Skip to content

Premature Consolidatable condition triggers create-delete cycle on uninitialized nodes #2922

@tgoodwin

Description

@tgoodwin

Background

I'm developing a tool that systematically explores controller reconciliation ordering, staleness, and fault injection (kamera).

Observed Behavior

I observe that when the disruption controller marks a node as Consolidatable before the node is fully initialized (pods scheduled, kubelet registered), a subsequent disruption evaluation treats the node as a valid consolidation candidate and enqueues deletion. The provisioner then creates a replacement, producing a create-delete cycle.

The consolidation.ShouldDisrupt() predicate checks ConditionTypeConsolidatable (line 122) but does not gate on Initialized(). An uninitialized node has no pods scheduled yet, so it appears "empty" and is marked Consolidatable — triggering deletion before the node has had a chance to receive workloads. This occurs whenever the disruption controller evaluates the node before initialization completes.

Expected Behavior

The disruption controller should not evaluate nodes for consolidation until they are fully initialized (all expected pods scheduled, NodeClaim initialized, Node registered).

Proposed Fix

The consolidation.ShouldDisrupt() predicate at consolidation.go:88-123 checks ConditionTypeConsolidatable (line 122) but does not check whether the node is initialized. In contrast, ValidateNodeDisruptable() at statenode.go:210 correctly gates on Initialized() (which checks for the NodeInitializedLabelKey label at statenode.go:343-350).

possible fix: add an Initialized() check at the top of consolidation.ShouldDisrupt():

func (c *Consolidation) ShouldDisrupt(_ context.Context, cn *Candidate) bool {
    if !cn.Initialized() {
        return false
    }
    // ... existing checks ...
}

This mirrors the pattern in ValidateNodeDisruptable() and ensures nodes are never considered for consolidation before they are fully initialized. Without this check, the node appears "empty" (no pods scheduled yet) and is immediately marked Consolidatable, triggering a deletion before the node has had a chance to receive workloads.

I'm happy to put up a PR for this if it would be helpful.

Versions

  • Karpenter: v1.8.0 (sigs.k8s.io/karpenter, commit 8ae07cf8)
  • Kubernetes: simulated via kamera (based on k8s.io/client-go v0.35.0 / Kubernetes 1.35)

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions