Background
I'm developing a tool that systematically explores controller reconciliation ordering, staleness, and fault injection (kamera).
Observed Behavior
I observe that spec.limits entries for resource types not present in instance type Capacity — most notably nodes — are completely invisible to the scheduler's capacity tracking.
This produces at least 4 distinct failure modes:
1. Batching bypass (always occurs)
When multiple pods arrive simultaneously and the provisioner batches them into a single CreateNodeClaims() call, all NodeClaims are created concurrently via ParallelizeUntil() (provisioner.go:156). Each concurrent Create() call reads nodePoolResources, which is empty at creation time (empty ProviderID → no StateNode → no updateNodePoolResources() call). The ExceededBy() check at provisioner.go:420 passes for all concurrent creations.
With spec.limits: { nodes: "1" } and 2 pending pods, both NodeClaims are created in every ordering.
2. Sequential off-by-one (conditional)
Even with sequential pod arrivals and fresh cluster state, ExceededBy() at nodepool.go:181 uses strict > comparison: usage.Cmp(limit) > 0. This checks whether usage has exceeded the limit, but not whether it has reached it. So with limits: {nodes: "1"} and one NodeClaim already provisioned (nodePoolResources = {nodes: 1}), the check evaluates 1.Cmp(1) > 0 → false — it reports the limit is not exceeded, and the provisioner creates a second NodeClaim. The intent of nodes: "1" is presumably "at most 1 node," but the strict > semantics treat it as "more than 1 node is too many," allowing exactly-at-limit to pass.
3. Multi-NodePool spillover failure (always occurs)
With two NodePools (pool-a weight=10 nodes: "1", pool-b weight=1 nodes: "1") and 3 pending pods, all 3 pods are assigned to pool-a — violating its nodes: "1" limit — and pool-b is never used. The expected behavior is that pool-a should accept 1 pod (filling its nodes limit), then the scheduler should spill over to pool-b for the next pod. Instead, because remaining["nodes"] for pool-a is never decremented (the nodes resource isn't tracked in filterByRemainingResources — see root cause below), pool-a always appears to have unlimited node capacity. The scheduler's weight-based selection keeps picking pool-a for every pod, and the limit is never enforced.
4. Disruption + provisioner race (conditional)
When a pre-existing empty node is being disrupted while a new pod arrives, I observe that if the provisioner runs before the disruption queue completes deletion of the old node, the ExceededBy() off-by-one allows the replacement NodeClaim to be created before the old one is removed — both exist simultaneously against nodes: "1".
Root cause
filterByRemainingResources() at scheduler.go:853-870 iterates over instance type Capacity to decrement remaining[resourceName]. The nodes resource is not present in instance type Capacity, so itResources["nodes"] returns a zero-value, making Cmp(0, positive) > 0 always false. Similarly, subtractMax() at scheduler.go:830-851 subtracts zero from remaining, doing nothing.
This means any custom resource in spec.limits that doesn't appear in instance type Capacity is silently ignored. There is also no CEL validation preventing users from setting unsupported limit keys (CEL validation at nodepool.go:40 only covers static NodePools).
Expected Behavior
spec.limits.nodes should be enforced: the scheduler should track nodes as a consumed resource, ExceededBy() should use >= (or document that the limit is exclusive), and concurrent CreateNodeClaims() should either serialize or use atomic limit checks.
Proposed Fix
Three changes would address this:
-
Fix ExceededBy() comparison at nodepool.go:181: change usage.Cmp(limit) > 0 to usage.Cmp(limit) >= 0 (or equivalently, use !usage.Cmp(limit).IsNegative()). This closes the off-by-one that allows usage == limit to pass.
-
Track nodes in scheduler accounting: in filterByRemainingResources(), add a guard that checks for limit keys not present in instance type Capacity (like nodes) and decrements them by 1 per NodeClaim. Similarly in subtractMax().
-
Serialize or atomically check limits in CreateNodeClaims(): the concurrent ParallelizeUntil() at provisioner.go:156 should either serialize NodeClaim creation or use an atomic counter for nodes consumed, so concurrent creates can't all pass the limit check simultaneously.
I'm happy to put up a PR for any or all of these if it would be helpful.
Versions
- Karpenter: v1.8.0 (
sigs.k8s.io/karpenter, commit 8ae07cf8)
- Kubernetes: simulated via kamera (based on k8s.io/client-go v0.35.0 / Kubernetes 1.35)
Background
I'm developing a tool that systematically explores controller reconciliation ordering, staleness, and fault injection (kamera).
Observed Behavior
I observe that
spec.limitsentries for resource types not present in instance typeCapacity— most notablynodes— are completely invisible to the scheduler's capacity tracking.This produces at least 4 distinct failure modes:
1. Batching bypass (always occurs)
When multiple pods arrive simultaneously and the provisioner batches them into a single
CreateNodeClaims()call, all NodeClaims are created concurrently viaParallelizeUntil()(provisioner.go:156). Each concurrentCreate()call readsnodePoolResources, which is empty at creation time (empty ProviderID → no StateNode → noupdateNodePoolResources()call). TheExceededBy()check atprovisioner.go:420passes for all concurrent creations.With
spec.limits: { nodes: "1" }and 2 pending pods, both NodeClaims are created in every ordering.2. Sequential off-by-one (conditional)
Even with sequential pod arrivals and fresh cluster state,
ExceededBy()atnodepool.go:181uses strict>comparison:usage.Cmp(limit) > 0. This checks whether usage has exceeded the limit, but not whether it has reached it. So withlimits: {nodes: "1"}and one NodeClaim already provisioned (nodePoolResources = {nodes: 1}), the check evaluates1.Cmp(1) > 0 → false— it reports the limit is not exceeded, and the provisioner creates a second NodeClaim. The intent ofnodes: "1"is presumably "at most 1 node," but the strict>semantics treat it as "more than 1 node is too many," allowing exactly-at-limit to pass.3. Multi-NodePool spillover failure (always occurs)
With two NodePools (pool-a weight=10
nodes: "1", pool-b weight=1nodes: "1") and 3 pending pods, all 3 pods are assigned to pool-a — violating itsnodes: "1"limit — and pool-b is never used. The expected behavior is that pool-a should accept 1 pod (filling itsnodeslimit), then the scheduler should spill over to pool-b for the next pod. Instead, becauseremaining["nodes"]for pool-a is never decremented (thenodesresource isn't tracked infilterByRemainingResources— see root cause below), pool-a always appears to have unlimited node capacity. The scheduler's weight-based selection keeps picking pool-a for every pod, and the limit is never enforced.4. Disruption + provisioner race (conditional)
When a pre-existing empty node is being disrupted while a new pod arrives, I observe that if the provisioner runs before the disruption queue completes deletion of the old node, the
ExceededBy()off-by-one allows the replacement NodeClaim to be created before the old one is removed — both exist simultaneously againstnodes: "1".Root cause
filterByRemainingResources()atscheduler.go:853-870iterates over instance typeCapacityto decrementremaining[resourceName]. Thenodesresource is not present in instance type Capacity, soitResources["nodes"]returns a zero-value, makingCmp(0, positive) > 0always false. Similarly,subtractMax()atscheduler.go:830-851subtracts zero from remaining, doing nothing.This means any custom resource in
spec.limitsthat doesn't appear in instance type Capacity is silently ignored. There is also no CEL validation preventing users from setting unsupported limit keys (CEL validation atnodepool.go:40only covers static NodePools).Expected Behavior
spec.limits.nodesshould be enforced: the scheduler should tracknodesas a consumed resource,ExceededBy()should use>=(or document that the limit is exclusive), and concurrentCreateNodeClaims()should either serialize or use atomic limit checks.Proposed Fix
Three changes would address this:
Fix
ExceededBy()comparison atnodepool.go:181: changeusage.Cmp(limit) > 0tousage.Cmp(limit) >= 0(or equivalently, use!usage.Cmp(limit).IsNegative()). This closes the off-by-one that allowsusage == limitto pass.Track
nodesin scheduler accounting: infilterByRemainingResources(), add a guard that checks for limit keys not present in instance type Capacity (likenodes) and decrements them by 1 per NodeClaim. Similarly insubtractMax().Serialize or atomically check limits in
CreateNodeClaims(): the concurrentParallelizeUntil()atprovisioner.go:156should either serialize NodeClaim creation or use an atomic counter fornodesconsumed, so concurrent creates can't all pass the limit check simultaneously.I'm happy to put up a PR for any or all of these if it would be helpful.
Versions
sigs.k8s.io/karpenter, commit8ae07cf8)