NodePool spec.limits.nodes is not enforced by the scheduler — batching, sequential, and multi-NodePool violations

**Background**

I'm developing a tool that systematically explores controller reconciliation ordering, staleness, and fault injection ([kamera](https://github.com/tgoodwin/kamera)).

**Observed Behavior**

I observe that `spec.limits` entries for resource types not present in instance type `Capacity` — most notably `nodes` — are completely invisible to the scheduler's capacity tracking.

This produces at least 4 distinct failure modes:

**1. Batching bypass (always occurs)**

When multiple pods arrive simultaneously and the provisioner batches them into a single `CreateNodeClaims()` call, all NodeClaims are created concurrently via `ParallelizeUntil()` ([`provisioner.go:156`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/provisioner.go#L156)). Each concurrent `Create()` call reads `nodePoolResources`, which is empty at creation time (empty ProviderID → no StateNode → no `updateNodePoolResources()` call). The `ExceededBy()` check at [`provisioner.go:420`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/provisioner.go#L420) passes for all concurrent creations.

With `spec.limits: { nodes: "1" }` and 2 pending pods, both NodeClaims are created in every ordering.

**2. Sequential off-by-one (conditional)**

Even with sequential pod arrivals and fresh cluster state, `ExceededBy()` at [`nodepool.go:181`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/apis/v1/nodepool.go#L181) uses strict `>` comparison: `usage.Cmp(limit) > 0`. This checks whether usage has *exceeded* the limit, but not whether it has *reached* it. So with `limits: {nodes: "1"}` and one NodeClaim already provisioned (`nodePoolResources = {nodes: 1}`), the check evaluates `1.Cmp(1) > 0 → false` — it reports the limit is not exceeded, and the provisioner creates a second NodeClaim. The intent of `nodes: "1"` is presumably "at most 1 node," but the strict `>` semantics treat it as "more than 1 node is too many," allowing exactly-at-limit to pass.

**3. Multi-NodePool spillover failure (always occurs)**

With two NodePools (pool-a weight=10 `nodes: "1"`, pool-b weight=1 `nodes: "1"`) and 3 pending pods, all 3 pods are assigned to pool-a — violating its `nodes: "1"` limit — and pool-b is never used. The expected behavior is that pool-a should accept 1 pod (filling its `nodes` limit), then the scheduler should spill over to pool-b for the next pod. Instead, because `remaining["nodes"]` for pool-a is never decremented (the `nodes` resource isn't tracked in `filterByRemainingResources` — see root cause below), pool-a always appears to have unlimited node capacity. The scheduler's weight-based selection keeps picking pool-a for every pod, and the limit is never enforced.

**4. Disruption + provisioner race (conditional)**

When a pre-existing empty node is being disrupted while a new pod arrives, I observe that if the provisioner runs before the disruption queue completes deletion of the old node, the `ExceededBy()` off-by-one allows the replacement NodeClaim to be created before the old one is removed — both exist simultaneously against `nodes: "1"`.

**Root cause**

`filterByRemainingResources()` at [`scheduler.go:853-870`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/scheduling/scheduler.go#L853-L870) iterates over instance type `Capacity` to decrement `remaining[resourceName]`. The `nodes` resource is not present in instance type Capacity, so `itResources["nodes"]` returns a zero-value, making `Cmp(0, positive) > 0` always false. Similarly, `subtractMax()` at [`scheduler.go:830-851`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/scheduling/scheduler.go#L830-L851) subtracts zero from remaining, doing nothing.

This means any custom resource in `spec.limits` that doesn't appear in instance type Capacity is silently ignored. There is also no CEL validation preventing users from setting unsupported limit keys (CEL validation at [`nodepool.go:40`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/apis/v1/nodepool.go#L40) only covers static NodePools).

**Expected Behavior**

`spec.limits.nodes` should be enforced: the scheduler should track `nodes` as a consumed resource, `ExceededBy()` should use `>=` (or document that the limit is exclusive), and concurrent `CreateNodeClaims()` should either serialize or use atomic limit checks.

**Proposed Fix**

Three changes would address this:

1. **Fix `ExceededBy()` comparison** at [`nodepool.go:181`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/apis/v1/nodepool.go#L181): change `usage.Cmp(limit) > 0` to `usage.Cmp(limit) >= 0` (or equivalently, use `!usage.Cmp(limit).IsNegative()`). This closes the off-by-one that allows `usage == limit` to pass.

2. **Track `nodes` in scheduler accounting**: in [`filterByRemainingResources()`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/scheduling/scheduler.go#L853-L870), add a guard that checks for limit keys not present in instance type Capacity (like `nodes`) and decrements them by 1 per NodeClaim. Similarly in [`subtractMax()`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/scheduling/scheduler.go#L830-L851).

3. **Serialize or atomically check limits in `CreateNodeClaims()`**: the concurrent `ParallelizeUntil()` at [`provisioner.go:156`](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/provisioner.go#L156) should either serialize NodeClaim creation or use an atomic counter for `nodes` consumed, so concurrent creates can't all pass the limit check simultaneously.

I'm happy to put up a PR for any or all of these if it would be helpful.

**Versions**

- Karpenter: v1.8.0 (`sigs.k8s.io/karpenter`, commit `8ae07cf8`)
- Kubernetes: simulated via [kamera](https://github.com/tgoodwin/kamera) (based on k8s.io/client-go v0.35.0 / Kubernetes 1.35)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodePool spec.limits.nodes is not enforced by the scheduler — batching, sequential, and multi-NodePool violations #2915

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NodePool spec.limits.nodes is not enforced by the scheduler — batching, sequential, and multi-NodePool violations #2915

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions