fix: skip not-ready sandboxes during warm pool adoption by noeljackson · Pull Request #519 · kubernetes-sigs/agent-sandbox

noeljackson · 2026-04-03T14:21:17Z

Summary

Prevent the claim controller from adopting not-Ready sandboxes during warm pool rotation, which causes claims to hang with ReconcilerError.

Problem

During warm pool rotation (e.g. template spec change triggers pod cycling), adoptSandboxFromCandidates sorts Ready sandboxes first but doesn't filter — if no Ready candidates are available (all pods being recreated), it adopts a not-Ready sandbox. The adoption succeeds (ownership transfer on the Sandbox CR), but the backing pod doesn't exist yet, causing reconcilePod to fail with "Pod not found". The claim gets stuck with Ready=False, Reason=ReconcilerError.

The error does trigger requeue with backoff, but the user sees a hung CLI.

Fix

Add a readiness guard in the adoption loop using the existing isSandboxReady() helper (already defined at line 480, previously only used for metrics). Candidates without Ready=True are skipped. If no Ready candidates exist, adoptSandboxFromCandidates returns nil, and getOrCreateSandbox falls through to cold creation.

if !isSandboxReady(adopted) {
    logger.V(1).Info("skipping not-ready adoption candidate",
        "sandbox", adopted.Name, "claim", claim.Name)
    continue
}

Behavior change

Scenario	Before	After
During rotation (no Ready pods)	Adopts not-Ready sandbox, claim hangs	Falls through to cold creation, slower but works
After rotation (Ready pods available)	Adopts Ready sandbox	Unchanged
Normal operation	Adopts Ready sandbox	Unchanged

Test plan

New test: TestSandboxClaimSkipsNotReadyAdoptionCandidates — only not-Ready candidates, verifies none are adopted
Updated existing test: skips not-ready sandboxes and falls through to cold creation (was adopts oldest non-ready sandbox)
All existing adoption tests pass (Ready sandbox adoption unchanged)

netlify · 2026-04-03T14:21:25Z

✅ Deploy Preview for agent-sandbox canceled.

Name	Link
🔨 Latest commit	`0f2c735`
🔍 Latest deploy log	https://app.netlify.com/projects/agent-sandbox/deploys/69e9f4f9f1f1720008e93716

k8s-ci-robot · 2026-04-03T14:21:29Z

Hi @noeljackson. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

justinsb · 2026-04-03T14:44:15Z

 		currIndex := (startIndex + i) % n
 		adopted := candidates[currIndex]

+		if !isSandboxReady(adopted) {


This seems reasonable, but if there is no ready sandbox, do we want to fall-back to a non-ready Sandbox? I'm imagining the (pathological) case that a Sandbox takes 2 minutes to start up; is it better to use a Sandbox that is 1 minute into startup in that case?

noeljackson · 2026-04-03T14:49:57Z

Good question. In practice, adopting a not-ready sandbox doesn't save time — it makes things worse:

The claim controller doesn't wait on the adopted sandbox's pod. It transfers ownership on the Sandbox CR, then immediately tries to reconcile the pod. If the pod doesn't exist yet (which is the case during rotation — pods are being deleted and recreated), it fails with "Pod not found" and the claim enters ReconcilerError.
Even if the pod exists but isn't ready, the claim enters a requeue loop waiting for Ready=True. A cold-created sandbox enters the same requeue loop. No time is saved either way.
The warm pool controller loses track of the adopted sandbox (ownership was transferred to the claim), so it creates a replacement. Now you have two sandboxes starting up for one claim.

The pathological case you're describing — sandbox exists, pod exists, 1 minute into a 2-minute startup — could theoretically benefit from adoption. But during rotation specifically, the not-ready sandboxes have no backing pod at all. The sort already puts Ready sandboxes first, so in the non-rotation case where some are ready and some aren't, claims always grab a ready one.

If we wanted to handle the "pod exists but isn't ready yet" case in the future, the right approach would be to check for pod existence (not just sandbox readiness) before adopting. But that's a separate optimization — this fix addresses the immediate hang during rotation.

justinsb · 2026-04-03T15:10:21Z

/ok-to-test

Copilot

Pull request overview

Prevents SandboxClaim warm-pool adoption from selecting sandboxes that are not Ready=True, avoiding stuck claims during warm pool rotation when backing pods don’t yet exist.

Changes:

Add a readiness guard in adoptSandboxFromCandidates to skip non-ready sandboxes and fall back to cold creation when none are ready.
Update the existing adoption test case to expect cold creation when only non-ready candidates exist.
Add a new unit test ensuring candidates without any Ready condition are not adopted.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`extensions/controllers/sandboxclaim_controller.go`	Skips not-ready warm pool candidates during adoption to avoid adopting sandboxes without backing pods.
`extensions/controllers/sandboxclaim_controller_test.go`	Adjusts adoption expectations and adds coverage for skipping candidates missing a Ready condition.

noeljackson · 2026-04-23T09:30:14Z

All presubmits have been green since 2026-04-03 and the Copilot bot review is in. Could a maintainer take a look when they have a moment? Thanks.

During warm pool rotation (template spec change triggers pod cycling), the claim controller could adopt a Sandbox whose backing pod doesn't exist yet. The adoption succeeds (ownership transfer on the Sandbox CR), but reconcilePod fails with "Pod not found", leaving the claim stuck in ReconcilerError. Root cause: verifySandboxCandidate accepted any adoptable sandbox regardless of Ready status. During warm-pool rotation all backing pods can be mid-recreate, so getCandidate would return a not-ready candidate and adoption would hand it off to reconcilePod. Fix: gate verifySandboxCandidate on isSandboxReady. Not-ready candidates are rejected with a clear error; getCandidate continues to the next queue entry. When the queue is exhausted without a ready candidate, adoptSandboxFromCandidates returns nil and falls through to cold creation. Claim startup takes longer in rotation scenarios but no longer hangs. Rebased onto upstream/main after KEP-0174 (kubernetes-sigs#514) restructured warm- pool adoption into a queue-based flow; the guard now lives in verifySandboxCandidate rather than the old candidate-iteration loop.

k8s-ci-robot · 2026-04-23T10:31:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: noeljackson
Once this PR has been reviewed and has the lgtm label, please ask for approval from vicentefb. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

extensions/controllers/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- pr/sandbox-pod-annotation-propagation (kubernetes-sigs#517) superseded by upstream kubernetes-sigs#514 (KEP-0174). - pr/fix-stale-pod-annotation (kubernetes-sigs#521) superseded by upstream kubernetes-sigs#613. - pr/podip-status superseded by upstream kubernetes-sigs#518. - pr/warm-adoption-preserve-podtemplate-metadata superseded by KEP-0174. Remaining fork patches: kubernetes-sigs#455, kubernetes-sigs#459, kubernetes-sigs#519, pr/template-volume-claim-templates, pr/warmpool-requeue-after.

aditya-shantanu · 2026-04-23T20:27:59Z

Let's tackle this request as part of
#491

aditya-shantanu · 2026-04-23T20:28:45Z

because you can choose ready sandboxes if those run out, it is better to pick a non-ready but already started sandbox than to fallback to creating one from scratch.

aditya-shantanu · 2026-04-23T20:29:27Z

/close

Pending discussion on the issue.

k8s-ci-robot · 2026-04-23T20:29:34Z

@aditya-shantanu: Closed this PR.

Details

In response to this:

/close

Pending discussion on the issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

janetkuo · 2026-04-23T20:30:44Z

Please feel free to reopen the PR (with /reopen command) incorporating #491

noeljackson · 2026-04-24T14:57:43Z

Reopening with a revised scope that addresses both @aditya-shantanu's and @janetkuo's feedback, and that incorporates the API surface for issue #491.

What changed

Narrower adoptability rule. The original "skip Ready=False" was too aggressive — @aditya-shantanu was right that a warm-pool sandbox whose backing Pod is running but not yet Ready is still more useful than cold-starting from scratch. The real defect is narrower: during warm-pool rotation, the Sandbox CR stays in the queue while its pod is deleted and recreated, and adopting during that window leaves the claim with ReconcilerError ("Pod not found"). The correct signal is "does a backing Pod exist" — independent of Ready state.

isAdoptable now requires len(Status.PodIPs) > 0. PodIPs is populated by the sandbox controller only when the pod has been scheduled and networked, so this cleanly separates "pod is mid-rotation / not yet scheduled" (skip) from "pod is running but not yet Ready" (adopt). The effect in sandboxEventHandler.Update is that rotating sandboxes never enter the queue; in verifySandboxCandidate they're skipped if already queued, and the adoption loop falls through to cold creation.

API surface for issue #491. Added SandboxTemplate.spec.adoptionStrategy (enum) with OldestReady as the default and currently-only value — the existing FIFO warm sandbox queue implements this implicitly. The field exists so kincoy's follow-ups (NodeSpread, topology-aware, pressure-aware) can slot in behind the same API without a breaking change. I intentionally did not introduce a Go-level AdoptionStrategy interface yet: with the current FIFO queue there's no sort step to wrap, and a useful interface shape depends on choices about queue implementation (one queue per strategy vs. queue-aware strategies vs. replacing the queue entirely with a strategy-driven picker). Happy to design that in a separate PR once there's a second strategy motivating it.

Tests

Regression (rotation): skips warm pool sandbox with no backing pod and falls through to cold creation — confirms the original hang cannot recur.
aditya-shantanu's case: adopts not-ready sandbox with backing pod, skipping rotating sandboxes without pods — pod-started-but-not-Ready remains adoptable.
The existing adopts sandboxes from queue regardless of ready state and adopts first available non-ready sandbox from queue cases continue to pass because the fixtures reflect the realistic state (Sandbox has PodIPs once its pod has IPs).

go test ./extensions/... and make lint-go / make lint-api all green.

/cc @aditya-shantanu @janetkuo — does this look like the right shape for incorporating #491?

noeljackson · 2026-04-24T14:57:59Z

/reopen

k8s-ci-robot · 2026-04-24T14:58:05Z

@noeljackson: Failed to re-open PR: state cannot be changed. The pr/claim-skip-not-ready branch was force-pushed or recreated.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

noeljackson · 2026-04-24T14:58:56Z

Re-submitted as #683 (prow refused to reopen this PR after the branch was force-pushed).

The new revision scopes the fix more narrowly ("no backing pod" instead of "not Ready") per @aditya-shantanu's feedback, and adds the SandboxTemplate.spec.adoptionStrategy API surface per @janetkuo's / issue #491 redirect. Context and discussion from this thread remain linked from the new PR.

noeljackson · 2026-04-24T15:59:33Z

/reopen

k8s-ci-robot · 2026-04-24T15:59:38Z

@noeljackson: Failed to re-open PR: state cannot be changed. The pr/claim-skip-not-ready branch was force-pushed or recreated.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from barney-s and janetkuo April 3, 2026 14:21

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2026

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 3, 2026

noeljackson added a commit to noeljackson/agent-sandbox that referenced this pull request Apr 3, 2026

sync: add pr/claim-skip-not-ready (kubernetes-sigs#519)

c46d486

justinsb added the area:extensions label Apr 3, 2026

justinsb reviewed Apr 3, 2026

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2026

justinsb assigned vicentefb Apr 6, 2026

janetkuo requested a review from Copilot April 21, 2026 17:00

Copilot started reviewing on behalf of janetkuo April 21, 2026 17:00 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

janetkuo added the ready-for-review label Apr 21, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2026

noeljackson force-pushed the pr/claim-skip-not-ready branch from 62e4adb to 0f2c735 Compare April 23, 2026 10:31

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 23, 2026

k8s-ci-robot closed this Apr 23, 2026

noeljackson mentioned this pull request Apr 24, 2026

fix: skip warm pool sandboxes without a backing pod; add adoption strategy API #683

Open

Conversation

noeljackson commented Apr 3, 2026

Summary

Problem

Fix

Behavior change

Test plan

Uh oh!

netlify Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for agent-sandbox canceled.

Uh oh!

k8s-ci-robot commented Apr 3, 2026

Uh oh!

justinsb Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

noeljackson commented Apr 3, 2026

Uh oh!

justinsb commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

noeljackson commented Apr 23, 2026

Uh oh!

k8s-ci-robot commented Apr 23, 2026

Uh oh!

aditya-shantanu commented Apr 23, 2026

Uh oh!

aditya-shantanu commented Apr 23, 2026

Uh oh!

aditya-shantanu commented Apr 23, 2026

Uh oh!

k8s-ci-robot commented Apr 23, 2026

Uh oh!

janetkuo commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noeljackson commented Apr 24, 2026

What changed

Tests

Uh oh!

noeljackson commented Apr 24, 2026

Uh oh!

k8s-ci-robot commented Apr 24, 2026

Uh oh!

noeljackson commented Apr 24, 2026

Uh oh!

noeljackson commented Apr 24, 2026

Uh oh!

k8s-ci-robot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

netlify Bot commented Apr 3, 2026 •

edited

Loading

janetkuo commented Apr 23, 2026 •

edited

Loading