fix: skip not-ready sandboxes during warm pool adoption#519
fix: skip not-ready sandboxes during warm pool adoption#519noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
Hi @noeljackson. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| currIndex := (startIndex + i) % n | ||
| adopted := candidates[currIndex] | ||
|
|
||
| if !isSandboxReady(adopted) { |
There was a problem hiding this comment.
This seems reasonable, but if there is no ready sandbox, do we want to fall-back to a non-ready Sandbox? I'm imagining the (pathological) case that a Sandbox takes 2 minutes to start up; is it better to use a Sandbox that is 1 minute into startup in that case?
|
Good question. In practice, adopting a not-ready sandbox doesn't save time — it makes things worse:
The pathological case you're describing — sandbox exists, pod exists, 1 minute into a 2-minute startup — could theoretically benefit from adoption. But during rotation specifically, the not-ready sandboxes have no backing pod at all. The sort already puts Ready sandboxes first, so in the non-rotation case where some are ready and some aren't, claims always grab a ready one. If we wanted to handle the "pod exists but isn't ready yet" case in the future, the right approach would be to check for pod existence (not just sandbox readiness) before adopting. But that's a separate optimization — this fix addresses the immediate hang during rotation. |
|
/ok-to-test |
There was a problem hiding this comment.
Pull request overview
Prevents SandboxClaim warm-pool adoption from selecting sandboxes that are not Ready=True, avoiding stuck claims during warm pool rotation when backing pods don’t yet exist.
Changes:
- Add a readiness guard in
adoptSandboxFromCandidatesto skip non-ready sandboxes and fall back to cold creation when none are ready. - Update the existing adoption test case to expect cold creation when only non-ready candidates exist.
- Add a new unit test ensuring candidates without any Ready condition are not adopted.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
extensions/controllers/sandboxclaim_controller.go |
Skips not-ready warm pool candidates during adoption to avoid adopting sandboxes without backing pods. |
extensions/controllers/sandboxclaim_controller_test.go |
Adjusts adoption expectations and adds coverage for skipping candidates missing a Ready condition. |
|
All presubmits have been green since 2026-04-03 and the Copilot bot review is in. Could a maintainer take a look when they have a moment? Thanks. |
During warm pool rotation (template spec change triggers pod cycling), the claim controller could adopt a Sandbox whose backing pod doesn't exist yet. The adoption succeeds (ownership transfer on the Sandbox CR), but reconcilePod fails with "Pod not found", leaving the claim stuck in ReconcilerError. Root cause: verifySandboxCandidate accepted any adoptable sandbox regardless of Ready status. During warm-pool rotation all backing pods can be mid-recreate, so getCandidate would return a not-ready candidate and adoption would hand it off to reconcilePod. Fix: gate verifySandboxCandidate on isSandboxReady. Not-ready candidates are rejected with a clear error; getCandidate continues to the next queue entry. When the queue is exhausted without a ready candidate, adoptSandboxFromCandidates returns nil and falls through to cold creation. Claim startup takes longer in rotation scenarios but no longer hangs. Rebased onto upstream/main after KEP-0174 (kubernetes-sigs#514) restructured warm- pool adoption into a queue-based flow; the guard now lives in verifySandboxCandidate rather than the old candidate-iteration loop.
62e4adb to
0f2c735
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: noeljackson The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
- pr/sandbox-pod-annotation-propagation (kubernetes-sigs#517) superseded by upstream kubernetes-sigs#514 (KEP-0174). - pr/fix-stale-pod-annotation (kubernetes-sigs#521) superseded by upstream kubernetes-sigs#613. - pr/podip-status superseded by upstream kubernetes-sigs#518. - pr/warm-adoption-preserve-podtemplate-metadata superseded by KEP-0174. Remaining fork patches: kubernetes-sigs#455, kubernetes-sigs#459, kubernetes-sigs#519, pr/template-volume-claim-templates, pr/warmpool-requeue-after.
- pr/sandbox-pod-annotation-propagation (kubernetes-sigs#517) superseded by upstream kubernetes-sigs#514 (KEP-0174). - pr/fix-stale-pod-annotation (kubernetes-sigs#521) superseded by upstream kubernetes-sigs#613. - pr/podip-status superseded by upstream kubernetes-sigs#518. - pr/warm-adoption-preserve-podtemplate-metadata superseded by KEP-0174. Remaining fork patches: kubernetes-sigs#455, kubernetes-sigs#459, kubernetes-sigs#519, pr/template-volume-claim-templates, pr/warmpool-requeue-after.
|
Let's tackle this request as part of |
|
because you can choose ready sandboxes if those run out, it is better to pick a non-ready but already started sandbox than to fallback to creating one from scratch. |
|
/close Pending discussion on the issue. |
|
@aditya-shantanu: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Please feel free to reopen the PR (with |
|
Reopening with a revised scope that addresses both @aditya-shantanu's and @janetkuo's feedback, and that incorporates the API surface for issue #491. What changedNarrower adoptability rule. The original "skip Ready=False" was too aggressive — @aditya-shantanu was right that a warm-pool sandbox whose backing Pod is running but not yet Ready is still more useful than cold-starting from scratch. The real defect is narrower: during warm-pool rotation, the Sandbox CR stays in the queue while its pod is deleted and recreated, and adopting during that window leaves the claim with
API surface for issue #491. Added Tests
/cc @aditya-shantanu @janetkuo — does this look like the right shape for incorporating #491? |
|
/reopen |
|
@noeljackson: Failed to re-open PR: state cannot be changed. The pr/claim-skip-not-ready branch was force-pushed or recreated. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Re-submitted as #683 (prow refused to reopen this PR after the branch was force-pushed). The new revision scopes the fix more narrowly ("no backing pod" instead of "not Ready") per @aditya-shantanu's feedback, and adds the |
|
/reopen |
|
@noeljackson: Failed to re-open PR: state cannot be changed. The pr/claim-skip-not-ready branch was force-pushed or recreated. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
Prevent the claim controller from adopting not-Ready sandboxes during warm pool rotation, which causes claims to hang with
ReconcilerError.Problem
During warm pool rotation (e.g. template spec change triggers pod cycling),
adoptSandboxFromCandidatessorts Ready sandboxes first but doesn't filter — if no Ready candidates are available (all pods being recreated), it adopts a not-Ready sandbox. The adoption succeeds (ownership transfer on the Sandbox CR), but the backing pod doesn't exist yet, causingreconcilePodto fail with "Pod not found". The claim gets stuck withReady=False, Reason=ReconcilerError.The error does trigger requeue with backoff, but the user sees a hung CLI.
Fix
Add a readiness guard in the adoption loop using the existing
isSandboxReady()helper (already defined at line 480, previously only used for metrics). Candidates withoutReady=Trueare skipped. If no Ready candidates exist,adoptSandboxFromCandidatesreturns nil, andgetOrCreateSandboxfalls through to cold creation.Behavior change
Test plan
TestSandboxClaimSkipsNotReadyAdoptionCandidates— only not-Ready candidates, verifies none are adoptedskips not-ready sandboxes and falls through to cold creation(wasadopts oldest non-ready sandbox)