Skip to content

fix: skip not-ready sandboxes during warm pool adoption#519

Closed
noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
noeljackson:pr/claim-skip-not-ready
Closed

fix: skip not-ready sandboxes during warm pool adoption#519
noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
noeljackson:pr/claim-skip-not-ready

Conversation

@noeljackson
Copy link
Copy Markdown
Contributor

Summary

Prevent the claim controller from adopting not-Ready sandboxes during warm pool rotation, which causes claims to hang with ReconcilerError.

Problem

During warm pool rotation (e.g. template spec change triggers pod cycling), adoptSandboxFromCandidates sorts Ready sandboxes first but doesn't filter — if no Ready candidates are available (all pods being recreated), it adopts a not-Ready sandbox. The adoption succeeds (ownership transfer on the Sandbox CR), but the backing pod doesn't exist yet, causing reconcilePod to fail with "Pod not found". The claim gets stuck with Ready=False, Reason=ReconcilerError.

The error does trigger requeue with backoff, but the user sees a hung CLI.

Fix

Add a readiness guard in the adoption loop using the existing isSandboxReady() helper (already defined at line 480, previously only used for metrics). Candidates without Ready=True are skipped. If no Ready candidates exist, adoptSandboxFromCandidates returns nil, and getOrCreateSandbox falls through to cold creation.

if !isSandboxReady(adopted) {
    logger.V(1).Info("skipping not-ready adoption candidate",
        "sandbox", adopted.Name, "claim", claim.Name)
    continue
}

Behavior change

Scenario Before After
During rotation (no Ready pods) Adopts not-Ready sandbox, claim hangs Falls through to cold creation, slower but works
After rotation (Ready pods available) Adopts Ready sandbox Unchanged
Normal operation Adopts Ready sandbox Unchanged

Test plan

  • New test: TestSandboxClaimSkipsNotReadyAdoptionCandidates — only not-Ready candidates, verifies none are adopted
  • Updated existing test: skips not-ready sandboxes and falls through to cold creation (was adopts oldest non-ready sandbox)
  • All existing adoption tests pass (Ready sandbox adoption unchanged)

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 3, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 0f2c735
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69e9f4f9f1f1720008e93716

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @noeljackson. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 3, 2026
noeljackson added a commit to noeljackson/agent-sandbox that referenced this pull request Apr 3, 2026
currIndex := (startIndex + i) % n
adopted := candidates[currIndex]

if !isSandboxReady(adopted) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, but if there is no ready sandbox, do we want to fall-back to a non-ready Sandbox? I'm imagining the (pathological) case that a Sandbox takes 2 minutes to start up; is it better to use a Sandbox that is 1 minute into startup in that case?

@noeljackson
Copy link
Copy Markdown
Contributor Author

Good question. In practice, adopting a not-ready sandbox doesn't save time — it makes things worse:

  1. The claim controller doesn't wait on the adopted sandbox's pod. It transfers ownership on the Sandbox CR, then immediately tries to reconcile the pod. If the pod doesn't exist yet (which is the case during rotation — pods are being deleted and recreated), it fails with "Pod not found" and the claim enters ReconcilerError.

  2. Even if the pod exists but isn't ready, the claim enters a requeue loop waiting for Ready=True. A cold-created sandbox enters the same requeue loop. No time is saved either way.

  3. The warm pool controller loses track of the adopted sandbox (ownership was transferred to the claim), so it creates a replacement. Now you have two sandboxes starting up for one claim.

The pathological case you're describing — sandbox exists, pod exists, 1 minute into a 2-minute startup — could theoretically benefit from adoption. But during rotation specifically, the not-ready sandboxes have no backing pod at all. The sort already puts Ready sandboxes first, so in the non-rotation case where some are ready and some aren't, claims always grab a ready one.

If we wanted to handle the "pod exists but isn't ready yet" case in the future, the right approach would be to check for pod existence (not just sandbox readiness) before adopting. But that's a separate optimization — this fix addresses the immediate hang during rotation.

@justinsb
Copy link
Copy Markdown
Contributor

justinsb commented Apr 3, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2026
@janetkuo janetkuo requested a review from Copilot April 21, 2026 17:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents SandboxClaim warm-pool adoption from selecting sandboxes that are not Ready=True, avoiding stuck claims during warm pool rotation when backing pods don’t yet exist.

Changes:

  • Add a readiness guard in adoptSandboxFromCandidates to skip non-ready sandboxes and fall back to cold creation when none are ready.
  • Update the existing adoption test case to expect cold creation when only non-ready candidates exist.
  • Add a new unit test ensuring candidates without any Ready condition are not adopted.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
extensions/controllers/sandboxclaim_controller.go Skips not-ready warm pool candidates during adoption to avoid adopting sandboxes without backing pods.
extensions/controllers/sandboxclaim_controller_test.go Adjusts adoption expectations and adds coverage for skipping candidates missing a Ready condition.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2026
@noeljackson
Copy link
Copy Markdown
Contributor Author

All presubmits have been green since 2026-04-03 and the Copilot bot review is in. Could a maintainer take a look when they have a moment? Thanks.

During warm pool rotation (template spec change triggers pod cycling),
the claim controller could adopt a Sandbox whose backing pod doesn't
exist yet. The adoption succeeds (ownership transfer on the Sandbox
CR), but reconcilePod fails with "Pod not found", leaving the claim
stuck in ReconcilerError.

Root cause: verifySandboxCandidate accepted any adoptable sandbox
regardless of Ready status. During warm-pool rotation all backing
pods can be mid-recreate, so getCandidate would return a not-ready
candidate and adoption would hand it off to reconcilePod.

Fix: gate verifySandboxCandidate on isSandboxReady. Not-ready
candidates are rejected with a clear error; getCandidate continues to
the next queue entry. When the queue is exhausted without a ready
candidate, adoptSandboxFromCandidates returns nil and falls through
to cold creation. Claim startup takes longer in rotation scenarios
but no longer hangs.

Rebased onto upstream/main after KEP-0174 (kubernetes-sigs#514) restructured warm-
pool adoption into a queue-based flow; the guard now lives in
verifySandboxCandidate rather than the old candidate-iteration loop.
@noeljackson noeljackson force-pushed the pr/claim-skip-not-ready branch from 62e4adb to 0f2c735 Compare April 23, 2026 10:31
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: noeljackson
Once this PR has been reviewed and has the lgtm label, please ask for approval from vicentefb. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 23, 2026
noeljackson added a commit to noeljackson/agent-sandbox that referenced this pull request Apr 23, 2026
- pr/sandbox-pod-annotation-propagation (kubernetes-sigs#517) superseded by upstream kubernetes-sigs#514 (KEP-0174).
- pr/fix-stale-pod-annotation (kubernetes-sigs#521) superseded by upstream kubernetes-sigs#613.
- pr/podip-status superseded by upstream kubernetes-sigs#518.
- pr/warm-adoption-preserve-podtemplate-metadata superseded by KEP-0174.

Remaining fork patches: kubernetes-sigs#455, kubernetes-sigs#459, kubernetes-sigs#519, pr/template-volume-claim-templates,
pr/warmpool-requeue-after.
noeljackson added a commit to noeljackson/agent-sandbox that referenced this pull request Apr 23, 2026
- pr/sandbox-pod-annotation-propagation (kubernetes-sigs#517) superseded by upstream kubernetes-sigs#514 (KEP-0174).
- pr/fix-stale-pod-annotation (kubernetes-sigs#521) superseded by upstream kubernetes-sigs#613.
- pr/podip-status superseded by upstream kubernetes-sigs#518.
- pr/warm-adoption-preserve-podtemplate-metadata superseded by KEP-0174.

Remaining fork patches: kubernetes-sigs#455, kubernetes-sigs#459, kubernetes-sigs#519, pr/template-volume-claim-templates,
pr/warmpool-requeue-after.
@aditya-shantanu
Copy link
Copy Markdown
Contributor

Let's tackle this request as part of
#491

@aditya-shantanu
Copy link
Copy Markdown
Contributor

because you can choose ready sandboxes if those run out, it is better to pick a non-ready but already started sandbox than to fallback to creating one from scratch.

@aditya-shantanu
Copy link
Copy Markdown
Contributor

/close

Pending discussion on the issue.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@aditya-shantanu: Closed this PR.

Details

In response to this:

/close

Pending discussion on the issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@janetkuo
Copy link
Copy Markdown
Member

janetkuo commented Apr 23, 2026

Please feel free to reopen the PR (with /reopen command) incorporating #491

@noeljackson
Copy link
Copy Markdown
Contributor Author

Reopening with a revised scope that addresses both @aditya-shantanu's and @janetkuo's feedback, and that incorporates the API surface for issue #491.

What changed

Narrower adoptability rule. The original "skip Ready=False" was too aggressive — @aditya-shantanu was right that a warm-pool sandbox whose backing Pod is running but not yet Ready is still more useful than cold-starting from scratch. The real defect is narrower: during warm-pool rotation, the Sandbox CR stays in the queue while its pod is deleted and recreated, and adopting during that window leaves the claim with ReconcilerError ("Pod not found"). The correct signal is "does a backing Pod exist" — independent of Ready state.

isAdoptable now requires len(Status.PodIPs) > 0. PodIPs is populated by the sandbox controller only when the pod has been scheduled and networked, so this cleanly separates "pod is mid-rotation / not yet scheduled" (skip) from "pod is running but not yet Ready" (adopt). The effect in sandboxEventHandler.Update is that rotating sandboxes never enter the queue; in verifySandboxCandidate they're skipped if already queued, and the adoption loop falls through to cold creation.

API surface for issue #491. Added SandboxTemplate.spec.adoptionStrategy (enum) with OldestReady as the default and currently-only value — the existing FIFO warm sandbox queue implements this implicitly. The field exists so kincoy's follow-ups (NodeSpread, topology-aware, pressure-aware) can slot in behind the same API without a breaking change. I intentionally did not introduce a Go-level AdoptionStrategy interface yet: with the current FIFO queue there's no sort step to wrap, and a useful interface shape depends on choices about queue implementation (one queue per strategy vs. queue-aware strategies vs. replacing the queue entirely with a strategy-driven picker). Happy to design that in a separate PR once there's a second strategy motivating it.

Tests

  • Regression (rotation): skips warm pool sandbox with no backing pod and falls through to cold creation — confirms the original hang cannot recur.
  • aditya-shantanu's case: adopts not-ready sandbox with backing pod, skipping rotating sandboxes without pods — pod-started-but-not-Ready remains adoptable.
  • The existing adopts sandboxes from queue regardless of ready state and adopts first available non-ready sandbox from queue cases continue to pass because the fixtures reflect the realistic state (Sandbox has PodIPs once its pod has IPs).

go test ./extensions/... and make lint-go / make lint-api all green.

/cc @aditya-shantanu @janetkuo — does this look like the right shape for incorporating #491?

@noeljackson
Copy link
Copy Markdown
Contributor Author

/reopen

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@noeljackson: Failed to re-open PR: state cannot be changed. The pr/claim-skip-not-ready branch was force-pushed or recreated.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@noeljackson
Copy link
Copy Markdown
Contributor Author

Re-submitted as #683 (prow refused to reopen this PR after the branch was force-pushed).

The new revision scopes the fix more narrowly ("no backing pod" instead of "not Ready") per @aditya-shantanu's feedback, and adds the SandboxTemplate.spec.adoptionStrategy API surface per @janetkuo's / issue #491 redirect. Context and discussion from this thread remain linked from the new PR.

@noeljackson
Copy link
Copy Markdown
Contributor Author

/reopen

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@noeljackson: Failed to re-open PR: state cannot be changed. The pr/claim-skip-not-ready branch was force-pushed or recreated.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:extensions cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. ready-for-review size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants