fix(controlplane): decouple draining-CP-row reaper from the drain wall#818
Open
EDsCODE wants to merge 1 commit into
Open
fix(controlplane): decouple draining-CP-row reaper from the drain wall#818EDsCODE wants to merge 1 commit into
EDsCODE wants to merge 1 commit into
Conversation
In remote mode HandoverDrainTimeout is intentionally 0 (unbounded session drain) so a rolling control plane never cuts an in-flight customer import. But the janitor's draining-CP-row reaper was wired to the same field (janitor.maxDrainTimeout = cfg.HandoverDrainTimeout), so its gate `if maxDrainTimeout > 0` was never satisfied and ExpireDrainingControlPlaneInstances never ran. A draining old-generation CP heartbeats on a non-cancellable context, so its cp_instances row stays state=draining with a fresh heartbeat and is never expired. ListOrphanedWorkers only lists workers whose owning CP is expired, so every worker owned by that draining CP becomes invisible to the orphan sweep — they leak, holding full per-worker reservations and pinning their nodes via karpenter.sh/do-not-disrupt. Introduce a dedicated, finite DrainingInstanceExpiryTimeout (default 2h) that bounds how long a draining CP row may linger before the leader janitor force-expires it, leaving HandoverDrainTimeout=0 (session drain) untouched. Resolution goes through effectiveDrainingInstanceExpiry, which never returns 0, so the reaper can no longer be silently disabled by an unset knob. Expiring a draining CP row cannot cut a live import: the orphan sweep already spares workers with an active/reconnecting Flight session. Plumbed through file (draining_instance_expiry_timeout), env (DUCKGRES_DRAINING_INSTANCE_EXPIRY_TIMEOUT), and CLI (--draining-instance-expiry-timeout), mirroring HandoverDrainTimeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Test Impact PlanDeterministic summary of how this PR changes tests, CI runners, and coverage-risk signals. Summary
Signals
Coverage risk: neutral or increased No coverage-reduction warnings detected. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Worker pods leak and accumulate idle, holding their full per-worker reservations and pinning nodes (via
karpenter.sh/do-not-disrupt) far past their TTL.Root cause is a single shared knob. In remote mode
HandoverDrainTimeoutis intentionally0(unbounded session drain) so a rolling control plane never cuts an in-flight customer import. But the janitor's draining-CP-row reaper was wired to that same field:so its gate
if maxDrainTimeout > 0was never satisfied andExpireDrainingControlPlaneInstancesnever ran.A draining old-generation CP keeps heartbeating on a non-cancellable context, so its
cp_instancesrow staysstate=drainingwith a fresh heartbeat and is never expired.ListOrphanedWorkersonly lists workers whose owning CP isexpired— so every worker owned by that draining CP is invisible to the orphan sweep and never reaped.Fix
Introduce a dedicated, finite
DrainingInstanceExpiryTimeout(default 2h) that bounds how long a draining CP row may linger before the leader janitor force-expires it.HandoverDrainTimeout=0(session drain) is left untouched.Resolution goes through
effectiveDrainingInstanceExpiry, which never returns 0 — so the reaper can no longer be silently disabled by an unset knob (mirrors the existingeffectiveDefaultWorkerTTLpattern).Expiring a draining CP row cannot cut a live import:
ListOrphanedWorkersalready spares any worker with an active/reconnecting Flight session.Plumbed through the same three sources as
HandoverDrainTimeout:draining_instance_expiry_timeoutDUCKGRES_DRAINING_INSTANCE_EXPIRY_TIMEOUT--draining-instance-expiry-timeoutTests
TestEffectiveDrainingInstanceExpiry— parameterized; asserts the resolver never returns 0 (the exact regression guard).cliflags_testflag-completeness table updated.go build/go vet(incl.-tags kubernetes), existing janitor/expiry tests, andconfigresolvetests all pass.Rollout
Ship with the generous 2h default first, then tighten. Watch
state=drainingcp_instancestrend down and orphan-retire metrics climb, while confirming no rise in cut live Flight sessions. Roll back by reverting the wiring.This is PR 1 of a sequence; it un-blinds the orphan sweep and is the prerequisite for the follow-up ownership-independent orphan terminalization +
do-not-disruptreconciliation.🤖 Generated with Claude Code