Skip to content

fix(controlplane): decouple draining-CP-row reaper from the drain wall#818

Open
EDsCODE wants to merge 1 commit into
mainfrom
eric/fix-draining-cp-row-reaper
Open

fix(controlplane): decouple draining-CP-row reaper from the drain wall#818
EDsCODE wants to merge 1 commit into
mainfrom
eric/fix-draining-cp-row-reaper

Conversation

@EDsCODE

@EDsCODE EDsCODE commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Problem

Worker pods leak and accumulate idle, holding their full per-worker reservations and pinning nodes (via karpenter.sh/do-not-disrupt) far past their TTL.

Root cause is a single shared knob. In remote mode HandoverDrainTimeout is intentionally 0 (unbounded session drain) so a rolling control plane never cuts an in-flight customer import. But the janitor's draining-CP-row reaper was wired to that same field:

janitor.maxDrainTimeout = cfg.HandoverDrainTimeout   // 0 in remote mode

so its gate if maxDrainTimeout > 0 was never satisfied and ExpireDrainingControlPlaneInstances never ran.

A draining old-generation CP keeps heartbeating on a non-cancellable context, so its cp_instances row stays state=draining with a fresh heartbeat and is never expired. ListOrphanedWorkers only lists workers whose owning CP is expired — so every worker owned by that draining CP is invisible to the orphan sweep and never reaped.

Fix

Introduce a dedicated, finite DrainingInstanceExpiryTimeout (default 2h) that bounds how long a draining CP row may linger before the leader janitor force-expires it. HandoverDrainTimeout=0 (session drain) is left untouched.

Resolution goes through effectiveDrainingInstanceExpiry, which never returns 0 — so the reaper can no longer be silently disabled by an unset knob (mirrors the existing effectiveDefaultWorkerTTL pattern).

Expiring a draining CP row cannot cut a live import: ListOrphanedWorkers already spares any worker with an active/reconnecting Flight session.

Plumbed through the same three sources as HandoverDrainTimeout:

  • file: draining_instance_expiry_timeout
  • env: DUCKGRES_DRAINING_INSTANCE_EXPIRY_TIMEOUT
  • CLI: --draining-instance-expiry-timeout

Tests

  • New TestEffectiveDrainingInstanceExpiry — parameterized; asserts the resolver never returns 0 (the exact regression guard).
  • cliflags_test flag-completeness table updated.
  • go build/go vet (incl. -tags kubernetes), existing janitor/expiry tests, and configresolve tests all pass.

Rollout

Ship with the generous 2h default first, then tighten. Watch state=draining cp_instances trend down and orphan-retire metrics climb, while confirming no rise in cut live Flight sessions. Roll back by reverting the wiring.

Decision needed: confirm the longest plausible legitimate drain (a pgwire import session with no Flight session). If that can exceed 2h, raise the default or set the knob higher to start.

This is PR 1 of a sequence; it un-blinds the orphan sweep and is the prerequisite for the follow-up ownership-independent orphan terminalization + do-not-disrupt reconciliation.

🤖 Generated with Claude Code

In remote mode HandoverDrainTimeout is intentionally 0 (unbounded session
drain) so a rolling control plane never cuts an in-flight customer import.
But the janitor's draining-CP-row reaper was wired to the same field
(janitor.maxDrainTimeout = cfg.HandoverDrainTimeout), so its gate
`if maxDrainTimeout > 0` was never satisfied and
ExpireDrainingControlPlaneInstances never ran.

A draining old-generation CP heartbeats on a non-cancellable context, so
its cp_instances row stays state=draining with a fresh heartbeat and is
never expired. ListOrphanedWorkers only lists workers whose owning CP is
expired, so every worker owned by that draining CP becomes invisible to
the orphan sweep — they leak, holding full per-worker reservations and
pinning their nodes via karpenter.sh/do-not-disrupt.

Introduce a dedicated, finite DrainingInstanceExpiryTimeout (default 2h)
that bounds how long a draining CP row may linger before the leader
janitor force-expires it, leaving HandoverDrainTimeout=0 (session drain)
untouched. Resolution goes through effectiveDrainingInstanceExpiry, which
never returns 0, so the reaper can no longer be silently disabled by an
unset knob. Expiring a draining CP row cannot cut a live import: the
orphan sweep already spares workers with an active/reconnecting Flight
session.

Plumbed through file (draining_instance_expiry_timeout), env
(DUCKGRES_DRAINING_INSTANCE_EXPIRY_TIMEOUT), and CLI
(--draining-instance-expiry-timeout), mirroring HandoverDrainTimeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Test Impact Plan

Deterministic summary of how this PR changes tests, CI runners, and coverage-risk signals.

Summary

Area Added Changed Deleted
Test files 0 2 0
E2E/journey files 0 0 0
Workflow files 0 0 0

Signals

  • Test cases: +1 / -0
  • Assertions: +2 / -0
  • Skips or known failures added: 0
  • Workflow continue-on-error added: 0
  • Workflow path filters added: 0
  • Test commands removed from justfile: 0
  • E2E/journey retry lines added: 0

Coverage risk: neutral or increased

No coverage-reduction warnings detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant