fix(controlplane): decouple draining-CP-row reaper from the drain wall by EDsCODE · Pull Request #818 · PostHog/duckgres

EDsCODE · 2026-06-26T19:11:28Z

Problem

Worker pods leak and accumulate idle, holding their full per-worker reservations and pinning nodes (via karpenter.sh/do-not-disrupt) far past their TTL.

Root cause is a single shared knob. In remote mode HandoverDrainTimeout is intentionally 0 (unbounded session drain) so a rolling control plane never cuts an in-flight customer import. But the janitor's draining-CP-row reaper was wired to that same field:

janitor.maxDrainTimeout = cfg.HandoverDrainTimeout   // 0 in remote mode

so its gate if maxDrainTimeout > 0 was never satisfied and ExpireDrainingControlPlaneInstances never ran.

A draining old-generation CP keeps heartbeating on a non-cancellable context, so its cp_instances row stays state=draining with a fresh heartbeat and is never expired. ListOrphanedWorkers only lists workers whose owning CP is expired — so every worker owned by that draining CP is invisible to the orphan sweep and never reaped.

Fix

Introduce a dedicated, finite DrainingInstanceExpiryTimeout (default 2h) that bounds how long a draining CP row may linger before the leader janitor force-expires it. HandoverDrainTimeout=0 (session drain) is left untouched.

Resolution goes through effectiveDrainingInstanceExpiry, which never returns 0 — so the reaper can no longer be silently disabled by an unset knob (mirrors the existing effectiveDefaultWorkerTTL pattern).

Expiring a draining CP row cannot cut a live import: ListOrphanedWorkers already spares any worker with an active/reconnecting Flight session.

Plumbed through the same three sources as HandoverDrainTimeout:

file: draining_instance_expiry_timeout
env: DUCKGRES_DRAINING_INSTANCE_EXPIRY_TIMEOUT
CLI: --draining-instance-expiry-timeout

Tests

New TestEffectiveDrainingInstanceExpiry — parameterized; asserts the resolver never returns 0 (the exact regression guard).
cliflags_test flag-completeness table updated.
go build/go vet (incl. -tags kubernetes), existing janitor/expiry tests, and configresolve tests all pass.

Rollout

Ship with the generous 2h default first, then tighten. Watch state=draining cp_instances trend down and orphan-retire metrics climb, while confirming no rise in cut live Flight sessions. Roll back by reverting the wiring.

Decision needed: confirm the longest plausible legitimate drain (a pgwire import session with no Flight session). If that can exceed 2h, raise the default or set the knob higher to start.

This is PR 1 of a sequence; it un-blinds the orphan sweep and is the prerequisite for the follow-up ownership-independent orphan terminalization + do-not-disrupt reconciliation.

🤖 Generated with Claude Code

In remote mode HandoverDrainTimeout is intentionally 0 (unbounded session drain) so a rolling control plane never cuts an in-flight customer import. But the janitor's draining-CP-row reaper was wired to the same field (janitor.maxDrainTimeout = cfg.HandoverDrainTimeout), so its gate `if maxDrainTimeout > 0` was never satisfied and ExpireDrainingControlPlaneInstances never ran. A draining old-generation CP heartbeats on a non-cancellable context, so its cp_instances row stays state=draining with a fresh heartbeat and is never expired. ListOrphanedWorkers only lists workers whose owning CP is expired, so every worker owned by that draining CP becomes invisible to the orphan sweep — they leak, holding full per-worker reservations and pinning their nodes via karpenter.sh/do-not-disrupt. Introduce a dedicated, finite DrainingInstanceExpiryTimeout (default 2h) that bounds how long a draining CP row may linger before the leader janitor force-expires it, leaving HandoverDrainTimeout=0 (session drain) untouched. Resolution goes through effectiveDrainingInstanceExpiry, which never returns 0, so the reaper can no longer be silently disabled by an unset knob. Expiring a draining CP row cannot cut a live import: the orphan sweep already spares workers with an active/reconnecting Flight session. Plumbed through file (draining_instance_expiry_timeout), env (DUCKGRES_DRAINING_INSTANCE_EXPIRY_TIMEOUT), and CLI (--draining-instance-expiry-timeout), mirroring HandoverDrainTimeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-26T19:12:01Z

Test Impact Plan

Deterministic summary of how this PR changes tests, CI runners, and coverage-risk signals.

Summary

Area	Added	Changed	Deleted
Test files	0	2	0
E2E/journey files	0	0	0
Workflow files	0	0	0

Signals

Test cases: +1 / -0
Assertions: +2 / -0
Skips or known failures added: 0
Workflow continue-on-error added: 0
Workflow path filters added: 0
Test commands removed from justfile: 0
E2E/journey retry lines added: 0

Coverage risk: neutral or increased

No coverage-reduction warnings detected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(controlplane): decouple draining-CP-row reaper from the drain wall#818

fix(controlplane): decouple draining-CP-row reaper from the drain wall#818
EDsCODE wants to merge 1 commit into
mainfrom
eric/fix-draining-cp-row-reaper

EDsCODE commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EDsCODE commented Jun 26, 2026

Problem

Fix

Tests

Rollout

Uh oh!

github-actions Bot commented Jun 26, 2026

Test Impact Plan

Summary

Signals

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant