controlplane: deterministic worker provisioning + stop the idle-worker leak by fuziontech · Pull Request #820 · PostHog/duckgres

fuziontech · 2026-06-27T00:01:50Z

What

Deterministic, leak-free k8s/remote worker provisioning. Driven by a deep audit of the symptom "hot workers not getting returned to the pool, ~4000 vCPU of idle worker pods wasted."

Root cause (and the fix)

JanitorLeaderManager.Start ran elector.Run exactly once. client-go's LeaderElector.Run returns the moment the lease can't be renewed, and nothing re-ran it — so a single API-server blip permanently dropped a replica out of the election. Because every fleet-wide reclamation pass (hot-idle TTL reap, orphan/stuck sweeps, stranded-pod cleanup, headroom scale-down) is leader-gated with no fallback, enough leadership churn left the lease stale with no contender and the reaper dark fleet-wide, while all 6 replicas kept spawning r6gd pods. Bounded only by a CP restart — which is why the leak reset on deploy and re-grew between deploys. → Fix #1: re-contend in a loop.

All fixes

Elector re-contends after losing leadership (janitor_leader_k8s.go). Primary cause of the steady-state leak.
Per-CP fallback hot-idle reaper (per_cp_hot_idle_reaper.go) — every replica retires its own expired hot-idle workers on a local timer, leader-independent, honoring the DefaultWorkerMinHotIdle floor. Plus duckgres_control_plane_janitor_is_leader and ..._hot_idle_reap_last_run_timestamp_seconds gauges so a wedged reaper is alertable.
Persist-before-mutate on hot→hot_idle (k8s_pool_lifecycle.go) — a swallowed durable write left a worker in-memory hot_idle / durable hot, invisible to both reuse and the reaper. Now stays Hot (still reusable) on failure + a hot_idle_persist_failures_total metric.
Same hardening on the activate hot persist (k8s_pool_acquire.go) — removes the durable/in-memory split that let the stuck-activating reaper retire a live worker mid-session.
Failed spawn frees its spawning-slot row on the request thread (k8s_pool_acquire.go) instead of leaking org+global cap (→ spurious OrgCap) until the 10m sweep.
Headroom is now a small constant (DUCKGRES_K8S_HEADROOM_NODES, node-sized placeholders) instead of headroomPercent × demand, which counted hot-idle workers as demand and amplified any idle leak ~1.75×. Disabling headroom now deletes existing placeholders. (Charts wiring in the companion PR.)
Built-in default worker TTL 20m → 1m so idle one-session pods + nodes are reclaimed quickly by default; clients/orgs opt up via duckgres.worker_ttl / default_worker_ttl.

Tests

Unit: elector re-contend + stop, hot-idle persist failure, spawn-slot free, per-CP reaper (+floor +disabled), headroom constant/disabled. Full controlplane suite green on both kubernetes and default build tags; configstore/configresolve/manifests green. e2e harness.sh gains hot_idle_retired — a 1m-TTL worker must actually be reaped (the load-bearing regression for the leak). Docs (worker-ttl-pool.md, replenish-capacity.md) synced.

Companion

Charts PR wires headroomNodes, switches mw-prod-us to constant headroom (3 nodes) and workerDefaultTtl: 1m.

🤖 Generated with Claude Code

Audit of the k8s/remote worker provisioning flow (hot workers not retired /reused, ~4000 vCPU of idle pods wasted). Fixes, highest-leverage first: 1. Janitor leader elector now re-contends in a loop after losing the lease. client-go's LeaderElector.Run returns on a lost lease and we only ran it once, so a single renewal blip permanently dropped a replica out of the election; enough churn left the lease stale with no contender and ALL fleet-wide reaping (hot-idle TTL, orphan/stuck sweeps, headroom) dark. This is the primary cause of the steady-state idle accumulation. 2. Per-CP fallback hot-idle reaper: every replica retires its own expired hot-idle workers on a local timer, independent of leadership (honors the DefaultWorkerMinHotIdle floor). Backstops the leader-only janitor. Adds duckgres_control_plane_janitor_is_leader and duckgres_control_plane_hot_idle_reap_last_run_timestamp_seconds for alerting. 3. Persist-before-mutate on the hot->hot_idle transition: a swallowed durable write left a worker in-memory hot_idle / durable hot, invisible to BOTH reuse and the TTL reaper. On failure it now stays Hot (still reusable) and bumps duckgres_worker_hot_idle_persist_failures_total. 4. Same hardening on the activate hot persist, removing the durable/in-memory split that let the stuck-activating reaper retire a live worker mid-session. 5. Failed pod spawn frees its durable spawning-slot row on the request thread instead of leaking org+global cap until the 10m stale-spawning sweep. 6. Headroom is now a small CONSTANT (DUCKGRES_K8S_HEADROOM_NODES, node-sized placeholders) instead of headroomPercent*demand, which counted hot-idle workers as demand and amplified any idle leak ~1.75x. Disabling headroom now deletes existing placeholders instead of stranding them. 7. Built-in default worker TTL 20m -> 1m so idle one-session pods (and their nodes) are reclaimed quickly by default; clients/orgs opt up via duckgres.worker_ttl / default_worker_ttl (still clamped by workerMaxTtl). Unit tests for the elector restart, hot-idle persist failure, spawn-slot free, per-CP reaper (+floor), and headroom constant/disabled modes. e2e harness gains hot_idle_retired (a 1m-TTL worker must actually be reaped). Docs synced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-27T00:02:23Z

Test Impact Plan

Deterministic summary of how this PR changes tests, CI runners, and coverage-risk signals.

Summary

Area	Added	Changed
Test files	1	4
E2E/journey files	0	1
Workflow files	0	0

Signals

Test cases: +13 / -0
Assertions: +40 / -0
Skips or known failures added: 0
Workflow continue-on-error added: 0
Workflow path filters added: 0
Test commands removed from justfile: 0
E2E/journey retry lines added: 0

Coverage risk: needs review

Warnings

E2E or journey files changed (needs review)
- tests/e2e-mw-dev/harness.sh

Follow-up to the multi-agent review of this PR. Closes the gaps it confirmed: - Per-CP fallback reaper now also reaps ROLLOUT-ORPHANED hot-idle workers (owner CP instance dead), via the orphan snapshot list + fenced RetireOrphanFromSnapshot. cpInstanceID is freshly minted on every CP start, so a graceful rollout orphans the prior process's hot-idle rows; without this they were reaped only by the leader janitor — defeating the whole point of a leader-independent backstop in the exact rollout+lease-blip case. - Janitor stamps the last-successful-reap gauge ONLY after a successful list, so a config-store outage can no longer keep the gauge fresh and silence the wedged-reaper alert at the moment reaping is dark. - Pool-local stuck-activating reaper skips workers with a live pre-claimed session, so a slow in-band activation (e.g. DuckLake migration) isn't reaped out from under a waiting client. (The leader durable stuck reaper still can't see active sessions; persisting a claimed marker is a follow-up.) - FIX a self-deadlock introduced by the hot_idle persist-before-mutate change: the failure path logged via p.logw(id), which RLocks p.mu while we already hold p.mu (RWMutex is not reentrant). Log with workerLogAttrs(w) instead. Caught by the new persist-failure unit test hanging. Tests: persist-failure regression tests for the hot_idle park (#3, stays Hot) and activation hot persist (#4, returns error); per-CP orphan-reap + orphan-disabled tests. Corrected the overstated hot_idle_retired harness comment (it is an e2e smoke, not the leader-loss/persist-failure regression gate — those are the unit tests) and stale comments in control.go/janitor.go/k8s_pool_acquire.go. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… reaped Closes the durable half of the stuck-activating gap the review flagged (the pool-local reaper was already made session-aware). The leader stuck-activating reaper keys off updated_at and cannot see the live in-memory session, so an activation that legitimately outlives activateTimeout (5m) — e.g. a large in-band DuckLake migration — could be CAS-retired mid-session with the client bound to it. ActivateReservedWorker now runs a heartbeat goroutine that re-persists the activating row (bumping updated_at) every activatingHeartbeatInterval while activate() is in flight. This ties "fresh" to "the activation goroutine is alive and progressing": a wedged activation returns via its ctx deadline (workerSpawnActivateTimeout) and is retired, and a crashed CP's goroutine dies so the row goes stale and IS reaped — so genuinely-stuck rows are still cleaned up, just not live ones. The heartbeat builds+persists under p.mu and re-checks the worker is still Activating, so it serializes with the Hot transition and can never clobber a committed Hot row back to activating. The fenced upsert can't resurrect a worker another path retired. Interval is injectable for tests (activatingHBInterval). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

golangci-lint runs with default build tags, so janitorIsLeaderGauge / setJanitorIsLeader and workerHotIdlePersistFailuresCounter / observeHotIdlePersistFailure — referenced only from kubernetes-tagged files (janitor_leader_k8s.go, k8s_pool_lifecycle.go) — were flagged "unused" under the default-tag analysis. Move them into worker_lifecycle_metrics_k8s.go (//go:build kubernetes) so neither the symbol nor its sole caller is compiled in the default build. hotIdleReapLastRunGauge / observeHotIdleReapRun stay in the shared file (used by the untagged janitor.go + per_cp_hot_idle_reaper.go). Verified: golangci-lint v2.11.4 (CI pin) reports 0 issues; builds clean on both tags. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both LOW/non-blocking, from the follow-up re-review: - Per-CP reaper: run the orphan pass BEFORE the hotIdleTTL<=0 early-return, so the orphan backstop is independent of the self-owned TTL guard — mirroring the leader orphan sweep (which is not gated on the hot-idle TTL block). Latent today (hotIdleTTL floors to 1m) but removes the coupling. - Activation heartbeat: make stop idempotent (sync.Once) and defer it as a panic-safe backstop while still joining explicitly before the Hot-commit (so no clobber race AND no goroutine/ticker leak if activate() panics). Full controlplane suite green on both tags; golangci-lint v2.11.4 reports 0 issues. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jghoman approved these changes Jun 27, 2026

View reviewed changes

fuziontech and others added 4 commits June 27, 2026 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

controlplane: deterministic worker provisioning + stop the idle-worker leak#820

controlplane: deterministic worker provisioning + stop the idle-worker leak#820
fuziontech wants to merge 5 commits into
mainfrom
worker-provisioning-determinism

fuziontech commented Jun 27, 2026

Uh oh!

github-actions Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fuziontech commented Jun 27, 2026

What

Root cause (and the fix)

All fixes

Tests

Companion

Uh oh!

github-actions Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Impact Plan

Summary

Signals

Warnings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 27, 2026 •

edited

Loading