controlplane: deterministic worker provisioning + stop the idle-worker leak#820
Open
fuziontech wants to merge 5 commits into
Open
controlplane: deterministic worker provisioning + stop the idle-worker leak#820fuziontech wants to merge 5 commits into
fuziontech wants to merge 5 commits into
Conversation
Audit of the k8s/remote worker provisioning flow (hot workers not retired /reused, ~4000 vCPU of idle pods wasted). Fixes, highest-leverage first: 1. Janitor leader elector now re-contends in a loop after losing the lease. client-go's LeaderElector.Run returns on a lost lease and we only ran it once, so a single renewal blip permanently dropped a replica out of the election; enough churn left the lease stale with no contender and ALL fleet-wide reaping (hot-idle TTL, orphan/stuck sweeps, headroom) dark. This is the primary cause of the steady-state idle accumulation. 2. Per-CP fallback hot-idle reaper: every replica retires its own expired hot-idle workers on a local timer, independent of leadership (honors the DefaultWorkerMinHotIdle floor). Backstops the leader-only janitor. Adds duckgres_control_plane_janitor_is_leader and duckgres_control_plane_hot_idle_reap_last_run_timestamp_seconds for alerting. 3. Persist-before-mutate on the hot->hot_idle transition: a swallowed durable write left a worker in-memory hot_idle / durable hot, invisible to BOTH reuse and the TTL reaper. On failure it now stays Hot (still reusable) and bumps duckgres_worker_hot_idle_persist_failures_total. 4. Same hardening on the activate hot persist, removing the durable/in-memory split that let the stuck-activating reaper retire a live worker mid-session. 5. Failed pod spawn frees its durable spawning-slot row on the request thread instead of leaking org+global cap until the 10m stale-spawning sweep. 6. Headroom is now a small CONSTANT (DUCKGRES_K8S_HEADROOM_NODES, node-sized placeholders) instead of headroomPercent*demand, which counted hot-idle workers as demand and amplified any idle leak ~1.75x. Disabling headroom now deletes existing placeholders instead of stranding them. 7. Built-in default worker TTL 20m -> 1m so idle one-session pods (and their nodes) are reclaimed quickly by default; clients/orgs opt up via duckgres.worker_ttl / default_worker_ttl (still clamped by workerMaxTtl). Unit tests for the elector restart, hot-idle persist failure, spawn-slot free, per-CP reaper (+floor), and headroom constant/disabled modes. e2e harness gains hot_idle_retired (a 1m-TTL worker must actually be reaped). Docs synced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Test Impact PlanDeterministic summary of how this PR changes tests, CI runners, and coverage-risk signals. Summary
Signals
Coverage risk: needs review Warnings
|
jghoman
approved these changes
Jun 27, 2026
Follow-up to the multi-agent review of this PR. Closes the gaps it confirmed: - Per-CP fallback reaper now also reaps ROLLOUT-ORPHANED hot-idle workers (owner CP instance dead), via the orphan snapshot list + fenced RetireOrphanFromSnapshot. cpInstanceID is freshly minted on every CP start, so a graceful rollout orphans the prior process's hot-idle rows; without this they were reaped only by the leader janitor — defeating the whole point of a leader-independent backstop in the exact rollout+lease-blip case. - Janitor stamps the last-successful-reap gauge ONLY after a successful list, so a config-store outage can no longer keep the gauge fresh and silence the wedged-reaper alert at the moment reaping is dark. - Pool-local stuck-activating reaper skips workers with a live pre-claimed session, so a slow in-band activation (e.g. DuckLake migration) isn't reaped out from under a waiting client. (The leader durable stuck reaper still can't see active sessions; persisting a claimed marker is a follow-up.) - FIX a self-deadlock introduced by the hot_idle persist-before-mutate change: the failure path logged via p.logw(id), which RLocks p.mu while we already hold p.mu (RWMutex is not reentrant). Log with workerLogAttrs(w) instead. Caught by the new persist-failure unit test hanging. Tests: persist-failure regression tests for the hot_idle park (#3, stays Hot) and activation hot persist (#4, returns error); per-CP orphan-reap + orphan-disabled tests. Corrected the overstated hot_idle_retired harness comment (it is an e2e smoke, not the leader-loss/persist-failure regression gate — those are the unit tests) and stale comments in control.go/janitor.go/k8s_pool_acquire.go. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… reaped Closes the durable half of the stuck-activating gap the review flagged (the pool-local reaper was already made session-aware). The leader stuck-activating reaper keys off updated_at and cannot see the live in-memory session, so an activation that legitimately outlives activateTimeout (5m) — e.g. a large in-band DuckLake migration — could be CAS-retired mid-session with the client bound to it. ActivateReservedWorker now runs a heartbeat goroutine that re-persists the activating row (bumping updated_at) every activatingHeartbeatInterval while activate() is in flight. This ties "fresh" to "the activation goroutine is alive and progressing": a wedged activation returns via its ctx deadline (workerSpawnActivateTimeout) and is retired, and a crashed CP's goroutine dies so the row goes stale and IS reaped — so genuinely-stuck rows are still cleaned up, just not live ones. The heartbeat builds+persists under p.mu and re-checks the worker is still Activating, so it serializes with the Hot transition and can never clobber a committed Hot row back to activating. The fenced upsert can't resurrect a worker another path retired. Interval is injectable for tests (activatingHBInterval). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
golangci-lint runs with default build tags, so janitorIsLeaderGauge / setJanitorIsLeader and workerHotIdlePersistFailuresCounter / observeHotIdlePersistFailure — referenced only from kubernetes-tagged files (janitor_leader_k8s.go, k8s_pool_lifecycle.go) — were flagged "unused" under the default-tag analysis. Move them into worker_lifecycle_metrics_k8s.go (//go:build kubernetes) so neither the symbol nor its sole caller is compiled in the default build. hotIdleReapLastRunGauge / observeHotIdleReapRun stay in the shared file (used by the untagged janitor.go + per_cp_hot_idle_reaper.go). Verified: golangci-lint v2.11.4 (CI pin) reports 0 issues; builds clean on both tags. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both LOW/non-blocking, from the follow-up re-review: - Per-CP reaper: run the orphan pass BEFORE the hotIdleTTL<=0 early-return, so the orphan backstop is independent of the self-owned TTL guard — mirroring the leader orphan sweep (which is not gated on the hot-idle TTL block). Latent today (hotIdleTTL floors to 1m) but removes the coupling. - Activation heartbeat: make stop idempotent (sync.Once) and defer it as a panic-safe backstop while still joining explicitly before the Hot-commit (so no clobber race AND no goroutine/ticker leak if activate() panics). Full controlplane suite green on both tags; golangci-lint v2.11.4 reports 0 issues. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Deterministic, leak-free k8s/remote worker provisioning. Driven by a deep audit of the symptom "hot workers not getting returned to the pool, ~4000 vCPU of idle worker pods wasted."
Root cause (and the fix)
JanitorLeaderManager.Startranelector.Runexactly once. client-go'sLeaderElector.Runreturns the moment the lease can't be renewed, and nothing re-ran it — so a single API-server blip permanently dropped a replica out of the election. Because every fleet-wide reclamation pass (hot-idle TTL reap, orphan/stuck sweeps, stranded-pod cleanup, headroom scale-down) is leader-gated with no fallback, enough leadership churn left the lease stale with no contender and the reaper dark fleet-wide, while all 6 replicas kept spawning r6gd pods. Bounded only by a CP restart — which is why the leak reset on deploy and re-grew between deploys. → Fix #1: re-contend in a loop.All fixes
janitor_leader_k8s.go). Primary cause of the steady-state leak.per_cp_hot_idle_reaper.go) — every replica retires its own expired hot-idle workers on a local timer, leader-independent, honoring theDefaultWorkerMinHotIdlefloor. Plusduckgres_control_plane_janitor_is_leaderand..._hot_idle_reap_last_run_timestamp_secondsgauges so a wedged reaper is alertable.k8s_pool_lifecycle.go) — a swallowed durable write left a worker in-memory hot_idle / durable hot, invisible to both reuse and the reaper. Now staysHot(still reusable) on failure + ahot_idle_persist_failures_totalmetric.k8s_pool_acquire.go) — removes the durable/in-memory split that let the stuck-activating reaper retire a live worker mid-session.k8s_pool_acquire.go) instead of leaking org+global cap (→ spurious OrgCap) until the 10m sweep.DUCKGRES_K8S_HEADROOM_NODES, node-sized placeholders) instead ofheadroomPercent × demand, which counted hot-idle workers as demand and amplified any idle leak ~1.75×. Disabling headroom now deletes existing placeholders. (Charts wiring in the companion PR.)duckgres.worker_ttl/default_worker_ttl.Tests
Unit: elector re-contend + stop, hot-idle persist failure, spawn-slot free, per-CP reaper (+floor +disabled), headroom constant/disabled. Full
controlplanesuite green on bothkubernetesand default build tags;configstore/configresolve/manifestsgreen. e2eharness.shgainshot_idle_retired— a 1m-TTL worker must actually be reaped (the load-bearing regression for the leak). Docs (worker-ttl-pool.md,replenish-capacity.md) synced.Companion
Charts PR wires
headroomNodes, switches mw-prod-us to constant headroom (3 nodes) andworkerDefaultTtl: 1m.🤖 Generated with Claude Code