Skip to content

controlplane: deterministic worker provisioning + stop the idle-worker leak#820

Open
fuziontech wants to merge 5 commits into
mainfrom
worker-provisioning-determinism
Open

controlplane: deterministic worker provisioning + stop the idle-worker leak#820
fuziontech wants to merge 5 commits into
mainfrom
worker-provisioning-determinism

Conversation

@fuziontech

Copy link
Copy Markdown
Member

What

Deterministic, leak-free k8s/remote worker provisioning. Driven by a deep audit of the symptom "hot workers not getting returned to the pool, ~4000 vCPU of idle worker pods wasted."

Root cause (and the fix)

JanitorLeaderManager.Start ran elector.Run exactly once. client-go's LeaderElector.Run returns the moment the lease can't be renewed, and nothing re-ran it — so a single API-server blip permanently dropped a replica out of the election. Because every fleet-wide reclamation pass (hot-idle TTL reap, orphan/stuck sweeps, stranded-pod cleanup, headroom scale-down) is leader-gated with no fallback, enough leadership churn left the lease stale with no contender and the reaper dark fleet-wide, while all 6 replicas kept spawning r6gd pods. Bounded only by a CP restart — which is why the leak reset on deploy and re-grew between deploys. → Fix #1: re-contend in a loop.

All fixes

  1. Elector re-contends after losing leadership (janitor_leader_k8s.go). Primary cause of the steady-state leak.
  2. Per-CP fallback hot-idle reaper (per_cp_hot_idle_reaper.go) — every replica retires its own expired hot-idle workers on a local timer, leader-independent, honoring the DefaultWorkerMinHotIdle floor. Plus duckgres_control_plane_janitor_is_leader and ..._hot_idle_reap_last_run_timestamp_seconds gauges so a wedged reaper is alertable.
  3. Persist-before-mutate on hot→hot_idle (k8s_pool_lifecycle.go) — a swallowed durable write left a worker in-memory hot_idle / durable hot, invisible to both reuse and the reaper. Now stays Hot (still reusable) on failure + a hot_idle_persist_failures_total metric.
  4. Same hardening on the activate hot persist (k8s_pool_acquire.go) — removes the durable/in-memory split that let the stuck-activating reaper retire a live worker mid-session.
  5. Failed spawn frees its spawning-slot row on the request thread (k8s_pool_acquire.go) instead of leaking org+global cap (→ spurious OrgCap) until the 10m sweep.
  6. Headroom is now a small constant (DUCKGRES_K8S_HEADROOM_NODES, node-sized placeholders) instead of headroomPercent × demand, which counted hot-idle workers as demand and amplified any idle leak ~1.75×. Disabling headroom now deletes existing placeholders. (Charts wiring in the companion PR.)
  7. Built-in default worker TTL 20m → 1m so idle one-session pods + nodes are reclaimed quickly by default; clients/orgs opt up via duckgres.worker_ttl / default_worker_ttl.

Tests

Unit: elector re-contend + stop, hot-idle persist failure, spawn-slot free, per-CP reaper (+floor +disabled), headroom constant/disabled. Full controlplane suite green on both kubernetes and default build tags; configstore/configresolve/manifests green. e2e harness.sh gains hot_idle_retired — a 1m-TTL worker must actually be reaped (the load-bearing regression for the leak). Docs (worker-ttl-pool.md, replenish-capacity.md) synced.

Companion

Charts PR wires headroomNodes, switches mw-prod-us to constant headroom (3 nodes) and workerDefaultTtl: 1m.

🤖 Generated with Claude Code

Audit of the k8s/remote worker provisioning flow (hot workers not retired
/reused, ~4000 vCPU of idle pods wasted). Fixes, highest-leverage first:

1. Janitor leader elector now re-contends in a loop after losing the lease.
   client-go's LeaderElector.Run returns on a lost lease and we only ran it
   once, so a single renewal blip permanently dropped a replica out of the
   election; enough churn left the lease stale with no contender and ALL
   fleet-wide reaping (hot-idle TTL, orphan/stuck sweeps, headroom) dark. This
   is the primary cause of the steady-state idle accumulation.

2. Per-CP fallback hot-idle reaper: every replica retires its own expired
   hot-idle workers on a local timer, independent of leadership (honors the
   DefaultWorkerMinHotIdle floor). Backstops the leader-only janitor. Adds
   duckgres_control_plane_janitor_is_leader and
   duckgres_control_plane_hot_idle_reap_last_run_timestamp_seconds for alerting.

3. Persist-before-mutate on the hot->hot_idle transition: a swallowed durable
   write left a worker in-memory hot_idle / durable hot, invisible to BOTH
   reuse and the TTL reaper. On failure it now stays Hot (still reusable) and
   bumps duckgres_worker_hot_idle_persist_failures_total.

4. Same hardening on the activate hot persist, removing the durable/in-memory
   split that let the stuck-activating reaper retire a live worker mid-session.

5. Failed pod spawn frees its durable spawning-slot row on the request thread
   instead of leaking org+global cap until the 10m stale-spawning sweep.

6. Headroom is now a small CONSTANT (DUCKGRES_K8S_HEADROOM_NODES, node-sized
   placeholders) instead of headroomPercent*demand, which counted hot-idle
   workers as demand and amplified any idle leak ~1.75x. Disabling headroom now
   deletes existing placeholders instead of stranding them.

7. Built-in default worker TTL 20m -> 1m so idle one-session pods (and their
   nodes) are reclaimed quickly by default; clients/orgs opt up via
   duckgres.worker_ttl / default_worker_ttl (still clamped by workerMaxTtl).

Unit tests for the elector restart, hot-idle persist failure, spawn-slot free,
per-CP reaper (+floor), and headroom constant/disabled modes. e2e harness gains
hot_idle_retired (a 1m-TTL worker must actually be reaped). Docs synced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown

Test Impact Plan

Deterministic summary of how this PR changes tests, CI runners, and coverage-risk signals.

Summary

Area Added Changed Deleted
Test files 1 4 0
E2E/journey files 0 1 0
Workflow files 0 0 0

Signals

  • Test cases: +13 / -0
  • Assertions: +40 / -0
  • Skips or known failures added: 0
  • Workflow continue-on-error added: 0
  • Workflow path filters added: 0
  • Test commands removed from justfile: 0
  • E2E/journey retry lines added: 0

Coverage risk: needs review

Warnings

  • E2E or journey files changed (needs review)
    • tests/e2e-mw-dev/harness.sh

fuziontech and others added 4 commits June 27, 2026 00:52
Follow-up to the multi-agent review of this PR. Closes the gaps it confirmed:

- Per-CP fallback reaper now also reaps ROLLOUT-ORPHANED hot-idle workers
  (owner CP instance dead), via the orphan snapshot list + fenced
  RetireOrphanFromSnapshot. cpInstanceID is freshly minted on every CP start,
  so a graceful rollout orphans the prior process's hot-idle rows; without this
  they were reaped only by the leader janitor — defeating the whole point of a
  leader-independent backstop in the exact rollout+lease-blip case.
- Janitor stamps the last-successful-reap gauge ONLY after a successful list,
  so a config-store outage can no longer keep the gauge fresh and silence the
  wedged-reaper alert at the moment reaping is dark.
- Pool-local stuck-activating reaper skips workers with a live pre-claimed
  session, so a slow in-band activation (e.g. DuckLake migration) isn't reaped
  out from under a waiting client. (The leader durable stuck reaper still can't
  see active sessions; persisting a claimed marker is a follow-up.)
- FIX a self-deadlock introduced by the hot_idle persist-before-mutate change:
  the failure path logged via p.logw(id), which RLocks p.mu while we already
  hold p.mu (RWMutex is not reentrant). Log with workerLogAttrs(w) instead.
  Caught by the new persist-failure unit test hanging.

Tests: persist-failure regression tests for the hot_idle park (#3, stays Hot)
and activation hot persist (#4, returns error); per-CP orphan-reap +
orphan-disabled tests. Corrected the overstated hot_idle_retired harness comment
(it is an e2e smoke, not the leader-loss/persist-failure regression gate — those
are the unit tests) and stale comments in control.go/janitor.go/k8s_pool_acquire.go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… reaped

Closes the durable half of the stuck-activating gap the review flagged (the
pool-local reaper was already made session-aware). The leader stuck-activating
reaper keys off updated_at and cannot see the live in-memory session, so an
activation that legitimately outlives activateTimeout (5m) — e.g. a large in-band
DuckLake migration — could be CAS-retired mid-session with the client bound to it.

ActivateReservedWorker now runs a heartbeat goroutine that re-persists the
activating row (bumping updated_at) every activatingHeartbeatInterval while
activate() is in flight. This ties "fresh" to "the activation goroutine is alive
and progressing": a wedged activation returns via its ctx deadline
(workerSpawnActivateTimeout) and is retired, and a crashed CP's goroutine dies so
the row goes stale and IS reaped — so genuinely-stuck rows are still cleaned up,
just not live ones. The heartbeat builds+persists under p.mu and re-checks the
worker is still Activating, so it serializes with the Hot transition and can
never clobber a committed Hot row back to activating. The fenced upsert can't
resurrect a worker another path retired. Interval is injectable for tests
(activatingHBInterval).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
golangci-lint runs with default build tags, so janitorIsLeaderGauge /
setJanitorIsLeader and workerHotIdlePersistFailuresCounter /
observeHotIdlePersistFailure — referenced only from kubernetes-tagged files
(janitor_leader_k8s.go, k8s_pool_lifecycle.go) — were flagged "unused" under the
default-tag analysis. Move them into worker_lifecycle_metrics_k8s.go
(//go:build kubernetes) so neither the symbol nor its sole caller is compiled in
the default build. hotIdleReapLastRunGauge / observeHotIdleReapRun stay in the
shared file (used by the untagged janitor.go + per_cp_hot_idle_reaper.go).

Verified: golangci-lint v2.11.4 (CI pin) reports 0 issues; builds clean on both
tags.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both LOW/non-blocking, from the follow-up re-review:

- Per-CP reaper: run the orphan pass BEFORE the hotIdleTTL<=0 early-return, so
  the orphan backstop is independent of the self-owned TTL guard — mirroring the
  leader orphan sweep (which is not gated on the hot-idle TTL block). Latent
  today (hotIdleTTL floors to 1m) but removes the coupling.
- Activation heartbeat: make stop idempotent (sync.Once) and defer it as a
  panic-safe backstop while still joining explicitly before the Hot-commit (so
  no clobber race AND no goroutine/ticker leak if activate() panics).

Full controlplane suite green on both tags; golangci-lint v2.11.4 reports 0 issues.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants