Problem
In remote/k8s mode the control-plane graceful-shutdown drain is unbounded: HandoverDrainTimeout defaults to 0 for --worker-backend remote (controlplane/control.go:256-268). On SIGTERM the CP stops accepting new connections and blocks on cp.wg.Wait() until every existing connection goroutine returns (control.go:1556-1631, :685/687).
A connection goroutine (server/conn.go messageLoop, :1047-1064) returns only on client disconnect/EOF, a fatal error, or an idle-timeout read deadline — and IdleTimeout defaults to 0 (disabled). The CP never force-closes a live connection (activeConns is just a counter, no registry).
Consequence: a single idle but open client connection keeps a Terminating CP pod alive until the pod's k8s terminationGracePeriodSeconds (24h in prod) expires and the kubelet SIGKILLs it. So:
- Rolling deploys / scale-down can leave old CP pods lingering for hours.
- A connection still open at the grace wall is SIGKILLed (ungraceful) instead of draining cleanly.
Why it is currently 0
The 15m default was deliberately removed after an incident where the timeout cut in-flight customer queries at the wall (see the comment at controlplane/control.go:256-268). So we cannot simply reinstate a flat hard cap — that reintroduces the original regression.
Note: CLAUDE.md, main.go help text, and the README still say "15m default (remote)" — stale, the actual remote default is 0. Docs should be corrected as part of this.
Possible directions (for discussion)
- Distinguish idle connections from connections with in-flight work during drain: after some grace, close idle connections (no active query / txn / stream) while continuing to wait on connections that are actively running work. This bounds drain without cutting live queries.
- Wire a sensible default
IdleTimeout so abandoned idle connections are reaped in steady state (not only at shutdown).
- A drain deadline that only applies to idle connections, leaving active work unbounded.
Acceptance
- A terminating CP pod cannot be held open indefinitely by idle connections.
- In-flight queries are still NOT cut mid-execution (no regression of the original incident).
- Stale "15m" docs corrected to reflect actual behavior.
Context: surfaced while designing managed-warehouse compute-seconds billing (docs/billing-compute-seconds-plan.md), where shutdown/connection-end semantics matter for metering.
Problem
In remote/k8s mode the control-plane graceful-shutdown drain is unbounded:
HandoverDrainTimeoutdefaults to0for--worker-backend remote(controlplane/control.go:256-268). On SIGTERM the CP stops accepting new connections and blocks oncp.wg.Wait()until every existing connection goroutine returns (control.go:1556-1631,:685/687).A connection goroutine (
server/conn.go messageLoop,:1047-1064) returns only on client disconnect/EOF, a fatal error, or an idle-timeout read deadline — andIdleTimeoutdefaults to0(disabled). The CP never force-closes a live connection (activeConnsis just a counter, no registry).Consequence: a single idle but open client connection keeps a
TerminatingCP pod alive until the pod's k8sterminationGracePeriodSeconds(24h in prod) expires and the kubelet SIGKILLs it. So:Why it is currently 0
The 15m default was deliberately removed after an incident where the timeout cut in-flight customer queries at the wall (see the comment at
controlplane/control.go:256-268). So we cannot simply reinstate a flat hard cap — that reintroduces the original regression.Note: CLAUDE.md,
main.gohelp text, and the README still say "15m default (remote)" — stale, the actual remote default is0. Docs should be corrected as part of this.Possible directions (for discussion)
IdleTimeoutso abandoned idle connections are reaped in steady state (not only at shutdown).Acceptance
Context: surfaced while designing managed-warehouse compute-seconds billing (
docs/billing-compute-seconds-plan.md), where shutdown/connection-end semantics matter for metering.