Skip to content

control-plane: remote drain is unbounded — idle connections can block pod termination indefinitely #782

Description

@benben

Problem

In remote/k8s mode the control-plane graceful-shutdown drain is unbounded: HandoverDrainTimeout defaults to 0 for --worker-backend remote (controlplane/control.go:256-268). On SIGTERM the CP stops accepting new connections and blocks on cp.wg.Wait() until every existing connection goroutine returns (control.go:1556-1631, :685/687).

A connection goroutine (server/conn.go messageLoop, :1047-1064) returns only on client disconnect/EOF, a fatal error, or an idle-timeout read deadline — and IdleTimeout defaults to 0 (disabled). The CP never force-closes a live connection (activeConns is just a counter, no registry).

Consequence: a single idle but open client connection keeps a Terminating CP pod alive until the pod's k8s terminationGracePeriodSeconds (24h in prod) expires and the kubelet SIGKILLs it. So:

  • Rolling deploys / scale-down can leave old CP pods lingering for hours.
  • A connection still open at the grace wall is SIGKILLed (ungraceful) instead of draining cleanly.

Why it is currently 0

The 15m default was deliberately removed after an incident where the timeout cut in-flight customer queries at the wall (see the comment at controlplane/control.go:256-268). So we cannot simply reinstate a flat hard cap — that reintroduces the original regression.

Note: CLAUDE.md, main.go help text, and the README still say "15m default (remote)" — stale, the actual remote default is 0. Docs should be corrected as part of this.

Possible directions (for discussion)

  • Distinguish idle connections from connections with in-flight work during drain: after some grace, close idle connections (no active query / txn / stream) while continuing to wait on connections that are actively running work. This bounds drain without cutting live queries.
  • Wire a sensible default IdleTimeout so abandoned idle connections are reaped in steady state (not only at shutdown).
  • A drain deadline that only applies to idle connections, leaving active work unbounded.

Acceptance

  • A terminating CP pod cannot be held open indefinitely by idle connections.
  • In-flight queries are still NOT cut mid-execution (no regression of the original incident).
  • Stale "15m" docs corrected to reflect actual behavior.

Context: surfaced while designing managed-warehouse compute-seconds billing (docs/billing-compute-seconds-plan.md), where shutdown/connection-end semantics matter for metering.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions