control-plane: remote drain is unbounded — idle connections can block pod termination indefinitely

## Problem

In remote/k8s mode the control-plane graceful-shutdown drain is **unbounded**: `HandoverDrainTimeout` defaults to `0` for `--worker-backend remote` (`controlplane/control.go:256-268`). On SIGTERM the CP stops accepting new connections and blocks on `cp.wg.Wait()` until **every** existing connection goroutine returns (`control.go:1556-1631`, `:685/687`).

A connection goroutine (`server/conn.go messageLoop`, `:1047-1064`) returns only on client disconnect/EOF, a fatal error, or an idle-timeout read deadline — and `IdleTimeout` defaults to `0` (disabled). The CP never force-closes a live connection (`activeConns` is just a counter, no registry).

**Consequence:** a single **idle but open** client connection keeps a `Terminating` CP pod alive until the pod's k8s `terminationGracePeriodSeconds` (24h in prod) expires and the kubelet SIGKILLs it. So:
- Rolling deploys / scale-down can leave old CP pods lingering for hours.
- A connection still open at the grace wall is SIGKILLed (ungraceful) instead of draining cleanly.

## Why it is currently 0

The 15m default was deliberately removed after an incident where the timeout cut in-flight customer queries at the wall (see the comment at `controlplane/control.go:256-268`). So we cannot simply reinstate a flat hard cap — that reintroduces the original regression.

Note: CLAUDE.md, `main.go` help text, and the README still say "15m default (remote)" — **stale**, the actual remote default is `0`. Docs should be corrected as part of this.

## Possible directions (for discussion)

- Distinguish **idle** connections from connections with **in-flight work** during drain: after some grace, close idle connections (no active query / txn / stream) while continuing to wait on connections that are actively running work. This bounds drain without cutting live queries.
- Wire a sensible default `IdleTimeout` so abandoned idle connections are reaped in steady state (not only at shutdown).
- A drain deadline that only applies to idle connections, leaving active work unbounded.

## Acceptance

- A terminating CP pod cannot be held open indefinitely by idle connections.
- In-flight queries are still NOT cut mid-execution (no regression of the original incident).
- Stale "15m" docs corrected to reflect actual behavior.

Context: surfaced while designing managed-warehouse compute-seconds billing (`docs/billing-compute-seconds-plan.md`), where shutdown/connection-end semantics matter for metering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

control-plane: remote drain is unbounded — idle connections can block pod termination indefinitely #782

Problem

Why it is currently 0

Possible directions (for discussion)

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

control-plane: remote drain is unbounded — idle connections can block pod termination indefinitely #782

Description

Problem

Why it is currently 0

Possible directions (for discussion)

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions