Skip to content

fix(leader-election): suppress standby lease false positive#72

Open
rswigginton wants to merge 1 commit into
Nextdoor:mainfrom
rswigginton:fix/70-standby-lease-false-positive
Open

fix(leader-election): suppress standby lease false positive#72
rswigginton wants to merge 1 commit into
Nextdoor:mainfrom
rswigginton:fix/70-standby-lease-false-positive

Conversation

@rswigginton

Copy link
Copy Markdown

In HA (leader election on, 2+ replicas), the non-leader replica logged "no controller is watching nodes" at ERROR every 2x leaseDuration forever, even with a healthy leader. The monitor only watched its own Elected() channel, which never fires on a standby, so it could not tell "another pod is leading" (fine) from "nobody is leading" (the real #55 wedge).

monitorLeaseAcquisition now reads the election Lease before escalating. If another replica holds it and renewed within the lease duration, this pod is a standby and logs at debug instead of warning. ERROR is reserved for when no leader is live, preserving #55 detection.

The lease namespace is pinned to POD_NAMESPACE (downward API) so the probe and the manager read the same Lease.

Tests: healthy standby -> no ERROR; no live leader -> ERROR; a stale lease fed through the real probe -> ERROR (#55 end to end); leaderLeaseLive table.

Fixes #70

In HA (leader election on, 2+ replicas), the non-leader replica logged
"no controller is watching nodes" at ERROR every 2x leaseDuration forever,
even with a healthy leader. The monitor only watched its own Elected()
channel, which never fires on a standby, so it could not tell "another pod
is leading" (fine) from "nobody is leading" (the real Nextdoor#55 wedge).

monitorLeaseAcquisition now reads the election Lease before escalating. If
another replica holds it and renewed within the lease duration, this pod is
a standby and logs at debug instead of warning. ERROR is reserved for when
no leader is live, preserving Nextdoor#55 detection.

The lease namespace is pinned to POD_NAMESPACE (downward API) so the probe
and the manager read the same Lease.

Tests: healthy standby -> no ERROR; no live leader -> ERROR; a stale lease
fed through the real probe -> ERROR (Nextdoor#55 end to end); leaderLeaseLive table.

Fixes Nextdoor#70

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rswigginton rswigginton requested a review from a team as a code owner June 14, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

monitorLeaseAcquisition logs ERROR ... no controller is watching nodes on healthy standby replicas (false positive)

1 participant