Skip to content

ci: runner-liveness alert for the self-hosted pool (#509 slice 1)#536

Open
avrabe wants to merge 1 commit into
mainfrom
ci/509-runner-liveness-alert
Open

ci: runner-liveness alert for the self-hosted pool (#509 slice 1)#536
avrabe wants to merge 1 commit into
mainfrom
ci/509-runner-liveness-alert

Conversation

@avrabe

@avrabe avrabe commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What

Every gating CI job runs on [self-hosted, …], so when the pool goes offline every gate queues forever with no fallback and no alarm — the multi-day outage in #509 was invisible until someone noticed by hand. This adds a GitHub-hosted (ubuntu-latest) liveness probe that keeps firing even when the self-hosted pool is down, and turns that silent failure into a durable tracking issue rather than a transient red badge.

How

.github/workflows/runner-liveness.ymlschedule: */15 * * * * + workflow_dispatch:

  1. Queued-run age (authoritative). Fails if any run has been queued longer than QUEUE_THRESHOLD_MINUTES (default 30). Needs only actions: read, and works regardless of whether runners are registered at the repo or org level — it measures the actual symptom (jobs not getting picked up).
  2. Runner list (best-effort). GET /repos/{repo}/actions/runners needs the administration scope, which is not grantable to GITHUB_TOKEN (actionlint flagged this), so the lookup self-skips on 403/empty instead of false-alarming. Wire a PAT into GH_TOKEN later if a hard runner count is wanted.

On a problem it opens or updates a single runner-down-labelled tracking issue (idempotent — comment-updates the existing one, never duplicates); on recovery it comments and auto-closes. The run itself also goes red so there's a badge signal too.

Verification

  • actionlint clean (includes shellcheck on the run: blocks).
  • Injection-safe: dynamic content flows through env: vars; triggers are schedule/workflow_dispatch with no untrusted event payload.
  • Scheduled workflows only run from the default branch, so I'll smoke-test via workflow_dispatch once this is on main and record the result here.

Out of scope (separate PRs, per the issue)

  • Routing fast core gates (fmt/yaml-lint/validate/clippy) to ubuntu-latest — runner-policy + billing implications.
  • The operational runbook docs.

Refs #509, #436. Trace: skip (ci type, AGENTS.md exempt).

🤖 Generated with Claude Code

Every gating job runs on `[self-hosted, …]`, so when the pool goes offline every
gate queues forever with no fallback and no alarm — the multi-day outage in #509
was invisible until noticed by hand. This GitHub-hosted workflow (ubuntu-latest,
so it fires even when the pool is down) polls on a 15-min schedule + dispatch and
raises a durable tracking issue instead of a transient red badge.

Signals: (1) queued-run age > QUEUE_THRESHOLD_MINUTES (default 30) is the
authoritative alarm — needs only actions:read and is agnostic to repo-vs-org
runner registration; (2) the runner-list check is best-effort and self-skips,
since listing self-hosted runners needs the `administration` scope that
GITHUB_TOKEN cannot be granted. On a problem it opens or updates an idempotent
`runner-down`-labelled issue (one tracker, comment-updated); on recovery it
comments and auto-closes. Validated with actionlint (incl. shellcheck on the
run blocks). Smoke-test via workflow_dispatch after merge.

Out of scope (separate PRs): routing fast core gates to ubuntu-latest (runner
policy + billing); the operational runbook.

Trace: skip
Refs: #509, #436
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant