ci: runner-liveness alert for the self-hosted pool (#509 slice 1) by avrabe · Pull Request #536 · pulseengine/rivet

avrabe · 2026-06-14T12:21:45Z

What

Every gating CI job runs on [self-hosted, …], so when the pool goes offline every gate queues forever with no fallback and no alarm — the multi-day outage in #509 was invisible until someone noticed by hand. This adds a GitHub-hosted (ubuntu-latest) liveness probe that keeps firing even when the self-hosted pool is down, and turns that silent failure into a durable tracking issue rather than a transient red badge.

How

.github/workflows/runner-liveness.yml — schedule: */15 * * * * + workflow_dispatch:

Queued-run age (authoritative). Fails if any run has been queued longer than QUEUE_THRESHOLD_MINUTES (default 30). Needs only actions: read, and works regardless of whether runners are registered at the repo or org level — it measures the actual symptom (jobs not getting picked up).
Runner list (best-effort). GET /repos/{repo}/actions/runners needs the administration scope, which is not grantable to GITHUB_TOKEN (actionlint flagged this), so the lookup self-skips on 403/empty instead of false-alarming. Wire a PAT into GH_TOKEN later if a hard runner count is wanted.

On a problem it opens or updates a single runner-down-labelled tracking issue (idempotent — comment-updates the existing one, never duplicates); on recovery it comments and auto-closes. The run itself also goes red so there's a badge signal too.

Verification

actionlint clean (includes shellcheck on the run: blocks).
Injection-safe: dynamic content flows through env: vars; triggers are schedule/workflow_dispatch with no untrusted event payload.
Scheduled workflows only run from the default branch, so I'll smoke-test via workflow_dispatch once this is on main and record the result here.

Out of scope (separate PRs, per the issue)

Routing fast core gates (fmt/yaml-lint/validate/clippy) to ubuntu-latest — runner-policy + billing implications.
The operational runbook docs.

Refs #509, #436. Trace: skip (ci type, AGENTS.md exempt).

🤖 Generated with Claude Code

Every gating job runs on `[self-hosted, …]`, so when the pool goes offline every gate queues forever with no fallback and no alarm — the multi-day outage in #509 was invisible until noticed by hand. This GitHub-hosted workflow (ubuntu-latest, so it fires even when the pool is down) polls on a 15-min schedule + dispatch and raises a durable tracking issue instead of a transient red badge. Signals: (1) queued-run age > QUEUE_THRESHOLD_MINUTES (default 30) is the authoritative alarm — needs only actions:read and is agnostic to repo-vs-org runner registration; (2) the runner-list check is best-effort and self-skips, since listing self-hosted runners needs the `administration` scope that GITHUB_TOKEN cannot be granted. On a problem it opens or updates an idempotent `runner-down`-labelled issue (one tracker, comment-updated); on recovery it comments and auto-closes. Validated with actionlint (incl. shellcheck on the run blocks). Smoke-test via workflow_dispatch after merge. Out of scope (separate PRs): routing fast core gates to ubuntu-latest (runner policy + billing); the operational runbook. Trace: skip Refs: #509, #436 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codecov · 2026-06-14T12:50:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: runner-liveness alert for the self-hosted pool (#509 slice 1)#536

ci: runner-liveness alert for the self-hosted pool (#509 slice 1)#536
avrabe wants to merge 1 commit into
mainfrom
ci/509-runner-liveness-alert

avrabe commented Jun 14, 2026

Uh oh!

codecov Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented Jun 14, 2026

What

How

Verification

Out of scope (separate PRs, per the issue)

Uh oh!

codecov Bot commented Jun 14, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant