ci: runner-liveness alert for the self-hosted pool (#509 slice 1)#536
Open
avrabe wants to merge 1 commit into
Open
ci: runner-liveness alert for the self-hosted pool (#509 slice 1)#536avrabe wants to merge 1 commit into
avrabe wants to merge 1 commit into
Conversation
Every gating job runs on `[self-hosted, …]`, so when the pool goes offline every gate queues forever with no fallback and no alarm — the multi-day outage in #509 was invisible until noticed by hand. This GitHub-hosted workflow (ubuntu-latest, so it fires even when the pool is down) polls on a 15-min schedule + dispatch and raises a durable tracking issue instead of a transient red badge. Signals: (1) queued-run age > QUEUE_THRESHOLD_MINUTES (default 30) is the authoritative alarm — needs only actions:read and is agnostic to repo-vs-org runner registration; (2) the runner-list check is best-effort and self-skips, since listing self-hosted runners needs the `administration` scope that GITHUB_TOKEN cannot be granted. On a problem it opens or updates an idempotent `runner-down`-labelled issue (one tracker, comment-updated); on recovery it comments and auto-closes. Validated with actionlint (incl. shellcheck on the run blocks). Smoke-test via workflow_dispatch after merge. Out of scope (separate PRs): routing fast core gates to ubuntu-latest (runner policy + billing); the operational runbook. Trace: skip Refs: #509, #436 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Every gating CI job runs on
[self-hosted, …], so when the pool goes offline every gate queues forever with no fallback and no alarm — the multi-day outage in #509 was invisible until someone noticed by hand. This adds a GitHub-hosted (ubuntu-latest) liveness probe that keeps firing even when the self-hosted pool is down, and turns that silent failure into a durable tracking issue rather than a transient red badge.How
.github/workflows/runner-liveness.yml—schedule: */15 * * * *+workflow_dispatch:queuedlonger thanQUEUE_THRESHOLD_MINUTES(default 30). Needs onlyactions: read, and works regardless of whether runners are registered at the repo or org level — it measures the actual symptom (jobs not getting picked up).GET /repos/{repo}/actions/runnersneeds theadministrationscope, which is not grantable toGITHUB_TOKEN(actionlint flagged this), so the lookup self-skips on 403/empty instead of false-alarming. Wire a PAT intoGH_TOKENlater if a hard runner count is wanted.On a problem it opens or updates a single
runner-down-labelled tracking issue (idempotent — comment-updates the existing one, never duplicates); on recovery it comments and auto-closes. The run itself also goes red so there's a badge signal too.Verification
actionlintclean (includes shellcheck on therun:blocks).env:vars; triggers areschedule/workflow_dispatchwith no untrusted event payload.workflow_dispatchonce this is onmainand record the result here.Out of scope (separate PRs, per the issue)
fmt/yaml-lint/validate/clippy) toubuntu-latest— runner-policy + billing implications.Refs #509, #436.
Trace: skip(ci type, AGENTS.md exempt).🤖 Generated with Claude Code