Add optional Healthchecks.io dead-man's switch (#79)#133
Open
VijitSingh97 wants to merge 2 commits into
Open
Conversation
Adds an opt-in external liveness monitor (default off). When enabled, the dashboard's existing data-collection loop pings a unique Healthchecks.io URL each cycle; if the host dies (power loss, kernel panic, NIC death) the pings stop and Healthchecks.io alerts the operator on the *absence* of a ping — the one failure mode an in-stack notifier (#45) structurally can't report from a dead machine. - New HealthchecksClient (service/healthchecks.py): throttled, fails silently when offline/Tor-only (#59 discipline), resolves full ping URLs or a bare uuid + base_url for self-hosted instances. - Health-aware: sends /fail while a required node is down (monerod always; Tari when TARI_REQUIRED), matching #31's worker-rejection predicate. Toggle via signal_fail_on_node_down (default true). - Wired into the data loop, gated on `enabled` so a disabled stack never even spawns the worker thread. - config.json `healthchecks.*` plumbed through pithead render_env + .env + docker-compose; the ping URL is treated as a secret (owner-only .env, never echoed by apply, never logged). - Manual setup only; Management-API auto-provisioning (model B) intentionally out of scope (no API key stored). - Docs: new docs/monitoring.md, configuration reference rows, docs index, and CHANGELOG entry. - Tests: unit tests for the client (URL resolution, throttle, fail signal, silent-offline) + data-loop integration tests + pithead render_env/apply-diff shell tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #79 — an optional Healthchecks.io dead-man's switch, behind a
config.jsonflag that is off by default.When enabled, the dashboard's existing data-collection loop pings a unique Healthchecks.io URL each cycle. The value is in what stops happening: if the whole host dies (power loss, kernel panic, NIC death, hard hang), the dashboard dies with it, the pings stop, and Healthchecks.io fires an alert on the absence of a ping — evaluated on their servers, so it survives the very outage you want to catch. This is the one failure mode an in-stack notifier (#45) structurally can't report from a dead machine.
How it integrates
DataServiceloop inservice/data_service.py(everyUPDATE_INTERVAL), so there's no extra container or daemon — it reuses the running loop and the node-health state already computed there./failwhile a required node is down — monerod always, Tari only whendashboard.tari_required— reusing the exact predicate behind Dashboard should show when monerod / tari nodes are down #31's worker rejection. Toggle withsignal_fail_on_node_down(defaulttrue); setfalsefor plain liveness.base_urloverride..env, never echoed byapply, never logged).Config (
config.json, default off)Plumbed through
pithead→.env→docker-compose.yml→config.py, matching the existing config flow.Acceptance criteria
healthchecks.enabled=falseby default; when off, nothing pings and no errors.ping_url, the stack pings on schedule; killing the stack/host stops the pings (→ alert after the configured period + grace)./failon a required-node-down signal.base_urlsupported.docs/monitoring.md).Documentation
New
docs/monitoring.md(account → check → period/grace → copy URL →config.json→apply, plus routing alerts to Telegram/email, a self-hosting note, a clearnet privacy note, and an optional host-level systemd-timer alternative). Also adds the keys to the configuration reference, links it from the docs index, and adds a CHANGELOG entry.Testing
tests/service/test_healthchecks.py): URL resolution (full/bare/self-hosted), throttle,/failvs liveness, disabled no-op, misconfig-warns-once, silent-offline + retry. New module at 100% coverage.tests/service/test_data_service.py): the loop pingsfail=Falsewhen healthy,fail=Truewhen a required node is down, and is never called when disabled.tests/stack/run.sh):render_envpropagation (defaults + enabled) and theapplypreview message — including that the ping URL is not leaked into the diff.All green locally: 426 dashboard tests, 100
pitheadshell tests, compose validation, andshellcheckclean.Out of scope / future
Closes #79.
🤖 Generated with Claude Code