Add optional Healthchecks.io dead-man's switch (#79) by VijitSingh97 · Pull Request #133 · p2pool-starter-stack/pithead

VijitSingh97 · 2026-06-04T07:00:43Z

Summary

Implements #79 — an optional Healthchecks.io dead-man's switch, behind a config.json flag that is off by default.

When enabled, the dashboard's existing data-collection loop pings a unique Healthchecks.io URL each cycle. The value is in what stops happening: if the whole host dies (power loss, kernel panic, NIC death, hard hang), the dashboard dies with it, the pings stop, and Healthchecks.io fires an alert on the absence of a ping — evaluated on their servers, so it survives the very outage you want to catch. This is the one failure mode an in-stack notifier (#45) structurally can't report from a dead machine.

How it integrates

Ping source: the existing DataService loop in service/data_service.py (every UPDATE_INTERVAL), so there's no extra container or daemon — it reuses the running loop and the node-health state already computed there.
Health-aware (optional): sends /fail while a required node is down — monerod always, Tari only when dashboard.tari_required — reusing the exact predicate behind Dashboard should show when monerod / tari nodes are down #31's worker rejection. Toggle with signal_fail_on_node_down (default true); set false for plain liveness.
Fails silently: a ping that can't get out (offline, or Tor-only without clearnet) is logged at debug only — Healthchecks.io alerts on the missed ping regardless. Consistent with the Dashboard: new-version warning + one-click upgrade button #59 offline check discipline.
Self-hosted: paste a full self-hosted ping URL, or a bare uuid plus a base_url override.
Manual setup only (MVP): the operator pastes the ping URL. Management-API auto-provisioning (model B) is intentionally out of scope — it would mean storing a powerful API key. The ping URL itself is treated as a secret (owner-only .env, never echoed by apply, never logged).

Config (`config.json`, default off)

"healthchecks": {
    "enabled": false,
    "ping_url": "",
    "base_url": "https://hc-ping.com",
    "interval_seconds": 60,
    "signal_fail_on_node_down": true
}

Plumbed through pithead → .env → docker-compose.yml → config.py, matching the existing config flow.

Acceptance criteria

healthchecks.enabled=false by default; when off, nothing pings and no errors.
When enabled with a ping_url, the stack pings on schedule; killing the stack/host stops the pings (→ alert after the configured period + grace).
Optional /fail on a required-node-down signal.
Self-hosted base_url supported.
Fails silently / no noise when offline or Tor-only.
Documented setup (docs/monitoring.md).

Documentation

New docs/monitoring.md (account → check → period/grace → copy URL → config.json → apply, plus routing alerts to Telegram/email, a self-hosting note, a clearnet privacy note, and an optional host-level systemd-timer alternative). Also adds the keys to the configuration reference, links it from the docs index, and adds a CHANGELOG entry.

Testing

Unit (tests/service/test_healthchecks.py): URL resolution (full/bare/self-hosted), throttle, /fail vs liveness, disabled no-op, misconfig-warns-once, silent-offline + retry. New module at 100% coverage.
Integration (tests/service/test_data_service.py): the loop pings fail=False when healthy, fail=True when a required node is down, and is never called when disabled.
Shell (tests/stack/run.sh): render_env propagation (defaults + enabled) and the apply preview message — including that the ping URL is not leaked into the diff.

All green locally: 426 dashboard tests, 100 pithead shell tests, compose validation, and shellcheck clean.

Out of scope / future

Model B auto-provisioning via the Management API (Bootable USB installers: self-provisioning appliance images for the stack host and RigForge miner #77 appliance wizard).
A dashboard UI indicator for healthcheck status.

Closes #79.

🤖 Generated with Claude Code

Adds an opt-in external liveness monitor (default off). When enabled, the dashboard's existing data-collection loop pings a unique Healthchecks.io URL each cycle; if the host dies (power loss, kernel panic, NIC death) the pings stop and Healthchecks.io alerts the operator on the *absence* of a ping — the one failure mode an in-stack notifier (#45) structurally can't report from a dead machine. - New HealthchecksClient (service/healthchecks.py): throttled, fails silently when offline/Tor-only (#59 discipline), resolves full ping URLs or a bare uuid + base_url for self-hosted instances. - Health-aware: sends /fail while a required node is down (monerod always; Tari when TARI_REQUIRED), matching #31's worker-rejection predicate. Toggle via signal_fail_on_node_down (default true). - Wired into the data loop, gated on `enabled` so a disabled stack never even spawns the worker thread. - config.json `healthchecks.*` plumbed through pithead render_env + .env + docker-compose; the ping URL is treated as a secret (owner-only .env, never echoed by apply, never logged). - Manual setup only; Management-API auto-provisioning (model B) intentionally out of scope (no API key stored). - Docs: new docs/monitoring.md, configuration reference rows, docs index, and CHANGELOG entry. - Tests: unit tests for the client (URL resolution, throttle, fail signal, silent-offline) + data-loop integration tests + pithead render_env/apply-diff shell tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md

VijitSingh97 and others added 2 commits June 4, 2026 02:00

Merge remote-tracking branch 'origin/main' into pr-133

d5cb210

# Conflicts: # CHANGELOG.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional Healthchecks.io dead-man's switch (#79)#133

Add optional Healthchecks.io dead-man's switch (#79)#133
VijitSingh97 wants to merge 2 commits into
mainfrom
claude/serene-faraday-eb0414

VijitSingh97 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VijitSingh97 commented Jun 4, 2026

Summary

How it integrates

Config (config.json, default off)

Acceptance criteria

Documentation

Testing

Out of scope / future

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Config (`config.json`, default off)