Skip to content

Add optional Healthchecks.io dead-man's switch (#79)#133

Open
VijitSingh97 wants to merge 2 commits into
mainfrom
claude/serene-faraday-eb0414
Open

Add optional Healthchecks.io dead-man's switch (#79)#133
VijitSingh97 wants to merge 2 commits into
mainfrom
claude/serene-faraday-eb0414

Conversation

@VijitSingh97
Copy link
Copy Markdown
Collaborator

Summary

Implements #79 — an optional Healthchecks.io dead-man's switch, behind a config.json flag that is off by default.

When enabled, the dashboard's existing data-collection loop pings a unique Healthchecks.io URL each cycle. The value is in what stops happening: if the whole host dies (power loss, kernel panic, NIC death, hard hang), the dashboard dies with it, the pings stop, and Healthchecks.io fires an alert on the absence of a ping — evaluated on their servers, so it survives the very outage you want to catch. This is the one failure mode an in-stack notifier (#45) structurally can't report from a dead machine.

How it integrates

  • Ping source: the existing DataService loop in service/data_service.py (every UPDATE_INTERVAL), so there's no extra container or daemon — it reuses the running loop and the node-health state already computed there.
  • Health-aware (optional): sends /fail while a required node is down — monerod always, Tari only when dashboard.tari_required — reusing the exact predicate behind Dashboard should show when monerod / tari nodes are down #31's worker rejection. Toggle with signal_fail_on_node_down (default true); set false for plain liveness.
  • Fails silently: a ping that can't get out (offline, or Tor-only without clearnet) is logged at debug only — Healthchecks.io alerts on the missed ping regardless. Consistent with the Dashboard: new-version warning + one-click upgrade button #59 offline check discipline.
  • Self-hosted: paste a full self-hosted ping URL, or a bare uuid plus a base_url override.
  • Manual setup only (MVP): the operator pastes the ping URL. Management-API auto-provisioning (model B) is intentionally out of scope — it would mean storing a powerful API key. The ping URL itself is treated as a secret (owner-only .env, never echoed by apply, never logged).

Config (config.json, default off)

"healthchecks": {
    "enabled": false,
    "ping_url": "",
    "base_url": "https://hc-ping.com",
    "interval_seconds": 60,
    "signal_fail_on_node_down": true
}

Plumbed through pithead.envdocker-compose.ymlconfig.py, matching the existing config flow.

Acceptance criteria

  • healthchecks.enabled=false by default; when off, nothing pings and no errors.
  • When enabled with a ping_url, the stack pings on schedule; killing the stack/host stops the pings (→ alert after the configured period + grace).
  • Optional /fail on a required-node-down signal.
  • Self-hosted base_url supported.
  • Fails silently / no noise when offline or Tor-only.
  • Documented setup (docs/monitoring.md).

Documentation

New docs/monitoring.md (account → check → period/grace → copy URL → config.jsonapply, plus routing alerts to Telegram/email, a self-hosting note, a clearnet privacy note, and an optional host-level systemd-timer alternative). Also adds the keys to the configuration reference, links it from the docs index, and adds a CHANGELOG entry.

Testing

  • Unit (tests/service/test_healthchecks.py): URL resolution (full/bare/self-hosted), throttle, /fail vs liveness, disabled no-op, misconfig-warns-once, silent-offline + retry. New module at 100% coverage.
  • Integration (tests/service/test_data_service.py): the loop pings fail=False when healthy, fail=True when a required node is down, and is never called when disabled.
  • Shell (tests/stack/run.sh): render_env propagation (defaults + enabled) and the apply preview message — including that the ping URL is not leaked into the diff.

All green locally: 426 dashboard tests, 100 pithead shell tests, compose validation, and shellcheck clean.

Out of scope / future

Closes #79.

🤖 Generated with Claude Code

VijitSingh97 and others added 2 commits June 4, 2026 02:00
Adds an opt-in external liveness monitor (default off). When enabled, the
dashboard's existing data-collection loop pings a unique Healthchecks.io URL
each cycle; if the host dies (power loss, kernel panic, NIC death) the pings
stop and Healthchecks.io alerts the operator on the *absence* of a ping — the
one failure mode an in-stack notifier (#45) structurally can't report from a
dead machine.

- New HealthchecksClient (service/healthchecks.py): throttled, fails silently
  when offline/Tor-only (#59 discipline), resolves full ping URLs or a bare
  uuid + base_url for self-hosted instances.
- Health-aware: sends /fail while a required node is down (monerod always; Tari
  when TARI_REQUIRED), matching #31's worker-rejection predicate. Toggle via
  signal_fail_on_node_down (default true).
- Wired into the data loop, gated on `enabled` so a disabled stack never even
  spawns the worker thread.
- config.json `healthchecks.*` plumbed through pithead render_env + .env +
  docker-compose; the ping URL is treated as a secret (owner-only .env, never
  echoed by apply, never logged).
- Manual setup only; Management-API auto-provisioning (model B) intentionally
  out of scope (no API key stored).
- Docs: new docs/monitoring.md, configuration reference rows, docs index, and
  CHANGELOG entry.
- Tests: unit tests for the client (URL resolution, throttle, fail signal,
  silent-offline) + data-loop integration tests + pithead render_env/apply-diff
  shell tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Healthchecks.io dead-man's-switch: detect power loss / host-down via external ping (config.json, default off)

1 participant