Skip to content

pithead robustness: backup degrades gracefully on du/df errors; doctor exits non-zero on FAIL (#127)#148

Merged
VijitSingh97 merged 2 commits into
mainfrom
claude/pithead-robustness
Jun 4, 2026
Merged

pithead robustness: backup degrades gracefully on du/df errors; doctor exits non-zero on FAIL (#127)#148
VijitSingh97 merged 2 commits into
mainfrom
claude/pithead-robustness

Conversation

@VijitSingh97
Copy link
Copy Markdown
Collaborator

Two small, independent pithead error-handling fixes from the v1.0 sweep.

backup aborted on a non-fatal du/df error

The disk-space pre-check assigned need_kb/avail_kb as bare statements:

need_kb=$(sudo du -sck "${items[@]}" 2>/dev/null | awk 'END{print $1}')
avail_kb=$(df -Pk "$backups_dir" 2>/dev/null | awk 'NR==2{print $4}')

Under set -Eeuo pipefail + trap on_err ERR, du exiting non-zero (a permission-denied subdir, a file vanishing mid-walk, an NFS hiccup — even though 2>/dev/null hides it and a total is still printed) tripped errexit and aborted the whole backup with the generic "aborted unexpectedly" — leaving the carefully-written "proceeding without a space check" fallback unreachable. Adding || true to both assignments lets it degrade as designed.

doctor always exited 0, even on critical FAIL

doctor tallied DR_FAIL for hard failures (missing jq/openssl/docker, unreachable daemon) but unconditionally return 0, and the dispatch was doctor) doctor ;; with no propagation — so pithead doctor; echo $? was always 0, making it useless as a cron/CI/monitoring health gate. Now it return 1 when DR_FAIL>0 and dispatches via doctor || exit 1 (mirroring status). Warnings alone still exit 0.

Tests

Adds a black-box test (tests/stack/run.sh) that drives a single critical failure via an unreachable-daemon docker stub and asserts doctor runs to its summary (so the exit 1 is from the FAIL tally, not an early crash) and exits 1. Suite: 94 passed, shellcheck clean.

(A backup round-trip test is deferred to #140, which scopes comprehensive backup/restore + doctor coverage.)

Closes #127.

🤖 Generated with Claude Code

VijitSingh97 and others added 2 commits June 4, 2026 08:19
…-zero on FAIL (#127)

backup: the disk-space pre-check assigned `need_kb`/`avail_kb` as bare statements, so a
non-zero `du` exit (an unreadable subdir, a vanished file, an NFS hiccup) tripped errexit
and aborted the whole backup — making the "proceeding without a space check" fallback
unreachable. Add `|| true` so it degrades as intended.

doctor: it tallied DR_FAIL but always `return 0`, and the dispatch didn't propagate, so
`pithead doctor` always exited 0 — useless as a cron/CI/monitoring health gate. Return 1
when DR_FAIL>0 and dispatch via `doctor || exit 1` (mirrors `status`); warnings still exit 0.

Adds a black-box test driving a critical FAIL via an unreachable daemon stub, asserting
doctor runs to its summary and exits 1. CHANGELOG updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@VijitSingh97 VijitSingh97 merged commit 62e7a39 into main Jun 4, 2026
5 checks passed
@VijitSingh97 VijitSingh97 deleted the claude/pithead-robustness branch June 4, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pithead robustness: 'backup' aborts on a non-fatal du/df error; 'doctor' always exits 0 even on FAIL

1 participant