fix(recovery): quarantine a saga whose branch-deletion compensation fails (#434)#438
Conversation
hadamrd
left a comment
There was a problem hiding this comment.
Minimal path to green (must-fix to merge)
- Revert the out-of-scope additions in src/forge_loop/tasks/saga.py (the CompensationKind.DELETE_BRANCH enum member) and its registration in _HANDLED_COMPENSATION_KINDS in recovery.py — the issue's 'Out of scope' assigns these to sibling #431. Keep the real deliverable: the recovery.py change that drives a saga to QUARANTINED when a handled compensation handler raises (instead of leaving it RUNNING). Prove the invariant via the already-handled REMOVE_WORKTREE path by injecting a raising reap_worktree (as the ticket's out-of-scope section directs: 'stub/inject ... make the minimal recovery.py change so the invariant holds regardless'). This also removes the reap_branch param / DELETE_BRANCH dispatch branch, which otherwise becomes an unwired handler: boot.py and cli_status_commands.py never pass reap_branch, so a DELETE_BRANCH compensation would be silently marked COMPENSATED (reap_branch=None -> no-op) without deleting the branch.
Optional follow-ups (do NOT block merge)
- [sev3/tests] tests/test_recovery.py — If the DELETE_BRANCH handler stays in any form, add a sad-path test for reap_branch=None: a DELETE_BRANCH compensation with no handler wired must NOT silently reach COMPENSATED. The current happy-path test only exercises the wired-handler case.
- [sev3/architecture] src/forge_loop/control/recovery.py:70 — A successful DELETE_BRANCH reap records nothing in RecoveredSaga (only worktree targets populate worktrees_reaped), so the recovery report has no trace of which branch was reclaimed. Out of scope to add a field here (#434 forbids new RecoveryReport fields), but worth noting for the #431 wiring PR.
hadamrd
left a comment
There was a problem hiding this comment.
[sev2/architecture] Out-of-scope per #434: the ticket explicitly defers the DELETE_BRANCH enum member + its _HANDLED_COMPENSATION_KINDS registration to sibling #431 and instructs proving the invariant with an injected/stubbed handler plus the minimal recovery.py change. Adding the enum here (a) duplicates #431 (a same-line enum/keyset merge conflict when it lands), and (b) registers a 'handled' kind with NO production handler — boot.py:159 and cli_status_commands.py:392 both call reconcile_stale_sagas without reap_branch. Combined with _run_compensations treating reap_branch=None as a silent no-op before mark_compensated, a real DELETE_BRANCH compensation would be marked COMPENSATED without deleting the branch: the precise integrity violation #434 exists to prevent, and untested. Keep the quarantine-on-handler-raise fix; prove it via the already-handled REMOVE_WORKTREE path instead.
| """ | ||
|
|
||
| REMOVE_WORKTREE = "remove-worktree" | ||
| # Reclaim the orphan ``loop/<n>`` branch a dead worker left behind (#431). |
There was a problem hiding this comment.
[sev2/architecture] Out-of-scope per #434: the ticket explicitly defers the DELETE_BRANCH enum member + its _HANDLED_COMPENSATION_KINDS registration to sibling #431 and instructs proving the invariant with an injected/stubbed handler plus the minimal recovery.py change. Adding the enum here (a) duplicates #431 (a same-line enum/keyset merge conflict when it lands), and (b) registers a 'handled' kind with NO production handler — boot.py:159 and cli_status_commands.py:392 both call reconcile_stale_sagas without reap_branch. Combined with _run_compensations treating reap_branch=None as a silent no-op before mark_compensated, a real DELETE_BRANCH compensation would be marked COMPENSATED without deleting the branch: the precise integrity violation #434 exists to prevent, and untested. Keep the quarantine-on-handler-raise fix; prove it via the already-handled REMOVE_WORKTREE path instead.
bd5b5d5 to
204d333
Compare
Fixes #434.
Problem
Once
DELETE_BRANCHis a handled compensation kind,reconcile_stale_sagascalls
reap_branchduring the sweep — and that call can fail (protected branch,git lock, network/API error, ref gone-but-erroring). The current per-saga loop's
generic
except Exceptioncaught the raise, appended a line toRecoveryReport.errors, andcontinued — leaving the saga in its non-terminalRUNNING state (confirmed by the prior assertion in
test_reconcile_continues_past_a_failing_saga, which expectedRUNNING). It wastherefore still stale, re-swept on every boot/tick, and never drained from the
in-flight view: the immortal-saga bug #360 fixed, reborn for a handler that raises.
Fix (the only production mechanism)
A saga all of whose kinds are handled but whose handler raises is driven to
terminal
TaskState.QUARANTINED— the same never-half-compensated invariant #360established for unhandled kinds, extended to a handled-but-failing one.
QUARANTINED ends the saga (liveness — it drains and is never re-swept) without
asserting the branch was deleted (integrity).
recovery.py: extracted_run_compensations(dispatches each compensation toits handler by enum —
REMOVE_WORKTREE→reap_worktree,DELETE_BRANCH→reap_branch); a handler raise is surfaced as a typed_CompensationFailedcarrying kind+target+cause. The per-saga loop catches it, calls
mark_quarantined(...)with a branch/kind-referencing reason, appends toerrors, and continues — neverCOMPENSATED, never leftRUNNING.reap_branchis injected (the real git helper and the dispatch-timeDELETE_BRANCHenqueue are sibling tickets in Compensate the branch a failed worker abandons #431, not yet on base).DELETE_BRANCHis registered in_HANDLED_COMPENSATION_KINDSso theexhaustiveness contract stays green and the kind routes through a typed enum
rather than a stringly-typed literal (manifesto: no stringly-typed cross-module
discriminators). This required the 1-line enum member in
saga.pysince thesibling enum PR is not yet merged on this base — see Note on scope below.
Acceptance criteria → tests (
tests/test_recovery.py)reap_branch⇒QUARANTINED,is_terminal,report.recovered == (),exactly one
errorsentry naming the saga id + branch-deletion failure,terminal_reasonreferences the branch/kind,list_in_flight() == ()—test_reconcile_quarantines_saga_when_branch_deletion_raisestest_reconcile_branch_deletion_failure_not_resweptboth terminal —
test_reconcile_mixed_sweep_branch_fail_quarantined_worktree_compensatedreap_branch⇒COMPENSATED+ inrecovered—test_reconcile_branch_deletion_success_reaches_compensatedrecovered)_HANDLED_COMPENSATION_KINDS == set(CompensationKind)stays greenwith
DELETE_BRANCHregisterederrorscount in the emitted
boot_recoveryevent + QUARANTINED saga —test_runner_boot_recovery_surfaces_quarantine_on_failing_reaptest_reconcile_continues_past_a_failing_sagatoassert the corrected QUARANTINED-and-drained behavior.
All 22 tests in
tests/test_recovery.pypass.Note on scope / environment
recovery.pyas the only expected production file, assuming thesibling Compensate the branch a failed worker abandons #431 enum PR was merged. It is not on this base, so a minimal 1-line
CompensationKind.DELETE_BRANCHmember was added tosaga.pyto keep routingenum-typed (a stringly-typed
"delete-branch"literal would be a manifestosev1).
reap_branchitself is still injected, not implemented here.src/is shadowed ontoPYTHONPATHto test thebranch's code (the venv's editable install points at the main checkout; per the
environment-safety rule no editable install was performed against the worktree).
ruff check src testspasses.pyrighton the changed files(
recovery.py,saga.py) is clean; the twopyright src/forge_looperrors inrunner/dispatch.py:985andrunner/tick_checks.py:173are pre-existing ontrunk (verified against the main checkout) and untouched by this PR.