Skip to content

fix(recovery): quarantine a saga whose branch-deletion compensation fails (#434)#438

Merged
hadamrd merged 1 commit into
trunkfrom
loop/434-quarantine-a-saga-whose-branch-deletion
Jun 9, 2026
Merged

fix(recovery): quarantine a saga whose branch-deletion compensation fails (#434)#438
hadamrd merged 1 commit into
trunkfrom
loop/434-quarantine-a-saga-whose-branch-deletion

Conversation

@hadamrd

@hadamrd hadamrd commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Fixes #434.

Problem

Once DELETE_BRANCH is a handled compensation kind, reconcile_stale_sagas
calls reap_branch during the sweep — and that call can fail (protected branch,
git lock, network/API error, ref gone-but-erroring). The current per-saga loop's
generic except Exception caught the raise, appended a line to
RecoveryReport.errors, and continued — leaving the saga in its non-terminal
RUNNING
state (confirmed by the prior assertion in
test_reconcile_continues_past_a_failing_saga, which expected RUNNING). It was
therefore still stale, re-swept on every boot/tick, and never drained from the
in-flight view: the immortal-saga bug #360 fixed, reborn for a handler that raises.

Fix (the only production mechanism)

A saga all of whose kinds are handled but whose handler raises is driven to
terminal TaskState.QUARANTINED — the same never-half-compensated invariant #360
established for unhandled kinds, extended to a handled-but-failing one.
QUARANTINED ends the saga (liveness — it drains and is never re-swept) without
asserting the branch was deleted (integrity).

  • recovery.py: extracted _run_compensations (dispatches each compensation to
    its handler by enum — REMOVE_WORKTREEreap_worktree, DELETE_BRANCH
    reap_branch); a handler raise is surfaced as a typed _CompensationFailed
    carrying kind+target+cause. The per-saga loop catches it, calls
    mark_quarantined(...) with a branch/kind-referencing reason, appends to
    errors, and continues — never COMPENSATED, never left RUNNING.
  • reap_branch is injected (the real git helper and the dispatch-time
    DELETE_BRANCH enqueue are sibling tickets in Compensate the branch a failed worker abandons #431, not yet on base).
  • DELETE_BRANCH is registered in _HANDLED_COMPENSATION_KINDS so the
    exhaustiveness contract stays green and the kind routes through a typed enum
    rather than a stringly-typed literal (manifesto: no stringly-typed cross-module
    discriminators). This required the 1-line enum member in saga.py since the
    sibling enum PR is not yet merged on this base — see Note on scope below.

Acceptance criteria → tests (tests/test_recovery.py)

  • raising reap_branchQUARANTINED, is_terminal, report.recovered == (),
    exactly one errors entry naming the saga id + branch-deletion failure,
    terminal_reason references the branch/kind, list_in_flight() == ()
    test_reconcile_quarantines_saga_when_branch_deletion_raises
  • re-sweep idempotence (no longer immortal) —
    test_reconcile_branch_deletion_failure_not_reswept
  • mixed sweep: one branch-fail → QUARANTINED, one healthy worktree → COMPENSATED;
    both terminal — test_reconcile_mixed_sweep_branch_fail_quarantined_worktree_compensated
  • happy path (guards against over-quarantining): succeeding reap_branch
    COMPENSATED + in recoveredtest_reconcile_branch_deletion_success_reaches_compensated
  • no partial success claimed on raise (saga not in recovered)
  • exhaustiveness _HANDLED_COMPENSATION_KINDS == set(CompensationKind) stays green
    with DELETE_BRANCH registered
  • integration via the boot surface: a failing reap yields a non-zero errors
    count in the emitted boot_recovery event + QUARANTINED saga —
    test_runner_boot_recovery_surfaces_quarantine_on_failing_reap
  • updated the previously-buggy test_reconcile_continues_past_a_failing_saga to
    assert the corrected QUARANTINED-and-drained behavior.

All 22 tests in tests/test_recovery.py pass.

Note on scope / environment

  • The issue named recovery.py as the only expected production file, assuming the
    sibling Compensate the branch a failed worker abandons #431 enum PR was merged. It is not on this base, so a minimal 1-line
    CompensationKind.DELETE_BRANCH member was added to saga.py to keep routing
    enum-typed (a stringly-typed "delete-branch" literal would be a manifesto
    sev1). reap_branch itself is still injected, not implemented here.
  • Verification: the worktree's src/ is shadowed onto PYTHONPATH to test the
    branch's code (the venv's editable install points at the main checkout; per the
    environment-safety rule no editable install was performed against the worktree).
  • ruff check src tests passes. pyright on the changed files
    (recovery.py, saga.py) is clean; the two pyright src/forge_loop errors in
    runner/dispatch.py:985 and runner/tick_checks.py:173 are pre-existing on
    trunk
    (verified against the main checkout) and untouched by this PR.

@hadamrd hadamrd left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimal path to green (must-fix to merge)

  1. Revert the out-of-scope additions in src/forge_loop/tasks/saga.py (the CompensationKind.DELETE_BRANCH enum member) and its registration in _HANDLED_COMPENSATION_KINDS in recovery.py — the issue's 'Out of scope' assigns these to sibling #431. Keep the real deliverable: the recovery.py change that drives a saga to QUARANTINED when a handled compensation handler raises (instead of leaving it RUNNING). Prove the invariant via the already-handled REMOVE_WORKTREE path by injecting a raising reap_worktree (as the ticket's out-of-scope section directs: 'stub/inject ... make the minimal recovery.py change so the invariant holds regardless'). This also removes the reap_branch param / DELETE_BRANCH dispatch branch, which otherwise becomes an unwired handler: boot.py and cli_status_commands.py never pass reap_branch, so a DELETE_BRANCH compensation would be silently marked COMPENSATED (reap_branch=None -> no-op) without deleting the branch.

Optional follow-ups (do NOT block merge)

  • [sev3/tests] tests/test_recovery.py — If the DELETE_BRANCH handler stays in any form, add a sad-path test for reap_branch=None: a DELETE_BRANCH compensation with no handler wired must NOT silently reach COMPENSATED. The current happy-path test only exercises the wired-handler case.
  • [sev3/architecture] src/forge_loop/control/recovery.py:70 — A successful DELETE_BRANCH reap records nothing in RecoveredSaga (only worktree targets populate worktrees_reaped), so the recovery report has no trace of which branch was reclaimed. Out of scope to add a field here (#434 forbids new RecoveryReport fields), but worth noting for the #431 wiring PR.

@hadamrd hadamrd added the critic:blocking Critic found blocking issues label Jun 8, 2026

@hadamrd hadamrd left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[sev2/architecture] Out-of-scope per #434: the ticket explicitly defers the DELETE_BRANCH enum member + its _HANDLED_COMPENSATION_KINDS registration to sibling #431 and instructs proving the invariant with an injected/stubbed handler plus the minimal recovery.py change. Adding the enum here (a) duplicates #431 (a same-line enum/keyset merge conflict when it lands), and (b) registers a 'handled' kind with NO production handler — boot.py:159 and cli_status_commands.py:392 both call reconcile_stale_sagas without reap_branch. Combined with _run_compensations treating reap_branch=None as a silent no-op before mark_compensated, a real DELETE_BRANCH compensation would be marked COMPENSATED without deleting the branch: the precise integrity violation #434 exists to prevent, and untested. Keep the quarantine-on-handler-raise fix; prove it via the already-handled REMOVE_WORKTREE path instead.

Comment thread src/forge_loop/tasks/saga.py Outdated
"""

REMOVE_WORKTREE = "remove-worktree"
# Reclaim the orphan ``loop/<n>`` branch a dead worker left behind (#431).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[sev2/architecture] Out-of-scope per #434: the ticket explicitly defers the DELETE_BRANCH enum member + its _HANDLED_COMPENSATION_KINDS registration to sibling #431 and instructs proving the invariant with an injected/stubbed handler plus the minimal recovery.py change. Adding the enum here (a) duplicates #431 (a same-line enum/keyset merge conflict when it lands), and (b) registers a 'handled' kind with NO production handler — boot.py:159 and cli_status_commands.py:392 both call reconcile_stale_sagas without reap_branch. Combined with _run_compensations treating reap_branch=None as a silent no-op before mark_compensated, a real DELETE_BRANCH compensation would be marked COMPENSATED without deleting the branch: the precise integrity violation #434 exists to prevent, and untested. Keep the quarantine-on-handler-raise fix; prove it via the already-handled REMOVE_WORKTREE path instead.

@hadamrd hadamrd force-pushed the loop/434-quarantine-a-saga-whose-branch-deletion branch from bd5b5d5 to 204d333 Compare June 9, 2026 19:05
@hadamrd hadamrd merged commit 56e4058 into trunk Jun 9, 2026
2 checks passed
@hadamrd hadamrd deleted the loop/434-quarantine-a-saga-whose-branch-deletion branch June 9, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

critic:blocking Critic found blocking issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Quarantine a saga whose branch-deletion compensation fails

1 participant