Skip to content

fix: deep-review hardening — auth probes, session-id, signal-kill, concurrency#5

Merged
bbingz merged 3 commits into
mainfrom
fix/deep-review-2026-06-02
Jun 2, 2026
Merged

fix: deep-review hardening — auth probes, session-id, signal-kill, concurrency#5
bbingz merged 3 commits into
mainfrom
fix/deep-review-2026-06-02

Conversation

@bbingz
Copy link
Copy Markdown
Owner

@bbingz bbingz commented Jun 2, 2026

Branch 1 of 3 from the 2026-06-02 workflow+codegraph deep review (27 confirmed findings). This PR lands the runtime/utils + host-lib hardening (25/27); the kimi-code migration and the new grok provider ship as their own PRs.

Runtime / utils (commit 1)

  • auth probes no longer regress transient failures to loggedIn:false — qwen normalizes ETIMEDOUT; claude/copilot/minimax gain the transient/explicit split (timeout/429/ECONNRESET → inconclusive, only an explicit auth error → loggedOut). Restores the documented hard rule.
  • cmd no longer fabricates a sessionId by scanning prose stdout for a UUID (hard null, cf. agy v0.6.18).
  • gemini/pi restrict the resolveSessionId fallback to stderr/file (stdout blanked) so a UUID in the answer prose can't be promoted to a fabricated id; the stderr-id fallback is preserved.
  • gemini sync parseGeminiJsonResult now gates ok on status === 0 (was ok:true on a non-zero exit when stdout had JSON).
  • runCommand surfaces a signal-kill (status:null + signal, no error) as a synthetic error so the sync providers stop reading a SIGKILL/SIGTERM as success.
  • createLineDecoder enforces maxBufferBytes on the post-drain residual, so a burst of complete newline-terminated lines is no longer rejected and dropped.
  • withLockfile reclaims a stale no-pid / partial-write lock by mtime instead of wedging for the full timeout.
  • Tests: classifyProviderFailure (was untested) + the four-state mixed aggregate branch.

Host-lib (commit 2)

  • state.mjs imports the single hardened writeJsonAtomic/withLockfile from polycli-utils instead of carrying byte-duplicate copies (removes the drift class behind the v0.6.15 split-store regression; inherits the no-pid fix).
  • cacheProviderModel RMW is now locked + atomic (fixes the cross-process lost update).
  • recoverLedgerTerminalEvents read→check→append→removeConfig is serialized under a .recover.lock (fixes the cross-process double-append).
  • cancelJob transitions status + captures the pid atomically under the state lock before signalling (no longer kills a pid from a stale pre-lock snapshot).

Deferred (documented, not rushed — both medium/low)

  • post-finalize ledger recovery for a worker that crashed between the atomic finalize and the ledger append (needs a crashed-vs-normal-terminal discriminator so it doesn't re-emit on the normal cancel path).
  • PID-reuse liveness witness (record process start-time at the spawn site; inherent to PID-only liveness).

Validation

npm test 474/474 (from 453, +21 tests). Companion bundles byte-identical. git diff clean.

Findings adjudicated by 3-lens adversarial verification; the 4 session-purge "findings" were all refuted (purge path verified sound) and are intentionally not touched.

bbingz added 3 commits June 2, 2026 11:23
…on (deep-review 1/2)

Review-driven fixes from the 2026-06-02 workflow+codegraph review.
Runtime + utils layer (host-lib concurrency fixes follow in a second commit):

- auth probes no longer regress transient failures to loggedIn:false. qwen normalizes
  ETIMEDOUT in runQwenPrompt; claude/copilot/minimax gain the transient/explicit split
  (timeout/429/ECONNRESET -> inconclusive, only an explicit auth error -> loggedOut).
- cmd no longer fabricates a sessionId by scanning prose stdout for a UUID (hard null, cf. agy v0.6.18).
- gemini/pi restrict the resolveSessionId fallback to stderr/file (stdout blanked) so a UUID in
  the answer prose can never be promoted to a fabricated sessionId; stderr id fallback preserved.
- gemini sync parseGeminiJsonResult gates ok on status===0 (was ok:true on a non-zero exit).
- runCommand surfaces a signal kill (status:null + signal, no error) as a synthetic error so the
  sync providers stop reading a SIGKILL/SIGTERM as success.
- createLineDecoder enforces maxBufferBytes on the post-drain residual, so a burst of complete
  newline-terminated lines is no longer rejected and dropped.
- withLockfile reclaims a stale no-pid / partial-write lock by mtime instead of wedging for the
  full timeout.

Tests: +21 (474/474). npm test green; companion bundles byte-identical.
Refs: memory project_deep_review_2026_06_02.
…ew 2/2)

Host-lib half of the 2026-06-02 workflow+codegraph review batch:

- state.mjs no longer ships byte-duplicate writeJsonAtomic/withLockfile; it imports the single
  hardened implementation from @bbingz/polycli-utils/atomic-save (which now also reclaims a stale
  no-pid/partial-write lock), removing the drift class behind the v0.6.15 split-store regression.
- cacheProviderModel serializes its read-modify-write under a lockfile and writes atomically,
  fixing the cross-process lost update where concurrent runs dropped each other's cached model.
- recoverLedgerTerminalEvents serializes read -> hasLedgerPhase -> append -> removeConfig under a
  recover lock (distinct path from the ndjson append lock) so two concurrent refreshJob() callers
  can no longer double-append a run's terminal events.
- cancelJob flips status + captures the pid atomically under the state lock BEFORE signalling, so it
  never kills a pid read from a stale pre-lock snapshot (the pid was confirmed ACTIVE at lock time).

Deferred (documented, not rushed): post-finalize ledger recovery for a worker that crashed between
the atomic finalize and the ledger append (needs a crashed-vs-normal-terminal discriminator), and a
PID-reuse liveness witness (record process start-time at the spawn site; inherent to PID-only
liveness). Both medium/low; tracked in memory project_deep_review_2026_06_02.

npm test 474/474; companion bundles byte-identical.
…e it

Codex review of PR #5 stalled (~20min, no output) on the 1249-line diff, so
this gate ran a 6-dimension adversarial self-review instead (20 reviewers:
auth-probe transient cluster, atomic-save locking, signal-kill, stream limit,
job-control concurrency, state dedup, companion sessionId, test adequacy).
14 raw findings → 0 survived adversarial verification: the 25 hardening fixes
are correctly implemented with no regressions (auth transient→inconclusive
consistent across all 10 providers; signal-kill surfaces a synthetic error
before status is read; no-pid stale-lock reclaim correct and tested; state.mjs
reduction is pure helper extraction; sessionId never fabricated).

Only actionable item: the branch tracked `.codegraph/.gitignore` (a CodeGraph
index artifact) while root `.gitignore` lacked a `.codegraph/` entry. Untrack
the artifact and ignore the directory (matching the kimi/grok branches).

Residual (NOT changed, low-risk): cacheProviderModel now throws on lock-acquire
timeout — but it already threw on FS errors at the same unguarded call sites, so
the propagation behavior is essentially pre-existing; the timeout path is
extremely low-probability and self-heals via the no-pid reclaim fix.

No source/bundle change; npm test stays 474/474.
@bbingz bbingz merged commit b04661b into main Jun 2, 2026
1 check failed
@bbingz bbingz deleted the fix/deep-review-2026-06-02 branch June 2, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant