fix: deep-review hardening — auth probes, session-id, signal-kill, concurrency#5
Merged
Conversation
…on (deep-review 1/2) Review-driven fixes from the 2026-06-02 workflow+codegraph review. Runtime + utils layer (host-lib concurrency fixes follow in a second commit): - auth probes no longer regress transient failures to loggedIn:false. qwen normalizes ETIMEDOUT in runQwenPrompt; claude/copilot/minimax gain the transient/explicit split (timeout/429/ECONNRESET -> inconclusive, only an explicit auth error -> loggedOut). - cmd no longer fabricates a sessionId by scanning prose stdout for a UUID (hard null, cf. agy v0.6.18). - gemini/pi restrict the resolveSessionId fallback to stderr/file (stdout blanked) so a UUID in the answer prose can never be promoted to a fabricated sessionId; stderr id fallback preserved. - gemini sync parseGeminiJsonResult gates ok on status===0 (was ok:true on a non-zero exit). - runCommand surfaces a signal kill (status:null + signal, no error) as a synthetic error so the sync providers stop reading a SIGKILL/SIGTERM as success. - createLineDecoder enforces maxBufferBytes on the post-drain residual, so a burst of complete newline-terminated lines is no longer rejected and dropped. - withLockfile reclaims a stale no-pid / partial-write lock by mtime instead of wedging for the full timeout. Tests: +21 (474/474). npm test green; companion bundles byte-identical. Refs: memory project_deep_review_2026_06_02.
…ew 2/2) Host-lib half of the 2026-06-02 workflow+codegraph review batch: - state.mjs no longer ships byte-duplicate writeJsonAtomic/withLockfile; it imports the single hardened implementation from @bbingz/polycli-utils/atomic-save (which now also reclaims a stale no-pid/partial-write lock), removing the drift class behind the v0.6.15 split-store regression. - cacheProviderModel serializes its read-modify-write under a lockfile and writes atomically, fixing the cross-process lost update where concurrent runs dropped each other's cached model. - recoverLedgerTerminalEvents serializes read -> hasLedgerPhase -> append -> removeConfig under a recover lock (distinct path from the ndjson append lock) so two concurrent refreshJob() callers can no longer double-append a run's terminal events. - cancelJob flips status + captures the pid atomically under the state lock BEFORE signalling, so it never kills a pid read from a stale pre-lock snapshot (the pid was confirmed ACTIVE at lock time). Deferred (documented, not rushed): post-finalize ledger recovery for a worker that crashed between the atomic finalize and the ledger append (needs a crashed-vs-normal-terminal discriminator), and a PID-reuse liveness witness (record process start-time at the spawn site; inherent to PID-only liveness). Both medium/low; tracked in memory project_deep_review_2026_06_02. npm test 474/474; companion bundles byte-identical.
…e it Codex review of PR #5 stalled (~20min, no output) on the 1249-line diff, so this gate ran a 6-dimension adversarial self-review instead (20 reviewers: auth-probe transient cluster, atomic-save locking, signal-kill, stream limit, job-control concurrency, state dedup, companion sessionId, test adequacy). 14 raw findings → 0 survived adversarial verification: the 25 hardening fixes are correctly implemented with no regressions (auth transient→inconclusive consistent across all 10 providers; signal-kill surfaces a synthetic error before status is read; no-pid stale-lock reclaim correct and tested; state.mjs reduction is pure helper extraction; sessionId never fabricated). Only actionable item: the branch tracked `.codegraph/.gitignore` (a CodeGraph index artifact) while root `.gitignore` lacked a `.codegraph/` entry. Untrack the artifact and ignore the directory (matching the kimi/grok branches). Residual (NOT changed, low-risk): cacheProviderModel now throws on lock-acquire timeout — but it already threw on FS errors at the same unguarded call sites, so the propagation behavior is essentially pre-existing; the timeout path is extremely low-probability and self-heals via the no-pid reclaim fix. No source/bundle change; npm test stays 474/474.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Branch 1 of 3 from the 2026-06-02 workflow+codegraph deep review (27 confirmed findings). This PR lands the runtime/utils + host-lib hardening (25/27); the kimi-code migration and the new grok provider ship as their own PRs.
Runtime / utils (commit 1)
loggedIn:false— qwen normalizes ETIMEDOUT; claude/copilot/minimax gain the transient/explicit split (timeout/429/ECONNRESET → inconclusive, only an explicit auth error → loggedOut). Restores the documented hard rule.null, cf. agy v0.6.18).resolveSessionIdfallback to stderr/file (stdout blanked) so a UUID in the answer prose can't be promoted to a fabricated id; the stderr-id fallback is preserved.parseGeminiJsonResultnow gatesokonstatus === 0(wasok:trueon a non-zero exit when stdout had JSON).status:null+signal, no error) as a synthetic error so the sync providers stop reading a SIGKILL/SIGTERM as success.maxBufferByteson the post-drain residual, so a burst of complete newline-terminated lines is no longer rejected and dropped.classifyProviderFailure(was untested) + the four-statemixedaggregate branch.Host-lib (commit 2)
writeJsonAtomic/withLockfilefrom polycli-utils instead of carrying byte-duplicate copies (removes the drift class behind the v0.6.15 split-store regression; inherits the no-pid fix)..recover.lock(fixes the cross-process double-append).Deferred (documented, not rushed — both medium/low)
Validation
npm test474/474 (from 453, +21 tests). Companion bundles byte-identical.git diffclean.Findings adjudicated by 3-lens adversarial verification; the 4 session-purge "findings" were all refuted (purge path verified sound) and are intentionally not touched.