fix: reconcile dead-pid jobs so status/result stop reporting a zombie 'running' state#373
fix: reconcile dead-pid jobs so status/result stop reporting a zombie 'running' state#373njrun1804 wants to merge 5 commits into
Conversation
A worker that dies before recording a terminal state (crash, SIGKILL, OOM, broker drop, session teardown) left its job stuck status:running forever — status showed a growing elapsed on a frozen updatedAt, and result threw 'still running' / 'no job found'. - job-control.mjs: every job-read path now routes through a liveness reconcile — process.kill(pid,0) -> ESRCH means the worker is gone, so the job is persisted as failed:orphaned. Idempotent; a live pid (or our own pid) is never touched. - tracked-jobs.mjs: the worker installs exit/SIGTERM/SIGINT/SIGHUP finalizers that flush a terminal failed:interrupted record on graceful death (SIGKILL/OOM are caught by the reconcile above). Defense in depth. - render.mjs: result falls back to the job's log tail when no payload was captured. - tests/job-control.test.mjs: dead-pid->orphaned, live-pid untouched, result resolves. 86 -> 89 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 94c10e78be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| process.kill(pid, 0); | ||
| return true; |
There was a problem hiding this comment.
Detect defunct workers before treating pid as live
When the worker has exited but remains as a defunct/zombie process, process.kill(pid, 0) still succeeds because the PID is still present in the process table. In that state this returns true, so reconcileJob leaves the job in queued/running and /codex:status or /codex:result can keep reporting it as active forever—the case this change is intended to repair, especially in containers where PID 1 may not reap orphaned children. Add a zombie/heartbeat check before accepting signal-0 success as liveness.
Useful? React with 👍 / 👎.
…e worker A single app-server turn that never reports completion would await forever. The liveness-reconcile in job-control only reaps DEAD workers — a turn that hangs while the pid stays alive slips past it. captureTurn now races state.completion against a wall-clock deadline (CODEX_TURN_TIMEOUT_MS, default 45m, 0 disables); on expiry it best-effort turn/interrupts the stuck turn and throws, so the job finalizes failed instead of hanging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…family jobs
Codex is used as a second model family for exactly two jobs; make both first-class and "like us":
(a) Design critique — /codex:critique + codex-critique subagent + handleCritique run the new
design-critique prompt READ-ONLY, with a hard mandate to ground every load-bearing claim in
the code (file:line) AND the LIVE database (cite the query + result) — not prose-only.
(b) Code review — a prompt-overlay seam (loadPromptTemplate appends prompts/overlays/<name>.md
inside <house_conventions>) carries our adversarial-review rubric ON TOP of the upstream
prompt, so "git pull upstream" never conflicts on the scaffold. The rubric: contradict don't
echo, correctness-first, file:line, demand check.sh proof, flag scope creep.
Also makes the brittle exact-list command test tolerant of fork additions, and adds
critique-and-overlays.test.mjs to lock in both jobs + the timeout. 96/96 green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…the two-job model Codex is a second model family for exactly two jobs — critique a design before building it (/codex:critique), and adversarially review completed work (/codex:adversarial-review). The plain /codex:review command was a strict subset of adversarial-review (same engine, same git targets, just no focus text and no house rubric), so it only created "which review?" confusion. - Remove the /codex:review command (the user-facing path). The companion `review` engine stays — the optional Stop review-gate uses it internally (runtime tests still cover it). - README: drop the review section + refs, add a /codex:critique section, frame the two-job model up front. adversarial-review is now "the one review command". - Skill routing + result.md follow-ups point at adversarial-review / critique. - Tests updated to match. 95/95 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8fe2467952
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| await client.request("turn/interrupt", { | ||
| threadId: state.threadId, | ||
| turnId: state.turnId | ||
| }); |
There was a problem hiding this comment.
Bound the interrupt request after turn timeout
When a turn times out because the broker/app-server is still connected but no longer answering JSON-RPC, this await client.request("turn/interrupt", ...) can hang indefinitely: CodexAppServerClient.request only settles on a response or connection close. That leaves the worker process alive after CODEX_TURN_TIMEOUT_MS, so the job remains running and the liveness reconcile cannot help; race the interrupt with a short timeout or fire it best-effort before throwing the timeout error.
Useful? React with 👍 / 👎.
|
|
||
| if (options.background) { | ||
| ensureCodexAvailable(cwd); | ||
| const job = buildTaskJob(workspaceRoot, taskMetadata, false); |
There was a problem hiding this comment.
Isolate critique jobs from rescue resumes
Critique runs are created with buildTaskJob, so they are stored as jobClass: "task" and later match findLatestResumableTaskJob/resolveLatestTrackedTaskThread in this file. After a /codex:critique completes, the next /codex:rescue --resume or resume prompt can resume the critique thread instead of the latest rescue task, mixing a read-only design critique conversation into an implementation task; give critique its own job class/kind or exclude it from the rescue resume lookup.
Useful? React with 👍 / 👎.
… an orphaned job /codex:critique previously defaulted to --background, which detached the review into a separate Codex job and returned an instant stub — the subagent finished with nothing and no real completion was produced. - agent: always foreground; strip --background/--wait (Claude-side dispatch controls); set max Bash timeout so long critiques are not cut short - handleCritique: parse+ignore --background/--wait (matches the review commands); remove the detach path; also fixes --wait leaking into design text - README: returns the critique directly; drop stale status/result polling note - bump 1.0.4 -> 1.0.5 + CHANGELOG so the plugin cache refreshes on update Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 332f95735f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const turnTimeoutMs = resolveTurnTimeoutMs(options); | ||
| if (turnTimeoutMs > 0) { |
There was a problem hiding this comment.
Start the turn timeout before awaiting turn/start
When the app server or broker accepts the connection but never replies to the initial turn/start or review/start JSON-RPC, execution is still stuck on the earlier await startRequest() before this new Promise.race is reached. CodexAppServerClient.request has no per-request timeout and only settles on a response or connection close, so CODEX_TURN_TIMEOUT_MS does not prevent a live worker PID from staying running forever in that failure mode; race the start request itself, or add a request timeout, before waiting for state.completion.
Useful? React with 👍 / 👎.
| if (job.pid === process.pid || isPidAlive(job.pid)) { | ||
| return job; |
There was a problem hiding this comment.
Verify worker identity before trusting a live PID
If the worker exits and the OS reuses its PID before /codex:status, /codex:result, or /codex:cancel runs, this branch treats the unrelated process as the original live worker because signal 0 only proves some process with that PID exists. The orphaned job then remains active despite the worker being gone, and resolveCancelableJob can feed that same PID to terminateProcessTree during cancel, so use a worker identity check such as start time, a command marker, or a heartbeat rather than PID existence alone.
Useful? React with 👍 / 👎.
Problem
A
task/reviewworker that dies before recording a terminal state — crash, SIGKILL, OOM, broker-socket drop, or a SessionEnd teardown in another session — leaves its job stuckstatus: "running"forever.statusthen shows a growingelapsedover a frozenupdatedAt(elapsed is computed live fromstartedAtwhile the stored status never advances), andresultthrows "still running" / "No job found". The status output actively misreports liveness, becausebuildStatusSnapshot/resolveResultJobpartition jobs purely on the storedstatusstring and never probe the worker pid.Fix
lib/job-control.mjs— every job-read path (buildStatusSnapshot,buildSingleJobSnapshot,resolveResultJob,resolveCancelableJob) routes through areconcileJob: for aqueued/runningjob with a finitepid,process.kill(pid, 0)→ESRCHmeans the worker is gone, so the job is persisted asfailed:orphaned. Idempotent (a reconciled job is terminal, so later reads skip it); a live pid (deliverable orEPERM) or the current process's own pid is never touched.lib/tracked-jobs.mjs—runTrackedJobinstallsexit/SIGTERM/SIGINT/SIGHUPfinalizers that synchronously flush a terminalfailed:interruptedrecord if the worker dies while stillrunning(SIGKILL/OOM can't be trapped — those are caught by the reconcile above). Disarmed on normal completion. Defense in depth.lib/render.mjs—renderStoredJobResultfalls back to the job's log tail when no result payload was captured, so an orphaned job still yields its partial output instead of an empty "No result" message.Tests
tests/job-control.test.mjs(new): dead-pid →failed:orphaned, live-pid untouched,resultresolves a reconciled orphan. Full suite 86 → 89, all green (node --test tests/*.test.mjs).🤖 Generated with Claude Code