Skip to content

fix: reconcile dead-pid jobs so status/result stop reporting a zombie 'running' state#373

Open
njrun1804 wants to merge 5 commits into
openai:mainfrom
njrun1804:main
Open

fix: reconcile dead-pid jobs so status/result stop reporting a zombie 'running' state#373
njrun1804 wants to merge 5 commits into
openai:mainfrom
njrun1804:main

Conversation

@njrun1804

Copy link
Copy Markdown

Problem

A task/review worker that dies before recording a terminal state — crash, SIGKILL, OOM, broker-socket drop, or a SessionEnd teardown in another session — leaves its job stuck status: "running" forever. status then shows a growing elapsed over a frozen updatedAt (elapsed is computed live from startedAt while the stored status never advances), and result throws "still running" / "No job found". The status output actively misreports liveness, because buildStatusSnapshot/resolveResultJob partition jobs purely on the stored status string and never probe the worker pid.

Fix

  • lib/job-control.mjs — every job-read path (buildStatusSnapshot, buildSingleJobSnapshot, resolveResultJob, resolveCancelableJob) routes through a reconcileJob: for a queued/running job with a finite pid, process.kill(pid, 0)ESRCH means the worker is gone, so the job is persisted as failed:orphaned. Idempotent (a reconciled job is terminal, so later reads skip it); a live pid (deliverable or EPERM) or the current process's own pid is never touched.
  • lib/tracked-jobs.mjsrunTrackedJob installs exit/SIGTERM/SIGINT/SIGHUP finalizers that synchronously flush a terminal failed:interrupted record if the worker dies while still running (SIGKILL/OOM can't be trapped — those are caught by the reconcile above). Disarmed on normal completion. Defense in depth.
  • lib/render.mjsrenderStoredJobResult falls back to the job's log tail when no result payload was captured, so an orphaned job still yields its partial output instead of an empty "No result" message.

Tests

tests/job-control.test.mjs (new): dead-pid → failed:orphaned, live-pid untouched, result resolves a reconciled orphan. Full suite 86 → 89, all green (node --test tests/*.test.mjs).

🤖 Generated with Claude Code

A worker that dies before recording a terminal state (crash, SIGKILL, OOM, broker drop,
session teardown) left its job stuck status:running forever — status showed a growing
elapsed on a frozen updatedAt, and result threw 'still running' / 'no job found'.

- job-control.mjs: every job-read path now routes through a liveness reconcile —
  process.kill(pid,0) -> ESRCH means the worker is gone, so the job is persisted as
  failed:orphaned. Idempotent; a live pid (or our own pid) is never touched.
- tracked-jobs.mjs: the worker installs exit/SIGTERM/SIGINT/SIGHUP finalizers that flush
  a terminal failed:interrupted record on graceful death (SIGKILL/OOM are caught by the
  reconcile above). Defense in depth.
- render.mjs: result falls back to the job's log tail when no payload was captured.
- tests/job-control.test.mjs: dead-pid->orphaned, live-pid untouched, result resolves.

86 -> 89 tests, all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@njrun1804 njrun1804 requested a review from a team June 14, 2026 17:41

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94c10e78be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +222 to +223
process.kill(pid, 0);
return true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detect defunct workers before treating pid as live

When the worker has exited but remains as a defunct/zombie process, process.kill(pid, 0) still succeeds because the PID is still present in the process table. In that state this returns true, so reconcileJob leaves the job in queued/running and /codex:status or /codex:result can keep reporting it as active forever—the case this change is intended to repair, especially in containers where PID 1 may not reap orphaned children. Add a zombie/heartbeat check before accepting signal-0 success as liveness.

Useful? React with 👍 / 👎.

njrun1804 and others added 3 commits June 14, 2026 14:12
…e worker

A single app-server turn that never reports completion would await forever. The
liveness-reconcile in job-control only reaps DEAD workers — a turn that hangs while the
pid stays alive slips past it. captureTurn now races state.completion against a wall-clock
deadline (CODEX_TURN_TIMEOUT_MS, default 45m, 0 disables); on expiry it best-effort
turn/interrupts the stuck turn and throws, so the job finalizes failed instead of hanging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…family jobs

Codex is used as a second model family for exactly two jobs; make both first-class and "like us":

(a) Design critique — /codex:critique + codex-critique subagent + handleCritique run the new
    design-critique prompt READ-ONLY, with a hard mandate to ground every load-bearing claim in
    the code (file:line) AND the LIVE database (cite the query + result) — not prose-only.
(b) Code review — a prompt-overlay seam (loadPromptTemplate appends prompts/overlays/<name>.md
    inside <house_conventions>) carries our adversarial-review rubric ON TOP of the upstream
    prompt, so "git pull upstream" never conflicts on the scaffold. The rubric: contradict don't
    echo, correctness-first, file:line, demand check.sh proof, flag scope creep.

Also makes the brittle exact-list command test tolerant of fork additions, and adds
critique-and-overlays.test.mjs to lock in both jobs + the timeout. 96/96 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…the two-job model

Codex is a second model family for exactly two jobs — critique a design before building it
(/codex:critique), and adversarially review completed work (/codex:adversarial-review). The
plain /codex:review command was a strict subset of adversarial-review (same engine, same git
targets, just no focus text and no house rubric), so it only created "which review?" confusion.

- Remove the /codex:review command (the user-facing path). The companion `review` engine stays —
  the optional Stop review-gate uses it internally (runtime tests still cover it).
- README: drop the review section + refs, add a /codex:critique section, frame the two-job model
  up front. adversarial-review is now "the one review command".
- Skill routing + result.md follow-ups point at adversarial-review / critique.
- Tests updated to match. 95/95 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fe2467952

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +637 to +640
await client.request("turn/interrupt", {
threadId: state.threadId,
turnId: state.turnId
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound the interrupt request after turn timeout

When a turn times out because the broker/app-server is still connected but no longer answering JSON-RPC, this await client.request("turn/interrupt", ...) can hang indefinitely: CodexAppServerClient.request only settles on a response or connection close. That leaves the worker process alive after CODEX_TURN_TIMEOUT_MS, so the job remains running and the liveness reconcile cannot help; race the interrupt with a short timeout or fire it best-effort before throwing the timeout error.

Useful? React with 👍 / 👎.


if (options.background) {
ensureCodexAvailable(cwd);
const job = buildTaskJob(workspaceRoot, taskMetadata, false);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate critique jobs from rescue resumes

Critique runs are created with buildTaskJob, so they are stored as jobClass: "task" and later match findLatestResumableTaskJob/resolveLatestTrackedTaskThread in this file. After a /codex:critique completes, the next /codex:rescue --resume or resume prompt can resume the critique thread instead of the latest rescue task, mixing a read-only design critique conversation into an implementation task; give critique its own job class/kind or exclude it from the rescue resume lookup.

Useful? React with 👍 / 👎.

… an orphaned job

/codex:critique previously defaulted to --background, which detached the
review into a separate Codex job and returned an instant stub — the
subagent finished with nothing and no real completion was produced.

- agent: always foreground; strip --background/--wait (Claude-side dispatch
  controls); set max Bash timeout so long critiques are not cut short
- handleCritique: parse+ignore --background/--wait (matches the review
  commands); remove the detach path; also fixes --wait leaking into design text
- README: returns the critique directly; drop stale status/result polling note
- bump 1.0.4 -> 1.0.5 + CHANGELOG so the plugin cache refreshes on update

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 332f95735f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +617 to +618
const turnTimeoutMs = resolveTurnTimeoutMs(options);
if (turnTimeoutMs > 0) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Start the turn timeout before awaiting turn/start

When the app server or broker accepts the connection but never replies to the initial turn/start or review/start JSON-RPC, execution is still stuck on the earlier await startRequest() before this new Promise.race is reached. CodexAppServerClient.request has no per-request timeout and only settles on a response or connection close, so CODEX_TURN_TIMEOUT_MS does not prevent a live worker PID from staying running forever in that failure mode; race the start request itself, or add a request timeout, before waiting for state.completion.

Useful? React with 👍 / 👎.

Comment on lines +238 to +239
if (job.pid === process.pid || isPidAlive(job.pid)) {
return job;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify worker identity before trusting a live PID

If the worker exits and the OS reuses its PID before /codex:status, /codex:result, or /codex:cancel runs, this branch treats the unrelated process as the original live worker because signal 0 only proves some process with that PID exists. The orphaned job then remains active despite the worker being gone, and resolveCancelableJob can feed that same PID to terminateProcessTree during cancel, so use a worker identity check such as start time, a command marker, or a heartbeat rather than PID existence alone.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant