fix: reconcile dead-pid jobs so status/result stop reporting a zombie 'running' state by njrun1804 · Pull Request #373 · openai/codex-plugin-cc

njrun1804 · 2026-06-14T17:41:19Z

Problem

A task/review worker that dies before recording a terminal state — crash, SIGKILL, OOM, broker-socket drop, or a SessionEnd teardown in another session — leaves its job stuck status: "running" forever. status then shows a growing elapsed over a frozen updatedAt (elapsed is computed live from startedAt while the stored status never advances), and result throws "still running" / "No job found". The status output actively misreports liveness, because buildStatusSnapshot/resolveResultJob partition jobs purely on the stored status string and never probe the worker pid.

Fix

lib/job-control.mjs — every job-read path (buildStatusSnapshot, buildSingleJobSnapshot, resolveResultJob, resolveCancelableJob) routes through a reconcileJob: for a queued/running job with a finite pid, process.kill(pid, 0) → ESRCH means the worker is gone, so the job is persisted as failed:orphaned. Idempotent (a reconciled job is terminal, so later reads skip it); a live pid (deliverable or EPERM) or the current process's own pid is never touched.
lib/tracked-jobs.mjs — runTrackedJob installs exit/SIGTERM/SIGINT/SIGHUP finalizers that synchronously flush a terminal failed:interrupted record if the worker dies while still running (SIGKILL/OOM can't be trapped — those are caught by the reconcile above). Disarmed on normal completion. Defense in depth.
lib/render.mjs — renderStoredJobResult falls back to the job's log tail when no result payload was captured, so an orphaned job still yields its partial output instead of an empty "No result" message.

Tests

tests/job-control.test.mjs (new): dead-pid → failed:orphaned, live-pid untouched, result resolves a reconciled orphan. Full suite 86 → 89, all green (node --test tests/*.test.mjs).

🤖 Generated with Claude Code

A worker that dies before recording a terminal state (crash, SIGKILL, OOM, broker drop, session teardown) left its job stuck status:running forever — status showed a growing elapsed on a frozen updatedAt, and result threw 'still running' / 'no job found'. - job-control.mjs: every job-read path now routes through a liveness reconcile — process.kill(pid,0) -> ESRCH means the worker is gone, so the job is persisted as failed:orphaned. Idempotent; a live pid (or our own pid) is never touched. - tracked-jobs.mjs: the worker installs exit/SIGTERM/SIGINT/SIGHUP finalizers that flush a terminal failed:interrupted record on graceful death (SIGKILL/OOM are caught by the reconcile above). Defense in depth. - render.mjs: result falls back to the job's log tail when no payload was captured. - tests/job-control.test.mjs: dead-pid->orphaned, live-pid untouched, result resolves. 86 -> 89 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94c10e78be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T17:46:28Z

+    process.kill(pid, 0);
+    return true;


Detect defunct workers before treating pid as live

When the worker has exited but remains as a defunct/zombie process, process.kill(pid, 0) still succeeds because the PID is still present in the process table. In that state this returns true, so reconcileJob leaves the job in queued/running and /codex:status or /codex:result can keep reporting it as active forever—the case this change is intended to repair, especially in containers where PID 1 may not reap orphaned children. Add a zombie/heartbeat check before accepting signal-0 success as liveness.

Useful? React with 👍 / 👎.

…e worker A single app-server turn that never reports completion would await forever. The liveness-reconcile in job-control only reaps DEAD workers — a turn that hangs while the pid stays alive slips past it. captureTurn now races state.completion against a wall-clock deadline (CODEX_TURN_TIMEOUT_MS, default 45m, 0 disables); on expiry it best-effort turn/interrupts the stuck turn and throws, so the job finalizes failed instead of hanging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…family jobs Codex is used as a second model family for exactly two jobs; make both first-class and "like us": (a) Design critique — /codex:critique + codex-critique subagent + handleCritique run the new design-critique prompt READ-ONLY, with a hard mandate to ground every load-bearing claim in the code (file:line) AND the LIVE database (cite the query + result) — not prose-only. (b) Code review — a prompt-overlay seam (loadPromptTemplate appends prompts/overlays/<name>.md inside <house_conventions>) carries our adversarial-review rubric ON TOP of the upstream prompt, so "git pull upstream" never conflicts on the scaffold. The rubric: contradict don't echo, correctness-first, file:line, demand check.sh proof, flag scope creep. Also makes the brittle exact-list command test tolerant of fork additions, and adds critique-and-overlays.test.mjs to lock in both jobs + the timeout. 96/96 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…the two-job model Codex is a second model family for exactly two jobs — critique a design before building it (/codex:critique), and adversarially review completed work (/codex:adversarial-review). The plain /codex:review command was a strict subset of adversarial-review (same engine, same git targets, just no focus text and no house rubric), so it only created "which review?" confusion. - Remove the /codex:review command (the user-facing path). The companion `review` engine stays — the optional Stop review-gate uses it internally (runtime tests still cover it). - README: drop the review section + refs, add a /codex:critique section, frame the two-job model up front. adversarial-review is now "the one review command". - Skill routing + result.md follow-ups point at adversarial-review / critique. - Tests updated to match. 95/95 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fe2467952

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T18:57:29Z

+            await client.request("turn/interrupt", {
+              threadId: state.threadId,
+              turnId: state.turnId
+            });


Bound the interrupt request after turn timeout

When a turn times out because the broker/app-server is still connected but no longer answering JSON-RPC, this await client.request("turn/interrupt", ...) can hang indefinitely: CodexAppServerClient.request only settles on a response or connection close. That leaves the worker process alive after CODEX_TURN_TIMEOUT_MS, so the job remains running and the liveness reconcile cannot help; race the interrupt with a short timeout or fire it best-effort before throwing the timeout error.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T18:57:29Z

+
+  if (options.background) {
+    ensureCodexAvailable(cwd);
+    const job = buildTaskJob(workspaceRoot, taskMetadata, false);


Isolate critique jobs from rescue resumes

Critique runs are created with buildTaskJob, so they are stored as jobClass: "task" and later match findLatestResumableTaskJob/resolveLatestTrackedTaskThread in this file. After a /codex:critique completes, the next /codex:rescue --resume or resume prompt can resume the critique thread instead of the latest rescue task, mixing a read-only design critique conversation into an implementation task; give critique its own job class/kind or exclude it from the rescue resume lookup.

Useful? React with 👍 / 👎.

… an orphaned job /codex:critique previously defaulted to --background, which detached the review into a separate Codex job and returned an instant stub — the subagent finished with nothing and no real completion was produced. - agent: always foreground; strip --background/--wait (Claude-side dispatch controls); set max Bash timeout so long critiques are not cut short - handleCritique: parse+ignore --background/--wait (matches the review commands); remove the detach path; also fixes --wait leaking into design text - README: returns the critique directly; drop stale status/result polling note - bump 1.0.4 -> 1.0.5 + CHANGELOG so the plugin cache refreshes on update Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 332f95735f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T20:47:26Z

+    const turnTimeoutMs = resolveTurnTimeoutMs(options);
+    if (turnTimeoutMs > 0) {


Start the turn timeout before awaiting turn/start

When the app server or broker accepts the connection but never replies to the initial turn/start or review/start JSON-RPC, execution is still stuck on the earlier await startRequest() before this new Promise.race is reached. CodexAppServerClient.request has no per-request timeout and only settles on a response or connection close, so CODEX_TURN_TIMEOUT_MS does not prevent a live worker PID from staying running forever in that failure mode; race the start request itself, or add a request timeout, before waiting for state.completion.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T20:47:26Z

+  if (job.pid === process.pid || isPidAlive(job.pid)) {
+    return job;


Verify worker identity before trusting a live PID

If the worker exits and the OS reuses its PID before /codex:status, /codex:result, or /codex:cancel runs, this branch treats the unrelated process as the original live worker because signal 0 only proves some process with that PID exists. The orphaned job then remains active despite the worker being gone, and resolveCancelableJob can feed that same PID to terminateProcessTree during cancel, so use a worker identity check such as start time, a command marker, or a heartbeat rather than PID existence alone.

Useful? React with 👍 / 👎.

njrun1804 requested a review from a team June 14, 2026 17:41

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

njrun1804 and others added 3 commits June 14, 2026 14:12

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reconcile dead-pid jobs so status/result stop reporting a zombie 'running' state#373

fix: reconcile dead-pid jobs so status/result stop reporting a zombie 'running' state#373
njrun1804 wants to merge 5 commits into
openai:mainfrom
njrun1804:main

njrun1804 commented Jun 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const turnTimeoutMs = resolveTurnTimeoutMs(options);
		if (turnTimeoutMs > 0) {

		if (job.pid === process.pid \|\| isPidAlive(job.pid)) {
		return job;

Conversation

njrun1804 commented Jun 14, 2026

Problem

Fix

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant