fix: preserve pending messages when LLM returns non-XML or error responses#1643
fix: preserve pending messages when LLM returns non-XML or error responses#1643gchaix wants to merge 6 commits intothedotmack:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Summary by CodeRabbitRelease Notes
WalkthroughCall sites now await and inspect a structured Changes
Sequence DiagramsequenceDiagram
participant Agent as Agent (Gemini / OpenRouter / SDK)
participant RP as ResponseProcessor
participant Store as PendingMessageStore
participant Session as Session
Agent->>RP: processAgentResponse(response, session,...)
alt Empty/whitespace && processingMessageIds present
RP->>Store: confirmProcessed or markFailed(processingMessageIds)
RP-->>Agent: {status: 'skipped' or 'empty', observationCount:0, summaryStored:false}
else Plain-text rate-limit/auth detected
RP->>Store: markFailed(processingMessageIds)
RP-->>Agent: {status: 'rate_limited'|'error', observationCount:0, summaryStored:false}
Agent->>Session: abort/rollback or remove last conversation entry
else Successful XML/observations parsed
RP->>Store: storeObservations(parsedObservations)
RP-->>Agent: {status: 'ok', observationCount:N, summaryStored:bool}
Agent->>Session: commit conversation history and continue
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/services/worker/OpenRouterAgent.ts (1)
203-223:⚠️ Potential issue | 🟠 MajorRollback retry-preserved prompts from
session.conversationHistory.The prompt is already appended at Lines 190-191 and 241-242 before the LLM call. When
processAgentResponse()returns'rate_limited'or'error', that queue item is preserved for retry, so leaving the prompt insession.conversationHistorymeans the retry adds a second copy and every later OpenRouter request carries duplicated context. Pop the just-added prompt, or only commit it after a confirmed response.💡 Minimal fix
); + if (obsResult.status === 'rate_limited' || obsResult.status === 'error') { + session.conversationHistory.pop(); + } if (obsResult.status === 'rate_limited') { logger.warn('SDK', 'OpenRouter rate-limited during observation, aborting session', { sessionId: session.sessionDbId }); return; }); + if (summaryResult.status === 'rate_limited' || summaryResult.status === 'error') { + session.conversationHistory.pop(); + } if (summaryResult.status === 'rate_limited') { logger.warn('SDK', 'OpenRouter rate-limited during summary, aborting session', { sessionId: session.sessionDbId }); return; }Also applies to: 254-274
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/services/worker/OpenRouterAgent.ts` around lines 203 - 223, The code appends the prompt to session.conversationHistory before calling processAgentResponse, but when processAgentResponse returns 'rate_limited' or 'error' the item is preserved for retry and the pre-appended prompt becomes duplicated on the next attempt; locate the prompt append in the OpenRouterAgent method (the block around the LLM call where session.conversationHistory is mutated at the earlier append lines) and either (A) only push/commit the prompt after processAgentResponse returns a successful outcome, or (B) immediately remove the just-added prompt when obsResult.status is 'rate_limited' or 'error' (e.g., pop the last conversationHistory entry or remove the matching item) and apply the same change for the other observation/response block referenced (the region around lines 254-274) so retries do not accumulate duplicate context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Around line 2509-2514: The repeated Markdown headings "## What's Changed" and
"## New Contributors" violate MD024 and lack the required blank line for MD022;
update those heading instances by making them unique (e.g., "## What's Changed —
vX.Y.Z" or "## New Contributors — vX.Y.Z") or convert them to non-heading text
(bold or plain) and ensure each heading is followed by a blank line; apply the
same change to other repeated blocks in the file to resolve both MD024 and
MD022.
- Around line 2241-2248: The markdown block around the Deleted/What remains
section is missing required blank lines, causing MD022/MD031; open CHANGELOG.md,
find the headings "### Deleted" and "### What remains", the fenced code block
that follows, and the "## Dependencies" heading, and insert a single blank line
before and after each of those headings and around the fenced code block so each
block is separated by blank lines (also apply the same fixes to the similar
block near the "2276-2277" area).
In `@src/services/worker/agents/ResponseProcessor.ts`:
- Around line 36-52: The regex in isRateLimitResponse is written as `/reset
s?/…/` which incorrectly requires a space between "reset" and the optional "s"
and therefore never matches phrases like "resets at 7pm"; update the pattern
used in isRateLimitResponse to match either "reset" or "resets" without a
spurious space (for example use a pattern equivalent to "resets?" followed by
whitespace and either "at <digit>" or "in <digit>" with case-insensitive
matching and optional word boundaries) so that strings like "resets at 7pm" and
"resets in 2 hours" are detected as rate-limited.
- Around line 160-196: The guard currently preserves only plain non-XML text,
allowing truncated/malformed XML (e.g. "<observation><type>...") to fall through
when observations.length === 0 and summary is falsy; update the logic in
ResponseProcessor (around variables text, observations, summary, agentName) to
also preserve messages whenever the parser produced no observations/summary even
if raw text contains XML markers: call preserveMessages(session, sessionManager,
worker, 'malformed_xml') and return error/rate_limited status as appropriate
(also reuse isRateLimitResponse/isAuthErrorResponse checks and logger calls), so
unparsable/truncated XML responses are never silently discarded before
storeObservations() / confirmProcessed().
---
Outside diff comments:
In `@src/services/worker/OpenRouterAgent.ts`:
- Around line 203-223: The code appends the prompt to
session.conversationHistory before calling processAgentResponse, but when
processAgentResponse returns 'rate_limited' or 'error' the item is preserved for
retry and the pre-appended prompt becomes duplicated on the next attempt; locate
the prompt append in the OpenRouterAgent method (the block around the LLM call
where session.conversationHistory is mutated at the earlier append lines) and
either (A) only push/commit the prompt after processAgentResponse returns a
successful outcome, or (B) immediately remove the just-added prompt when
obsResult.status is 'rate_limited' or 'error' (e.g., pop the last
conversationHistory entry or remove the matching item) and apply the same change
for the other observation/response block referenced (the region around lines
254-274) so retries do not accumulate duplicate context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a6cdeeb6-227d-4d5c-aba6-31d45418d8a1
📒 Files selected for processing (8)
CHANGELOG.mdsrc/services/worker/GeminiAgent.tssrc/services/worker/OpenRouterAgent.tssrc/services/worker/SDKAgent.tssrc/services/worker/agents/ResponseProcessor.tssrc/services/worker/agents/index.tssrc/services/worker/agents/types.tstests/worker/agents/response-processor.test.ts
- Fix regex bug in isRateLimitResponse: /reset s?/ had a spurious space before the optional 's', so 'resets at 7pm' never matched; corrected to /\bresets?\s+(at\s+\d|in\s+\d)/i - Preserve messages on malformed/truncated XML responses: the previous guard only fired when text had no XML markers; truncated XML like '<observation><type>bugfix' would pass the guard and fall through to storeObservations()+confirmProcessed(), recreating the message-loss bug; now also fires when XML markers are present but parser yields nothing (unless it's an intentional <skip_summary>-only response) - Roll back OpenRouter conversationHistory on rate_limited or error: the prompt is pushed before the LLM call; on failure the session object is reused on the next retry claim, causing duplicate prompts to accumulate; pop the last entry before returning on error/rate_limited - Add test for malformed/truncated XML preservation (25 tests, all pass)
|
Fixed in cdda789 (re: the outside-diff comment on OpenRouterAgent.ts lines 203-223 and 254-274). The issue was real: session.conversationHistory is persisted on the in-memory ActiveSession object which is reused across retry claims (SessionManager.initializeSession returns the cached session if it already exists). So a prompt pushed before a failed LLM call would accumulate on subsequent retries. Fix: added a pop() of the last conversationHistory entry immediately after processAgentResponse returns rate_limited or error, for both the observation and summary call sites. |
- Fix regex bug in isRateLimitResponse: /reset s?/ had a spurious space before the optional 's', so 'resets at 7pm' never matched; corrected to /\bresets?\s+(at\s+\d|in\s+\d)/i - Preserve messages on malformed/truncated XML responses: the previous guard only fired when text had no XML markers; truncated XML like '<observation><type>bugfix' would pass the guard and fall through to storeObservations()+confirmProcessed(), recreating the message-loss bug; now also fires when XML markers are present but parser yields nothing (unless it's an intentional <skip_summary>-only response) - Roll back OpenRouter conversationHistory on rate_limited or error: the prompt is pushed before the LLM call; on failure the session object is reused on the next retry claim, causing duplicate prompts to accumulate; pop the last entry before returning on error/rate_limited - Add test for malformed/truncated XML preservation (25 tests, all pass)
cdda789 to
d8a24cb
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/services/worker/SDKAgent.ts (1)
264-284:⚠️ Potential issue | 🟠 MajorMissing 'error' status handling and conversation history rollback.
SDKAgent only handles
'rate_limited'status but ignores'error'status. PerResponseProcessor.ts(context snippet 1),processAgentResponsereturns{ status: 'error' }for:
- Empty responses with pending messages (line 153)
- Auth errors (line 194)
- Malformed/truncated XML (line 203)
Additionally,
OpenRouterAgent(context snippet 2, lines 218-226) handles both statuses and rolls back the conversation history:if (obsResult.status === 'rate_limited' || obsResult.status === 'error') { session.conversationHistory.pop(); }Without this handling:
- Auth errors and malformed XML responses are silently ignored, allowing the loop to continue with corrupted state
- The pushed assistant response accumulates in
conversationHistoryon retries, causing context driftSuggested fix (align with OpenRouterAgent pattern)
// Parse and process response using shared ResponseProcessor const result = await processAgentResponse( textContent, session, this.dbManager, this.sessionManager, worker, discoveryTokens, originalTimestamp, 'SDK', cwdTracker.lastCwd, modelId ); + // Roll back conversation history on error/rate-limit to prevent accumulation + if (result.status === 'rate_limited' || result.status === 'error') { + session.conversationHistory.pop(); + } + // Rate-limited: abort — subsequent calls will also fail if (result.status === 'rate_limited') { logger.warn('SDK', 'Aborting session due to rate limiting', { sessionDbId: session.sessionDbId }); session.abortController.abort(); return; } + + // Error: abort — auth errors and malformed responses indicate non-recoverable state + if (result.status === 'error') { + logger.warn('SDK', 'Aborting session due to error response', { + sessionDbId: session.sessionDbId + }); + session.abortController.abort(); + return; + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/services/worker/SDKAgent.ts` around lines 264 - 284, process the 'error' status returned by processAgentResponse in SDKAgent the same way OpenRouterAgent does: after calling processAgentResponse (in SDKAgent.ts) check if result.status === 'rate_limited' || result.status === 'error' and if so pop the last message from session.conversationHistory; if result.status === 'rate_limited' additionally log the warning and abort via session.abortController.abort() and return, whereas for 'error' just rollback the conversationHistory and return so retries don't accumulate corrupted assistant messages.src/services/worker/agents/ResponseProcessor.ts (1)
139-154:⚠️ Potential issue | 🟠 MajorConversation history entry is not rolled back on error paths.
Line 141 pushes the assistant text to
conversationHistoryunconditionally (when text is non-empty). However, for the unproductive response guards (lines 179-203), the function returns early after the push but without popping the entry. This leaves error/rate-limit response text in the conversation history, which can cause accumulation on retries.The guards at lines 147-153 are fine since empty responses don't push anything. But the non-empty error paths need rollback.
Suggested fix
if (isRateLimitResponse(text)) { logger.warn('PARSER', `${agentName} returned rate-limit response; messages preserved for retry`, { sessionId: session.sessionDbId, preview }); + // Roll back the assistant response we just pushed + session.conversationHistory.pop(); preserveMessages(session, sessionManager, worker, 'rate_limit'); return { status: 'rate_limited', observationCount: 0, summaryStored: false }; } if (isAuthErrorResponse(text)) { logger.error('PARSER', `${agentName} returned auth error response; messages preserved for retry`, { sessionId: session.sessionDbId, preview }); + // Roll back the assistant response we just pushed + session.conversationHistory.pop(); preserveMessages(session, sessionManager, worker, 'auth_error'); return { status: 'error', observationCount: 0, summaryStored: false }; } const reason = hasXmlMarkers ? 'malformed_xml' : 'non_xml'; logger.warn('PARSER', `${agentName} returned ${reason === 'malformed_xml' ? 'malformed/truncated XML' : 'non-XML'} response; messages preserved for retry`, { sessionId: session.sessionDbId, preview }); + // Roll back the assistant response we just pushed + session.conversationHistory.pop(); preserveMessages(session, sessionManager, worker, reason); return { status: 'error', observationCount: 0, summaryStored: false };Alternatively, if the rollback is intended to be handled by callers (as mentioned in PR comments), then ResponseProcessor should document this contract clearly—but the context snippet 2 shows OpenRouterAgent doing the pop at the call site, which creates inconsistency with this centralized approach.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/services/worker/agents/ResponseProcessor.ts` around lines 139 - 154, The code pushes assistant text into session.conversationHistory via session.conversationHistory.push({ role: 'assistant', content: text }) before several early-return error guards, which leaves erroneous/rate-limit responses in history on retry; fix by ensuring any early-return paths (e.g., the empty-response guard that calls preserveMessages(session, sessionManager, worker, 'empty_response') and others that return { status: 'error', ... }) remove the last pushed entry (pop) or move the push until after all guards succeed so conversationHistory is only mutated on final successful responses, keeping the behavior consistent with callers like OpenRouterAgent that currently pop on error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/services/worker/agents/ResponseProcessor.ts`:
- Around line 139-154: The code pushes assistant text into
session.conversationHistory via session.conversationHistory.push({ role:
'assistant', content: text }) before several early-return error guards, which
leaves erroneous/rate-limit responses in history on retry; fix by ensuring any
early-return paths (e.g., the empty-response guard that calls
preserveMessages(session, sessionManager, worker, 'empty_response') and others
that return { status: 'error', ... }) remove the last pushed entry (pop) or move
the push until after all guards succeed so conversationHistory is only mutated
on final successful responses, keeping the behavior consistent with callers like
OpenRouterAgent that currently pop on error.
In `@src/services/worker/SDKAgent.ts`:
- Around line 264-284: process the 'error' status returned by
processAgentResponse in SDKAgent the same way OpenRouterAgent does: after
calling processAgentResponse (in SDKAgent.ts) check if result.status ===
'rate_limited' || result.status === 'error' and if so pop the last message from
session.conversationHistory; if result.status === 'rate_limited' additionally
log the warning and abort via session.abortController.abort() and return,
whereas for 'error' just rollback the conversationHistory and return so retries
don't accumulate corrupted assistant messages.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 285b78d1-52d1-4fb7-a267-b0b2b5985c8c
📒 Files selected for processing (7)
src/services/worker/GeminiAgent.tssrc/services/worker/OpenRouterAgent.tssrc/services/worker/SDKAgent.tssrc/services/worker/agents/ResponseProcessor.tssrc/services/worker/agents/index.tssrc/services/worker/agents/types.tstests/worker/agents/response-processor.test.ts
✅ Files skipped from review due to trivial changes (2)
- src/services/worker/agents/types.ts
- src/services/worker/GeminiAgent.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- src/services/worker/agents/index.ts
- src/services/worker/OpenRouterAgent.ts
- tests/worker/agents/response-processor.test.ts
- Fix regex bug in isRateLimitResponse: /reset s?/ had a spurious space before the optional 's', so 'resets at 7pm' never matched; corrected to /\bresets?\s+(at\s+\d|in\s+\d)/i - Preserve messages on malformed/truncated XML responses: the previous guard only fired when text had no XML markers; truncated XML like '<observation><type>bugfix' would pass the guard and fall through to storeObservations()+confirmProcessed(), recreating the message-loss bug; now also fires when XML markers are present but parser yields nothing (unless it's an intentional <skip_summary>-only response) - Roll back OpenRouter conversationHistory on rate_limited or error: the prompt is pushed before the LLM call; on failure the session object is reused on the next retry claim, causing duplicate prompts to accumulate; pop the last entry before returning on error/rate_limited - Add test for malformed/truncated XML preservation (25 tests, all pass)
…onses Previously, processAgentResponse() called confirmProcessed() (which DELETEs pending_messages rows) regardless of whether the LLM actually produced parseable XML. Any non-XML response — auth errors, rate limits, empty responses, or garbled output — would silently delete the queued tool interactions with zero observations stored. This was observed in production when stale auth tokens caused the Claude SDK to return '401 Invalid bearer token' as response text. Hours of observations were permanently lost. Changes: - Add ProcessAgentResponseResult type (status: ok|rate_limited|error|empty) so callers can inspect what happened - Add isRateLimitResponse() and isAuthErrorResponse() classification helpers - Add preserveMessages() helper that calls markFailed() on all in-flight message IDs, routing them through the existing retry logic (up to 3 attempts) rather than silently deleting them - processAgentResponse() now returns early with markFailed() on all non-XML paths: empty response with pending messages, rate-limit text, auth error text, and any other unknown non-XML output - All three agent call sites (SDKAgent, GeminiAgent, OpenRouterAgent) updated to capture the return value and abort the session on rate_limited to prevent wasting further LLM calls - GeminiAgent's existing "leave for stale recovery" pattern (5-min wait) replaced by direct markFailed() calls for immediate retry - Tests updated: non-XML test now asserts markFailed() is called and storeObservations() is NOT called; 7 new test cases cover rate-limit, auth error, empty-with-ids, empty-without-ids, and detection functions
- Fix regex bug in isRateLimitResponse: /reset s?/ had a spurious space before the optional 's', so 'resets at 7pm' never matched; corrected to /\bresets?\s+(at\s+\d|in\s+\d)/i - Preserve messages on malformed/truncated XML responses: the previous guard only fired when text had no XML markers; truncated XML like '<observation><type>bugfix' would pass the guard and fall through to storeObservations()+confirmProcessed(), recreating the message-loss bug; now also fires when XML markers are present but parser yields nothing (unless it's an intentional <skip_summary>-only response) - Roll back OpenRouter conversationHistory on rate_limited or error: the prompt is pushed before the LLM call; on failure the session object is reused on the next retry claim, causing duplicate prompts to accumulate; pop the last entry before returning on error/rate_limited - Add test for malformed/truncated XML preservation (25 tests, all pass)
Two related issues: 1. ResponseProcessor was pushing the assistant response to conversationHistory before the early-return guards. On any error path (empty response, rate limit, auth error, non-XML, malformed XML) the bad response text was left in history, so the next retry attempt would carry corrupted context. Fix: move the push to after all guards succeed. Callers that push a user prompt before calling processAgentResponse can then pop their push (one level) on non-ok status to keep history clean. 2. SDKAgent handled rate_limited by aborting but did not pop conversationHistory on either rate_limited or error. On error it also continued rather than returning, allowing subsequent loop iterations to run after a failed response. Fix: pop conversationHistory on both rate_limited and error (mirrors what OpenRouterAgent already does), then return immediately on error (matching the abort-and-return on rate_limited). Adds three new tests asserting conversationHistory is empty after empty-response, non-XML, and rate-limit error paths. 28 tests, all pass.
The `return` on `result.status === 'error'` added in d184b67 caused the SDK subprocess to be killed on every empty response, triggering crash-recovery loops that burned through the 3-restart limit. This was observed in production: after a worker restart, INIT prompts returned empty responses transiently. The `return` killed the generator, crash-recovery restarted it, another empty response killed it again, and after 3 cycles all pending messages were permanently stuck as failed. Fix: remove the `return` on error so the generator loop continues to the next message. The failed messages are already preserved via markFailed() for retry. Only rate_limited warrants aborting, since subsequent API calls will also fail. Before this commit (d184b67): error → return → kill subprocess → restart After this commit: error → continue loop → process next message
…race fix
- PendingMessageStore: add drainSessionMessages() — on session end, requeues
messages with retries remaining instead of permanently failing them all.
Deprecates markAllSessionMessagesAbandoned().
- worker-service + SessionCompletionHandler: migrate both drain call sites to
drainSessionMessages(), logging { failed, requeued } counts.
- ResponseProcessor: treat empty LLM response as intentional skip (confirmed
+ deleted) instead of an error requiring retry. Adds 'skipped' to the
ProcessAgentResponseResult status union.
- SessionRoutes: fix race condition in handleStatusByClaudeId — when session
is not found in the active map, fall back to DB getPendingCount() instead of
hardcoding queueLength: 0. This was causing the Stop hook to exit early
before summarize messages were processed, explaining low summary counts.
- summarize.ts: add status?: string to queue-status type assertion to match
the race fix above.
- tests: update response-processor tests for new skip semantics; add 8-test
suite for drainSessionMessages() covering retries, exhaustion, isolation,
and edge cases.
- scripts: add inspect-failed.sh and requeue-failed.sh for operational
debugging of stuck pending_messages.
…empty observations - tag-stripping: add TASK_NOTIFICATION_REGEX to strip <task-notification> blocks from user prompts. Claude Code injects these into the UserPromptSubmit hook when a background task completes at prompt submission time; they are system events, not user content, and were being persisted as prompt records. - prompts.ts: strengthen summary mode-switch header with visual === banners, CRITICAL/STOP language, and a closing reminder. The observer subprocess was returning <observation> tags for summary requests due to established conversation context overriding a weak mode-switch instruction. - parser.ts: skip observation shells with no title, narrative, or facts before storage. The LLM occasionally emits <observation><type>bugfix</type></observation> placeholders with no content, which appeared in the UI as blank "Untitled" cards. The "ALWAYS save observations" directive applies to observations with content.
85a537f to
6697aac
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/sdk/parser.ts (1)
167-174:⚠️ Potential issue | 🟡 MinorPreserve notes-only summaries.
notesis a first-class field inParsedSummary, but this false-positive guard ignores it.<summary><notes>...</notes></summary>now returnsnulleven though the comment above says summaries should still be saved when fields are missing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sdk/parser.ts` around lines 167 - 174, The false-positive guard in the parser uses the condition if (!request && !investigated && !learned && !completed && !next_steps) which ignores the first-class ParsedSummary field notes and causes summaries that only contain <notes> to be dropped; update the guard to include notes (e.g., if (!request && !investigated && !learned && !completed && !next_steps && !notes) so summaries with notes are preserved, leaving the logger.warn and return null behavior unchanged when truly empty.src/services/worker/OpenRouterAgent.ts (1)
203-228:⚠️ Potential issue | 🟠 MajorRoll back prompts when the OpenRouter call throws, not just on parsed error statuses.
These pops only run after
processAgentResponse()returns. IfqueryOpenRouterMultiTurn()throws for 401/429/network/API errors, the prompt pushed at Line 190 or Line 246 stays inconversationHistory, and the next retry replays stale context. Add a local rollback path around each request as well.Also applies to: 259-284
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/services/worker/OpenRouterAgent.ts` around lines 203 - 228, The code only rolls back the prompt after processAgentResponse() returns, so if queryOpenRouterMultiTurn() throws (401/429/network/API errors) the pushed prompt remains in session.conversationHistory; wrap each call to queryOpenRouterMultiTurn() in a try/catch (both the observation call and the later call referenced at the other site) and in the catch block immediately pop the last entry from session.conversationHistory to undo the push, then log the error and rethrow or return consistent with existing error handling (mirroring the existing rollback/return logic used when obsResult.status indicates rate_limited/error); reference queryOpenRouterMultiTurn, processAgentResponse, and session.conversationHistory.pop for locating the changes.src/services/worker/SDKAgent.ts (1)
264-297:⚠️ Potential issue | 🟠 MajorThe stale-prompt rollback still misses thrown auth failures.
This only pops when
processAgentResponse()returnsrate_limitedorerror. The explicitInvalid API keythrow at Line 259 happens after the current user prompt has already been pushed increateMessageGenerator(), so retries on the sameActiveSessioncan still accumulate duplicate prompts. The rollback needs to run in the per-message exception path too.src/services/worker/GeminiAgent.ts (1)
176-194:⚠️ Potential issue | 🟠 MajorRollback the failed turn from
conversationHistorybefore returning.These early returns leave the just-appended prompt/response in the cached
ActiveSession, so the next retry replays duplicate context. That is the exact retry-history leak called out in the PR notes, and it still happens here for init/observation/summary failures.♻️ Suggested fix
+ const initHistoryBase = session.conversationHistory.length; session.conversationHistory.push({ role: 'user', content: initPrompt }); const initResponse = await this.queryGeminiMultiTurn(session.conversationHistory, apiKey, model, rateLimitingEnabled); if (initResponse.content) { session.conversationHistory.push({ role: 'assistant', content: initResponse.content }); // ... const initResult = await processAgentResponse(/* ... */); if (initResult.status === 'rate_limited' || initResult.status === 'error') { + session.conversationHistory.splice(initHistoryBase); logger.warn('SDK', `Gemini init response failed (${initResult.status}), aborting session`, { sessionId: session.sessionDbId }); return; } }Apply the same
historyBase/splice()pattern to the observation and summary turns before any non-okearly return.Also applies to: 255-276, 307-328
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/services/worker/GeminiAgent.ts` around lines 176 - 194, The early returns after calling processAgentResponse (e.g., handling initResult.status === 'rate_limited' || 'error') leave the just-appended turn in the ActiveSession.conversationHistory; before any non-'ok' return you must rollback that failed turn using the same historyBase/splice pattern: capture historyBase prior to appending (or compute index where the new turn(s) were pushed) and call conversationHistory.splice(historyBase) to remove the appended init/observation/summary turn(s) so retries don’t replay duplicate context; apply this fix in the init, observation, and summary handling blocks around processAgentResponse in GeminiAgent (use the existing historyBase variable and splice() pattern already used elsewhere).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/inspect-failed.sh`:
- Around line 71-84: The SQL uses unixepoch on millisecond timestamps and
references a non-existent column; update the datetime conversions to divide ms
by 1000 (e.g. datetime(created_at_epoch/1000, 'unixepoch', 'localtime')) and
replace the invalid failed_at_epoch reference with the actual schema column (or
a safe COALESCE fallback) in the SELECT for the pending_messages query inside
scripts/inspect-failed.sh; ensure both created_at_epoch and the
failure-timestamp column you choose are divided by 1000 and wrapped with
COALESCE(..., '—') so the while read loop (id, sess_id, msg_type, ...) can run
without errors.
In `@src/sdk/parser.ts`:
- Around line 53-62: The current emptiness guard uses hasContent = !!(title ||
narrative || facts.length > 0) which wrongly treats observations that only have
persisted fields like files_read or concept entries as empty; update the
hasContent check in src/sdk/parser.ts to also include all other persisted
observation fields (e.g., files_read/filesRead, concepts, references,
attachments or any other stored arrays/objects) using proper truthiness checks
(e.g., array.length > 0 or Object.keys(obj).length > 0) so those observations
are not skipped, and adjust the logger message near the hasContent check (and
any related logic in ResponseProcessor) to reflect which fields were considered
when deciding emptiness.
In `@src/services/sqlite/PendingMessageStore.ts`:
- Around line 297-303: The UPDATE uses a non-existent column failed_at_epoch on
the pending_messages table; change the UPDATE in failStmt (and any related uses)
to set the existing terminal timestamp column completed_at_epoch instead, i.e.
use completed_at_epoch = ? with the same now parameter, or if you intend a
separate failure timestamp add a DB migration to create failed_at_epoch and
update all readers/writers consistently (refer to failStmt, pending_messages,
failed_at_epoch, completed_at_epoch, and this.maxRetries to locate the code).
In `@src/services/worker/agents/ResponseProcessor.ts`:
- Around line 189-205: The auth-path in ResponseProcessor currently treats
isAuthErrorResponse(text) like a generic error; update it to return a distinct
fatal status (e.g., return { status: 'auth_error', observationCount: 0,
summaryStored: false }) instead of { status: 'error' ... } so callers can abort
retries; keep the preserveMessages(session, sessionManager, worker,
'auth_error') call and the existing log, and ensure any callers that switch on
returned status can handle the 'auth_error' value.
- Around line 139-155: The code currently treats any empty text as an
intentional skip and immediately confirms/clears session.processingMessageIds
(using pendingStore.confirmProcessed and cleanupProcessedMessages), which can
drop real queued work; change the guard in ResponseProcessor (the block that
checks if (!text.trim() && session.processingMessageIds.length > 0)) to only
treat an empty response as a skip when the observation includes an explicit skip
marker (e.g., observation.intentionalSkip or a new boolean returned by
OpenRouterAgent.queryOpenRouterMultiTurn), and otherwise do not call
pendingStore.confirmProcessed or clear session.processingMessageIds—leave the
messages for retry/inspection and return a non-skipped error status; update the
agent call (OpenRouterAgent.queryOpenRouterMultiTurn) to propagate this explicit
flag if necessary and adjust logging to reflect whether the empty response was
an explicit skip or an upstream failure.
---
Outside diff comments:
In `@src/sdk/parser.ts`:
- Around line 167-174: The false-positive guard in the parser uses the condition
if (!request && !investigated && !learned && !completed && !next_steps) which
ignores the first-class ParsedSummary field notes and causes summaries that only
contain <notes> to be dropped; update the guard to include notes (e.g., if
(!request && !investigated && !learned && !completed && !next_steps && !notes)
so summaries with notes are preserved, leaving the logger.warn and return null
behavior unchanged when truly empty.
In `@src/services/worker/GeminiAgent.ts`:
- Around line 176-194: The early returns after calling processAgentResponse
(e.g., handling initResult.status === 'rate_limited' || 'error') leave the
just-appended turn in the ActiveSession.conversationHistory; before any non-'ok'
return you must rollback that failed turn using the same historyBase/splice
pattern: capture historyBase prior to appending (or compute index where the new
turn(s) were pushed) and call conversationHistory.splice(historyBase) to remove
the appended init/observation/summary turn(s) so retries don’t replay duplicate
context; apply this fix in the init, observation, and summary handling blocks
around processAgentResponse in GeminiAgent (use the existing historyBase
variable and splice() pattern already used elsewhere).
In `@src/services/worker/OpenRouterAgent.ts`:
- Around line 203-228: The code only rolls back the prompt after
processAgentResponse() returns, so if queryOpenRouterMultiTurn() throws
(401/429/network/API errors) the pushed prompt remains in
session.conversationHistory; wrap each call to queryOpenRouterMultiTurn() in a
try/catch (both the observation call and the later call referenced at the other
site) and in the catch block immediately pop the last entry from
session.conversationHistory to undo the push, then log the error and rethrow or
return consistent with existing error handling (mirroring the existing
rollback/return logic used when obsResult.status indicates rate_limited/error);
reference queryOpenRouterMultiTurn, processAgentResponse, and
session.conversationHistory.pop for locating the changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5e764ab7-2e70-47e0-9e08-187ff4659353
📒 Files selected for processing (21)
plugin/scripts/mcp-server.cjsplugin/scripts/worker-service.cjsplugin/ui/viewer-bundle.jsscripts/inspect-failed.shscripts/requeue-failed.shsrc/cli/handlers/summarize.tssrc/sdk/parser.tssrc/sdk/prompts.tssrc/services/sqlite/PendingMessageStore.tssrc/services/worker-service.tssrc/services/worker/GeminiAgent.tssrc/services/worker/OpenRouterAgent.tssrc/services/worker/SDKAgent.tssrc/services/worker/agents/ResponseProcessor.tssrc/services/worker/agents/index.tssrc/services/worker/agents/types.tssrc/services/worker/http/routes/SessionRoutes.tssrc/services/worker/session/SessionCompletionHandler.tssrc/utils/tag-stripping.tstests/services/sqlite/pending-message-drain.test.tstests/worker/agents/response-processor.test.ts
✅ Files skipped from review due to trivial changes (1)
- src/cli/handlers/summarize.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/services/worker/agents/index.ts
- src/services/worker/agents/types.ts
| sqlite3 -separator $'\t' "$DB" " | ||
| SELECT | ||
| id, | ||
| content_session_id, | ||
| message_type, | ||
| COALESCE(tool_name, '—'), | ||
| retry_count, | ||
| COALESCE(datetime(created_at_epoch, 'unixepoch', 'localtime'), '—'), | ||
| COALESCE(datetime(failed_at_epoch, 'unixepoch', 'localtime'), '—'), | ||
| COALESCE(substr(tool_response, 1, 120), substr(last_assistant_message, 1, 120), '—') | ||
| FROM pending_messages | ||
| WHERE ${WHERE} | ||
| ORDER BY id; | ||
| " | while IFS=$'\t' read -r id sess_id msg_type tool retries created failed preview; do |
There was a problem hiding this comment.
The inspection query is using the wrong timestamp fields and units.
pending_messages stores Date.now()-style millisecond epochs, so datetime(..., 'unixepoch') prints bogus dates unless you divide by 1000. On top of that, failed_at_epoch is not in the current schema, so this query will error before the script can display anything.
🛠️ Suggested fix
COALESCE(tool_name, '—'),
retry_count,
- COALESCE(datetime(created_at_epoch, 'unixepoch', 'localtime'), '—'),
- COALESCE(datetime(failed_at_epoch, 'unixepoch', 'localtime'), '—'),
+ COALESCE(datetime(created_at_epoch / 1000, 'unixepoch', 'localtime'), '—'),
+ COALESCE(datetime(completed_at_epoch / 1000, 'unixepoch', 'localtime'), '—'),
COALESCE(substr(tool_response, 1, 120), substr(last_assistant_message, 1, 120), '—')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sqlite3 -separator $'\t' "$DB" " | |
| SELECT | |
| id, | |
| content_session_id, | |
| message_type, | |
| COALESCE(tool_name, '—'), | |
| retry_count, | |
| COALESCE(datetime(created_at_epoch, 'unixepoch', 'localtime'), '—'), | |
| COALESCE(datetime(failed_at_epoch, 'unixepoch', 'localtime'), '—'), | |
| COALESCE(substr(tool_response, 1, 120), substr(last_assistant_message, 1, 120), '—') | |
| FROM pending_messages | |
| WHERE ${WHERE} | |
| ORDER BY id; | |
| " | while IFS=$'\t' read -r id sess_id msg_type tool retries created failed preview; do | |
| sqlite3 -separator $'\t' "$DB" " | |
| SELECT | |
| id, | |
| content_session_id, | |
| message_type, | |
| COALESCE(tool_name, '—'), | |
| retry_count, | |
| COALESCE(datetime(created_at_epoch / 1000, 'unixepoch', 'localtime'), '—'), | |
| COALESCE(datetime(completed_at_epoch / 1000, 'unixepoch', 'localtime'), '—'), | |
| COALESCE(substr(tool_response, 1, 120), substr(last_assistant_message, 1, 120), '—') | |
| FROM pending_messages | |
| WHERE ${WHERE} | |
| ORDER BY id; | |
| " | while IFS=$'\t' read -r id sess_id msg_type tool retries created failed preview; do |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/inspect-failed.sh` around lines 71 - 84, The SQL uses unixepoch on
millisecond timestamps and references a non-existent column; update the datetime
conversions to divide ms by 1000 (e.g. datetime(created_at_epoch/1000,
'unixepoch', 'localtime')) and replace the invalid failed_at_epoch reference
with the actual schema column (or a safe COALESCE fallback) in the SELECT for
the pending_messages query inside scripts/inspect-failed.sh; ensure both
created_at_epoch and the failure-timestamp column you choose are divided by 1000
and wrapped with COALESCE(..., '—') so the while read loop (id, sess_id,
msg_type, ...) can run without errors.
| // Guard: skip observations with no meaningful content (title, narrative, facts all empty). | ||
| // These are LLM-generated shells — e.g. <observation><type>bugfix</type></observation> — | ||
| // that provide no value and pollute the UI with blank "Untitled" cards. | ||
| // NOTE: "ALWAYS save observations" (10/24/2025) applies to observations with content; | ||
| // it was never intended to preserve completely empty shells. | ||
| const hasContent = !!(title || narrative || facts.length > 0); | ||
| if (!hasContent) { | ||
| logger.warn('PARSER', 'Skipping empty observation shell (no title, narrative, or facts)', { correlationId, type }); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Don't treat file-only or concept-only observations as empty.
This guard only counts title, narrative, and facts as meaningful content. A structured response like <observation><type>bugfix</type><files_read>...</files_read></observation> now gets dropped here, and ResponseProcessor will classify the whole response as malformed/empty instead of storing it or treating it as a valid observation. Include the other persisted fields in the emptiness check too.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sdk/parser.ts` around lines 53 - 62, The current emptiness guard uses
hasContent = !!(title || narrative || facts.length > 0) which wrongly treats
observations that only have persisted fields like files_read or concept entries
as empty; update the hasContent check in src/sdk/parser.ts to also include all
other persisted observation fields (e.g., files_read/filesRead, concepts,
references, attachments or any other stored arrays/objects) using proper
truthiness checks (e.g., array.length > 0 or Object.keys(obj).length > 0) so
those observations are not skipped, and adjust the logger message near the
hasContent check (and any related logic in ResponseProcessor) to reflect which
fields were considered when deciding emptiness.
| const failStmt = this.db.prepare(` | ||
| UPDATE pending_messages | ||
| SET status = 'failed', failed_at_epoch = ? | ||
| WHERE session_db_id = ? AND status IN ('pending', 'processing') | ||
| AND retry_count >= ? | ||
| `); | ||
| const failResult = failStmt.run(now, sessionDbId, this.maxRetries); |
There was a problem hiding this comment.
Use the schema’s terminal timestamp column here.
pending_messages does not have a failed_at_epoch column in the current schema; it only has completed_at_epoch. This update will fail as soon as a drained message has exhausted retries, and the new test/script changes are already assuming the same nonexistent column downstream.
🛠️ Suggested fix
const failStmt = this.db.prepare(`
UPDATE pending_messages
- SET status = 'failed', failed_at_epoch = ?
+ SET status = 'failed', completed_at_epoch = ?
WHERE session_db_id = ? AND status IN ('pending', 'processing')
AND retry_count >= ?
`);If you want a dedicated failure timestamp, add a migration first and update all readers/writers consistently.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const failStmt = this.db.prepare(` | |
| UPDATE pending_messages | |
| SET status = 'failed', failed_at_epoch = ? | |
| WHERE session_db_id = ? AND status IN ('pending', 'processing') | |
| AND retry_count >= ? | |
| `); | |
| const failResult = failStmt.run(now, sessionDbId, this.maxRetries); | |
| const failStmt = this.db.prepare(` | |
| UPDATE pending_messages | |
| SET status = 'failed', completed_at_epoch = ? | |
| WHERE session_db_id = ? AND status IN ('pending', 'processing') | |
| AND retry_count >= ? | |
| `); | |
| const failResult = failStmt.run(now, sessionDbId, this.maxRetries); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/services/sqlite/PendingMessageStore.ts` around lines 297 - 303, The
UPDATE uses a non-existent column failed_at_epoch on the pending_messages table;
change the UPDATE in failStmt (and any related uses) to set the existing
terminal timestamp column completed_at_epoch instead, i.e. use
completed_at_epoch = ? with the same now parameter, or if you intend a separate
failure timestamp add a DB migration to create failed_at_epoch and update all
readers/writers consistently (refer to failStmt, pending_messages,
failed_at_epoch, completed_at_epoch, and this.maxRetries to locate the code).
| // --- GUARD: Empty response with pending messages --- | ||
| // The observation prompt explicitly allows the LLM to return empty to skip trivial | ||
| // tool uses. Genuine API failures produce non-empty error text (rate-limit notices, | ||
| // auth errors, HTTP bodies), so an empty string here is an intentional skip, not a | ||
| // transient error. Confirm and delete the messages rather than retrying them. | ||
| if (!text.trim() && session.processingMessageIds.length > 0) { | ||
| const pendingStore = sessionManager.getPendingMessageStore(); | ||
| for (const messageId of session.processingMessageIds) { | ||
| pendingStore.confirmProcessed(messageId); | ||
| } | ||
| logger.info('QUEUE', `SKIPPED | sessionDbId=${session.sessionDbId} | reason=intentional_skip | count=${session.processingMessageIds.length} | ids=[${session.processingMessageIds.join(',')}]`, { | ||
| sessionId: session.sessionDbId | ||
| }); | ||
| session.processingMessageIds = []; | ||
| cleanupProcessedMessages(session, worker); | ||
| return { status: 'skipped', observationCount: 0, summaryStored: false }; | ||
| } |
There was a problem hiding this comment.
Empty responses can still drop queued messages.
This branch confirms and clears processingMessageIds before the caller can distinguish an intentional skip from an upstream failure. OpenRouterAgent.queryOpenRouterMultiTurn() already normalizes one API failure mode to content: '', so this path can still delete real queued work on empty provider responses. If empty strings need to mean “skip”, that signal should be explicit rather than inferred from !text.trim().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/services/worker/agents/ResponseProcessor.ts` around lines 139 - 155, The
code currently treats any empty text as an intentional skip and immediately
confirms/clears session.processingMessageIds (using
pendingStore.confirmProcessed and cleanupProcessedMessages), which can drop real
queued work; change the guard in ResponseProcessor (the block that checks if
(!text.trim() && session.processingMessageIds.length > 0)) to only treat an
empty response as a skip when the observation includes an explicit skip marker
(e.g., observation.intentionalSkip or a new boolean returned by
OpenRouterAgent.queryOpenRouterMultiTurn), and otherwise do not call
pendingStore.confirmProcessed or clear session.processingMessageIds—leave the
messages for retry/inspection and return a non-skipped error status; update the
agent call (OpenRouterAgent.queryOpenRouterMultiTurn) to propagate this explicit
flag if necessary and adjust logging to reflect whether the empty response was
an explicit skip or an upstream failure.
| if (isAuthErrorResponse(text)) { | ||
| logger.error('PARSER', `${agentName} returned auth error response; messages preserved for retry`, { | ||
| sessionId: session.sessionDbId, | ||
| preview | ||
| }); | ||
| preserveMessages(session, sessionManager, worker, 'auth_error'); | ||
| return { status: 'error', observationCount: 0, summaryStored: false }; | ||
| } | ||
|
|
||
| const reason = hasXmlMarkers ? 'malformed_xml' : 'non_xml'; | ||
| logger.warn('PARSER', `${agentName} returned ${reason === 'malformed_xml' ? 'malformed/truncated XML' : 'non-XML'} response; messages preserved for retry`, { | ||
| sessionId: session.sessionDbId, | ||
| preview | ||
| }); | ||
| preserveMessages(session, sessionManager, worker, reason); | ||
| return { status: 'error', observationCount: 0, summaryStored: false }; | ||
| } |
There was a problem hiding this comment.
Auth failures need a distinct result status.
isAuthErrorResponse() identifies a deterministic credential/configuration failure, but this still returns status: 'error'. The current callers only abort on rate_limited, so an expired token can burn one retry for every queued message in the same session before anything stops. Returning a dedicated auth_error/fatal status would let agents abort immediately while keeping transient parse errors retryable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/services/worker/agents/ResponseProcessor.ts` around lines 189 - 205, The
auth-path in ResponseProcessor currently treats isAuthErrorResponse(text) like a
generic error; update it to return a distinct fatal status (e.g., return {
status: 'auth_error', observationCount: 0, summaryStored: false }) instead of {
status: 'error' ... } so callers can abort retries; keep the
preserveMessages(session, sessionManager, worker, 'auth_error') call and the
existing log, and ensure any callers that switch on returned status can handle
the 'auth_error' value.
|
Closed during the April 2026 backlog cleanup. The underlying bug is now tracked in #1874, which is the single canonical issue being addressed by the maintainer. Thanks for taking the time to report — your symptoms and repro are captured in the consolidated ticket. |
processAgentResponse()calledconfirmProcessed()(which DELETEs thepending_messagesrow) regardless of whether the LLM actually returned parseable XML. Any unexpected response — auth errors, rate limits, empty responses, garbled output — silently deleted the queued tool interactions with zero observations stored.This was observed in production when stale auth tokens caused the Claude SDK to return
401 Invalid bearer tokenas response text. The parser found no XML, the warning was logged, and thenconfirmProcessed()deleted the queue entry.What Changed
ResponseProcessor.ts— the core fix:isRateLimitResponse()andisAuthErrorResponse()classification helperspreserveMessages()— callsmarkFailed()on all in-flight message IDs, routing them through the existing retry logic (up to 3 retries, then permanently failed but visible in the queue UI) rather than silently deleting themprocessAgentResponse()now returnsProcessAgentResponseResult(status:ok | rate_limited | error | empty) instead ofvoidpreserveMessages():errorrate_limitederrorerrorAgent call sites — all 6
processAgentResponsecalls updated:SDKAgent.ts— captures result, aborts session onrate_limitedGeminiAgent.ts— captures result at all 3 call sites; replaces the "leave for stale recovery" pattern (5-min wait) with immediatemarkFailed()callsOpenRouterAgent.ts— captures result at both call sites, aborts onrate_limitedtypes.ts— addsProcessAgentResponseResultinterfaceindex.ts— exportsProcessAgentResponseResult,isRateLimitResponse,isAuthErrorResponseTests
24 tests pass (
bun test tests/worker/agents/response-processor.test.ts).Updated existing tests:
markFailed()is called andstoreObservations()is not calledNew tests:
markFailed()called, statusrate_limitedmarkFailed()called, statuserrormarkFailed()called, statuserrorstoreObservations()called normally, statusemptymarkFailed()called, statuserrorisRateLimitResponse()unit testsisAuthErrorResponse()unit testsExisting Safety Properties Preserved
confirmProcessed()is still called only after successfulstoreObservations()on the happy pathstoreObservations()with 0 observations — no session timestamp side-effectsmarkFailed()retries up to 3× then marks permanentlyfailed(already the behaviour for rate-limit errors in OpenRouter batch processing)resetStaleProcessingMessages(0)) and the 5-minute stale claim reset are unchanged