fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609
Merged
matthv merged 5 commits intoJun 3, 2026
Conversation
…ing provider errors [PRD-409] When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global STEP_TIMEOUT_MS (default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner. Add a dedicated timeout on each AI invocation (default 60s, configurable via AI_INVOKE_TIMEOUT_MS) using AbortController + signal so the underlying HTTP request is actually cancelled. On timeout, throws the new AiInvokeTimeoutError, which BaseStepExecutor.execute() converts to an error outcome with a user-friendly message — the orchestrator then sets context.error on the step and the frontend exits its isLoading state immediately. fixes PRD-409 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Coverage Impact ⬆️ Merging this pull request will increase total coverage on Modified Files with Diff Coverage (4)
🛟 Help
|
Replace the manual AbortController + setTimeout in invokeWithTools with LangChain's native `timeout` call option, which it converts to an AbortSignal.timeout(ms) and forwards to the underlying HTTP request (real cancellation, not just a race). Lowers invokeWithTools complexity. Map the resulting TimeoutError/AbortError to AiInvokeTimeoutError to keep the user-facing message. Lower the default to 30s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rtSignal Instead of guessing the thrown error's name (providers wrap an aborted request differently — AbortError, TimeoutError, APIUserAbortError, APIConnectionTimeoutError…), pass an AbortSignal.timeout we own and detect the timeout via signal.aborted. Provider-agnostic, and LangChain forwards the signal so a hanging provider is genuinely cancelled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er disables - build-workflow-executor: clamp stepTimeoutMs/aiInvokeTimeoutMs to a positive value or the default (`?? default` only caught null/undefined, so a 0/negative programmatic value silently disabled the timeout) - base-step-executor: drop the `as number` casts via inline narrowing; add a WHY comment on the signal.aborted timeout detection - tests: build clamps non-positive timeouts to default; AI invoke timeout now surfaces end-to-end through execute() as an error outcome with its userMessage Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…back to default) positiveOrDefault accepted Infinity (typeof Infinity === 'number' && Infinity > 0), which disabled the timeout instead of using the default. Guard with Number.isFinite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Scra3
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global
STEP_TIMEOUT_MS(default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner.This PR adds a dedicated timeout on each AI invocation (default 30s, configurable via
AI_INVOKE_TIMEOUT_MS) by passing LangChain's nativetimeoutcall option tomodel.invoke. LangChain converts it into anAbortSignal.timeout(ms)and forwards it to the underlying HTTP request, so a hanging provider is actually cancelled — not merely raced.On timeout, the abort surfaces as a
TimeoutError/AbortError, whichinvokeWithToolsmaps to the newAiInvokeTimeoutError.BaseStepExecutor.execute()then converts it to an error outcome with a user-friendly message — the orchestrator setscontext.erroron the step and the frontend exits itsisLoadingstate immediately.Why delegate to LangChain instead of a manual AbortController
An earlier version wired up
AbortController+setTimeoutby hand. LangChain already does exactly this internally when given atimeoutcall option (verified in@langchain/coreensureConfig→AbortSignal.timeout→ forwarded assignalto the request). Delegating removes the manual timer plumbing and lowersinvokeWithToolscomplexity, while still producing a real request cancellation. Thetimeoutcall option is in milliseconds.Why not just lower STEP_TIMEOUT_MS globally
STEP_TIMEOUT_MScovers more than the AI call (it also covers slow agent fetches, DB lookups, etc.). Lowering it globally would kill legitimately slow non-AI work. A dedicated AI timeout is more surgical.Changes
defaults.ts: newDEFAULT_AI_INVOKE_TIMEOUT_MS = 30_000errors.ts: newAiInvokeTimeoutError extends WorkflowExecutorErrorwith provider-specific user messagebase-step-executor.ts:invokeWithToolspasses{ timeout: aiInvokeTimeoutMs }tomodel.invoke, and maps the resultingTimeoutError/AbortErrortoAiInvokeTimeoutErrorRunnerConfig→StepContextConfig→ExecutionContextcli-core.ts: parseAI_INVOKE_TIMEOUT_MSenv varAiInvokeTimeoutError,{ timeout }passed as the 2nd arg, disabled when unset/<=0 (abort not mapped), non-abort errors rethrown as-isfixes PRD-409
Test plan
workflow-executortest suite passes (base-step-executor.test.ts: 45 tests, incl. the 6 above)tsc --noEmitcleanSIMULATE_AI_HANG=1 AI_INVOKE_TIMEOUT_MS=10000, the frontend shows the new user message after 10s instead of spinning for 5min🤖 Generated with Claude Code
Note
Add per-invocation AI timeout to surface hanging provider errors in workflow executor
aiInvokeTimeoutMs(default 30s) toRunnerConfig,ExecutionContext, andExecutorOptions, configurable via theAI_INVOKE_TIMEOUT_MSenv var in the CLI.BaseStepExecutor.invokeWithToolswithAbortSignal.timeout, throwing a distinctAiInvokeTimeoutErrorwith a user-facing message when the provider does not respond in time.positiveOrDefaultin build-workflow-executor.ts so non-positive or non-finite timeout values always fall back to their defaults rather than disabling the timeout.stepTimeoutMsvalues that are non-positive or non-finite now fall back toDEFAULT_STEP_TIMEOUT_MSinstead of disabling the step timeout.Macroscope summarized 7e7b418.