fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409] by matthv · Pull Request #1609 · ForestAdmin/agent-nodejs

matthv · 2026-05-28T14:52:22Z

Summary

When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global STEP_TIMEOUT_MS (default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner.

This PR adds a dedicated timeout on each AI invocation (default 30s, configurable via AI_INVOKE_TIMEOUT_MS) by passing LangChain's native timeout call option to model.invoke. LangChain converts it into an AbortSignal.timeout(ms) and forwards it to the underlying HTTP request, so a hanging provider is actually cancelled — not merely raced.

On timeout, the abort surfaces as a TimeoutError/AbortError, which invokeWithTools maps to the new AiInvokeTimeoutError. BaseStepExecutor.execute() then converts it to an error outcome with a user-friendly message — the orchestrator sets context.error on the step and the frontend exits its isLoading state immediately.

Why delegate to LangChain instead of a manual AbortController

An earlier version wired up AbortController + setTimeout by hand. LangChain already does exactly this internally when given a timeout call option (verified in @langchain/core ensureConfig → AbortSignal.timeout → forwarded as signal to the request). Delegating removes the manual timer plumbing and lowers invokeWithTools complexity, while still producing a real request cancellation. The timeout call option is in milliseconds.

Why not just lower STEP_TIMEOUT_MS globally

STEP_TIMEOUT_MS covers more than the AI call (it also covers slow agent fetches, DB lookups, etc.). Lowering it globally would kill legitimately slow non-AI work. A dedicated AI timeout is more surgical.

Changes

defaults.ts: new DEFAULT_AI_INVOKE_TIMEOUT_MS = 30_000
errors.ts: new AiInvokeTimeoutError extends WorkflowExecutorError with provider-specific user message
base-step-executor.ts: invokeWithTools passes { timeout: aiInvokeTimeoutMs } to model.invoke, and maps the resulting TimeoutError/AbortError to AiInvokeTimeoutError
Config plumbing through RunnerConfig → StepContextConfig → ExecutionContext
cli-core.ts: parse AI_INVOKE_TIMEOUT_MS env var
6 unit tests: TimeoutError/AbortError mapped to AiInvokeTimeoutError, { timeout } passed as the 2nd arg, disabled when unset/<=0 (abort not mapped), non-abort errors rethrown as-is

fixes PRD-409

Test plan

workflow-executor test suite passes (base-step-executor.test.ts: 45 tests, incl. the 6 above)
Lint clean on changed files; tsc --noEmit clean
Live test: with SIMULATE_AI_HANG=1 AI_INVOKE_TIMEOUT_MS=10000, the frontend shows the new user message after 10s instead of spinning for 5min
Default set to 30s

🤖 Generated with Claude Code

Note

Add per-invocation AI timeout to surface hanging provider errors in workflow executor

Adds aiInvokeTimeoutMs (default 30s) to RunnerConfig, ExecutionContext, and ExecutorOptions, configurable via the AI_INVOKE_TIMEOUT_MS env var in the CLI.
Wraps each AI model invocation in BaseStepExecutor.invokeWithTools with AbortSignal.timeout, throwing a distinct AiInvokeTimeoutError with a user-facing message when the provider does not respond in time.
Introduces positiveOrDefault in build-workflow-executor.ts so non-positive or non-finite timeout values always fall back to their defaults rather than disabling the timeout.
Behavioral Change: stepTimeoutMs values that are non-positive or non-finite now fall back to DEFAULT_STEP_TIMEOUT_MS instead of disabling the step timeout.

^{Macroscope summarized 7e7b418.}

…ing provider errors [PRD-409] When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global STEP_TIMEOUT_MS (default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner. Add a dedicated timeout on each AI invocation (default 60s, configurable via AI_INVOKE_TIMEOUT_MS) using AbortController + signal so the underlying HTTP request is actually cancelled. On timeout, throws the new AiInvokeTimeoutError, which BaseStepExecutor.execute() converts to an error outcome with a user-friendly message — the orchestrator then sets context.error on the step and the frontend exits its isLoading state immediately. fixes PRD-409 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

linear · 2026-05-28T14:52:27Z

PRD-409

qltysh · 2026-05-28T14:53:38Z

All good ✅

qltysh · 2026-05-28T14:58:23Z

Coverage Impact

⬆️ Merging this pull request will increase total coverage on feat/prd-214-server-step-mapper by 0.03%.

Modified Files with Diff Coverage (4)

Rating	File	% Diff	Uncovered Line #s
	packages/workflow-executor/src/build-workflow-executor.ts	100.0%
	packages/workflow-executor/src/executors/base-step-executor.ts	100.0%
	packages/workflow-executor/src/errors.ts	100.0%
	packages/workflow-executor/src/defaults.ts	100.0%
	Total	100.0%

🚦 See full report on Qlty Cloud »

🛟 Help

Diff Coverage: Coverage for added or modified lines of code (excludes deleted files). Learn more.
Total Coverage: Coverage for the whole repository, calculated as the sum of all File Coverage. Learn more.
File Coverage: Covered Lines divided by Covered Lines plus Missed Lines. (Excludes non-executable lines including blank lines and comments.)
- Indirect Changes: Changes to File Coverage for files that were not modified in this PR. Learn more.

Replace the manual AbortController + setTimeout in invokeWithTools with LangChain's native `timeout` call option, which it converts to an AbortSignal.timeout(ms) and forwards to the underlying HTTP request (real cancellation, not just a race). Lowers invokeWithTools complexity. Map the resulting TimeoutError/AbortError to AiInvokeTimeoutError to keep the user-facing message. Lower the default to 30s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rtSignal Instead of guessing the thrown error's name (providers wrap an aborted request differently — AbortError, TimeoutError, APIUserAbortError, APIConnectionTimeoutError…), pass an AbortSignal.timeout we own and detect the timeout via signal.aborted. Provider-agnostic, and LangChain forwards the signal so a hanging provider is genuinely cancelled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…er disables - build-workflow-executor: clamp stepTimeoutMs/aiInvokeTimeoutMs to a positive value or the default (`?? default` only caught null/undefined, so a 0/negative programmatic value silently disabled the timeout) - base-step-executor: drop the `as number` casts via inline narrowing; add a WHY comment on the signal.aborted timeout detection - tests: build clamps non-positive timeouts to default; AI invoke timeout now surfaces end-to-end through execute() as an error outcome with its userMessage Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…back to default) positiveOrDefault accepted Infinity (typeof Infinity === 'number' && Infinity > 0), which disabled the timeout instead of using the default. Guard with Number.isFinite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

matthv and others added 3 commits June 2, 2026 12:11

macroscopeapp Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread packages/workflow-executor/src/build-workflow-executor.ts

Scra3 approved these changes Jun 3, 2026

View reviewed changes

matthv merged commit ed1fe5c into feat/prd-214-server-step-mapper Jun 3, 2026
30 checks passed

matthv deleted the fix/prd-409-ai-invoke-timeout branch June 3, 2026 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609

fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609
matthv merged 5 commits into
feat/prd-214-server-step-mapperfrom
fix/prd-409-ai-invoke-timeout

matthv commented May 28, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

linear Bot commented May 28, 2026

Uh oh!

qltysh Bot commented May 28, 2026 •

edited

Loading

Uh oh!

qltysh Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matthv commented May 28, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why delegate to LangChain instead of a manual AbortController

Why not just lower STEP_TIMEOUT_MS globally

Changes

Test plan

Add per-invocation AI timeout to surface hanging provider errors in workflow executor

Uh oh!

linear Bot commented May 28, 2026

Uh oh!

qltysh Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qltysh Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matthv commented May 28, 2026 •

edited by macroscopeapp Bot

Loading

qltysh Bot commented May 28, 2026 •

edited

Loading

qltysh Bot commented May 28, 2026 •

edited

Loading