Skip to content

fix: swarm bug "Failed to detach context" with opentelemetry#2281

Open
mehtarac wants to merge 1 commit into
strands-agents:mainfrom
mehtarac:fix_swarm_bug
Open

fix: swarm bug "Failed to detach context" with opentelemetry#2281
mehtarac wants to merge 1 commit into
strands-agents:mainfrom
mehtarac:fix_swarm_bug

Conversation

@mehtarac
Copy link
Copy Markdown
Member

Description

When using the Swarm multiagent pattern with OpenTelemetry tracing enabled, users encounter repeated "Failed to detach context" errors with ValueError: was created in a different Context. This happens because _stream_with_timeout uses asyncio.wait_for() to wrap each anext() call on the async generator. Each wait_for invocation creates a new asyncio.Task with a copied contextvars.Context, so OTEL span tokens attached in one iteration's context cannot be detached in a subsequent iteration's different context.

Note:
On Python 3.10 only, a node that hangs indefinitely mid-await (e.g., unresponsive model API that never returns) will not be interrupted until the next event arrives. This is an unavoidable limitation of Python 3.10's async APIs and affects an edge case on a version approaching EOL (Oct 2026). Python 3.11+ retains full mid-await cancellation semantics via asyncio.timeout().

Related Issues

#1316

Documentation PR

Type of Change

Bug fix

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mehtarac mehtarac changed the title fix: swarm bug fix: swarm bug "Failed to detach context" with opentelemetry May 11, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@mehtarac mehtarac marked this pull request as ready for review May 11, 2026 16:50
@mehtarac mehtarac deployed to manual-approval May 11, 2026 16:53 — with GitHub Actions Active
raise Exception(timeout_message) from err
else:
# Track start time for total timeout
start_time = asyncio.get_event_loop().time()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: asyncio.get_event_loop() is deprecated in Python 3.10+ when there's no running event loop. While it works correctly here (since we're inside an async function), asyncio.get_running_loop() is more semantically correct — it explicitly documents the expectation that a loop is running and avoids any potential deprecation noise.

Suggestion:

start_time = asyncio.get_running_loop().time()
async for event in async_generator:
    elapsed = asyncio.get_running_loop().time() - start_time

yield event
except asyncio.TimeoutError as err:
raise Exception(timeout_message) from err
else:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: On the Python 3.10 path, the timeout is only enforced after an event is received from the generator. If the generator hangs forever mid-await (e.g., unresponsive model API that never yields another event), the timeout will never trigger — the async for will simply block indefinitely. This means the node_timeout guarantee is effectively absent on 3.10 for the "no more events" case.

This is acknowledged in the PR description (and it's an acceptable tradeoff given 3.10 EOL). However, could you add a brief inline comment here explaining this limitation? It'll help future maintainers understand why this path exists and when it can be removed (once 3.10 support is dropped).

Suggestion:

else:
    # Python 3.10 fallback: timeout is only checked between yielded events.
    # A generator that hangs mid-await won't be interrupted until the next event.
    # This can be removed once Python 3.10 support is dropped (Oct 2026).
    start_time = asyncio.get_running_loop().time()
    ...

@github-actions
Copy link
Copy Markdown

Assessment: Comment

The fix correctly identifies the root cause (asyncio.wait_for() creating new tasks with copied contexts breaks OTEL span token attachment) and addresses it with an appropriate version-branched approach. The Python 3.11+ path using asyncio.timeout() is clean and correct. The Python 3.10 fallback trades timeout precision for correctness, which is a reasonable tradeoff for a version approaching EOL.

Review Details
  • Documentation: The docstring still describes the old wait_for behavior and should be updated to reflect the new version-branched semantics and 3.10 limitations.
  • API usage: asyncio.get_event_loop() should be asyncio.get_running_loop() for correctness and to avoid deprecation concerns.
  • Maintainability: The Python 3.10 fallback has a significant behavioral difference (no mid-await cancellation) that should be documented inline with a note about when it can be removed.

The approach is sound and the change is well-scoped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant