fix: swarm bug "Failed to detach context" with opentelemetry#2281
fix: swarm bug "Failed to detach context" with opentelemetry#2281mehtarac wants to merge 1 commit into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| raise Exception(timeout_message) from err | ||
| else: | ||
| # Track start time for total timeout | ||
| start_time = asyncio.get_event_loop().time() |
There was a problem hiding this comment.
Issue: asyncio.get_event_loop() is deprecated in Python 3.10+ when there's no running event loop. While it works correctly here (since we're inside an async function), asyncio.get_running_loop() is more semantically correct — it explicitly documents the expectation that a loop is running and avoids any potential deprecation noise.
Suggestion:
start_time = asyncio.get_running_loop().time()
async for event in async_generator:
elapsed = asyncio.get_running_loop().time() - start_time| yield event | ||
| except asyncio.TimeoutError as err: | ||
| raise Exception(timeout_message) from err | ||
| else: |
There was a problem hiding this comment.
Issue: On the Python 3.10 path, the timeout is only enforced after an event is received from the generator. If the generator hangs forever mid-await (e.g., unresponsive model API that never yields another event), the timeout will never trigger — the async for will simply block indefinitely. This means the node_timeout guarantee is effectively absent on 3.10 for the "no more events" case.
This is acknowledged in the PR description (and it's an acceptable tradeoff given 3.10 EOL). However, could you add a brief inline comment here explaining this limitation? It'll help future maintainers understand why this path exists and when it can be removed (once 3.10 support is dropped).
Suggestion:
else:
# Python 3.10 fallback: timeout is only checked between yielded events.
# A generator that hangs mid-await won't be interrupted until the next event.
# This can be removed once Python 3.10 support is dropped (Oct 2026).
start_time = asyncio.get_running_loop().time()
...|
Assessment: Comment The fix correctly identifies the root cause ( Review Details
The approach is sound and the change is well-scoped. |
Description
When using the Swarm multiagent pattern with OpenTelemetry tracing enabled, users encounter repeated "Failed to detach context" errors with ValueError: was created in a different Context. This happens because _stream_with_timeout uses asyncio.wait_for() to wrap each anext() call on the async generator. Each wait_for invocation creates a new asyncio.Task with a copied contextvars.Context, so OTEL span tokens attached in one iteration's context cannot be detached in a subsequent iteration's different context.
Note:
On Python 3.10 only, a node that hangs indefinitely mid-await (e.g., unresponsive model API that never returns) will not be interrupted until the next event arrives. This is an unavoidable limitation of Python 3.10's async APIs and affects an edge case on a version approaching EOL (Oct 2026). Python 3.11+ retains full mid-await cancellation semantics via asyncio.timeout().
Related Issues
#1316
Documentation PR
Type of Change
Bug fix
Testing
How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.