fix(python-client): harden tracing, uploads, and execution response handling#501
fix(python-client): harden tracing, uploads, and execution response handling#501wllbo wants to merge 2 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
Hi @wllbo. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/assign @SHRUTI6991 |
| from k8s_agent_sandbox.trace_manager import trace_span, trace | ||
|
|
||
| # Maximum response size for command execution (16 MB). | ||
| MAX_EXECUTION_RESPONSE_SIZE = 16 * 1024 * 1024 |
There was a problem hiding this comment.
Why are we enforcing this limit?
There was a problem hiding this comment.
It guards against the client parsing and propagating unexpectedly large command output. 16 MB matches the Go SDK limit from #424. For larger outputs, callers should use file I/O instead.
|
|
||
| @trace_span("run") | ||
| def run(self, command: str, timeout: int = 60) -> ExecutionResult: | ||
| """Executes a command. Rejects responses larger than 16 MB.""" |
There was a problem hiding this comment.
Nit: You may be wanna parametrize this based on the MAX_EXECUTION_RESPONSE_SIZE value.
There was a problem hiding this comment.
updated, docstring now references the constant by name
|
|
||
| class SandboxTracerConfig(BaseModel): | ||
| """Configuration for tracer level information""" | ||
| model_config = {"arbitrary_types_allowed": True} |
There was a problem hiding this comment.
Nit: add a comment about this field.
There was a problem hiding this comment.
added a comment explaining it's needed for the tracer_provider field
|
|
||
| try: | ||
| self._create_claim(claim_name, template, namespace) | ||
| claim_name = self._create_claim(template, namespace) |
| Initializes the global OpenTelemetry TracerProvider using the singleton pattern. | ||
|
|
||
| This function uses double-checked locking to ensure thread-safe, one-time initialization. | ||
| def create_tracer_provider(service_name: str) -> "TracerProvider | None": |
| provider.add_span_processor( | ||
| BatchSpanProcessor(OTLPSpanExporter()) | ||
| ) | ||
| atexit.register(provider.shutdown) |
There was a problem hiding this comment.
The trace provider is shutdown when the client goes out of scope right?
There was a problem hiding this comment.
Not on client GC, atexit fires at process exit. Python's __del__ is unreliable, so atexit ensures BatchSpanProcessor flushes remaining spans before the process terminates.
If the caller passes their own provider via SandboxTracerConfig.tracer_provider, create_tracer_provider is never called, so we don't register anything. They own the lifecycle.
|
/ok-to-test |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aditya-shantanu, wllbo The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/lgtm /assign @igooch |
There was a problem hiding this comment.
Pull request overview
Hardens the Python SDK’s tracing setup, claim creation, upload path validation, and execution response handling to align with issues found during the Go SDK review.
Changes:
- Replaces global
initialize_tracer()with a caller-ownedcreate_tracer_provider()and threads an optional provider into tracer creation. - Switches SandboxClaim creation to Kubernetes
generateNameand updates sync/async clients/helpers accordingly. - Adds stricter upload filename validation and rejects oversized execution responses.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py | Introduces create_tracer_provider() and updates tracer manager creation to accept an optional provider. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py | Updates claim creation flow to use generated claim names returned from K8s. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py | Async equivalent of claim-name generation and tracing provider wiring. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/k8s_helper.py | Uses generateName for claim creation and returns the generated claim name. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_k8s_helper.py | Async equivalent of generateName claim creation and returning generated names. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py | Passes the configured tracer provider into tracer manager creation. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/models.py | Allows arbitrary tracer provider types in SandboxTracerConfig. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/filesystem.py | Rejects upload paths containing directories (sync filesystem). |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/command_executor.py | Adds a max-size guard for execution responses (sync executor). |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/test/unit/test_sandboxclient.py | Updates tests for generated claim names and updated _create_claim signature/return. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/test/unit/test_async_sandboxclient.py | Updates async client tests for generated claim names. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/test/unit/test_k8s_helper.py | Updates helper tests for new create_sandbox_claim signature. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/test/unit/test_async_k8s_helper.py | Updates async helper tests for new create_sandbox_claim signature/return. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/test/unit/test_sandbox.py | Updates expectations for create_tracer_manager(..., None) invocation. |
|
|
||
| try: | ||
| await self._create_claim(claim_name, template, namespace, labels=labels, lifecycle=lifecycle) | ||
| claim_name = await self._create_claim(template, namespace, labels=labels, lifecycle=lifecycle) |
There was a problem hiding this comment.
claim_name is assigned inside this try; if _create_claim(...) fails before returning a name, the except cleanup path later will reference an unassigned claim_name (leading to UnboundLocalError, and potentially skipping cleanup logic). Initialize claim_name before the try and guard the cleanup call so it only runs when a claim was actually created.
| return None, None | ||
|
|
||
| manager = TracerManager(service_name=config.trace_service_name) | ||
| manager = TracerManager(service_name=config.trace_service_name, provider=provider) |
There was a problem hiding this comment.
create_tracer_manager ignores config.tracer_provider unless callers thread it through explicitly. This already results in at least one call site (AsyncSandbox) not using the provider even when set on the config. Consider defaulting to config.tracer_provider when provider is None to avoid silent no-op tracing and to keep older call patterns working.
| manager = TracerManager(service_name=config.trace_service_name, provider=provider) | |
| effective_provider = provider | |
| if effective_provider is None: | |
| effective_provider = getattr(config, "tracer_provider", None) | |
| manager = TracerManager( | |
| service_name=config.trace_service_name, | |
| provider=effective_provider, | |
| ) |
| filename = os.path.basename(path) | ||
| if filename != path: | ||
| raise ValueError( | ||
| f"path must be a plain filename without directories, got {path!r}" | ||
| ) |
There was a problem hiding this comment.
This validation only rejects separators recognized by os.path.basename on the current OS. On POSIX, a path like "dir\\file" will pass even though it contains a directory separator for Windows-style paths. If the goal is to reject any directory separators, explicitly check for both '/' and '\\' (and consider updating AsyncFilesystem.write to apply the same rule for parity).
| body = response.content | ||
| if len(body) > MAX_EXECUTION_RESPONSE_SIZE: | ||
| raise RuntimeError( | ||
| f"Execution response exceeds {MAX_EXECUTION_RESPONSE_SIZE} byte limit" | ||
| ) |
There was a problem hiding this comment.
This size check happens after requests has already buffered the full response body (since send_request doesn’t use streaming), so it won’t protect the client from large memory/network usage—only from parsing/processing. Consider enforcing the limit via Content-Length pre-check and/or streaming reads with an explicit cap (and mirror the same limit in AsyncCommandExecutor, which currently has no equivalent guard).
|
|
||
| try: | ||
| self._create_claim(claim_name, template, namespace, labels=labels, lifecycle=lifecycle) | ||
| claim_name = self._create_claim(template, namespace, labels=labels, lifecycle=lifecycle) |
There was a problem hiding this comment.
Because claim_name is now assigned inside the try, if _create_claim(...) raises, the except block later will reference an unassigned claim_name (raising UnboundLocalError and masking the original exception). Initialize claim_name before the try (e.g., None) and only attempt cleanup when it was successfully created.
Issues surfaced during Go SDK review (#424) that also apply to Python SDK.
initialize_tracer()withcreate_tracer_provider()factory, the SDK no longer callstrace.set_tracer_provider().ValueErroron upload paths with directory separators instead of silently stripping to basename.generateNamefor claim creation instead of manualuuid.uuid4()suffix.