Conversation
_get_async_lock() had a race where two coroutines (or threads running per-thread event loops via ThreadedAsyncSandboxClient) could both see _async_lock as None and create separate Lock instances. Use a threading.Lock with double-checked locking to ensure exactly one asyncio.Lock is created. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When many sandboxes are created in a burst, all cache-miss concurrently and fire redundant auth POST requests. Add per-sandbox in-flight tracking so only one request is issued per sandbox_id while others wait for the result. Applies to both sync (threading.Event) and async (asyncio.Future) paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_save_cache() does blocking file I/O (json.dump to disk). When called inside self._lock, it blocks all other threads waiting for the lock while doing disk writes. Move the save outside the critical section — the in-memory cache is already updated atomically under the lock, so the file write can happen without holding it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- _save_cache_async(): mkdir was sync, now offloaded to thread - AsyncSandboxClient.download_file(): os.makedirs was sync, now offloaded - AsyncSandboxClient.upload_file(): os.path.exists was sync, now offloaded These sync syscalls block the event loop. Under high concurrency they contribute to event loop lag accumulation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dcf61b7c1d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except BaseException as e: | ||
| async with self._get_async_lock(): | ||
| fut = self._inflight_async.pop(sandbox_id, None) | ||
| if fut is not None and not fut.done(): | ||
| fut.set_exception(e) |
There was a problem hiding this comment.
Stop one cancelled auth refresh from aborting peer requests
In SandboxAuthCache.get_or_refresh_async, concurrent callers for the same sandbox share the leader's future. If that leader task is cancelled mid-request (for example because one caller is wrapped in asyncio.wait_for or a client disconnects), this except BaseException path stores the CancelledError on the shared future, so every other waiter is cancelled too even though their own operations are still valid. Before this coalescing change those peer commands/downloads would have failed or succeeded independently.
Useful? React with 👍 / 👎.
| if event is not None: | ||
| # Wait for the in-flight request to complete | ||
| event.wait() | ||
| with self._lock: | ||
| cached_auth = self._check_cached_auth(sandbox_id) | ||
| if cached_auth: | ||
| return cached_auth | ||
| # Fallback: if the in-flight request failed, fetch ourselves | ||
| return self.get_or_refresh(sandbox_id) |
There was a problem hiding this comment.
Avoid recursive retries after a shared auth failure
When many threads call SandboxAuthCache.get_or_refresh for the same sandbox while /sandbox/{id}/auth is failing, each waiter wakes up here and re-enters get_or_refresh() recursively. Only one waiter becomes the new leader each round, so the others can accumulate one extra stack frame per failure; under a large burst this can end in RecursionError instead of surfacing the original auth error. An iterative retry path would preserve the coalescing without making outage handling stack-depth dependent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
yeah, this recursion is sus.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| with self._lock: | ||
| if sandbox_id in self._auth_cache: | ||
| self._auth_cache[sandbox_id]["is_gpu"] = is_gpu | ||
| self._save_cache() |
There was a problem hiding this comment.
Sync is_gpu saves cache unconditionally unlike async
Low Severity
In the sync is_gpu, _save_cache() was moved outside the if sandbox_id in self._auth_cache block, so it now runs even when no update was made. The original code (and the async counterpart is_gpu_async) only calls save inside the if block. This is an unintentional behavioral change that performs unnecessary file I/O and is inconsistent with the async version.
Additional Locations (1)
| self._async_lock: Optional[asyncio.Lock] = None | ||
| # Per-sandbox in-flight auth futures to coalesce concurrent requests | ||
| self._inflight_sync: Dict[str, threading.Event] = {} | ||
| self._inflight_sync_results: Dict[str, Dict[str, Any]] = {} |
There was a problem hiding this comment.
Unused _inflight_sync_results dict is dead code
Low Severity
_inflight_sync_results is initialized in __init__ but never read or written to anywhere in the codebase. It appears to be a leftover from an earlier design of the sync coalescing logic that was replaced by the event-based approach. This unused attribute adds confusion about the concurrency design.
DamianB-BitFlipper
left a comment
There was a problem hiding this comment.
I get the gist of this PR, but it is trying to shoehorn new logic in to the existing foundation that is not super compatible.
A lot of the async.to_thread and the thread lock + async lock, is due to def _load_cache(self) -> tuple[Dict[str, Any], bool]: and other methods being "sync".
By my inspection, nothing prevents them from being async (or having async variants). Then one would not need to do the whole to_thread business.
|
I would also not define |
would love this, let's do it |


Note
Medium Risk
Touches auth token caching and refresh behavior, adding new locking/coalescing paths that could deadlock or cause stale/failed token reuse if incorrect, but scope is limited to client-side cache management.
Overview
Auth cache concurrency was hardened.
SandboxAuthCacheis now thread-safe and async-safe via locks, and it coalesces concurrentPOST /sandbox/{id}/authrefreshes so only one request is issued per sandbox while others wait.Cache lifecycle and I/O were adjusted. The cache can be lazily loaded (used by
AsyncSandboxClient), expired entries are cleaned on load with a save-on-clean, async file ops avoid blocking the event loop (to_threadfor mkdir/unlink/exists/makedirs), andAsyncSandboxClientaddsclear_auth_cache_async()alongside the existing sync method.Written by Cursor Bugbot for commit dcf61b7. This will update automatically on new commits. Configure here.