fix: add cuda cache cleanup after each inference to prevent progressi…#1139
fix: add cuda cache cleanup after each inference to prevent progressi…#1139black-eleven wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds garbage collection and CUDA cache clearing in the API server's task processing loop. However, feedback highlights that in a distributed inference setup, this cleanup only runs on the API server process (rank 0) and not on other worker processes, which could still lead to memory fragmentation and OOM errors. It is recommended to move this cleanup logic to the inference worker itself so that all ranks execute it.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| gc.collect() | ||
| if torch.cuda.is_available(): | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
In a distributed inference setup (e.g., using torchrun), multiple worker processes (ranks) participate in the inference. Since _process_single_task is only executed on the API server process (typically rank 0), calling gc.collect() and torch.cuda.empty_cache() here will only clean up the CUDA cache on rank 0. The other worker processes (rank 1, rank 2, etc.) running the worker loop will not execute this block, meaning they will still suffer from progressive CUDA allocator fragmentation and potential Out-Of-Memory (OOM) errors or slowdowns.\n\nTo ensure consistent performance and prevent fragmentation across all ranks, the garbage collection and CUDA cache clearing should be performed within the inference worker itself (e.g., at the end of the task processing in TorchrunInferenceWorker or inside the generation services), so that every rank executes it after completing a task.
0324a96 to
8c0ac6c
Compare
Two root causes: 1. TaskInfo cleanup kept 1000 completed tasks, each holding result_png (5-8MB) and full message body — consuming 5-8GB and slowing down OrderedDict scans in get_next_pending_task. Lowered to 50. 2. Release message reference immediately on task completion. Also add gc.collect() + torch.cuda.empty_cache() after each inference to prevent CUDA allocator fragmentation over time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
35e258e to
32f9739
Compare
Expose task history retention limit as: - CLI: --history_limit (default 1000) - Env: LIGHTX2V_HISTORY_LIMIT - Config: history_limit - TaskManager: set_history_limit() Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ve slowdown
Repeated inference causes CUDA allocator fragmentation — gc.collect() and torch.cuda.empty_cache() were previously only called on task cancellation, not after normal completion. This made per-step inference times drift upward over the server lifetime.