Skip to content

fix: add cuda cache cleanup after each inference to prevent progressi…#1139

Open
black-eleven wants to merge 2 commits into
mainfrom
fix-server
Open

fix: add cuda cache cleanup after each inference to prevent progressi…#1139
black-eleven wants to merge 2 commits into
mainfrom
fix-server

Conversation

@black-eleven

Copy link
Copy Markdown
Contributor

…ve slowdown

Repeated inference causes CUDA allocator fragmentation — gc.collect() and torch.cuda.empty_cache() were previously only called on task cancellation, not after normal completion. This made per-step inference times drift upward over the server lifetime.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds garbage collection and CUDA cache clearing in the API server's task processing loop. However, feedback highlights that in a distributed inference setup, this cleanup only runs on the API server process (rank 0) and not on other worker processes, which could still lead to memory fragmentation and OOM errors. It is recommended to move this cleanup logic to the inference worker itself so that all ranks execute it.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lightx2v/server/api/server.py Outdated
Comment on lines +155 to +157
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In a distributed inference setup (e.g., using torchrun), multiple worker processes (ranks) participate in the inference. Since _process_single_task is only executed on the API server process (typically rank 0), calling gc.collect() and torch.cuda.empty_cache() here will only clean up the CUDA cache on rank 0. The other worker processes (rank 1, rank 2, etc.) running the worker loop will not execute this block, meaning they will still suffer from progressive CUDA allocator fragmentation and potential Out-Of-Memory (OOM) errors or slowdowns.\n\nTo ensure consistent performance and prevent fragmentation across all ranks, the garbage collection and CUDA cache clearing should be performed within the inference worker itself (e.g., at the end of the task processing in TorchrunInferenceWorker or inside the generation services), so that every rank executes it after completing a task.

@black-eleven black-eleven force-pushed the fix-server branch 3 times, most recently from 0324a96 to 8c0ac6c Compare June 9, 2026 10:09
Two root causes:
1. TaskInfo cleanup kept 1000 completed tasks, each holding result_png
   (5-8MB) and full message body — consuming 5-8GB and slowing down
   OrderedDict scans in get_next_pending_task. Lowered to 50.
2. Release message reference immediately on task completion.

Also add gc.collect() + torch.cuda.empty_cache() after each inference
to prevent CUDA allocator fragmentation over time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@black-eleven black-eleven force-pushed the fix-server branch 3 times, most recently from 35e258e to 32f9739 Compare June 10, 2026 10:12
Expose task history retention limit as:
- CLI: --history_limit (default 1000)
- Env: LIGHTX2V_HISTORY_LIMIT
- Config: history_limit
- TaskManager: set_history_limit()

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant