fix: add cuda cache cleanup after each inference to prevent progressi… by black-eleven · Pull Request #1139 · ModelTC/LightX2V

black-eleven · 2026-06-09T09:26:06Z

…ve slowdown

Repeated inference causes CUDA allocator fragmentation — gc.collect() and torch.cuda.empty_cache() were previously only called on task cancellation, not after normal completion. This made per-step inference times drift upward over the server lifetime.

gemini-code-assist

Code Review

This pull request adds garbage collection and CUDA cache clearing in the API server's task processing loop. However, feedback highlights that in a distributed inference setup, this cleanup only runs on the API server process (rank 0) and not on other worker processes, which could still lead to memory fragmentation and OOM errors. It is recommended to move this cleanup logic to the inference worker itself so that all ranks execute it.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T09:27:02Z

+            gc.collect()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()


In a distributed inference setup (e.g., using torchrun), multiple worker processes (ranks) participate in the inference. Since _process_single_task is only executed on the API server process (typically rank 0), calling gc.collect() and torch.cuda.empty_cache() here will only clean up the CUDA cache on rank 0. The other worker processes (rank 1, rank 2, etc.) running the worker loop will not execute this block, meaning they will still suffer from progressive CUDA allocator fragmentation and potential Out-Of-Memory (OOM) errors or slowdowns.\n\nTo ensure consistent performance and prevent fragmentation across all ranks, the garbage collection and CUDA cache clearing should be performed within the inference worker itself (e.g., at the end of the task processing in TorchrunInferenceWorker or inside the generation services), so that every rank executes it after completing a task.

Two root causes: 1. TaskInfo cleanup kept 1000 completed tasks, each holding result_png (5-8MB) and full message body — consuming 5-8GB and slowing down OrderedDict scans in get_next_pending_task. Lowered to 50. 2. Release message reference immediately on task completion. Also add gc.collect() + torch.cuda.empty_cache() after each inference to prevent CUDA allocator fragmentation over time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Expose task history retention limit as: - CLI: --history_limit (default 1000) - Env: LIGHTX2V_HISTORY_LIMIT - Config: history_limit - TaskManager: set_history_limit() Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

black-eleven force-pushed the fix-server branch 3 times, most recently from 0324a96 to 8c0ac6c Compare June 9, 2026 10:09

black-eleven force-pushed the fix-server branch 3 times, most recently from 35e258e to 32f9739 Compare June 10, 2026 10:12

feat: make history_limit configurable via CLI and env var

8f181e6

Expose task history retention limit as: - CLI: --history_limit (default 1000) - Env: LIGHTX2V_HISTORY_LIMIT - Config: history_limit - TaskManager: set_history_limit() Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

black-eleven force-pushed the fix-server branch from 32f9739 to 8f181e6 Compare June 10, 2026 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add cuda cache cleanup after each inference to prevent progressi…#1139

fix: add cuda cache cleanup after each inference to prevent progressi…#1139
black-eleven wants to merge 2 commits into
mainfrom
fix-server

black-eleven commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

black-eleven commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant