Describe the Bug
While testing CloudAI as the benchmark execution layer for a small wrapper project, CloudAI Autotune, I found a standalone sleep dry-run path that can terminate the parent wrapper process group.
The dry-run reached the common sleep scenario and then logged:
[INFO] Scheduling termination of job 0 after 0 seconds.
The CloudAI debug log for the same run shows the termination command:
Executing termination command for job 0: kill -9 0
Because kill -9 0 targets the current process group, the parent wrapper process was killed too. From the wrapper's perspective, the command exited with code 137 before it could print a normal pass/fail summary or parse any resulting artifacts.
The likely area is standalone job termination. In my local checkout, src/cloudai/systems/standalone/standalone_system.py builds:
cmd = f"kill -9 {job.id}"
If job.id is 0, that becomes kill -9 0.
Steps to Reproduce
Observed using a local CloudAI checkout at commit dfa28da3 with the common standalone sleep scenario.
I triggered it through a wrapper command that shells out to CloudAI dry-run:
python -m autotune.cli smoke-cloudai \
/Users/shreyas/cloudai/conf/common/test_scenario/sleep.toml \
--cloudai-bin /Users/shreyas/cloudai/.venv/bin/cloudai \
--system-config /Users/shreyas/cloudai/conf/common/system/standalone_system.toml \
--tests-dir /Users/shreyas/cloudai/conf/common/test \
--runs-dir runs \
--timeout-sec 60
Relevant stdout captured by the wrapper:
[INFO] System Name: standalone_system
[INFO] Scheduler: standalone
[INFO] Test Scenario Name: sleep-scenario
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating StandaloneRunner
[INFO] Starting test: Tests.sleep1
[INFO] Executing command for test Tests.sleep1: sleep 10
[INFO] Starting test: Tests.sleep20
[INFO] Executing command for test Tests.sleep20: sleep 20
[INFO] Starting test: Tests.sleep5
[INFO] Executing command for test Tests.sleep5: sleep 5
[INFO] Job completed: Tests.sleep1 (iteration 1 of 1)
[INFO] Starting test: Tests.sleep5_2
[INFO] Executing command for test Tests.sleep5_2: sleep 5
[INFO] Scheduling termination of job 0 after 0 seconds.
The parent wrapper then exited with code 137.
Expected Behavior
A standalone CloudAI dry-run should return control to the parent process cleanly, even when a dependency-triggered termination is scheduled.
For wrapper tools, CI jobs, and smoke tests, I would expect one of these outcomes:
- CloudAI completes dry-run successfully and exits
0; or
- CloudAI reports a scoped job termination/failure and exits non-zero without killing the caller.
The termination command should target only the child process/job CloudAI owns, not process group 0.
Suggested Fix Direction
- Guard standalone
kill against invalid or process-group-sensitive ids such as 0.
- Track the actual subprocess PID for standalone jobs and terminate that PID/process tree explicitly.
- Add a regression test for dry-running
conf/common/test_scenario/sleep.toml under a parent wrapper/subprocess and assert the parent process survives.
Screenshots
Not applicable. The relevant evidence is the captured stdout/debug-log text above.
Describe the Bug
While testing CloudAI as the benchmark execution layer for a small wrapper project, CloudAI Autotune, I found a standalone sleep
dry-runpath that can terminate the parent wrapper process group.The dry-run reached the common sleep scenario and then logged:
The CloudAI debug log for the same run shows the termination command:
Because
kill -9 0targets the current process group, the parent wrapper process was killed too. From the wrapper's perspective, the command exited with code137before it could print a normal pass/fail summary or parse any resulting artifacts.The likely area is standalone job termination. In my local checkout,
src/cloudai/systems/standalone/standalone_system.pybuilds:If
job.idis0, that becomeskill -9 0.Steps to Reproduce
Observed using a local CloudAI checkout at commit
dfa28da3with the common standalone sleep scenario.I triggered it through a wrapper command that shells out to CloudAI dry-run:
Relevant stdout captured by the wrapper:
The parent wrapper then exited with code
137.Expected Behavior
A standalone CloudAI dry-run should return control to the parent process cleanly, even when a dependency-triggered termination is scheduled.
For wrapper tools, CI jobs, and smoke tests, I would expect one of these outcomes:
0; orThe termination command should target only the child process/job CloudAI owns, not process group
0.Suggested Fix Direction
killagainst invalid or process-group-sensitive ids such as0.conf/common/test_scenario/sleep.tomlunder a parent wrapper/subprocess and assert the parent process survives.Screenshots
Not applicable. The relevant evidence is the captured stdout/debug-log text above.