Skip to content

Standalone sleep dry-run can kill parent wrapper when job id is 0 #929

Description

@shreyaskommuri

Describe the Bug

While testing CloudAI as the benchmark execution layer for a small wrapper project, CloudAI Autotune, I found a standalone sleep dry-run path that can terminate the parent wrapper process group.

The dry-run reached the common sleep scenario and then logged:

[INFO] Scheduling termination of job 0 after 0 seconds.

The CloudAI debug log for the same run shows the termination command:

Executing termination command for job 0: kill -9 0

Because kill -9 0 targets the current process group, the parent wrapper process was killed too. From the wrapper's perspective, the command exited with code 137 before it could print a normal pass/fail summary or parse any resulting artifacts.

The likely area is standalone job termination. In my local checkout, src/cloudai/systems/standalone/standalone_system.py builds:

cmd = f"kill -9 {job.id}"

If job.id is 0, that becomes kill -9 0.

Steps to Reproduce

Observed using a local CloudAI checkout at commit dfa28da3 with the common standalone sleep scenario.

I triggered it through a wrapper command that shells out to CloudAI dry-run:

python -m autotune.cli smoke-cloudai \
  /Users/shreyas/cloudai/conf/common/test_scenario/sleep.toml \
  --cloudai-bin /Users/shreyas/cloudai/.venv/bin/cloudai \
  --system-config /Users/shreyas/cloudai/conf/common/system/standalone_system.toml \
  --tests-dir /Users/shreyas/cloudai/conf/common/test \
  --runs-dir runs \
  --timeout-sec 60

Relevant stdout captured by the wrapper:

[INFO] System Name: standalone_system
[INFO] Scheduler: standalone
[INFO] Test Scenario Name: sleep-scenario
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating StandaloneRunner
[INFO] Starting test: Tests.sleep1
[INFO] Executing command for test Tests.sleep1: sleep 10
[INFO] Starting test: Tests.sleep20
[INFO] Executing command for test Tests.sleep20: sleep 20
[INFO] Starting test: Tests.sleep5
[INFO] Executing command for test Tests.sleep5: sleep 5
[INFO] Job completed: Tests.sleep1 (iteration 1 of 1)
[INFO] Starting test: Tests.sleep5_2
[INFO] Executing command for test Tests.sleep5_2: sleep 5
[INFO] Scheduling termination of job 0 after 0 seconds.

The parent wrapper then exited with code 137.

Expected Behavior

A standalone CloudAI dry-run should return control to the parent process cleanly, even when a dependency-triggered termination is scheduled.

For wrapper tools, CI jobs, and smoke tests, I would expect one of these outcomes:

  • CloudAI completes dry-run successfully and exits 0; or
  • CloudAI reports a scoped job termination/failure and exits non-zero without killing the caller.

The termination command should target only the child process/job CloudAI owns, not process group 0.

Suggested Fix Direction

  • Guard standalone kill against invalid or process-group-sensitive ids such as 0.
  • Track the actual subprocess PID for standalone jobs and terminate that PID/process tree explicitly.
  • Add a regression test for dry-running conf/common/test_scenario/sleep.toml under a parent wrapper/subprocess and assert the parent process survives.

Screenshots

Not applicable. The relevant evidence is the captured stdout/debug-log text above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions