Add pre-test hook support for MegatronBridge workload#957
Add pre-test hook support for MegatronBridge workload#957saivishal1999 wants to merge 2 commits into
Conversation
MegatronBridge's gen_exec_command uses setup_experiment.py to generate its own sbatch, bypassing CloudAI's normal pre_test injection. This adds support by submitting the pre-hook as a separate sbatch first, then passing dependency=afterok:<job_id> to setup_experiment.py via --additional_slurm_params so Slurm gates training on pre-hook success. - _gen_pre_hook_sbatch: generates standalone sbatch for pre-hook tests, injecting main test's nnodes so srun allocates all training nodes - _collect_additional_slurm_params: extracted from _build_launcher_parts so the base params can be built in Python and dependency appended at bash runtime after the pre-hook job ID is known - _wrap_launcher_for_job_id_and_quiet_output: submits pre-hook sbatch, captures job ID, builds --additional_slurm_params with dependency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds optional pre-test sbatch support to the Megatron-Bridge Slurm command generation strategy. A standalone pre-hook sbatch script can now be generated, submitted, and its job ID used as an afterok dependency for the main launcher job, with conditional inclusion of additional Slurm parameters. ChangesPre-hook sbatch dependency for Megatron-Bridge
Estimated code review effort: 4 (Complex) | ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 133-136: The pre-hook sbatch generation in build_sbatch_lines is
dropping excluded nodes when get_cached_nodes_spec() only provides num_nodes.
Update the sbatch line assembly so self.test_run.exclude_nodes is preserved even
in the --nodes path, not just when node_list is present, and keep the existing
behavior around num_nodes in SlurmCommandGenStrategy while ensuring excluded
nodes are still passed through.
- Around line 117-118: The pre-test node assignment in
slurm_command_gen_strategy.py is using the read-only TestRun.nnodes property,
which will fail before sbatch generation. Update the loop in the strategy method
that iterates over self.test_run.pre_test.test_runs to assign the underlying
mutable field num_nodes instead of nnodes, keeping the existing logic that
copies the parent test run’s node count into each pre-test run.
- Line 139: The pre-hook submission in gen_pre_test / SlurmCommandGenStrategy
currently only propagates PRE_TEST_SUCCESS, so a failed pre-test can still let
the sbatch wrapper exit successfully and release the main job. Update the
command assembly around sbatch_lines.extend and pre_hook_cmds to explicitly exit
non-zero when the pre-test check fails, using the gen_pre_test-generated success
flag as the gate so the pre-hook sbatch cannot complete as success on failure.
- Around line 344-350: Make the include_slurm_params parameter keyword-only in
_build_launcher_parts so it cannot be passed positionally and the FBT001/FBT002
warning is eliminated. Update the _build_launcher_parts function signature to
force include_slurm_params as a keyword-only argument, and ensure any internal
call sites use the named argument form; the existing keyword-based caller should
continue to work unchanged.
- Around line 268-270: The shell wrapper in the Slurm command generation logic
is inserting base Slurm args directly into the ADDITIONAL_SLURM_PARAMS
assignment, which can allow shell expansion and produce a leading separator when
the base value is empty. Update the Megatron bridge’s slurm_command_gen_strategy
around the ADDITIONAL_SLURM_PARAMS construction to safely quote/escape
base_slurm_params before embedding it in the shell snippet, and adjust the
PRE_HOOK_JOB_ID append logic so it only adds the dependency separator when
ADDITIONAL_SLURM_PARAMS already has content. Use the existing command assembly
in the strategy method that emits these lines to keep the fix localized.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: dce884bc-1a4a-49c7-bf8d-0aa679ff96f6
📒 Files selected for processing (1)
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
| if node_list: | ||
| sbatch_lines.append(f"#SBATCH --nodelist={','.join(node_list)}") | ||
| elif num_nodes: | ||
| sbatch_lines.append(f"#SBATCH --nodes={num_nodes}") |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Preserve excluded nodes in the pre-hook sbatch.
When get_cached_nodes_spec() returns only num_nodes, this script emits --nodes but drops self.test_run.exclude_nodes, so the pre-hook can run on nodes the main job excludes.
🐛 Proposed fix
if node_list:
sbatch_lines.append(f"`#SBATCH` --nodelist={','.join(node_list)}")
elif num_nodes:
sbatch_lines.append(f"`#SBATCH` --nodes={num_nodes}")
+ if self.test_run.exclude_nodes:
+ sbatch_lines.append(f"`#SBATCH` --exclude={','.join(self.test_run.exclude_nodes)}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if node_list: | |
| sbatch_lines.append(f"#SBATCH --nodelist={','.join(node_list)}") | |
| elif num_nodes: | |
| sbatch_lines.append(f"#SBATCH --nodes={num_nodes}") | |
| if node_list: | |
| sbatch_lines.append(f"`#SBATCH` --nodelist={','.join(node_list)}") | |
| elif num_nodes: | |
| sbatch_lines.append(f"`#SBATCH` --nodes={num_nodes}") | |
| if self.test_run.exclude_nodes: | |
| sbatch_lines.append(f"`#SBATCH` --exclude={','.join(self.test_run.exclude_nodes)}") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around
lines 133 - 136, The pre-hook sbatch generation in build_sbatch_lines is
dropping excluded nodes when get_cached_nodes_spec() only provides num_nodes.
Update the sbatch line assembly so self.test_run.exclude_nodes is preserved even
in the --nodes path, not just when node_list is present, and keep the existing
behavior around num_nodes in SlurmCommandGenStrategy while ensuring excluded
nodes are still passed through.
There was a problem hiding this comment.
this is a broader issue rather than only num nodes. it also doesn't set distribution, gres/gpu-per-node and some other potential directives. what I suggest as a fix, is to update the base class SlurmCommandGenStrategy with extract directives generation into a sub-method. So that you could explicitly call it here to write directives into your buffer
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
fd -a SlurmCommandGenStrategy
fd . -a --type f -g '*slurm_command_gen_strategy.py' src/cloudai/systems/slurm
</command>Length of output: 810
🏁 Script executed:
#!/bin/bash
fd -a slurm_command_gen_strategy.py
echo "---"
fd -a -e py . src/cloudai/systems/slurmLength of output: 2675
🏁 Script executed:
#!/bin/bash
ast-grep outline src/cloudai/systems/slurm/slurm_command_gen_strategy.pyLength of output: 995
🏁 Script executed:
#!/bin/bash
ast-grep run --pattern 'def _append_sbatch_directives($$$) {
$$$
}' --lang python src/cloudai/systems/slurm/slurm_command_gen_strategy.py
echo "---"
sed -n '1,400p' src/cloudai/systems/slurm/slurm_command_gen_strategy.py | grep -n "_append_sbatch_directives\|_append_nodes_related_directives\|_write_sbatch_script"Length of output: 1164
🏁 Script executed:
#!/bin/bash
sed -n '342,470p' src/cloudai/systems/slurm/slurm_command_gen_strategy.pyLength of output: 5433
@podkidyshev, agreed — that's the better fix. I checked src/cloudai/systems/slurm/slurm_command_gen_strategy.py, and it already has most of this logic centralized in _append_sbatch_directives (Line 375) and _append_nodes_related_directives (Line 411), covering: reservation, --distribution, nodelist/hostfile vs -N, exclude, gpus-per-node/gres, ntasks-per-node, --time, and extra_sbatch_args. The _gen_pre_hook_sbatch method in slurm_command_gen_strategy.py currently reimplements only a small subset of that (job-name, output/error, partition, account, nodelist/nodes, time_limit), which is why distribution, gres/gpus-per-node, ntasks-per-node, extra_sbatch_args, and exclude are all missing there.
The one wrinkle is that _append_sbatch_directives hardcodes self.test_run.output_path (for stdout/stderr) and self.job_name(), both of which differ for the pre-hook script (pre_hook subdir, pre_hook_ prefix). So a clean extraction would be something like pulling the node/GPU/misc-directive portion (distribution, nodelist/exclude/hostfile, gpus-per-node/gres, ntasks-per-node, extra_sbatch_args, time_limit) out of _append_sbatch_directives into its own method, e.g. _append_resource_directives(content: List[str]), that both _write_sbatch_script and _gen_pre_hook_sbatch call directly, while each caller still appends its own job-name/output/error/partition/account lines.
Since this touches the shared base class and is broader than the original single-line fix, would you like this done in this PR, or should I open a follow-up issue to track the base-class refactor so both _gen_pre_hook_sbatch and the main sbatch path stay in sync?
✏️ Learnings added
Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 957
File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:133-136
Timestamp: 2026-07-03T12:16:35.962Z
Learning: In NVIDIA/cloudai, src/cloudai/systems/slurm/slurm_command_gen_strategy.py already centralizes most SBATCH directive generation (distribution, nodelist/hostfile/-N, exclude_nodes, gpus-per-node/gres, ntasks-per-node, extra_sbatch_args, time_limit) inside `_append_sbatch_directives` and `_append_nodes_related_directives`. When a workload-specific strategy (e.g., src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py's `_gen_pre_hook_sbatch`) needs to generate a standalone sbatch script, it should reuse/extend these base-class methods rather than reimplementing a partial subset of directives, to avoid missing directives like distribution, gres/gpus-per-node, ntasks-per-node, extra_sbatch_args, and exclude_nodes. Maintainer podkidyshev suggested extracting the resource-directive portion of `_append_sbatch_directives` into its own reusable sub-method on the base class for this purpose.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 799
File: src/cloudai/workloads/vllm/report_generation_strategy.py:47-48
Timestamp: 2026-02-10T13:29:25.671Z
Learning: In src/cloudai/workloads/vllm/report_generation_strategy.py, it is acceptable to annotate parse_vllm_bench_output with a caching decorator (e.g., functools.cache) because benchmark result files are immutable after creation in typical usage. This improves repeated calls by avoiding re-parsing unchanged data. Ensure the function is side-effect free and that cache invalidation is not needed for these inputs. If inputs include non-file-state variability, consider cache keys on file path or content hash.
Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 821
File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:87-97
Timestamp: 2026-03-10T11:01:25.158Z
Learning: In NVIDIA/cloudai repository, there is no secrets management infrastructure; environment variable values (including tokens/secrets) are serialized as literal strings into generated commands and run artifacts (e.g., cloudai_generated_command.sh). This is an architectural limitation acknowledged by the maintainer (podkidyshev). During code reviews for this repo, do not flag secret-serialization issues unless a secrets management solution is available. If a future change introduces proper secrets handling, update this guideline accordingly.
Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 821
File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:69-70
Timestamp: 2026-03-10T11:09:12.655Z
Learning: In src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, _build_launcher_parts uses ' '.join(parts) intentionally. The parts list contains raw shell constructs and pre-quoted entries (via shlex.quote) so they survive as single arguments to the Python launcher. Replacing with shlex.join(parts) would double-quote shell syntax and break the generated bash command. Do not suggest replacing ' '.join(parts) with shlex.join(parts) in this file during reviews.
Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 829
File: src/cloudai/workloads/common/nixl.py:198-217
Timestamp: 2026-03-13T14:17:44.666Z
Learning: In NVIDIA/cloudai, when reviewing code in src/cloudai/workloads/common/nixl.py, understand that NIXL Pydantic models (e.g., NIXLExtendedCmdArgs) may define fields like device_list and total_buffer_size as str | list[str] to support sweep configuration. However, by the time methods such as _device_list_mounts in NIXLCmdGenBase are invoked, these fields are resolved to a single str or None. Casting them to str | None at the command-generation layer is intentional and correct and should not be flagged as a bug. If reviewing related code, verify that the sweep expansion happens before command generation and that the command-gen layer operates on concrete string values or None rather than lists.
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 836
File: tests/workloads/test_llm_serving.py:35-83
Timestamp: 2026-03-16T14:10:11.280Z
Learning: In the NVIDIA/cloudai repository, require meaningful docstrings for non-test Python classes (production/library code) to improve readability and maintainability. Test doubles, test helper classes, and other classes defined inside test files (e.g., Fake*, Plain*, stub classes) are exempt from the docstring requirement. Enforce via automated checks (e.g., a docstring style/quality gate using CI) that validate presence and quality of class docstrings for production code while skipping test-only classes within test files.
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 839
File: src/cloudai/workloads/sglang/slurm_command_gen_strategy.py:41-41
Timestamp: 2026-03-17T20:01:53.137Z
Learning: In src/cloudai/workloads/sglang/slurm_command_gen_strategy.py, 0.0.0.0 bindings (default prefill_host/decode_host in get_serve_commands and --host 0.0.0.0 in get_helper_command) are intentional for in-cluster serving. Ruff S104 warnings on these lines should not be flagged or require suppression comments, as the maintainer (amaslenn) has explicitly confirmed this is the intended behavior.
Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 846
File: src/cloudai/reporter.py:168-168
Timestamp: 2026-03-24T23:24:17.668Z
Learning: In the NVIDIA/cloudai repository, follow the project-wide logging convention: use root/global logging calls like `logging.info(...)` and `logging.warning(...)` rather than creating module-level loggers via `logging.getLogger(__name__)`. When reviewing Python code, do not flag or reference LOG015 for this reason and do not suggest switching to module-level loggers. Separately, Ruff’s formatting rule (e.g., G004 f-string vs %-style) is independent and may still be raised if applicable.
Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 869
File: src/cloudai/_core/test_scenario.py:23-23
Timestamp: 2026-04-14T11:05:59.444Z
Learning: In the NVIDIA/cloudai repository, do not surface or fail code-review findings for Ruff UP035 warnings about deprecated typing aliases (e.g., typing.List, typing.Set, typing.Type) on individual files. This deprecation cleanup is planned for a dedicated whole-repo modernization bump, so UP035 should be treated as an expected/known issue rather than something reviewers should block at the per-file level.
Learnt from: orbalayla-nvidia
Repo: NVIDIA/cloudai PR: 899
File: src/cloudai/_core/grading_strategy.py:2-2
Timestamp: 2026-06-07T12:01:43.285Z
Learning: In the NVIDIA/cloudai repo, copyright headers’ year formatting must match the test logic enforced by tests/test_check_copyright_headers.py (via collect_years_same_file and _format_years_to_ranges): use comma-separated year lists for non-consecutive “touched” years (e.g., "2024, 2026") and use hyphen ranges only for consecutive years (e.g., "2024-2026"). Do not “normalize” "YYYY, YYYY" into "YYYY-YYYY" unless the missing years are actually present in git history; otherwise the copyright header check will fail.
Learnt from: rutayan-nv
Repo: NVIDIA/cloudai PR: 901
File: src/cloudai/cli/handlers.py:147-162
Timestamp: 2026-06-24T01:05:39.500Z
Learning: For NVIDIA/cloudai: Ruff rule G004 (logging-f-string) is not enabled in this repo’s Ruff configuration, and CI Ruff checks pass even with f-string usage in logging calls. During code reviews, do not flag or request changes specifically for Ruff G004 violations; leave logging f-strings as-is in .py files unless CI or other enabled rules fail.
| sbatch_lines.append(f"#SBATCH --nodes={num_nodes}") | ||
| if self.test_run.time_limit: | ||
| sbatch_lines.append(f"#SBATCH --time={self.test_run.time_limit}") | ||
| sbatch_lines.extend(["", pre_hook_cmds]) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win
Exit non-zero when the pre-hook success check fails.
gen_pre_test() only sets PRE_TEST_SUCCESS; without an explicit exit here, the pre-hook sbatch can finish successfully and unblock the main training job even after a failed pre-test.
🐛 Proposed fix
- sbatch_lines.extend(["", pre_hook_cmds])
+ sbatch_lines.extend(
+ [
+ "",
+ pre_hook_cmds,
+ "",
+ 'if [ "${PRE_TEST_SUCCESS:-0}" -ne 1 ]; then',
+ ' echo "Pre-hook failed; not releasing dependent training job." >&2',
+ " exit 1",
+ "fi",
+ ]
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sbatch_lines.extend(["", pre_hook_cmds]) | |
| sbatch_lines.extend( | |
| [ | |
| "", | |
| pre_hook_cmds, | |
| "", | |
| 'if [ "${PRE_TEST_SUCCESS:-0}" -ne 1 ]; then', | |
| ' echo "Pre-hook failed; not releasing dependent training job." >&2', | |
| " exit 1", | |
| "fi", | |
| ] | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` at line
139, The pre-hook submission in gen_pre_test / SlurmCommandGenStrategy currently
only propagates PRE_TEST_SUCCESS, so a failed pre-test can still let the sbatch
wrapper exit successfully and release the main job. Update the command assembly
around sbatch_lines.extend and pre_hook_cmds to explicitly exit non-zero when
the pre-test check fails, using the gen_pre_test-generated success flag as the
gate so the pre-hook sbatch cannot complete as success on failure.
| f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"', | ||
| 'if [ -n "$PRE_HOOK_JOB_ID" ]; then', | ||
| ' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"', |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Quote base Slurm params and avoid a leading empty parameter.
base_slurm_params is injected into a double-quoted shell assignment, so $, backticks, or quotes from extra Slurm args can be expanded by the wrapper. Also, an empty base value becomes ;dependency=....
🐛 Proposed fix
if pre_hook_sbatch_path is not None:
+ quoted_base_slurm_params = shlex.quote(base_slurm_params)
pre_hook_lines = [
f'PRE_HOOK_SBATCH="{pre_hook_sbatch_path.absolute()}"',
'PRE_HOOK_OUTPUT=$(sbatch "$PRE_HOOK_SBATCH" 2>&1)',
@@
"fi",
'echo "Submitted pre-hook batch job $PRE_HOOK_JOB_ID"',
- f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"',
- 'if [ -n "$PRE_HOOK_JOB_ID" ]; then',
+ f"ADDITIONAL_SLURM_PARAMS={quoted_base_slurm_params}",
+ 'if [ -n "$ADDITIONAL_SLURM_PARAMS" ]; then',
' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',
+ "else",
+ ' ADDITIONAL_SLURM_PARAMS="dependency=afterok:${PRE_HOOK_JOB_ID}"',
"fi",
"",
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"', | |
| 'if [ -n "$PRE_HOOK_JOB_ID" ]; then', | |
| ' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"', | |
| quoted_base_slurm_params = shlex.quote(base_slurm_params) | |
| pre_hook_lines = [ | |
| f'PRE_HOOK_SBATCH="{pre_hook_sbatch_path.absolute()}"', | |
| 'PRE_HOOK_OUTPUT=$(sbatch "$PRE_HOOK_SBATCH" 2>&1)', | |
| # ... | |
| "fi", | |
| 'echo "Submitted pre-hook batch job $PRE_HOOK_JOB_ID"', | |
| f"ADDITIONAL_SLURM_PARAMS={quoted_base_slurm_params}", | |
| 'if [ -n "$ADDITIONAL_SLURM_PARAMS" ]; then', | |
| ' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"', | |
| "else", | |
| ' ADDITIONAL_SLURM_PARAMS="dependency=afterok:${PRE_HOOK_JOB_ID}"', | |
| "fi", | |
| "", | |
| ] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around
lines 268 - 270, The shell wrapper in the Slurm command generation logic is
inserting base Slurm args directly into the ADDITIONAL_SLURM_PARAMS assignment,
which can allow shell expansion and produce a leading separator when the base
value is empty. Update the Megatron bridge’s slurm_command_gen_strategy around
the ADDITIONAL_SLURM_PARAMS construction to safely quote/escape
base_slurm_params before embedding it in the shell snippet, and adjust the
PRE_HOOK_JOB_ID append logic so it only adds the dependency separator when
ADDITIONAL_SLURM_PARAMS already has content. Use the existing command assembly
in the strategy method that emits these lines to keep the fix localized.
| def _build_launcher_parts( # noqa: C901 | ||
| self, args: MegatronBridgeCmdArgs, tdef: MegatronBridgeTestDefinition, repo_path: Path, launcher_py: Path | ||
| self, | ||
| args: MegatronBridgeCmdArgs, | ||
| tdef: MegatronBridgeTestDefinition, | ||
| repo_path: Path, | ||
| launcher_py: Path, | ||
| include_slurm_params: bool = True, |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
ruff check --select FBT001,FBT002 src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.pyRepository: NVIDIA/cloudai
Length of output: 1524
Make include_slurm_params keyword-only. This removes the FBT001/FBT002 warning and avoids boolean positional calls; the existing keyword call site stays unchanged.
♻️ Proposed fix
args: MegatronBridgeCmdArgs,
tdef: MegatronBridgeTestDefinition,
repo_path: Path,
launcher_py: Path,
+ *,
include_slurm_params: bool = True,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _build_launcher_parts( # noqa: C901 | |
| self, args: MegatronBridgeCmdArgs, tdef: MegatronBridgeTestDefinition, repo_path: Path, launcher_py: Path | |
| self, | |
| args: MegatronBridgeCmdArgs, | |
| tdef: MegatronBridgeTestDefinition, | |
| repo_path: Path, | |
| launcher_py: Path, | |
| include_slurm_params: bool = True, | |
| def _build_launcher_parts( # noqa: C901 | |
| self, | |
| args: MegatronBridgeCmdArgs, | |
| tdef: MegatronBridgeTestDefinition, | |
| repo_path: Path, | |
| launcher_py: Path, | |
| *, | |
| include_slurm_params: bool = True, |
🧰 Tools
🪛 Ruff (0.15.20)
[warning] 344-344: Too many branches (28 > 12)
(PLR0912)
[warning] 344-344: Too many statements (152 > 50)
(PLR0915)
[warning] 350-350: Boolean-typed positional argument in function definition
(FBT001)
[warning] 350-350: Boolean default positional argument in function definition
(FBT002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around
lines 344 - 350, Make the include_slurm_params parameter keyword-only in
_build_launcher_parts so it cannot be passed positionally and the FBT001/FBT002
warning is eliminated. Update the _build_launcher_parts function signature to
force include_slurm_params as a keyword-only argument, and ensure any internal
call sites use the named argument form; the existing keyword-based caller should
continue to work unchanged.
Source: Linters/SAST tools
tr.nnodes is a read-only property backed by num_nodes; set num_nodes directly so srun gets -N<main_job_nodes> instead of -N1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@saivishal1999 please resolve coderabbit comments |
| if node_list: | ||
| sbatch_lines.append(f"#SBATCH --nodelist={','.join(node_list)}") | ||
| elif num_nodes: | ||
| sbatch_lines.append(f"#SBATCH --nodes={num_nodes}") |
There was a problem hiding this comment.
this is a broader issue rather than only num nodes. it also doesn't set distribution, gres/gpu-per-node and some other potential directives. what I suggest as a fix, is to update the base class SlurmCommandGenStrategy with extract directives generation into a sub-method. So that you could explicitly call it here to write directives into your buffer
MegatronBridge's gen_exec_command uses setup_experiment.py to generate its own sbatch, bypassing CloudAI's normal pre_test injection. This adds support by submitting the pre-hook as a separate sbatch first, then passing dependency=afterok:<job_id> to setup_experiment.py via --additional_slurm_params so Slurm gates training on pre-hook success.