From 014e05d3fe822680830035182a4976bbaa271766 Mon Sep 17 00:00:00 2001 From: AtlantaPepsi Date: Thu, 28 May 2026 19:57:30 +0000 Subject: [PATCH] first addition of skills --- .claude/skills/transferbench-debug/SKILL.md | 94 +++++++++++ .../examples/topology-probe.sh | 58 +++++++ .../references/common-failures.md | 136 ++++++++++++++++ .../references/multi-rank-debug.md | 148 +++++++++++++++++ .../references/verbose-introspection.md | 112 +++++++++++++ .claude/skills/transferbench-run/SKILL.md | 150 ++++++++++++++++++ .../examples/advanced-mixed.cfg | 18 +++ .../transferbench-run/examples/basic-p2p.cfg | 21 +++ .../transferbench-run/examples/multi-node.sh | 59 +++++++ .../references/config-format.md | 110 +++++++++++++ .../transferbench-run/references/env-vars.md | 114 +++++++++++++ .../transferbench-run/references/presets.md | 74 +++++++++ 12 files changed, 1094 insertions(+) create mode 100644 .claude/skills/transferbench-debug/SKILL.md create mode 100755 .claude/skills/transferbench-debug/examples/topology-probe.sh create mode 100644 .claude/skills/transferbench-debug/references/common-failures.md create mode 100644 .claude/skills/transferbench-debug/references/multi-rank-debug.md create mode 100644 .claude/skills/transferbench-debug/references/verbose-introspection.md create mode 100644 .claude/skills/transferbench-run/SKILL.md create mode 100644 .claude/skills/transferbench-run/examples/advanced-mixed.cfg create mode 100644 .claude/skills/transferbench-run/examples/basic-p2p.cfg create mode 100755 .claude/skills/transferbench-run/examples/multi-node.sh create mode 100644 .claude/skills/transferbench-run/references/config-format.md create mode 100644 .claude/skills/transferbench-run/references/env-vars.md create mode 100644 .claude/skills/transferbench-run/references/presets.md diff --git a/.claude/skills/transferbench-debug/SKILL.md b/.claude/skills/transferbench-debug/SKILL.md new file mode 100644 index 00000000..d0da63ea --- /dev/null +++ b/.claude/skills/transferbench-debug/SKILL.md @@ -0,0 +1,94 @@ +--- +name: transferbench-debug +description: Use when a TransferBench (ROCm/CUDA bandwidth-benchmark) run fails, hangs, crashes, validates incorrectly, or produces unexpected/misleading results — i.e. the user is troubleshooting rather than ramping up usage. Covers reading error output, isolating hangs (single-rank vs. multi-rank, NIC vs. POD detection), validation failures, performance regressions, and the binary's built-in verbose / dump / dryrun introspection. Does NOT cover writing new configs from scratch (use the run-side skill) or modifying TransferBench source. +--- + +# TransferBench debugging + +This skill kicks in when something is **wrong** with a TransferBench run. The goal is always: turn a vague "it doesn't work" into a specific failure mode with a known fix or workaround. + +## Triage flow + +Always run these three steps **first** before guessing: + +1. **Reproduce with the smallest possible config.** Replace presets with a single-line `cmdline` if possible; halve rank count; drop to one Transfer. +2. **Confirm the binary parses the input.** Run `dryrun` instead of executing — separates parser bugs from runtime bugs. +3. **Capture what the binary actually saw.** `TB_DUMP_CFG_FILE=out.cfg` for presets; `HIDE_ENV=0` (default) so the env-var summary at startup is visible. + +Only after that, branch by symptom — see the table below. + +## Symptom → reference + +| Symptom | Most likely cause | First thing to try | Deeper | +|---|---|---|---| +| Process hangs at startup, no output | MPI bootstrap or socket-mode env vars wrong | `mpirun --tag-output` to confirm all ranks started; verify `TB_NUM_RANKS` matches `-np` | `references/multi-rank-debug.md` | +| `Pod-aware` preset hangs or errors out before transfers | AMD-SMI / NVML pod detection unavailable | `TB_FORCE_SINGLE_POD=1` | `references/common-failures.md` §pod | +| RDMA preset (nicp2p, nica2a, …) hangs in NIC bring-up | GID index, IB port, or NIC filter wrong | Lower `IB_GID_INDEX` to a known good index; `TB_NIC_FILTER` to a single NIC | `references/multi-rank-debug.md` §rdma | +| Validation failure (`ALWAYS_VALIDATE=1` reports mismatch) | Wrong CU mask, wrong memory type, or actual HW issue | `VALIDATE_DIRECT=1`; rerun with `NUM_ITERATIONS=1` to see if first iter is wrong | `references/common-failures.md` §validation | +| Bandwidth far below expected | Stream/HW-queue serialization, wrong executor, GFX kernel mis-tuned | `USE_SINGLE_STREAM=0` + `GPU_MAX_HW_QUEUES=8`; try `D` (DMA) instead of `G` | `references/common-failures.md` §perf | +| Bandwidth varies wildly run-to-run | Warmup too short, NUMA/clock policy | `NUM_WARMUPS=10`, `SHOW_ITERATIONS=1`, `SHOW_PERCENTILES=50,90,99` | `references/common-failures.md` §perf | +| Crash / segfault | Bad memory code (e.g. `F` on a GPU without fine-grain), bad kernel for arch | Run with `dryrun` first; rebuild without optimization for symbol info | `references/common-failures.md` §crash | +| "Unsupported" / executor missing | Build-time disable (e.g. `DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`) | `./TransferBench` (no args) — its banner lists which executors are compiled in | `references/common-failures.md` §unsupported | +| Output is garbled / interleaved across ranks | MPI stderr buffering, no per-rank labels | `mpirun --tag-output` or pipe each rank into a per-rank log | `references/multi-rank-debug.md` §output | + +## The four "always-on" introspection commands + +These four commands are how you **observe** the binary as it actually exists on this host (don't trust any documentation, including this one, when troubleshooting): + +```bash +./TransferBench # banner: detected GPUs, NUMA, NICs, compiled features +./TransferBench help # config-file syntax with examples +./TransferBench presets # list of presets compiled into THIS build +./TransferBench envvars # complete list of env vars THIS build honors +``` + +Plus two safe inspections of any preset/config: + +```bash +./TransferBench dryrun "" # validate parsing, expand wildcards +TB_DUMP_CFG_FILE=dump.cfg ./TransferBench p2p # dump what a preset actually emits +``` + +## Verbose / capture env vars + +Reach for these when you need more visibility (full table in `references/verbose-introspection.md`): + +| Env var | Effect | +|---|---| +| `HIDE_ENV=0` (default) | Print env-var summary at start (shows what was actually set) | +| `SHOW_ITERATIONS=1` | Per-iteration timings — exposes warmup/jitter issues | +| `SHOW_PERCENTILES=50,90,99` | Tail latencies — exposes slow-iteration outliers | +| `ALWAYS_VALIDATE=1` | Validate destination after every iteration (slow, but catches data-corruption regressions) | +| `VALIDATE_DIRECT=1` | Validate by reading the destination directly (skips copy-back path) | +| `VALIDATE_SOURCE=1` | Confirm src was unchanged (catches kernels that overwrite src) | +| `NUM_ITERATIONS=1` | Run exactly one iteration — useful when validation fails on iter N>0 | +| `NUM_WARMUPS=0` | Strip warmups so iter-0 timing is the cold case | +| `USE_INTERACTIVE=1` | Pause between tests — useful for `gdb attach` mid-run | +| `TB_DUMP_CFG_FILE=out.cfg` | Dump executed Transfers from a preset to a config file | +| `TB_DUMP_LINES=N` | Limit number of dumped lines | +| `TB_VERBOSE=1` | Verbose lifecycle logging for newer execution paths (anvil/SDMA in recent builds) | +| `TB_WALLCLOCK_RATE=` | Override GPU wallclock rate when the GPU returns 0 (debug-only) | + +## Multi-rank-specific quick checks + +When debugging across nodes, before suspecting TransferBench itself: + +1. **Same binary on every node.** `md5sum ./TransferBenchCuda` on each host. A different mtime/checksum is the most common multi-rank gotcha. +2. **Same env on every rank.** Use `mpirun -x VAR` (not just shell-export); without `-x`, only rank 0 sees your shell vars. +3. **Network actually up.** `ibstatus` (RDMA) or a `nc` between hosts on your master port (socket mode). +4. **Hostfile slots = 1 per node.** TransferBench expects one rank per node by default. + +## When you're stuck + +If the table above and the references didn't help: + +1. Build with `-g -O0` (or `-g -O1`) to get usable symbols, run under `gdb` / `cuda-gdb` / `rocgdb`, and `bt` once it hangs or crashes. Hangs in particular are usually obvious from the stuck thread's stack. +2. Strip the build down: pass `DISABLE_*` flags for any executor not under test (`DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`, etc.). Eliminates whole code paths from suspicion. +3. Compare against a known-good commit. The `git log` on this repo has many tagged commits where features were added — you can check out an older commit, run the same config, and confirm it passes there. + +## References + +- `references/common-failures.md` — symptom-organized catalog with concrete fixes +- `references/multi-rank-debug.md` — MPI / socket / RDMA-specific issues +- `references/verbose-introspection.md` — every debug-flavored env var + when to reach for it +- `examples/topology-probe.sh` — minimal script that prints what TransferBench sees about the host diff --git a/.claude/skills/transferbench-debug/examples/topology-probe.sh b/.claude/skills/transferbench-debug/examples/topology-probe.sh new file mode 100755 index 00000000..6a87489b --- /dev/null +++ b/.claude/skills/transferbench-debug/examples/topology-probe.sh @@ -0,0 +1,58 @@ +#!/usr/bin/env bash +# topology-probe.sh — print everything TransferBench can tell you about this host. +# +# Run as the FIRST step of any debugging session. Captures: detected GPUs, +# NUMA, NICs, compiled feature flags, env-var defaults, and (optionally) what +# a single preset actually emits. +# +# Usage: +# ./topology-probe.sh # just probe +# ./topology-probe.sh p2p # probe + dump what `p2p` would run +# ./topology-probe.sh p2p out/ # write output files into out/ + +set -euo pipefail + +BINARY="${BINARY:-./TransferBench}" +[[ -x "./TransferBenchCuda" ]] && BINARY="${BINARY/TransferBench/TransferBenchCuda}" +[[ -x "$BINARY" ]] || { echo "ERROR: $BINARY not found or not executable"; exit 1; } + +PRESET="${1:-}" +OUTDIR="${2:-.}" +mkdir -p "$OUTDIR" + +echo "=== Binary banner (compiled features + detected hardware) ===" +"$BINARY" 2>&1 | tee "$OUTDIR/banner.txt" | head -60 +echo + +echo "=== Compiled-in presets ===" +"$BINARY" presets 2>&1 | tee "$OUTDIR/presets.txt" +echo + +echo "=== Compiled-in environment variables ===" +"$BINARY" envvars 2>&1 | tee "$OUTDIR/envvars.txt" | head -40 +echo " (full list in $OUTDIR/envvars.txt)" +echo + +echo "=== Config-file syntax help ===" +"$BINARY" help 2>&1 | tee "$OUTDIR/help.txt" | head -40 +echo " (full help in $OUTDIR/help.txt)" +echo + +if [[ -n "$PRESET" ]]; then + DUMP="$OUTDIR/${PRESET}_dump.cfg" + echo "=== Dumping what '$PRESET' actually runs to $DUMP ===" + TB_DUMP_CFG_FILE="$DUMP" TB_DUMP_LINES=100 "$BINARY" "$PRESET" >/dev/null 2>&1 || true + if [[ -f "$DUMP" ]]; then + echo " First 30 lines:" + head -30 "$DUMP" | sed 's/^/ /' + else + echo " (TB_DUMP_CFG_FILE produced no output — preset may not support dump)" + fi +fi + +echo +echo "=== Quick parser sanity check ===" +"$BINARY" dryrun "1 4 (G0->G0->G1)" 2>&1 | head -10 +echo + +echo "Done. Files written to $OUTDIR/" diff --git a/.claude/skills/transferbench-debug/references/common-failures.md b/.claude/skills/transferbench-debug/references/common-failures.md new file mode 100644 index 00000000..e3af917d --- /dev/null +++ b/.claude/skills/transferbench-debug/references/common-failures.md @@ -0,0 +1,136 @@ +# TransferBench common failures — symptoms, causes, fixes + +Catalog of the most common "it doesn't work" cases, organized by symptom. Each section has: **symptom signal**, **likely cause(s)**, and **concrete fix or next probe**. + +--- + +## §pod — pod-aware preset hangs / errors + +### Symptom +- `podp2p` / `poda2a` / `rings` errors immediately with a pod-detection message. +- Or hangs in startup before any transfer table prints. + +### Cause +- AMD-SMI (HIP build) or NVML (CUDA build) is unavailable, blocked, or returns inconsistent pod membership across ranks. +- The build has `POD_COMM_ENABLED` but the host lacks fabricmanager / `nvidia-fabricmanager.service`. + +### Fix +1. `TB_FORCE_SINGLE_POD=1` — fastest workaround, treats every rank as one pod. +2. Confirm fabricmanager is running (CUDA): `systemctl status nvidia-fabricmanager`. +3. Confirm NVML works: a one-liner like `nvidia-smi -q | head` should succeed on every rank. +4. If you actually want per-pod awareness, ensure all ranks see the same pod IDs by running a probe script (each rank prints its detected pod ID). + +--- + +## §rdma — NIC / RDMA preset hangs or errors + +### Symptom +- Hang during NIC bring-up (no transfer table) on `nicp2p`, `nica2a`, `nicrings`, `a2a_n`. +- Or "QP create failed" / "RDMA connect failed" messages. + +### Cause +- Wrong `IB_GID_INDEX` — depends on the host's IB / RoCE configuration. +- `IB_PORT_NUMBER` doesn't match the active port on the chosen NIC. +- More NICs detected than usable; some are unconfigured. +- RoCE version mismatch. + +### Fix +1. Find a working GID: `show_gids` or `ibv_devinfo -v` on each host. +2. Set `IB_GID_INDEX=`, `IB_PORT_NUMBER=` to known good values. +3. Restrict to a single NIC with `TB_NIC_FILTER=` to localize the bad one. +4. RoCE: try `ROCE_VERSION=2` (most common) or `ROCE_VERSION=1`. +5. Confirm both ends agree on `IP_ADDRESS_FAMILY` (4 vs 6). +6. If using OpenMPI: `--mca pml ucx --mca btl ^vader,openib` is the canonical setting in this repo. + +--- + +## §validation — `ALWAYS_VALIDATE=1` reports mismatch + +### Symptom +- Run completes but reports "Validation failed" / mismatch between expected and actual destination contents. + +### Cause (in order of likelihood) +1. Wrong memory-location code (e.g. fine-grain `F` requested on a GPU that doesn't support it → memory backed by global instead, kernel writes go to a different place than reads). +2. Wrong CU mask — kernel uses a CU group that doesn't have the right cache visibility. +3. Multi-Transfer test where two Transfers race on the same destination address. +4. Actual hardware issue (least likely). + +### Fix +1. `VALIDATE_DIRECT=1` — read destination directly without copy-back; isolates copy-back-path bugs. +2. `VALIDATE_SOURCE=1` — confirm source data was not overwritten by the kernel; catches `src == dst` issues. +3. `NUM_ITERATIONS=1 NUM_WARMUPS=0 ALWAYS_VALIDATE=1` — confirms it's not a state-leak between iterations. +4. Drop to `cmdline` with **one** Transfer to rule out multi-Transfer races. +5. If still failing: `FILL_PATTERN=0xDEADBEEF` (or any custom pattern) — makes the corruption signature easy to spot in the diff. + +--- + +## §perf — bandwidth far below expected, or wildly variable + +### Symptom +- Reported BW is a fraction (e.g. 1/4, 1/8) of the link's theoretical max. +- Or BW jumps 2× between iterations without obvious reason. + +### Cause +- Stream-per-Transfer hits HW-queue limit and serializes (`USE_SINGLE_STREAM=0` + low `GPU_MAX_HW_QUEUES`). +- GFX kernel parameters mis-tuned for the size (`GFX_UNROLL`, `GFX_BLOCK_SIZE`, `GFX_WORD_SIZE`). +- Not enough warmup — first few iterations include allocation, paging, clock ramp. +- Wrong executor for the workload: GFX kernel for tiny payload (use DMA), DMA for one-to-many (use Batched-DMA). +- NUMA / pinned-memory mismatch (e.g. CPU-side memory on the wrong NUMA for the chosen GPU). + +### Fix +1. `NUM_WARMUPS=10 NUM_ITERATIONS=20 SHOW_ITERATIONS=1` — see whether iter 0–2 are slow and the rest converge. +2. `SHOW_PERCENTILES=50,75,90,99` — exposes outlier iterations. +3. Try the alternate executor on the same memory pair: GFX (`G`) ↔ DMA (`D`) ↔ Batched-DMA (`B`). +4. `USE_SINGLE_STREAM=0 GPU_MAX_HW_QUEUES=8` for many parallel Transfers. +5. Sweep with a preset (`gfxsweep`, `a2asweep`) to find the right kernel options before hand-tuning. + +--- + +## §crash — crash / segfault + +### Symptom +- Process exits with SIGSEGV / "memory access fault" / "invalid memory access." + +### Cause +- Memory code unsupported by HW (e.g. `F` on a GPU without fine-grain memory; `U` with no uncached path). +- DMA Transfer with multiple SRCs (DMA requires exactly one SRC). +- NIC executor with mismatched index syntax (`I0` instead of `I0.0`). +- Buffer alignment: byte count not a multiple of 4 (parser usually catches this, but custom builds may slip). + +### Fix +1. `dryrun ""` first — most parser-level bugs surface here. +2. Read the banner from `./TransferBench` with no args — confirms which memory types and executors are compiled in for this build. +3. For DMA crashes: confirm exactly one SRC per Transfer. +4. Build with `-g -O0` and run under `cuda-gdb` (NVIDIA) or `rocgdb` (AMD); the stack at the fault tells you which Executor's path failed. + +--- + +## §unsupported — "executor missing" / "feature not compiled in" + +### Symptom +- "Unsupported executor" or similar, even though the code seems to allow it. + +### Cause +- This build was compiled with one of the `DISABLE_*` Makefile flags (`DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`, `DISABLE_AMD_SMI=1`, etc.). +- Or `MPI_PATH` was not set, so multi-rank paths were stubbed out. + +### Fix +1. Run `./TransferBench` with no arguments. Its banner lists which executors and features are compiled in for this exact binary. +2. If the feature is genuinely missing, rebuild without the corresponding `DISABLE_*` flag. (See the build-side skill — out of scope here.) + +--- + +## §parser — config-file parser rejects a line + +### Symptom +- "Failed to parse" / "Invalid config line" / silently runs the wrong thing. + +### Cause +- Confused basic vs. advanced syntax (`numTransfers` positive vs. negative). +- Whitespace inside an executor or memory token. +- Quoting issues on the shell side when using `cmdline` (e.g. `G*` getting glob-expanded). + +### Fix +1. **Always quote** `cmdline` arguments: `./TransferBench cmdline "1 4 (G0->G0->G1)"`. +2. `dryrun` first to see the parse result without execution. +3. For complex configs, use `##` echo lines liberally — they show in the output and help correlate result rows to test definitions. diff --git a/.claude/skills/transferbench-debug/references/multi-rank-debug.md b/.claude/skills/transferbench-debug/references/multi-rank-debug.md new file mode 100644 index 00000000..d5793ef0 --- /dev/null +++ b/.claude/skills/transferbench-debug/references/multi-rank-debug.md @@ -0,0 +1,148 @@ +# Multi-rank debugging (MPI / socket / RDMA) + +Multi-rank TransferBench is the most failure-prone configuration. This guide is organized by the **layer where the failure happens**: launcher → bootstrap → NIC bring-up → transfer execution. + +## Pre-flight: things to check before suspecting TransferBench + +These rule out 90% of multi-rank "bugs": + +```bash +# 1. Same binary on every node +for h in node0 node1; do ssh $h md5sum /home/timhu102/tBench/TransferBenchCuda; done + +# 2. NICs functional on every node +for h in node0 node1; do ssh $h "ibstatus | head -20"; done + +# 3. Hosts can reach each other on the master port (socket mode) +ssh node1 "nc -zv node0 " + +# 4. Same env vars actually propagating to all ranks +mpirun --tag-output -np 2 -host node0,node1 -x SOME_VAR env | grep SOME_VAR +``` + +## Launcher layer: hangs/errors before any TransferBench output + +### Hostfile / `-host` problems + +- **Hostnames not resolving:** `mpirun` will hang silently. Try IP addresses instead. +- **More slots requested than declared:** `host node0:1,node1:1 -np 4` is a misconfiguration; use one slot per node. +- **SSH not passwordless:** `mpirun` quietly fails when an `ssh` it spawns prompts for a password. Test with `ssh node1 hostname` first. + +### MPI transport problems + +The canonical incantation in this repo is: + +```bash +mpirun --mca pml ucx --mca btl '^vader,openib' ... +``` + +If you see "no transport available" / "BTL openib unavailable": +- Confirm UCX is built into your OpenMPI (`ompi_info | grep ucx`). +- The `^vader,openib` is a **negative** filter — it excludes those BTLs to force PML/UCX. Don't drop it without good reason. + +### Env-var propagation + +OpenMPI does **not** forward your shell env to remote ranks unless you ask it to: + +```bash +mpirun -x VAR1 -x VAR2=value -np ... ./TransferBenchCuda ... +``` + +- `-x VAR` forwards the **current** shell value of `VAR`. +- `-x VAR=value` sets it explicitly. +- Without `-x`, remote ranks see only the env they inherit from the SSH session, which usually does NOT include your interactive shell exports. + +This is the most common source of "works on rank 0, fails on rank ≥1" bugs. The `examples/multi-node.sh` template in `transferbench-run` builds the `-x` flags from a list — model after it. + +## Bootstrap layer: process started, but no transfer output + +### MPI bootstrap + +If `mpirun` reports all ranks started but TransferBench prints nothing, the most common cause is the rank-0 → rank-N handshake taking longer than expected because of a stuck NIC or NUMA probe. + +- Run `mpirun --tag-output ...` so each rank's output is prefixed `[rank,N]`. +- If rank 0 reaches the banner and others don't, those ranks are stuck in their initialization (often AMD-SMI or NVML). + +### Socket-mode bootstrap + +```bash +# On rank 0, no TB_RANK set → rank 0 prints master address +TB_NUM_RANKS=4 ./TransferBenchCuda + +# On rank N (N>0), set TB_RANK and TB_MASTER_ADDR +TB_NUM_RANKS=4 TB_RANK=1 TB_MASTER_ADDR= ./TransferBenchCuda +``` + +Common socket-mode issues: +- **`TB_NUM_RANKS` differs across ranks** → silent hang. +- **`TB_MASTER_ADDR` not reachable** (firewall, wrong interface). Test with `nc -zv` first. +- **One rank exited early** → the others wait forever. + +## NIC bring-up layer: `nicp2p` / `nica2a` / pod presets hang here + +### GID index + +The single most common cause of NIC hangs. Find a working GID: + +```bash +for d in $(ibv_devices | tail -n +3 | awk '{print $1}'); do + echo "=== $d ===" + show_gids $d 2>/dev/null || ibv_devinfo -d $d -v 2>/dev/null | grep -E 'GID|state' +done +``` + +Pick a GID with a populated address (not `0000:0000:...`) and an active port: + +```bash +IB_GID_INDEX= IB_PORT_NUMBER= ./TransferBenchCuda nicp2p +``` + +### NIC filtering + +If you have eight NICs but only four are configured, list the bad ones: + +```bash +TB_NIC_FILTER=mlx5_0,mlx5_1,mlx5_2,mlx5_3 ./TransferBenchCuda nicp2p +``` + +This restricts the world-view of NICs to a subset; useful for narrowing down a bad NIC. + +### RoCE version mismatch + +Symptom: connection establishment hangs forever. +Fix: `ROCE_VERSION=2` is the modern default; some legacy clusters require `ROCE_VERSION=1`. Both ends must agree. + +### POD / MNNVL detection + +For `podp2p` / `poda2a`: + +- AMD-SMI (HIP) or NVML (CUDA) must be functional, AND fabricmanager must be running on NVIDIA. +- Quick workaround if pod detection is broken: `TB_FORCE_SINGLE_POD=1`. + +## Output layer: results are garbled or look wrong + +### Interleaved output + +Without explicit tagging, MPI lets all ranks write to the same stdout, which interleaves bytes: + +```bash +mpirun --tag-output ... # each line prefixed with [rank,N] +mpirun --output-filename out -np ... # per-rank log files +mpirun --merge-stderr-to-stdout ... # one stream +``` + +### Apparent zeros / NaNs in the bandwidth table + +- Rank-N reported a hardware error and failed silently → check rank-N's stderr (use `--output-filename`). +- Or the `Test` was malformed and one rank parsed it differently → use `dryrun` and `TB_DUMP_CFG_FILE` to confirm both ranks are running the same Transfers. + +## When the hang is genuinely TransferBench's fault + +If the layers above are clean and TransferBench still hangs: + +1. Build with `-g -O0` (or `-g -O1`). +2. Run a tiny config (one Transfer, one rank pair) under `cuda-gdb` / `rocgdb`. +3. When it hangs: `Ctrl-C`, then `bt`. The stuck thread's stack will name the function holding the lock or the queue it's polling. +4. Look for: NIC completion-queue polling without a timeout, a stream-event wait, or an MPI collective that's missing a peer. + +If you're hitting this often, the build-side skill (separate skill) covers the recompile workflow. diff --git a/.claude/skills/transferbench-debug/references/verbose-introspection.md b/.claude/skills/transferbench-debug/references/verbose-introspection.md new file mode 100644 index 00000000..b9079905 --- /dev/null +++ b/.claude/skills/transferbench-debug/references/verbose-introspection.md @@ -0,0 +1,112 @@ +# Debug-flavored env vars and built-in introspection + +This is the curated debug-side counterpart to the run-side `env-vars.md`. Variables here are the ones you reach for **when something is wrong**, not when you're tuning. For the authoritative complete list compiled into your binary, run: + +```bash +./TransferBench envvars +``` + +## Built-in info commands (no env vars needed) + +```bash +./TransferBench # banner: detected GPUs, NUMA, NICs, compiled features +./TransferBench help # config-file syntax + examples +./TransferBench presets # presets compiled into THIS build +./TransferBench envvars # full env-var list with descriptions +./TransferBench dryrun "..." # parse + expand wildcards, no execution +``` + +The first one is especially valuable when debugging — its banner names every executor and feature compiled in (`NIC_EXEC_ENABLED`, `POD_COMM_ENABLED`, `NVML_ENABLED`, etc.). If a feature you expect isn't there, you've found the problem before you even ran a test. + +## Verbose & lifecycle logging + +| Variable | What it does | When to use | +|---|---|---| +| `HIDE_ENV=0` (default) | Print env-var summary at startup | Always leave at default when debugging — you want to see what TransferBench thinks the env says | +| `TB_VERBOSE=1` | Verbose lifecycle logging in newer execution paths (anvil/SDMA) | When recent-feature execution paths hang or behave oddly | +| `SHOW_ITERATIONS=1` | Print every iteration's time | First step when "BW seems too low" or "BW is unstable" | +| `SHOW_PERCENTILES=50,75,90,99` | Add percentile columns | When the mean is fine but tail latency is the suspected issue | +| `SHOW_BORDERS=0` | Strip table borders | Easier to diff two runs' raw output | + +## Validation / correctness + +| Variable | Effect | Notes | +|---|---|---| +| `ALWAYS_VALIDATE=1` | Validate dst after every iteration | Catches data-corruption regressions; significantly slows runs | +| `VALIDATE_DIRECT=1` | Validate by reading dst directly | Skips copy-back; isolates "is the validation path itself buggy?" | +| `VALIDATE_SOURCE=1` | Confirm src wasn't overwritten | Catches kernels that aliased into src | +| `FILL_PATTERN=0xDEADBEEF` | Custom hex source fill pattern | Makes corruption signatures recognizable | +| `FILL_COMPRESS=1` | Use compressible source data | Useful when debugging compression-aware paths | +| `BYTE_OFFSET=N` | Offset (bytes) into allocated buffers | Useful when alignment is suspected | +| `BLOCK_BYTES=256` | Block granularity for transfers | Try larger / smaller when validation fails on edge sizes | + +## Iteration / timing isolation + +| Variable | Effect | Use case | +|---|---|---| +| `NUM_ITERATIONS=1` | Run exactly one iteration | "Validation fails on iter N>0" → reduce to 1 to confirm cold case | +| `NUM_WARMUPS=0` | No warmups | Forces iter-0 to be the only iteration; useful with `NUM_ITERATIONS=1` | +| `NUM_ITERATIONS=-30` | Timed mode, 30 seconds | When you want to study perf over time, not iterations | +| `NUM_SUBITERATIONS=N` | N sub-iterations per outer iteration | Reduce if sub-iter is the granularity at which a bug appears | + +## Capture / reproducibility + +| Variable | Effect | Use case | +|---|---|---| +| `TB_DUMP_CFG_FILE=out.cfg` | Dump executed Transfers to a config file | "What is this preset *actually* running?" — crucial when a preset behaves unexpectedly | +| `TB_DUMP_LINES=N` | Limit dumped lines | Quick peek at the start of a large preset | +| `OUTPUT_TO_CSV=1` | CSV output | Easier to diff two runs programmatically | + +## Interactive / breakpoint-friendly + +| Variable | Effect | Use case | +|---|---|---| +| `USE_INTERACTIVE=1` | Pause for stdin between tests | Attach `gdb` / `cuda-gdb` / `rocgdb` mid-run, set breakpoints, hit Enter to continue | + +## NIC / RDMA debug + +| Variable | Effect | Use case | +|---|---|---| +| `IB_GID_INDEX=N` | Force a specific GID index | Almost always the first thing to set when NIC presets hang | +| `IB_PORT_NUMBER=N` | Force a specific port | When the active port isn't 1 | +| `ROCE_VERSION=1\|2` | RoCE version | Both ends must agree | +| `IP_ADDRESS_FAMILY=4\|6` | IPv4 or IPv6 | When dual-stack hosts pick the wrong one | +| `TB_NIC_FILTER=name1,name2` | Restrict to listed NICs | Localize which NIC is misbehaving | +| `NIC_CHUNK_BYTES=N` | NIC transfer chunk size | Reduce to confirm a size-dependent bug | +| `NIC_CQ_POLL_BATCH=N` | CQ poll batch size | Reduce to 1 to expose race conditions | +| `NIC_RELAX_ORDER=0\|1` | Relaxed ordering on the NIC | Disable when ordering bugs suspected | + +## Pod / multi-rank fallbacks + +| Variable | Effect | Use case | +|---|---|---| +| `TB_FORCE_SINGLE_POD=1` | Treat all ranks as one pod | Workaround when AMD-SMI / NVML pod detection is broken | +| `TB_RANK`, `TB_NUM_RANKS`, `TB_MASTER_ADDR` | Socket-mode rank coordination | Alternative bootstrap when MPI isn't available or is the suspect | + +## Wallclock / timing edge cases + +| Variable | Effect | Use case | +|---|---|---| +| `TB_WALLCLOCK_RATE=` | Override GPU wallclock rate | Some firmware reports 0 for the rate; this lets you bypass that | +| `USE_HIP_EVENTS=0` | Use host clock instead of HIP/CUDA events | When HIP-event timing is suspected (rare) | + +## Combined recipes + +A few canned env-var combinations that come up often: + +```bash +# "What does this preset actually run?" +TB_DUMP_CFG_FILE=p2p_dump.cfg ./TransferBench p2p + +# "Is the slowness in iter 0 only, or every iter?" +NUM_WARMUPS=0 NUM_ITERATIONS=20 SHOW_ITERATIONS=1 ./TransferBench cmdline "1 4 (G0->G0->G1)" 256M + +# "Validate every iteration, with a custom pattern" +ALWAYS_VALIDATE=1 VALIDATE_SOURCE=1 FILL_PATTERN=0xDEADBEEF ./TransferBench p2p + +# "Quietest possible run for diff'ing two builds" +HIDE_ENV=1 SHOW_BORDERS=0 OUTPUT_TO_CSV=1 NUM_WARMUPS=10 NUM_ITERATIONS=20 ./TransferBench p2p > run_A.csv + +# "Pause between tests so I can gdb attach" +USE_INTERACTIVE=1 ./TransferBench p2p +``` diff --git a/.claude/skills/transferbench-run/SKILL.md b/.claude/skills/transferbench-run/SKILL.md new file mode 100644 index 00000000..050eac54 --- /dev/null +++ b/.claude/skills/transferbench-run/SKILL.md @@ -0,0 +1,150 @@ +--- +name: transferbench-run +description: Use when the user wants to *run* TransferBench (the ROCm/CUDA memory-transfer benchmarking tool from AMD) — benchmarking, profiling, or measuring GPU/CPU/NIC bandwidth and latency. Covers writing config files, picking the right preset (a2a, p2p, sweep, nicp2p, podp2p, etc.), tuning environment variables, and launching single-node or multi-node (MPI / socket) runs. Does NOT cover building the binary from source, modifying its source code, or extending it with new presets/executors — for those, defer to a separate skill or the codebase itself. +--- + +# TransferBench + +TransferBench is a command-line utility for benchmarking simultaneous data transfers between CPU, GPU, and NIC memory locations using GPU kernels, DMA engines, RDMA NICs, or CPU threads. It runs on AMD (ROCm/HIP) and NVIDIA (CUDA) platforms. + +The binary is named `TransferBench` (HIP build) or `TransferBenchCuda` (CUDA build). A prebuilt `TransferBenchCuda` may exist at the repo root; otherwise build with `make` (HIP) or `make TransferBenchCuda` (CUDA). + +## Mental model + +A **Transfer** is one operation: an **Executor** reads values from one or more **SRC** memory locations, sums them, and writes the result to one or more **DST** memory locations. With one SRC and one DST it's a plain copy. + +``` +SRC 0 +SRC 1 -> Executor -> DST 0 +SRC X DST Y +``` + +A **Test** is one line in a config file — a set of Transfers run in parallel. + +A **SubExecutor (SE)** is the unit of parallelism inside an executor: +- CPU executor → CPU thread +- GPU executor → threadblock / Compute Unit (CU) +- DMA / Batched-DMA → stream / batch item (must have a single SRC) +- NIC → Queue Pair + +## Invocation + +```bash +./TransferBench [N] +``` + +- `` is one of: + - A path to a config file + - A preset name (`a2a`, `p2p`, `sweep`, `nicp2p`, `podp2p`, ...) + - `cmdline ""` — run one ad-hoc transfer + - `dryrun ""` — parse and print without executing + - `help`, `presets`, `envvars` — built-in info screens +- `N` (optional) is the number of bytes per Transfer. Defaults if omitted. `0` means sweep over a range. May be suffixed with `K`, `M`, or `G`. Must be a multiple of 4. + +Run `./TransferBench` with no args to see usage and detected topology (GPUs, NUMA nodes, NICs). + +## Quick recipes + +Decide which path the user needs and pick from below. **Always prefer a preset** if it matches the user's intent — presets handle topology discovery and produce well-formatted output. + +### "How fast is GPU↔GPU?" → use a preset +```bash +./TransferBench p2p # peer-to-peer matrix between all GPUs +./TransferBench a2a # all-to-all simultaneous transfers +./TransferBench scaling # one GPU to all others, scaled CU counts +./TransferBench sweep 64M # sweep through transfer combinations +``` + +### "How fast is HBM / local memory?" +```bash +./TransferBench hbm +``` + +### "How fast is CPU↔GPU or pinned-memory transfer?" → custom config +Write a small `.cfg` file (see `references/config-format.md`) and pass it as the first argument. + +### "Benchmark one specific transfer" → cmdline mode +```bash +./TransferBench cmdline "1 4 (G0->G0->G1)" 256M +./TransferBench dryrun "2 8 G0->G0->G1 G1->G1->G0" # validate parsing first +``` + +### "RDMA / NIC across nodes" → NIC presets +```bash +./TransferBench nicp2p # NIC peer-to-peer matrix +./TransferBench nica2a # NIC all-to-all +./TransferBench nicrings # NIC ring tests +``` + +### "Pod-aware (multi-rank, single MNNVL/XGMI pod)" → pod presets +```bash +./TransferBench podp2p # within-pod P2P +./TransferBench poda2a # within-pod all-to-all +``` +For pod presets, set `TB_FORCE_SINGLE_POD=1` if AMD-SMI / NVML pod detection is unavailable. + +See `references/presets.md` for the full list. + +## Multi-rank execution + +TransferBench runs multi-node either via MPI (if compiled with `MPI_PATH` set) or via plain TCP sockets. + +### MPI approach +```bash +mpirun -np 4 -host node0,node1,node2,node3 ./TransferBench a2a +``` + +### Socket approach (no MPI) +On rank 0, set only `TB_NUM_RANKS=N` to print the master address; copy that to other ranks. +```bash +# node0 +TB_NUM_RANKS=4 ./TransferBench a2a +# node1, node2, node3 (use the address node0 prints) +TB_NUM_RANKS=4 TB_RANK=1 TB_MASTER_ADDR= ./TransferBench a2a +TB_NUM_RANKS=4 TB_RANK=2 TB_MASTER_ADDR= ./TransferBench a2a +TB_NUM_RANKS=4 TB_RANK=3 TB_MASTER_ADDR= ./TransferBench a2a +``` + +Recommend **one process per node**. See `examples/multi-node.sh` for a full mpirun launcher script with environment-variable propagation (`-x VAR`). + +## Tuning behavior with environment variables + +Most tuning is environment-variable driven. The most useful ones: + +| Variable | Purpose | +|---|---| +| `NUM_ITERATIONS` | Iterations per test (negative = run for that many seconds in timed mode) | +| `NUM_WARMUPS` | Warmup iterations (default 3) | +| `USE_SINGLE_STREAM` | When 0, each Transfer gets its own stream (may serialize on HW queue limits) | +| `GPU_MAX_HW_QUEUES` | Raise HW-queue cap when `USE_SINGLE_STREAM=0` | +| `OUTPUT_TO_CSV` | Emit CSV output | +| `SHOW_ITERATIONS` | Print per-iteration timings | +| `SHOW_PERCENTILES` | e.g. `50,75,90,99` for tail-latency percentiles | +| `ALWAYS_VALIDATE` | Validate destination data after each iteration | +| `FILL_PATTERN` | Custom source fill pattern | +| `TB_DUMP_CFG_FILE` | Dump executed Transfers (e.g. from a preset) to a config file | +| `TB_FORCE_SINGLE_POD` | Force single-pod membership when AMD-SMI/NVML unavailable | +| `TB_NIC_FILTER` | Restrict which NICs are used | + +Run `./TransferBench envvars` for the authoritative list. See `references/env-vars.md` for grouped/annotated reference. + +## Reading results + +Default output prints one row per Test with bandwidth (GB/s) and time (ms). With `OUTPUT_TO_CSV=1`, results are CSV-formatted. With `SHOW_PERCENTILES=...`, percentile tail-latency columns are appended. Lines beginning with `##` in a config file are echoed back into the output for annotation. + +## When you're stuck + +1. Run `./TransferBench help` for config-file syntax with examples. +2. Run `./TransferBench presets` for the live list of presets. +3. Run `./TransferBench envvars` for all environment variables. +4. Run `./TransferBench dryrun "..."` to validate a transfer expression without executing. +5. If a preset fails on multi-node, first try `TB_FORCE_SINGLE_POD=1` and confirm rank count matches host count. + +## References + +- `references/config-format.md` — full grammar for config files (memory codes, executor codes, basic vs. advanced syntax) +- `references/presets.md` — every preset, when to use each, and its key env-var knobs +- `references/env-vars.md` — grouped environment-variable reference +- `examples/basic-p2p.cfg` — minimal GPU↔GPU copy +- `examples/advanced-mixed.cfg` — mixed CPU/GPU/DMA transfers with explicit byte counts +- `examples/multi-node.sh` — MPI launcher template with logging and env-var propagation diff --git a/.claude/skills/transferbench-run/examples/advanced-mixed.cfg b/.claude/skills/transferbench-run/examples/advanced-mixed.cfg new file mode 100644 index 00000000..14068570 --- /dev/null +++ b/.claude/skills/transferbench-run/examples/advanced-mixed.cfg @@ -0,0 +1,18 @@ +# Advanced-mode: per-Transfer SE counts and explicit byte sizes. +# Run with: ./TransferBench examples/advanced-mixed.cfg +# (No N argument needed because each Transfer specifies its own byte count.) + +## CPU NUMA-node 0 -> GPU 0 (1 GiB) parallel with GPU 0 -> GPU 1 (256 MiB) +-2 (C0->G0->G0 8 1G) (G0->G0->G1 4 256M) + +## Three-way fan-out: GPU0 broadcasts to GPU0+GPU1+GPU2 (16 CUs, 128 MiB) +-1 (G0->G0->G0G1G2 16 128M) + +## Fan-in / sum: read GPU0 and GPU1, write the sum to GPU2 (16 CUs, 64 MiB) +-1 (G0G1->G2->G2 16 64M) + +## Pinned-host write benchmark: GPU 0 writes to coarse-grained pinned host on its closest NUMA +-1 (N0->G0->P0 16 256M) + +## Mixed bandwidth test: DMA copy + GFX copy in parallel, different sizes +-2 (G0->D0->G1 1 512M) (G2->G2->G3 8 512M) diff --git a/.claude/skills/transferbench-run/examples/basic-p2p.cfg b/.claude/skills/transferbench-run/examples/basic-p2p.cfg new file mode 100644 index 00000000..470f3703 --- /dev/null +++ b/.claude/skills/transferbench-run/examples/basic-p2p.cfg @@ -0,0 +1,21 @@ +# Minimal GPU<->GPU peer-to-peer config. +# Run with: ./TransferBench examples/basic-p2p.cfg 256M +# Lines starting with ## are echoed into the result output. + +## GPU0 -> GPU1 with 4 CUs (kernel-driven) +1 4 (G0->G0->G1) + +## GPU1 -> GPU0 with 4 CUs (kernel-driven) +1 4 (G1->G1->G0) + +## Bidirectional simultaneous, 4 CUs each +2 4 (G0->G0->G1) (G1->G1->G0) + +## DMA-engine copy GPU0 -> GPU1 (single SE for DMA) +1 1 (G0->D0->G1) + +## "Memset" benchmark: write to GPU0 memory with no real source +1 32 (N0->G0->G0) + +## "Read-only" benchmark: read GPU0 memory and discard +1 32 (G0->G0->N0) diff --git a/.claude/skills/transferbench-run/examples/multi-node.sh b/.claude/skills/transferbench-run/examples/multi-node.sh new file mode 100755 index 00000000..ff43b42a --- /dev/null +++ b/.claude/skills/transferbench-run/examples/multi-node.sh @@ -0,0 +1,59 @@ +#!/usr/bin/env bash +# Template launcher for multi-node TransferBench runs via mpirun. +# Adjust HOSTS, NP, and the OpenMPI install path for your cluster. +# +# Usage: +# ./multi-node.sh [N] +# Example: +# ./multi-node.sh nicp2p +# ./multi-node.sh podp2p +# ./multi-node.sh my.cfg 256M + +set -euo pipefail + +BINARY="${BINARY:-./TransferBenchCuda}" # or ./TransferBench for HIP +PRESET="${1:?usage: $0 [N]}" +SIZE="${2:-}" + +# --- MPI environment (edit for your cluster) ---------------------------------- +export PATH="${HOME}/rdma/ompi/install/bin:${PATH}" +export LD_LIBRARY_PATH="${HOME}/rdma/ompi/install/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" +export OPAL_PREFIX="${HOME}/rdma/ompi/install" + +HOSTS="${HOSTS:-node0:1,node1:1}" # one slot per node is recommended +NP="${NP:-2}" + +# --- TransferBench tuning (edit as needed) ------------------------------------ +# Forward each var with -x so MPI propagates it to every rank. +TB_ENV=( + "NUM_ITERATIONS=20" + "NUM_WARMUPS=3" + "OUTPUT_TO_CSV=0" + # "TB_FORCE_SINGLE_POD=1" # uncomment if AMD-SMI / NVML pod detection fails + # "USE_REMOTE_READ=1" + # "TB_DUMP_CFG_FILE=run_dump.cfg" +) + +x_flags=() +env_inline=() +for kv in "${TB_ENV[@]}"; do + key="${kv%%=*}" + x_flags+=("-x" "$key") + env_inline+=("$kv") +done + +# Export so mpirun can see them when forwarding with -x KEY +for kv in "${TB_ENV[@]}"; do export "$kv"; done + +CMD=(mpirun + --mca pml ucx + --mca btl '^vader,openib' + --host "$HOSTS" + -np "$NP" + "${x_flags[@]}" + "$BINARY" "$PRESET" +) +[[ -n "$SIZE" ]] && CMD+=("$SIZE") + +echo "# Launching: ${env_inline[*]} ${CMD[*]}" +"${CMD[@]}" diff --git a/.claude/skills/transferbench-run/references/config-format.md b/.claude/skills/transferbench-run/references/config-format.md new file mode 100644 index 00000000..f07c9131 --- /dev/null +++ b/.claude/skills/transferbench-run/references/config-format.md @@ -0,0 +1,110 @@ +# TransferBench config-file format + +A config file is plain text. Each non-comment line defines a **Test** — a set of **Transfers** that run in parallel. + +- Lines starting with `#` are ignored. +- Lines starting with `##` are echoed verbatim into output (use them as labels for results). +- Round brackets `()` and arrows `->` are decorative and ignored by the parser. + +## Two ways to specify a Test + +### Basic: same SE count for every Transfer + +``` + (srcMem1 -> Executor1 -> dstMem1) ... (srcMemN -> ExecutorN -> dstMemN) +``` + +- `numTransfers` — positive integer, count of parallel Transfers on this line +- `SEs` — number of SubExecutors used by every Transfer on the line +- Each triplet describes one Transfer + +Examples: +``` +1 4 (G0->G0->G1) # 4 CUs on GPU0 copy GPU0 -> GPU1 +1 4 (C1->G2->G0) # 4 CUs on GPU2 copy CPU1 -> GPU0 +2 4 G0->G0->G1 G1->G1->G0 # bidirectional, 4 SEs each +``` + +### Advanced: per-Transfer SE count and byte count + +``` +- (srcMem1 -> Exec1 -> dstMem1 SEs1 Bytes1) ... (srcMemN -> ExecN -> dstMemN SEsN BytesN) +``` + +- `numTransfers` is **negated** to switch into advanced mode. +- `Bytes` is per-Transfer; `0` means "use the command-line `N`". May be suffixed with `K`, `M`, or `G`. Must be a multiple of 4. + +Example: +``` +-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M) +# Copies 1MiB GPU0->GPU1 with 4 CUs, in parallel with 2MiB GPU1->GPU0 with 8 CUs +``` + +## Executor codes + +`Executor` is one character + a 0-based device index (NICs use a two-part index). + +| Code | Executor | Index range | Notes | +|---|---|---|---| +| `C` | CPU | NUMA node | SubExecutor = CPU thread | +| `G` | GPU kernel | GPU device | SubExecutor = threadblock / CU | +| `D` | DMA | GPU device | Single SRC, ≥1 DST | +| `B` | Batched-DMA | GPU device | `hipMemcpyBatchAsync`-based; HIP 7.1 / CUDA 12.8+ | +| `I#.#` | NIC executor | NIC index `.` QP index | e.g. `I0.2` | +| `N#.#` | Nearest-NIC executor | GPU index `.` QP index | Picks each end's closest NIC | + +## Memory-location codes + +A memory location is ``. Multiple locations can be concatenated for multi-SRC / multi-DST (e.g. `G0G1` is "both GPU0 and GPU1 memory"). + +| Code | Memory type | Indexed by | +|---|---|---| +| `C` | Pinned host (coarse-grained) | NUMA node | +| `P` | Pinned host (closest-GPU NUMA) | GPU index | +| `B` | Coherent pinned host | NUMA node | +| `D` | Non-coherent pinned host | NUMA node | +| `K` | Uncached pinned host | NUMA node | +| `H` | Unpinned host | NUMA node | +| `G` | Global device memory | GPU | +| `F` | Fine-grain device memory | GPU | +| `U` | Uncached device memory | GPU | +| `N` | Null (no read or no write) | ignored | + +`N` on the SRC side gives a "memset-like" write benchmark; `N` on the DST side gives a "read-only" benchmark. + +## Idiomatic patterns + +``` +## Memset by GPU0 onto its own memory +1 32 (N0->G0->G0) + +## Read-only by CPU0 NUMA node +1 4 (C0->C0->N0) + +## Broadcast from GPU0 to GPU0 and GPU1 simultaneously +1 16 (G0->G0->G0G1) + +## Fan-in / sum: read from GPU0 and GPU1, write the sum to GPU2 +1 16 (G0G1->G2->G2) + +## NIC RDMA between two GPUs across NIC0 and NIC2 with 2 QPs +1 2 (F0->I0.2->F1) + +## Nearest-NIC RDMA: each side picks its closest NIC +1 1 (F0->N0.1->F1) +``` + +## Validating a config without running it + +``` +./TransferBench dryrun "1 4 (G0->G0->G1)" +./TransferBench dryrun my.cfg +``` +`dryrun` parses, expands wildcards, and prints what *would* execute — useful when iterating on complex configs. + +## Capturing what a preset actually executes + +``` +TB_DUMP_CFG_FILE=p2p_dump.cfg ./TransferBench p2p +``` +Writes the resolved Transfers from the preset to a config file you can edit and rerun. diff --git a/.claude/skills/transferbench-run/references/env-vars.md b/.claude/skills/transferbench-run/references/env-vars.md new file mode 100644 index 00000000..5013a1cd --- /dev/null +++ b/.claude/skills/transferbench-run/references/env-vars.md @@ -0,0 +1,114 @@ +# TransferBench environment variables + +This is a curated guide to the most-used variables. For the authoritative complete list as compiled into your binary, run: + +```bash +./TransferBench envvars +``` + +## Iteration / timing + +| Variable | Default | Effect | +|---|---|---| +| `NUM_ITERATIONS` | `10` | Iterations per test. **Negative** = timed mode (run for that many seconds). | +| `NUM_SUBITERATIONS` | `1` | Sub-iterations per outer iteration. | +| `NUM_WARMUPS` | `3` | Warmup iterations before timing. | +| `USE_HIP_EVENTS` | `1` | Use HIP/CUDA events for timing (vs. host clock). | +| `SAMPLING_FACTOR` | `1` | Subsampling factor for sweep presets. | + +## Output / reporting + +| Variable | Default | Effect | +|---|---|---| +| `OUTPUT_TO_CSV` | `0` | Emit CSV output instead of human-readable tables. | +| `SHOW_BORDERS` | `1` | Draw table borders. | +| `SHOW_ITERATIONS` | `0` | Print per-iteration timings. | +| `SHOW_PERCENTILES` | unset | Comma list, e.g. `50,75,90,99`, to add percentile columns. | +| `HIDE_ENV` | `0` | Suppress the env-var summary printed at startup. | +| `OUTPUT_FORMAT` | preset-specific | `0` = list, `1` = full matrix (used by `podp2p`). | + +## Validation / data + +| Variable | Default | Effect | +|---|---|---| +| `ALWAYS_VALIDATE` | `0` | Validate destination data after every iteration (slow but safe). | +| `VALIDATE_DIRECT` | `0` | Validation reads memory directly without copy-back. | +| `VALIDATE_SOURCE` | `0` | Validate that source data is unchanged. | +| `FILL_PATTERN` | unset | Custom hex pattern for source initialization. | +| `FILL_COMPRESS` | unset | Use compressible source data. | +| `BYTE_OFFSET` | `0` | Offset (bytes) into allocated buffers. | +| `BLOCK_BYTES` | `256` | Block granularity for transfers. | + +## GPU / GFX kernel knobs + +| Variable | Default | Effect | +|---|---|---| +| `USE_SINGLE_STREAM` | `1` | When `0`, each Transfer gets its own stream (may serialize on HW-queue cap). | +| `GPU_MAX_HW_QUEUES` | `4` | Hardware-queue cap when `USE_SINGLE_STREAM=0`. Raise for more parallelism. | +| `GFX_KERNEL` | `0` | Choose copy kernel variant. | +| `GFX_BLOCK_ORDER` | `0` | Threadblock dispatch order. | +| `GFX_BLOCK_SIZE` | `256` | Threads per block. | +| `GFX_SE_TYPE` | `0` | SubExecutor mapping strategy. | +| `GFX_SINGLE_TEAM` | `0` | Combine work into a single team. | +| `GFX_TEMPORAL` | `0` | Temporal hints for cache. | +| `GFX_UNROLL` | preset-specific | Loop-unroll factor in the kernel. | +| `GFX_WAVE_ORDER` | `0` | Wavefront iteration order. | +| `GFX_WORD_SIZE` | `4` | Per-thread element size in bytes. | +| `CU_MASK` | unset | Bitmask restricting which CUs are used. | +| `XCC_PREF_TABLE` | unset | XCC preference table for MI300-class GPUs. | +| `USE_HSA_DMA` | `0` | Use HSA DMA path on AMD platforms. | + +## Variable SubExecutor sweeps + +| Variable | Default | Effect | +|---|---|---| +| `MIN_VAR_SUBEXEC` | `1` | Min SE count when sweeping. | +| `MAX_VAR_SUBEXEC` | `0` | Max SE count when sweeping (`0` = unlimited). | + +## NIC / RDMA + +| Variable | Default | Effect | +|---|---|---| +| `IB_GID_INDEX` | `-1` | InfiniBand GID index (`-1` = auto). | +| `IB_PORT_NUMBER` | `1` | IB port number. | +| `ROCE_VERSION` | `2` | RoCE version (1 or 2). | +| `IP_ADDRESS_FAMILY` | `4` | `4` = IPv4, `6` = IPv6. | +| `NIC_CHUNK_BYTES` | `1073741824` | Chunk size (bytes) for NIC transfers. | +| `NIC_CQ_POLL_BATCH` | `4` | Completion-queue poll batch size. | +| `NIC_RELAX_ORDER` | `1` | Relaxed ordering on the NIC. | +| `TB_NIC_FILTER` | unset | Restrict which NICs participate. | + +## Multi-rank / pod + +| Variable | Default | Effect | +|---|---|---| +| `TB_RANK` | unset | Rank ID (0-based) for socket-mode. | +| `TB_NUM_RANKS` | unset | Total ranks for socket-mode. | +| `TB_MASTER_ADDR` | unset | Master address printed by rank 0. | +| `TB_FORCE_SINGLE_POD` | `0` | Force single-pod membership when AMD-SMI/NVML unavailable. | + +## Debug / capture + +| Variable | Default | Effect | +|---|---|---| +| `TB_DUMP_CFG_FILE` | unset | Dump executed Transfers (e.g. from a preset) to this config file. | +| `TB_DUMP_LINES` | unset | Limit number of dumped lines. | +| `TB_WALLCLOCK_RATE` | unset | Override wallclock rate when GPU returns 0 (debug). | +| `USE_INTERACTIVE` | `0` | Pause for input between tests. | + +## Pod-preset specific + +Used by `podp2p` and `poda2a`: + +| Variable | Used by | Values | +|---|---|---| +| `P2P_MODE` | `podp2p` | `0` both, `1` uni only, `2` bi only | +| `A2A_MODE` | `poda2a` | `0` copy, `1` read-only, `2` write-only, `2:3` custom | +| `A2A_LOCAL` | `poda2a` | `0` exclude same-rank, `1` include | +| `PARALLEL_LVL` | `podp2p` | `0` serial node pairs, `1` parallel | +| `STRIDE` | `poda2a` | Interleave stride | +| `GROUP_SIZE` | `poda2a` | GPUs per group (must divide rank count) | +| `USE_GPU_DMA` | `podp2p` | `0` GFX exec, `1` DMA exec | +| `USE_DMA_EXEC` | `poda2a` | `0` GFX exec, `1` DMA exec (DMA only allowed for `A2A_MODE=0`) | +| `USE_REMOTE_READ` | both | `0` write to remote, `1` read from remote | +| `NUM_GPU_DEVICES` | both | Limit GPUs per rank | diff --git a/.claude/skills/transferbench-run/references/presets.md b/.claude/skills/transferbench-run/references/presets.md new file mode 100644 index 00000000..086979cc --- /dev/null +++ b/.claude/skills/transferbench-run/references/presets.md @@ -0,0 +1,74 @@ +# TransferBench presets + +Presets are built-in configurations that handle topology discovery and produce well-formatted bandwidth tables. Run any of them as the first argument: + +```bash +./TransferBench [N] +``` + +For the live list on a given build, run `./TransferBench presets`. + +## Single-node bandwidth presets + +| Preset | Purpose | +|---|---| +| `a2a` | All-to-all parallel transfers between every pair of GPUs. | +| `a2asweep` | GFX-based a2a swept across CU counts and unroll factors (`MEM_TYPE`, `NUM_SUB_EXECS`). | +| `bmasweep` | Compares DMA vs. Batched-DMA for one-to-many copies (HIP 7.1 / CUDA 12.8+). | +| `gfxsweep` | Sweeps GFX kernel options for one Transfer. | +| `hbm` | Local HBM read bandwidth on each GPU. | +| `healthcheck` | Quick correctness/perf health check (AMD MI300 series only). | +| `one2all` | All subsets of parallel transfers from one GPU to all others. | +| `p2p` | Peer-to-peer device-memory matrix between every GPU pair. | +| `pcopy` | Parallel copies from a single GPU to other GPUs. | +| `rsweep` | Random sweep through Transfer combinations. | +| `rwrite` | Parallel remote writes from a single GPU to others. | +| `scaling` | Scaling test: one GPU → all others, varying SEs, mem types (`CPU_MEM_TYPE`, `GPU_MEM_TYPE`). | +| `schmoo` | Local/remote read/write/copy scaling between two GPUs. | +| `smoketest` | Quick DMA/GFX correctness sweep. | +| `sweep` | Ordered sweep through Transfer combinations. | +| `wallclock` | Compares wallclock counters across XCCs within one GPU. | + +## Multi-node / NIC presets + +Require an MPI launcher or socket-mode environment variables (`TB_NUM_RANKS`, `TB_RANK`, `TB_MASTER_ADDR`). + +| Preset | Purpose | +|---|---| +| `a2a_n` | All-to-all over RDMA via each GPU's nearest NIC. | +| `nica2a` | NIC all-to-all using each NIC's closest GPU/CPU endpoint. | +| `nicp2p` | NIC peer-to-peer matrix across all NICs in the world. | +| `nicrings` | Ring transfers across identical NIC indices on each rank. | +| `rings` | Ring transfers within subgroups of pod ranks (also runs single-node). | + +## Pod-aware presets (multi-rank, single MNNVL/XGMI pod) + +Detect pod membership via AMD-SMI (HIP) or NVML (CUDA). If unavailable, set `TB_FORCE_SINGLE_POD=1`. + +| Preset | Purpose | Key knobs | +|---|---|---| +| `podp2p` | P2P across ranks within a pod. | `P2P_MODE`, `PARALLEL_LVL`, `USE_GPU_DMA`, `USE_REMOTE_READ`, `OUTPUT_FORMAT`, `NUM_GPU_DEVICES` | +| `poda2a` | All-to-all across ranks within a pod. | `A2A_MODE`, `A2A_LOCAL`, `STRIDE`, `GROUP_SIZE`, `USE_DMA_EXEC`, `USE_REMOTE_READ`, `NUM_GPU_DEVICES` | + +`P2P_MODE`: `0` = both directions, `1` = unidirectional only, `2` = bidirectional only. +`A2A_MODE`: `0` = copy, `1` = read-only, `2` = write-only, `2:3` = custom ratio. +`PARALLEL_LVL`: `0` = serial node pairs, `1` = node pairs in parallel. + +## Info-only presets + +These print and exit; they don't run transfers. + +| Preset | Purpose | +|---|---| +| `help` | Config-file syntax with examples. | +| `presets` | Lists all available presets. | +| `envvars` | Lists every environment variable and its effect. | + +## Choosing a preset + +- "Quick GPU↔GPU bandwidth" → `p2p`. +- "All-pairs simultaneous" → `a2a`. +- "How does perf scale with CUs?" → `scaling` or `gfxsweep`. +- "Across two nodes via RDMA" → `nicp2p` for matrix, `nica2a` for collective-style. +- "Within an MNNVL pod" → `podp2p` / `poda2a`. +- "I want to capture what a preset does and tweak it" → run with `TB_DUMP_CFG_FILE=out.cfg`, then edit `out.cfg`.