ROCm · AtlantaPepsi · May 28, 2026
@@ -0,0 +1,94 @@
+---
+name: transferbench-debug
+description: Use when a TransferBench (ROCm/CUDA bandwidth-benchmark) run fails, hangs, crashes, validates incorrectly, or produces unexpected/misleading results — i.e. the user is troubleshooting rather than ramping up usage. Covers reading error output, isolating hangs (single-rank vs. multi-rank, NIC vs. POD detection), validation failures, performance regressions, and the binary's built-in verbose / dump / dryrun introspection. Does NOT cover writing new configs from scratch (use the run-side skill) or modifying TransferBench source.
+---
+
+# TransferBench debugging
+
+This skill kicks in when something is **wrong** with a TransferBench run. The goal is always: turn a vague "it doesn't work" into a specific failure mode with a known fix or workaround.
+
+## Triage flow
+
+Always run these three steps **first** before guessing:
+
+1. **Reproduce with the smallest possible config.** Replace presets with a single-line `cmdline` if possible; halve rank count; drop to one Transfer.
+2. **Confirm the binary parses the input.** Run `dryrun` instead of executing — separates parser bugs from runtime bugs.
+3. **Capture what the binary actually saw.** `TB_DUMP_CFG_FILE=out.cfg` for presets; `HIDE_ENV=0` (default) so the env-var summary at startup is visible.
+
+Only after that, branch by symptom — see the table below.
+
+## Symptom → reference
+
+| Symptom | Most likely cause | First thing to try | Deeper |
+|---|---|---|---|
+| Process hangs at startup, no output | MPI bootstrap or socket-mode env vars wrong | `mpirun --tag-output` to confirm all ranks started; verify `TB_NUM_RANKS` matches `-np` | `references/multi-rank-debug.md` |
+| `Pod-aware` preset hangs or errors out before transfers | AMD-SMI / NVML pod detection unavailable | `TB_FORCE_SINGLE_POD=1` | `references/common-failures.md` §pod |
+| RDMA preset (nicp2p, nica2a, …) hangs in NIC bring-up | GID index, IB port, or NIC filter wrong | Lower `IB_GID_INDEX` to a known good index; `TB_NIC_FILTER` to a single NIC | `references/multi-rank-debug.md` §rdma |
+| Validation failure (`ALWAYS_VALIDATE=1` reports mismatch) | Wrong CU mask, wrong memory type, or actual HW issue | `VALIDATE_DIRECT=1`; rerun with `NUM_ITERATIONS=1` to see if first iter is wrong | `references/common-failures.md` §validation |
+| Bandwidth far below expected | Stream/HW-queue serialization, wrong executor, GFX kernel mis-tuned | `USE_SINGLE_STREAM=0` + `GPU_MAX_HW_QUEUES=8`; try `D` (DMA) instead of `G` | `references/common-failures.md` §perf |
+| Bandwidth varies wildly run-to-run | Warmup too short, NUMA/clock policy | `NUM_WARMUPS=10`, `SHOW_ITERATIONS=1`, `SHOW_PERCENTILES=50,90,99` | `references/common-failures.md` §perf |
+| Crash / segfault | Bad memory code (e.g. `F` on a GPU without fine-grain), bad kernel for arch | Run with `dryrun` first; rebuild without optimization for symbol info | `references/common-failures.md` §crash |
+| "Unsupported" / executor missing | Build-time disable (e.g. `DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`) | `./TransferBench` (no args) — its banner lists which executors are compiled in | `references/common-failures.md` §unsupported |
+| Output is garbled / interleaved across ranks | MPI stderr buffering, no per-rank labels | `mpirun --tag-output` or pipe each rank into a per-rank log | `references/multi-rank-debug.md` §output |
+
+## The four "always-on" introspection commands
+
+These four commands are how you **observe** the binary as it actually exists on this host (don't trust any documentation, including this one, when troubleshooting):
+
+```bash
+./TransferBench                  # banner: detected GPUs, NUMA, NICs, compiled features
+./TransferBench help             # config-file syntax with examples
+./TransferBench presets          # list of presets compiled into THIS build
+./TransferBench envvars          # complete list of env vars THIS build honors
+```
+
+Plus two safe inspections of any preset/config:
+
+```bash
+./TransferBench dryrun "<expression>"          # validate parsing, expand wildcards
+TB_DUMP_CFG_FILE=dump.cfg ./TransferBench p2p  # dump what a preset actually emits
+```
+
+## Verbose / capture env vars
+
+Reach for these when you need more visibility (full table in `references/verbose-introspection.md`):
+
+| Env var | Effect |
+|---|---|
+| `HIDE_ENV=0` (default) | Print env-var summary at start (shows what was actually set) |
+| `SHOW_ITERATIONS=1` | Per-iteration timings — exposes warmup/jitter issues |
+| `SHOW_PERCENTILES=50,90,99` | Tail latencies — exposes slow-iteration outliers |
+| `ALWAYS_VALIDATE=1` | Validate destination after every iteration (slow, but catches data-corruption regressions) |
+| `VALIDATE_DIRECT=1` | Validate by reading the destination directly (skips copy-back path) |
+| `VALIDATE_SOURCE=1` | Confirm src was unchanged (catches kernels that overwrite src) |
+| `NUM_ITERATIONS=1` | Run exactly one iteration — useful when validation fails on iter N>0 |
+| `NUM_WARMUPS=0` | Strip warmups so iter-0 timing is the cold case |
+| `USE_INTERACTIVE=1` | Pause between tests — useful for `gdb attach` mid-run |
+| `TB_DUMP_CFG_FILE=out.cfg` | Dump executed Transfers from a preset to a config file |
+| `TB_DUMP_LINES=N` | Limit number of dumped lines |
+| `TB_VERBOSE=1` | Verbose lifecycle logging for newer execution paths (anvil/SDMA in recent builds) |
+| `TB_WALLCLOCK_RATE=<hz>` | Override GPU wallclock rate when the GPU returns 0 (debug-only) |
+
+## Multi-rank-specific quick checks
+
+When debugging across nodes, before suspecting TransferBench itself:
+
+1. **Same binary on every node.** `md5sum ./TransferBenchCuda` on each host. A different mtime/checksum is the most common multi-rank gotcha.
+2. **Same env on every rank.** Use `mpirun -x VAR` (not just shell-export); without `-x`, only rank 0 sees your shell vars.
+3. **Network actually up.** `ibstatus` (RDMA) or a `nc` between hosts on your master port (socket mode).
+4. **Hostfile slots = 1 per node.** TransferBench expects one rank per node by default.
+
+## When you're stuck
+
+If the table above and the references didn't help:
+
+1. Build with `-g -O0` (or `-g -O1`) to get usable symbols, run under `gdb` / `cuda-gdb` / `rocgdb`, and `bt` once it hangs or crashes. Hangs in particular are usually obvious from the stuck thread's stack.
+2. Strip the build down: pass `DISABLE_*` flags for any executor not under test (`DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`, etc.). Eliminates whole code paths from suspicion.
+3. Compare against a known-good commit. The `git log` on this repo has many tagged commits where features were added — you can check out an older commit, run the same config, and confirm it passes there.
+
+## References
+
+- `references/common-failures.md` — symptom-organized catalog with concrete fixes
+- `references/multi-rank-debug.md` — MPI / socket / RDMA-specific issues
+- `references/verbose-introspection.md` — every debug-flavored env var + when to reach for it
+- `examples/topology-probe.sh` — minimal script that prints what TransferBench sees about the host
@@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+# topology-probe.sh — print everything TransferBench can tell you about this host.
+#
+# Run as the FIRST step of any debugging session. Captures: detected GPUs,
+# NUMA, NICs, compiled feature flags, env-var defaults, and (optionally) what
+# a single preset actually emits.
+#
+# Usage:
+#   ./topology-probe.sh                 # just probe
+#   ./topology-probe.sh p2p             # probe + dump what `p2p` would run
+#   ./topology-probe.sh p2p out/        # write output files into out/
+
+set -euo pipefail
+
+BINARY="${BINARY:-./TransferBench}"
+[[ -x "./TransferBenchCuda" ]] && BINARY="${BINARY/TransferBench/TransferBenchCuda}"
+[[ -x "$BINARY" ]] || { echo "ERROR: $BINARY not found or not executable"; exit 1; }
+
+PRESET="${1:-}"
+OUTDIR="${2:-.}"
+mkdir -p "$OUTDIR"
+
+echo "=== Binary banner (compiled features + detected hardware) ==="
+"$BINARY" 2>&1 | tee "$OUTDIR/banner.txt" | head -60
+echo
+
+echo "=== Compiled-in presets ==="
+"$BINARY" presets 2>&1 | tee "$OUTDIR/presets.txt"
+echo
+
+echo "=== Compiled-in environment variables ==="
+"$BINARY" envvars 2>&1 | tee "$OUTDIR/envvars.txt" | head -40
+echo "  (full list in $OUTDIR/envvars.txt)"
+echo
+
+echo "=== Config-file syntax help ==="
+"$BINARY" help 2>&1 | tee "$OUTDIR/help.txt" | head -40
+echo "  (full help in $OUTDIR/help.txt)"
+echo
+
+if [[ -n "$PRESET" ]]; then
+  DUMP="$OUTDIR/${PRESET}_dump.cfg"
+  echo "=== Dumping what '$PRESET' actually runs to $DUMP ==="
+  TB_DUMP_CFG_FILE="$DUMP" TB_DUMP_LINES=100 "$BINARY" "$PRESET" >/dev/null 2>&1 || true
+  if [[ -f "$DUMP" ]]; then
+    echo "  First 30 lines:"
+    head -30 "$DUMP" | sed 's/^/    /'
+  else
+    echo "  (TB_DUMP_CFG_FILE produced no output — preset may not support dump)"
+  fi
+fi
+
+echo
+echo "=== Quick parser sanity check ==="
+"$BINARY" dryrun "1 4 (G0->G0->G1)" 2>&1 | head -10
+echo
+
+echo "Done. Files written to $OUTDIR/"
@@ -0,0 +1,136 @@
+# TransferBench common failures — symptoms, causes, fixes
+
+Catalog of the most common "it doesn't work" cases, organized by symptom. Each section has: **symptom signal**, **likely cause(s)**, and **concrete fix or next probe**.
+
+---
+
+## §pod — pod-aware preset hangs / errors
+
+### Symptom
+- `podp2p` / `poda2a` / `rings` errors immediately with a pod-detection message.
+- Or hangs in startup before any transfer table prints.
+
+### Cause
+- AMD-SMI (HIP build) or NVML (CUDA build) is unavailable, blocked, or returns inconsistent pod membership across ranks.
+- The build has `POD_COMM_ENABLED` but the host lacks fabricmanager / `nvidia-fabricmanager.service`.
+
+### Fix
+1. `TB_FORCE_SINGLE_POD=1` — fastest workaround, treats every rank as one pod.
+2. Confirm fabricmanager is running (CUDA): `systemctl status nvidia-fabricmanager`.
+3. Confirm NVML works: a one-liner like `nvidia-smi -q | head` should succeed on every rank.
+4. If you actually want per-pod awareness, ensure all ranks see the same pod IDs by running a probe script (each rank prints its detected pod ID).
+
+---
+
+## §rdma — NIC / RDMA preset hangs or errors
+
+### Symptom
+- Hang during NIC bring-up (no transfer table) on `nicp2p`, `nica2a`, `nicrings`, `a2a_n`.
+- Or "QP create failed" / "RDMA connect failed" messages.
+
+### Cause
+- Wrong `IB_GID_INDEX` — depends on the host's IB / RoCE configuration.
+- `IB_PORT_NUMBER` doesn't match the active port on the chosen NIC.
+- More NICs detected than usable; some are unconfigured.
+- RoCE version mismatch.
+
+### Fix
+1. Find a working GID: `show_gids` or `ibv_devinfo -v` on each host.
+2. Set `IB_GID_INDEX=<index>`, `IB_PORT_NUMBER=<port>` to known good values.
+3. Restrict to a single NIC with `TB_NIC_FILTER=<nic_name>` to localize the bad one.
+4. RoCE: try `ROCE_VERSION=2` (most common) or `ROCE_VERSION=1`.
+5. Confirm both ends agree on `IP_ADDRESS_FAMILY` (4 vs 6).
+6. If using OpenMPI: `--mca pml ucx --mca btl ^vader,openib` is the canonical setting in this repo.
+
+---
+
+## §validation — `ALWAYS_VALIDATE=1` reports mismatch
+
+### Symptom
+- Run completes but reports "Validation failed" / mismatch between expected and actual destination contents.
+
+### Cause (in order of likelihood)
+1. Wrong memory-location code (e.g. fine-grain `F` requested on a GPU that doesn't support it → memory backed by global instead, kernel writes go to a different place than reads).
+2. Wrong CU mask — kernel uses a CU group that doesn't have the right cache visibility.
+3. Multi-Transfer test where two Transfers race on the same destination address.
+4. Actual hardware issue (least likely).
+
+### Fix
+1. `VALIDATE_DIRECT=1` — read destination directly without copy-back; isolates copy-back-path bugs.
+2. `VALIDATE_SOURCE=1` — confirm source data was not overwritten by the kernel; catches `src == dst` issues.
+3. `NUM_ITERATIONS=1 NUM_WARMUPS=0 ALWAYS_VALIDATE=1` — confirms it's not a state-leak between iterations.
+4. Drop to `cmdline` with **one** Transfer to rule out multi-Transfer races.
+5. If still failing: `FILL_PATTERN=0xDEADBEEF` (or any custom pattern) — makes the corruption signature easy to spot in the diff.
+
+---
+
+## §perf — bandwidth far below expected, or wildly variable
+
+### Symptom
+- Reported BW is a fraction (e.g. 1/4, 1/8) of the link's theoretical max.
+- Or BW jumps 2× between iterations without obvious reason.
+
+### Cause
+- Stream-per-Transfer hits HW-queue limit and serializes (`USE_SINGLE_STREAM=0` + low `GPU_MAX_HW_QUEUES`).
+- GFX kernel parameters mis-tuned for the size (`GFX_UNROLL`, `GFX_BLOCK_SIZE`, `GFX_WORD_SIZE`).
+- Not enough warmup — first few iterations include allocation, paging, clock ramp.
+- Wrong executor for the workload: GFX kernel for tiny payload (use DMA), DMA for one-to-many (use Batched-DMA).
+- NUMA / pinned-memory mismatch (e.g. CPU-side memory on the wrong NUMA for the chosen GPU).
+
+### Fix
+1. `NUM_WARMUPS=10 NUM_ITERATIONS=20 SHOW_ITERATIONS=1` — see whether iter 0–2 are slow and the rest converge.
+2. `SHOW_PERCENTILES=50,75,90,99` — exposes outlier iterations.
+3. Try the alternate executor on the same memory pair: GFX (`G`) ↔ DMA (`D`) ↔ Batched-DMA (`B`).
+4. `USE_SINGLE_STREAM=0 GPU_MAX_HW_QUEUES=8` for many parallel Transfers.
+5. Sweep with a preset (`gfxsweep`, `a2asweep`) to find the right kernel options before hand-tuning.
+
+---
+
+## §crash — crash / segfault
+
+### Symptom
+- Process exits with SIGSEGV / "memory access fault" / "invalid memory access."
+
+### Cause
+- Memory code unsupported by HW (e.g. `F` on a GPU without fine-grain memory; `U` with no uncached path).
+- DMA Transfer with multiple SRCs (DMA requires exactly one SRC).
+- NIC executor with mismatched index syntax (`I0` instead of `I0.0`).
+- Buffer alignment: byte count not a multiple of 4 (parser usually catches this, but custom builds may slip).
+
+### Fix
+1. `dryrun "<expression>"` first — most parser-level bugs surface here.
+2. Read the banner from `./TransferBench` with no args — confirms which memory types and executors are compiled in for this build.
+3. For DMA crashes: confirm exactly one SRC per Transfer.
+4. Build with `-g -O0` and run under `cuda-gdb` (NVIDIA) or `rocgdb` (AMD); the stack at the fault tells you which Executor's path failed.
+
+---
+
+## §unsupported — "executor missing" / "feature not compiled in"
+
+### Symptom
+- "Unsupported executor" or similar, even though the code seems to allow it.
+
+### Cause
+- This build was compiled with one of the `DISABLE_*` Makefile flags (`DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`, `DISABLE_AMD_SMI=1`, etc.).
+- Or `MPI_PATH` was not set, so multi-rank paths were stubbed out.
+
+### Fix
+1. Run `./TransferBench` with no arguments. Its banner lists which executors and features are compiled in for this exact binary.
+2. If the feature is genuinely missing, rebuild without the corresponding `DISABLE_*` flag. (See the build-side skill — out of scope here.)
+
+---
+
+## §parser — config-file parser rejects a line
+
+### Symptom
+- "Failed to parse" / "Invalid config line" / silently runs the wrong thing.
+
+### Cause
+- Confused basic vs. advanced syntax (`numTransfers` positive vs. negative).
+- Whitespace inside an executor or memory token.
+- Quoting issues on the shell side when using `cmdline` (e.g. `G*` getting glob-expanded).
+
+### Fix
+1. **Always quote** `cmdline` arguments: `./TransferBench cmdline "1 4 (G0->G0->G1)"`.
+2. `dryrun` first to see the parse result without execution.
+3. For complex configs, use `##` echo lines liberally — they show in the output and help correlate result rows to test definitions.