Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions .claude/skills/transferbench-debug/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
name: transferbench-debug
description: Use when a TransferBench (ROCm/CUDA bandwidth-benchmark) run fails, hangs, crashes, validates incorrectly, or produces unexpected/misleading results — i.e. the user is troubleshooting rather than ramping up usage. Covers reading error output, isolating hangs (single-rank vs. multi-rank, NIC vs. POD detection), validation failures, performance regressions, and the binary's built-in verbose / dump / dryrun introspection. Does NOT cover writing new configs from scratch (use the run-side skill) or modifying TransferBench source.
---

# TransferBench debugging

This skill kicks in when something is **wrong** with a TransferBench run. The goal is always: turn a vague "it doesn't work" into a specific failure mode with a known fix or workaround.

## Triage flow

Always run these three steps **first** before guessing:

1. **Reproduce with the smallest possible config.** Replace presets with a single-line `cmdline` if possible; halve rank count; drop to one Transfer.
2. **Confirm the binary parses the input.** Run `dryrun` instead of executing — separates parser bugs from runtime bugs.
3. **Capture what the binary actually saw.** `TB_DUMP_CFG_FILE=out.cfg` for presets; `HIDE_ENV=0` (default) so the env-var summary at startup is visible.

Only after that, branch by symptom — see the table below.

## Symptom → reference

| Symptom | Most likely cause | First thing to try | Deeper |
|---|---|---|---|
| Process hangs at startup, no output | MPI bootstrap or socket-mode env vars wrong | `mpirun --tag-output` to confirm all ranks started; verify `TB_NUM_RANKS` matches `-np` | `references/multi-rank-debug.md` |
| `Pod-aware` preset hangs or errors out before transfers | AMD-SMI / NVML pod detection unavailable | `TB_FORCE_SINGLE_POD=1` | `references/common-failures.md` §pod |
| RDMA preset (nicp2p, nica2a, …) hangs in NIC bring-up | GID index, IB port, or NIC filter wrong | Lower `IB_GID_INDEX` to a known good index; `TB_NIC_FILTER` to a single NIC | `references/multi-rank-debug.md` §rdma |
| Validation failure (`ALWAYS_VALIDATE=1` reports mismatch) | Wrong CU mask, wrong memory type, or actual HW issue | `VALIDATE_DIRECT=1`; rerun with `NUM_ITERATIONS=1` to see if first iter is wrong | `references/common-failures.md` §validation |
| Bandwidth far below expected | Stream/HW-queue serialization, wrong executor, GFX kernel mis-tuned | `USE_SINGLE_STREAM=0` + `GPU_MAX_HW_QUEUES=8`; try `D` (DMA) instead of `G` | `references/common-failures.md` §perf |
| Bandwidth varies wildly run-to-run | Warmup too short, NUMA/clock policy | `NUM_WARMUPS=10`, `SHOW_ITERATIONS=1`, `SHOW_PERCENTILES=50,90,99` | `references/common-failures.md` §perf |
| Crash / segfault | Bad memory code (e.g. `F` on a GPU without fine-grain), bad kernel for arch | Run with `dryrun` first; rebuild without optimization for symbol info | `references/common-failures.md` §crash |
| "Unsupported" / executor missing | Build-time disable (e.g. `DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`) | `./TransferBench` (no args) — its banner lists which executors are compiled in | `references/common-failures.md` §unsupported |
| Output is garbled / interleaved across ranks | MPI stderr buffering, no per-rank labels | `mpirun --tag-output` or pipe each rank into a per-rank log | `references/multi-rank-debug.md` §output |

## The four "always-on" introspection commands

These four commands are how you **observe** the binary as it actually exists on this host (don't trust any documentation, including this one, when troubleshooting):

```bash
./TransferBench # banner: detected GPUs, NUMA, NICs, compiled features
./TransferBench help # config-file syntax with examples
./TransferBench presets # list of presets compiled into THIS build
./TransferBench envvars # complete list of env vars THIS build honors
```

Plus two safe inspections of any preset/config:

```bash
./TransferBench dryrun "<expression>" # validate parsing, expand wildcards
TB_DUMP_CFG_FILE=dump.cfg ./TransferBench p2p # dump what a preset actually emits
```

## Verbose / capture env vars

Reach for these when you need more visibility (full table in `references/verbose-introspection.md`):

| Env var | Effect |
|---|---|
| `HIDE_ENV=0` (default) | Print env-var summary at start (shows what was actually set) |
| `SHOW_ITERATIONS=1` | Per-iteration timings — exposes warmup/jitter issues |
| `SHOW_PERCENTILES=50,90,99` | Tail latencies — exposes slow-iteration outliers |
| `ALWAYS_VALIDATE=1` | Validate destination after every iteration (slow, but catches data-corruption regressions) |
| `VALIDATE_DIRECT=1` | Validate by reading the destination directly (skips copy-back path) |
| `VALIDATE_SOURCE=1` | Confirm src was unchanged (catches kernels that overwrite src) |
| `NUM_ITERATIONS=1` | Run exactly one iteration — useful when validation fails on iter N>0 |
| `NUM_WARMUPS=0` | Strip warmups so iter-0 timing is the cold case |
| `USE_INTERACTIVE=1` | Pause between tests — useful for `gdb attach` mid-run |
| `TB_DUMP_CFG_FILE=out.cfg` | Dump executed Transfers from a preset to a config file |
| `TB_DUMP_LINES=N` | Limit number of dumped lines |
| `TB_VERBOSE=1` | Verbose lifecycle logging for newer execution paths (anvil/SDMA in recent builds) |
| `TB_WALLCLOCK_RATE=<hz>` | Override GPU wallclock rate when the GPU returns 0 (debug-only) |

## Multi-rank-specific quick checks

When debugging across nodes, before suspecting TransferBench itself:

1. **Same binary on every node.** `md5sum ./TransferBenchCuda` on each host. A different mtime/checksum is the most common multi-rank gotcha.
2. **Same env on every rank.** Use `mpirun -x VAR` (not just shell-export); without `-x`, only rank 0 sees your shell vars.
3. **Network actually up.** `ibstatus` (RDMA) or a `nc` between hosts on your master port (socket mode).
4. **Hostfile slots = 1 per node.** TransferBench expects one rank per node by default.

## When you're stuck

If the table above and the references didn't help:

1. Build with `-g -O0` (or `-g -O1`) to get usable symbols, run under `gdb` / `cuda-gdb` / `rocgdb`, and `bt` once it hangs or crashes. Hangs in particular are usually obvious from the stuck thread's stack.
2. Strip the build down: pass `DISABLE_*` flags for any executor not under test (`DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`, etc.). Eliminates whole code paths from suspicion.
3. Compare against a known-good commit. The `git log` on this repo has many tagged commits where features were added — you can check out an older commit, run the same config, and confirm it passes there.

## References

- `references/common-failures.md` — symptom-organized catalog with concrete fixes
- `references/multi-rank-debug.md` — MPI / socket / RDMA-specific issues
- `references/verbose-introspection.md` — every debug-flavored env var + when to reach for it
- `examples/topology-probe.sh` — minimal script that prints what TransferBench sees about the host
58 changes: 58 additions & 0 deletions .claude/skills/transferbench-debug/examples/topology-probe.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/usr/bin/env bash
# topology-probe.sh — print everything TransferBench can tell you about this host.
#
# Run as the FIRST step of any debugging session. Captures: detected GPUs,
# NUMA, NICs, compiled feature flags, env-var defaults, and (optionally) what
# a single preset actually emits.
#
# Usage:
# ./topology-probe.sh # just probe
# ./topology-probe.sh p2p # probe + dump what `p2p` would run
# ./topology-probe.sh p2p out/ # write output files into out/

set -euo pipefail

BINARY="${BINARY:-./TransferBench}"
[[ -x "./TransferBenchCuda" ]] && BINARY="${BINARY/TransferBench/TransferBenchCuda}"
[[ -x "$BINARY" ]] || { echo "ERROR: $BINARY not found or not executable"; exit 1; }
Comment on lines +15 to +17

PRESET="${1:-}"
OUTDIR="${2:-.}"
mkdir -p "$OUTDIR"

echo "=== Binary banner (compiled features + detected hardware) ==="
"$BINARY" 2>&1 | tee "$OUTDIR/banner.txt" | head -60
echo

echo "=== Compiled-in presets ==="
"$BINARY" presets 2>&1 | tee "$OUTDIR/presets.txt"
echo

echo "=== Compiled-in environment variables ==="
"$BINARY" envvars 2>&1 | tee "$OUTDIR/envvars.txt" | head -40
echo " (full list in $OUTDIR/envvars.txt)"
echo

echo "=== Config-file syntax help ==="
"$BINARY" help 2>&1 | tee "$OUTDIR/help.txt" | head -40
echo " (full help in $OUTDIR/help.txt)"
echo

if [[ -n "$PRESET" ]]; then
DUMP="$OUTDIR/${PRESET}_dump.cfg"
echo "=== Dumping what '$PRESET' actually runs to $DUMP ==="
TB_DUMP_CFG_FILE="$DUMP" TB_DUMP_LINES=100 "$BINARY" "$PRESET" >/dev/null 2>&1 || true
if [[ -f "$DUMP" ]]; then
echo " First 30 lines:"
head -30 "$DUMP" | sed 's/^/ /'
else
echo " (TB_DUMP_CFG_FILE produced no output — preset may not support dump)"
fi
fi

echo
echo "=== Quick parser sanity check ==="
"$BINARY" dryrun "1 4 (G0->G0->G1)" 2>&1 | head -10
echo

echo "Done. Files written to $OUTDIR/"
136 changes: 136 additions & 0 deletions .claude/skills/transferbench-debug/references/common-failures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# TransferBench common failures — symptoms, causes, fixes

Catalog of the most common "it doesn't work" cases, organized by symptom. Each section has: **symptom signal**, **likely cause(s)**, and **concrete fix or next probe**.

---

## §pod — pod-aware preset hangs / errors

### Symptom
- `podp2p` / `poda2a` / `rings` errors immediately with a pod-detection message.
- Or hangs in startup before any transfer table prints.

### Cause
- AMD-SMI (HIP build) or NVML (CUDA build) is unavailable, blocked, or returns inconsistent pod membership across ranks.
- The build has `POD_COMM_ENABLED` but the host lacks fabricmanager / `nvidia-fabricmanager.service`.

### Fix
1. `TB_FORCE_SINGLE_POD=1` — fastest workaround, treats every rank as one pod.
2. Confirm fabricmanager is running (CUDA): `systemctl status nvidia-fabricmanager`.
3. Confirm NVML works: a one-liner like `nvidia-smi -q | head` should succeed on every rank.
4. If you actually want per-pod awareness, ensure all ranks see the same pod IDs by running a probe script (each rank prints its detected pod ID).

---

## §rdma — NIC / RDMA preset hangs or errors

### Symptom
- Hang during NIC bring-up (no transfer table) on `nicp2p`, `nica2a`, `nicrings`, `a2a_n`.
- Or "QP create failed" / "RDMA connect failed" messages.

### Cause
- Wrong `IB_GID_INDEX` — depends on the host's IB / RoCE configuration.
- `IB_PORT_NUMBER` doesn't match the active port on the chosen NIC.
- More NICs detected than usable; some are unconfigured.
- RoCE version mismatch.

### Fix
1. Find a working GID: `show_gids` or `ibv_devinfo -v` on each host.
2. Set `IB_GID_INDEX=<index>`, `IB_PORT_NUMBER=<port>` to known good values.
3. Restrict to a single NIC with `TB_NIC_FILTER=<nic_name>` to localize the bad one.
4. RoCE: try `ROCE_VERSION=2` (most common) or `ROCE_VERSION=1`.
5. Confirm both ends agree on `IP_ADDRESS_FAMILY` (4 vs 6).
6. If using OpenMPI: `--mca pml ucx --mca btl ^vader,openib` is the canonical setting in this repo.

---

## §validation — `ALWAYS_VALIDATE=1` reports mismatch

### Symptom
- Run completes but reports "Validation failed" / mismatch between expected and actual destination contents.

### Cause (in order of likelihood)
1. Wrong memory-location code (e.g. fine-grain `F` requested on a GPU that doesn't support it → memory backed by global instead, kernel writes go to a different place than reads).
2. Wrong CU mask — kernel uses a CU group that doesn't have the right cache visibility.
3. Multi-Transfer test where two Transfers race on the same destination address.
4. Actual hardware issue (least likely).

### Fix
1. `VALIDATE_DIRECT=1` — read destination directly without copy-back; isolates copy-back-path bugs.
2. `VALIDATE_SOURCE=1` — confirm source data was not overwritten by the kernel; catches `src == dst` issues.
3. `NUM_ITERATIONS=1 NUM_WARMUPS=0 ALWAYS_VALIDATE=1` — confirms it's not a state-leak between iterations.
4. Drop to `cmdline` with **one** Transfer to rule out multi-Transfer races.
5. If still failing: `FILL_PATTERN=0xDEADBEEF` (or any custom pattern) — makes the corruption signature easy to spot in the diff.

---

## §perf — bandwidth far below expected, or wildly variable

### Symptom
- Reported BW is a fraction (e.g. 1/4, 1/8) of the link's theoretical max.
- Or BW jumps 2× between iterations without obvious reason.

### Cause
- Stream-per-Transfer hits HW-queue limit and serializes (`USE_SINGLE_STREAM=0` + low `GPU_MAX_HW_QUEUES`).
- GFX kernel parameters mis-tuned for the size (`GFX_UNROLL`, `GFX_BLOCK_SIZE`, `GFX_WORD_SIZE`).
- Not enough warmup — first few iterations include allocation, paging, clock ramp.
- Wrong executor for the workload: GFX kernel for tiny payload (use DMA), DMA for one-to-many (use Batched-DMA).
- NUMA / pinned-memory mismatch (e.g. CPU-side memory on the wrong NUMA for the chosen GPU).

### Fix
1. `NUM_WARMUPS=10 NUM_ITERATIONS=20 SHOW_ITERATIONS=1` — see whether iter 0–2 are slow and the rest converge.
2. `SHOW_PERCENTILES=50,75,90,99` — exposes outlier iterations.
3. Try the alternate executor on the same memory pair: GFX (`G`) ↔ DMA (`D`) ↔ Batched-DMA (`B`).
4. `USE_SINGLE_STREAM=0 GPU_MAX_HW_QUEUES=8` for many parallel Transfers.
5. Sweep with a preset (`gfxsweep`, `a2asweep`) to find the right kernel options before hand-tuning.

---

## §crash — crash / segfault

### Symptom
- Process exits with SIGSEGV / "memory access fault" / "invalid memory access."

### Cause
- Memory code unsupported by HW (e.g. `F` on a GPU without fine-grain memory; `U` with no uncached path).
- DMA Transfer with multiple SRCs (DMA requires exactly one SRC).
- NIC executor with mismatched index syntax (`I0` instead of `I0.0`).
- Buffer alignment: byte count not a multiple of 4 (parser usually catches this, but custom builds may slip).

### Fix
1. `dryrun "<expression>"` first — most parser-level bugs surface here.
2. Read the banner from `./TransferBench` with no args — confirms which memory types and executors are compiled in for this build.
3. For DMA crashes: confirm exactly one SRC per Transfer.
4. Build with `-g -O0` and run under `cuda-gdb` (NVIDIA) or `rocgdb` (AMD); the stack at the fault tells you which Executor's path failed.

---

## §unsupported — "executor missing" / "feature not compiled in"

### Symptom
- "Unsupported executor" or similar, even though the code seems to allow it.

### Cause
- This build was compiled with one of the `DISABLE_*` Makefile flags (`DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`, `DISABLE_AMD_SMI=1`, etc.).
- Or `MPI_PATH` was not set, so multi-rank paths were stubbed out.

### Fix
1. Run `./TransferBench` with no arguments. Its banner lists which executors and features are compiled in for this exact binary.
2. If the feature is genuinely missing, rebuild without the corresponding `DISABLE_*` flag. (See the build-side skill — out of scope here.)

---

## §parser — config-file parser rejects a line

### Symptom
- "Failed to parse" / "Invalid config line" / silently runs the wrong thing.

### Cause
- Confused basic vs. advanced syntax (`numTransfers` positive vs. negative).
- Whitespace inside an executor or memory token.
- Quoting issues on the shell side when using `cmdline` (e.g. `G*` getting glob-expanded).

### Fix
1. **Always quote** `cmdline` arguments: `./TransferBench cmdline "1 4 (G0->G0->G1)"`.
2. `dryrun` first to see the parse result without execution.
3. For complex configs, use `##` echo lines liberally — they show in the output and help correlate result rows to test definitions.
Loading