From 014e05d3fe822680830035182a4976bbaa271766 Mon Sep 17 00:00:00 2001
From: AtlantaPepsi <timhu102@amd.com>
Date: Thu, 28 May 2026 19:57:30 +0000
Subject: [PATCH] first addition of skills

---
 .claude/skills/transferbench-debug/SKILL.md   |  94 +++++++++++
 .../examples/topology-probe.sh                |  58 +++++++
 .../references/common-failures.md             | 136 ++++++++++++++++
 .../references/multi-rank-debug.md            | 148 +++++++++++++++++
 .../references/verbose-introspection.md       | 112 +++++++++++++
 .claude/skills/transferbench-run/SKILL.md     | 150 ++++++++++++++++++
 .../examples/advanced-mixed.cfg               |  18 +++
 .../transferbench-run/examples/basic-p2p.cfg  |  21 +++
 .../transferbench-run/examples/multi-node.sh  |  59 +++++++
 .../references/config-format.md               | 110 +++++++++++++
 .../transferbench-run/references/env-vars.md  | 114 +++++++++++++
 .../transferbench-run/references/presets.md   |  74 +++++++++
 12 files changed, 1094 insertions(+)
 create mode 100644 .claude/skills/transferbench-debug/SKILL.md
 create mode 100755 .claude/skills/transferbench-debug/examples/topology-probe.sh
 create mode 100644 .claude/skills/transferbench-debug/references/common-failures.md
 create mode 100644 .claude/skills/transferbench-debug/references/multi-rank-debug.md
 create mode 100644 .claude/skills/transferbench-debug/references/verbose-introspection.md
 create mode 100644 .claude/skills/transferbench-run/SKILL.md
 create mode 100644 .claude/skills/transferbench-run/examples/advanced-mixed.cfg
 create mode 100644 .claude/skills/transferbench-run/examples/basic-p2p.cfg
 create mode 100755 .claude/skills/transferbench-run/examples/multi-node.sh
 create mode 100644 .claude/skills/transferbench-run/references/config-format.md
 create mode 100644 .claude/skills/transferbench-run/references/env-vars.md
 create mode 100644 .claude/skills/transferbench-run/references/presets.md

diff --git a/.claude/skills/transferbench-debug/SKILL.md b/.claude/skills/transferbench-debug/SKILL.md
new file mode 100644
index 00000000..d0da63ea
--- /dev/null
+++ b/.claude/skills/transferbench-debug/SKILL.md
@@ -0,0 +1,94 @@
+---
+name: transferbench-debug
+description: Use when a TransferBench (ROCm/CUDA bandwidth-benchmark) run fails, hangs, crashes, validates incorrectly, or produces unexpected/misleading results — i.e. the user is troubleshooting rather than ramping up usage. Covers reading error output, isolating hangs (single-rank vs. multi-rank, NIC vs. POD detection), validation failures, performance regressions, and the binary's built-in verbose / dump / dryrun introspection. Does NOT cover writing new configs from scratch (use the run-side skill) or modifying TransferBench source.
+---
+
+# TransferBench debugging
+
+This skill kicks in when something is **wrong** with a TransferBench run. The goal is always: turn a vague "it doesn't work" into a specific failure mode with a known fix or workaround.
+
+## Triage flow
+
+Always run these three steps **first** before guessing:
+
+1. **Reproduce with the smallest possible config.** Replace presets with a single-line `cmdline` if possible; halve rank count; drop to one Transfer.
+2. **Confirm the binary parses the input.** Run `dryrun` instead of executing — separates parser bugs from runtime bugs.
+3. **Capture what the binary actually saw.** `TB_DUMP_CFG_FILE=out.cfg` for presets; `HIDE_ENV=0` (default) so the env-var summary at startup is visible.
+
+Only after that, branch by symptom — see the table below.
+
+## Symptom → reference
+
+| Symptom | Most likely cause | First thing to try | Deeper |
+|---|---|---|---|
+| Process hangs at startup, no output | MPI bootstrap or socket-mode env vars wrong | `mpirun --tag-output` to confirm all ranks started; verify `TB_NUM_RANKS` matches `-np` | `references/multi-rank-debug.md` |
+| `Pod-aware` preset hangs or errors out before transfers | AMD-SMI / NVML pod detection unavailable | `TB_FORCE_SINGLE_POD=1` | `references/common-failures.md` §pod |
+| RDMA preset (nicp2p, nica2a, …) hangs in NIC bring-up | GID index, IB port, or NIC filter wrong | Lower `IB_GID_INDEX` to a known good index; `TB_NIC_FILTER` to a single NIC | `references/multi-rank-debug.md` §rdma |
+| Validation failure (`ALWAYS_VALIDATE=1` reports mismatch) | Wrong CU mask, wrong memory type, or actual HW issue | `VALIDATE_DIRECT=1`; rerun with `NUM_ITERATIONS=1` to see if first iter is wrong | `references/common-failures.md` §validation |
+| Bandwidth far below expected | Stream/HW-queue serialization, wrong executor, GFX kernel mis-tuned | `USE_SINGLE_STREAM=0` + `GPU_MAX_HW_QUEUES=8`; try `D` (DMA) instead of `G` | `references/common-failures.md` §perf |
+| Bandwidth varies wildly run-to-run | Warmup too short, NUMA/clock policy | `NUM_WARMUPS=10`, `SHOW_ITERATIONS=1`, `SHOW_PERCENTILES=50,90,99` | `references/common-failures.md` §perf |
+| Crash / segfault | Bad memory code (e.g. `F` on a GPU without fine-grain), bad kernel for arch | Run with `dryrun` first; rebuild without optimization for symbol info | `references/common-failures.md` §crash |
+| "Unsupported" / executor missing | Build-time disable (e.g. `DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`) | `./TransferBench` (no args) — its banner lists which executors are compiled in | `references/common-failures.md` §unsupported |
+| Output is garbled / interleaved across ranks | MPI stderr buffering, no per-rank labels | `mpirun --tag-output` or pipe each rank into a per-rank log | `references/multi-rank-debug.md` §output |
+
+## The four "always-on" introspection commands
+
+These four commands are how you **observe** the binary as it actually exists on this host (don't trust any documentation, including this one, when troubleshooting):
+
+```bash
+./TransferBench                  # banner: detected GPUs, NUMA, NICs, compiled features
+./TransferBench help             # config-file syntax with examples
+./TransferBench presets          # list of presets compiled into THIS build
+./TransferBench envvars          # complete list of env vars THIS build honors
+```
+
+Plus two safe inspections of any preset/config:
+
+```bash
+./TransferBench dryrun "<expression>"          # validate parsing, expand wildcards
+TB_DUMP_CFG_FILE=dump.cfg ./TransferBench p2p  # dump what a preset actually emits
+```
+
+## Verbose / capture env vars
+
+Reach for these when you need more visibility (full table in `references/verbose-introspection.md`):
+
+| Env var | Effect |
+|---|---|
+| `HIDE_ENV=0` (default) | Print env-var summary at start (shows what was actually set) |
+| `SHOW_ITERATIONS=1` | Per-iteration timings — exposes warmup/jitter issues |
+| `SHOW_PERCENTILES=50,90,99` | Tail latencies — exposes slow-iteration outliers |
+| `ALWAYS_VALIDATE=1` | Validate destination after every iteration (slow, but catches data-corruption regressions) |
+| `VALIDATE_DIRECT=1` | Validate by reading the destination directly (skips copy-back path) |
+| `VALIDATE_SOURCE=1` | Confirm src was unchanged (catches kernels that overwrite src) |
+| `NUM_ITERATIONS=1` | Run exactly one iteration — useful when validation fails on iter N>0 |
+| `NUM_WARMUPS=0` | Strip warmups so iter-0 timing is the cold case |
+| `USE_INTERACTIVE=1` | Pause between tests — useful for `gdb attach` mid-run |
+| `TB_DUMP_CFG_FILE=out.cfg` | Dump executed Transfers from a preset to a config file |
+| `TB_DUMP_LINES=N` | Limit number of dumped lines |
+| `TB_VERBOSE=1` | Verbose lifecycle logging for newer execution paths (anvil/SDMA in recent builds) |
+| `TB_WALLCLOCK_RATE=<hz>` | Override GPU wallclock rate when the GPU returns 0 (debug-only) |
+
+## Multi-rank-specific quick checks
+
+When debugging across nodes, before suspecting TransferBench itself:
+
+1. **Same binary on every node.** `md5sum ./TransferBenchCuda` on each host. A different mtime/checksum is the most common multi-rank gotcha.
+2. **Same env on every rank.** Use `mpirun -x VAR` (not just shell-export); without `-x`, only rank 0 sees your shell vars.
+3. **Network actually up.** `ibstatus` (RDMA) or a `nc` between hosts on your master port (socket mode).
+4. **Hostfile slots = 1 per node.** TransferBench expects one rank per node by default.
+
+## When you're stuck
+
+If the table above and the references didn't help:
+
+1. Build with `-g -O0` (or `-g -O1`) to get usable symbols, run under `gdb` / `cuda-gdb` / `rocgdb`, and `bt` once it hangs or crashes. Hangs in particular are usually obvious from the stuck thread's stack.
+2. Strip the build down: pass `DISABLE_*` flags for any executor not under test (`DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`, etc.). Eliminates whole code paths from suspicion.
+3. Compare against a known-good commit. The `git log` on this repo has many tagged commits where features were added — you can check out an older commit, run the same config, and confirm it passes there.
+
+## References
+
+- `references/common-failures.md` — symptom-organized catalog with concrete fixes
+- `references/multi-rank-debug.md` — MPI / socket / RDMA-specific issues
+- `references/verbose-introspection.md` — every debug-flavored env var + when to reach for it
+- `examples/topology-probe.sh` — minimal script that prints what TransferBench sees about the host
diff --git a/.claude/skills/transferbench-debug/examples/topology-probe.sh b/.claude/skills/transferbench-debug/examples/topology-probe.sh
new file mode 100755
index 00000000..6a87489b
--- /dev/null
+++ b/.claude/skills/transferbench-debug/examples/topology-probe.sh
@@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+# topology-probe.sh — print everything TransferBench can tell you about this host.
+#
+# Run as the FIRST step of any debugging session. Captures: detected GPUs,
+# NUMA, NICs, compiled feature flags, env-var defaults, and (optionally) what
+# a single preset actually emits.
+#
+# Usage:
+#   ./topology-probe.sh                 # just probe
+#   ./topology-probe.sh p2p             # probe + dump what `p2p` would run
+#   ./topology-probe.sh p2p out/        # write output files into out/
+
+set -euo pipefail
+
+BINARY="${BINARY:-./TransferBench}"
+[[ -x "./TransferBenchCuda" ]] && BINARY="${BINARY/TransferBench/TransferBenchCuda}"
+[[ -x "$BINARY" ]] || { echo "ERROR: $BINARY not found or not executable"; exit 1; }
+
+PRESET="${1:-}"
+OUTDIR="${2:-.}"
+mkdir -p "$OUTDIR"
+
+echo "=== Binary banner (compiled features + detected hardware) ==="
+"$BINARY" 2>&1 | tee "$OUTDIR/banner.txt" | head -60
+echo
+
+echo "=== Compiled-in presets ==="
+"$BINARY" presets 2>&1 | tee "$OUTDIR/presets.txt"
+echo
+
+echo "=== Compiled-in environment variables ==="
+"$BINARY" envvars 2>&1 | tee "$OUTDIR/envvars.txt" | head -40
+echo "  (full list in $OUTDIR/envvars.txt)"
+echo
+
+echo "=== Config-file syntax help ==="
+"$BINARY" help 2>&1 | tee "$OUTDIR/help.txt" | head -40
+echo "  (full help in $OUTDIR/help.txt)"
+echo
+
+if [[ -n "$PRESET" ]]; then
+  DUMP="$OUTDIR/${PRESET}_dump.cfg"
+  echo "=== Dumping what '$PRESET' actually runs to $DUMP ==="
+  TB_DUMP_CFG_FILE="$DUMP" TB_DUMP_LINES=100 "$BINARY" "$PRESET" >/dev/null 2>&1 || true
+  if [[ -f "$DUMP" ]]; then
+    echo "  First 30 lines:"
+    head -30 "$DUMP" | sed 's/^/    /'
+  else
+    echo "  (TB_DUMP_CFG_FILE produced no output — preset may not support dump)"
+  fi
+fi
+
+echo
+echo "=== Quick parser sanity check ==="
+"$BINARY" dryrun "1 4 (G0->G0->G1)" 2>&1 | head -10
+echo
+
+echo "Done. Files written to $OUTDIR/"
diff --git a/.claude/skills/transferbench-debug/references/common-failures.md b/.claude/skills/transferbench-debug/references/common-failures.md
new file mode 100644
index 00000000..e3af917d
--- /dev/null
+++ b/.claude/skills/transferbench-debug/references/common-failures.md
@@ -0,0 +1,136 @@
+# TransferBench common failures — symptoms, causes, fixes
+
+Catalog of the most common "it doesn't work" cases, organized by symptom. Each section has: **symptom signal**, **likely cause(s)**, and **concrete fix or next probe**.
+
+---
+
+## §pod — pod-aware preset hangs / errors
+
+### Symptom
+- `podp2p` / `poda2a` / `rings` errors immediately with a pod-detection message.
+- Or hangs in startup before any transfer table prints.
+
+### Cause
+- AMD-SMI (HIP build) or NVML (CUDA build) is unavailable, blocked, or returns inconsistent pod membership across ranks.
+- The build has `POD_COMM_ENABLED` but the host lacks fabricmanager / `nvidia-fabricmanager.service`.
+
+### Fix
+1. `TB_FORCE_SINGLE_POD=1` — fastest workaround, treats every rank as one pod.
+2. Confirm fabricmanager is running (CUDA): `systemctl status nvidia-fabricmanager`.
+3. Confirm NVML works: a one-liner like `nvidia-smi -q | head` should succeed on every rank.
+4. If you actually want per-pod awareness, ensure all ranks see the same pod IDs by running a probe script (each rank prints its detected pod ID).
+
+---
+
+## §rdma — NIC / RDMA preset hangs or errors
+
+### Symptom
+- Hang during NIC bring-up (no transfer table) on `nicp2p`, `nica2a`, `nicrings`, `a2a_n`.
+- Or "QP create failed" / "RDMA connect failed" messages.
+
+### Cause
+- Wrong `IB_GID_INDEX` — depends on the host's IB / RoCE configuration.
+- `IB_PORT_NUMBER` doesn't match the active port on the chosen NIC.
+- More NICs detected than usable; some are unconfigured.
+- RoCE version mismatch.
+
+### Fix
+1. Find a working GID: `show_gids` or `ibv_devinfo -v` on each host.
+2. Set `IB_GID_INDEX=<index>`, `IB_PORT_NUMBER=<port>` to known good values.
+3. Restrict to a single NIC with `TB_NIC_FILTER=<nic_name>` to localize the bad one.
+4. RoCE: try `ROCE_VERSION=2` (most common) or `ROCE_VERSION=1`.
+5. Confirm both ends agree on `IP_ADDRESS_FAMILY` (4 vs 6).
+6. If using OpenMPI: `--mca pml ucx --mca btl ^vader,openib` is the canonical setting in this repo.
+
+---
+
+## §validation — `ALWAYS_VALIDATE=1` reports mismatch
+
+### Symptom
+- Run completes but reports "Validation failed" / mismatch between expected and actual destination contents.
+
+### Cause (in order of likelihood)
+1. Wrong memory-location code (e.g. fine-grain `F` requested on a GPU that doesn't support it → memory backed by global instead, kernel writes go to a different place than reads).
+2. Wrong CU mask — kernel uses a CU group that doesn't have the right cache visibility.
+3. Multi-Transfer test where two Transfers race on the same destination address.
+4. Actual hardware issue (least likely).
+
+### Fix
+1. `VALIDATE_DIRECT=1` — read destination directly without copy-back; isolates copy-back-path bugs.
+2. `VALIDATE_SOURCE=1` — confirm source data was not overwritten by the kernel; catches `src == dst` issues.
+3. `NUM_ITERATIONS=1 NUM_WARMUPS=0 ALWAYS_VALIDATE=1` — confirms it's not a state-leak between iterations.
+4. Drop to `cmdline` with **one** Transfer to rule out multi-Transfer races.
+5. If still failing: `FILL_PATTERN=0xDEADBEEF` (or any custom pattern) — makes the corruption signature easy to spot in the diff.
+
+---
+
+## §perf — bandwidth far below expected, or wildly variable
+
+### Symptom
+- Reported BW is a fraction (e.g. 1/4, 1/8) of the link's theoretical max.
+- Or BW jumps 2× between iterations without obvious reason.
+
+### Cause
+- Stream-per-Transfer hits HW-queue limit and serializes (`USE_SINGLE_STREAM=0` + low `GPU_MAX_HW_QUEUES`).
+- GFX kernel parameters mis-tuned for the size (`GFX_UNROLL`, `GFX_BLOCK_SIZE`, `GFX_WORD_SIZE`).
+- Not enough warmup — first few iterations include allocation, paging, clock ramp.
+- Wrong executor for the workload: GFX kernel for tiny payload (use DMA), DMA for one-to-many (use Batched-DMA).
+- NUMA / pinned-memory mismatch (e.g. CPU-side memory on the wrong NUMA for the chosen GPU).
+
+### Fix
+1. `NUM_WARMUPS=10 NUM_ITERATIONS=20 SHOW_ITERATIONS=1` — see whether iter 0–2 are slow and the rest converge.
+2. `SHOW_PERCENTILES=50,75,90,99` — exposes outlier iterations.
+3. Try the alternate executor on the same memory pair: GFX (`G`) ↔ DMA (`D`) ↔ Batched-DMA (`B`).
+4. `USE_SINGLE_STREAM=0 GPU_MAX_HW_QUEUES=8` for many parallel Transfers.
+5. Sweep with a preset (`gfxsweep`, `a2asweep`) to find the right kernel options before hand-tuning.
+
+---
+
+## §crash — crash / segfault
+
+### Symptom
+- Process exits with SIGSEGV / "memory access fault" / "invalid memory access."
+
+### Cause
+- Memory code unsupported by HW (e.g. `F` on a GPU without fine-grain memory; `U` with no uncached path).
+- DMA Transfer with multiple SRCs (DMA requires exactly one SRC).
+- NIC executor with mismatched index syntax (`I0` instead of `I0.0`).
+- Buffer alignment: byte count not a multiple of 4 (parser usually catches this, but custom builds may slip).
+
+### Fix
+1. `dryrun "<expression>"` first — most parser-level bugs surface here.
+2. Read the banner from `./TransferBench` with no args — confirms which memory types and executors are compiled in for this build.
+3. For DMA crashes: confirm exactly one SRC per Transfer.
+4. Build with `-g -O0` and run under `cuda-gdb` (NVIDIA) or `rocgdb` (AMD); the stack at the fault tells you which Executor's path failed.
+
+---
+
+## §unsupported — "executor missing" / "feature not compiled in"
+
+### Symptom
+- "Unsupported executor" or similar, even though the code seems to allow it.
+
+### Cause
+- This build was compiled with one of the `DISABLE_*` Makefile flags (`DISABLE_NIC_EXEC=1`, `DISABLE_POD_COMM=1`, `DISABLE_AMD_SMI=1`, etc.).
+- Or `MPI_PATH` was not set, so multi-rank paths were stubbed out.
+
+### Fix
+1. Run `./TransferBench` with no arguments. Its banner lists which executors and features are compiled in for this exact binary.
+2. If the feature is genuinely missing, rebuild without the corresponding `DISABLE_*` flag. (See the build-side skill — out of scope here.)
+
+---
+
+## §parser — config-file parser rejects a line
+
+### Symptom
+- "Failed to parse" / "Invalid config line" / silently runs the wrong thing.
+
+### Cause
+- Confused basic vs. advanced syntax (`numTransfers` positive vs. negative).
+- Whitespace inside an executor or memory token.
+- Quoting issues on the shell side when using `cmdline` (e.g. `G*` getting glob-expanded).
+
+### Fix
+1. **Always quote** `cmdline` arguments: `./TransferBench cmdline "1 4 (G0->G0->G1)"`.
+2. `dryrun` first to see the parse result without execution.
+3. For complex configs, use `##` echo lines liberally — they show in the output and help correlate result rows to test definitions.
diff --git a/.claude/skills/transferbench-debug/references/multi-rank-debug.md b/.claude/skills/transferbench-debug/references/multi-rank-debug.md
new file mode 100644
index 00000000..d5793ef0
--- /dev/null
+++ b/.claude/skills/transferbench-debug/references/multi-rank-debug.md
@@ -0,0 +1,148 @@
+# Multi-rank debugging (MPI / socket / RDMA)
+
+Multi-rank TransferBench is the most failure-prone configuration. This guide is organized by the **layer where the failure happens**: launcher → bootstrap → NIC bring-up → transfer execution.
+
+## Pre-flight: things to check before suspecting TransferBench
+
+These rule out 90% of multi-rank "bugs":
+
+```bash
+# 1. Same binary on every node
+for h in node0 node1; do ssh $h md5sum /home/timhu102/tBench/TransferBenchCuda; done
+
+# 2. NICs functional on every node
+for h in node0 node1; do ssh $h "ibstatus | head -20"; done
+
+# 3. Hosts can reach each other on the master port (socket mode)
+ssh node1 "nc -zv node0 <port>"
+
+# 4. Same env vars actually propagating to all ranks
+mpirun --tag-output -np 2 -host node0,node1 -x SOME_VAR env | grep SOME_VAR
+```
+
+## Launcher layer: hangs/errors before any TransferBench output
+
+### Hostfile / `-host` problems
+
+- **Hostnames not resolving:** `mpirun` will hang silently. Try IP addresses instead.
+- **More slots requested than declared:** `host node0:1,node1:1 -np 4` is a misconfiguration; use one slot per node.
+- **SSH not passwordless:** `mpirun` quietly fails when an `ssh` it spawns prompts for a password. Test with `ssh node1 hostname` first.
+
+### MPI transport problems
+
+The canonical incantation in this repo is:
+
+```bash
+mpirun --mca pml ucx --mca btl '^vader,openib' ...
+```
+
+If you see "no transport available" / "BTL openib unavailable":
+- Confirm UCX is built into your OpenMPI (`ompi_info | grep ucx`).
+- The `^vader,openib` is a **negative** filter — it excludes those BTLs to force PML/UCX. Don't drop it without good reason.
+
+### Env-var propagation
+
+OpenMPI does **not** forward your shell env to remote ranks unless you ask it to:
+
+```bash
+mpirun -x VAR1 -x VAR2=value -np ... ./TransferBenchCuda ...
+```
+
+- `-x VAR` forwards the **current** shell value of `VAR`.
+- `-x VAR=value` sets it explicitly.
+- Without `-x`, remote ranks see only the env they inherit from the SSH session, which usually does NOT include your interactive shell exports.
+
+This is the most common source of "works on rank 0, fails on rank ≥1" bugs. The `examples/multi-node.sh` template in `transferbench-run` builds the `-x` flags from a list — model after it.
+
+## Bootstrap layer: process started, but no transfer output
+
+### MPI bootstrap
+
+If `mpirun` reports all ranks started but TransferBench prints nothing, the most common cause is the rank-0 → rank-N handshake taking longer than expected because of a stuck NIC or NUMA probe.
+
+- Run `mpirun --tag-output ...` so each rank's output is prefixed `[rank,N]`.
+- If rank 0 reaches the banner and others don't, those ranks are stuck in their initialization (often AMD-SMI or NVML).
+
+### Socket-mode bootstrap
+
+```bash
+# On rank 0, no TB_RANK set → rank 0 prints master address
+TB_NUM_RANKS=4 ./TransferBenchCuda <preset>
+
+# On rank N (N>0), set TB_RANK and TB_MASTER_ADDR
+TB_NUM_RANKS=4 TB_RANK=1 TB_MASTER_ADDR=<addr> ./TransferBenchCuda <preset>
+```
+
+Common socket-mode issues:
+- **`TB_NUM_RANKS` differs across ranks** → silent hang.
+- **`TB_MASTER_ADDR` not reachable** (firewall, wrong interface). Test with `nc -zv` first.
+- **One rank exited early** → the others wait forever.
+
+## NIC bring-up layer: `nicp2p` / `nica2a` / pod presets hang here
+
+### GID index
+
+The single most common cause of NIC hangs. Find a working GID:
+
+```bash
+for d in $(ibv_devices | tail -n +3 | awk '{print $1}'); do
+  echo "=== $d ==="
+  show_gids $d 2>/dev/null || ibv_devinfo -d $d -v 2>/dev/null | grep -E 'GID|state'
+done
+```
+
+Pick a GID with a populated address (not `0000:0000:...`) and an active port:
+
+```bash
+IB_GID_INDEX=<index> IB_PORT_NUMBER=<port> ./TransferBenchCuda nicp2p
+```
+
+### NIC filtering
+
+If you have eight NICs but only four are configured, list the bad ones:
+
+```bash
+TB_NIC_FILTER=mlx5_0,mlx5_1,mlx5_2,mlx5_3 ./TransferBenchCuda nicp2p
+```
+
+This restricts the world-view of NICs to a subset; useful for narrowing down a bad NIC.
+
+### RoCE version mismatch
+
+Symptom: connection establishment hangs forever.
+Fix: `ROCE_VERSION=2` is the modern default; some legacy clusters require `ROCE_VERSION=1`. Both ends must agree.
+
+### POD / MNNVL detection
+
+For `podp2p` / `poda2a`:
+
+- AMD-SMI (HIP) or NVML (CUDA) must be functional, AND fabricmanager must be running on NVIDIA.
+- Quick workaround if pod detection is broken: `TB_FORCE_SINGLE_POD=1`.
+
+## Output layer: results are garbled or look wrong
+
+### Interleaved output
+
+Without explicit tagging, MPI lets all ranks write to the same stdout, which interleaves bytes:
+
+```bash
+mpirun --tag-output ...                   # each line prefixed with [rank,N]
+mpirun --output-filename out -np ...      # per-rank log files
+mpirun --merge-stderr-to-stdout ...       # one stream
+```
+
+### Apparent zeros / NaNs in the bandwidth table
+
+- Rank-N reported a hardware error and failed silently → check rank-N's stderr (use `--output-filename`).
+- Or the `Test` was malformed and one rank parsed it differently → use `dryrun` and `TB_DUMP_CFG_FILE` to confirm both ranks are running the same Transfers.
+
+## When the hang is genuinely TransferBench's fault
+
+If the layers above are clean and TransferBench still hangs:
+
+1. Build with `-g -O0` (or `-g -O1`).
+2. Run a tiny config (one Transfer, one rank pair) under `cuda-gdb` / `rocgdb`.
+3. When it hangs: `Ctrl-C`, then `bt`. The stuck thread's stack will name the function holding the lock or the queue it's polling.
+4. Look for: NIC completion-queue polling without a timeout, a stream-event wait, or an MPI collective that's missing a peer.
+
+If you're hitting this often, the build-side skill (separate skill) covers the recompile workflow.
diff --git a/.claude/skills/transferbench-debug/references/verbose-introspection.md b/.claude/skills/transferbench-debug/references/verbose-introspection.md
new file mode 100644
index 00000000..b9079905
--- /dev/null
+++ b/.claude/skills/transferbench-debug/references/verbose-introspection.md
@@ -0,0 +1,112 @@
+# Debug-flavored env vars and built-in introspection
+
+This is the curated debug-side counterpart to the run-side `env-vars.md`. Variables here are the ones you reach for **when something is wrong**, not when you're tuning. For the authoritative complete list compiled into your binary, run:
+
+```bash
+./TransferBench envvars
+```
+
+## Built-in info commands (no env vars needed)
+
+```bash
+./TransferBench                  # banner: detected GPUs, NUMA, NICs, compiled features
+./TransferBench help             # config-file syntax + examples
+./TransferBench presets          # presets compiled into THIS build
+./TransferBench envvars          # full env-var list with descriptions
+./TransferBench dryrun "..."     # parse + expand wildcards, no execution
+```
+
+The first one is especially valuable when debugging — its banner names every executor and feature compiled in (`NIC_EXEC_ENABLED`, `POD_COMM_ENABLED`, `NVML_ENABLED`, etc.). If a feature you expect isn't there, you've found the problem before you even ran a test.
+
+## Verbose & lifecycle logging
+
+| Variable | What it does | When to use |
+|---|---|---|
+| `HIDE_ENV=0` (default) | Print env-var summary at startup | Always leave at default when debugging — you want to see what TransferBench thinks the env says |
+| `TB_VERBOSE=1` | Verbose lifecycle logging in newer execution paths (anvil/SDMA) | When recent-feature execution paths hang or behave oddly |
+| `SHOW_ITERATIONS=1` | Print every iteration's time | First step when "BW seems too low" or "BW is unstable" |
+| `SHOW_PERCENTILES=50,75,90,99` | Add percentile columns | When the mean is fine but tail latency is the suspected issue |
+| `SHOW_BORDERS=0` | Strip table borders | Easier to diff two runs' raw output |
+
+## Validation / correctness
+
+| Variable | Effect | Notes |
+|---|---|---|
+| `ALWAYS_VALIDATE=1` | Validate dst after every iteration | Catches data-corruption regressions; significantly slows runs |
+| `VALIDATE_DIRECT=1` | Validate by reading dst directly | Skips copy-back; isolates "is the validation path itself buggy?" |
+| `VALIDATE_SOURCE=1` | Confirm src wasn't overwritten | Catches kernels that aliased into src |
+| `FILL_PATTERN=0xDEADBEEF` | Custom hex source fill pattern | Makes corruption signatures recognizable |
+| `FILL_COMPRESS=1` | Use compressible source data | Useful when debugging compression-aware paths |
+| `BYTE_OFFSET=N` | Offset (bytes) into allocated buffers | Useful when alignment is suspected |
+| `BLOCK_BYTES=256` | Block granularity for transfers | Try larger / smaller when validation fails on edge sizes |
+
+## Iteration / timing isolation
+
+| Variable | Effect | Use case |
+|---|---|---|
+| `NUM_ITERATIONS=1` | Run exactly one iteration | "Validation fails on iter N>0" → reduce to 1 to confirm cold case |
+| `NUM_WARMUPS=0` | No warmups | Forces iter-0 to be the only iteration; useful with `NUM_ITERATIONS=1` |
+| `NUM_ITERATIONS=-30` | Timed mode, 30 seconds | When you want to study perf over time, not iterations |
+| `NUM_SUBITERATIONS=N` | N sub-iterations per outer iteration | Reduce if sub-iter is the granularity at which a bug appears |
+
+## Capture / reproducibility
+
+| Variable | Effect | Use case |
+|---|---|---|
+| `TB_DUMP_CFG_FILE=out.cfg` | Dump executed Transfers to a config file | "What is this preset *actually* running?" — crucial when a preset behaves unexpectedly |
+| `TB_DUMP_LINES=N` | Limit dumped lines | Quick peek at the start of a large preset |
+| `OUTPUT_TO_CSV=1` | CSV output | Easier to diff two runs programmatically |
+
+## Interactive / breakpoint-friendly
+
+| Variable | Effect | Use case |
+|---|---|---|
+| `USE_INTERACTIVE=1` | Pause for stdin between tests | Attach `gdb` / `cuda-gdb` / `rocgdb` mid-run, set breakpoints, hit Enter to continue |
+
+## NIC / RDMA debug
+
+| Variable | Effect | Use case |
+|---|---|---|
+| `IB_GID_INDEX=N` | Force a specific GID index | Almost always the first thing to set when NIC presets hang |
+| `IB_PORT_NUMBER=N` | Force a specific port | When the active port isn't 1 |
+| `ROCE_VERSION=1\|2` | RoCE version | Both ends must agree |
+| `IP_ADDRESS_FAMILY=4\|6` | IPv4 or IPv6 | When dual-stack hosts pick the wrong one |
+| `TB_NIC_FILTER=name1,name2` | Restrict to listed NICs | Localize which NIC is misbehaving |
+| `NIC_CHUNK_BYTES=N` | NIC transfer chunk size | Reduce to confirm a size-dependent bug |
+| `NIC_CQ_POLL_BATCH=N` | CQ poll batch size | Reduce to 1 to expose race conditions |
+| `NIC_RELAX_ORDER=0\|1` | Relaxed ordering on the NIC | Disable when ordering bugs suspected |
+
+## Pod / multi-rank fallbacks
+
+| Variable | Effect | Use case |
+|---|---|---|
+| `TB_FORCE_SINGLE_POD=1` | Treat all ranks as one pod | Workaround when AMD-SMI / NVML pod detection is broken |
+| `TB_RANK`, `TB_NUM_RANKS`, `TB_MASTER_ADDR` | Socket-mode rank coordination | Alternative bootstrap when MPI isn't available or is the suspect |
+
+## Wallclock / timing edge cases
+
+| Variable | Effect | Use case |
+|---|---|---|
+| `TB_WALLCLOCK_RATE=<hz>` | Override GPU wallclock rate | Some firmware reports 0 for the rate; this lets you bypass that |
+| `USE_HIP_EVENTS=0` | Use host clock instead of HIP/CUDA events | When HIP-event timing is suspected (rare) |
+
+## Combined recipes
+
+A few canned env-var combinations that come up often:
+
+```bash
+# "What does this preset actually run?"
+TB_DUMP_CFG_FILE=p2p_dump.cfg ./TransferBench p2p
+
+# "Is the slowness in iter 0 only, or every iter?"
+NUM_WARMUPS=0 NUM_ITERATIONS=20 SHOW_ITERATIONS=1 ./TransferBench cmdline "1 4 (G0->G0->G1)" 256M
+
+# "Validate every iteration, with a custom pattern"
+ALWAYS_VALIDATE=1 VALIDATE_SOURCE=1 FILL_PATTERN=0xDEADBEEF ./TransferBench p2p
+
+# "Quietest possible run for diff'ing two builds"
+HIDE_ENV=1 SHOW_BORDERS=0 OUTPUT_TO_CSV=1 NUM_WARMUPS=10 NUM_ITERATIONS=20 ./TransferBench p2p > run_A.csv
+
+# "Pause between tests so I can gdb attach"
+USE_INTERACTIVE=1 ./TransferBench p2p
+```
diff --git a/.claude/skills/transferbench-run/SKILL.md b/.claude/skills/transferbench-run/SKILL.md
new file mode 100644
index 00000000..050eac54
--- /dev/null
+++ b/.claude/skills/transferbench-run/SKILL.md
@@ -0,0 +1,150 @@
+---
+name: transferbench-run
+description: Use when the user wants to *run* TransferBench (the ROCm/CUDA memory-transfer benchmarking tool from AMD) — benchmarking, profiling, or measuring GPU/CPU/NIC bandwidth and latency. Covers writing config files, picking the right preset (a2a, p2p, sweep, nicp2p, podp2p, etc.), tuning environment variables, and launching single-node or multi-node (MPI / socket) runs. Does NOT cover building the binary from source, modifying its source code, or extending it with new presets/executors — for those, defer to a separate skill or the codebase itself.
+---
+
+# TransferBench
+
+TransferBench is a command-line utility for benchmarking simultaneous data transfers between CPU, GPU, and NIC memory locations using GPU kernels, DMA engines, RDMA NICs, or CPU threads. It runs on AMD (ROCm/HIP) and NVIDIA (CUDA) platforms.
+
+The binary is named `TransferBench` (HIP build) or `TransferBenchCuda` (CUDA build). A prebuilt `TransferBenchCuda` may exist at the repo root; otherwise build with `make` (HIP) or `make TransferBenchCuda` (CUDA).
+
+## Mental model
+
+A **Transfer** is one operation: an **Executor** reads values from one or more **SRC** memory locations, sums them, and writes the result to one or more **DST** memory locations. With one SRC and one DST it's a plain copy.
+
+```
+SRC 0
+SRC 1 -> Executor -> DST 0
+SRC X                DST Y
+```
+
+A **Test** is one line in a config file — a set of Transfers run in parallel.
+
+A **SubExecutor (SE)** is the unit of parallelism inside an executor:
+- CPU executor → CPU thread
+- GPU executor → threadblock / Compute Unit (CU)
+- DMA / Batched-DMA → stream / batch item (must have a single SRC)
+- NIC → Queue Pair
+
+## Invocation
+
+```bash
+./TransferBench <config> [N]
+```
+
+- `<config>` is one of:
+  - A path to a config file
+  - A preset name (`a2a`, `p2p`, `sweep`, `nicp2p`, `podp2p`, ...)
+  - `cmdline "<transfer expression>"` — run one ad-hoc transfer
+  - `dryrun "<transfer expression>"` — parse and print without executing
+  - `help`, `presets`, `envvars` — built-in info screens
+- `N` (optional) is the number of bytes per Transfer. Defaults if omitted. `0` means sweep over a range. May be suffixed with `K`, `M`, or `G`. Must be a multiple of 4.
+
+Run `./TransferBench` with no args to see usage and detected topology (GPUs, NUMA nodes, NICs).
+
+## Quick recipes
+
+Decide which path the user needs and pick from below. **Always prefer a preset** if it matches the user's intent — presets handle topology discovery and produce well-formatted output.
+
+### "How fast is GPU↔GPU?" → use a preset
+```bash
+./TransferBench p2p           # peer-to-peer matrix between all GPUs
+./TransferBench a2a           # all-to-all simultaneous transfers
+./TransferBench scaling       # one GPU to all others, scaled CU counts
+./TransferBench sweep 64M     # sweep through transfer combinations
+```
+
+### "How fast is HBM / local memory?"
+```bash
+./TransferBench hbm
+```
+
+### "How fast is CPU↔GPU or pinned-memory transfer?" → custom config
+Write a small `.cfg` file (see `references/config-format.md`) and pass it as the first argument.
+
+### "Benchmark one specific transfer" → cmdline mode
+```bash
+./TransferBench cmdline "1 4 (G0->G0->G1)" 256M
+./TransferBench dryrun  "2 8 G0->G0->G1 G1->G1->G0"   # validate parsing first
+```
+
+### "RDMA / NIC across nodes" → NIC presets
+```bash
+./TransferBench nicp2p        # NIC peer-to-peer matrix
+./TransferBench nica2a        # NIC all-to-all
+./TransferBench nicrings      # NIC ring tests
+```
+
+### "Pod-aware (multi-rank, single MNNVL/XGMI pod)" → pod presets
+```bash
+./TransferBench podp2p        # within-pod P2P
+./TransferBench poda2a        # within-pod all-to-all
+```
+For pod presets, set `TB_FORCE_SINGLE_POD=1` if AMD-SMI / NVML pod detection is unavailable.
+
+See `references/presets.md` for the full list.
+
+## Multi-rank execution
+
+TransferBench runs multi-node either via MPI (if compiled with `MPI_PATH` set) or via plain TCP sockets.
+
+### MPI approach
+```bash
+mpirun -np 4 -host node0,node1,node2,node3 ./TransferBench a2a
+```
+
+### Socket approach (no MPI)
+On rank 0, set only `TB_NUM_RANKS=N` to print the master address; copy that to other ranks.
+```bash
+# node0
+TB_NUM_RANKS=4 ./TransferBench a2a
+# node1, node2, node3 (use the address node0 prints)
+TB_NUM_RANKS=4 TB_RANK=1 TB_MASTER_ADDR=<addr> ./TransferBench a2a
+TB_NUM_RANKS=4 TB_RANK=2 TB_MASTER_ADDR=<addr> ./TransferBench a2a
+TB_NUM_RANKS=4 TB_RANK=3 TB_MASTER_ADDR=<addr> ./TransferBench a2a
+```
+
+Recommend **one process per node**. See `examples/multi-node.sh` for a full mpirun launcher script with environment-variable propagation (`-x VAR`).
+
+## Tuning behavior with environment variables
+
+Most tuning is environment-variable driven. The most useful ones:
+
+| Variable | Purpose |
+|---|---|
+| `NUM_ITERATIONS` | Iterations per test (negative = run for that many seconds in timed mode) |
+| `NUM_WARMUPS` | Warmup iterations (default 3) |
+| `USE_SINGLE_STREAM` | When 0, each Transfer gets its own stream (may serialize on HW queue limits) |
+| `GPU_MAX_HW_QUEUES` | Raise HW-queue cap when `USE_SINGLE_STREAM=0` |
+| `OUTPUT_TO_CSV` | Emit CSV output |
+| `SHOW_ITERATIONS` | Print per-iteration timings |
+| `SHOW_PERCENTILES` | e.g. `50,75,90,99` for tail-latency percentiles |
+| `ALWAYS_VALIDATE` | Validate destination data after each iteration |
+| `FILL_PATTERN` | Custom source fill pattern |
+| `TB_DUMP_CFG_FILE` | Dump executed Transfers (e.g. from a preset) to a config file |
+| `TB_FORCE_SINGLE_POD` | Force single-pod membership when AMD-SMI/NVML unavailable |
+| `TB_NIC_FILTER` | Restrict which NICs are used |
+
+Run `./TransferBench envvars` for the authoritative list. See `references/env-vars.md` for grouped/annotated reference.
+
+## Reading results
+
+Default output prints one row per Test with bandwidth (GB/s) and time (ms). With `OUTPUT_TO_CSV=1`, results are CSV-formatted. With `SHOW_PERCENTILES=...`, percentile tail-latency columns are appended. Lines beginning with `##` in a config file are echoed back into the output for annotation.
+
+## When you're stuck
+
+1. Run `./TransferBench help` for config-file syntax with examples.
+2. Run `./TransferBench presets` for the live list of presets.
+3. Run `./TransferBench envvars` for all environment variables.
+4. Run `./TransferBench dryrun "..."` to validate a transfer expression without executing.
+5. If a preset fails on multi-node, first try `TB_FORCE_SINGLE_POD=1` and confirm rank count matches host count.
+
+## References
+
+- `references/config-format.md` — full grammar for config files (memory codes, executor codes, basic vs. advanced syntax)
+- `references/presets.md` — every preset, when to use each, and its key env-var knobs
+- `references/env-vars.md` — grouped environment-variable reference
+- `examples/basic-p2p.cfg` — minimal GPU↔GPU copy
+- `examples/advanced-mixed.cfg` — mixed CPU/GPU/DMA transfers with explicit byte counts
+- `examples/multi-node.sh` — MPI launcher template with logging and env-var propagation
diff --git a/.claude/skills/transferbench-run/examples/advanced-mixed.cfg b/.claude/skills/transferbench-run/examples/advanced-mixed.cfg
new file mode 100644
index 00000000..14068570
--- /dev/null
+++ b/.claude/skills/transferbench-run/examples/advanced-mixed.cfg
@@ -0,0 +1,18 @@
+# Advanced-mode: per-Transfer SE counts and explicit byte sizes.
+# Run with:   ./TransferBench examples/advanced-mixed.cfg
+# (No N argument needed because each Transfer specifies its own byte count.)
+
+## CPU NUMA-node 0 -> GPU 0 (1 GiB) parallel with GPU 0 -> GPU 1 (256 MiB)
+-2 (C0->G0->G0 8 1G) (G0->G0->G1 4 256M)
+
+## Three-way fan-out: GPU0 broadcasts to GPU0+GPU1+GPU2 (16 CUs, 128 MiB)
+-1 (G0->G0->G0G1G2 16 128M)
+
+## Fan-in / sum: read GPU0 and GPU1, write the sum to GPU2 (16 CUs, 64 MiB)
+-1 (G0G1->G2->G2 16 64M)
+
+## Pinned-host write benchmark: GPU 0 writes to coarse-grained pinned host on its closest NUMA
+-1 (N0->G0->P0 16 256M)
+
+## Mixed bandwidth test: DMA copy + GFX copy in parallel, different sizes
+-2 (G0->D0->G1 1 512M) (G2->G2->G3 8 512M)
diff --git a/.claude/skills/transferbench-run/examples/basic-p2p.cfg b/.claude/skills/transferbench-run/examples/basic-p2p.cfg
new file mode 100644
index 00000000..470f3703
--- /dev/null
+++ b/.claude/skills/transferbench-run/examples/basic-p2p.cfg
@@ -0,0 +1,21 @@
+# Minimal GPU<->GPU peer-to-peer config.
+# Run with:   ./TransferBench examples/basic-p2p.cfg 256M
+# Lines starting with ## are echoed into the result output.
+
+## GPU0 -> GPU1 with 4 CUs (kernel-driven)
+1 4 (G0->G0->G1)
+
+## GPU1 -> GPU0 with 4 CUs (kernel-driven)
+1 4 (G1->G1->G0)
+
+## Bidirectional simultaneous, 4 CUs each
+2 4 (G0->G0->G1) (G1->G1->G0)
+
+## DMA-engine copy GPU0 -> GPU1 (single SE for DMA)
+1 1 (G0->D0->G1)
+
+## "Memset" benchmark: write to GPU0 memory with no real source
+1 32 (N0->G0->G0)
+
+## "Read-only" benchmark: read GPU0 memory and discard
+1 32 (G0->G0->N0)
diff --git a/.claude/skills/transferbench-run/examples/multi-node.sh b/.claude/skills/transferbench-run/examples/multi-node.sh
new file mode 100755
index 00000000..ff43b42a
--- /dev/null
+++ b/.claude/skills/transferbench-run/examples/multi-node.sh
@@ -0,0 +1,59 @@
+#!/usr/bin/env bash
+# Template launcher for multi-node TransferBench runs via mpirun.
+# Adjust HOSTS, NP, and the OpenMPI install path for your cluster.
+#
+# Usage:
+#   ./multi-node.sh <preset_or_config> [N]
+# Example:
+#   ./multi-node.sh nicp2p
+#   ./multi-node.sh podp2p
+#   ./multi-node.sh my.cfg 256M
+
+set -euo pipefail
+
+BINARY="${BINARY:-./TransferBenchCuda}"   # or ./TransferBench for HIP
+PRESET="${1:?usage: $0 <preset_or_config> [N]}"
+SIZE="${2:-}"
+
+# --- MPI environment (edit for your cluster) ----------------------------------
+export PATH="${HOME}/rdma/ompi/install/bin:${PATH}"
+export LD_LIBRARY_PATH="${HOME}/rdma/ompi/install/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
+export OPAL_PREFIX="${HOME}/rdma/ompi/install"
+
+HOSTS="${HOSTS:-node0:1,node1:1}"          # one slot per node is recommended
+NP="${NP:-2}"
+
+# --- TransferBench tuning (edit as needed) ------------------------------------
+# Forward each var with -x so MPI propagates it to every rank.
+TB_ENV=(
+  "NUM_ITERATIONS=20"
+  "NUM_WARMUPS=3"
+  "OUTPUT_TO_CSV=0"
+  # "TB_FORCE_SINGLE_POD=1"      # uncomment if AMD-SMI / NVML pod detection fails
+  # "USE_REMOTE_READ=1"
+  # "TB_DUMP_CFG_FILE=run_dump.cfg"
+)
+
+x_flags=()
+env_inline=()
+for kv in "${TB_ENV[@]}"; do
+  key="${kv%%=*}"
+  x_flags+=("-x" "$key")
+  env_inline+=("$kv")
+done
+
+# Export so mpirun can see them when forwarding with -x KEY
+for kv in "${TB_ENV[@]}"; do export "$kv"; done
+
+CMD=(mpirun
+  --mca pml ucx
+  --mca btl '^vader,openib'
+  --host "$HOSTS"
+  -np "$NP"
+  "${x_flags[@]}"
+  "$BINARY" "$PRESET"
+)
+[[ -n "$SIZE" ]] && CMD+=("$SIZE")
+
+echo "# Launching: ${env_inline[*]} ${CMD[*]}"
+"${CMD[@]}"
diff --git a/.claude/skills/transferbench-run/references/config-format.md b/.claude/skills/transferbench-run/references/config-format.md
new file mode 100644
index 00000000..f07c9131
--- /dev/null
+++ b/.claude/skills/transferbench-run/references/config-format.md
@@ -0,0 +1,110 @@
+# TransferBench config-file format
+
+A config file is plain text. Each non-comment line defines a **Test** — a set of **Transfers** that run in parallel.
+
+- Lines starting with `#` are ignored.
+- Lines starting with `##` are echoed verbatim into output (use them as labels for results).
+- Round brackets `()` and arrows `->` are decorative and ignored by the parser.
+
+## Two ways to specify a Test
+
+### Basic: same SE count for every Transfer
+
+```
+<numTransfers> <SEs> (srcMem1 -> Executor1 -> dstMem1) ... (srcMemN -> ExecutorN -> dstMemN)
+```
+
+- `numTransfers` — positive integer, count of parallel Transfers on this line
+- `SEs` — number of SubExecutors used by every Transfer on the line
+- Each triplet describes one Transfer
+
+Examples:
+```
+1 4  (G0->G0->G1)                  # 4 CUs on GPU0 copy GPU0 -> GPU1
+1 4  (C1->G2->G0)                  # 4 CUs on GPU2 copy CPU1 -> GPU0
+2 4  G0->G0->G1  G1->G1->G0        # bidirectional, 4 SEs each
+```
+
+### Advanced: per-Transfer SE count and byte count
+
+```
+-<numTransfers> (srcMem1 -> Exec1 -> dstMem1 SEs1 Bytes1) ... (srcMemN -> ExecN -> dstMemN SEsN BytesN)
+```
+
+- `numTransfers` is **negated** to switch into advanced mode.
+- `Bytes` is per-Transfer; `0` means "use the command-line `N`". May be suffixed with `K`, `M`, or `G`. Must be a multiple of 4.
+
+Example:
+```
+-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
+# Copies 1MiB GPU0->GPU1 with 4 CUs, in parallel with 2MiB GPU1->GPU0 with 8 CUs
+```
+
+## Executor codes
+
+`Executor` is one character + a 0-based device index (NICs use a two-part index).
+
+| Code | Executor | Index range | Notes |
+|---|---|---|---|
+| `C` | CPU | NUMA node | SubExecutor = CPU thread |
+| `G` | GPU kernel | GPU device | SubExecutor = threadblock / CU |
+| `D` | DMA | GPU device | Single SRC, ≥1 DST |
+| `B` | Batched-DMA | GPU device | `hipMemcpyBatchAsync`-based; HIP 7.1 / CUDA 12.8+ |
+| `I#.#` | NIC executor | NIC index `.` QP index | e.g. `I0.2` |
+| `N#.#` | Nearest-NIC executor | GPU index `.` QP index | Picks each end's closest NIC |
+
+## Memory-location codes
+
+A memory location is `<code><index>`. Multiple locations can be concatenated for multi-SRC / multi-DST (e.g. `G0G1` is "both GPU0 and GPU1 memory").
+
+| Code | Memory type | Indexed by |
+|---|---|---|
+| `C` | Pinned host (coarse-grained) | NUMA node |
+| `P` | Pinned host (closest-GPU NUMA) | GPU index |
+| `B` | Coherent pinned host | NUMA node |
+| `D` | Non-coherent pinned host | NUMA node |
+| `K` | Uncached pinned host | NUMA node |
+| `H` | Unpinned host | NUMA node |
+| `G` | Global device memory | GPU |
+| `F` | Fine-grain device memory | GPU |
+| `U` | Uncached device memory | GPU |
+| `N` | Null (no read or no write) | ignored |
+
+`N` on the SRC side gives a "memset-like" write benchmark; `N` on the DST side gives a "read-only" benchmark.
+
+## Idiomatic patterns
+
+```
+## Memset by GPU0 onto its own memory
+1 32 (N0->G0->G0)
+
+## Read-only by CPU0 NUMA node
+1 4 (C0->C0->N0)
+
+## Broadcast from GPU0 to GPU0 and GPU1 simultaneously
+1 16 (G0->G0->G0G1)
+
+## Fan-in / sum: read from GPU0 and GPU1, write the sum to GPU2
+1 16 (G0G1->G2->G2)
+
+## NIC RDMA between two GPUs across NIC0 and NIC2 with 2 QPs
+1 2 (F0->I0.2->F1)
+
+## Nearest-NIC RDMA: each side picks its closest NIC
+1 1 (F0->N0.1->F1)
+```
+
+## Validating a config without running it
+
+```
+./TransferBench dryrun "1 4 (G0->G0->G1)"
+./TransferBench dryrun my.cfg
+```
+`dryrun` parses, expands wildcards, and prints what *would* execute — useful when iterating on complex configs.
+
+## Capturing what a preset actually executes
+
+```
+TB_DUMP_CFG_FILE=p2p_dump.cfg ./TransferBench p2p
+```
+Writes the resolved Transfers from the preset to a config file you can edit and rerun.
diff --git a/.claude/skills/transferbench-run/references/env-vars.md b/.claude/skills/transferbench-run/references/env-vars.md
new file mode 100644
index 00000000..5013a1cd
--- /dev/null
+++ b/.claude/skills/transferbench-run/references/env-vars.md
@@ -0,0 +1,114 @@
+# TransferBench environment variables
+
+This is a curated guide to the most-used variables. For the authoritative complete list as compiled into your binary, run:
+
+```bash
+./TransferBench envvars
+```
+
+## Iteration / timing
+
+| Variable | Default | Effect |
+|---|---|---|
+| `NUM_ITERATIONS` | `10` | Iterations per test. **Negative** = timed mode (run for that many seconds). |
+| `NUM_SUBITERATIONS` | `1` | Sub-iterations per outer iteration. |
+| `NUM_WARMUPS` | `3` | Warmup iterations before timing. |
+| `USE_HIP_EVENTS` | `1` | Use HIP/CUDA events for timing (vs. host clock). |
+| `SAMPLING_FACTOR` | `1` | Subsampling factor for sweep presets. |
+
+## Output / reporting
+
+| Variable | Default | Effect |
+|---|---|---|
+| `OUTPUT_TO_CSV` | `0` | Emit CSV output instead of human-readable tables. |
+| `SHOW_BORDERS` | `1` | Draw table borders. |
+| `SHOW_ITERATIONS` | `0` | Print per-iteration timings. |
+| `SHOW_PERCENTILES` | unset | Comma list, e.g. `50,75,90,99`, to add percentile columns. |
+| `HIDE_ENV` | `0` | Suppress the env-var summary printed at startup. |
+| `OUTPUT_FORMAT` | preset-specific | `0` = list, `1` = full matrix (used by `podp2p`). |
+
+## Validation / data
+
+| Variable | Default | Effect |
+|---|---|---|
+| `ALWAYS_VALIDATE` | `0` | Validate destination data after every iteration (slow but safe). |
+| `VALIDATE_DIRECT` | `0` | Validation reads memory directly without copy-back. |
+| `VALIDATE_SOURCE` | `0` | Validate that source data is unchanged. |
+| `FILL_PATTERN` | unset | Custom hex pattern for source initialization. |
+| `FILL_COMPRESS` | unset | Use compressible source data. |
+| `BYTE_OFFSET` | `0` | Offset (bytes) into allocated buffers. |
+| `BLOCK_BYTES` | `256` | Block granularity for transfers. |
+
+## GPU / GFX kernel knobs
+
+| Variable | Default | Effect |
+|---|---|---|
+| `USE_SINGLE_STREAM` | `1` | When `0`, each Transfer gets its own stream (may serialize on HW-queue cap). |
+| `GPU_MAX_HW_QUEUES` | `4` | Hardware-queue cap when `USE_SINGLE_STREAM=0`. Raise for more parallelism. |
+| `GFX_KERNEL` | `0` | Choose copy kernel variant. |
+| `GFX_BLOCK_ORDER` | `0` | Threadblock dispatch order. |
+| `GFX_BLOCK_SIZE` | `256` | Threads per block. |
+| `GFX_SE_TYPE` | `0` | SubExecutor mapping strategy. |
+| `GFX_SINGLE_TEAM` | `0` | Combine work into a single team. |
+| `GFX_TEMPORAL` | `0` | Temporal hints for cache. |
+| `GFX_UNROLL` | preset-specific | Loop-unroll factor in the kernel. |
+| `GFX_WAVE_ORDER` | `0` | Wavefront iteration order. |
+| `GFX_WORD_SIZE` | `4` | Per-thread element size in bytes. |
+| `CU_MASK` | unset | Bitmask restricting which CUs are used. |
+| `XCC_PREF_TABLE` | unset | XCC preference table for MI300-class GPUs. |
+| `USE_HSA_DMA` | `0` | Use HSA DMA path on AMD platforms. |
+
+## Variable SubExecutor sweeps
+
+| Variable | Default | Effect |
+|---|---|---|
+| `MIN_VAR_SUBEXEC` | `1` | Min SE count when sweeping. |
+| `MAX_VAR_SUBEXEC` | `0` | Max SE count when sweeping (`0` = unlimited). |
+
+## NIC / RDMA
+
+| Variable | Default | Effect |
+|---|---|---|
+| `IB_GID_INDEX` | `-1` | InfiniBand GID index (`-1` = auto). |
+| `IB_PORT_NUMBER` | `1` | IB port number. |
+| `ROCE_VERSION` | `2` | RoCE version (1 or 2). |
+| `IP_ADDRESS_FAMILY` | `4` | `4` = IPv4, `6` = IPv6. |
+| `NIC_CHUNK_BYTES` | `1073741824` | Chunk size (bytes) for NIC transfers. |
+| `NIC_CQ_POLL_BATCH` | `4` | Completion-queue poll batch size. |
+| `NIC_RELAX_ORDER` | `1` | Relaxed ordering on the NIC. |
+| `TB_NIC_FILTER` | unset | Restrict which NICs participate. |
+
+## Multi-rank / pod
+
+| Variable | Default | Effect |
+|---|---|---|
+| `TB_RANK` | unset | Rank ID (0-based) for socket-mode. |
+| `TB_NUM_RANKS` | unset | Total ranks for socket-mode. |
+| `TB_MASTER_ADDR` | unset | Master address printed by rank 0. |
+| `TB_FORCE_SINGLE_POD` | `0` | Force single-pod membership when AMD-SMI/NVML unavailable. |
+
+## Debug / capture
+
+| Variable | Default | Effect |
+|---|---|---|
+| `TB_DUMP_CFG_FILE` | unset | Dump executed Transfers (e.g. from a preset) to this config file. |
+| `TB_DUMP_LINES` | unset | Limit number of dumped lines. |
+| `TB_WALLCLOCK_RATE` | unset | Override wallclock rate when GPU returns 0 (debug). |
+| `USE_INTERACTIVE` | `0` | Pause for input between tests. |
+
+## Pod-preset specific
+
+Used by `podp2p` and `poda2a`:
+
+| Variable | Used by | Values |
+|---|---|---|
+| `P2P_MODE` | `podp2p` | `0` both, `1` uni only, `2` bi only |
+| `A2A_MODE` | `poda2a` | `0` copy, `1` read-only, `2` write-only, `2:3` custom |
+| `A2A_LOCAL` | `poda2a` | `0` exclude same-rank, `1` include |
+| `PARALLEL_LVL` | `podp2p` | `0` serial node pairs, `1` parallel |
+| `STRIDE` | `poda2a` | Interleave stride |
+| `GROUP_SIZE` | `poda2a` | GPUs per group (must divide rank count) |
+| `USE_GPU_DMA` | `podp2p` | `0` GFX exec, `1` DMA exec |
+| `USE_DMA_EXEC` | `poda2a` | `0` GFX exec, `1` DMA exec (DMA only allowed for `A2A_MODE=0`) |
+| `USE_REMOTE_READ` | both | `0` write to remote, `1` read from remote |
+| `NUM_GPU_DEVICES` | both | Limit GPUs per rank |
diff --git a/.claude/skills/transferbench-run/references/presets.md b/.claude/skills/transferbench-run/references/presets.md
new file mode 100644
index 00000000..086979cc
--- /dev/null
+++ b/.claude/skills/transferbench-run/references/presets.md
@@ -0,0 +1,74 @@
+# TransferBench presets
+
+Presets are built-in configurations that handle topology discovery and produce well-formatted bandwidth tables. Run any of them as the first argument:
+
+```bash
+./TransferBench <preset> [N]
+```
+
+For the live list on a given build, run `./TransferBench presets`.
+
+## Single-node bandwidth presets
+
+| Preset | Purpose |
+|---|---|
+| `a2a` | All-to-all parallel transfers between every pair of GPUs. |
+| `a2asweep` | GFX-based a2a swept across CU counts and unroll factors (`MEM_TYPE`, `NUM_SUB_EXECS`). |
+| `bmasweep` | Compares DMA vs. Batched-DMA for one-to-many copies (HIP 7.1 / CUDA 12.8+). |
+| `gfxsweep` | Sweeps GFX kernel options for one Transfer. |
+| `hbm` | Local HBM read bandwidth on each GPU. |
+| `healthcheck` | Quick correctness/perf health check (AMD MI300 series only). |
+| `one2all` | All subsets of parallel transfers from one GPU to all others. |
+| `p2p` | Peer-to-peer device-memory matrix between every GPU pair. |
+| `pcopy` | Parallel copies from a single GPU to other GPUs. |
+| `rsweep` | Random sweep through Transfer combinations. |
+| `rwrite` | Parallel remote writes from a single GPU to others. |
+| `scaling` | Scaling test: one GPU → all others, varying SEs, mem types (`CPU_MEM_TYPE`, `GPU_MEM_TYPE`). |
+| `schmoo` | Local/remote read/write/copy scaling between two GPUs. |
+| `smoketest` | Quick DMA/GFX correctness sweep. |
+| `sweep` | Ordered sweep through Transfer combinations. |
+| `wallclock` | Compares wallclock counters across XCCs within one GPU. |
+
+## Multi-node / NIC presets
+
+Require an MPI launcher or socket-mode environment variables (`TB_NUM_RANKS`, `TB_RANK`, `TB_MASTER_ADDR`).
+
+| Preset | Purpose |
+|---|---|
+| `a2a_n` | All-to-all over RDMA via each GPU's nearest NIC. |
+| `nica2a` | NIC all-to-all using each NIC's closest GPU/CPU endpoint. |
+| `nicp2p` | NIC peer-to-peer matrix across all NICs in the world. |
+| `nicrings` | Ring transfers across identical NIC indices on each rank. |
+| `rings` | Ring transfers within subgroups of pod ranks (also runs single-node). |
+
+## Pod-aware presets (multi-rank, single MNNVL/XGMI pod)
+
+Detect pod membership via AMD-SMI (HIP) or NVML (CUDA). If unavailable, set `TB_FORCE_SINGLE_POD=1`.
+
+| Preset | Purpose | Key knobs |
+|---|---|---|
+| `podp2p` | P2P across ranks within a pod. | `P2P_MODE`, `PARALLEL_LVL`, `USE_GPU_DMA`, `USE_REMOTE_READ`, `OUTPUT_FORMAT`, `NUM_GPU_DEVICES` |
+| `poda2a` | All-to-all across ranks within a pod. | `A2A_MODE`, `A2A_LOCAL`, `STRIDE`, `GROUP_SIZE`, `USE_DMA_EXEC`, `USE_REMOTE_READ`, `NUM_GPU_DEVICES` |
+
+`P2P_MODE`: `0` = both directions, `1` = unidirectional only, `2` = bidirectional only.
+`A2A_MODE`: `0` = copy, `1` = read-only, `2` = write-only, `2:3` = custom ratio.
+`PARALLEL_LVL`: `0` = serial node pairs, `1` = node pairs in parallel.
+
+## Info-only presets
+
+These print and exit; they don't run transfers.
+
+| Preset | Purpose |
+|---|---|
+| `help` | Config-file syntax with examples. |
+| `presets` | Lists all available presets. |
+| `envvars` | Lists every environment variable and its effect. |
+
+## Choosing a preset
+
+- "Quick GPU↔GPU bandwidth" → `p2p`.
+- "All-pairs simultaneous" → `a2a`.
+- "How does perf scale with CUs?" → `scaling` or `gfxsweep`.
+- "Across two nodes via RDMA" → `nicp2p` for matrix, `nica2a` for collective-style.
+- "Within an MNNVL pod" → `podp2p` / `poda2a`.
+- "I want to capture what a preset does and tweak it" → run with `TB_DUMP_CFG_FILE=out.cfg`, then edit `out.cfg`.