webern · webern · Jun 20, 2026 · Jun 20, 2026 · Jun 20, 2026 · Jun 20, 2026
diff --git a/.claude/skills/explain-api-roundtrip/SKILL.md b/.claude/skills/explain-api-roundtrip/SKILL.md
@@ -0,0 +1,167 @@
+---
+name: explain-api-roundtrip
+description: >
+  Use this skill to explain, in plain language, what is wrong with the `mx::api` round-trip and what
+  it needs next. It drives the failure classifier (dump -> classify) over the corpus, then reads
+  build/api/classified.json and turns it into a prioritized, human-readable explanation grouped by
+  failure mode (hard crash, dropped supported elements, reorder, by-design drop, audit blind spot).
+  Invoke for requests like "what's broken about mx::api", "explain the round-trip failures",
+  "what does the api need next", or "triage the api round-trip".
+argument-hint: "<optional: a category or element to focus on>"
+disable-model-invocation: false
+user-invocable: true
+---
+# Explain the `mx::api` round-trip
+
+`mx::api` is a deliberate subset of MusicXML, so some round-trip data loss is by design. This skill
+separates the by-design losses from the real defects and produces a plain-English answer to two
+questions: what is broken, and what should we fix next.
+
+It is the read-out layer on top of the failure classifier (issue #211): the classifier produces a
+machine-readable `build/api/classified.json`; this skill interprets it for a human.
+
+## How it works
+
+The pipeline has two steps, kept separate on purpose (see `audit/README.md`):
+
+1. `make dump-api-roundtrip` — runs the api pipeline over every corpus file and writes the
+   normalized expected/actual XML pairs to `build/api/roundtrip-dump/`. Slow: it builds the C++
+   harness and runs ~800 files. Re-run only when api/impl code changed.
+2. `make classify-api-roundtrip` — pure Python, fast. Diffs each pair as an element multiset
+   (`Counter(expected) - Counter(actual)`), cross-references `data/api.features.xml`, and writes
+   `build/api/classified.json` plus a stdout summary.
+
+The categories in the JSON (`primary_category`):
+
+| id | meaning |
+|----|---------|
+| B | drop-only: every missing element is `support="none"` — a by-design subset drop |
+| C | reorder-only: same elements, different order |
+| D | enum bug: a value maps to a known-missing enum member |
+| E | missing attribute: a `partial` feature dropped one attribute |
+| F | pipeline error: LOADFAIL/GETDATAFAIL/CREATEFAIL — no output produced (a crash) |
+| unknown | a FAIL that matched none of the above — usually a `support="full"` element that was dropped (a real bug) or an element not tracked in `api.features.xml` |
+
+## Procedure
+
+### Step 1 — produce the data
+
+If `build/api/roundtrip-dump/` is empty or stale (api/impl changed since it was written), run:
+
+```
+make dump-api-roundtrip
+```
+
+Then always run:
+
+```
+make classify-api-roundtrip
+```
+
+Read the stdout summary it prints — that is the top-level shape (counts per category + the worklist
+of features blocking the most files).
+
+### Step 2 — mine `build/api/classified.json`
+
+Run these read-only analyses (they join the classifier output against the support index). Adjust the
+path if `--out` was overridden.
+
+Top dropped elements, with their audited support level (the key signal — `support="full"` drops are
+bugs, `support="none"` drops are by design):
+
+```
+python3 - <<'PY'
+import json, re
+from collections import Counter
+d = json.load(open("build/api/classified.json"))
+support = {m.group(1): m.group(2) for m in
+          re.finditer(r'name="([^"]+)" support="([a-z]+)"', open("data/api.features.xml").read())}
+miss = Counter()
+for r in d["files"]:
+    for tag in r["missing_element_counts"]:
+        miss[tag] += 1  # files affected
+for tag, n in miss.most_common(25):
+    print(f"{n:>4} files  {tag:<18} support={support.get(tag, 'NOT-IN-INDEX')}")
+PY
+```
+
+Pipeline-error (crash) cluster — group by file/feature to find the common root:
+
+```
+python3 - <<'PY'
+import json
+d = json.load(open("build/api/classified.json"))
+for r in d["files"]:
+    if r["primary_category"] == "F":
+        print(r["pipeline_error_kind"], r["file"])
+PY
+```
+
+Reorder cluster — where in the tree the order diverges:
+
+```
+python3 - <<'PY'
+import json
+from collections import Counter
+d = json.load(open("build/api/classified.json"))
+paths = Counter(r["first_divergence_path"] for r in d["files"] if r["primary_category"] == "C")
+for path, n in paths.most_common(10):
+    print(f"{n:>4}  {path}")
+PY
+```
+
+What is driving the `unknown` bucket (the warnings on stderr from Step 1 also list this; this is the
+programmatic view):
+
+```
+python3 - <<'PY'
+import json, re
+from collections import Counter
+d = json.load(open("build/api/classified.json"))
+support = {m.group(1): m.group(2) for m in
+          re.finditer(r'name="([^"]+)" support="([a-z]+)"', open("data/api.features.xml").read())}
+full_drop, untracked = Counter(), Counter()
+for r in d["files"]:
+    if r["primary_category"] != "unknown":
+        continue
+    for tag in r["missing_elements"]:
+        s = support.get(tag)
+        if s in ("full", "partial"):
+            full_drop[tag] += 1   # claimed supported but dropped -> bug
+        elif s is None:
+            untracked[tag] += 1   # not in api.features.xml -> audit gap
+print("supported-but-dropped:", full_drop.most_common(10))
+print("untracked:", untracked.most_common(10))
+PY
+```
+
+To drill into one file, look at its record (`missing_elements`, `mismatch_type`,
+`first_divergence_path`) and diff the pair directly:
+`diff build/api/roundtrip-dump/<flat>.expected.xml build/api/roundtrip-dump/<flat>.actual.xml`
+where `<flat>` is the corpus path with `/` replaced by `__`.
+
+### Step 3 — write the explanation
+
+Synthesize the findings into plain language grouped by failure mode, ordered by severity. Use this
+structure (fill the numbers and element names from Step 2; do not invent them):
+
+1. Frame it: `mx::api` is a subset, so some loss is by design — separate that from the real defects.
+2. Hard crashes (category F). Highest severity: no output at all. Name the cluster (the crash
+   analysis usually points at one feature). These are bugs.
+3. Dropped supported elements (the `support="full"`/`partial` rows from the top-dropped table and the
+   `supported-but-dropped` view). Either an impl round-trip bug or `api.features.xml` overstates
+   support — say which needs checking, per element.
+4. Reorder (category C). Lower severity: content intact, order wrong. Name the divergence path.
+5. By-design drops (category B): mention briefly — these are expected subset behavior, not bugs.
+6. Audit blind spots: the `untracked` view — elements dropped but not in `api.features.xml`, so they
+   can't be categorized. Recommend running `api-feature-audit` to close the gap.
+
+Then give a prioritized "what it needs" list. Be honest about caveats: the comparison is strict
+full-DOM, and if the pinned baseline (`roundtrip-baseline.txt`) is ungrown, almost the whole corpus
+shows as failing — these are the raw landscape, not a regression.
+
+## Hand-off Fixes (if requested)
+
+- To fix a dropped/under-supported element or a crash: use the `add-mx-api-feature` skill.
+- To correct or extend support levels in `data/api.features.xml`: use the `api-feature-audit` skill.
+- The findings belong under the tracking issue #208; file specifics with the `open-mx-issue` skill.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -58,6 +58,9 @@ jobs:
       - name: Generator tests
         run: make test-gen
 
+      - name: Audit tests
+        run: make test-audit
+
       - name: plates --check (all targets)
         run: make gen-check
 

diff --git a/.github/workflows/replace-claude.yaml b/.github/workflows/replace-claude.yaml
diff --git a/Makefile b/Makefile
@@ -65,12 +65,13 @@ FIND_CPP := find src \
 
 .DEFAULT_GOAL := help
 .PHONY: help sdk fmt check core-dev check-core-dev test-core-dev test-cpp-unit \
-        validate-cpp probe-cpp coverage-core-dev test-gen gen-check \
+        validate-cpp probe-cpp coverage-core-dev test-gen test-audit gen-check \
         gen-quality gen-lint \
         gen gen-cpp gen-go gen-c gen-schema \
         audit audit-force \
         build-go build-c test-go test-c \
-        lib dev test run-examples test-api-roundtrip discover-api-roundtrip coverage-api \
+        lib dev test run-examples test-api-roundtrip discover-api-roundtrip \
+        dump-api-roundtrip classify-api-roundtrip coverage-api \
         clean clean-docker check-docker docker-volume
 
 help:
@@ -84,6 +85,8 @@ help:
 	@echo '  make run-examples       Build and run all three api example programs.'
 	@echo '  make test-api-roundtrip Run the corpus api roundtrip in regression mode (CI gate).'
 	@echo '  make discover-api-roundtrip  Run discovery mode over the full corpus (manual only).'
+	@echo '  make dump-api-roundtrip      Dump normalized expected/actual XML for failures.'
+	@echo '  make classify-api-roundtrip  Classify dumped failures by root cause (Python).'
 	@echo '  make coverage-api            Instrumented api/impl/utility build + gcovr report.'
 	@echo ''
 	@echo '  C++ core:'
@@ -97,6 +100,7 @@ help:
 	@echo ''
 	@echo '  Generator:'
 	@echo '  make test-gen       Run the generator (parser + IR + plates + press) Python tests.'
+	@echo '  make test-audit     Run the audit tool Python tests (incl. failure classifier).'
 	@echo '  make gen-check      plates --check for every target (renames, collisions).'
 	@echo '  make gen            Run the generator for every target (cpp/go/c/schema).'
 	@echo '  make gen-cpp        Run the generator for the C++ target (src/private/mx/core/generated).'
@@ -177,6 +181,24 @@ test-api-roundtrip: dev
 discover-api-roundtrip: dev
 	$(BUILD_ROOT)/api/mxtest-api-roundtrip discovery $(CURDIR)/data
 
+# Dump normalized expected/actual XML for every failing api round-trip.
+# Output goes to build/api/roundtrip-dump/ (build dir, already gitignored).
+# Feeds the classifier: make dump-api-roundtrip && make classify-api-roundtrip
+dump-api-roundtrip: dev
+	mkdir -p $(BUILD_ROOT)/api/roundtrip-dump
+	$(BUILD_ROOT)/api/mxtest-api-roundtrip discovery $(CURDIR)/data \
+		--dump $(CURDIR)/$(BUILD_ROOT)/api/roundtrip-dump
+
+# Classify api round-trip failures by root cause.
+# Reads the dump produced by dump-api-roundtrip; writes build/api/classified.json.
+# Fast (pure Python); kept separate from the slow dump step so classification
+# logic can be re-run without re-dumping. Pass DUMP_DIR=path to override.
+DUMP_DIR ?= $(BUILD_ROOT)/api/roundtrip-dump
+classify-api-roundtrip:
+	python3 -m audit classify $(DUMP_DIR) \
+		--data $(CURDIR)/data \
+		--out $(BUILD_ROOT)/api/classified.json
+
 # Instrumented api coverage: build mx+mxtest with --coverage, run all
 # suites, produce gcovr report for src/private/mx/{api,impl,utility}/.
 coverage-api:
@@ -299,6 +321,10 @@ coverage-core-dev:
 test-gen:
 	python3 -m unittest discover -s gen/tests -t . $(ARGS)
 
+# Audit tool Python tests (feature-audit + the round-trip failure classifier).
+test-audit:
+	python3 -m unittest discover -s audit/tests -t . $(ARGS)
+
 # plates --check for every target: validates renames and detects identifier
 # collisions (a CI gate, like test-gen).
 gen-check:
@@ -406,6 +432,12 @@ test-api-roundtrip: $(DOCKER_STAMP) docker-volume
 discover-api-roundtrip: $(DOCKER_STAMP) docker-volume
 	$(DOCKER_RUN) make discover-api-roundtrip BUILD_TYPE=$(BUILD_TYPE)
 
+dump-api-roundtrip: $(DOCKER_STAMP) docker-volume
+	$(DOCKER_RUN) make dump-api-roundtrip BUILD_TYPE=$(BUILD_TYPE)
+
+classify-api-roundtrip: $(DOCKER_STAMP) docker-volume
+	$(DOCKER_RUN) make classify-api-roundtrip BUILD_TYPE=$(BUILD_TYPE)
+
 coverage-api: $(DOCKER_STAMP) docker-volume
 	@rm -rf $(COV_DIR)/api
 	$(DOCKER_RUN) make coverage-api BUILD_TYPE=$(BUILD_TYPE) ARGS='$(ARGS)'
@@ -437,6 +469,9 @@ coverage-core-dev: $(DOCKER_STAMP) docker-volume
 test-gen: $(DOCKER_STAMP)
 	$(DOCKER_RUN) make test-gen ARGS='$(ARGS)'
 
+test-audit: $(DOCKER_STAMP)
+	$(DOCKER_RUN) make test-audit ARGS='$(ARGS)'
+
 gen-check: $(DOCKER_STAMP)
 	$(DOCKER_RUN) make gen-check
 

diff --git a/audit/README.md b/audit/README.md
@@ -37,6 +37,32 @@ python3 -m audit all [--force]   # both
 common case (a new corpus file was added) only writes the new sidecar. Use
 `--force` when the output format itself changes.
 
+## Classifying api round-trip failures
+
+```
+make dump-api-roundtrip          # C++: write normalized expected/actual XML pairs
+make classify-api-roundtrip      # Python: classify those failures by root cause
+
+python3 -m audit classify <dump_dir> [--data DIR] [--out FILE]
+```
+
+`classify` reads the dump directory produced by `make dump-api-roundtrip`
+(`build/api/roundtrip-dump/`), diffs each expected/actual pair as an order-free
+element **multiset** (`Counter(expected) - Counter(actual)`), cross-references
+`data/api.features.xml`, and assigns each non-passing file a root-cause category
+(drop-only, reorder-only, enum bug, missing attribute, pipeline error). It writes
+`build/api/classified.json` and prints a worklist of the features blocking the
+most files. The two steps are kept separate: dumping is slow (runs the C++
+pipeline over the whole corpus), classifying is fast (pure Python), so the
+classification logic can be iterated without re-dumping. See
+`docs/ai/design/api-roundtrip-classifier.md`.
+
+## Tests
+
+```
+make test-audit                  # python3 -m unittest discover -s audit/tests -t .
+```
+
 ## Audited set
 
 The audited files are exactly those the `corert` round-trip suite processes (see