webern · webern · Jun 20, 2026 · Jun 20, 2026 · Jun 20, 2026
diff --git a/.claude/skills/explain-api-roundtrip/SKILL.md b/.claude/skills/explain-api-roundtrip/SKILL.md
@@ -3,44 +3,47 @@ name: explain-api-roundtrip
 description: >
   Use this skill to explain, in plain language, what is wrong with the `mx::api` round-trip and what
   it needs next. It drives the failure classifier (dump -> classify) over the corpus, then reads
-  build/api/classified.json and turns it into a prioritized, human-readable explanation grouped by
-  failure mode (hard crash, dropped supported elements, reorder, by-design drop, audit blind spot).
+  build/api/classified.json and turns it into a prioritized, human-readable worklist grouped by
+  failure shape (crashes, instant wins, small fix-sets, reorder-blocked, high-frequency drops).
   Invoke for requests like "what's broken about mx::api", "explain the round-trip failures",
   "what does the api need next", or "triage the api round-trip".
-argument-hint: "<optional: a category or element to focus on>"
+argument-hint: "<optional: a signature or element to focus on>"
 disable-model-invocation: false
 user-invocable: true
 ---
 # Explain the `mx::api` round-trip
 
-`mx::api` is a deliberate subset of MusicXML, so some round-trip data loss is by design. This skill
-separates the by-design losses from the real defects and produces a plain-English answer to two
-questions: what is broken, and what should we fix next.
+`mx::api` is a deliberate subset of MusicXML, and the comparison is strict full-DOM, so most corpus
+files diverge somewhere. This skill turns the measured divergences into a plain-English answer to two
+questions: what is broken, and what should we fix next to land the most files into the round-trip
+corpus with the fewest software changes.
 
-It is the read-out layer on top of the failure classifier (issue #211): the classifier produces a
-machine-readable `build/api/classified.json`; this skill interprets it for a human.
+It is the read-out layer on top of the failure classifier (issue #211/#212): the classifier produces
+a machine-readable `build/api/classified.json`; this skill interprets it for a human.
 
-## How it works
+## What the classifier reports (and what it does not)
 
-The pipeline has two steps, kept separate on purpose (see `audit/README.md`):
+Classification is purely **measured** from each expected/actual pair. It does **not** consult
+`data/api.features.xml` or any record of what `mx::api` was believed to "support" — whether a drop is
+intended is a present-day human decision (#214), not something the tool asserts. So do not describe
+any drop as "by-design" on the classifier's authority; report what was dropped and let the human
+decide.
 
-1. `make dump-api-roundtrip` — runs the api pipeline over every corpus file and writes the
-   normalized expected/actual XML pairs to `build/api/roundtrip-dump/`. Slow: it builds the C++
-   harness and runs ~800 files. Re-run only when api/impl code changed.
-2. `make classify-api-roundtrip` — pure Python, fast. Diffs each pair as an element multiset
-   (`Counter(expected) - Counter(actual)`), cross-references `data/api.features.xml`, and writes
-   `build/api/classified.json` plus a stdout summary.
+Each difference is a **signature**, and a file's **distance** to passing is its count of unique
+signatures:
 
-The categories in the JSON (`primary_category`):
+| signature | meaning |
+|-----------|---------|
+| `drop:<tag>` | a tag in expected, missing from actual |
+| `add:<tag>` | a spurious tag only in actual (a bug) |
+| `value:<tag>` | a paired element whose text value differs |
+| `value:<tag>@<attr>` | a paired element whose attribute value differs |
+| `attr:<tag>@<attr>` | an attribute present on only one side |
+| `reorder:<parent>` | a parent with matching children in a different order |
 
-| id | meaning |
-|----|---------|
-| B | drop-only: every missing element is `support="none"` — a by-design subset drop |
-| C | reorder-only: same elements, different order |
-| D | enum bug: a value maps to a known-missing enum member |
-| E | missing attribute: a `partial` feature dropped one attribute |
-| F | pipeline error: LOADFAIL/GETDATAFAIL/CREATEFAIL — no output produced (a crash) |
-| unknown | a FAIL that matched none of the above — usually a `support="full"` element that was dropped (a real bug) or an element not tracked in `api.features.xml` |
+Per-file `status` is `PASS` / `FAIL` / `CRASH`. A `FAIL` with no `reorder:` signature is a
+**candidate**: the first-pass worklist targets these, since reorders are expected `mx::api` behavior
+to be absorbed in test normalization later (#214).
 
 ## Procedure
 
@@ -58,110 +61,106 @@ Then always run:
 make classify-api-roundtrip
 ```
 
-Read the stdout summary it prints — that is the top-level shape (counts per category + the worklist
-of features blocking the most files).
+Read the stdout summary it prints — status counts, the distance histogram, and the ranked worklist.
+That worklist *is* the headline answer to "what should we fix next."
 
 ### Step 2 — mine `build/api/classified.json`
 
-Run these read-only analyses (they join the classifier output against the support index). Adjust the
-path if `--out` was overridden.
+These read-only analyses expand on the stdout summary. Adjust the path if `--out` was overridden.
 
-Top dropped elements, with their audited support level (the key signal — `support="full"` drops are
-bugs, `support="none"` drops are by design):
+The worklist — signatures ranked by candidate files unblocked (`sole_blocker` = files this fix flips
+green on its own; `files_blocked` = candidate files that include it):
 
 ```
 python3 - <<'PY'
-import json, re
-from collections import Counter
+import json
+d = json.load(open("build/api/classified.json"))
+for row in d["worklist"][:25]:
+    print(f"{row['sole_blocker']:>4} sole  {row['files_blocked']:>5} total  {row['signature']}")
+PY
+```
+
+Instant wins — candidate files one fix away from passing (distance 1):
+
+```
+python3 - <<'PY'
+import json
 d = json.load(open("build/api/classified.json"))
-support = {m.group(1): m.group(2) for m in
-          re.finditer(r'name="([^"]+)" support="([a-z]+)"', open("data/api.features.xml").read())}
-miss = Counter()
-for r in d["files"]:
-    for tag in r["missing_element_counts"]:
-        miss[tag] += 1  # files affected
-for tag, n in miss.most_common(25):
-    print(f"{n:>4} files  {tag:<18} support={support.get(tag, 'NOT-IN-INDEX')}")
+for f in d["near_misses"]["1"]:
+    print(f["signatures"][0], f["file"])
 PY
 ```
 
-Pipeline-error (crash) cluster — group by file/feature to find the common root:
+Small fix-sets — files that pass once a handful of features land (distance 2–3); the union of their
+signatures is a high-yield batch:
 
 ```
 python3 - <<'PY'
 import json
+from collections import Counter
 d = json.load(open("build/api/classified.json"))
-for r in d["files"]:
-    if r["primary_category"] == "F":
-        print(r["pipeline_error_kind"], r["file"])
+for dist in ("2", "3"):
+    sigs = Counter(s for f in d["near_misses"][dist] for s in f["signatures"])
+    print(f"distance {dist}: {len(d['near_misses'][dist])} files; signatures {sigs.most_common(10)}")
 PY
 ```
 
-Reorder cluster — where in the tree the order diverges:
+Crash cluster (highest severity — no output at all) — group by kind to find the common root:
 
 ```
 python3 - <<'PY'
 import json
 from collections import Counter
 d = json.load(open("build/api/classified.json"))
-paths = Counter(r["first_divergence_path"] for r in d["files"] if r["primary_category"] == "C")
-for path, n in paths.most_common(10):
-    print(f"{n:>4}  {path}")
+crashes = [(r["crash_kind"], r["file"]) for r in d["files"] if r["status"] == "CRASH"]
+print(Counter(k for k, _ in crashes))
+for kind, f in crashes[:20]:
+    print(kind, f)
 PY
 ```
 
-What is driving the `unknown` bucket (the warnings on stderr from Step 1 also list this; this is the
-programmatic view):
+Reorder-blocked files (deferred to #214) — how many, and at which parents:
 
 ```
 python3 - <<'PY'
-import json, re
+import json
 from collections import Counter
 d = json.load(open("build/api/classified.json"))
-support = {m.group(1): m.group(2) for m in
-          re.finditer(r'name="([^"]+)" support="([a-z]+)"', open("data/api.features.xml").read())}
-full_drop, untracked = Counter(), Counter()
-for r in d["files"]:
-    if r["primary_category"] != "unknown":
-        continue
-    for tag in r["missing_elements"]:
-        s = support.get(tag)
-        if s in ("full", "partial"):
-            full_drop[tag] += 1   # claimed supported but dropped -> bug
-        elif s is None:
-            untracked[tag] += 1   # not in api.features.xml -> audit gap
-print("supported-but-dropped:", full_drop.most_common(10))
-print("untracked:", untracked.most_common(10))
+reorders = Counter(s for r in d["files"] if r["has_reorder"]
+                   for s in r["signatures"] if s.startswith("reorder:"))
+print("reorder-blocked files:", sum(1 for r in d["files"] if r["has_reorder"]))
+print(reorders.most_common(10))
 PY
 ```
 
-To drill into one file, look at its record (`missing_elements`, `mismatch_type`,
-`first_divergence_path`) and diff the pair directly:
+To drill into one file, look at its record (`signatures`, `sample_paths`, `distance`) and diff the
+pair directly:
 `diff build/api/roundtrip-dump/<flat>.expected.xml build/api/roundtrip-dump/<flat>.actual.xml`
 where `<flat>` is the corpus path with `/` replaced by `__`.
 
 ### Step 3 — write the explanation
 
-Synthesize the findings into plain language grouped by failure mode, ordered by severity. Use this
-structure (fill the numbers and element names from Step 2; do not invent them):
-
-1. Frame it: `mx::api` is a subset, so some loss is by design — separate that from the real defects.
-2. Hard crashes (category F). Highest severity: no output at all. Name the cluster (the crash
-   analysis usually points at one feature). These are bugs.
-3. Dropped supported elements (the `support="full"`/`partial` rows from the top-dropped table and the
-   `supported-but-dropped` view). Either an impl round-trip bug or `api.features.xml` overstates
-   support — say which needs checking, per element.
-4. Reorder (category C). Lower severity: content intact, order wrong. Name the divergence path.
-5. By-design drops (category B): mention briefly — these are expected subset behavior, not bugs.
-6. Audit blind spots: the `untracked` view — elements dropped but not in `api.features.xml`, so they
-   can't be categorized. Recommend running `api-feature-audit` to close the gap.
-
-Then give a prioritized "what it needs" list. Be honest about caveats: the comparison is strict
-full-DOM, and if the pinned baseline (`roundtrip-baseline.txt`) is ungrown, almost the whole corpus
-shows as failing — these are the raw landscape, not a regression.
+Synthesize the findings into plain language, ordered by what grows the corpus fastest. Use this
+structure (fill the numbers and signatures from Step 2; do not invent them):
+
+1. Frame it: strict full-DOM, so a file passes only when *every* signature is resolved. The goal is
+   to land files with the fewest software changes.
+2. Crashes (`status="CRASH"`). Highest severity: no output at all. Name the cluster (the crash kind
+   usually points at one feature). These are bugs.
+3. Instant wins: the distance-1 candidates and the top `sole_blocker` signatures — one fix each flips
+   a file green now. This is the front of the worklist.
+4. Small fix-sets: the distance-2/3 candidates and the union of signatures that would unblock a batch
+   — "add these N features → these M files pass."
+5. Reorder-blocked: count them, name the top `reorder:` parents, and note they are deferred to test
+   normalization (#214), not part of the first-pass worklist.
+6. High-frequency drops that are *not* sole blockers: large `files_blocked` with low `sole_blocker`
+   means the fix helps many files but flips none alone — flag as enabling, lower immediate yield.
+
+Be honest about caveats: the comparison is strict full-DOM, and if the pinned baseline
+(`roundtrip-baseline.txt`) is ungrown, almost the whole corpus shows as failing — these are the raw
+landscape, not a regression.
 
 ## Hand-off Fixes (if requested)
 
 - To fix a dropped/under-supported element or a crash: use the `add-mx-api-feature` skill.
-- To correct or extend support levels in `data/api.features.xml`: use the `api-feature-audit` skill.
 - The findings belong under the tracking issue #208; file specifics with the `open-mx-issue` skill.
diff --git a/audit/README.md b/audit/README.md
@@ -41,21 +41,28 @@ common case (a new corpus file was added) only writes the new sidecar. Use
 
 ```
 make dump-api-roundtrip          # C++: write normalized expected/actual XML pairs
-make classify-api-roundtrip      # Python: classify those failures by root cause
+make classify-api-roundtrip      # Python: diff each pair, rank a worklist
 
 python3 -m audit classify <dump_dir> [--data DIR] [--out FILE]
 ```
 
 `classify` reads the dump directory produced by `make dump-api-roundtrip`
-(`build/api/roundtrip-dump/`), diffs each expected/actual pair as an order-free
-element **multiset** (`Counter(expected) - Counter(actual)`), cross-references
-`data/api.features.xml`, and assigns each non-passing file a root-cause category
-(drop-only, reorder-only, enum bug, missing attribute, pipeline error). It writes
-`build/api/classified.json` and prints a worklist of the features blocking the
-most files. The two steps are kept separate: dumping is slow (runs the C++
-pipeline over the whole corpus), classifying is fast (pure Python), so the
-classification logic can be iterated without re-dumping. See
-`docs/ai/design/api-roundtrip-classifier.md`.
+(`build/api/roundtrip-dump/`) and diffs each expected/actual pair structurally.
+Drops/adds come from an order-free element **multiset**
+(`Counter(expected) - Counter(actual)`); value/attribute/reorder differences come
+from an alignment walk over the surviving structure. Each difference becomes a
+**signature** (`drop:<tag>`, `add:<tag>`, `value:<tag>`, `attr:<tag>@<name>`,
+`reorder:<parent>`), and a file's **distance** to passing is its count of unique
+signatures. It writes `build/api/classified.json` and prints a worklist ranking
+each signature by how many candidate files it is the sole blocker of.
+
+Classification is purely **measured**: it does not consult `data/api.features.xml`
+or any record of what `mx::api` was believed to "support" -- whether a drop is
+intended is a present-day human call (#214), not something the classifier
+asserts. `--data` is accepted for compatibility but unused. The two steps are
+kept separate: dumping is slow (runs the C++ pipeline over the whole corpus),
+classifying is fast (pure Python), so classification can be iterated without
+re-dumping. See `docs/ai/design/api-roundtrip-classifier.md`.
 
 ## Tests
 

diff --git a/audit/__main__.py b/audit/__main__.py
@@ -6,8 +6,8 @@
   python3 -m audit corpus            (re)build data/corpus.xml from the corpus
   python3 -m audit all [--force]     run `files` then `corpus`
   python3 -m audit classify <dump_dir> [--data DIR] [--out FILE]
-                                     classify api round-trip failures by root
-                                     cause from a dump directory (see #211)
+                                     diff api round-trip dumps and rank a worklist
+                                     (see #211/#212; --data is unused)
 
 See audit/README.md. The audited set mirrors the corert round-trip suite.
 """
@@ -67,7 +67,7 @@ def main(argv: list[str]) -> int:
     p_all.add_argument("--force", action="store_true", help="overwrite existing sidecars")
 
     p_classify = sub.add_parser(
-        "classify", help="classify api round-trip failures from a dump directory"
+        "classify", help="diff api round-trip dumps and rank a worklist"
     )
     p_classify.add_argument("dump_dir", help="directory of *.expected.xml/*.actual.xml dumps")
     p_classify.add_argument(