Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 85 additions & 86 deletions .claude/skills/explain-api-roundtrip/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,44 +3,47 @@ name: explain-api-roundtrip
description: >
Use this skill to explain, in plain language, what is wrong with the `mx::api` round-trip and what
it needs next. It drives the failure classifier (dump -> classify) over the corpus, then reads
build/api/classified.json and turns it into a prioritized, human-readable explanation grouped by
failure mode (hard crash, dropped supported elements, reorder, by-design drop, audit blind spot).
build/api/classified.json and turns it into a prioritized, human-readable worklist grouped by
failure shape (crashes, instant wins, small fix-sets, reorder-blocked, high-frequency drops).
Invoke for requests like "what's broken about mx::api", "explain the round-trip failures",
"what does the api need next", or "triage the api round-trip".
argument-hint: "<optional: a category or element to focus on>"
argument-hint: "<optional: a signature or element to focus on>"
disable-model-invocation: false
user-invocable: true
---
# Explain the `mx::api` round-trip

`mx::api` is a deliberate subset of MusicXML, so some round-trip data loss is by design. This skill
separates the by-design losses from the real defects and produces a plain-English answer to two
questions: what is broken, and what should we fix next.
`mx::api` is a deliberate subset of MusicXML, and the comparison is strict full-DOM, so most corpus
files diverge somewhere. This skill turns the measured divergences into a plain-English answer to two
questions: what is broken, and what should we fix next to land the most files into the round-trip
corpus with the fewest software changes.

It is the read-out layer on top of the failure classifier (issue #211): the classifier produces a
machine-readable `build/api/classified.json`; this skill interprets it for a human.
It is the read-out layer on top of the failure classifier (issue #211/#212): the classifier produces
a machine-readable `build/api/classified.json`; this skill interprets it for a human.

## How it works
## What the classifier reports (and what it does not)

The pipeline has two steps, kept separate on purpose (see `audit/README.md`):
Classification is purely **measured** from each expected/actual pair. It does **not** consult
`data/api.features.xml` or any record of what `mx::api` was believed to "support" — whether a drop is
intended is a present-day human decision (#214), not something the tool asserts. So do not describe
any drop as "by-design" on the classifier's authority; report what was dropped and let the human
decide.

1. `make dump-api-roundtrip` — runs the api pipeline over every corpus file and writes the
normalized expected/actual XML pairs to `build/api/roundtrip-dump/`. Slow: it builds the C++
harness and runs ~800 files. Re-run only when api/impl code changed.
2. `make classify-api-roundtrip` — pure Python, fast. Diffs each pair as an element multiset
(`Counter(expected) - Counter(actual)`), cross-references `data/api.features.xml`, and writes
`build/api/classified.json` plus a stdout summary.
Each difference is a **signature**, and a file's **distance** to passing is its count of unique
signatures:

The categories in the JSON (`primary_category`):
| signature | meaning |
|-----------|---------|
| `drop:<tag>` | a tag in expected, missing from actual |
| `add:<tag>` | a spurious tag only in actual (a bug) |
| `value:<tag>` | a paired element whose text value differs |
| `value:<tag>@<attr>` | a paired element whose attribute value differs |
| `attr:<tag>@<attr>` | an attribute present on only one side |
| `reorder:<parent>` | a parent with matching children in a different order |

| id | meaning |
|----|---------|
| B | drop-only: every missing element is `support="none"` — a by-design subset drop |
| C | reorder-only: same elements, different order |
| D | enum bug: a value maps to a known-missing enum member |
| E | missing attribute: a `partial` feature dropped one attribute |
| F | pipeline error: LOADFAIL/GETDATAFAIL/CREATEFAIL — no output produced (a crash) |
| unknown | a FAIL that matched none of the above — usually a `support="full"` element that was dropped (a real bug) or an element not tracked in `api.features.xml` |
Per-file `status` is `PASS` / `FAIL` / `CRASH`. A `FAIL` with no `reorder:` signature is a
**candidate**: the first-pass worklist targets these, since reorders are expected `mx::api` behavior
to be absorbed in test normalization later (#214).

## Procedure

Expand All @@ -58,110 +61,106 @@ Then always run:
make classify-api-roundtrip
```

Read the stdout summary it prints — that is the top-level shape (counts per category + the worklist
of features blocking the most files).
Read the stdout summary it prints — status counts, the distance histogram, and the ranked worklist.
That worklist *is* the headline answer to "what should we fix next."

### Step 2 — mine `build/api/classified.json`

Run these read-only analyses (they join the classifier output against the support index). Adjust the
path if `--out` was overridden.
These read-only analyses expand on the stdout summary. Adjust the path if `--out` was overridden.

Top dropped elements, with their audited support level (the key signal — `support="full"` drops are
bugs, `support="none"` drops are by design):
The worklist — signatures ranked by candidate files unblocked (`sole_blocker` = files this fix flips
green on its own; `files_blocked` = candidate files that include it):

```
python3 - <<'PY'
import json, re
from collections import Counter
import json
d = json.load(open("build/api/classified.json"))
for row in d["worklist"][:25]:
print(f"{row['sole_blocker']:>4} sole {row['files_blocked']:>5} total {row['signature']}")
PY
```

Instant wins — candidate files one fix away from passing (distance 1):

```
python3 - <<'PY'
import json
d = json.load(open("build/api/classified.json"))
support = {m.group(1): m.group(2) for m in
re.finditer(r'name="([^"]+)" support="([a-z]+)"', open("data/api.features.xml").read())}
miss = Counter()
for r in d["files"]:
for tag in r["missing_element_counts"]:
miss[tag] += 1 # files affected
for tag, n in miss.most_common(25):
print(f"{n:>4} files {tag:<18} support={support.get(tag, 'NOT-IN-INDEX')}")
for f in d["near_misses"]["1"]:
print(f["signatures"][0], f["file"])
PY
```

Pipeline-error (crash) cluster — group by file/feature to find the common root:
Small fix-sets — files that pass once a handful of features land (distance 2–3); the union of their
signatures is a high-yield batch:

```
python3 - <<'PY'
import json
from collections import Counter
d = json.load(open("build/api/classified.json"))
for r in d["files"]:
if r["primary_category"] == "F":
print(r["pipeline_error_kind"], r["file"])
for dist in ("2", "3"):
sigs = Counter(s for f in d["near_misses"][dist] for s in f["signatures"])
print(f"distance {dist}: {len(d['near_misses'][dist])} files; signatures {sigs.most_common(10)}")
PY
```

Reorder cluster — where in the tree the order diverges:
Crash cluster (highest severity — no output at all) — group by kind to find the common root:

```
python3 - <<'PY'
import json
from collections import Counter
d = json.load(open("build/api/classified.json"))
paths = Counter(r["first_divergence_path"] for r in d["files"] if r["primary_category"] == "C")
for path, n in paths.most_common(10):
print(f"{n:>4} {path}")
crashes = [(r["crash_kind"], r["file"]) for r in d["files"] if r["status"] == "CRASH"]
print(Counter(k for k, _ in crashes))
for kind, f in crashes[:20]:
print(kind, f)
PY
```

What is driving the `unknown` bucket (the warnings on stderr from Step 1 also list this; this is the
programmatic view):
Reorder-blocked files (deferred to #214) — how many, and at which parents:

```
python3 - <<'PY'
import json, re
import json
from collections import Counter
d = json.load(open("build/api/classified.json"))
support = {m.group(1): m.group(2) for m in
re.finditer(r'name="([^"]+)" support="([a-z]+)"', open("data/api.features.xml").read())}
full_drop, untracked = Counter(), Counter()
for r in d["files"]:
if r["primary_category"] != "unknown":
continue
for tag in r["missing_elements"]:
s = support.get(tag)
if s in ("full", "partial"):
full_drop[tag] += 1 # claimed supported but dropped -> bug
elif s is None:
untracked[tag] += 1 # not in api.features.xml -> audit gap
print("supported-but-dropped:", full_drop.most_common(10))
print("untracked:", untracked.most_common(10))
reorders = Counter(s for r in d["files"] if r["has_reorder"]
for s in r["signatures"] if s.startswith("reorder:"))
print("reorder-blocked files:", sum(1 for r in d["files"] if r["has_reorder"]))
print(reorders.most_common(10))
PY
```

To drill into one file, look at its record (`missing_elements`, `mismatch_type`,
`first_divergence_path`) and diff the pair directly:
To drill into one file, look at its record (`signatures`, `sample_paths`, `distance`) and diff the
pair directly:
`diff build/api/roundtrip-dump/<flat>.expected.xml build/api/roundtrip-dump/<flat>.actual.xml`
where `<flat>` is the corpus path with `/` replaced by `__`.

### Step 3 — write the explanation

Synthesize the findings into plain language grouped by failure mode, ordered by severity. Use this
structure (fill the numbers and element names from Step 2; do not invent them):

1. Frame it: `mx::api` is a subset, so some loss is by design — separate that from the real defects.
2. Hard crashes (category F). Highest severity: no output at all. Name the cluster (the crash
analysis usually points at one feature). These are bugs.
3. Dropped supported elements (the `support="full"`/`partial` rows from the top-dropped table and the
`supported-but-dropped` view). Either an impl round-trip bug or `api.features.xml` overstates
support — say which needs checking, per element.
4. Reorder (category C). Lower severity: content intact, order wrong. Name the divergence path.
5. By-design drops (category B): mention briefly — these are expected subset behavior, not bugs.
6. Audit blind spots: the `untracked` view — elements dropped but not in `api.features.xml`, so they
can't be categorized. Recommend running `api-feature-audit` to close the gap.

Then give a prioritized "what it needs" list. Be honest about caveats: the comparison is strict
full-DOM, and if the pinned baseline (`roundtrip-baseline.txt`) is ungrown, almost the whole corpus
shows as failing — these are the raw landscape, not a regression.
Synthesize the findings into plain language, ordered by what grows the corpus fastest. Use this
structure (fill the numbers and signatures from Step 2; do not invent them):

1. Frame it: strict full-DOM, so a file passes only when *every* signature is resolved. The goal is
to land files with the fewest software changes.
2. Crashes (`status="CRASH"`). Highest severity: no output at all. Name the cluster (the crash kind
usually points at one feature). These are bugs.
3. Instant wins: the distance-1 candidates and the top `sole_blocker` signatures — one fix each flips
a file green now. This is the front of the worklist.
4. Small fix-sets: the distance-2/3 candidates and the union of signatures that would unblock a batch
— "add these N features → these M files pass."
5. Reorder-blocked: count them, name the top `reorder:` parents, and note they are deferred to test
normalization (#214), not part of the first-pass worklist.
6. High-frequency drops that are *not* sole blockers: large `files_blocked` with low `sole_blocker`
means the fix helps many files but flips none alone — flag as enabling, lower immediate yield.

Be honest about caveats: the comparison is strict full-DOM, and if the pinned baseline
(`roundtrip-baseline.txt`) is ungrown, almost the whole corpus shows as failing — these are the raw
landscape, not a regression.

## Hand-off Fixes (if requested)

- To fix a dropped/under-supported element or a crash: use the `add-mx-api-feature` skill.
- To correct or extend support levels in `data/api.features.xml`: use the `api-feature-audit` skill.
- The findings belong under the tracking issue #208; file specifics with the `open-mx-issue` skill.
27 changes: 17 additions & 10 deletions audit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,21 +41,28 @@ common case (a new corpus file was added) only writes the new sidecar. Use

```
make dump-api-roundtrip # C++: write normalized expected/actual XML pairs
make classify-api-roundtrip # Python: classify those failures by root cause
make classify-api-roundtrip # Python: diff each pair, rank a worklist

python3 -m audit classify <dump_dir> [--data DIR] [--out FILE]
```

`classify` reads the dump directory produced by `make dump-api-roundtrip`
(`build/api/roundtrip-dump/`), diffs each expected/actual pair as an order-free
element **multiset** (`Counter(expected) - Counter(actual)`), cross-references
`data/api.features.xml`, and assigns each non-passing file a root-cause category
(drop-only, reorder-only, enum bug, missing attribute, pipeline error). It writes
`build/api/classified.json` and prints a worklist of the features blocking the
most files. The two steps are kept separate: dumping is slow (runs the C++
pipeline over the whole corpus), classifying is fast (pure Python), so the
classification logic can be iterated without re-dumping. See
`docs/ai/design/api-roundtrip-classifier.md`.
(`build/api/roundtrip-dump/`) and diffs each expected/actual pair structurally.
Drops/adds come from an order-free element **multiset**
(`Counter(expected) - Counter(actual)`); value/attribute/reorder differences come
from an alignment walk over the surviving structure. Each difference becomes a
**signature** (`drop:<tag>`, `add:<tag>`, `value:<tag>`, `attr:<tag>@<name>`,
`reorder:<parent>`), and a file's **distance** to passing is its count of unique
signatures. It writes `build/api/classified.json` and prints a worklist ranking
each signature by how many candidate files it is the sole blocker of.

Classification is purely **measured**: it does not consult `data/api.features.xml`
or any record of what `mx::api` was believed to "support" -- whether a drop is
intended is a present-day human call (#214), not something the classifier
asserts. `--data` is accepted for compatibility but unused. The two steps are
kept separate: dumping is slow (runs the C++ pipeline over the whole corpus),
classifying is fast (pure Python), so classification can be iterated without
re-dumping. See `docs/ai/design/api-roundtrip-classifier.md`.

## Tests

Expand Down
6 changes: 3 additions & 3 deletions audit/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
python3 -m audit corpus (re)build data/corpus.xml from the corpus
python3 -m audit all [--force] run `files` then `corpus`
python3 -m audit classify <dump_dir> [--data DIR] [--out FILE]
classify api round-trip failures by root
cause from a dump directory (see #211)
diff api round-trip dumps and rank a worklist
(see #211/#212; --data is unused)

See audit/README.md. The audited set mirrors the corert round-trip suite.
"""
Expand Down Expand Up @@ -67,7 +67,7 @@ def main(argv: list[str]) -> int:
p_all.add_argument("--force", action="store_true", help="overwrite existing sidecars")

p_classify = sub.add_parser(
"classify", help="classify api round-trip failures from a dump directory"
"classify", help="diff api round-trip dumps and rank a worklist"
)
p_classify.add_argument("dump_dir", help="directory of *.expected.xml/*.actual.xml dumps")
p_classify.add_argument(
Expand Down
Loading
Loading