diff --git a/.claude/skills/explain-api-roundtrip/SKILL.md b/.claude/skills/explain-api-roundtrip/SKILL.md index c8caf15d..87f8a709 100644 --- a/.claude/skills/explain-api-roundtrip/SKILL.md +++ b/.claude/skills/explain-api-roundtrip/SKILL.md @@ -3,44 +3,47 @@ name: explain-api-roundtrip description: > Use this skill to explain, in plain language, what is wrong with the `mx::api` round-trip and what it needs next. It drives the failure classifier (dump -> classify) over the corpus, then reads - build/api/classified.json and turns it into a prioritized, human-readable explanation grouped by - failure mode (hard crash, dropped supported elements, reorder, by-design drop, audit blind spot). + build/api/classified.json and turns it into a prioritized, human-readable worklist grouped by + failure shape (crashes, instant wins, small fix-sets, reorder-blocked, high-frequency drops). Invoke for requests like "what's broken about mx::api", "explain the round-trip failures", "what does the api need next", or "triage the api round-trip". -argument-hint: "" +argument-hint: "" disable-model-invocation: false user-invocable: true --- # Explain the `mx::api` round-trip -`mx::api` is a deliberate subset of MusicXML, so some round-trip data loss is by design. This skill -separates the by-design losses from the real defects and produces a plain-English answer to two -questions: what is broken, and what should we fix next. +`mx::api` is a deliberate subset of MusicXML, and the comparison is strict full-DOM, so most corpus +files diverge somewhere. This skill turns the measured divergences into a plain-English answer to two +questions: what is broken, and what should we fix next to land the most files into the round-trip +corpus with the fewest software changes. -It is the read-out layer on top of the failure classifier (issue #211): the classifier produces a -machine-readable `build/api/classified.json`; this skill interprets it for a human. +It is the read-out layer on top of the failure classifier (issue #211/#212): the classifier produces +a machine-readable `build/api/classified.json`; this skill interprets it for a human. -## How it works +## What the classifier reports (and what it does not) -The pipeline has two steps, kept separate on purpose (see `audit/README.md`): +Classification is purely **measured** from each expected/actual pair. It does **not** consult +`data/api.features.xml` or any record of what `mx::api` was believed to "support" — whether a drop is +intended is a present-day human decision (#214), not something the tool asserts. So do not describe +any drop as "by-design" on the classifier's authority; report what was dropped and let the human +decide. -1. `make dump-api-roundtrip` — runs the api pipeline over every corpus file and writes the - normalized expected/actual XML pairs to `build/api/roundtrip-dump/`. Slow: it builds the C++ - harness and runs ~800 files. Re-run only when api/impl code changed. -2. `make classify-api-roundtrip` — pure Python, fast. Diffs each pair as an element multiset - (`Counter(expected) - Counter(actual)`), cross-references `data/api.features.xml`, and writes - `build/api/classified.json` plus a stdout summary. +Each difference is a **signature**, and a file's **distance** to passing is its count of unique +signatures: -The categories in the JSON (`primary_category`): +| signature | meaning | +|-----------|---------| +| `drop:` | a tag in expected, missing from actual | +| `add:` | a spurious tag only in actual (a bug) | +| `value:` | a paired element whose text value differs | +| `value:@` | a paired element whose attribute value differs | +| `attr:@` | an attribute present on only one side | +| `reorder:` | a parent with matching children in a different order | -| id | meaning | -|----|---------| -| B | drop-only: every missing element is `support="none"` — a by-design subset drop | -| C | reorder-only: same elements, different order | -| D | enum bug: a value maps to a known-missing enum member | -| E | missing attribute: a `partial` feature dropped one attribute | -| F | pipeline error: LOADFAIL/GETDATAFAIL/CREATEFAIL — no output produced (a crash) | -| unknown | a FAIL that matched none of the above — usually a `support="full"` element that was dropped (a real bug) or an element not tracked in `api.features.xml` | +Per-file `status` is `PASS` / `FAIL` / `CRASH`. A `FAIL` with no `reorder:` signature is a +**candidate**: the first-pass worklist targets these, since reorders are expected `mx::api` behavior +to be absorbed in test normalization later (#214). ## Procedure @@ -58,110 +61,106 @@ Then always run: make classify-api-roundtrip ``` -Read the stdout summary it prints — that is the top-level shape (counts per category + the worklist -of features blocking the most files). +Read the stdout summary it prints — status counts, the distance histogram, and the ranked worklist. +That worklist *is* the headline answer to "what should we fix next." ### Step 2 — mine `build/api/classified.json` -Run these read-only analyses (they join the classifier output against the support index). Adjust the -path if `--out` was overridden. +These read-only analyses expand on the stdout summary. Adjust the path if `--out` was overridden. -Top dropped elements, with their audited support level (the key signal — `support="full"` drops are -bugs, `support="none"` drops are by design): +The worklist — signatures ranked by candidate files unblocked (`sole_blocker` = files this fix flips +green on its own; `files_blocked` = candidate files that include it): ``` python3 - <<'PY' -import json, re -from collections import Counter +import json +d = json.load(open("build/api/classified.json")) +for row in d["worklist"][:25]: + print(f"{row['sole_blocker']:>4} sole {row['files_blocked']:>5} total {row['signature']}") +PY +``` + +Instant wins — candidate files one fix away from passing (distance 1): + +``` +python3 - <<'PY' +import json d = json.load(open("build/api/classified.json")) -support = {m.group(1): m.group(2) for m in - re.finditer(r'name="([^"]+)" support="([a-z]+)"', open("data/api.features.xml").read())} -miss = Counter() -for r in d["files"]: - for tag in r["missing_element_counts"]: - miss[tag] += 1 # files affected -for tag, n in miss.most_common(25): - print(f"{n:>4} files {tag:<18} support={support.get(tag, 'NOT-IN-INDEX')}") +for f in d["near_misses"]["1"]: + print(f["signatures"][0], f["file"]) PY ``` -Pipeline-error (crash) cluster — group by file/feature to find the common root: +Small fix-sets — files that pass once a handful of features land (distance 2–3); the union of their +signatures is a high-yield batch: ``` python3 - <<'PY' import json +from collections import Counter d = json.load(open("build/api/classified.json")) -for r in d["files"]: - if r["primary_category"] == "F": - print(r["pipeline_error_kind"], r["file"]) +for dist in ("2", "3"): + sigs = Counter(s for f in d["near_misses"][dist] for s in f["signatures"]) + print(f"distance {dist}: {len(d['near_misses'][dist])} files; signatures {sigs.most_common(10)}") PY ``` -Reorder cluster — where in the tree the order diverges: +Crash cluster (highest severity — no output at all) — group by kind to find the common root: ``` python3 - <<'PY' import json from collections import Counter d = json.load(open("build/api/classified.json")) -paths = Counter(r["first_divergence_path"] for r in d["files"] if r["primary_category"] == "C") -for path, n in paths.most_common(10): - print(f"{n:>4} {path}") +crashes = [(r["crash_kind"], r["file"]) for r in d["files"] if r["status"] == "CRASH"] +print(Counter(k for k, _ in crashes)) +for kind, f in crashes[:20]: + print(kind, f) PY ``` -What is driving the `unknown` bucket (the warnings on stderr from Step 1 also list this; this is the -programmatic view): +Reorder-blocked files (deferred to #214) — how many, and at which parents: ``` python3 - <<'PY' -import json, re +import json from collections import Counter d = json.load(open("build/api/classified.json")) -support = {m.group(1): m.group(2) for m in - re.finditer(r'name="([^"]+)" support="([a-z]+)"', open("data/api.features.xml").read())} -full_drop, untracked = Counter(), Counter() -for r in d["files"]: - if r["primary_category"] != "unknown": - continue - for tag in r["missing_elements"]: - s = support.get(tag) - if s in ("full", "partial"): - full_drop[tag] += 1 # claimed supported but dropped -> bug - elif s is None: - untracked[tag] += 1 # not in api.features.xml -> audit gap -print("supported-but-dropped:", full_drop.most_common(10)) -print("untracked:", untracked.most_common(10)) +reorders = Counter(s for r in d["files"] if r["has_reorder"] + for s in r["signatures"] if s.startswith("reorder:")) +print("reorder-blocked files:", sum(1 for r in d["files"] if r["has_reorder"])) +print(reorders.most_common(10)) PY ``` -To drill into one file, look at its record (`missing_elements`, `mismatch_type`, -`first_divergence_path`) and diff the pair directly: +To drill into one file, look at its record (`signatures`, `sample_paths`, `distance`) and diff the +pair directly: `diff build/api/roundtrip-dump/.expected.xml build/api/roundtrip-dump/.actual.xml` where `` is the corpus path with `/` replaced by `__`. ### Step 3 — write the explanation -Synthesize the findings into plain language grouped by failure mode, ordered by severity. Use this -structure (fill the numbers and element names from Step 2; do not invent them): - -1. Frame it: `mx::api` is a subset, so some loss is by design — separate that from the real defects. -2. Hard crashes (category F). Highest severity: no output at all. Name the cluster (the crash - analysis usually points at one feature). These are bugs. -3. Dropped supported elements (the `support="full"`/`partial` rows from the top-dropped table and the - `supported-but-dropped` view). Either an impl round-trip bug or `api.features.xml` overstates - support — say which needs checking, per element. -4. Reorder (category C). Lower severity: content intact, order wrong. Name the divergence path. -5. By-design drops (category B): mention briefly — these are expected subset behavior, not bugs. -6. Audit blind spots: the `untracked` view — elements dropped but not in `api.features.xml`, so they - can't be categorized. Recommend running `api-feature-audit` to close the gap. - -Then give a prioritized "what it needs" list. Be honest about caveats: the comparison is strict -full-DOM, and if the pinned baseline (`roundtrip-baseline.txt`) is ungrown, almost the whole corpus -shows as failing — these are the raw landscape, not a regression. +Synthesize the findings into plain language, ordered by what grows the corpus fastest. Use this +structure (fill the numbers and signatures from Step 2; do not invent them): + +1. Frame it: strict full-DOM, so a file passes only when *every* signature is resolved. The goal is + to land files with the fewest software changes. +2. Crashes (`status="CRASH"`). Highest severity: no output at all. Name the cluster (the crash kind + usually points at one feature). These are bugs. +3. Instant wins: the distance-1 candidates and the top `sole_blocker` signatures — one fix each flips + a file green now. This is the front of the worklist. +4. Small fix-sets: the distance-2/3 candidates and the union of signatures that would unblock a batch + — "add these N features → these M files pass." +5. Reorder-blocked: count them, name the top `reorder:` parents, and note they are deferred to test + normalization (#214), not part of the first-pass worklist. +6. High-frequency drops that are *not* sole blockers: large `files_blocked` with low `sole_blocker` + means the fix helps many files but flips none alone — flag as enabling, lower immediate yield. + +Be honest about caveats: the comparison is strict full-DOM, and if the pinned baseline +(`roundtrip-baseline.txt`) is ungrown, almost the whole corpus shows as failing — these are the raw +landscape, not a regression. ## Hand-off Fixes (if requested) - To fix a dropped/under-supported element or a crash: use the `add-mx-api-feature` skill. -- To correct or extend support levels in `data/api.features.xml`: use the `api-feature-audit` skill. - The findings belong under the tracking issue #208; file specifics with the `open-mx-issue` skill. diff --git a/audit/README.md b/audit/README.md index 14f89745..7a444377 100644 --- a/audit/README.md +++ b/audit/README.md @@ -41,21 +41,28 @@ common case (a new corpus file was added) only writes the new sidecar. Use ``` make dump-api-roundtrip # C++: write normalized expected/actual XML pairs -make classify-api-roundtrip # Python: classify those failures by root cause +make classify-api-roundtrip # Python: diff each pair, rank a worklist python3 -m audit classify [--data DIR] [--out FILE] ``` `classify` reads the dump directory produced by `make dump-api-roundtrip` -(`build/api/roundtrip-dump/`), diffs each expected/actual pair as an order-free -element **multiset** (`Counter(expected) - Counter(actual)`), cross-references -`data/api.features.xml`, and assigns each non-passing file a root-cause category -(drop-only, reorder-only, enum bug, missing attribute, pipeline error). It writes -`build/api/classified.json` and prints a worklist of the features blocking the -most files. The two steps are kept separate: dumping is slow (runs the C++ -pipeline over the whole corpus), classifying is fast (pure Python), so the -classification logic can be iterated without re-dumping. See -`docs/ai/design/api-roundtrip-classifier.md`. +(`build/api/roundtrip-dump/`) and diffs each expected/actual pair structurally. +Drops/adds come from an order-free element **multiset** +(`Counter(expected) - Counter(actual)`); value/attribute/reorder differences come +from an alignment walk over the surviving structure. Each difference becomes a +**signature** (`drop:`, `add:`, `value:`, `attr:@`, +`reorder:`), and a file's **distance** to passing is its count of unique +signatures. It writes `build/api/classified.json` and prints a worklist ranking +each signature by how many candidate files it is the sole blocker of. + +Classification is purely **measured**: it does not consult `data/api.features.xml` +or any record of what `mx::api` was believed to "support" -- whether a drop is +intended is a present-day human call (#214), not something the classifier +asserts. `--data` is accepted for compatibility but unused. The two steps are +kept separate: dumping is slow (runs the C++ pipeline over the whole corpus), +classifying is fast (pure Python), so classification can be iterated without +re-dumping. See `docs/ai/design/api-roundtrip-classifier.md`. ## Tests diff --git a/audit/__main__.py b/audit/__main__.py index 743a70e2..b494aaa3 100644 --- a/audit/__main__.py +++ b/audit/__main__.py @@ -6,8 +6,8 @@ python3 -m audit corpus (re)build data/corpus.xml from the corpus python3 -m audit all [--force] run `files` then `corpus` python3 -m audit classify [--data DIR] [--out FILE] - classify api round-trip failures by root - cause from a dump directory (see #211) + diff api round-trip dumps and rank a worklist + (see #211/#212; --data is unused) See audit/README.md. The audited set mirrors the corert round-trip suite. """ @@ -67,7 +67,7 @@ def main(argv: list[str]) -> int: p_all.add_argument("--force", action="store_true", help="overwrite existing sidecars") p_classify = sub.add_parser( - "classify", help="classify api round-trip failures from a dump directory" + "classify", help="diff api round-trip dumps and rank a worklist" ) p_classify.add_argument("dump_dir", help="directory of *.expected.xml/*.actual.xml dumps") p_classify.add_argument( diff --git a/audit/classify.py b/audit/classify.py index e98a05e2..0efd7e83 100644 --- a/audit/classify.py +++ b/audit/classify.py @@ -1,32 +1,50 @@ -"""Classify api round-trip failures by root cause. +"""Classify api round-trip failures by *measured* divergence, and rank a worklist. Reads the dump directory produced by ``make dump-api-roundtrip`` (Phase 1, issue #210): pairs of ``.expected.xml`` / ``.actual.xml`` plus a -``.status`` sidecar for pipeline errors. Cross-references -``data/api.features.xml`` and the per-file ``*.features.xml`` sidecars, assigns -each non-passing file one primary root-cause category plus any secondaries, and -writes a machine-readable JSON report (consumed by Phase 3 ranking) with a -human-readable summary on stdout. - -The diff is **multiset-based**, not a positional walk. ``mx::api`` drops and -reorders subtrees by design, so the dominant signal is a *deletion*; a naive -index-by-index walk desynchronizes after the first drop and only its first -divergence is trustworthy. The multiset difference -``Counter(expected) - Counter(actual)`` enumerates **every** dropped element -class in O(n), fully reorder-invariant. See -``docs/ai/design/api-roundtrip-classifier.md`` for the full rationale. - -Categories: - - A already-passing expected/actual identical (defensive; not normally dumped) - B drop-only every missing element class has support="none" - C reorder-only same tag multiset, but a parent's child order differs - D enum-bug text/attr value is a known missing enum member - E missing-attribute a partial feature's attribute was dropped - F pipeline-error LOADFAIL / GETDATAFAIL / CREATEFAIL (no actual produced) - G supported-drop a dropped element class is marked support="full"/"partial" - (an impl round-trip bug or an api.features.xml overstatement) - unknown a FAIL that matched none of the above +``.status`` sidecar for crashes. Diffs each pair structurally, records the +set of divergences that keep the file from round-tripping, and writes a +machine-readable JSON report (consumed by Phase 3 ranking, #212) with a +human-readable worklist on stdout. + +What this is *not*: it does not consult ``data/api.features.xml`` or any other +record of what someone believed ``mx::api`` "supports". Whether a drop was +intended is a present-day human decision (issue #214), not a property the +classifier asserts. The classifier reports only what the round-trip actually did +to each file, so the worklist is grounded in measured behavior. + +## How a file is diffed + +The comparison is strict full-DOM (same rule as the api round-trip gate). A file +PASSes only when nothing diverges. Every divergence is reduced to a *signature*, +and a file's **distance** to passing is the count of *unique* signatures it has +(```` dropped a thousand times is one ``drop:foo`` signature, distance 1): + + drop: a tag present in expected, missing from actual + add: a spurious tag the api invented (present only in actual) + reorder: a parent whose child multiset matches but order differs + value: a paired element whose text value differs + value:@ a paired element whose attribute *value* differs + attr:@ a paired element with an attribute present on one side only + +Drops/adds come from a multiset difference (``Counter(expected) - Counter(actual)``) +which is O(n) and reorder-invariant, so it enumerates *every* dropped class, not +just the first (a positional walk desyncs after the first drop). value/attr/reorder +come from an alignment walk (``difflib.SequenceMatcher`` over child-tag sequences) +that pairs surviving elements across drops. See +``docs/ai/design/api-roundtrip-classifier.md``. + +## Statuses and the worklist + + PASS expected == actual (defensive; not normally dumped) + CRASH no actual produced -- LOADFAIL / GETDATAFAIL / CREATEFAIL + FAIL produced output that diverged + +A FAIL with no ``reorder:`` signature is a **candidate**: reorders are expected +``mx::api`` behavior to be absorbed in test normalization later (#214), so the +first-pass worklist targets files that need only feature additions. The worklist +ranks each signature by how many candidate files it is the *sole* blocker of +(fixing it flips those files green now), then by total candidate files it blocks. """ from __future__ import annotations @@ -36,80 +54,23 @@ class in O(n), fully reorder-invariant. See import sys import xml.etree.ElementTree as ET from collections import Counter -from dataclasses import dataclass, field +from dataclasses import dataclass from datetime import datetime, timezone from pathlib import Path -from . import FEATURES_SUFFIX, featuresfile - # Filename suffixes written by the dump harness (#210). _EXPECTED_SUFFIX = ".expected.xml" _ACTUAL_SUFFIX = ".actual.xml" _STATUS_SUFFIX = ".status" -# Pipeline-error status codes (no actual document was produced). -_PIPELINE_STATUSES = frozenset({"LOADFAIL", "GETDATAFAIL", "CREATEFAIL"}) - -# Human-readable category labels for the stdout summary. -_CATEGORY_LABELS = { - "A": "already passing", - "B": "drop-only divergence", - "C": "reorder-only divergence", - "D": "enum bug", - "E": "missing attribute/element", - "F": "pipeline error", - "G": "supported-element drop", - "unknown": "unknown", -} - -# Categories that are actionable feature gaps (ranked in the worklist). -_ACTIONABLE = frozenset({"B", "D", "E", "G"}) - - -# --------------------------------------------------------------------------- # -# api.features.xml support index -# --------------------------------------------------------------------------- # +# Crash status codes (no actual document was produced). +_CRASH_STATUSES = frozenset({"LOADFAIL", "GETDATAFAIL", "CREATEFAIL"}) +# Signature type -> human label, in worklist display order. +_SIGNATURE_TYPES = ("drop", "add", "value", "attr", "reorder") -@dataclass -class ApiFeature: - """One ```` from ``data/api.features.xml`` with support detail.""" - - name: str - support: str # "full" / "partial" / "none" / "" (unknown) - attributes: dict[str, str] = field(default_factory=dict) # attr name -> support - enum_missing: set[str] = field(default_factory=set) # normalized missing members - - -def _normalize_enum(value: str | None) -> str: - """Bridge XML kebab-case values and api/core camelCase enum symbols. - - ``sharp-sharp`` (XML text) and ``sharpSharp`` (enum member) both normalize to - ``sharpsharp`` so they compare equal. - """ - if not value: - return "" - return value.strip().lower().replace("-", "").replace("_", "") - - -def load_api_features(path: Path) -> dict[str, ApiFeature]: - """Parse ``api.features.xml`` into a name -> ApiFeature map.""" - features: dict[str, ApiFeature] = {} - root = ET.parse(path).getroot() - for feature in root.findall("./features/feature"): - name = feature.get("name", "") - if not name: - continue - feat = ApiFeature(name=name, support=feature.get("support", "")) - for attr in feature.findall("./attributes/attribute"): - aname = attr.get("name", "") - if aname: - feat.attributes[aname] = attr.get("support", "") - for enum in feature.findall("./enums/enum"): - for missing in enum.findall("./missing"): - feat.enum_missing.add(_normalize_enum(missing.text)) - features[name] = feat - return features +# How many distance buckets of near-miss files to surface. +_NEAR_MISS_MAX_DISTANCE = 3 # --------------------------------------------------------------------------- # @@ -155,7 +116,7 @@ def _entry(flat: str) -> DumpEntry: # --------------------------------------------------------------------------- # -# XML helpers and the positional first-divergence walk +# XML helpers # --------------------------------------------------------------------------- # @@ -167,7 +128,7 @@ def _local(tag: str) -> str: def tag_multiset(root: ET.Element) -> Counter: - """Count every element's local tag name in the tree (the multiset primitive).""" + """Count every element's local tag name in the tree (the drop/add primitive).""" return Counter(_local(el.tag) for el in root.iter()) @@ -181,78 +142,99 @@ def _values_equiv(left: str, right: str) -> bool: return False -@dataclass -class Divergence: - """The first positional divergence, retained for continuity (not completeness).""" +def _first_paths(root: ET.Element) -> dict[str, str]: + """Map each local tag to one example root-to-node path (first occurrence).""" + paths: dict[str, str] = {} + + def walk(el: ET.Element, path: str) -> None: + for child in el: + if not isinstance(child.tag, str): + continue + tag = _local(child.tag) + here = f"{path}/{tag}" + paths.setdefault(tag, here) + walk(child, here) + + root_tag = _local(root.tag) + paths.setdefault(root_tag, f"/{root_tag}") + walk(root, f"/{root_tag}") + return paths - mismatch_type: str # element-name / text / attribute / attribute-count / child-count - path: str - element: str # leaf local name at the divergence - expected_value: str | None = None - actual_value: str | None = None - attr_name: str | None = None - missing_attrs: list[str] = field(default_factory=list) - parent_expected_children: list[str] = field(default_factory=list) - parent_actual_children: list[str] = field(default_factory=list) + +# --------------------------------------------------------------------------- # +# Divergence collection +# --------------------------------------------------------------------------- # -def first_divergence(exp: ET.Element, act: ET.Element, path: str) -> Divergence | None: - """Find the first positional divergence, mirroring ``compareElements``. +def _add_signature(sigs: dict[str, str], sig: str, path: str | None) -> None: + """Record a signature, keeping the first example path seen for it.""" + if sig not in sigs: + sigs[sig] = path or "" - ``path`` includes ``exp`` itself. Tags of ``exp``/``act`` are assumed equal - (the caller pairs them); the root pair is the only entry point. + +def _diff_walk(exp: ET.Element, act: ET.Element, path: str, sigs: dict[str, str]) -> None: + """Collect value/attr/reorder signatures over the aligned, surviving structure. + + ``exp``/``act`` are a matched pair (same tag). Children are aligned by tag + sequence so the walk survives drops without desyncing; only tag-equal pairs + are recursed, so value/attr comparisons are never made across mismatched tags. + Drops and adds are handled separately by the global multiset, not here. """ + tag = _local(exp.tag) + et = (exp.text or "").strip() at = (act.text or "").strip() if not _values_equiv(et, at): - return Divergence("text", path, _local(exp.tag), expected_value=et, actual_value=at) + _add_signature(sigs, f"value:{tag}", path) ea = {_local(k): v for k, v in exp.attrib.items()} aa = {_local(k): v for k, v in act.attrib.items()} - for aname in sorted(set(ea) & set(aa)): - if not _values_equiv(ea[aname], aa[aname]): - return Divergence( - "attribute", path, _local(exp.tag), - expected_value=ea[aname], actual_value=aa[aname], attr_name=aname, - ) - missing_attrs = sorted(set(ea) - set(aa)) - extra_attrs = sorted(set(aa) - set(ea)) - if missing_attrs or extra_attrs: - return Divergence( - "attribute-count", path, _local(exp.tag), - attr_name=missing_attrs[0] if missing_attrs else None, - missing_attrs=missing_attrs, - ) + for name in sorted(set(ea) & set(aa)): + if not _values_equiv(ea[name], aa[name]): + _add_signature(sigs, f"value:{tag}@{name}", path) + for name in sorted(set(ea) ^ set(aa)): + _add_signature(sigs, f"attr:{tag}@{name}", path) ec = [c for c in exp if isinstance(c.tag, str)] ac = [c for c in act if isinstance(c.tag, str)] - exp_children = [_local(c.tag) for c in ec] - act_children = [_local(c.tag) for c in ac] - for i in range(min(len(ec), len(ac))): - if exp_children[i] != act_children[i]: - return Divergence( - "element-name", f"{path}/{exp_children[i]}", exp_children[i], - parent_expected_children=exp_children, parent_actual_children=act_children, - ) - deeper = first_divergence(ec[i], ac[i], f"{path}/{exp_children[i]}") - if deeper is not None: - return deeper - if len(ec) != len(ac): - return Divergence( - "child-count", path, _local(exp.tag), - parent_expected_children=exp_children, parent_actual_children=act_children, - ) - return None - - -def _is_permutation(a: list[str], b: list[str]) -> bool: - """True if the two child-tag sequences are a pure reorder (same multiset).""" - if Counter(a) != Counter(b) or a == b: - return False - # Confirm via difflib opcodes: a permutation has no shared-multiset 'replace' - # that adds/removes a tag (handled by the Counter check above); this just - # asserts the sequences genuinely differ in order. - return any(tag != "equal" for tag, *_ in difflib.SequenceMatcher(a=a, b=b).get_opcodes()) + exp_tags = [_local(c.tag) for c in ec] + act_tags = [_local(c.tag) for c in ac] + + if exp_tags != act_tags and Counter(exp_tags) == Counter(act_tags): + _add_signature(sigs, f"reorder:{tag}", path) + + matcher = difflib.SequenceMatcher(a=exp_tags, b=act_tags, autojunk=False) + for op, i1, i2, j1, j2 in matcher.get_opcodes(): + if op == "equal": + for k in range(i2 - i1): + child = ec[i1 + k] + _diff_walk(child, ac[j1 + k], f"{path}/{_local(child.tag)}", sigs) + + +def collect_signatures(exp_root: ET.Element, act_root: ET.Element) -> dict[str, str]: + """Return every divergence signature (-> example path) for one expected/actual pair.""" + sigs: dict[str, str] = {} + + # Drops/adds: complete, reorder-invariant inventory via the tag multiset. + missing = tag_multiset(exp_root) - tag_multiset(act_root) + added = tag_multiset(act_root) - tag_multiset(exp_root) + if missing: + exp_paths = _first_paths(exp_root) + for tag in missing: + _add_signature(sigs, f"drop:{tag}", exp_paths.get(tag)) + if added: + act_paths = _first_paths(act_root) + for tag in added: + _add_signature(sigs, f"add:{tag}", act_paths.get(tag)) + + # value/attr/reorder over the aligned surviving structure. + _diff_walk(exp_root, act_root, f"/{_local(exp_root.tag)}", sigs) + return sigs + + +def _signature_type(sig: str) -> str: + """The type prefix of a signature (``drop:foo`` -> ``drop``).""" + return sig.split(":", 1)[0] # --------------------------------------------------------------------------- # @@ -265,163 +247,55 @@ def _blank_record(rel: str) -> dict: return { "file": rel, "status": None, - "primary_category": None, - "secondary_categories": [], - "first_divergence_element": None, - "first_divergence_path": None, - "mismatch_type": None, - "missing_elements": [], - "missing_element_counts": {}, - "distinct_missing_count": 0, - "added_elements": [], - "is_single_blocker": False, - "total_missing_instances": 0, - "blocking_features": [], - "pipeline_error_kind": None, + "crash_kind": None, + "signatures": [], + "sample_paths": {}, + "signature_counts": {}, + "distance": 0, + "has_reorder": False, + "is_candidate": False, } -def classify_entry( - entry: DumpEntry, - api_features: dict[str, ApiFeature], - data_root: Path, - warn, -) -> dict: +def classify_entry(entry: DumpEntry, warn) -> dict: """Classify one dump entry into a per-file JSON record.""" rec = _blank_record(entry.rel) - # Pipeline error: only an expected file (and maybe a .status sidecar). + # Crash: only an expected file (and maybe a .status sidecar). if entry.actual is None: - kind = entry.status_code if entry.status_code in _PIPELINE_STATUSES else None - rec["status"] = kind or "PIPELINE_ERROR" - rec["primary_category"] = "F" - rec["pipeline_error_kind"] = kind + kind = entry.status_code if entry.status_code in _CRASH_STATUSES else None + rec["status"] = "CRASH" + rec["crash_kind"] = kind return rec if entry.expected is None: warn(f"{entry.rel}: actual present but no expected; skipping") rec["status"] = "FAIL" - rec["primary_category"] = "unknown" return rec - rec["status"] = "FAIL" try: exp_root = ET.parse(entry.expected).getroot() act_root = ET.parse(entry.actual).getroot() except ET.ParseError as exc: - warn(f"{entry.rel}: XML parse error ({exc}); marking unknown") - rec["primary_category"] = "unknown" + warn(f"{entry.rel}: XML parse error ({exc}); marking FAIL with no signatures") + rec["status"] = "FAIL" return rec - # Layer 1: multiset tag diff -- the complete, reorder-invariant inventory. - missing = tag_multiset(exp_root) - tag_multiset(act_root) - added = tag_multiset(act_root) - tag_multiset(exp_root) - rec["missing_elements"] = sorted(missing) - rec["missing_element_counts"] = {k: missing[k] for k in sorted(missing)} - rec["distinct_missing_count"] = len(missing) - rec["added_elements"] = sorted(added) - rec["total_missing_instances"] = sum(missing.values()) - rec["is_single_blocker"] = len(missing) == 1 - - # First positional divergence (continuity only; the harness reports the same). - div = first_divergence(exp_root, act_root, f"/{_local(exp_root.tag)}") - if div is not None: - rec["mismatch_type"] = div.mismatch_type - rec["first_divergence_element"] = div.element - rec["first_divergence_path"] = div.path - - # No divergence and nothing dropped/added -> the file actually round-trips. - if div is None and not missing and not added: + sigs = collect_signatures(exp_root, act_root) + if not sigs: rec["status"] = "PASS" - rec["primary_category"] = "A" return rec - def support_of(tag: str) -> str | None: - feat = api_features.get(tag) - return feat.support if feat else None - - cats: list[str] = [] - - # B -- drop-only: provable across the whole file via the multiset. - if missing and not added and all(support_of(t) == "none" for t in missing): - cats.append("B") - - # C -- reorder-only: same global multiset, a parent's child order differs. - if ( - not missing - and not added - and div is not None - and div.mismatch_type in ("element-name", "child-count") - and _is_permutation(div.parent_expected_children, div.parent_actual_children) - ): - cats.append("C") - - # D -- enum bug: a value mismatch that is a known missing enum member. - if div is not None and div.mismatch_type in ("text", "attribute"): - feat = api_features.get(div.element) - if ( - feat is not None - and feat.support == "partial" - and feat.enum_missing - and _normalize_enum(div.expected_value) in feat.enum_missing - ): - cats.append("D") - - # E -- missing attribute: a partial feature's modeled attribute was dropped. - if div is not None and div.mismatch_type in ("attribute", "attribute-count"): - feat = api_features.get(div.element) - if ( - feat is not None - and feat.support == "partial" - and div.attr_name is not None - and div.attr_name in feat.attributes - ): - cats.append("E") - - # G -- a dropped element class the audit marks support="full"/"partial". - # Either a genuine impl round-trip bug or an api.features.xml overstatement; - # both need human triage (issue #219). Without this the file falls through to - # "unknown", since B requires *every* dropped class to be support="none". - supported_missing = sorted(t for t in missing if support_of(t) in ("full", "partial")) - if supported_missing: - cats.append("G") - - # Primary = first match in priority order; the rest are secondary. G is last - # so a precise enum/attribute finding still wins when one applies; otherwise - # a dropped supported element is surfaced instead of hidden in "unknown". - primary = next((c for c in ("B", "C", "D", "E", "G") if c in cats), None) - if primary is None: - warn(f"{entry.rel}: unclassified FAIL (missing={rec['missing_elements']}, " - f"mismatch={rec['mismatch_type']})") - rec["primary_category"] = "unknown" - else: - rec["primary_category"] = primary - rec["secondary_categories"] = [c for c in cats if c != primary] - - # Blocking features: what, if fully supported, would unblock this file. - if primary == "B": - rec["blocking_features"] = sorted(missing) - elif primary == "G": - rec["blocking_features"] = supported_missing - elif primary in ("D", "E") and div is not None and div.element: - rec["blocking_features"] = [div.element] - - # Cross-reference the per-file sidecar: a missing or unreadable sidecar is a - # warning, never an abort. When present, sanity-check that the divergence - # element really is part of the file's surface (catches namespace/local-name - # drift between the dump XML and the audited surface). - sidecar = (data_root / entry.rel).with_suffix("") - sidecar = sidecar.with_name(sidecar.name + FEATURES_SUFFIX) - if not sidecar.exists(): - warn(f"{entry.rel}: no feature sidecar at {sidecar}; cross-reference skipped") - else: - try: - surface = featuresfile.read(sidecar) - if div is not None and div.element and div.element not in surface.elements: - warn(f"{entry.rel}: divergence element '{div.element}' not in file surface") - except ET.ParseError as exc: - warn(f"{entry.rel}: unreadable feature sidecar ({exc}); cross-reference skipped") - + signatures = sorted(sigs) + rec["status"] = "FAIL" + rec["signatures"] = signatures + rec["sample_paths"] = {s: sigs[s] for s in signatures if sigs[s]} + rec["signature_counts"] = dict( + sorted(Counter(_signature_type(s) for s in signatures).items()) + ) + rec["distance"] = len(signatures) + rec["has_reorder"] = any(_signature_type(s) == "reorder" for s in signatures) + rec["is_candidate"] = not rec["has_reorder"] return rec @@ -430,85 +304,165 @@ def support_of(tag: str) -> str | None: # --------------------------------------------------------------------------- # -def _rank_blocking_features(records: list[dict]) -> list[tuple[str, int, int]]: - """Rank actionable (B/D/E) blocking features by files unblocked. +def build_worklist(candidates: list[dict]) -> list[dict]: + """Rank signatures by candidate files unblocked, sole-blocker first. - Returns (feature, files_unblocked, single_blocker_count) sorted descending by - files unblocked, then by single-blocker count, then name. + ``files_blocked`` is the number of candidate files carrying the signature; + ``sole_blocker`` is the number whose *only* divergence is that signature + (distance 1) -- fixing it flips those files green immediately. """ - files = Counter() - single = Counter() - for rec in records: - if rec["primary_category"] not in _ACTIONABLE: - continue - blockers = rec["blocking_features"] - for feat in set(blockers): - files[feat] += 1 - if rec["is_single_blocker"] and len(blockers) == 1: - single[blockers[0]] += 1 - ranked = [(f, files[f], single[f]) for f in files] - ranked.sort(key=lambda t: (-t[1], -t[2], t[0])) - return ranked - - -def build_report( - dump_dir: Path, data_root: Path, api_features_path: Path, warn -) -> dict: + files: Counter = Counter() + sole: Counter = Counter() + for rec in candidates: + for sig in rec["signatures"]: + files[sig] += 1 + if rec["distance"] == 1: + sole[rec["signatures"][0]] += 1 + rows = [ + { + "signature": sig, + "type": _signature_type(sig), + "files_blocked": files[sig], + "sole_blocker": sole[sig], + } + for sig in files + ] + rows.sort(key=lambda r: (-r["sole_blocker"], -r["files_blocked"], r["signature"])) + return rows + + +def build_batch_plan(candidates: list[dict], max_steps: int = 15) -> list[dict]: + """Greedy set-cover: the fix sequence that lands the most files the soonest. + + The signature-level worklist ranks fixes independently, but most candidate + files need *several* fixes before they pass. This answers the #212 question + directly -- "minimal changes -> most files" -- by greedily choosing, at each + step, the signature that maximizes the number of candidate files whose + *entire* remaining signature set is then cleared. Each row's + ``cumulative_files`` is how many candidate files fully pass once every fix up + to and including that row is made. + """ + sigsets = [set(r["signatures"]) for r in candidates] + all_sigs = {s for ss in sigsets for s in ss} + fixed: set[str] = set() + plan: list[dict] = [] + prev = 0 + for _ in range(max_steps): + best_sig, best_total = None, prev + for sig in sorted(all_sigs - fixed): # sorted: deterministic tie-break + trial = fixed | {sig} + total = sum(1 for ss in sigsets if ss <= trial) + if total > best_total: + best_total, best_sig = total, sig + if best_sig is None: # no remaining fix lands another file + break + fixed.add(best_sig) + plan.append({ + "fix": best_sig, + "added_files": best_total - prev, + "cumulative_files": best_total, + }) + prev = best_total + return plan + + +def _near_misses(candidates: list[dict]) -> dict[str, list[dict]]: + """Group candidate files by distance (1..N) so small fix-sets are visible.""" + out: dict[str, list[dict]] = {} + for dist in range(1, _NEAR_MISS_MAX_DISTANCE + 1): + bucket = [ + {"file": r["file"], "signatures": r["signatures"]} + for r in candidates + if r["distance"] == dist + ] + bucket.sort(key=lambda d: d["file"]) + out[str(dist)] = bucket + return out + + +def build_report(dump_dir: Path, warn) -> dict: """Run classification over the dump directory and return the JSON report.""" - api_features = load_api_features(api_features_path) entries = parse_dump_dir(dump_dir) - records = [classify_entry(e, api_features, data_root, warn) for e in entries] + records = [classify_entry(e, warn) for e in entries] + + candidates = [r for r in records if r["status"] == "FAIL" and r["is_candidate"]] + by_status: Counter = Counter(r["status"] for r in records) + distance_hist = Counter(r["distance"] for r in candidates) - by_category: Counter = Counter(r["primary_category"] for r in records) return { "dump_dir": str(dump_dir.resolve()), - "data_root": str(data_root.resolve()), "generated": datetime.now(timezone.utc).isoformat(), "summary": { "total": len(records), - "by_category": dict(sorted(by_category.items())), + "by_status": dict(sorted(by_status.items())), + "candidates": len(candidates), + "reorder_blocked": sum( + 1 for r in records if r["status"] == "FAIL" and r["has_reorder"] + ), + "distance_histogram": {str(k): distance_hist[k] for k in sorted(distance_hist)}, }, + "worklist": build_worklist(candidates), + "batch_plan": build_batch_plan(candidates), + "near_misses": _near_misses(candidates), "files": records, } def print_summary(report: dict, out_path: Path) -> None: - """Print the human-readable summary to stdout.""" - records = report["files"] - counts = report["summary"]["by_category"] - total = report["summary"]["total"] - print(f"Classified {total} files from {report['dump_dir']}\n") - - for cat in ("A", "B", "C", "D", "E", "F", "G", "unknown"): - n = counts.get(cat, 0) - if n == 0 and cat == "A": - continue - marker = "?" if cat == "unknown" else cat - print(f" {marker} {_CATEGORY_LABELS[cat]:<28}{n:>4}") + """Print the human-readable worklist to stdout.""" + summary = report["summary"] + print(f"Classified {summary['total']} files from {report['dump_dir']}\n") + + for status in ("PASS", "FAIL", "CRASH"): + n = summary["by_status"].get(status, 0) + print(f" {status:<8}{n:>5}") + print( + f"\n candidates (FAIL, no reorder): {summary['candidates']}" + f" | reorder-blocked: {summary['reorder_blocked']}" + ) + + hist = summary["distance_histogram"] + if hist: + print("\nDistance to passing (candidate files; #unique fixes needed):") + for dist, n in hist.items(): + print(f" {dist:>3} fix(es){n:>6} files") + + worklist = report["worklist"] + if worklist: + print("\nWorklist (signatures ranked by candidate files unblocked):") + print(f" {'signature':<28}{'sole':>6}{'total':>7}") + for row in worklist[:20]: + print( + f" {row['signature']:<28}{row['sole_blocker']:>6}{row['files_blocked']:>7}" + ) - ranked = _rank_blocking_features(records) - if ranked: - print("\nTop blocking features (ranked by files unblocked; B+D+E+G):") - for feat, files, single in ranked[:15]: - print(f" {feat:<24}{files:>4} files ({single} single-blocker)") + plan = report["batch_plan"] + if plan: + print("\nBatch plan (greedy: fewest fixes -> most candidate files passing):") + for i, row in enumerate(plan, 1): + print( + f" {i:>2}. +{row['fix']:<28}" + f"{row['cumulative_files']:>5} pass (+{row['added_files']})" + ) print(f"\nOutput: {out_path}") def run(dump_dir: Path, data_root: Path, out_path: Path) -> int: - """Entry point for the ``classify`` subcommand.""" + """Entry point for the ``classify`` subcommand. + + ``data_root`` is accepted for CLI compatibility but unused: classification is + derived entirely from the dump pair, never from an external support index. + """ + del data_root if not dump_dir.is_dir(): print(f"classify: dump dir not found: {dump_dir}", file=sys.stderr) return 2 - api_features_path = data_root / "api.features.xml" - if not api_features_path.exists(): - print(f"classify: api.features.xml not found: {api_features_path}", file=sys.stderr) - return 2 def warn(msg: str) -> None: print(f"classify: warning: {msg}", file=sys.stderr) - report = build_report(dump_dir, data_root, api_features_path, warn) + report = build_report(dump_dir, warn) out_path.parent.mkdir(parents=True, exist_ok=True) out_path.write_text(json.dumps(report, indent=2) + "\n", encoding="utf-8") diff --git a/audit/tests/test_classify.py b/audit/tests/test_classify.py index e80e3da9..4cc7a119 100644 --- a/audit/tests/test_classify.py +++ b/audit/tests/test_classify.py @@ -1,8 +1,10 @@ -"""Tests for the api round-trip failure classifier (audit/classify.py, #211). +"""Tests for the api round-trip failure classifier (audit/classify.py, #211/#212). Builds synthetic dump directories that mirror the #210 dump format and asserts -the classifier's category assignment, the multiset diff (the core fix: it must -enumerate *every* missing element class, not just the first), and the ranking. +the measured-divergence model: the signature set per file, the distance metric +(unique signatures; a tag dropped many times still counts once), the reorder +exclusion from candidates, and the sole-blocker worklist ranking. The classifier +consults no support index -- nothing here writes ``api.features.xml``. Run with: python3 -m unittest audit.tests.test_classify """ @@ -15,38 +17,6 @@ from audit import classify -# A small api.features.xml fixture with controlled support levels. -API_FEATURES = """ - - mx::api - 4.0 - - - - - - - - - - - - - - - sharpSharp - - - - - - - - - - -""" - def _wrap(body: str) -> str: return f"{body}" @@ -55,12 +25,8 @@ def _wrap(body: str) -> str: class ClassifierTest(unittest.TestCase): def setUp(self) -> None: self._tmp = tempfile.TemporaryDirectory() - self.root = Path(self._tmp.name) - self.data = self.root / "data" - self.dump = self.root / "dump" - self.data.mkdir() + self.dump = Path(self._tmp.name) / "dump" self.dump.mkdir() - (self.data / "api.features.xml").write_text(API_FEATURES, encoding="utf-8") self.warnings: list[str] = [] def tearDown(self) -> None: @@ -77,17 +43,14 @@ def _status(self, rel: str, code: str) -> None: (self.dump / f"{flat}.status").write_text(code + "\n", encoding="utf-8") def _classify(self) -> dict[str, dict]: - report = classify.build_report( - self.dump, self.data, self.data / "api.features.xml", self.warnings.append - ) - self.report = report - return {r["file"]: r for r in report["files"]} + self.report = classify.build_report(self.dump, self.warnings.append) + return {r["file"]: r for r in self.report["files"]} - # --- individual categories ------------------------------------------- # + # --- drops ----------------------------------------------------------- # - def test_drop_only_enumerates_all_missing(self) -> None: + def test_drop_enumerates_all_missing(self) -> None: # credit AND defaults are both dropped; a positional walk would only see - # the first. The multiset must report both. + # the first. The multiset must report both, as two drop signatures. self._pair( "wild/drop.xml", _wrap("ab" @@ -95,137 +58,197 @@ def test_drop_only_enumerates_all_missing(self) -> None: _wrap("C"), ) rec = self._classify()["wild/drop.xml"] - self.assertEqual(rec["primary_category"], "B") - self.assertEqual(rec["missing_elements"], ["credit", "defaults"]) - self.assertEqual(rec["distinct_missing_count"], 2) - self.assertFalse(rec["is_single_blocker"]) - self.assertEqual(rec["blocking_features"], ["credit", "defaults"]) + self.assertEqual(rec["status"], "FAIL") + self.assertEqual(rec["signatures"], ["drop:credit", "drop:defaults"]) + self.assertEqual(rec["distance"], 2) + self.assertTrue(rec["is_candidate"]) - def test_drop_only_single_blocker(self) -> None: + def test_repeated_drop_counts_once(self) -> None: + # dropped three times is still distance 1 (unique signatures). self._pair( - "wild/single.xml", - _wrap("aC"), - _wrap("C"), + "wild/repeat.xml", + _wrap("abc"), + _wrap(""), ) - rec = self._classify()["wild/single.xml"] - self.assertEqual(rec["primary_category"], "B") - self.assertEqual(rec["distinct_missing_count"], 1) - self.assertTrue(rec["is_single_blocker"]) + rec = self._classify()["wild/repeat.xml"] + self.assertEqual(rec["signatures"], ["drop:credit"]) + self.assertEqual(rec["distance"], 1) - def test_reorder_only(self) -> None: - self._pair( - "wild/reorder.xml", - _wrap("12"), - _wrap("21"), + def test_drop_records_sample_path(self) -> None: + self._pair("wild/path.xml", + _wrap("a"), _wrap("")) + rec = self._classify()["wild/path.xml"] + self.assertEqual( + rec["sample_paths"]["drop:credit"], + "/score-partwise/part/measure/credit", ) - rec = self._classify()["wild/reorder.xml"] - self.assertEqual(rec["primary_category"], "C") - self.assertEqual(rec["missing_elements"], []) - self.assertEqual(rec["added_elements"], []) - def test_enum_bug(self) -> None: + # --- value / attr ---------------------------------------------------- # + + def test_value_mismatch(self) -> None: self._pair( - "wild/enum.xml", + "wild/value.xml", _wrap("sharp-sharp"), _wrap("sharp"), ) - rec = self._classify()["wild/enum.xml"] - self.assertEqual(rec["primary_category"], "D") - self.assertEqual(rec["mismatch_type"], "text") - self.assertEqual(rec["first_divergence_element"], "accidental") + rec = self._classify()["wild/value.xml"] + self.assertEqual(rec["signatures"], ["value:accidental"]) + self.assertEqual(rec["signature_counts"], {"value": 1}) + + def test_value_survives_a_sibling_drop(self) -> None: + # A dropped sibling must not desync the walk: the value diff on the + # surviving is still found alongside the drop. + self._pair( + "wild/both.xml", + _wrap("aC"), + _wrap("D"), + ) + rec = self._classify()["wild/both.xml"] + self.assertEqual(rec["signatures"], ["drop:credit", "value:step"]) - def test_missing_attribute(self) -> None: + def test_attribute_dropped(self) -> None: self._pair( "wild/attr.xml", _wrap(''), _wrap(""), ) rec = self._classify()["wild/attr.xml"] - self.assertEqual(rec["primary_category"], "E") - self.assertEqual(rec["mismatch_type"], "attribute-count") + self.assertEqual(rec["signatures"], ["attr:dynamics@placement"]) - def test_supported_element_drop(self) -> None: - # backup is support="full" but vanishes. That is not category B (which - # needs *every* drop to be support="none"), so it must surface as G - # rather than fall through to "unknown". + def test_attribute_value_mismatch(self) -> None: self._pair( - "wild/supdrop.xml", - _wrap("1C"), - _wrap("C"), + "wild/attrval.xml", + _wrap(''), + _wrap(''), ) - rec = self._classify()["wild/supdrop.xml"] - self.assertEqual(rec["primary_category"], "G") - self.assertEqual(rec["missing_elements"], ["backup"]) - self.assertEqual(rec["blocking_features"], ["backup"]) - self.assertTrue(rec["is_single_blocker"]) - - def test_mixed_supported_and_none_drop_is_g(self) -> None: - # A supported drop (backup=full) mixed with an unsupported drop - # (credit=none) is G, not B -- and only the supported tag is a blocker. + rec = self._classify()["wild/attrval.xml"] + self.assertEqual(rec["signatures"], ["value:dynamics@placement"]) + + # --- reorder --------------------------------------------------------- # + + def test_reorder_is_not_a_candidate(self) -> None: self._pair( - "wild/mixed.xml", - _wrap("1c"), - _wrap(""), + "wild/reorder.xml", + _wrap("12"), + _wrap("21"), ) - rec = self._classify()["wild/mixed.xml"] - self.assertEqual(rec["primary_category"], "G") - self.assertEqual(rec["missing_elements"], ["backup", "credit"]) - self.assertEqual(rec["blocking_features"], ["backup"]) - self.assertEqual(rec["secondary_categories"], []) + rec = self._classify()["wild/reorder.xml"] + self.assertEqual(rec["signatures"], ["reorder:measure"]) + self.assertTrue(rec["has_reorder"]) + self.assertFalse(rec["is_candidate"]) + + # --- crashes --------------------------------------------------------- # - def test_pipeline_error_with_status(self) -> None: + def test_crash_with_status(self) -> None: self._pair("wild/load.xml", _wrap(""), None) self._status("wild/load.xml", "LOADFAIL") rec = self._classify()["wild/load.xml"] - self.assertEqual(rec["primary_category"], "F") - self.assertEqual(rec["status"], "LOADFAIL") - self.assertEqual(rec["pipeline_error_kind"], "LOADFAIL") + self.assertEqual(rec["status"], "CRASH") + self.assertEqual(rec["crash_kind"], "LOADFAIL") + self.assertFalse(rec["is_candidate"]) - def test_pipeline_error_without_status(self) -> None: + def test_crash_without_status(self) -> None: self._pair("wild/nostatus.xml", _wrap(""), None) rec = self._classify()["wild/nostatus.xml"] - self.assertEqual(rec["primary_category"], "F") - self.assertEqual(rec["status"], "PIPELINE_ERROR") - self.assertIsNone(rec["pipeline_error_kind"]) + self.assertEqual(rec["status"], "CRASH") + self.assertIsNone(rec["crash_kind"]) - def test_unknown_full_support_mismatch(self) -> None: - self._pair( - "wild/unknown.xml", - _wrap("C"), - _wrap("D"), - ) - rec = self._classify()["wild/unknown.xml"] - self.assertEqual(rec["primary_category"], "unknown") + # --- pass ------------------------------------------------------------ # + + def test_identical_is_pass(self) -> None: + self._pair("wild/ok.xml", _wrap(""), _wrap("")) + rec = self._classify()["wild/ok.xml"] + self.assertEqual(rec["status"], "PASS") + self.assertEqual(rec["distance"], 0) + self.assertFalse(rec["is_candidate"]) + + def test_numeric_text_equivalent_is_pass(self) -> None: + self._pair("wild/num.xml", + _wrap("1.0"), _wrap("1")) + rec = self._classify()["wild/num.xml"] + self.assertEqual(rec["status"], "PASS") + + # --- record shape ---------------------------------------------------- # def test_every_record_has_all_fields(self) -> None: - self._pair("wild/single.xml", - _wrap("a"), _wrap("")) + self._pair("wild/x.xml", _wrap("a"), _wrap("")) rec = next(iter(self._classify().values())) for key in ( - "file", "status", "primary_category", "secondary_categories", - "first_divergence_element", "first_divergence_path", "mismatch_type", - "missing_elements", "missing_element_counts", "distinct_missing_count", - "added_elements", "is_single_blocker", "total_missing_instances", - "blocking_features", "pipeline_error_kind", + "file", "status", "crash_kind", "signatures", "sample_paths", + "signature_counts", "distance", "has_reorder", "is_candidate", ): self.assertIn(key, rec) - def test_missing_sidecar_warns_not_aborts(self) -> None: - self._pair("wild/single.xml", - _wrap("a"), _wrap("")) - self._classify() # no sidecars created -> warnings, no exception - self.assertTrue(any("no feature sidecar" in w for w in self.warnings)) + # --- worklist / aggregation ----------------------------------------- # - def test_ranking_orders_by_files_unblocked(self) -> None: - # credit blocks two files, defaults blocks one. + def test_worklist_ranks_sole_blocker_first(self) -> None: + # credit blocks 3 files (sole in all 3); defaults blocks 2 but is sole in + # only 1 (the other also drops bar). credit must rank first. self._pair("wild/a.xml", _wrap("x"), _wrap("")) self._pair("wild/b.xml", _wrap("x"), _wrap("")) - self._pair("wild/c.xml", _wrap("x"), _wrap("")) - records = list(self._classify().values()) - ranked = classify._rank_blocking_features(records) - self.assertEqual(ranked[0][0], "credit") - self.assertEqual(ranked[0][1], 2) # files unblocked - self.assertEqual(ranked[0][2], 2) # single-blocker count + self._pair("wild/c.xml", _wrap("x"), _wrap("")) + self._pair("wild/d.xml", _wrap("x"), _wrap("")) + self._pair("wild/e.xml", + _wrap("xy"), _wrap("")) + self._classify() + worklist = {r["signature"]: r for r in self.report["worklist"]} + self.assertEqual(self.report["worklist"][0]["signature"], "drop:credit") + self.assertEqual(worklist["drop:credit"]["sole_blocker"], 3) + self.assertEqual(worklist["drop:credit"]["files_blocked"], 3) + self.assertEqual(worklist["drop:defaults"]["sole_blocker"], 1) + self.assertEqual(worklist["drop:defaults"]["files_blocked"], 2) + + def test_reorder_excluded_from_worklist(self) -> None: + self._pair("wild/r.xml", + _wrap("12"), + _wrap("21")) + self._classify() + self.assertEqual(self.report["worklist"], []) + self.assertEqual(self.report["summary"]["reorder_blocked"], 1) + self.assertEqual(self.report["summary"]["candidates"], 0) + + def test_near_misses_grouped_by_distance(self) -> None: + self._pair("wild/one.xml", _wrap("x"), _wrap("")) + self._pair("wild/two.xml", + _wrap("xy"), + _wrap("")) + self._classify() + nm = self.report["near_misses"] + self.assertEqual([f["file"] for f in nm["1"]], ["wild/one.xml"]) + self.assertEqual([f["file"] for f in nm["2"]], ["wild/two.xml"]) + self.assertEqual(nm["3"], []) + + def test_batch_plan_greedy_cover(self) -> None: + # Greedy picks credit first (clears the one distance-1 file), then + # defaults (clears the two that need both), then other. + cands = [ + {"signatures": ["drop:credit", "drop:defaults"]}, + {"signatures": ["drop:credit", "drop:defaults"]}, + {"signatures": ["drop:credit"]}, + {"signatures": ["drop:credit", "drop:other"]}, + ] + plan = classify.build_batch_plan(cands) + self.assertEqual( + [(r["fix"], r["cumulative_files"], r["added_files"]) for r in plan], + [("drop:credit", 1, 1), ("drop:defaults", 3, 2), ("drop:other", 4, 1)], + ) + + def test_batch_plan_in_report_excludes_reorder(self) -> None: + self._pair("wild/one.xml", _wrap(""), _wrap("")) + self._pair("wild/r.xml", + _wrap("12"), + _wrap("21")) + self._classify() + plan = self.report["batch_plan"] + self.assertEqual([r["fix"] for r in plan], ["drop:credit"]) + + def test_summary_status_counts(self) -> None: + self._pair("wild/ok.xml", _wrap(""), _wrap("")) + self._pair("wild/fail.xml", _wrap(""), _wrap("")) + self._pair("wild/crash.xml", _wrap(""), None) + self._classify() + self.assertEqual(self.report["summary"]["by_status"], + {"CRASH": 1, "FAIL": 1, "PASS": 1}) if __name__ == "__main__": diff --git a/docs/ai/design/api-roundtrip-classifier.md b/docs/ai/design/api-roundtrip-classifier.md index ab9c82bf..aebe49a9 100644 --- a/docs/ai/design/api-roundtrip-classifier.md +++ b/docs/ai/design/api-roundtrip-classifier.md @@ -6,6 +6,39 @@ and why. It supersedes the original "walk both trees in parallel until the first mismatch" sketch in the #211 issue body, which has a structural defect described below. +## Update (#212): measurement-only classification + +The classifier was reworked to be grounded **only** in what the round-trip +measurably did to each file. The earlier version cross-referenced +`data/api.features.xml` to label drops as "by-design" (category B) vs. bug +(category G) and to recognize enum/attribute gaps (D/E). That `support` field is +hand-authored opinion — a *prediction* of round-trip behavior, not a measurement +— and it can be wrong (part-group was marked `full` yet dropped data; see #224). +Keying classification on a fallible prediction made the worklist untrustworthy. + +What changed: + +- **No `api.features.xml`.** The support cross-reference (old Layer 3 and the + B/D/E/G category logic) is removed. Whether a drop is intended is a present-day + human decision (#214), not a property the classifier asserts. +- **Categories → signatures.** Every difference is reduced to a signature: + `drop:`, `add:`, `value:`, `value:@`, + `attr:@`, `reorder:`. A file's **distance** to passing is + its count of *unique* signatures (a tag dropped a thousand times is one + `drop:` signature, distance 1). +- **Statuses** are just `PASS` / `FAIL` / `CRASH` (the three pipeline codes fold + into `CRASH`). +- **Candidates and the worklist.** A FAIL with no `reorder:` signature is a + *candidate* (reorders are expected `mx::api` behavior, to be absorbed in test + normalization later, #214). The worklist ranks each signature by the number of + candidate files it is the *sole* blocker of (fixing it flips them green now), + then by total candidate files blocked. + +Layers 1 (multiset for drops/adds) and 4 (alignment, now also used for +value/attr, not just reorder) below are unchanged and still describe the engine. +Layer 3, "Category G", and the old JSON schema are retained only as historical +rationale; the current schema is at the end of this section. + ## The defect in the naive design Both `compareElements` (`src/private/mxtest/corert/Compare.cpp`) and the original @@ -99,9 +132,11 @@ a second multiset keyed by `parent/tag` (or the full root-to-node path). The headline `distinct_missing_count` stays on bare tags; path qualification is used when mapping a drop back to an `api.features.xml` feature. -### Layer 3 — cross-reference `api.features.xml` +### Layer 3 — cross-reference `api.features.xml` *(REMOVED in #212; historical)* -For each missing tag, look up `support` in `api.features.xml`: +This layer was removed: classification no longer consults `api.features.xml`. The +text below is retained only to explain what the categories used to mean. For each +missing tag, the old version looked up `support` in `api.features.xml`: - **Category B (drop-only)** becomes *provable across the whole file*: it holds iff **every** tag in `missing` has `support="none"`. The naive walk could never @@ -111,14 +146,19 @@ For each missing tag, look up `support` in `api.features.xml`: or a partial-drop — surfaced explicitly instead of being hidden behind whatever the first divergence happened to be. -### Layer 4 — sequence alignment per parent (category C only) +### Layer 4 — sequence alignment per parent (reorder + paired value/attr) + +`difflib.SequenceMatcher` over the two child-tag sequences earns its keep twice: + +- **Reorder** (`reorder:`): a parent whose child *multiset* is equal but + whose child *sequence* differs is a pure permutation. +- **Paired value/attr** (`value:`/`attr:`): recursing only the matcher's `equal` + blocks pairs surviving elements across drops without desyncing, so text and + attribute differences are compared between genuinely corresponding nodes (never + across mismatched tags). This is how value/attr diffs survive a sibling drop. -Reorder-only (category C) is the case where a parent's child *multiset* is equal -between expected and actual but the child *sequence* differs. Detect with -`difflib.SequenceMatcher` opcodes over the two child-tag sequences: a pure -permutation (no net insert/delete) confirms reorder-only. Stdlib, zero dependency. -This is the one place per-parent alignment earns its keep; it is **not** used for -the deletion inventory (Layer 1 already has that). +It is **not** used for the deletion inventory (Layer 1's multiset already has +that). Stdlib, zero dependency. ### Layer 5 — edit-distance scalar (secondary) @@ -147,23 +187,26 @@ This belongs to Phase 3 (#212) ranking, but the per-file fields it needs (`distinct_missing_count`, `missing_elements`, single-blocker flag) are emitted here in Phase 2. -## JSON schema additions +## JSON schema (current, #212) -The #211 schema gains these per-file fields (existing fields unchanged; all -present on every entry, `null`/`[]`/`{}` when not applicable): +Each file gets one record; every field is present on every entry: ```jsonc -"missing_elements": ["credit", "defaults"], // sorted distinct dropped tags -"missing_element_counts": {"credit": 3, "defaults": 1}, -"distinct_missing_count": 2, // headline ranking metric -"added_elements": [], // spurious tags in actual (bug) -"is_single_blocker": false, // distinct_missing_count == 1 -"total_missing_instances": 4 // sum of counts; severity scalar +"file": "wild/foo.xml", +"status": "FAIL", // PASS | FAIL | CRASH +"crash_kind": null, // LOADFAIL | GETDATAFAIL | CREATEFAIL | null +"signatures": ["drop:credit", "value:step"], // sorted unique signatures +"sample_paths": {"drop:credit": "/score-partwise/part/measure/credit"}, +"signature_counts": {"drop": 1, "value": 1}, // by signature type +"distance": 2, // len(signatures); the headline metric +"has_reorder": false, // any reorder: signature +"is_candidate": true // FAIL and not has_reorder ``` -`first_divergence_element` / `first_divergence_path` / `mismatch_type` are retained -for continuity with the harness detail string, but are explicitly **not** relied on -for completeness — they describe only the first positional divergence. +The report's top level adds `summary` (status counts, `candidates`, +`reorder_blocked`, `distance_histogram`), `worklist` (signatures ranked by +`sole_blocker` then `files_blocked`), and `near_misses` (candidate files bucketed +by distance 1..3). ### Pipeline errors and the `.status` sidecar @@ -171,32 +214,22 @@ A flat dump filename (`a__b.xml.expected.xml`) cannot, on its own, distinguish t three pipeline-error statuses (`LOADFAIL` / `GETDATAFAIL` / `CREATEFAIL`): all three produce only an `.expected.xml` (the api pipeline never reached a written document). So the dump step (#210) writes a tiny `.status` sidecar next to -the expected file recording the exact code. The classifier reads it to populate -`pipeline_error_kind` and `status`; when the sidecar is absent it falls back to a -generic `status = "PIPELINE_ERROR"` with `pipeline_error_kind = null`. The -presence/absence of `.actual.xml` remains the FAIL-vs-pipeline-error signal; the -`.status` sidecar only refines *which* pipeline error. The sidecar lives in the -gitignored dump dir and is never checked in. - -### Category G — `support="full"`/`"partial"` drops - -A dropped element whose `api.features.xml` support is `full` or `partial` does -**not** satisfy category B (drop-only) — B requires *every* missing class to be -`support="none"`. A class that is supposed to round-trip but vanished is either a -genuine impl round-trip bug or an `api.features.xml` overstatement; both are -actionable and need human triage (issue #219). These files are assigned -**category G**, with the dropped supported classes as their `blocking_features`, -rather than being buried in `unknown`. G is evaluated last (after B/C/D/E), so a -precise enum (D) or attribute (E) finding still wins when the first divergence -matches one; otherwise the supported drop is surfaced. The multiset makes this -provable across the whole file rather than hiding it behind whatever the first -positional divergence happened to be. - -`part-group` was the motivating case: marked `support="full"` yet the single -most-dropped element, which turned out to be partly an audit overstatement -(corrected to `partial`) and partly malformed synthetic input (unmatched -start/stop, fixed in the corpus). Surfacing such drops as G instead of `unknown` -is what makes that triage discoverable. +the expected file recording the exact code. The classifier reads it into +`crash_kind` (with `status = "CRASH"`); when the sidecar is absent, `status` is +still `"CRASH"` and `crash_kind` is `null`. The presence/absence of `.actual.xml` +remains the FAIL-vs-CRASH signal; the `.status` sidecar only refines *which* +crash. The sidecar lives in the gitignored dump dir and is never checked in. + +### Category G — `support="full"`/`"partial"` drops *(REMOVED in #212)* + +Category G (and the B/D/E categories) existed only because classification used to +read `api.features.xml`. With that cross-reference gone, every dropped tag is just +a `drop:` signature regardless of any prior belief about whether it was +"supported"; the supported-vs-by-design judgment is a present-day human call +(#214), not the classifier's. `part-group` (marked `support="full"` yet dropped; +#224) was the case that proved the `support` field could not be trusted as a +classification input — which is what motivated removing it. The drop is still +reported, now as `drop:part-group`, ranked purely by how many files it blocks. ## Dependency decision