Skip to content

chore(deps): bump idna from 3.11 to 3.15#2

Open
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/uv/idna-3.15
Open

chore(deps): bump idna from 3.11 to 3.15#2
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/uv/idna-3.15

Conversation

@dependabot
Copy link
Copy Markdown

@dependabot dependabot Bot commented on behalf of github May 19, 2026

Bumps idna from 3.11 to 3.15.

Changelog

Sourced from idna's changelog.

3.15 (2026-05-12)

  • Enforce DNS-length cap on individual labels early in check_label, short-circuiting contextual-rule processing for oversized input while staying compatible with UTS 46 usage.
  • Tidy core helpers: hoist bidi category sets to module-level frozensets (avoiding per-codepoint list construction), simplify length checks, and reuse the shared _unicode_dots_re from idna.core in the codec module.
  • Use raise ... from err for proper exception chaining and switch internal string formatting to f-strings.
  • Allow flit_core 4.x in the build backend.
  • Expand the ruff lint set (flake8-bugbear, flake8-simplify, pyupgrade, perflint) and apply the surfaced fixes; pin lint CI to Python 3.14.
  • Add Dependabot configuration for GitHub Actions.
  • Convert README and HISTORY from reStructuredText to Markdown.
  • Reference CVE-2026-45409 for the 3.14 advisory in place of the initial GHSA identifier.

Thanks to Felix Yan, Stan Ulbrych, and metsw24-max for contributions to this release.

3.14 (2026-05-10)

  • Removed opportunity to process long inputs into quadratic time by rejecting oversize inputs up-front. Closes a bypass of the CVE-2024-3651 mitigation. [CVE-2026-45409]

Thanks to Stan Ulbrych for reporting the issue.

3.13 (2026-04-22)

  • Correct classification error for codepoint U+A7F1

3.12 (2026-04-21)

  • Update to Unicode 17.0.0.
  • Issue a deprecation warning for the transitional argument.
  • Added lazy-loading to provide some performance improvements.
  • Removed vestiges of code related to Python 2 support, including segmentation of data structures specific to Jython.

Thanks to Rodrigo Nogueira for contributions to this release.

Commits
  • af30a09 Release 3.15
  • 30314d4 Pre-release 3.15rc0
  • 05d4b21 Merge pull request #237 from kjd/convert-docs-to-markdown
  • 2987fdb Convert README and HISTORY from reStructuredText to Markdown
  • 59fa800 Merge pull request #236 from kjd/dependabot/github_actions/actions-f3e34333ea
  • def6983 Merge branch 'master' into dependabot/github_actions/actions-f3e34333ea
  • bbd8004 Merge pull request #234 from StanFromIreland/patch-1
  • edd07c0 Bump github/codeql-action from 3.35.2 to 4.35.2 in the actions group
  • 5557db0 Merge branch 'master' into patch-1
  • f11746c Merge pull request #235 from StanFromIreland/patch-2
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    You can disable automated security fix PRs for this repo from the Security Alerts page.

Bumps [idna](https://github.com/kjd/idna) from 3.11 to 3.15.
- [Release notes](https://github.com/kjd/idna/releases)
- [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.md)
- [Commits](kjd/idna@v3.11...v3.15)

---
updated-dependencies:
- dependency-name: idna
  dependency-version: '3.15'
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot added dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code labels May 19, 2026
maxyanghu pushed a commit that referenced this pull request May 26, 2026
…without shmem (mlcommons#306)

* feat(metrics): add registry, samplers, and snapshot wire schema

Introduces the three primitives that the upcoming pub/sub metrics path
will compose on top of:

- snapshot.py: MetricsSnapshot wire struct (msgspec, tagged union of
  CounterStat | SeriesStat) plus SessionState enum (LIVE / DRAINING /
  COMPLETE) and msgpack codec.
- registry.py: MetricsRegistry holding CounterSamplers and
  SeriesSamplers. Series samplers carry an HDR Histogram for cheap live
  percentiles, an array.array of raw values for exact-final
  computation, and exact rollup primitives. Histogram bucket edges are
  log-spaced over the observed [min, max] per snapshot, so they
  auto-zoom to data instead of wasting buckets on empty range.
- New unit tests cover the wire codec round-trip, sampler hot path,
  and registry registration/collision behavior.

Adds hdrhistogram==0.10.3 as a runtime dependency.

Wiring of these primitives into the aggregator and removal of the old
KVStore path follow in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(metrics): add MetricsPublisher and MetricsSnapshotSubscriber

- publisher.py: MetricsPublisher owns the periodic tick task that
  publishes live MetricsSnapshots over IPC pub/sub at refresh_hz, plus
  publish_final() which is awaited by the aggregator on ENDED. Final
  delivery is dual-path:
  * pub/sub publish (best-effort, telemetry knobs sndhwm=4, linger=10s)
  * disk fallback (atomic: tmp + fsync(file) + rename + fsync(parent dir))
  Both paths are independently wrapped in try/except — neither failure
  suppresses the other. publish_final is async and awaits tick-task
  cancellation before publishing COMPLETE so a late LIVE/DRAINING tick
  can never land after COMPLETE on the wire.

- subscriber.py: MetricsSnapshotSubscriber tracks ``latest`` and the
  ``COMPLETE``-state snapshot. Defaults to conflate=True (TUI / report
  consumer) but parametrized for any consumer that needs every tick.

- New unit tests cover tick-task lifecycle, atomic disk fallback,
  independence of pub/sub vs disk failure paths, and the regression
  that publish_final must await tick-task cancellation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(metrics): wire pub/sub into aggregator, remove KVStore + mmap

Replaces the mmap-backed BasicKVStore with the registry/publisher path
introduced in the previous two commits.

Aggregator changes:
- MetricsAggregatorService now constructs a MetricsRegistry and
  MetricsPublisher on entry; trigger callbacks call registry.record /
  registry.increment instead of kv_store.update.
- Tracks SessionState (LIVE → DRAINING on ENDED → COMPLETE on
  publish_final). The publisher tick task captures (state,
  n_pending_tasks) per tick via a callback; consumers detect drain
  timeout as state == COMPLETE and n_pending_tasks > 0.
- Adds TRACKED_SAMPLES_FAILED counter, incremented on ERROR events
  whose tracked row still exists at processing time. Correctness
  depends on the load_generator emitting ERROR before COMPLETE; the
  matching test asserts that order.
- ENDED handler awaits drain_tasks (30s timeout), publish_final, and
  closes the publisher (linger=10s drains pending pub/sub frames).

Report changes:
- Replaces from_kv_reader with from_snapshot (pure function on a
  MetricsSnapshot). complete is derived from state == COMPLETE and
  n_pending_tasks == 0. Display warns when not complete.

Main-process changes (commands/benchmark/execute.py):
- Spawns a MetricsSnapshotSubscriber on the main loop. Triple-redundant
  report sourcing: pub/sub COMPLETE → disk fallback → latest live.
- Removes _setup_kv_reader, ARM tmpfs branching, and mmap salvage in
  _salvage_tmpfs (events.jsonl salvage is preserved).
- Awaits subscriber.wait_for_complete(timeout=2.0) after launcher
  exit so the loop can dispatch the COMPLETE frame before deciding
  the pub/sub path missed.

Removed:
- async_utils/services/metrics_aggregator/kv_store.py
- async_utils/services/metrics_aggregator/fs_check.py

Tests:
- Deletes test_kv_store.py.
- Marks test_aggregator.py / test_aggregator_e2e.py /
  test_metrics_table.py / test_report_builder.py / conftest.py with
  module-level skip + a TODO referencing the design doc; rewriting
  these on the new fixtures is a tracked follow-up.
- Adds test_aggregator_error_handler.py covering the
  TRACKED_SAMPLES_FAILED increment path and the negative case where
  COMPLETE arrives before ERROR (documents the bug the ERROR/COMPLETE
  swap fixes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(load_generator): emit ERROR before COMPLETE for failed queries

Swaps the publish order in BenchmarkSession._handle_response so that a
QueryResult carrying an error emits ErrorEventType.GENERIC first, then
SampleEventType.COMPLETE.

This is required for metrics-aggregator correctness: COMPLETE causes
MetricsTable.set_field to remove the tracked row, so an ERROR observed
afterward has no row to inspect and TRACKED_SAMPLES_FAILED would
silently stay at 0. Emitting ERROR first keeps the row alive long
enough for the aggregator's error handler to identify the failure as
tracked. EventLoggerService and other event consumers treat the two
event types independently, so order is invisible to them.

The test_failed_query_published_as_error_event test now asserts the
order explicitly so a future revert is caught immediately, and the
aggregator-side regression is covered by test_aggregator_error_handler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(agents): update AGENTS.md for metrics pub/sub refactor

- Core Data Flow diagram: replaces "EventRecorder + MetricsReporter"
  with the events PUB → EventLoggerService / MetricsAggregatorService
  fan-out + main-process Report subscriber.
- Key Components table: adds Metrics Aggregator and Report rows;
  notes the load_generator's ERROR-before-COMPLETE invariant.
- New "Metrics Aggregator subprocess (pub/sub)" section under
  Hot-Path Architecture: state machine, sampler storage layout, hot
  path API, dual-path final delivery, dynamic histogram edges.
- Code Organization tree: expands metrics_aggregator/ to list each
  module; removes deleted recorder.py / reporter.py from metrics/.
- Key Dependencies table: adds hdrhistogram (C-backed HDR Histogram).
- Test fixtures: removes events_db (deleted with the KVStore path).

Per AGENTS.md's own self-update rules: "Treat AGENTS.md changes as
part of the refactor itself — include them in the same PR".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): address P0 review-council findings

Two high-severity issues raised by the review-council pass on PR mlcommons#306:

1. (mlcommons#306-1) Subscriber late-binding could drop early ticks via the ZMQ
   slow-joiner pattern. Move MetricsSnapshotSubscriber construction +
   start() BEFORE launcher.launch() so the SUB handshake completes
   during the subprocess-spawn window. ZMQ tolerates connect-before-
   bind on IPC — the connect resolves once the binder appears. The
   prior ordering (subscribe AFTER launch returns) had a window where
   the aggregator could begin ticking on STARTED before the SUB
   subscription warmed up, dropping early live snapshots and, in the
   worst case, missing COMPLETE entirely.

2. (mlcommons#306-2) MetricsPublisher._write_atomic_fallback runs synchronous
   f.flush + fsync(file) + fsync(parent dir) + rename on the
   aggregator's event loop. On a busy host this can block tens-to-
   hundreds of ms — long enough to back-pressure event-record
   processing. Wrap with asyncio.to_thread inside publish_final.

Both fixes are localized — no API changes, no test changes required.
Existing integration tests (test_concurrency_benchmark, test_end_to_
end_oracle) exercise both paths end-to-end and still pass.

The third P0 item (mlcommons#306-3, unbounded raw-sample retention) is the
agreed memory trade documented in metrics_pubsub_design_v5.md §11;
addressed by adding "--persist-raw" as a tracked follow-up rather
than a code change in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(metrics): rewrite skipped suites on registry/snapshot fixtures

The metrics pub/sub refactor (PR #N) module-level-skipped four test
files plus their conftest because they hard-coupled to the deleted
KVStore API. This commit reinstates them on the new fixtures, in
scope with the design doc's "test impact" callout.

- conftest.py: rewrites shared fixtures to construct MetricsRegistry
  and MetricsTable instances directly. Drops events_db (SQLite
  fixture deleted with the KVStore path).
- test_metrics_table.py: 16 tests covering tracking-window lifecycle,
  trigger dispatch on field updates, tracked-block accounting, and
  the in-flight async-task drain path.
- test_aggregator.py: 31 tests covering MetricsAggregatorService end
  to end (in-process, MagicMock publisher) — counter accounting,
  ISSUED/COMPLETE/error event handling, ENDED → publish_final
  sequence, and the LIVE → DRAINING state transition. Adds a new
  TestCounterAccounting class to cover the total_* vs tracked_*
  counter split that the legacy tests conflated.
- test_aggregator_e2e.py: 3 tests round-tripping a real
  MetricsPublisher ↔ MetricsSnapshotSubscriber over IPC, covering
  COMPLETE-only delivery, LIVE-tick-then-COMPLETE lifecycle, and
  counter+series wire shape.
- test_report_builder.py: 14 tests on Report.from_snapshot, including
  the complete=False derivation when state != COMPLETE or
  n_pending_tasks > 0.

Net: 64 new tests across the 4 suites; full unit suite up from
660 to 724 passing. The 4 module-level skips are gone.

Production-code surfaces flagged for follow-up coverage:
- AsyncTokenTrigger exception path in metrics_table.py
- SeriesSampler HDR clamp warn-once branch
- MetricsAggregatorService._finalize shutdown_event signaling
- Report.tps() OSL-empty-with-duration case

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(templates): unblock TestTemplateIntegration without HF_TOKEN

The 6 generated-template integration tests were skipped unconditionally
in CI/dev because the template placeholders default to gated
meta-llama/Llama-3.1-* repos that require HF_TOKEN to fetch the
tokenizer.

Substitute TinyLlama/TinyLlama-1.1B-Chat-v1.0 for the model name in
_resolve_template after placeholder expansion. TinyLlama is non-gated
(~1MB tokenizer download), shares the Llama-family tokenizer the
templates were written against, and the echo-server path doesn't care
about model identity — only that AutoTokenizer.from_pretrained
succeeds for the metrics aggregator's ISL/OSL/TPOT triggers.

Drops the @pytest.mark.skipif(not HF_TOKEN) decorator, removes the now-
unused os import.

Effect: integration suite goes from 20 passed / 8 skipped to 26 passed
/ 2 skipped. The remaining 2 skips need real LLM servers (vLLM/SGLang)
which aren't in scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(agents): add reference-hygiene rules + clean up violations

Adds two new sections to AGENTS.md "Development Standards":

1. "Documentation references — no local-only artifacts" — docs and
   comments must not reference paths outside the repo (gitignored
   directories, local scratch dirs, contributor workstation paths).
   A reviewer fetching the PR should be able to follow every cited
   reference.

2. "Comments and docstrings — describe current state, not development
   history" — no comments narrating iteration on the codebase ("we
   tried X first", "an earlier implementation did Y"). Such pointers
   belong in the PR description and git log, not the source tree.
   Especially relevant under AI-assisted development where it's
   tempting to leave a paper trail of design pivots inline.

Sweeps existing violations across both rules:

Production code: drops cites to ``metrics_pubsub_design_v5.md`` from
module/class docstrings (snapshot.py, registry.py, publisher.py) and
inlines self-contained rationale where useful (aggregator.py HDR
bounds, TOTAL_DURATION_NS comment).

Tests: removes "Migrated to ..." / "The legacy tests ..." framing
from rewritten test module docstrings; reframes regression-test
docstrings (test_registry.py, test_publisher.py, test_aggregator.py)
to describe the invariant being protected rather than narrating the
prior bug's discovery.

AGENTS.md: removes its own self-violation cite to the gitignored
design doc.

Behavior: no functional changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(metrics_table): encapsulate in-flight task access

Address PR mlcommons#306 review comments from gemini-code-assist (encapsulation)
and github-code-quality (non-iterable enum loop):

- Add `MetricsTable.in_flight_tasks_count` property so the aggregator
  no longer reaches into `table._in_flight_tasks` to report pending-
  task counts on snapshots and drain logging.
- Add `MetricsTable.cancel_in_flight_tasks()` returning the list of
  cancelled tasks (sets up the T3 await-cancellations fix).
- Update aggregator.py call sites accordingly.
- Use `MetricCounterKey.__members__.values()` in test_report_builder
  to satisfy CodeQL's "non-iterable used in for-loop" check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(metrics): rename refresh_hz → publish_interval_s (seconds)

Address PR mlcommons#306 review comments from arekay-nv asking for the
"interval" naming convention used elsewhere in the repo (e.g.
`check_interval` in worker_manager, `interval` in benchmark_httpclient).

- CLI flag `--refresh-hz <Hz>` → `--publish-interval <seconds>`
  (default 4.0 Hz → 0.25 s; same wire cadence).
- Constructor parameters `refresh_hz` (aggregator + publisher) →
  `publish_interval_s`. The `_s` suffix makes the unit explicit so
  call sites can't accidentally pass a frequency.
- Internal field `_refresh_hz` → `_publish_interval_s`.
- Drops the `period = 1.0 / refresh_hz` indirection in publisher.start.
- Tests / AGENTS.md updated accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(report): numerically stable variance for integer-aggregate series

Address PR mlcommons#306 gemini-code-assist comment on report.py:53.

For ns-precision latency series (`SAMPLE_LATENCY_NS`, `TTFT_NS`,
`TPOT_NS`, etc.) the rollups store `total` and `sum_sq` as Python ints
that can grow to many digits. The previous formula
`sum_sq - total*total / n` evaluates `total*total / n` as a float and
catastrophically cancels against `sum_sq` when the variance is small
relative to the mean, producing a negative variance numerator the
sqrt() then clamps to 0.

Use the exact integer numerator `n*sum_sq - total*total` when the
inputs are ints (this is what the math.sqrt sees, no cancellation),
falling back to the float form for series whose dtype is float
(currently only TPOT, where the magnitudes are small enough that the
naive form is fine).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): drain / shutdown correctness for cancel + SIGTERM

Address PR mlcommons#306 review-council items mlcommons#4, mlcommons#9, mlcommons#12.

mlcommons#4 — Cancellations not awaited before reading n_pending
    After a drain timeout, the aggregator's `t.cancel()` loop only
    *scheduled* cancellation; reading `n_pending` on the next line
    therefore reported a stale-high count and left the to-be-cancelled
    tasks runnable when the loop tore down. Now `await
    asyncio.gather(*cancelled, return_exceptions=True)` runs before
    `n_pending = table.in_flight_tasks_count`, so the snapshot reflects
    the post-cancellation set and the cancelled tasks actually exit.

mlcommons#9 — close() cancels tick task but doesn't await it
    Added `MetricsPublisher.aclose()` (async) that cancels the tick task
    AND awaits its exit before closing the underlying transport.
    Aggregator's post-publish_final path and __main__.py's finally block
    now use it. Sync `close()` is kept for sync error-path fallbacks
    with a docstring noting the race.

mlcommons#12 — SIGTERM bypasses publish_final
    Installed `SIGTERM` and `SIGINT` handlers in __main__.py that fire
    `publish_final` defensively before setting `shutdown_event`. Added
    `MetricsPublisher._finalized` so the SIGTERM-triggered and the
    ENDED-triggered paths are safe to race — only the first call
    publishes a COMPLETE frame.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): pre-check HDR `high >= 2*low` before HdrHistogram ctor

Address PR mlcommons#306 council review mlcommons#5 (registry.py:161).

The C-backed hdrhistogram constructor requires `high >= 2*low` but
raises an opaque allocation error if that doesn't hold — making it
hard to debug a misconfigured `register_series` call. Add an explicit
pre-check after the low/high clamps so the error names the series and
both values up front.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): guard publisher.start against double-STARTED

Address PR mlcommons#306 council review mlcommons#8 (aggregator.py:281).

A repeat `SessionEventType.STARTED` (replay buffer, buggy producer,
test fixture) used to make `MetricsPublisher.start` overwrite
`_tick_task`, orphaning the first tick task — it kept running until
GC and raced the new task to publish snapshots.

Make `start` idempotent: if `_tick_task` is already set, log a warning
and return without spawning a second task. The original task remains
the one `publish_final` / `aclose` cancels and awaits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): structured logging around aggregator subprocess crash

Address PR mlcommons#306 council review mlcommons#10 (__main__.py:166).

Wrap the top-level `run_until_complete(main())` so startup / bind /
tokenizer-load failures emit a structured `logger.exception` before
the interpreter prints the traceback. The parent's ServiceLauncher
previously saw only the non-zero exit code and a raw stderr trace
with no context to correlate against the parent's logs.

`SystemExit` is re-raised untouched so argparse usage / explicit
sys.exit paths stay user-facing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(session): document publish-order invariant for ERROR-before-COMPLETE

Address PR mlcommons#306 council review mlcommons#11 (session.py:408).

The metrics aggregator's `TRACKED_SAMPLES_FAILED` accounting relies on
the publisher delivering ERROR strictly before COMPLETE for a failed
sample. The ordering is correct today (ZMQ PUB→SUB in-order delivery,
ZmqMessagePublisher batches without reordering), but it's an implicit
contract — a future transport refactor that breaks it would break
tracked-failure counting silently. Document the invariant inline so
that future refactors trip over it instead of past it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(metrics): --drain-timeout flag, default bumped to 60s

Address PR mlcommons#306 arekay-nv inline comment on aggregator.py:99
("This might need to be higher").

Analysis: at the system's design point (50k QPS short-context, default
2 tokenizer workers) the 30 s drain finishes in well under a second.
Long-context tokenize workloads can push the backlog higher — a 32k-
context 5k-QPS run with 2 workers can take ~100 s to drain. The right
knob there is `--tokenizer-workers`, not the drain budget, but giving
the user a CLI handle makes both ends tunable without redeploying.

Changes:
- Default drain budget bumped 30s → 60s. Covers normal + long-context
  at the default 2 workers without inflating the high-QPS short-
  context case (we exit early when drain_tasks returns).
- New `--drain-timeout <seconds>` CLI flag plumbed through the
  aggregator subprocess and into `MetricsAggregatorService` as a
  constructor arg `drain_timeout_s`. The kwarg is positionable (not
  a global) so callers can inject test values without monkey-patching
  the module-level constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(metrics): add SessionState.INITIALIZE for pre-START phase

Address PR mlcommons#306 arekay-nv inline comments on snapshot.py:35
(add an `INITIALIZE` state preceding `LIVE`) and test_snapshot.py:67
(add state-check tests).

- Add `SessionState.INITIALIZE = "initialize"` to the wire schema; the
  aggregator now starts in INITIALIZE and transitions to LIVE on the
  first STARTED event. The state machine is forward-only:
  INITIALIZE → LIVE → DRAINING → COMPLETE.
- No INITIALIZE snapshot is emitted today (the tick task only starts on
  the first STARTED), but the state exists as the well-defined starting
  point and so a future setup-phase tick has a state to carry. Wire
  compatibility is preserved — INITIALIZE round-trips through the codec
  (test added).
- New `TestSessionStateTransitions` pins: member set, declaration order
  (consumers can rely on `list(SessionState)` for forward checks), the
  `complete = state == COMPLETE and n_pending_tasks == 0` rule across
  every state, and the INITIALIZE round-trip.
- AGENTS.md updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(registry): cover SeriesSampler internal boundaries

Address PR mlcommons#306 arekay-nv inline comment on registry.py:118 ("can we
add tests to ensure that the behavior is fixed and any changes are
caught by tests, specifically the internal points/boundaries").

New `TestSeriesSamplerBoundaries` class pins:

- HDR construction-time invariants: `high < 2*low` rejected, equality
  case accepted, `low=0` coerced to 1, unsupported dtype rejected.
- Clamp behavior at the HDR bounds: values exactly at `hdr_low` /
  `hdr_high` are unclamped and don't trip the warn-once flag.
- Under- and over-bound clamping: warn-once fires exactly once per
  sampler, raw values stay un-clamped (only HDR's view is clamped).
- Float dtype uses float comparison for the lower clamp (so sub-integer
  under-bound values are still detected).
- sig_figs at HDR-supported extremes (1 and 5) construct and record.
- Rollup edges: count==1 (min==max==total, sum_sq==v^2) and the empty
  case (count==0, histogram==[]).
- Warn-once flag is per-sampler, not process-global.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(metrics): add SessionState.INTERRUPTED for signal-handler shutdown

Foundation commit for the JSON-file-final-snapshot refactor: add a
terminal state distinct from `COMPLETE` so signal-handler-triggered
final snapshots can be told apart from clean ENDED-driven ones.

- Add `SessionState.INTERRUPTED = "interrupted"` and document the
  forward-only transition graph in the enum docstring:
  `INITIALIZE → LIVE → DRAINING → {COMPLETE | INTERRUPTED}`
- Tighten the `state == COMPLETE and n_pending_tasks == 0` complete-
  predicate test to cover both INTERRUPTED + n_pending=0 and
  INTERRUPTED + n_pending>0 as "not complete".
- Add a wire-round-trip test for INTERRUPTED via the msgpack codec.

No call-site changes yet — the next commit wires the publisher /
signal handler / consumer to use INTERRUPTED, and switches the
persisted final snapshot from msgpack pub/sub fallback to a JSON
file as the primary source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(metrics): final snapshot = JSON file; pub/sub = TUI signal

Decouple the two delivery paths so the Report consumer no longer
depends on pub/sub terminal-frame survivability. Closes PR mlcommons#306
council mlcommons#6 (conflate=True fragility for the Report consumer).

Architecture change:

- `MetricsPublisher.publish_final(..., interrupted: bool = False)` now
  atomically writes `final_snapshot.json` (pretty-printed, dict form)
  as the **primary** Report source AND publishes the terminal-state
  snapshot over pub/sub as a **TUI shutdown signal**. Disk write and
  pub/sub send are independent best-effort paths.
- Signal handler in `__main__.py` invokes `publish_final(interrupted=
  True)` so SIGTERM/SIGINT writes a snapshot tagged `INTERRUPTED`
  (introduced in the prior commit) — distinguishes "user killed the
  run mid-execution" from a clean shutdown.
- `MetricsSnapshotSubscriber` is now TUI-only: stripped `complete`,
  `_complete_event`, `wait_for_complete`. `conflate=True` is the
  unambiguous default — no Report-consumer fragility to reason about.
- `execute.py` reads `final_snapshot.json` via `json.loads` straight
  to the dict form, drops the 2 s `wait_for_complete` window and the
  triple-redundant fallback chain. Single fallback: if the file is
  missing (SIGKILL/OOM before the signal handler ran), convert the
  subscriber's `latest` live snapshot via `snapshot_to_dict` and mark
  the report incomplete.
- `Report.from_snapshot` now accepts a dict (the consumer contract).
  All field reads use `dict.get(...)` with defaults that produce an
  honest "incomplete" report on missing fields rather than crashing.
  Surfaces a `state: str` field so `display()` renders an explicit
  INTERRUPTED indicator.
- New `snapshot_to_dict()` in `snapshot.py` is the one-way bridge from
  the wire `MetricsSnapshot` (array_like=True, compact msgpack) to the
  dict form used by both the file writer and any consumer that needs
  to feed a live Struct into Report. The inverse is intentionally
  absent — see `Report.from_snapshot` docstring for the rationale.

Tests rewritten:
- `test_publisher.py`: assertions read JSON from disk instead of
  msgpack, new test for `interrupted=True` writing `state=interrupted`.
- `test_aggregator_e2e.py`: covers both delivery paths (JSON file +
  pub/sub terminal frame).
- `test_report_builder.py`: routes through `snapshot_to_dict`; new
  tests for INTERRUPTED display, empty-dict defaults, and malformed
  metric entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): duplicate STARTED → log error + preserve session_start

Address PR mlcommons#306 review-council finding (Claude mlcommons#5): a duplicate
`STARTED` event silently froze `total_duration_ns` for the rest of
the run because the max-of-elapsed guard never beat the new smaller
deltas computed against the later start timestamp.

The producer contract is "STARTED exactly once per session". Treat a
duplicate as a producer bug: log an error with both timestamps and
DROP the duplicate (don't re-assign `_session_start_ns`). The
publisher.start guard already rejects the second tick-task spawn
(council mlcommons#8); this commit defends the session-state side of the same
invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): scrub NaN/Inf to None in snapshot_to_dict; allow_nan=False

Address PR mlcommons#306 review-council finding (Claude mlcommons#7): the persisted
`final_snapshot.json` could contain literal `NaN` / `Infinity` tokens
if any series recorded a non-finite float (e.g. division-by-zero in a
future TPOT calc, clock-skew artifact, etc.). Python's `json.loads`
reads those back fine, but `jq`, Go's `encoding/json`, JS strict mode,
and most other strict-JSON consumers reject them — and the documented
"cat / jq the file" workflow makes this a real interop tripwire.

Two changes:

1. `snapshot.py::snapshot_to_dict` scrubs non-finite floats to `None`
   on the numeric fields where they could land (counter value, series
   total/min/max/sum_sq/percentiles/histogram-edges). `None` is
   self-describing in the dict consumer: `Report.from_snapshot` uses
   `dict.get(..., default)` so the absence-mapping degrades gracefully
   to zero/empty.

2. `publisher.py::publish_final` switches `json.dumps` to
   `allow_nan=False`. With the scrub in place this should never raise;
   if it does, that's a producer-side bug that needs surfacing, not
   silencing into a non-strict JSON file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): SIGTERM refresh duration; SIGINT no-op

Address PR mlcommons#306 review-council findings (Codex #1 + #2):

#1 — SIGTERM-driven `_signal_finalize` skipped the
`tracked_duration_ns` refresh that the ENDED-driven path does at
`aggregator.py:379-381`. Interrupted reports therefore showed
`duration_ns=0` / `QPS=N/A` even after processing many tracked
samples. Mirror the ENDED path: `registry.set_counter(...,
table.total_tracked_duration_ns)` before `publish_final`.

#2 — On interactive ^C, the OS sends SIGINT to the whole foreground
process group; the aggregator child received it and immediately
called `publish_final(interrupted=True)`, writing the file from
whatever state it had at signal time. Samples that completed during
the parent's clean-shutdown window (between the SIGINT and the
parent's eventual ENDED) never reached the file because
`_finalized=True` made the subsequent ENDED-driven `publish_final` a
no-op. Result: systematic undercount on interactive runs.

Fix: SIGINT registers a no-op handler that silences Python's default
KeyboardInterrupt and lets the parent's ENDED path drive the
aggregator's finalize. SIGTERM remains the only signal that
finalizes — used by `ServiceLauncher.kill_all` when the parent
decides to terminate the child before ENDED arrives.

New integration tests in `tests/integration/async_utils/services/
metrics_aggregator/test_signal_handling.py` spawn the aggregator as
a real subprocess and verify both paths end-to-end (SIGTERM writes
`state=interrupted`; SIGINT does not write the file and the
subprocess stays alive).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(execute): cover _load_final_snapshot_from_disk + Report fallback

Address PR mlcommons#306 review-council finding (Claude mlcommons#15): the consumer-
side fallback ladder in `execute.py` had no test coverage. The
three branches (file present / file absent / file malformed) plus
the state→complete-flag→display contract are load-bearing for the
"JSON file is the canonical Report source" architecture, but a
regression that swapped precedence or mis-defaulted on a malformed
file would go unnoticed until manual QA.

New `TestLoadFinalSnapshotFromDisk` pins:
- file missing → None (SIGKILL / OOM case)
- valid JSON → dict returned with state+pending fields intact
- malformed JSON → None + WARNING logged (graceful, not crash)

New `TestReportFromLoadedSnapshot` pins:
- Parametrized state × n_pending → expected `report.complete`,
  covering clean-COMPLETE, drain-timeout-COMPLETE, INTERRUPTED-0,
  and INTERRUPTED-with-pending.
- INTERRUPTED display() surfaces the signal-driven shutdown so a
  user reading the output knows the data is partial.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): doc cleanup + contract enforcement (p50, output_dir)

Address PR mlcommons#306 review-council follow-ups mlcommons#6, mlcommons#7, mlcommons#8, mlcommons#9, mlcommons#14, mlcommons#17.

mlcommons#6 — Drop the bottom `if not self.complete` WARNING in
`Report.display`. The top if/elif (state == "interrupted" vs not
self.complete) already says everything needed and says it
correctly. The bottom warning fired a second time for INTERRUPTED
runs with the misleading "(drain timeout)" attribution.

mlcommons#7 — Reword `execute.py` fallback log from "report will be marked
incomplete" to "state may or may not be terminal" — the latest
pub/sub frame may in fact be a terminal-state signal.

mlcommons#8 — Update `MetricsSnapshot.state` field docstring to list all
five states (INITIALIZE, LIVE, DRAINING, COMPLETE, INTERRUPTED)
and note that COMPLETE / INTERRUPTED are both terminal.

mlcommons#9 — Codify the "parent owns directory setup" contract. The parent
(`commands/benchmark/execute.py:432-433`) already creates
`<report_dir>/metrics/` before launching the aggregator subprocess.
The child's redundant `mkdir` and the publisher's redundant
`path.parent.mkdir` are both replaced with a fail-fast contract
check in `__main__.py`: if the directory doesn't exist at startup,
the child raises `SystemExit` with a clear message in its stderr.
This prevents the prior failure mode where an mkdir error in the
child caused a 30s parent-side launcher timeout with no visible
diagnostic. (The parent-side fail-fast-on-early-subprocess-death
piece remains a known follow-up against `ServiceLauncher`.)

mlcommons#14 — Enforce the "p50 mandatory" contract at registration time.
`MetricsRegistry.register_series` now rejects percentiles tuples
that omit 50.0, with a clear error message naming the series.
`_series_to_metric_dict` keeps the midrange fallback as defense-
in-depth for hand-crafted snapshot dicts (e.g. manually-edited
JSON files) that bypass the registry path, with a comment
labeling it as approximate-only.

mlcommons#17 — Expand the `publisher.py:publish_final` pub/sub-publish
`except` comment to call out the legitimate ENDED-vs-signal race
(a SIGTERM-driven publish_final reaching `aclose()` first leaves
the underlying ZMQ socket closed when this publish runs). The
dropped TUI frame in that race is acceptable because the JSON
file is the authoritative Report source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): robustness — KeyboardInterrupt, finalize, .tmp cleanup

Address PR mlcommons#306 review-council follow-ups mlcommons#10, mlcommons#12, mlcommons#13.

mlcommons#10 — Top-level exception handler in `__main__.py` caught
`BaseException`, which includes `KeyboardInterrupt`. If SIGINT
arrived before the per-loop signal handlers were registered (during
argparse / `aggregator.start()` / tokenizer load), the user-initiated
^C was logged as "subprocess crashed" with a full traceback —
misleading on a clean interactive shutdown. Narrow to `except
Exception as e:` so KeyboardInterrupt and SystemExit propagate
untouched, and log the concrete exception type up front for grep-
ability.

mlcommons#12 — `aggregator.process()` ENDED path called `publish_final` →
`aclose()` → `_finalize()` as three top-level awaits. If
`publish_final` raised (e.g. tick-task crashed with a non-
CancelledError that escaped its `await self._tick_task`), the
remaining two cleanup steps were skipped — and `_finalize()` is
what sets `shutdown_event`. Without it, `await
shutdown_event.wait()` in main() hangs forever absent a signal.
Wrap in `try/finally` so the cleanup pair always runs, with the
inner `aclose()` also wrapped so its own failure can't prevent
`_finalize()` from completing.

mlcommons#13 — `_write_atomic_json` on `publisher.py` didn't clean up the
`.tmp` file on failure. If `os.rename` raised (EXDEV cross-device
after a tmpfs flip, parent dir removed mid-write, permission
change), the `.tmp` file leaked across runs. Wrap the write +
rename sequence so any failure unlinks `tmp` (with `missing_ok=True`
since rename may have consumed it just before the failure point).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(tests): rename tests/datasets/ → tests/assets/datasets/

Make room for non-dataset test fixtures (e.g. local tokenizer artifacts
for tests that need ISL/OSL/TPOT triggers but can't depend on
HuggingFace Hub access in CI). `tests/datasets/` was too narrowly
named; `tests/assets/` will house both `datasets/` and other test
artifacts under logical subdirectories.

Pure path rename — files move from `tests/datasets/<x>` to
`tests/assets/datasets/<x>`, no content changes. References updated
across:

- pyproject.toml (sdist include glob)
- README.md, docs/CLI_QUICK_REFERENCE.md, docs/LOCAL_TESTING.md,
  examples/02_ServerBenchmarking/README.md
- scripts/create_dummy_dataset.py, scripts/regenerate_templates.py
- src/inference_endpoint/config/templates/*.yaml (placeholder examples
  regenerated from the updated script)
- tests/conftest.py, tests/unit/commands/test_benchmark.py
- AGENTS.md (Test data section)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(integration): use local char-tokenizer fixture, drop HF Hub dependency

Two integration tests in PR mlcommons#306's metrics-aggregator path were
flaky / slow in CI because of HuggingFace Hub:

- `TestTemplateIntegration::test_template_runs` (6 cases) called
  `AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")`
  on the aggregator subprocess's startup path. Cold-cache CI runs
  paid the ~1 MB download + tokenizer-init cost, sometimes pushing
  subprocess startup past the parent launcher's 30 s timeout. Also
  required network egress / HF_TOKEN for some CI environments.
- `test_signal_handling.py` (new tests) were not affected (they
  don't pass `--tokenizer`), but the parent-owns-output-dir contract
  from the earlier mlcommons#9 follow-up also applied — those tests now
  create the output dir themselves before spawning the subprocess.

Fix: drop in a local character-level tokenizer fixture at
`tests/assets/tokenizers/char/`. ~3 KB total (`tokenizer.json` +
`tokenizer_config.json`). Loaded via the existing
`AutoTokenizer.from_pretrained(local_dir)` codepath — no test-only
hooks in production code. Each character is one token, which is
enough for the aggregator's ISL/OSL/TPOT triggers to produce
deterministic counts (the e2e test path doesn't care about
tokenization correctness, only that *some* count appears).

Effects: no network call on the aggregator startup path for these
tests, no HF_TOKEN requirement, and tokenizer load completes in
single-digit ms instead of seconds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(integration): bump worker_initialization_timeout to 120s in CI

`TestTemplateIntegration::test_template_runs[concurrency_template.yaml]`
consistently hits the 60s `worker_initialization_timeout` in CI on
cold-start. `concurrency_template.yaml` is alphabetically first in the
parametrized lane, so it pays the full first-time-this-CI-job cost:

- Python `multiprocessing` `spawn`-mode re-import of the entire
  `inference_endpoint` package per worker subprocess (transformers,
  msgspec, pyzmq, etc.)
- First-time ZMQ IPC bind + connect handshake for the worker pool
- Concurrent aggregator subprocess cold-start contending for the
  same small-CI-runner CPU

Subsequent templates in the same lane benefit from warm module
caches and don't approach the limit. Local Docker runs finish all 6
templates in ~40 s total (~6.5 s/template), but CI runners with less
headroom (and `spawn` vs `fork`) consistently push the first test
past 60 s.

Bump to 120 s in this test only — `_resolve_template` injects
`settings.client.worker_initialization_timeout: 120.0` into each
template before running. Production default (60 s) is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(metrics): drain pending count, SIGTERM task GC, NaN display

Addresses the three high-priority findings from the review council:

H1: drain_tasks now owns the timeout + cancel-and-await sequence, so the
pending count is captured before per-task done callbacks empty the
in-flight set. Previously read 0 unconditionally — the documented
state==COMPLETE and n_pending_tasks>0 drain-timeout contract was
unenforceable.

H2: Extract _make_sigterm_handler returning a strong-ref set[Task] that
holds the spawned _signal_finalize task; the loop tracks tasks via
weakref only, so a discarded create_task() return value can be GC'd
mid-flight (Python asyncio docs) — exactly the failure the INTERRUPTED
delivery path exists to prevent.

H3: _scrub_nonfinite maps producer-side NaN/Inf to None for strict JSON.
_display_metric did val * scale_factor with no guard → TypeError on
display(), which finalize_benchmark calls outside the report-build
try/except. Render N/A for None across named scalars, histogram
bucket edges, and percentiles.

Tests added (all verified failing pre-fix):
- test_drain_timeout_reports_pending_count: forever-blocking pool +
  drain_timeout_s=0.05, asserts publish_final receives n_pending>0
- test_sigterm_handler_holds_strong_reference_to_finalize_task: drives
  the handler, asserts task is in the strong-ref set, survives
  gc.collect(), and self-removes via done-callback on completion
- test_sigterm_handler_refreshes_tracked_duration: handler mirrors the
  ENDED path's tracked_duration_ns refresh before publish_final
- test_display_handles_scrubbed_nan_percentiles: dict with scrubbed
  None percentile values does not crash display(); renders N/A
- test_scrub_nonfinite_round_trip_yields_none: registry-side NaN/Inf
  surfaces as None in snapshot_to_dict and round-trips through
  json.dumps(allow_nan=False)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style(tests): hoist lazy imports to top of test_report_builder

AGENTS.md forbids imports inside function bodies. The H3 round-trip test
introduced lazy imports of math, json, MetricsSnapshot, and SeriesStat —
move them to the top-level import block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maxyanghu pushed a commit that referenced this pull request May 26, 2026
* feat: add multi-turn dataset manager with flat JSONL support

Add MultiTurnDataset, MultiTurnConfig schema, tool-calling types,
Query.metadata transport field, adapter tools= kwarg, and multi-turn
factory routing.

* feat: add ConversationManager and MultiTurnStrategy

Add per-conversation asyncio.Event sequencing (ConversationManager),
async turn pipeline (MultiTurnStrategy), and benchmark execution wiring
(execute.py, session.py PhaseIssuer data_override).

* test: add multi-turn unit and integration tests

Add unit tests for MultiTurnDataset, ConversationManager, and
MultiTurnStrategy; add integration tests including tool-use scenarios
and large-concurrency stress tests.

* feat: wire multi-turn into benchmark execution pipeline

Consolidate multi-turn dataset with single-turn transform pipeline,
fix prior-row extraction, live-history mode, system prompt injection,
tool_calls preservation, and asyncio.Event-based sequencing.

* docs: add multi-turn quickstart, examples, and conversion scripts

Add MULTI_TURN_QUICKSTART.md, examples/09_MultiTurn/ configs and sample
data, scripts/convert_agentic_snapshot.py, and README clarifications
including conversion script output destination.

* fix: replace hardcoded /model/ path in validate_jsonl_schema.py docstring

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: move multi_turn_dataset_schema.json into scripts/ and update default path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address PR mlcommons#285 review comments for multi-turn implementation

Fix 15 review issues across severity levels:
- HIGH: metadata=None crash in msgspec adapter, silent exception swallowing in gather
- MEDIUM: timeout state consistency, conv_id canonicalization, PromptData fallback, conv_id guard
- LOW: enum comparison, frozen config, empty tool_results warning, adapter metadata extraction,
  groupby deduplication, live-history tool warning, asyncio.Event docs, test TODO

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: improve multi-turn PromptData text and add concurrent stress test

Use newline separators (instead of spaces) when flattening messages to
text for ISL estimation, and add a 12-conversation concurrent stress test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: replace semaphore with worker-pool concurrency in MultiTurnStrategy

target_concurrency now limits active conversations (not in-flight requests).
N worker tasks pull from asyncio.Queue, each processing one full conversation
before taking the next. Also adds slots=True back to PhaseConfig and sort=False
to groupby for file-order preservation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address remaining PR mlcommons#285 review comments for multi-turn implementation

- openai_adapter: normalize null content to "" instead of literal "None"
  to avoid polluting conversation history in tool-calling responses
- multi_turn_dataset: validate tool_results entries have required
  tool_call_id and content fields; raise InputValidationError at load time
- multi_turn_dataset: remove unused "index" field from samples metadata
- multi_turn_strategy: wrap mark_turn_complete/mark_turn_failed in
  try/except KeyError in on_sample_complete
- multi_turn_strategy: clear _inflight at end of execute() with warning
  if entries remain (transport failure or session abort)
- docs: remove prescriptive concurrency sizing guide; replace with
  definition of what target_concurrency controls
- docs: rename "Long Conversations" to "Conversations with Many Turns"
- docs: add dataset validation utility reference in Troubleshooting

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address remaining PR mlcommons#285 review comments

- Fix refusal field set to literal string "None" instead of "" in
  openai_adapter.py — made downstream refusal checks incorrectly truthy
- Add test_pipeline_error_propagated to verify execute() re-raises
  worker exceptions instead of swallowing them via gather(return_exceptions=True)
- Clarify MultiTurnStrategy docstring and MULTI_TURN_QUICKSTART.md:
  target_concurrency = simultaneous conversations (not requests);
  each active conversation has exactly 1 in-flight turn at a time
- Remove unjustified "Common Configurations" section from quickstart
- Correct misleading "workers = concurrent conversations" tip; clarify
  client.workers and target_concurrency are independent layers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: replace worker-pool with event-driven model in MultiTurnStrategy

Rewrites MultiTurnStrategy to issue subsequent turns synchronously inside
on_sample_complete() (zero event-loop delay), removing pre-spawned worker
tasks and per-conversation asyncio.Event waiting. ConversationState no
longer holds an asyncio.Event; sequencing is driven entirely by the
strategy. Addresses PR mlcommons#285 reviewer request to move turn issuance into
the sample-complete handler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address PR mlcommons#285 review comments for multi-turn implementation

- Remove ConversationMode enum (single-member) and mode field from
  MultiTurnConfig; drop mode: independent from YAML examples and docs
- Merge AddDefaultColumns into AddStaticColumns(overwrite=False)
- Replace per-call strategy check with construct-time branch in execute.py
- Normalize None tool-calling content to "" in openai_adapter.py
- Delete unused Query.metadata, QueryResult.with_metadata, and
  InFlightRequest.query_metadata plumbing
- Add role-specific validation in _validate_conversation_structure:
  tool rows require non-empty tool_results, assistant rows require
  content or tool_calls
- Backfill explicit sample_index into conversation_metadata["samples"];
  MultiTurnStrategy reads sample_meta["sample_index"] instead of enumerate
- Add AT-RISK gc=False docstring notes to openai/types.py structs with
  mutable container fields
- Rewrite dataset tool_call_ids with model-generated ids in live-history
  mode; add test_live_history_remaps_tool_call_id integration test
- Lift inline imports to top of test_schema.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Import fix

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

* fix: revert out-of-scope live-history tool_call_id rewriting

Remove tool_call_id rewriting from live-history mode (last_assistant_tool_call_ids
field, ConversationManager population, MultiTurnStrategy rewrite logic) and the
corresponding integration test. Live-history improvements are not in scope for
this PR. Also revert the _mt_strategy closure capture in execute.py that was not
requested by any review comment, while keeping the is-None branch elimination.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix issue with tool call accumulation and reasoning content

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

* feat: account for tool-call tokens in OSL / TPOT / TPS metrics

Tool-call tokens were completely excluded from output sequence length,
TPOT, and TPS because they were only stored in QueryResult.metadata and
never reached TextModelOutput or EventRecord.data.

- Add `tool_calls` field to TextModelOutput; __str__ and
  text_after_first_chunk include JSON-encoded tool calls so the full
  generation is counted
- Add as_message_parts / as_message_parts_after_first_chunk helpers for
  chat-template-aware tokenization in the metrics pipeline
- OpenAI SSE accumulator populates tool_calls in TextModelOutput and
  emits a zero-length sentinel StreamChunk on the first tool-call delta
  so TTFT fires for agentic (content-free) responses
- Both OpenAI adapters (msgspec and pydantic) route tool_calls into
  TextModelOutput in addition to metadata
- TokenizePool gains token_count_message / token_count_message_async
  using apply_chat_template + baseline subtraction, with fallback to
  whitespace tokenization when the template raises
- OslTrigger and TpotTrigger override the new _extract_message hook to
  use the message tokenization path when tool_calls are present
- Forward `tools` key through MultiTurnDataset per-conversation defaults

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: correct chat-template tokenization for tool-call messages

Two bugs in TokenizePool._token_count_message_worker caused OSL / TPOT
to be inflated for every response containing tool_calls:

1. tool_calls[].function.arguments arrives as the OpenAI wire-format JSON
   string, but Hermes-style chat templates (Qwen3-Coder, etc.) iterate
   arguments as a mapping. Passing a string raises, and the code silently
   fell through to whitespace-splitting content + reasoning + json(tool_calls) —
   counting every JSON bracket, quote, and escape as its own token.
   Fixed by parsing arguments to dict before rendering.

2. apply_chat_template rejects assistant-only message lists on several
   templates ("No user query found in messages"). The render also raised,
   forcing the fallback path. Fixed by prepending an empty user message
   and subtracting its token length back out.

Also switched the render path from tokenize=True (which returns a single-
element [Encoding] in recent transformers, so len() was 1) to
tokenize=False followed by tokenizer.tokenize(rendered), matching how
_token_count_worker measures plain text.

Verified on a real Qwen3.6-35B-A3B response: a tool-calling turn that
previously reported 130 tokens now reports 100, matching the raw-bytes
reference of 102 (2-token delta is the template's <think>\\n scaffolding).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: fix stale references and tool-row format in multi-turn docs

- QUICKSTART: validate_jsonl_schema.py only does per-row JSON Schema
  checks; cross-row invariants (role sequences, turn numbering, grouping)
  are enforced by MultiTurnDataset at load time, not the script
- README: collapse single/merged tool rows into unified tool_results form
  to match what MultiTurnDataset._validate_conversation_structure enforces
- multi_turn_dataset.py: fix docstring referencing removed AddDefaultColumns

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: pre-compute ISL token counts for multi-turn dataset-history mode

- Add _precompute_isl_for_multi_turn() in execute.py: runs
  apply_chat_template(messages, tokenize=True, add_generation_prompt=True)
  once per client turn at setup time and stores results in
  sample["input_tokens"], hitting the IslTrigger sync fast path
  (len(token_ids)) with zero hot-path cost.
- Add _extract_prompt_text() in session.py: refactors inline message
  content extraction to handle list-form multimodal content safely,
  fixing a crash when content is a list (e.g. vision/tool-call messages).
- Add unit tests for both helpers and two integration tests covering
  target_concurrency cap enforcement and pipeline exception propagation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: unwrap BatchEncoding from apply_chat_template for Qwen3 tokenizer

Qwen3's fast tokenizer returns a BatchEncoding object from
apply_chat_template(tokenize=True) instead of a plain list[int].
Storing the BatchEncoding in sample["input_tokens"] caused a msgspec
serialization error at benchmark setup time. Extract .input_ids when
the return value has that attribute; fall back to the plain list otherwise.

Add a regression test using a mock BatchEncoding so this is caught
before it can regress again.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: accuracy phases now inherit configured load pattern instead of flooding with MAX_THROUGHPUT

Hardcoded LoadPatternType.MAX_THROUGHPUT for accuracy phases caused all
requests to be issued simultaneously with no concurrency cap, exhausting
the SGLang KV pool and producing truncated outputs (finish_reason=length)
with 0/990 final answers for GPQA. Accuracy phases now inherit the perf
phase load pattern, downgrading MULTI_TURN to CONCURRENCY (same cap) when
the accuracy dataset is not a MultiTurnDataset.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix pre-commit

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

* Fix CI error for completion

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

* Change to SSE choice for test completion

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

* fix: address PR mlcommons#285 review deficiencies in multi-turn stack

Correctness fixes:
- multi_turn_strategy: derive response_text from TextModelOutput.output
  directly so tool-call JSON is not duplicated into assistant history
- multi_turn_strategy: KeyError path in on_sample_complete now pops
  _active_iters and calls _fill_slot() to prevent session hang
- multi_turn_strategy: logger.exception preserves traceback on issuance failure
- multi_turn_strategy: raise InputValidationError at __init__ when
  live-history mode (use_dataset_history=False) is combined with tool turns;
  removes the has_tool_msg warning that sent unissued tool_call_ids
- multi_turn_strategy: _handle_timeout synthesises a failure QueryResult and
  routes it through the composite callback so accuracy collector and event
  logger see timed-out turns; counts and logs dropped downstream turns;
  late responses get a debug log instead of silent drop
- execute.py: each hook in _on_sample_complete wrapped independently so a
  strategy failure cannot suppress accuracy collection
- execute.py: move AutoTokenizer import to top-level; narrow ISL precompute
  exception to (TemplateError, KeyError, ValueError, TypeError); raise
  RuntimeError when all samples fail
- token_metrics: join fallback parts with "\n" to avoid cross-boundary
  token merging; logger.exception for baseline computation failure
- adapter_protocol: per-document SSE try/except so one bad frame does not
  drop the rest of the buffer; filter None returns from decode_sse_message
- openai/types: add role field to SSEDelta to accept streaming first frame
- dataset_manager/factory: skip ColumnRemap for MultiTurnDataset
- multi_turn_dataset: warn and skip samples missing pre-built messages
- config/schema: add gt=0 validator on MultiTurnConfig.turn_timeout_s

Docs:
- examples/09_MultiTurn/README.md: correct concurrency/timeout semantics,
  mark per-conversation metrics as planned
- examples/10_CollectOutputs: add example for output collection

Tests:
- test_live_history_rejects_tool_turns: asserts InputValidationError at init
- test_isl_precomputed_for_dataset_history: guards ISL precompute hot path
- annotate bare except blocks in CapturingEchoServer with explanatory comment

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: close residual PR mlcommons#285 review deficiencies

- execute.py: guard accuracy phase against MULTI_TURN load pattern on
  non-MultiTurnDataset (currently unreachable, made explicit)
- runtime_settings.py: clamp multi-turn sample count to dataset size;
  warn when min_sample_count exceeds client-turn count
- token_metrics.py: emit one-shot per-(tokenizer, exc-class) warning
  when apply_chat_template falls back to whitespace tokenization
- strategy.py, schema.py, README.md: fix stale docstrings/docs

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: drop jinja2 import and fix test mocks for ISL precompute

Top-level `import jinja2` in execute.py required an undeclared runtime
dep, aborting CI at collection time. Revert the narrow except tuple back
to `except Exception:` — `logger.exception` already delivers the traceback
visibility the reviewer asked for. Fix `test_precompute_isl.py` patch
targets (broken since AutoTokenizer moved to module top) and guard the
all-failed check against the zero-samples-with-messages edge case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: address Copilot review comments on multi-turn implementation

Code fixes:
- Timeout handler now decrements PhaseIssuer.inflight, preventing
  _drain_inflight() from hanging when any turn times out (#1)
- _precompute_isl_for_multi_turn normalizes tool_calls arguments
  before apply_chat_template, fixing Hermes-style template failures
  on tool-call-bearing samples (#2)
- Non-streaming reasoning_content now set on TextModelOutput.reasoning
  so OSL/TPOT tokenization includes it (mlcommons#3)
- live-history data_override clears stale input_tokens and token_ids (mlcommons#18)
- execute() cleanup runs in finally block, surviving CancelledError (mlcommons#21)
- Validator rejects plain-assistant->tool (missing tool_calls) (mlcommons#13)
- Message builders include 'name' field for prior and current turns (mlcommons#19, mlcommons#20)
- max_new_tokens sets all three adapter aliases (mlcommons#22)
- Multi-turn accuracy datasets raise InputValidationError (not yet supported) (mlcommons#4)

Examples and docs:
- Remove ineffective samples: 10 from multi-turn YAML examples (mlcommons#8, mlcommons#9)
- Fix events.jsonl docs (no conversation_id/turn_number fields) (mlcommons#6, mlcommons#7, mlcommons#14)
- Fix workers -> num_workers in quickstart config snippet (mlcommons#10)
- Fix agentic role-sequence grammar in README (mlcommons#15)
- Document that multi-turn accuracy datasets are not yet supported
- Delete examples/10_CollectOutputs/ (mlcommons#11, mlcommons#12, mlcommons#16, mlcommons#17)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* refactor: typed ConversationMetadata dataclass, single build in load(), tool-call ISL test

- R1: Move _build_metadata() from __init__ to load() so pre_built_messages_by_key
  is always built against the post-transform dataframe (latent desync bug).
  conversation_metadata is None until load() is called.

- R2: Replace conversation_metadata dict[str, Any] with @DataClass ConversationMetadata
  (+ ConversationSampleEntry). Attribute access in MultiTurnStrategy replaces .get()
  calls; mypy now flags typos at the call site.

- R3: Append unit test to TestPrecomputeIslForMultiTurn asserting that
  _normalize_tool_calls_for_template converts tool_calls[].function.arguments
  from JSON string to dict before apply_chat_template (previously untested branch).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR mlcommons#285 round-4 review comments

- Assert conversation_metadata is not None before MultiTurnStrategy
  construction (fixes mypy CI failure at execute.py:634 and
  test_multi_turn.py:89 introduced by R1 moving build to load())
- Narrow SSE decode exception catch from bare Exception to
  msgspec.DecodeError/ValidationError and raise log level to warning
- Clear uuid_to_index entry on turn timeout to prevent inflight counter
  going negative when a late response arrives after timeout
- Add @pytest.mark.unit to 7 unmarked test classes in test_types.py so
  they are included in pytest -m unit CI lane

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Send conversation_id and turn number

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

* Address perf concerns

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

* fix: address PR mlcommons#285 round-5 Copilot review comments

- Revert 595faf4 (conversation_id/turn in EventRecord/PhaseIssuer/session)
- Validate user row content and reject assistant(tool_calls)→user transition
  without intervening tool row in _validate_conversation_structure
- Reject malformed tool_calls (present but not a non-empty list) on assistant rows
- Guard _expand_tool_results against non-dict entries in tool_results list
- Add "tools" to ColumnFilter optional_columns in both OpenAI adapters so
  single-turn datasets with a tools column are not silently stripped
- Replace RuntimeError with logger.warning in _precompute_isl_for_multi_turn
  so template-incompatible tokenizers fall back instead of aborting the benchmark
- Fix schema cross-validation to check only the performance dataset for
  multi_turn config instead of any dataset
- Move seeding loop inside try/finally in MultiTurnStrategy.execute so
  cleanup runs even if _start_conversation raises
- Add inflight/uuid_to_index to FakePhaseIssuer for _handle_timeout coverage
- Strengthen test_no_matching_columns to assert unrelated columns are preserved
- Update MULTI_TURN_QUICKSTART.md to accurately describe which turns produce
  sample events and how to correlate events back to conversations

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: update test_schema.py error message assertions for Fix 5

Match new error messages from the multi_turn schema cross-validation
fix that scopes validation to the performance dataset specifically.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix ci test failure post merge

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

* fix: address PR mlcommons#285 round-6 Copilot review comments

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix double-firing of timed-out turns

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

* feat: stamp conversation_id and turn on EventRecord pipeline

Thread (conversation_id, turn) from the load generator through the event
record / SQL writer pipeline so multi-turn runs can group ISSUED, COMPLETE,
and ERROR events by conversation in the event log.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: sum_sq overflow, streaming conv stamping, tool-call metric coverage

Change 0: cast sum_sq to float in _hdr_stat/_exact_stat to prevent
msgpack OverflowError for ns-range latencies >= 4.3s. SeriesStat wire
schema already accepts int|float.

Change 1: stamp conversation_id/turn on RECV_FIRST/RECV_NON_FIRST events
using .get() (not .pop()) so the uuid_to_conv_info entry remains available
for the terminal QueryResult pop.

Change 2: add test_tpot_osl_for_tool_call_complete to TestAsyncTriggers
with exact OSL/TPOT value assertions for tool-call streaming responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: stamp conv_id/turn on timeout events, clarify live-history guard, update docs

Closes Copilot review comments mlcommons#41, mlcommons#58, mlcommons#81, mlcommons#87, mlcommons#89, mlcommons#90, mlcommons#91 (round 7):

- timeout EventRecords: capture uuid_to_conv_info in a single .pop() and
  pass conversation_id/turn to both ERROR and COMPLETE constructors so
  timed-out turns correlate like normal completions
- live-history guard: add comment explaining why only role:tool rows are
  rejected; tighten error message wording
- MULTI_TURN_QUICKSTART.md: rewrite Events Log section to reflect
  conversation_id/turn on EventRecords; update validate_jsonl_schema
  description to clarify row-level vs cross-row validation scope
- examples/09_MultiTurn/README.md: update event correlation paragraph;
  fix agentic role-sequence diagram to [assistant(tool_calls)->tool]*;
  add note about merged tool_results
- offline_template_full.yaml: correct load_pattern comment to
  "offline only: max_throughput"
- scripts/regenerate_templates.py: apply offline-specific override so
  regeneration preserves the corrected comment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: dataset validation, _fill_slot hang prevention, stale docs, tests

Address remaining dataset-history-scope reviewer concerns from PR mlcommons#285:

- MultiTurnDataset: reject null/empty/NaN conversation_id at construction
  (groupby dropna=False + explicit guard in _validate_conversation_grouping)
- MultiTurnDataset: validate per-call tool_calls shape (id, type, function.name,
  function.arguments) and per-entry tool_results shape (tool_call_id, content)
  at construction time instead of failing silently at the endpoint
- MultiTurnDataset.load(): raise NotImplementedError when adapter= is passed
  without api_type/model_params (was silently ignored)
- MultiTurnStrategy._fill_slot(): wrap in try/except so a transport or dataset
  error during a timeout/error-triggered slot refill sets _error and _all_done
  instead of leaving execute() hanging on _all_done.wait() forever
- docs: reword stale turn_timeout_s comment and target_concurrency description
- tests: add regression coverage for all fixes plus ISL/OSL/TPOT aggregator
  metrics integration test for multi-turn turns

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: emit synthetic ISSUED+ERROR+COMPLETE for dropped turns on timeout/error

Dropped turns after a timeout or error were silent — absent from
events.jsonl, the sample-index map, and the accuracy collector.
Introduces PhaseIssuer.register_skipped() (mirrors issue() minus the
HTTP send and inflight increment) and wires both drop sites in
MultiTurnStrategy through _abort_remaining_turns(), which calls
register_skipped() and publishes synthetic failure EventRecords per
dropped turn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* build: pin jinja2 as direct dep for apply_chat_template

transformers' apply_chat_template (used by multi-turn ISL precompute and
metrics token rendering) requires jinja2 at runtime; without it ImportError
falls through silent text-tokenization fallbacks and quietly degrades ISL/
OSL/TPOT.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: cover TTFT and tool-call streaming in multi-turn metrics test

Parameterize the streaming aggregator test over text and tool-calls-only
payloads, and add TTFT to the asserted metric set. Closes the gap where
neither unit nor integration tests verified that TTFT fires on a
RECV_FIRST whose payload carries only tool_calls.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants