chore(deps): bump idna from 3.11 to 3.15 by dependabot[bot] · Pull Request #2 · CentML/mlperf-endpoints

dependabot · 2026-05-19T23:02:25Z

Bumps idna from 3.11 to 3.15.

Changelog

3.15 (2026-05-12)

Enforce DNS-length cap on individual labels early in check_label, short-circuiting contextual-rule processing for oversized input while staying compatible with UTS 46 usage.

Tidy core helpers: hoist bidi category sets to module-level frozensets (avoiding per-codepoint list construction), simplify length checks, and reuse the shared _unicode_dots_re from idna.core in the codec module.

Use raise ... from err for proper exception chaining and switch internal string formatting to f-strings.

Allow flit_core 4.x in the build backend.

Expand the ruff lint set (flake8-bugbear, flake8-simplify, pyupgrade, perflint) and apply the surfaced fixes; pin lint CI to Python 3.14.

Add Dependabot configuration for GitHub Actions.

Convert README and HISTORY from reStructuredText to Markdown.

Reference CVE-2026-45409 for the 3.14 advisory in place of the initial GHSA identifier.

Thanks to Felix Yan, Stan Ulbrych, and metsw24-max for contributions to this release.

3.14 (2026-05-10)

Removed opportunity to process long inputs into quadratic time by rejecting oversize inputs up-front. Closes a bypass of the CVE-2024-3651 mitigation. [CVE-2026-45409]

Thanks to Stan Ulbrych for reporting the issue.

3.13 (2026-04-22)

Correct classification error for codepoint U+A7F1

3.12 (2026-04-21)

Update to Unicode 17.0.0.

Issue a deprecation warning for the transitional argument.

Added lazy-loading to provide some performance improvements.

Removed vestiges of code related to Python 2 support, including segmentation of data structures specific to Jython.

Thanks to Rodrigo Nogueira for contributions to this release.

Commits

af30a09 Release 3.15
30314d4 Pre-release 3.15rc0
05d4b21 Merge pull request #237 from kjd/convert-docs-to-markdown
2987fdb Convert README and HISTORY from reStructuredText to Markdown
59fa800 Merge pull request #236 from kjd/dependabot/github_actions/actions-f3e34333ea
def6983 Merge branch 'master' into dependabot/github_actions/actions-f3e34333ea
bbd8004 Merge pull request #234 from StanFromIreland/patch-1
edd07c0 Bump github/codeql-action from 3.35.2 to 4.35.2 in the actions group
5557db0 Merge branch 'master' into patch-1
f11746c Merge pull request #235 from StanFromIreland/patch-2
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the Security Alerts page.

Bumps [idna](https://github.com/kjd/idna) from 3.11 to 3.15. - [Release notes](https://github.com/kjd/idna/releases) - [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.md) - [Commits](kjd/idna@v3.11...v3.15) --- updated-dependencies: - dependency-name: idna dependency-version: '3.15' dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

…without shmem (mlcommons#306) * feat(metrics): add registry, samplers, and snapshot wire schema Introduces the three primitives that the upcoming pub/sub metrics path will compose on top of: - snapshot.py: MetricsSnapshot wire struct (msgspec, tagged union of CounterStat | SeriesStat) plus SessionState enum (LIVE / DRAINING / COMPLETE) and msgpack codec. - registry.py: MetricsRegistry holding CounterSamplers and SeriesSamplers. Series samplers carry an HDR Histogram for cheap live percentiles, an array.array of raw values for exact-final computation, and exact rollup primitives. Histogram bucket edges are log-spaced over the observed [min, max] per snapshot, so they auto-zoom to data instead of wasting buckets on empty range. - New unit tests cover the wire codec round-trip, sampler hot path, and registry registration/collision behavior. Adds hdrhistogram==0.10.3 as a runtime dependency. Wiring of these primitives into the aggregator and removal of the old KVStore path follow in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(metrics): add MetricsPublisher and MetricsSnapshotSubscriber - publisher.py: MetricsPublisher owns the periodic tick task that publishes live MetricsSnapshots over IPC pub/sub at refresh_hz, plus publish_final() which is awaited by the aggregator on ENDED. Final delivery is dual-path: * pub/sub publish (best-effort, telemetry knobs sndhwm=4, linger=10s) * disk fallback (atomic: tmp + fsync(file) + rename + fsync(parent dir)) Both paths are independently wrapped in try/except — neither failure suppresses the other. publish_final is async and awaits tick-task cancellation before publishing COMPLETE so a late LIVE/DRAINING tick can never land after COMPLETE on the wire. - subscriber.py: MetricsSnapshotSubscriber tracks ``latest`` and the ``COMPLETE``-state snapshot. Defaults to conflate=True (TUI / report consumer) but parametrized for any consumer that needs every tick. - New unit tests cover tick-task lifecycle, atomic disk fallback, independence of pub/sub vs disk failure paths, and the regression that publish_final must await tick-task cancellation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(metrics): wire pub/sub into aggregator, remove KVStore + mmap Replaces the mmap-backed BasicKVStore with the registry/publisher path introduced in the previous two commits. Aggregator changes: - MetricsAggregatorService now constructs a MetricsRegistry and MetricsPublisher on entry; trigger callbacks call registry.record / registry.increment instead of kv_store.update. - Tracks SessionState (LIVE → DRAINING on ENDED → COMPLETE on publish_final). The publisher tick task captures (state, n_pending_tasks) per tick via a callback; consumers detect drain timeout as state == COMPLETE and n_pending_tasks > 0. - Adds TRACKED_SAMPLES_FAILED counter, incremented on ERROR events whose tracked row still exists at processing time. Correctness depends on the load_generator emitting ERROR before COMPLETE; the matching test asserts that order. - ENDED handler awaits drain_tasks (30s timeout), publish_final, and closes the publisher (linger=10s drains pending pub/sub frames). Report changes: - Replaces from_kv_reader with from_snapshot (pure function on a MetricsSnapshot). complete is derived from state == COMPLETE and n_pending_tasks == 0. Display warns when not complete. Main-process changes (commands/benchmark/execute.py): - Spawns a MetricsSnapshotSubscriber on the main loop. Triple-redundant report sourcing: pub/sub COMPLETE → disk fallback → latest live. - Removes _setup_kv_reader, ARM tmpfs branching, and mmap salvage in _salvage_tmpfs (events.jsonl salvage is preserved). - Awaits subscriber.wait_for_complete(timeout=2.0) after launcher exit so the loop can dispatch the COMPLETE frame before deciding the pub/sub path missed. Removed: - async_utils/services/metrics_aggregator/kv_store.py - async_utils/services/metrics_aggregator/fs_check.py Tests: - Deletes test_kv_store.py. - Marks test_aggregator.py / test_aggregator_e2e.py / test_metrics_table.py / test_report_builder.py / conftest.py with module-level skip + a TODO referencing the design doc; rewriting these on the new fixtures is a tracked follow-up. - Adds test_aggregator_error_handler.py covering the TRACKED_SAMPLES_FAILED increment path and the negative case where COMPLETE arrives before ERROR (documents the bug the ERROR/COMPLETE swap fixes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(load_generator): emit ERROR before COMPLETE for failed queries Swaps the publish order in BenchmarkSession._handle_response so that a QueryResult carrying an error emits ErrorEventType.GENERIC first, then SampleEventType.COMPLETE. This is required for metrics-aggregator correctness: COMPLETE causes MetricsTable.set_field to remove the tracked row, so an ERROR observed afterward has no row to inspect and TRACKED_SAMPLES_FAILED would silently stay at 0. Emitting ERROR first keeps the row alive long enough for the aggregator's error handler to identify the failure as tracked. EventLoggerService and other event consumers treat the two event types independently, so order is invisible to them. The test_failed_query_published_as_error_event test now asserts the order explicitly so a future revert is caught immediately, and the aggregator-side regression is covered by test_aggregator_error_handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(agents): update AGENTS.md for metrics pub/sub refactor - Core Data Flow diagram: replaces "EventRecorder + MetricsReporter" with the events PUB → EventLoggerService / MetricsAggregatorService fan-out + main-process Report subscriber. - Key Components table: adds Metrics Aggregator and Report rows; notes the load_generator's ERROR-before-COMPLETE invariant. - New "Metrics Aggregator subprocess (pub/sub)" section under Hot-Path Architecture: state machine, sampler storage layout, hot path API, dual-path final delivery, dynamic histogram edges. - Code Organization tree: expands metrics_aggregator/ to list each module; removes deleted recorder.py / reporter.py from metrics/. - Key Dependencies table: adds hdrhistogram (C-backed HDR Histogram). - Test fixtures: removes events_db (deleted with the KVStore path). Per AGENTS.md's own self-update rules: "Treat AGENTS.md changes as part of the refactor itself — include them in the same PR". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): address P0 review-council findings Two high-severity issues raised by the review-council pass on PR mlcommons#306: 1. (mlcommons#306-1) Subscriber late-binding could drop early ticks via the ZMQ slow-joiner pattern. Move MetricsSnapshotSubscriber construction + start() BEFORE launcher.launch() so the SUB handshake completes during the subprocess-spawn window. ZMQ tolerates connect-before- bind on IPC — the connect resolves once the binder appears. The prior ordering (subscribe AFTER launch returns) had a window where the aggregator could begin ticking on STARTED before the SUB subscription warmed up, dropping early live snapshots and, in the worst case, missing COMPLETE entirely. 2. (mlcommons#306-2) MetricsPublisher._write_atomic_fallback runs synchronous f.flush + fsync(file) + fsync(parent dir) + rename on the aggregator's event loop. On a busy host this can block tens-to- hundreds of ms — long enough to back-pressure event-record processing. Wrap with asyncio.to_thread inside publish_final. Both fixes are localized — no API changes, no test changes required. Existing integration tests (test_concurrency_benchmark, test_end_to_ end_oracle) exercise both paths end-to-end and still pass. The third P0 item (mlcommons#306-3, unbounded raw-sample retention) is the agreed memory trade documented in metrics_pubsub_design_v5.md §11; addressed by adding "--persist-raw" as a tracked follow-up rather than a code change in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(metrics): rewrite skipped suites on registry/snapshot fixtures The metrics pub/sub refactor (PR #N) module-level-skipped four test files plus their conftest because they hard-coupled to the deleted KVStore API. This commit reinstates them on the new fixtures, in scope with the design doc's "test impact" callout. - conftest.py: rewrites shared fixtures to construct MetricsRegistry and MetricsTable instances directly. Drops events_db (SQLite fixture deleted with the KVStore path). - test_metrics_table.py: 16 tests covering tracking-window lifecycle, trigger dispatch on field updates, tracked-block accounting, and the in-flight async-task drain path. - test_aggregator.py: 31 tests covering MetricsAggregatorService end to end (in-process, MagicMock publisher) — counter accounting, ISSUED/COMPLETE/error event handling, ENDED → publish_final sequence, and the LIVE → DRAINING state transition. Adds a new TestCounterAccounting class to cover the total_* vs tracked_* counter split that the legacy tests conflated. - test_aggregator_e2e.py: 3 tests round-tripping a real MetricsPublisher ↔ MetricsSnapshotSubscriber over IPC, covering COMPLETE-only delivery, LIVE-tick-then-COMPLETE lifecycle, and counter+series wire shape. - test_report_builder.py: 14 tests on Report.from_snapshot, including the complete=False derivation when state != COMPLETE or n_pending_tasks > 0. Net: 64 new tests across the 4 suites; full unit suite up from 660 to 724 passing. The 4 module-level skips are gone. Production-code surfaces flagged for follow-up coverage: - AsyncTokenTrigger exception path in metrics_table.py - SeriesSampler HDR clamp warn-once branch - MetricsAggregatorService._finalize shutdown_event signaling - Report.tps() OSL-empty-with-duration case Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(templates): unblock TestTemplateIntegration without HF_TOKEN The 6 generated-template integration tests were skipped unconditionally in CI/dev because the template placeholders default to gated meta-llama/Llama-3.1-* repos that require HF_TOKEN to fetch the tokenizer. Substitute TinyLlama/TinyLlama-1.1B-Chat-v1.0 for the model name in _resolve_template after placeholder expansion. TinyLlama is non-gated (~1MB tokenizer download), shares the Llama-family tokenizer the templates were written against, and the echo-server path doesn't care about model identity — only that AutoTokenizer.from_pretrained succeeds for the metrics aggregator's ISL/OSL/TPOT triggers. Drops the @pytest.mark.skipif(not HF_TOKEN) decorator, removes the now- unused os import. Effect: integration suite goes from 20 passed / 8 skipped to 26 passed / 2 skipped. The remaining 2 skips need real LLM servers (vLLM/SGLang) which aren't in scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(agents): add reference-hygiene rules + clean up violations Adds two new sections to AGENTS.md "Development Standards": 1. "Documentation references — no local-only artifacts" — docs and comments must not reference paths outside the repo (gitignored directories, local scratch dirs, contributor workstation paths). A reviewer fetching the PR should be able to follow every cited reference. 2. "Comments and docstrings — describe current state, not development history" — no comments narrating iteration on the codebase ("we tried X first", "an earlier implementation did Y"). Such pointers belong in the PR description and git log, not the source tree. Especially relevant under AI-assisted development where it's tempting to leave a paper trail of design pivots inline. Sweeps existing violations across both rules: Production code: drops cites to ``metrics_pubsub_design_v5.md`` from module/class docstrings (snapshot.py, registry.py, publisher.py) and inlines self-contained rationale where useful (aggregator.py HDR bounds, TOTAL_DURATION_NS comment). Tests: removes "Migrated to ..." / "The legacy tests ..." framing from rewritten test module docstrings; reframes regression-test docstrings (test_registry.py, test_publisher.py, test_aggregator.py) to describe the invariant being protected rather than narrating the prior bug's discovery. AGENTS.md: removes its own self-violation cite to the gitignored design doc. Behavior: no functional changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(metrics_table): encapsulate in-flight task access Address PR mlcommons#306 review comments from gemini-code-assist (encapsulation) and github-code-quality (non-iterable enum loop): - Add `MetricsTable.in_flight_tasks_count` property so the aggregator no longer reaches into `table._in_flight_tasks` to report pending- task counts on snapshots and drain logging. - Add `MetricsTable.cancel_in_flight_tasks()` returning the list of cancelled tasks (sets up the T3 await-cancellations fix). - Update aggregator.py call sites accordingly. - Use `MetricCounterKey.__members__.values()` in test_report_builder to satisfy CodeQL's "non-iterable used in for-loop" check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(metrics): rename refresh_hz → publish_interval_s (seconds) Address PR mlcommons#306 review comments from arekay-nv asking for the "interval" naming convention used elsewhere in the repo (e.g. `check_interval` in worker_manager, `interval` in benchmark_httpclient). - CLI flag `--refresh-hz <Hz>` → `--publish-interval <seconds>` (default 4.0 Hz → 0.25 s; same wire cadence). - Constructor parameters `refresh_hz` (aggregator + publisher) → `publish_interval_s`. The `_s` suffix makes the unit explicit so call sites can't accidentally pass a frequency. - Internal field `_refresh_hz` → `_publish_interval_s`. - Drops the `period = 1.0 / refresh_hz` indirection in publisher.start. - Tests / AGENTS.md updated accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(report): numerically stable variance for integer-aggregate series Address PR mlcommons#306 gemini-code-assist comment on report.py:53. For ns-precision latency series (`SAMPLE_LATENCY_NS`, `TTFT_NS`, `TPOT_NS`, etc.) the rollups store `total` and `sum_sq` as Python ints that can grow to many digits. The previous formula `sum_sq - total*total / n` evaluates `total*total / n` as a float and catastrophically cancels against `sum_sq` when the variance is small relative to the mean, producing a negative variance numerator the sqrt() then clamps to 0. Use the exact integer numerator `n*sum_sq - total*total` when the inputs are ints (this is what the math.sqrt sees, no cancellation), falling back to the float form for series whose dtype is float (currently only TPOT, where the magnitudes are small enough that the naive form is fine). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): drain / shutdown correctness for cancel + SIGTERM Address PR mlcommons#306 review-council items mlcommons#4, mlcommons#9, mlcommons#12. mlcommons#4 — Cancellations not awaited before reading n_pending After a drain timeout, the aggregator's `t.cancel()` loop only *scheduled* cancellation; reading `n_pending` on the next line therefore reported a stale-high count and left the to-be-cancelled tasks runnable when the loop tore down. Now `await asyncio.gather(*cancelled, return_exceptions=True)` runs before `n_pending = table.in_flight_tasks_count`, so the snapshot reflects the post-cancellation set and the cancelled tasks actually exit. mlcommons#9 — close() cancels tick task but doesn't await it Added `MetricsPublisher.aclose()` (async) that cancels the tick task AND awaits its exit before closing the underlying transport. Aggregator's post-publish_final path and __main__.py's finally block now use it. Sync `close()` is kept for sync error-path fallbacks with a docstring noting the race. mlcommons#12 — SIGTERM bypasses publish_final Installed `SIGTERM` and `SIGINT` handlers in __main__.py that fire `publish_final` defensively before setting `shutdown_event`. Added `MetricsPublisher._finalized` so the SIGTERM-triggered and the ENDED-triggered paths are safe to race — only the first call publishes a COMPLETE frame. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): pre-check HDR `high >= 2*low` before HdrHistogram ctor Address PR mlcommons#306 council review mlcommons#5 (registry.py:161). The C-backed hdrhistogram constructor requires `high >= 2*low` but raises an opaque allocation error if that doesn't hold — making it hard to debug a misconfigured `register_series` call. Add an explicit pre-check after the low/high clamps so the error names the series and both values up front. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): guard publisher.start against double-STARTED Address PR mlcommons#306 council review mlcommons#8 (aggregator.py:281). A repeat `SessionEventType.STARTED` (replay buffer, buggy producer, test fixture) used to make `MetricsPublisher.start` overwrite `_tick_task`, orphaning the first tick task — it kept running until GC and raced the new task to publish snapshots. Make `start` idempotent: if `_tick_task` is already set, log a warning and return without spawning a second task. The original task remains the one `publish_final` / `aclose` cancels and awaits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): structured logging around aggregator subprocess crash Address PR mlcommons#306 council review mlcommons#10 (__main__.py:166). Wrap the top-level `run_until_complete(main())` so startup / bind / tokenizer-load failures emit a structured `logger.exception` before the interpreter prints the traceback. The parent's ServiceLauncher previously saw only the non-zero exit code and a raw stderr trace with no context to correlate against the parent's logs. `SystemExit` is re-raised untouched so argparse usage / explicit sys.exit paths stay user-facing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(session): document publish-order invariant for ERROR-before-COMPLETE Address PR mlcommons#306 council review mlcommons#11 (session.py:408). The metrics aggregator's `TRACKED_SAMPLES_FAILED` accounting relies on the publisher delivering ERROR strictly before COMPLETE for a failed sample. The ordering is correct today (ZMQ PUB→SUB in-order delivery, ZmqMessagePublisher batches without reordering), but it's an implicit contract — a future transport refactor that breaks it would break tracked-failure counting silently. Document the invariant inline so that future refactors trip over it instead of past it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(metrics): --drain-timeout flag, default bumped to 60s Address PR mlcommons#306 arekay-nv inline comment on aggregator.py:99 ("This might need to be higher"). Analysis: at the system's design point (50k QPS short-context, default 2 tokenizer workers) the 30 s drain finishes in well under a second. Long-context tokenize workloads can push the backlog higher — a 32k- context 5k-QPS run with 2 workers can take ~100 s to drain. The right knob there is `--tokenizer-workers`, not the drain budget, but giving the user a CLI handle makes both ends tunable without redeploying. Changes: - Default drain budget bumped 30s → 60s. Covers normal + long-context at the default 2 workers without inflating the high-QPS short- context case (we exit early when drain_tasks returns). - New `--drain-timeout <seconds>` CLI flag plumbed through the aggregator subprocess and into `MetricsAggregatorService` as a constructor arg `drain_timeout_s`. The kwarg is positionable (not a global) so callers can inject test values without monkey-patching the module-level constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(metrics): add SessionState.INITIALIZE for pre-START phase Address PR mlcommons#306 arekay-nv inline comments on snapshot.py:35 (add an `INITIALIZE` state preceding `LIVE`) and test_snapshot.py:67 (add state-check tests). - Add `SessionState.INITIALIZE = "initialize"` to the wire schema; the aggregator now starts in INITIALIZE and transitions to LIVE on the first STARTED event. The state machine is forward-only: INITIALIZE → LIVE → DRAINING → COMPLETE. - No INITIALIZE snapshot is emitted today (the tick task only starts on the first STARTED), but the state exists as the well-defined starting point and so a future setup-phase tick has a state to carry. Wire compatibility is preserved — INITIALIZE round-trips through the codec (test added). - New `TestSessionStateTransitions` pins: member set, declaration order (consumers can rely on `list(SessionState)` for forward checks), the `complete = state == COMPLETE and n_pending_tasks == 0` rule across every state, and the INITIALIZE round-trip. - AGENTS.md updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(registry): cover SeriesSampler internal boundaries Address PR mlcommons#306 arekay-nv inline comment on registry.py:118 ("can we add tests to ensure that the behavior is fixed and any changes are caught by tests, specifically the internal points/boundaries"). New `TestSeriesSamplerBoundaries` class pins: - HDR construction-time invariants: `high < 2*low` rejected, equality case accepted, `low=0` coerced to 1, unsupported dtype rejected. - Clamp behavior at the HDR bounds: values exactly at `hdr_low` / `hdr_high` are unclamped and don't trip the warn-once flag. - Under- and over-bound clamping: warn-once fires exactly once per sampler, raw values stay un-clamped (only HDR's view is clamped). - Float dtype uses float comparison for the lower clamp (so sub-integer under-bound values are still detected). - sig_figs at HDR-supported extremes (1 and 5) construct and record. - Rollup edges: count==1 (min==max==total, sum_sq==v^2) and the empty case (count==0, histogram==[]). - Warn-once flag is per-sampler, not process-global. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(metrics): add SessionState.INTERRUPTED for signal-handler shutdown Foundation commit for the JSON-file-final-snapshot refactor: add a terminal state distinct from `COMPLETE` so signal-handler-triggered final snapshots can be told apart from clean ENDED-driven ones. - Add `SessionState.INTERRUPTED = "interrupted"` and document the forward-only transition graph in the enum docstring: `INITIALIZE → LIVE → DRAINING → {COMPLETE | INTERRUPTED}` - Tighten the `state == COMPLETE and n_pending_tasks == 0` complete- predicate test to cover both INTERRUPTED + n_pending=0 and INTERRUPTED + n_pending>0 as "not complete". - Add a wire-round-trip test for INTERRUPTED via the msgpack codec. No call-site changes yet — the next commit wires the publisher / signal handler / consumer to use INTERRUPTED, and switches the persisted final snapshot from msgpack pub/sub fallback to a JSON file as the primary source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(metrics): final snapshot = JSON file; pub/sub = TUI signal Decouple the two delivery paths so the Report consumer no longer depends on pub/sub terminal-frame survivability. Closes PR mlcommons#306 council mlcommons#6 (conflate=True fragility for the Report consumer). Architecture change: - `MetricsPublisher.publish_final(..., interrupted: bool = False)` now atomically writes `final_snapshot.json` (pretty-printed, dict form) as the **primary** Report source AND publishes the terminal-state snapshot over pub/sub as a **TUI shutdown signal**. Disk write and pub/sub send are independent best-effort paths. - Signal handler in `__main__.py` invokes `publish_final(interrupted= True)` so SIGTERM/SIGINT writes a snapshot tagged `INTERRUPTED` (introduced in the prior commit) — distinguishes "user killed the run mid-execution" from a clean shutdown. - `MetricsSnapshotSubscriber` is now TUI-only: stripped `complete`, `_complete_event`, `wait_for_complete`. `conflate=True` is the unambiguous default — no Report-consumer fragility to reason about. - `execute.py` reads `final_snapshot.json` via `json.loads` straight to the dict form, drops the 2 s `wait_for_complete` window and the triple-redundant fallback chain. Single fallback: if the file is missing (SIGKILL/OOM before the signal handler ran), convert the subscriber's `latest` live snapshot via `snapshot_to_dict` and mark the report incomplete. - `Report.from_snapshot` now accepts a dict (the consumer contract). All field reads use `dict.get(...)` with defaults that produce an honest "incomplete" report on missing fields rather than crashing. Surfaces a `state: str` field so `display()` renders an explicit INTERRUPTED indicator. - New `snapshot_to_dict()` in `snapshot.py` is the one-way bridge from the wire `MetricsSnapshot` (array_like=True, compact msgpack) to the dict form used by both the file writer and any consumer that needs to feed a live Struct into Report. The inverse is intentionally absent — see `Report.from_snapshot` docstring for the rationale. Tests rewritten: - `test_publisher.py`: assertions read JSON from disk instead of msgpack, new test for `interrupted=True` writing `state=interrupted`. - `test_aggregator_e2e.py`: covers both delivery paths (JSON file + pub/sub terminal frame). - `test_report_builder.py`: routes through `snapshot_to_dict`; new tests for INTERRUPTED display, empty-dict defaults, and malformed metric entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): duplicate STARTED → log error + preserve session_start Address PR mlcommons#306 review-council finding (Claude mlcommons#5): a duplicate `STARTED` event silently froze `total_duration_ns` for the rest of the run because the max-of-elapsed guard never beat the new smaller deltas computed against the later start timestamp. The producer contract is "STARTED exactly once per session". Treat a duplicate as a producer bug: log an error with both timestamps and DROP the duplicate (don't re-assign `_session_start_ns`). The publisher.start guard already rejects the second tick-task spawn (council mlcommons#8); this commit defends the session-state side of the same invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): scrub NaN/Inf to None in snapshot_to_dict; allow_nan=False Address PR mlcommons#306 review-council finding (Claude mlcommons#7): the persisted `final_snapshot.json` could contain literal `NaN` / `Infinity` tokens if any series recorded a non-finite float (e.g. division-by-zero in a future TPOT calc, clock-skew artifact, etc.). Python's `json.loads` reads those back fine, but `jq`, Go's `encoding/json`, JS strict mode, and most other strict-JSON consumers reject them — and the documented "cat / jq the file" workflow makes this a real interop tripwire. Two changes: 1. `snapshot.py::snapshot_to_dict` scrubs non-finite floats to `None` on the numeric fields where they could land (counter value, series total/min/max/sum_sq/percentiles/histogram-edges). `None` is self-describing in the dict consumer: `Report.from_snapshot` uses `dict.get(..., default)` so the absence-mapping degrades gracefully to zero/empty. 2. `publisher.py::publish_final` switches `json.dumps` to `allow_nan=False`. With the scrub in place this should never raise; if it does, that's a producer-side bug that needs surfacing, not silencing into a non-strict JSON file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): SIGTERM refresh duration; SIGINT no-op Address PR mlcommons#306 review-council findings (Codex #1 + #2): #1 — SIGTERM-driven `_signal_finalize` skipped the `tracked_duration_ns` refresh that the ENDED-driven path does at `aggregator.py:379-381`. Interrupted reports therefore showed `duration_ns=0` / `QPS=N/A` even after processing many tracked samples. Mirror the ENDED path: `registry.set_counter(..., table.total_tracked_duration_ns)` before `publish_final`. #2 — On interactive ^C, the OS sends SIGINT to the whole foreground process group; the aggregator child received it and immediately called `publish_final(interrupted=True)`, writing the file from whatever state it had at signal time. Samples that completed during the parent's clean-shutdown window (between the SIGINT and the parent's eventual ENDED) never reached the file because `_finalized=True` made the subsequent ENDED-driven `publish_final` a no-op. Result: systematic undercount on interactive runs. Fix: SIGINT registers a no-op handler that silences Python's default KeyboardInterrupt and lets the parent's ENDED path drive the aggregator's finalize. SIGTERM remains the only signal that finalizes — used by `ServiceLauncher.kill_all` when the parent decides to terminate the child before ENDED arrives. New integration tests in `tests/integration/async_utils/services/ metrics_aggregator/test_signal_handling.py` spawn the aggregator as a real subprocess and verify both paths end-to-end (SIGTERM writes `state=interrupted`; SIGINT does not write the file and the subprocess stays alive). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(execute): cover _load_final_snapshot_from_disk + Report fallback Address PR mlcommons#306 review-council finding (Claude mlcommons#15): the consumer- side fallback ladder in `execute.py` had no test coverage. The three branches (file present / file absent / file malformed) plus the state→complete-flag→display contract are load-bearing for the "JSON file is the canonical Report source" architecture, but a regression that swapped precedence or mis-defaulted on a malformed file would go unnoticed until manual QA. New `TestLoadFinalSnapshotFromDisk` pins: - file missing → None (SIGKILL / OOM case) - valid JSON → dict returned with state+pending fields intact - malformed JSON → None + WARNING logged (graceful, not crash) New `TestReportFromLoadedSnapshot` pins: - Parametrized state × n_pending → expected `report.complete`, covering clean-COMPLETE, drain-timeout-COMPLETE, INTERRUPTED-0, and INTERRUPTED-with-pending. - INTERRUPTED display() surfaces the signal-driven shutdown so a user reading the output knows the data is partial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): doc cleanup + contract enforcement (p50, output_dir) Address PR mlcommons#306 review-council follow-ups mlcommons#6, mlcommons#7, mlcommons#8, mlcommons#9, mlcommons#14, mlcommons#17. mlcommons#6 — Drop the bottom `if not self.complete` WARNING in `Report.display`. The top if/elif (state == "interrupted" vs not self.complete) already says everything needed and says it correctly. The bottom warning fired a second time for INTERRUPTED runs with the misleading "(drain timeout)" attribution. mlcommons#7 — Reword `execute.py` fallback log from "report will be marked incomplete" to "state may or may not be terminal" — the latest pub/sub frame may in fact be a terminal-state signal. mlcommons#8 — Update `MetricsSnapshot.state` field docstring to list all five states (INITIALIZE, LIVE, DRAINING, COMPLETE, INTERRUPTED) and note that COMPLETE / INTERRUPTED are both terminal. mlcommons#9 — Codify the "parent owns directory setup" contract. The parent (`commands/benchmark/execute.py:432-433`) already creates `<report_dir>/metrics/` before launching the aggregator subprocess. The child's redundant `mkdir` and the publisher's redundant `path.parent.mkdir` are both replaced with a fail-fast contract check in `__main__.py`: if the directory doesn't exist at startup, the child raises `SystemExit` with a clear message in its stderr. This prevents the prior failure mode where an mkdir error in the child caused a 30s parent-side launcher timeout with no visible diagnostic. (The parent-side fail-fast-on-early-subprocess-death piece remains a known follow-up against `ServiceLauncher`.) mlcommons#14 — Enforce the "p50 mandatory" contract at registration time. `MetricsRegistry.register_series` now rejects percentiles tuples that omit 50.0, with a clear error message naming the series. `_series_to_metric_dict` keeps the midrange fallback as defense- in-depth for hand-crafted snapshot dicts (e.g. manually-edited JSON files) that bypass the registry path, with a comment labeling it as approximate-only. mlcommons#17 — Expand the `publisher.py:publish_final` pub/sub-publish `except` comment to call out the legitimate ENDED-vs-signal race (a SIGTERM-driven publish_final reaching `aclose()` first leaves the underlying ZMQ socket closed when this publish runs). The dropped TUI frame in that race is acceptable because the JSON file is the authoritative Report source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): robustness — KeyboardInterrupt, finalize, .tmp cleanup Address PR mlcommons#306 review-council follow-ups mlcommons#10, mlcommons#12, mlcommons#13. mlcommons#10 — Top-level exception handler in `__main__.py` caught `BaseException`, which includes `KeyboardInterrupt`. If SIGINT arrived before the per-loop signal handlers were registered (during argparse / `aggregator.start()` / tokenizer load), the user-initiated ^C was logged as "subprocess crashed" with a full traceback — misleading on a clean interactive shutdown. Narrow to `except Exception as e:` so KeyboardInterrupt and SystemExit propagate untouched, and log the concrete exception type up front for grep- ability. mlcommons#12 — `aggregator.process()` ENDED path called `publish_final` → `aclose()` → `_finalize()` as three top-level awaits. If `publish_final` raised (e.g. tick-task crashed with a non- CancelledError that escaped its `await self._tick_task`), the remaining two cleanup steps were skipped — and `_finalize()` is what sets `shutdown_event`. Without it, `await shutdown_event.wait()` in main() hangs forever absent a signal. Wrap in `try/finally` so the cleanup pair always runs, with the inner `aclose()` also wrapped so its own failure can't prevent `_finalize()` from completing. mlcommons#13 — `_write_atomic_json` on `publisher.py` didn't clean up the `.tmp` file on failure. If `os.rename` raised (EXDEV cross-device after a tmpfs flip, parent dir removed mid-write, permission change), the `.tmp` file leaked across runs. Wrap the write + rename sequence so any failure unlinks `tmp` (with `missing_ok=True` since rename may have consumed it just before the failure point). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(tests): rename tests/datasets/ → tests/assets/datasets/ Make room for non-dataset test fixtures (e.g. local tokenizer artifacts for tests that need ISL/OSL/TPOT triggers but can't depend on HuggingFace Hub access in CI). `tests/datasets/` was too narrowly named; `tests/assets/` will house both `datasets/` and other test artifacts under logical subdirectories. Pure path rename — files move from `tests/datasets/<x>` to `tests/assets/datasets/<x>`, no content changes. References updated across: - pyproject.toml (sdist include glob) - README.md, docs/CLI_QUICK_REFERENCE.md, docs/LOCAL_TESTING.md, examples/02_ServerBenchmarking/README.md - scripts/create_dummy_dataset.py, scripts/regenerate_templates.py - src/inference_endpoint/config/templates/*.yaml (placeholder examples regenerated from the updated script) - tests/conftest.py, tests/unit/commands/test_benchmark.py - AGENTS.md (Test data section) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(integration): use local char-tokenizer fixture, drop HF Hub dependency Two integration tests in PR mlcommons#306's metrics-aggregator path were flaky / slow in CI because of HuggingFace Hub: - `TestTemplateIntegration::test_template_runs` (6 cases) called `AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")` on the aggregator subprocess's startup path. Cold-cache CI runs paid the ~1 MB download + tokenizer-init cost, sometimes pushing subprocess startup past the parent launcher's 30 s timeout. Also required network egress / HF_TOKEN for some CI environments. - `test_signal_handling.py` (new tests) were not affected (they don't pass `--tokenizer`), but the parent-owns-output-dir contract from the earlier mlcommons#9 follow-up also applied — those tests now create the output dir themselves before spawning the subprocess. Fix: drop in a local character-level tokenizer fixture at `tests/assets/tokenizers/char/`. ~3 KB total (`tokenizer.json` + `tokenizer_config.json`). Loaded via the existing `AutoTokenizer.from_pretrained(local_dir)` codepath — no test-only hooks in production code. Each character is one token, which is enough for the aggregator's ISL/OSL/TPOT triggers to produce deterministic counts (the e2e test path doesn't care about tokenization correctness, only that *some* count appears). Effects: no network call on the aggregator startup path for these tests, no HF_TOKEN requirement, and tokenizer load completes in single-digit ms instead of seconds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(integration): bump worker_initialization_timeout to 120s in CI `TestTemplateIntegration::test_template_runs[concurrency_template.yaml]` consistently hits the 60s `worker_initialization_timeout` in CI on cold-start. `concurrency_template.yaml` is alphabetically first in the parametrized lane, so it pays the full first-time-this-CI-job cost: - Python `multiprocessing` `spawn`-mode re-import of the entire `inference_endpoint` package per worker subprocess (transformers, msgspec, pyzmq, etc.) - First-time ZMQ IPC bind + connect handshake for the worker pool - Concurrent aggregator subprocess cold-start contending for the same small-CI-runner CPU Subsequent templates in the same lane benefit from warm module caches and don't approach the limit. Local Docker runs finish all 6 templates in ~40 s total (~6.5 s/template), but CI runners with less headroom (and `spawn` vs `fork`) consistently push the first test past 60 s. Bump to 120 s in this test only — `_resolve_template` injects `settings.client.worker_initialization_timeout: 120.0` into each template before running. Production default (60 s) is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(metrics): drain pending count, SIGTERM task GC, NaN display Addresses the three high-priority findings from the review council: H1: drain_tasks now owns the timeout + cancel-and-await sequence, so the pending count is captured before per-task done callbacks empty the in-flight set. Previously read 0 unconditionally — the documented state==COMPLETE and n_pending_tasks>0 drain-timeout contract was unenforceable. H2: Extract _make_sigterm_handler returning a strong-ref set[Task] that holds the spawned _signal_finalize task; the loop tracks tasks via weakref only, so a discarded create_task() return value can be GC'd mid-flight (Python asyncio docs) — exactly the failure the INTERRUPTED delivery path exists to prevent. H3: _scrub_nonfinite maps producer-side NaN/Inf to None for strict JSON. _display_metric did val * scale_factor with no guard → TypeError on display(), which finalize_benchmark calls outside the report-build try/except. Render N/A for None across named scalars, histogram bucket edges, and percentiles. Tests added (all verified failing pre-fix): - test_drain_timeout_reports_pending_count: forever-blocking pool + drain_timeout_s=0.05, asserts publish_final receives n_pending>0 - test_sigterm_handler_holds_strong_reference_to_finalize_task: drives the handler, asserts task is in the strong-ref set, survives gc.collect(), and self-removes via done-callback on completion - test_sigterm_handler_refreshes_tracked_duration: handler mirrors the ENDED path's tracked_duration_ns refresh before publish_final - test_display_handles_scrubbed_nan_percentiles: dict with scrubbed None percentile values does not crash display(); renders N/A - test_scrub_nonfinite_round_trip_yields_none: registry-side NaN/Inf surfaces as None in snapshot_to_dict and round-trips through json.dumps(allow_nan=False) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(tests): hoist lazy imports to top of test_report_builder AGENTS.md forbids imports inside function bodies. The H3 round-trip test introduced lazy imports of math, json, MetricsSnapshot, and SeriesStat — move them to the top-level import block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: add multi-turn dataset manager with flat JSONL support Add MultiTurnDataset, MultiTurnConfig schema, tool-calling types, Query.metadata transport field, adapter tools= kwarg, and multi-turn factory routing. * feat: add ConversationManager and MultiTurnStrategy Add per-conversation asyncio.Event sequencing (ConversationManager), async turn pipeline (MultiTurnStrategy), and benchmark execution wiring (execute.py, session.py PhaseIssuer data_override). * test: add multi-turn unit and integration tests Add unit tests for MultiTurnDataset, ConversationManager, and MultiTurnStrategy; add integration tests including tool-use scenarios and large-concurrency stress tests. * feat: wire multi-turn into benchmark execution pipeline Consolidate multi-turn dataset with single-turn transform pipeline, fix prior-row extraction, live-history mode, system prompt injection, tool_calls preservation, and asyncio.Event-based sequencing. * docs: add multi-turn quickstart, examples, and conversion scripts Add MULTI_TURN_QUICKSTART.md, examples/09_MultiTurn/ configs and sample data, scripts/convert_agentic_snapshot.py, and README clarifications including conversion script output destination. * fix: replace hardcoded /model/ path in validate_jsonl_schema.py docstring Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: move multi_turn_dataset_schema.json into scripts/ and update default path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address PR mlcommons#285 review comments for multi-turn implementation Fix 15 review issues across severity levels: - HIGH: metadata=None crash in msgspec adapter, silent exception swallowing in gather - MEDIUM: timeout state consistency, conv_id canonicalization, PromptData fallback, conv_id guard - LOW: enum comparison, frozen config, empty tool_results warning, adapter metadata extraction, groupby deduplication, live-history tool warning, asyncio.Event docs, test TODO Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: improve multi-turn PromptData text and add concurrent stress test Use newline separators (instead of spaces) when flattening messages to text for ISL estimation, and add a 12-conversation concurrent stress test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: replace semaphore with worker-pool concurrency in MultiTurnStrategy target_concurrency now limits active conversations (not in-flight requests). N worker tasks pull from asyncio.Queue, each processing one full conversation before taking the next. Also adds slots=True back to PhaseConfig and sort=False to groupby for file-order preservation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address remaining PR mlcommons#285 review comments for multi-turn implementation - openai_adapter: normalize null content to "" instead of literal "None" to avoid polluting conversation history in tool-calling responses - multi_turn_dataset: validate tool_results entries have required tool_call_id and content fields; raise InputValidationError at load time - multi_turn_dataset: remove unused "index" field from samples metadata - multi_turn_strategy: wrap mark_turn_complete/mark_turn_failed in try/except KeyError in on_sample_complete - multi_turn_strategy: clear _inflight at end of execute() with warning if entries remain (transport failure or session abort) - docs: remove prescriptive concurrency sizing guide; replace with definition of what target_concurrency controls - docs: rename "Long Conversations" to "Conversations with Many Turns" - docs: add dataset validation utility reference in Troubleshooting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address remaining PR mlcommons#285 review comments - Fix refusal field set to literal string "None" instead of "" in openai_adapter.py — made downstream refusal checks incorrectly truthy - Add test_pipeline_error_propagated to verify execute() re-raises worker exceptions instead of swallowing them via gather(return_exceptions=True) - Clarify MultiTurnStrategy docstring and MULTI_TURN_QUICKSTART.md: target_concurrency = simultaneous conversations (not requests); each active conversation has exactly 1 in-flight turn at a time - Remove unjustified "Common Configurations" section from quickstart - Correct misleading "workers = concurrent conversations" tip; clarify client.workers and target_concurrency are independent layers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: replace worker-pool with event-driven model in MultiTurnStrategy Rewrites MultiTurnStrategy to issue subsequent turns synchronously inside on_sample_complete() (zero event-loop delay), removing pre-spawned worker tasks and per-conversation asyncio.Event waiting. ConversationState no longer holds an asyncio.Event; sequencing is driven entirely by the strategy. Addresses PR mlcommons#285 reviewer request to move turn issuance into the sample-complete handler. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address PR mlcommons#285 review comments for multi-turn implementation - Remove ConversationMode enum (single-member) and mode field from MultiTurnConfig; drop mode: independent from YAML examples and docs - Merge AddDefaultColumns into AddStaticColumns(overwrite=False) - Replace per-call strategy check with construct-time branch in execute.py - Normalize None tool-calling content to "" in openai_adapter.py - Delete unused Query.metadata, QueryResult.with_metadata, and InFlightRequest.query_metadata plumbing - Add role-specific validation in _validate_conversation_structure: tool rows require non-empty tool_results, assistant rows require content or tool_calls - Backfill explicit sample_index into conversation_metadata["samples"]; MultiTurnStrategy reads sample_meta["sample_index"] instead of enumerate - Add AT-RISK gc=False docstring notes to openai/types.py structs with mutable container fields - Rewrite dataset tool_call_ids with model-generated ids in live-history mode; add test_live_history_remaps_tool_call_id integration test - Lift inline imports to top of test_schema.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Import fix Signed-off-by: Li, Tianmu <tianmu.li@intel.com> * fix: revert out-of-scope live-history tool_call_id rewriting Remove tool_call_id rewriting from live-history mode (last_assistant_tool_call_ids field, ConversationManager population, MultiTurnStrategy rewrite logic) and the corresponding integration test. Live-history improvements are not in scope for this PR. Also revert the _mt_strategy closure capture in execute.py that was not requested by any review comment, while keeping the is-None branch elimination. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix issue with tool call accumulation and reasoning content Signed-off-by: Li, Tianmu <tianmu.li@intel.com> * feat: account for tool-call tokens in OSL / TPOT / TPS metrics Tool-call tokens were completely excluded from output sequence length, TPOT, and TPS because they were only stored in QueryResult.metadata and never reached TextModelOutput or EventRecord.data. - Add `tool_calls` field to TextModelOutput; __str__ and text_after_first_chunk include JSON-encoded tool calls so the full generation is counted - Add as_message_parts / as_message_parts_after_first_chunk helpers for chat-template-aware tokenization in the metrics pipeline - OpenAI SSE accumulator populates tool_calls in TextModelOutput and emits a zero-length sentinel StreamChunk on the first tool-call delta so TTFT fires for agentic (content-free) responses - Both OpenAI adapters (msgspec and pydantic) route tool_calls into TextModelOutput in addition to metadata - TokenizePool gains token_count_message / token_count_message_async using apply_chat_template + baseline subtraction, with fallback to whitespace tokenization when the template raises - OslTrigger and TpotTrigger override the new _extract_message hook to use the message tokenization path when tool_calls are present - Forward `tools` key through MultiTurnDataset per-conversation defaults Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: correct chat-template tokenization for tool-call messages Two bugs in TokenizePool._token_count_message_worker caused OSL / TPOT to be inflated for every response containing tool_calls: 1. tool_calls[].function.arguments arrives as the OpenAI wire-format JSON string, but Hermes-style chat templates (Qwen3-Coder, etc.) iterate arguments as a mapping. Passing a string raises, and the code silently fell through to whitespace-splitting content + reasoning + json(tool_calls) — counting every JSON bracket, quote, and escape as its own token. Fixed by parsing arguments to dict before rendering. 2. apply_chat_template rejects assistant-only message lists on several templates ("No user query found in messages"). The render also raised, forcing the fallback path. Fixed by prepending an empty user message and subtracting its token length back out. Also switched the render path from tokenize=True (which returns a single- element [Encoding] in recent transformers, so len() was 1) to tokenize=False followed by tokenizer.tokenize(rendered), matching how _token_count_worker measures plain text. Verified on a real Qwen3.6-35B-A3B response: a tool-calling turn that previously reported 130 tokens now reports 100, matching the raw-bytes reference of 102 (2-token delta is the template's <think>\\n scaffolding). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: fix stale references and tool-row format in multi-turn docs - QUICKSTART: validate_jsonl_schema.py only does per-row JSON Schema checks; cross-row invariants (role sequences, turn numbering, grouping) are enforced by MultiTurnDataset at load time, not the script - README: collapse single/merged tool rows into unified tool_results form to match what MultiTurnDataset._validate_conversation_structure enforces - multi_turn_dataset.py: fix docstring referencing removed AddDefaultColumns Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: pre-compute ISL token counts for multi-turn dataset-history mode - Add _precompute_isl_for_multi_turn() in execute.py: runs apply_chat_template(messages, tokenize=True, add_generation_prompt=True) once per client turn at setup time and stores results in sample["input_tokens"], hitting the IslTrigger sync fast path (len(token_ids)) with zero hot-path cost. - Add _extract_prompt_text() in session.py: refactors inline message content extraction to handle list-form multimodal content safely, fixing a crash when content is a list (e.g. vision/tool-call messages). - Add unit tests for both helpers and two integration tests covering target_concurrency cap enforcement and pipeline exception propagation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: unwrap BatchEncoding from apply_chat_template for Qwen3 tokenizer Qwen3's fast tokenizer returns a BatchEncoding object from apply_chat_template(tokenize=True) instead of a plain list[int]. Storing the BatchEncoding in sample["input_tokens"] caused a msgspec serialization error at benchmark setup time. Extract .input_ids when the return value has that attribute; fall back to the plain list otherwise. Add a regression test using a mock BatchEncoding so this is caught before it can regress again. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: accuracy phases now inherit configured load pattern instead of flooding with MAX_THROUGHPUT Hardcoded LoadPatternType.MAX_THROUGHPUT for accuracy phases caused all requests to be issued simultaneously with no concurrency cap, exhausting the SGLang KV pool and producing truncated outputs (finish_reason=length) with 0/990 final answers for GPQA. Accuracy phases now inherit the perf phase load pattern, downgrading MULTI_TURN to CONCURRENCY (same cap) when the accuracy dataset is not a MultiTurnDataset. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix pre-commit Signed-off-by: Li, Tianmu <tianmu.li@intel.com> * Fix CI error for completion Signed-off-by: Li, Tianmu <tianmu.li@intel.com> * Change to SSE choice for test completion Signed-off-by: Li, Tianmu <tianmu.li@intel.com> * fix: address PR mlcommons#285 review deficiencies in multi-turn stack Correctness fixes: - multi_turn_strategy: derive response_text from TextModelOutput.output directly so tool-call JSON is not duplicated into assistant history - multi_turn_strategy: KeyError path in on_sample_complete now pops _active_iters and calls _fill_slot() to prevent session hang - multi_turn_strategy: logger.exception preserves traceback on issuance failure - multi_turn_strategy: raise InputValidationError at __init__ when live-history mode (use_dataset_history=False) is combined with tool turns; removes the has_tool_msg warning that sent unissued tool_call_ids - multi_turn_strategy: _handle_timeout synthesises a failure QueryResult and routes it through the composite callback so accuracy collector and event logger see timed-out turns; counts and logs dropped downstream turns; late responses get a debug log instead of silent drop - execute.py: each hook in _on_sample_complete wrapped independently so a strategy failure cannot suppress accuracy collection - execute.py: move AutoTokenizer import to top-level; narrow ISL precompute exception to (TemplateError, KeyError, ValueError, TypeError); raise RuntimeError when all samples fail - token_metrics: join fallback parts with "\n" to avoid cross-boundary token merging; logger.exception for baseline computation failure - adapter_protocol: per-document SSE try/except so one bad frame does not drop the rest of the buffer; filter None returns from decode_sse_message - openai/types: add role field to SSEDelta to accept streaming first frame - dataset_manager/factory: skip ColumnRemap for MultiTurnDataset - multi_turn_dataset: warn and skip samples missing pre-built messages - config/schema: add gt=0 validator on MultiTurnConfig.turn_timeout_s Docs: - examples/09_MultiTurn/README.md: correct concurrency/timeout semantics, mark per-conversation metrics as planned - examples/10_CollectOutputs: add example for output collection Tests: - test_live_history_rejects_tool_turns: asserts InputValidationError at init - test_isl_precomputed_for_dataset_history: guards ISL precompute hot path - annotate bare except blocks in CapturingEchoServer with explanatory comment Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: close residual PR mlcommons#285 review deficiencies - execute.py: guard accuracy phase against MULTI_TURN load pattern on non-MultiTurnDataset (currently unreachable, made explicit) - runtime_settings.py: clamp multi-turn sample count to dataset size; warn when min_sample_count exceeds client-turn count - token_metrics.py: emit one-shot per-(tokenizer, exc-class) warning when apply_chat_template falls back to whitespace tokenization - strategy.py, schema.py, README.md: fix stale docstrings/docs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: drop jinja2 import and fix test mocks for ISL precompute Top-level `import jinja2` in execute.py required an undeclared runtime dep, aborting CI at collection time. Revert the narrow except tuple back to `except Exception:` — `logger.exception` already delivers the traceback visibility the reviewer asked for. Fix `test_precompute_isl.py` patch targets (broken since AutoTokenizer moved to module top) and guard the all-failed check against the zero-samples-with-messages edge case. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: address Copilot review comments on multi-turn implementation Code fixes: - Timeout handler now decrements PhaseIssuer.inflight, preventing _drain_inflight() from hanging when any turn times out (#1) - _precompute_isl_for_multi_turn normalizes tool_calls arguments before apply_chat_template, fixing Hermes-style template failures on tool-call-bearing samples (#2) - Non-streaming reasoning_content now set on TextModelOutput.reasoning so OSL/TPOT tokenization includes it (mlcommons#3) - live-history data_override clears stale input_tokens and token_ids (mlcommons#18) - execute() cleanup runs in finally block, surviving CancelledError (mlcommons#21) - Validator rejects plain-assistant->tool (missing tool_calls) (mlcommons#13) - Message builders include 'name' field for prior and current turns (mlcommons#19, mlcommons#20) - max_new_tokens sets all three adapter aliases (mlcommons#22) - Multi-turn accuracy datasets raise InputValidationError (not yet supported) (mlcommons#4) Examples and docs: - Remove ineffective samples: 10 from multi-turn YAML examples (mlcommons#8, mlcommons#9) - Fix events.jsonl docs (no conversation_id/turn_number fields) (mlcommons#6, mlcommons#7, mlcommons#14) - Fix workers -> num_workers in quickstart config snippet (mlcommons#10) - Fix agentic role-sequence grammar in README (mlcommons#15) - Document that multi-turn accuracy datasets are not yet supported - Delete examples/10_CollectOutputs/ (mlcommons#11, mlcommons#12, mlcommons#16, mlcommons#17) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * refactor: typed ConversationMetadata dataclass, single build in load(), tool-call ISL test - R1: Move _build_metadata() from __init__ to load() so pre_built_messages_by_key is always built against the post-transform dataframe (latent desync bug). conversation_metadata is None until load() is called. - R2: Replace conversation_metadata dict[str, Any] with @DataClass ConversationMetadata (+ ConversationSampleEntry). Attribute access in MultiTurnStrategy replaces .get() calls; mypy now flags typos at the call site. - R3: Append unit test to TestPrecomputeIslForMultiTurn asserting that _normalize_tool_calls_for_template converts tool_calls[].function.arguments from JSON string to dict before apply_chat_template (previously untested branch). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: address PR mlcommons#285 round-4 review comments - Assert conversation_metadata is not None before MultiTurnStrategy construction (fixes mypy CI failure at execute.py:634 and test_multi_turn.py:89 introduced by R1 moving build to load()) - Narrow SSE decode exception catch from bare Exception to msgspec.DecodeError/ValidationError and raise log level to warning - Clear uuid_to_index entry on turn timeout to prevent inflight counter going negative when a late response arrives after timeout - Add @pytest.mark.unit to 7 unmarked test classes in test_types.py so they are included in pytest -m unit CI lane Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Send conversation_id and turn number Signed-off-by: Li, Tianmu <tianmu.li@intel.com> * Address perf concerns Signed-off-by: Li, Tianmu <tianmu.li@intel.com> * fix: address PR mlcommons#285 round-5 Copilot review comments - Revert 595faf4 (conversation_id/turn in EventRecord/PhaseIssuer/session) - Validate user row content and reject assistant(tool_calls)→user transition without intervening tool row in _validate_conversation_structure - Reject malformed tool_calls (present but not a non-empty list) on assistant rows - Guard _expand_tool_results against non-dict entries in tool_results list - Add "tools" to ColumnFilter optional_columns in both OpenAI adapters so single-turn datasets with a tools column are not silently stripped - Replace RuntimeError with logger.warning in _precompute_isl_for_multi_turn so template-incompatible tokenizers fall back instead of aborting the benchmark - Fix schema cross-validation to check only the performance dataset for multi_turn config instead of any dataset - Move seeding loop inside try/finally in MultiTurnStrategy.execute so cleanup runs even if _start_conversation raises - Add inflight/uuid_to_index to FakePhaseIssuer for _handle_timeout coverage - Strengthen test_no_matching_columns to assert unrelated columns are preserved - Update MULTI_TURN_QUICKSTART.md to accurately describe which turns produce sample events and how to correlate events back to conversations Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: update test_schema.py error message assertions for Fix 5 Match new error messages from the multi_turn schema cross-validation fix that scopes validation to the performance dataset specifically. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix ci test failure post merge Signed-off-by: Li, Tianmu <tianmu.li@intel.com> * fix: address PR mlcommons#285 round-6 Copilot review comments Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix double-firing of timed-out turns Signed-off-by: Li, Tianmu <tianmu.li@intel.com> * feat: stamp conversation_id and turn on EventRecord pipeline Thread (conversation_id, turn) from the load generator through the event record / SQL writer pipeline so multi-turn runs can group ISSUED, COMPLETE, and ERROR events by conversation in the event log. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: sum_sq overflow, streaming conv stamping, tool-call metric coverage Change 0: cast sum_sq to float in _hdr_stat/_exact_stat to prevent msgpack OverflowError for ns-range latencies >= 4.3s. SeriesStat wire schema already accepts int|float. Change 1: stamp conversation_id/turn on RECV_FIRST/RECV_NON_FIRST events using .get() (not .pop()) so the uuid_to_conv_info entry remains available for the terminal QueryResult pop. Change 2: add test_tpot_osl_for_tool_call_complete to TestAsyncTriggers with exact OSL/TPOT value assertions for tool-call streaming responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: stamp conv_id/turn on timeout events, clarify live-history guard, update docs Closes Copilot review comments mlcommons#41, mlcommons#58, mlcommons#81, mlcommons#87, mlcommons#89, mlcommons#90, mlcommons#91 (round 7): - timeout EventRecords: capture uuid_to_conv_info in a single .pop() and pass conversation_id/turn to both ERROR and COMPLETE constructors so timed-out turns correlate like normal completions - live-history guard: add comment explaining why only role:tool rows are rejected; tighten error message wording - MULTI_TURN_QUICKSTART.md: rewrite Events Log section to reflect conversation_id/turn on EventRecords; update validate_jsonl_schema description to clarify row-level vs cross-row validation scope - examples/09_MultiTurn/README.md: update event correlation paragraph; fix agentic role-sequence diagram to [assistant(tool_calls)->tool]*; add note about merged tool_results - offline_template_full.yaml: correct load_pattern comment to "offline only: max_throughput" - scripts/regenerate_templates.py: apply offline-specific override so regeneration preserves the corrected comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: dataset validation, _fill_slot hang prevention, stale docs, tests Address remaining dataset-history-scope reviewer concerns from PR mlcommons#285: - MultiTurnDataset: reject null/empty/NaN conversation_id at construction (groupby dropna=False + explicit guard in _validate_conversation_grouping) - MultiTurnDataset: validate per-call tool_calls shape (id, type, function.name, function.arguments) and per-entry tool_results shape (tool_call_id, content) at construction time instead of failing silently at the endpoint - MultiTurnDataset.load(): raise NotImplementedError when adapter= is passed without api_type/model_params (was silently ignored) - MultiTurnStrategy._fill_slot(): wrap in try/except so a transport or dataset error during a timeout/error-triggered slot refill sets _error and _all_done instead of leaving execute() hanging on _all_done.wait() forever - docs: reword stale turn_timeout_s comment and target_concurrency description - tests: add regression coverage for all fixes plus ISL/OSL/TPOT aggregator metrics integration test for multi-turn turns Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: emit synthetic ISSUED+ERROR+COMPLETE for dropped turns on timeout/error Dropped turns after a timeout or error were silent — absent from events.jsonl, the sample-index map, and the accuracy collector. Introduces PhaseIssuer.register_skipped() (mirrors issue() minus the HTTP send and inflight increment) and wires both drop sites in MultiTurnStrategy through _abort_remaining_turns(), which calls register_skipped() and publishes synthetic failure EventRecords per dropped turn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * build: pin jinja2 as direct dep for apply_chat_template transformers' apply_chat_template (used by multi-turn ISL precompute and metrics token rendering) requires jinja2 at runtime; without it ImportError falls through silent text-tokenization fallbacks and quietly degrades ISL/ OSL/TPOT. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test: cover TTFT and tool-call streaming in multi-turn metrics test Parameterize the streaming aggregator test over text and tool-calls-only payloads, and add TTFT to the asserted metric set. Closes the gap where neither unit nor integration tests verified that TTFT fires on a RECV_FIRST whose payload carries only tool_calls. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Signed-off-by: Li, Tianmu <tianmu.li@intel.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

dependabot Bot added dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code labels May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): bump idna from 3.11 to 3.15#2

chore(deps): bump idna from 3.11 to 3.15#2
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/uv/idna-3.15

dependabot Bot commented on behalf of github May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants