Skip to content

feat(belief-state): project runtime benchmark corpus evidence#229

Merged
drewstone merged 2 commits into
mainfrom
feat/belief-runtime-benchmark-corpus
Jun 6, 2026
Merged

feat(belief-state): project runtime benchmark corpus evidence#229
drewstone merged 2 commits into
mainfrom
feat/belief-runtime-benchmark-corpus

Conversation

@drewstone

@drewstone drewstone commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add a generic runtime trajectory projector for records that carry passive runtimeEvents
  • expose projectRuntimeTrajectoryEvidence and parseRuntimeTrajectoryHookEvent from the root package
  • keep the belief-state runtime benchmark helper as a thin consumer of that generic projection
  • feed optional explicit runtime decisions and observed labels into the existing Phase 0 measurement helper
  • diagnose lifecycle-only records as captured trajectory evidence, not enough support for belief policy claims

Why

Runtime now emits benchmark lifecycle evidence. The reusable substrate should be trajectory-first, not belief-only: benchmark/runtime rows record joins and lifecycle events; belief-state Phase 0 is one analysis over that evidence, and future trace/corpus analyses can reuse the same projection.

Verification

  • pnpm lint (passes; existing warnings in src/authenticity/index.ts and src/storyboard/code-edit.ts)
  • pnpm typecheck
  • pnpm exec vitest run tests/runtime-trajectory.test.ts tests/belief-state/runtime-benchmark-corpus.test.ts tests/belief-state/phase0-measurement.test.ts
  • pnpm test -- --testTimeout=300000
  • pnpm build
  • pnpm verify:package

@tangletools

Copy link
Copy Markdown
Contributor

⚠️ Review Incomplete — 52bddba1

At least one required reviewer lane failed closed. No approval or request-changes review was published. Trigger a fresh review on the current PR head.

tangletools · 2026-06-06T18:24:27Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 52bddba1

Readiness 60/100 · Confidence 70/100 · 7 findings (2 medium, 5 low)

deepseek: Correctness 60 · Security 60 · Testing 60 · Architecture 60

Full multi-shot audit completed 2/2 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM decisions-without-labels diagnostic path untested — tests/belief-state/runtime-benchmark-corpus.test.ts

Implementation lines 123-126 produce a distinct diagnostic when decisions.length > 0 but labels.length === 0 ('no decision labels supplied; observed action/outcome joins will be incomplete'). No test supplies decisions without labels. Only the inverse (no decisions at all) is tested.

🟠 MEDIUM defaultSplitTag option path untested — tests/belief-state/runtime-benchmark-corpus.test.ts

Line 86 of the implementation (record.splitTag ?? options.defaultSplitTag ?? DEFAULT_SPLIT_TAG) has a code path where options.defaultSplitTag supplies the splitTag instead of the internal default. No test passes defaultSplitTag — all tests rely on the hidden DEFAULT_SPLIT_TAG='search'. This branch is not exercised.

🟡 LOW No test coverage for empty records array or duplicate-runId conflict diagnostics — src/belief-state/runtime-benchmark-corpus.ts

Test file (tests/belief-state/runtime-benchmark-corpus.test.ts) has 3 tests covering: blocked pipeline without decisions, full decisions+labels pipeline, and malformed/missing events. Missing: (1) records: [] — should verify empty runs/events/summary, (2) two records sharing a runId but with differing splitTags — should verify the conflict diagnostic fires at line 111, (3) explicit defaultSplitTag override behavior. All existing tests pass. The conflict diagnostic path at lines 110-112 is reachable

🟡 LOW parseRuntimeHookEvent does not snapshot payload (inconsistent with runtime-hooks.ts) — src/belief-state/runtime-benchmark-corpus.ts

Line 178: payload: input.payload passes the raw reference without snapshotting. The equivalent function in runtime-hooks.ts snapshotRuntimeHookEvent (line 345) calls snapshotUnknown(event.payload) to shallow-copy objects and arrays. Benchmark data is de facto immutable and downstream consumers only read payload via previewUnknown (stringification), so no practical bug exists. However, the inconsistency means a future mutating consumer could corrupt parsed records silently. Fix: `payload: snapshotUnk

🟡 LOW conflicting run metadata diagnostic untested — tests/belief-state/runtime-benchmark-corpus.test.ts

Implementation lines 110-112 detect when two events for the same runId carry different scenarioId or splitTag ('runId X has conflicting scenario/split metadata'). No test input triggers this, so the conflict-detection path has no coverage.

🟡 LOW output fields decisions and labels never asserted — tests/belief-state/runtime-benchmark-corpus.test.ts

RuntimeBenchmarkBeliefPhase0Measurement (line 63-71 of implementation) includes decisions: RuntimeBeliefDecisionPoint[] and labels: RuntimeBeliefDecisionLabel[] as pass-through output fields. No test assertion checks report.decisions or report.labels. Test 2 supplies 12 decisions and 12 labels but only inspects report.measurement.* — if the pass-through silently dropped items, no test would catch it.

🟡 LOW parseRuntimeHookEvent validation only tested for one failure mode — tests/belief-state/runtime-benchmark-corpus.test.ts

parseRuntimeHookEvent (lines 161-181 of implementation) has 7 validation checks (non-object input, empty-string id, empty-string runId, empty-string target, empty-string phase, non-finite timestamp). Test 3 only exercises the 'missing runId' path via { id: 'bad' }. The other 6 rejection paths have zero coverage, so a regression that silently accepts a timestamp of NaN or Infinity would not be caught.


tangletools · 2026-06-06T18:38:14Z · trace

tangletools
tangletools previously approved these changes Jun 6, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 7 non-blocking findings — 52bddba1

Full multi-shot audit completed 2/2 planned shots over 3 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-06T18:38:14Z · immutable trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — d88dd940

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-06T18:46:48Z

@drewstone drewstone merged commit 69e9f2e into main Jun 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants