feat(belief-state): project runtime benchmark corpus evidence by drewstone · Pull Request #229 · tangle-network/agent-eval

drewstone · 2026-06-06T18:15:59Z

Summary

add a generic runtime trajectory projector for records that carry passive runtimeEvents
expose projectRuntimeTrajectoryEvidence and parseRuntimeTrajectoryHookEvent from the root package
keep the belief-state runtime benchmark helper as a thin consumer of that generic projection
feed optional explicit runtime decisions and observed labels into the existing Phase 0 measurement helper
diagnose lifecycle-only records as captured trajectory evidence, not enough support for belief policy claims

Why

Runtime now emits benchmark lifecycle evidence. The reusable substrate should be trajectory-first, not belief-only: benchmark/runtime rows record joins and lifecycle events; belief-state Phase 0 is one analysis over that evidence, and future trace/corpus analyses can reuse the same projection.

Verification

pnpm lint (passes; existing warnings in src/authenticity/index.ts and src/storyboard/code-edit.ts)
pnpm typecheck
pnpm exec vitest run tests/runtime-trajectory.test.ts tests/belief-state/runtime-benchmark-corpus.test.ts tests/belief-state/phase0-measurement.test.ts
pnpm test -- --testTimeout=300000
pnpm build
pnpm verify:package

tangletools · 2026-06-06T18:24:29Z

⚠️ Review Incomplete — `52bddba1`

At least one required reviewer lane failed closed. No approval or request-changes review was published. Trigger a fresh review on the current PR head.

_{tangletools · 2026-06-06T18:24:27Z}

tangletools · 2026-06-06T18:38:17Z

✅ No Blockers — `52bddba1`

Readiness 60/100 · Confidence 70/100 · 7 findings (2 medium, 5 low)

deepseek: Correctness 60 · Security 60 · Testing 60 · Architecture 60

Full multi-shot audit completed 2/2 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM decisions-without-labels diagnostic path untested — tests/belief-state/runtime-benchmark-corpus.test.ts

Implementation lines 123-126 produce a distinct diagnostic when decisions.length > 0 but labels.length === 0 ('no decision labels supplied; observed action/outcome joins will be incomplete'). No test supplies decisions without labels. Only the inverse (no decisions at all) is tested.

🟠 MEDIUM defaultSplitTag option path untested — tests/belief-state/runtime-benchmark-corpus.test.ts

Line 86 of the implementation (record.splitTag ?? options.defaultSplitTag ?? DEFAULT_SPLIT_TAG) has a code path where options.defaultSplitTag supplies the splitTag instead of the internal default. No test passes defaultSplitTag — all tests rely on the hidden DEFAULT_SPLIT_TAG='search'. This branch is not exercised.

🟡 LOW No test coverage for empty records array or duplicate-runId conflict diagnostics — src/belief-state/runtime-benchmark-corpus.ts

Test file (tests/belief-state/runtime-benchmark-corpus.test.ts) has 3 tests covering: blocked pipeline without decisions, full decisions+labels pipeline, and malformed/missing events. Missing: (1) records: [] — should verify empty runs/events/summary, (2) two records sharing a runId but with differing splitTags — should verify the conflict diagnostic fires at line 111, (3) explicit defaultSplitTag override behavior. All existing tests pass. The conflict diagnostic path at lines 110-112 is reachable

🟡 LOW parseRuntimeHookEvent does not snapshot payload (inconsistent with runtime-hooks.ts) — src/belief-state/runtime-benchmark-corpus.ts

Line 178: payload: input.payload passes the raw reference without snapshotting. The equivalent function in runtime-hooks.ts snapshotRuntimeHookEvent (line 345) calls snapshotUnknown(event.payload) to shallow-copy objects and arrays. Benchmark data is de facto immutable and downstream consumers only read payload via previewUnknown (stringification), so no practical bug exists. However, the inconsistency means a future mutating consumer could corrupt parsed records silently. Fix: `payload: snapshotUnk

🟡 LOW conflicting run metadata diagnostic untested — tests/belief-state/runtime-benchmark-corpus.test.ts

Implementation lines 110-112 detect when two events for the same runId carry different scenarioId or splitTag ('runId X has conflicting scenario/split metadata'). No test input triggers this, so the conflict-detection path has no coverage.

🟡 LOW output fields decisions and labels never asserted — tests/belief-state/runtime-benchmark-corpus.test.ts

RuntimeBenchmarkBeliefPhase0Measurement (line 63-71 of implementation) includes decisions: RuntimeBeliefDecisionPoint[] and labels: RuntimeBeliefDecisionLabel[] as pass-through output fields. No test assertion checks report.decisions or report.labels. Test 2 supplies 12 decisions and 12 labels but only inspects report.measurement.* — if the pass-through silently dropped items, no test would catch it.

🟡 LOW parseRuntimeHookEvent validation only tested for one failure mode — tests/belief-state/runtime-benchmark-corpus.test.ts

parseRuntimeHookEvent (lines 161-181 of implementation) has 7 validation checks (non-object input, empty-string id, empty-string runId, empty-string target, empty-string phase, non-finite timestamp). Test 3 only exercises the 'missing runId' path via { id: 'bad' }. The other 6 rejection paths have zero coverage, so a regression that silently accepts a timestamp of NaN or Infinity would not be caught.

_{tangletools · 2026-06-06T18:38:14Z · trace}

tangletools

✅ Approved — 7 non-blocking findings — `52bddba1`

Full multi-shot audit completed 2/2 planned shots over 3 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-06T18:38:14Z · immutable trace}

tangletools

✅ Refreshed approval after new commits — `d88dd940`

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-06T18:46:48Z}

feat(belief-state): project runtime benchmark corpus evidence

52bddba

tangletools previously approved these changes Jun 6, 2026

View reviewed changes

refactor(runtime-trajectory): split generic runtime evidence projection

d88dd94

drewstone dismissed tangletools’s stale review via d88dd94 June 6, 2026 18:46

tangletools approved these changes Jun 6, 2026

View reviewed changes

drewstone merged commit 69e9f2e into main Jun 6, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(belief-state): project runtime benchmark corpus evidence#229

feat(belief-state): project runtime benchmark corpus evidence#229
drewstone merged 2 commits into
mainfrom
feat/belief-runtime-benchmark-corpus

drewstone commented Jun 6, 2026 •

edited

Loading

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Verification

Uh oh!

tangletools commented Jun 6, 2026

⚠️ Review Incomplete — 52bddba1

Uh oh!

tangletools commented Jun 6, 2026

✅ No Blockers — 52bddba1

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 7 non-blocking findings — 52bddba1

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Refreshed approval after new commits — d88dd940

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewstone commented Jun 6, 2026 •

edited

Loading

⚠️ Review Incomplete — `52bddba1`

✅ No Blockers — `52bddba1`

✅ Approved — 7 non-blocking findings — `52bddba1`

✅ Refreshed approval after new commits — `d88dd940`