Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,16 @@ provider-backed ELF evidence was required.
exported core block JSON, archival passage/readback/search JSON, and source ids are
present. The report makes no hosted mem0 Platform, OpenMemory UI/export, or Letta
parity, win, tie, or loss claim.
- Temporal/trajectory adapter coverage after XY-1070: the June 23 follow-up refreshes
Graphiti/Zep temporal-validity and OpenViking context-trajectory evidence. The
Graphiti/Zep blocked fixture now includes current, historical, provider-boundary
source ids plus trace-stage readback, and the generated smoke manifest emits a
temporal-validity scenario row. The OpenViking staged, hierarchy, and recursive
fixtures remain 3 typed blockers with 3 trace-stage artifacts for same-corpus,
missing stage/hierarchy/recursive output, rejected sibling or decoy handling, and
comparison gates. This improves auditability only: no graph-memory parity,
OpenViking trajectory win/tie/loss, hosted Zep, private-corpus, or provider-backed
quality claim is made.
- Operator-approved public-proxy addendum after XY-930: the June 19 follow-up runs
`cargo make baseline-production-private-addendum` with a simulated/public-proxy
production corpus manifest approved for this stage. The run records 12 documents,
Expand Down Expand Up @@ -424,6 +434,7 @@ Detailed evidence and interpretation:
- [P2 Knowledge Workspace PageIndex/OpenKB Closeout Report - June 22, 2026](docs/evidence/benchmarking/2026-06-22-p2-knowledge-workspace-pageindex-openkb-closeout-report.md)
- [PageIndex/OpenKB Same-Corpus Adapter Report - June 22, 2026](docs/evidence/benchmarking/2026-06-22-pageindex-openkb-same-corpus-adapter-report.md)
- [mem0/OpenMemory and Letta Memory-History/Core-Archive Adapter Report - June 22, 2026](docs/evidence/benchmarking/2026-06-22-mem0-openmemory-letta-memory-history-core-archive-report.md)
- [Temporal and Trajectory Adapter Coverage Report - June 23, 2026](docs/evidence/benchmarking/2026-06-23-temporal-trajectory-adapter-coverage-report.md)
- [Live Baseline Benchmark Runbook](docs/runbook/benchmarking/live_baseline_benchmark.md)
- [Single-User Production Runbook](docs/runbook/single_user_production.md)
- Benchmark contract:
Expand Down Expand Up @@ -517,6 +528,7 @@ Detailed comparison, mechanism-level analysis, and source map:
- [P2 Knowledge Workspace PageIndex/OpenKB Closeout Report - June 22, 2026](docs/evidence/benchmarking/2026-06-22-p2-knowledge-workspace-pageindex-openkb-closeout-report.md)
- [PageIndex/OpenKB Same-Corpus Adapter Report - June 22, 2026](docs/evidence/benchmarking/2026-06-22-pageindex-openkb-same-corpus-adapter-report.md)
- [mem0/OpenMemory and Letta Memory-History/Core-Archive Adapter Report - June 22, 2026](docs/evidence/benchmarking/2026-06-22-mem0-openmemory-letta-memory-history-core-archive-report.md)
- [Temporal and Trajectory Adapter Coverage Report - June 23, 2026](docs/evidence/benchmarking/2026-06-23-temporal-trajectory-adapter-coverage-report.md)
- [Live Baseline Benchmark Runbook](docs/runbook/benchmarking/live_baseline_benchmark.md)
- [Real-World Agent Memory Benchmark](docs/runbook/benchmarking/real_world_agent_memory_benchmark.md)
- [External Memory Improvement Plan](docs/evidence/external_memory/external_memory_improvement_plan.md)
Expand All @@ -528,14 +540,14 @@ Detailed comparison, mechanism-level analysis, and source map:
- [Derived Knowledge Page Follow-Up Research](docs/research/derived_knowledge_page_followup.md)
- [Dreaming Product Surface Follow-Up Research](docs/research/dreaming_product_surface_followup.md)

Latest real-world benchmark report: June 22, 2026. Latest external research refresh:
Latest real-world benchmark report: June 23, 2026. Latest external research refresh:
June 11, 2026; June 20 adds the Agent Knowledge OS Closeout Benchmark Report,
the Graph Topic-Map Report - June 20, 2026, Knowledge Workspace Version-Diff
Report - June 20, 2026, and the Live Knowledge-Page Rebuild/Lint Report - June 20,
2026; June 22 adds the P1 Memory Authority Closeout Report, P2 Knowledge
Workspace PageIndex/OpenKB Closeout Report, PageIndex/OpenKB Same-Corpus Adapter
Report, and mem0/OpenMemory and Letta Memory-History/Core-Archive Adapter Report
after the June 19
Report, and mem0/OpenMemory and Letta Memory-History/Core-Archive Adapter Report;
June 23 adds the Temporal and Trajectory Adapter Coverage Report after the June 19
XY-930 operator-approved public-proxy production addendum and service-native Dreaming
readback, the qmd debug-ergonomics Dreaming retest, the June 17 competitor-strength
closeout, and the June 16 temporal reconciliation, live consolidation self-check,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,57 @@
},
"created_at": "2026-06-11T17:17:00Z"
}
]
],
"adapter_response": {
"adapter_id": "fixture_graphiti_zep_temporal_validity",
"answer": {
"content": "Graphiti/Zep temporal scoring requires current and historical facts with validity windows. The representative adapter output remains blocked until provider-backed temporal search maps those facts to generated evidence ids.",
"claims": [
{
"claim_id": "graphiti_temporal_contract",
"text": "Graphiti/Zep temporal scoring requires current and historical facts with validity windows.",
"evidence_ids": [
"graphiti-current-fact-contract",
"graphiti-historical-fact-contract",
"graphiti-provider-boundary"
],
"confidence": "high"
}
],
"evidence_ids": [
"graphiti-current-fact-contract",
"graphiti-historical-fact-contract",
"graphiti-provider-boundary"
],
"trace_explainability": {
"trace_id": "fixture-graphiti-zep-temporal-validity-blocked",
"failure_stage": "graphiti.provider_boundary",
"failure_reason": "provider_api_key_missing blocks live temporal search output, so the fixture records the current/historical validity-window contract instead of scoring parity.",
"stages": [
{
"stage_name": "graphiti.validity_window_contract",
"kept_evidence": [
"graphiti-current-fact-contract",
"graphiti-historical-fact-contract"
],
"notes": "The typed blocker still names the current and historical source ids required before scoring."
},
{
"stage_name": "graphiti.provider_boundary",
"kept_evidence": ["graphiti-provider-boundary"],
"notes": "Missing explicit provider configuration is a valid typed blocker, not a failed ELF graph-memory comparison."
}
]
},
"latency_ms": 0.0,
"cost": {
"currency": "USD",
"amount": 0.0,
"input_tokens": 0,
"output_tokens": 0
}
}
}
},
"timeline": [
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,27 @@
"amount": 0.0,
"input_tokens": 0,
"output_tokens": 0
},
"trace_explainability": {
"trace_id": "fixture-openviking-hierarchy-selection-blocked",
"failure_stage": "openviking.hierarchy_artifact_gate",
"failure_reason": "Selected parent, child, resource, and rejected sibling evidence is not materialized, so hierarchy selection remains a typed blocker.",
"stages": [
{
"stage_name": "openviking.same_corpus_gate",
"kept_evidence": ["same-corpus-before-hierarchy"],
"notes": "Hierarchy scoring is gated behind same-corpus expected evidence id coverage."
},
{
"stage_name": "openviking.hierarchy_artifact_gate",
"kept_evidence": [
"hierarchy-selection-output-contract",
"hierarchy-comparison-requires-elf-equivalent"
],
"dropped_evidence": ["hierarchy-design-win-decoy"],
"notes": "The required artifact must show selected hierarchy nodes plus the rejected sibling or decoy context before any ELF/OpenViking comparison is scored."
}
]
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,29 @@
"amount": 0.0,
"input_tokens": 0,
"output_tokens": 0
},
"trace_explainability": {
"trace_id": "fixture-openviking-recursive-expansion-blocked",
"failure_stage": "openviking.recursive_expansion_gate",
"failure_reason": "Seed, expanded child, final evidence, and pruned-branch artifacts are not materialized, so recursive/context expansion remains blocked.",
"stages": [
{
"stage_name": "openviking.same_corpus_gate",
"kept_evidence": ["recursive-same-corpus-gate"],
"notes": "Recursive expansion scoring remains gated behind expected evidence id coverage."
},
{
"stage_name": "openviking.recursive_expansion_gate",
"kept_evidence": ["recursive-expansion-output-contract"],
"dropped_evidence": ["recursive-expansion-win-decoy"],
"notes": "The missing expansion-path artifact must show seed context, expanded child contexts, final evidence ids, and pruned branches."
},
{
"stage_name": "openviking.comparison_gate",
"kept_evidence": ["recursive-elf-comparison-gate"],
"notes": "No ELF tie, win, or loss is allowed until both systems publish comparable expansion-path artifacts for the same scenario."
}
]
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,27 @@
"amount": 0.0,
"input_tokens": 0,
"output_tokens": 0
},
"trace_explainability": {
"trace_id": "fixture-openviking-staged-retrieval-blocked",
"failure_stage": "openviking.stage_artifact_gate",
"failure_reason": "Stage-level OpenViking trajectory output is not materialized, so the fixture keeps the context-trajectory comparison blocked.",
"stages": [
{
"stage_name": "openviking.same_corpus_gate",
"kept_evidence": [
"openviking-evidence-id-output-contract",
"openviking-same-corpus-precondition-blocked"
],
"notes": "Same-corpus expected, matched, and missing evidence ids must be correct before stage scoring is allowed."
},
{
"stage_name": "openviking.stage_artifact_gate",
"kept_evidence": ["elf-comparison-requires-comparable-trajectory"],
"dropped_evidence": ["trajectory-win-decoy"],
"notes": "Comparable stage artifacts are missing, and the decoy ELF win claim is explicitly dropped."
}
]
}
}
}
Expand Down
58 changes: 56 additions & 2 deletions apps/elf-eval/tests/real_world_job_benchmark.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1847,6 +1847,10 @@ fn graph_rag_representative_fixtures_report_typed_non_pass_states() -> Result<()
report.pointer("/summary/temporal_validity_not_encoded_count").and_then(Value::as_u64),
Some(1)
);
assert_eq!(
report.pointer("/summary/trace_explainability_count").and_then(Value::as_u64),
Some(1)
);

let jobs = array_at(&report, "/jobs")?;
let ragflow = find_by_field(jobs, "/job_id", "graph-rag-ragflow-reference-chunks-001")?;
Expand All @@ -1872,6 +1876,17 @@ fn graph_rag_representative_fixtures_report_typed_non_pass_states() -> Result<()
graphiti.pointer("/evolution/temporal_validity_not_encoded").and_then(Value::as_bool),
Some(true)
);
assert_eq!(
graphiti.pointer("/trace_explainability/failure_stage").and_then(Value::as_str),
Some("graphiti.provider_boundary")
);
assert!(array_contains_str(graphiti, "/produced_evidence", "graphiti-current-fact-contract")?);
assert!(array_contains_str(
graphiti,
"/produced_evidence",
"graphiti-historical-fact-contract"
)?);
assert!(array_contains_str(graphiti, "/produced_evidence", "graphiti-provider-boundary")?);
assert!(array_contains_str(graphify, "/produced_evidence", "graphify-source-location-output")?);

Ok(())
Expand Down Expand Up @@ -3383,7 +3398,8 @@ fn assert_qmd_debug_retest_markdown_and_indexes(
benchmarking_index.contains("2026-06-19-qmd-debug-ergonomics-dreaming-retest-report.md")
);
assert!(readme.contains("qmd Debug-Ergonomics Dreaming Retest Report - June 19, 2026"));
assert!(readme.contains("Latest real-world benchmark report: June 22, 2026"));
assert!(readme.contains("Temporal and Trajectory Adapter Coverage Report - June 23, 2026"));
assert!(readme.contains("Latest real-world benchmark report: June 23, 2026"));
assert!(readme.contains("keeps the qmd edge unchanged"));
}

Expand Down Expand Up @@ -7199,6 +7215,10 @@ fn context_trajectory_fixtures_report_blocked_openviking_gates() -> Result<()> {
report.pointer("/summary/expected_evidence_recall").and_then(Value::as_f64),
Some(1.0)
);
assert_eq!(
report.pointer("/summary/trace_explainability_count").and_then(Value::as_u64),
Some(3)
);

let suites = array_at(&report, "/suites")?;
let context = find_by_field(suites, "/suite_id", "context_trajectory")?;
Expand All @@ -7217,6 +7237,40 @@ fn context_trajectory_fixtures_report_blocked_openviking_gates() -> Result<()> {
assert_eq!(staged.pointer("/status").and_then(Value::as_str), Some("blocked"));
assert_eq!(hierarchy.pointer("/status").and_then(Value::as_str), Some("blocked"));
assert_eq!(recursive.pointer("/status").and_then(Value::as_str), Some("blocked"));
assert_eq!(
staged.pointer("/trace_explainability/failure_stage").and_then(Value::as_str),
Some("openviking.stage_artifact_gate")
);
assert_eq!(
hierarchy.pointer("/trace_explainability/failure_stage").and_then(Value::as_str),
Some("openviking.hierarchy_artifact_gate")
);
assert_eq!(
recursive.pointer("/trace_explainability/failure_stage").and_then(Value::as_str),
Some("openviking.recursive_expansion_gate")
);

let staged_stages = array_at(staged, "/trace_explainability/stages")?;
let staged_gate =
find_by_field(staged_stages, "/stage_name", "openviking.stage_artifact_gate")?;

assert!(array_contains_str(staged_gate, "/dropped_evidence", "trajectory-win-decoy")?);

let hierarchy_stages = array_at(hierarchy, "/trace_explainability/stages")?;
let hierarchy_gate =
find_by_field(hierarchy_stages, "/stage_name", "openviking.hierarchy_artifact_gate")?;

assert!(array_contains_str(hierarchy_gate, "/dropped_evidence", "hierarchy-design-win-decoy")?);

let recursive_stages = array_at(recursive, "/trace_explainability/stages")?;
let recursive_gate =
find_by_field(recursive_stages, "/stage_name", "openviking.recursive_expansion_gate")?;

assert!(array_contains_str(
recursive_gate,
"/dropped_evidence",
"recursive-expansion-win-decoy"
)?);
assert!(
staged.pointer("/reason").and_then(Value::as_str).is_some_and(
|reason| reason.contains("same-corpus output returns expected evidence ids")
Expand Down Expand Up @@ -7292,7 +7346,7 @@ fn assert_root_aggregate_summary(report: &Value) {
assert_eq!(report.pointer("/summary/quote_coverage").and_then(Value::as_f64), Some(1.0));
assert_eq!(
report.pointer("/summary/trace_explainability_count").and_then(Value::as_u64),
Some(2)
Some(5)
);
assert_eq!(
report.pointer("/summary/wrong_result_stage_attribution_count").and_then(Value::as_u64),
Expand Down
Loading