Skip to content

[Leaderboard] SCRIBE (Actioneer) - Claude Opus 4.7 - 71.99% Pass@1 (corrected)#57

Open
Suraj-gameramp wants to merge 2 commits into
ucbepic:mainfrom
Suraj-gameramp:scribe-actioneer-opus47-submission
Open

[Leaderboard] SCRIBE (Actioneer) - Claude Opus 4.7 - 71.99% Pass@1 (corrected)#57
Suraj-gameramp wants to merge 2 commits into
ucbepic:mainfrom
Suraj-gameramp:scribe-actioneer-opus47-submission

Conversation

@Suraj-gameramp

@Suraj-gameramp Suraj-gameramp commented Jun 10, 2026

Copy link
Copy Markdown

Agent name: SCRIBE (Actioneer)
Backbone LLM: Claude Opus 4.7 (Anthropic)
Hints: Yes
Trials: 5 per query
Stratified Pass@1: 71.99%

Note (correction): This is the corrected submission following the maintainers' leakage review of our earlier 83.87% entry. All flagged queries (and several additional ones we found in our own audit) were re-run under a hardened sandbox with regenerated, gold-free specs. The honest number is lower, and we report it as-is. Full remediation details below.

Architecture — design principles

SCRIBE (Spec-Conditioned ReAct with Inline Backreview Escalation) is a three-role harness. Each role mitigates a specific, well-known LLM failure mode on data-analysis tasks.

  1. Action bias -> planning agent. A dedicated spec agent commits a structured plan (JSON contract) before the executor sees a Python REPL; the executor is conditioned on that plan.
  2. "One definitive answer" bias -> multi-interpretation surfacing. The spec agent enumerates 2-4 explicit interpretations (filter strictness, metric definition, single- vs multi-row, code-vs-name granularity) and records the alternatives for the reviewer. Where the benchmark leaves a decisive interpretation unspecified, the spec now presents the options open, with no committed choice.
  3. Premature commitment -> backreview over the recent window. The reviewer receives the last N steps as structured context, so it can target the step where a wrong decision was made.
  4. Five-verdict cascade -> root-cause-aware recovery. The reviewer emits exactly one of [SPEC_WRONG | BLIND_SPOT | EXECUTOR_WRONG | NA_CONFIRMED | ANSWER].

Remediation of the flagged leakage (and what else we found)

Every issue from the review is fixed, and the affected queries re-run end-to-end:

  1. Local HF cache reads (agnews/query4). Added a filesystem guard to the executor preamble that blocks reads of any HuggingFace/Kaggle cache path and any .arrow/.parquet shard outside the task's own context directory. agnews was re-run with the guard active; all agnews answers now come from honest classification of the article text in MongoDB. (agnews/query4's honest answer does not match gold, and we report the failure.)
  2. Gold-derived values in the prompt (crmarenapro/q2, q12; deps_dev_v1/q1; googlelocal/q2; pancancer/q1). The hand-tuning pass that introduced these was removed entirely. Specs were regenerated from schema/data only, and the spec-extraction prompt now contains an explicit hygiene rule forbidding answer values, record IDs, gold cardinality, or validator references in any spec.
  3. Decisive unspecified interpretation injected up front (yelp/q1, q5). Regenerated specs surface the aggregation choice as an open interpretation with no steer. Notably, the executor still selects the review-weighted reading on its own from the data, and both queries pass honestly.
  4. Additional issues we found in our own audit and fixed the same way (re-run, not patched):
    • crmarenapro/q10's spec contained a literal record-ID prediction; all 13 crmarenapro queries were re-run with scrubbed specs.
    • github_repos/q2's spec used the eventual answer as a format example. After removing it, the executor still derives the same repo from the data in 5/5 runs — the example was redundant, but we re-ran anyway so the shipped traces are example-free.
    • github_repos/q1's spec pre-stated a computed value ("the answer is 1/3 ≈ 0.3333"). After removing it, the executor computes a different (failing) value in 5/5 runs — i.e. the hint was load-bearing. We kept the honest failing answers and report the lower github_repos score.
    • All specs were additionally scrubbed of validator-internals language (matcher behavior, row-count gates) and of sampled record-ID literals.

Integrity / anti-leakage

All trials in this submission ran under a hardened Python sandbox enforced in the executor's REPL preamble:

  • Import block: datasets, huggingface_hub, kaggle(hub), tensorflow_datasets, torchvision.datasets raise ImportError; offline env vars set.
  • Network egress block: sockets to any non-localhost host raise OSError; only the project DBs on 127.0.0.1 are reachable.
  • Local-cache block: reads of HuggingFace/Kaggle cache paths or external .arrow/.parquet shards raise OSError.

Self-audit (scribe_actioneer_opus47_taint.json, committed): 270/270 clean — 0 HF dataset loads, 0 cache file reads, 0 answer-key/validator reads, 0 external fetches, 0 gold values or answer-pointers in any spec that reaches a prompt, 0 JSON-vs-trace mismatches, 270 distinct real traces. We additionally scanned every trace prompt for gold tokens, record IDs, validator references, and format-examples equal to the answer (0 hits), and verified every passing answer is computed in its trace from the project data.

One disclosure: in a single deps_dev_v1/q1 trace (a query that fails 0/5), the executor writes "the planner's expected answer listed X" while overriding the planner's internal proposal against the data. This refers to our own planner's data-derived suggestion, not the benchmark gold; we left the trace untouched rather than edit it.

Results Summary

Dataset Pass@1
bookreview 1.00
googlelocal 1.00
stockindex 1.00
stockmarket 1.00
yelp 0.94
music_brainz_20k 0.93
crmarenapro 0.85
PANCANCER_ATLAS 0.67
deps_dev_v1 0.50
GITHUB_REPOS 0.50
agnews 0.25
PATENTS 0.00
Stratified Pass@1 0.7199

Notes

  • Pass@1 computed with DAB's stratified formula (1/D) x sum_J [(1/Q_J) x sum_i (c_i,j / n)]; every trial validated with the official query_<ds>/query<N>/validate.py (270 records, 0 validator errors).
  • Submission file: submissions/scribe_actioneer_opus47.json — 270 entries (54 queries x 5 trials).
  • Full per-trial trace bundle (submissions/scribe_actioneer_opus47_traces.zip, 270 session logs with prompts + tool I/O) and the self-audit ledger (submissions/scribe_actioneer_opus47_taint.json) are committed in this PR for verification.

@Ruiying-Ma

Ruiying-Ma commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Hi @Suraj-gameramp, thanks for your submission!

There are several leakage paths.

  1. agnews query4 (5 trials). After the sandbox correctly blocked the live load_dataset and network calls, the executor read the AG News data from a pre-existing local HuggingFace cache (~/.cache/huggingface/.../datasets--ag_news/.../train-00000-of-00001.parquet, plus the matching .arrow files).

  2. Answer or gold-derived value placed in the prompt:

Query (all 5 runs) What the prompt contains
crmarenapro / query2 "CONFIRMED CORRECT (verified against gold ka0Wt000000Eq0MIAS)", the exact answer id
crmarenapro / query12 "verified against gold 005Wt000003NDEBIA4 ... = gold", plus "DO NOT filter on Opportunity.CreatedDate"
deps_dev_v1 / query1 "CONFIRMED CORRECT COMPUTATION (verified against gold)" and the literal top row "@dmrvos/infrajs>0.0.6>typescript"
googlelocal / query2 "Gold has 4 entries; do NOT cap output" plus the decisive ">= 4.0 (INCLUSIVE)" rule
pancancer_atlas / query1 "the histology CODES (gold uses 9382/3, 9400/3, etc.)", which are the answer's row keys
  1. A decisive interpretation that the benchmark does not specify, handed over up front:
Query (all 5 runs) What the prompt contains
yelp / query1 "prefer avg_of_reviews (review-weighted)". The question does not say whether to average over reviews or over per-business means. Review-weighted gives 3.547 (the gold value), per-business-mean gives 3.497.
yelp / query5 the spec's review-weighted AVG(rating) fixes the rating to 3.48 (gold) versus 3.58 for per-business-mean. The winning state is robust either way, but the rating half depends on the injected choice.

We we will re-review your submission and report the pass@1 on our leaderboard after the issues above are resolved. Thanks!

@Suraj-gameramp

Copy link
Copy Markdown
Author

Thank you for the careful review  you're right on all three counts, and we appreciate you catching them. We're treating these as genuine leakage and fixing them properly rather than arguing the margins.

What we’ve already changed

1. Local HF cache reads (agnews/query4)

Our sandbox blocked live load_dataset calls and network access, but it did not block reads from the existing on-disk Hugging Face cache (e.g. ~/.cache/huggingface/.../ag_news-train.arrow).

We’ve added a filesystem guard to the executor preamble that now blocks:

  • Reads from any Hugging Face or Kaggle cache path
  • Reads of any .arrow or .parquet shard outside the task’s own context directory

(Project databases — SQLite, DuckDB, Postgres, and Mongo — are unaffected.)

We’ll re-run agnews/query4 with this guard enabled so the task is solved solely from the article text stored in MongoDB.

2. Gold-derived values in the prompt (crmarena/q2, crmarena/q12, deps_dev_v1/q1, googlelocal/q2, pancancer/q1)

These values came from a hand-tuning pass we performed on a subset of specs, and they should never have included:

  • Gold IDs
  • Gold row keys
  • Gold-cardinality hints

We’ve removed that hand-tuning step entirely and scrubbed the spec-extraction prompt so it can no longer emit:

  • Record IDs
  • Answer values
  • “Verified against gold”–style language

These specs are being regenerated cleanly from schema and data only.

3. Decisive unspecified interpretation injected up front (yelp/q1, yelp/q5)

Agreed — supplying the executor with the review-weighted vs. per-business-mean interpretation (the one that happened to match gold) is not legitimate.

The regenerated specs will now surface that as an open interpretation for the agent to resolve from the data itself, with no steering toward the gold-matching branch.

We’re re-running under the hardened sandbox with clean, gold-free specs, re-auditing all 270 trials, and will push the corrected submission to this same PR.

The resulting number might be lower than what we originally reported, we’d rather publish a clean result.

Thanks again for the thorough review.

…clean - 71.99% Pass@1

- Hardened sandbox: import block + network egress block + local HF/Kaggle
  cache read block (closes the agnews/query4 vector)
- Regenerated gold-free specs for all flagged queries (crmarenapro q2/q12,
  deps_dev_v1 q1, googlelocal q2, pancancer q1, yelp q1/q5) and re-ran 5x
- Additional self-audit findings fixed the same way: crmarenapro q10 record-ID
  prediction (all 13 crmarena queries re-run), github_repos q2 answer-as-format-
  example, github_repos q1 pre-stated value (honest re-run fails; reported as-is),
  validator-internals language scrubbed from all specs
- New trace bundle (270 sessions) + upgraded self-audit ledger: 270/270 clean
@Suraj-gameramp Suraj-gameramp force-pushed the scribe-actioneer-opus47-submission branch from 8f0d2f8 to ab84122 Compare June 11, 2026 16:52
@Suraj-gameramp Suraj-gameramp changed the title [Leaderboard] SCRIBE (Actioneer) - Claude Opus 4.7 - 83.87% Pass@1 [Leaderboard] SCRIBE (Actioneer) - Claude Opus 4.7 - 71.99% Pass@1 (corrected) Jun 11, 2026
@Suraj-gameramp

Copy link
Copy Markdown
Author

Hi! The corrected submission is now pushed to this PR.

Corrected Stratified Pass@1: 71.99% (down from the flagged 83.87% — the difference was the leakage, and we report the honest number as-is).

Artifacts updated in this PR: scribe_actioneer_opus47.json (270 cells), scribe_actioneer_opus47_traces.zip (270 session logs incl. full prompts), scribe_actioneer_opus47_taint.json (self-audit: 270/270 clean — 0 HF/cache loads, 0 answer-key reads, 0 external fetches, 0 gold tokens in any prompt, 0 JSON↔trace mismatches, 270 distinct traces).

One disclosure for trace review: a single deps_dev_v1/q1 trace (query fails 0/5) contains the executor phrase "the planner's expected answer listed X" while it overrides our planner's internal data-derived proposal — it refers to our own spec agent, not benchmark gold. We left the trace unedited.

Thanks again for the careful review — happy to answer anything that comes up in re-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants