Skip to content

Add JS-SAM language backend (fourth spec language)#17

Open
jdubray wants to merge 3 commits into
specula-org:mainfrom
jdubray:full-stack
Open

Add JS-SAM language backend (fourth spec language)#17
jdubray wants to merge 3 commits into
specula-org:mainfrom
jdubray:full-stack

Conversation

@jdubray

@jdubray jdubray commented Jun 9, 2026

Copy link
Copy Markdown

Summary

Adds a fourth specification language: JavaScript using the SAM pattern (sam.js.org), via @cognitive-fab/sam-pattern and @cognitive-fab/sam-fsm.

SAM is grounded in TLA+ semantics — actions present proposals, a model accepts them, state is a pure function of the model — but the artifact is an executable .js module. This is the bench's first non-formal-language backend, testing a different hypothesis: can LLMs produce correct system specs in a pattern they see constantly in training data, rather than in a formal language they rarely see?

Full design rationale, user stories, and decision log: docs/js_sam_backend_spec.md.
New to the project? Start with docs/js_sam_walkthrough.md — a from-scratch explainer of the integration, every test that was run, and what it demonstrates.

How it fits

One LanguageBackend subclass (tla_eval/languages/js_sam.py) plus a small Node helper (tools/js-sam/cli.mjs, JSON-in/JSON-out) the Python side shells out to — the same shape as the Alloy and PAT backends. No new CLI plumbing.

  • Phase 1 — parse + load + structural check of the mandatory module contract (instance, init, actions, getState, setState, checkerIntents)
  • Phase 2 — the sam-pattern bundled behavior explorer (bench-wide depthMax: 6), with runtime-error / nondeterminism / timeout classification aligned to the existing vocabulary
  • Phase 3 — first backend to use the direct path from docs/add_new_spec_language.md §3.3: SAM trace replay is setState(pre); actions[name](data); diff(getState(), post). Runs in seconds, no agent, no API spend. Per §3.3 this lands the evaluator-side work with it: a language-neutral NDJSON trace loader (trace_loader.py) and the previously-stubbed direct branch in transition_validation.py
  • Phase 4 — translated invariants are (state) => boolean predicates checked by the explorer; agent translators remap to a direct API call, matching the Alloy (alloy.py) and PAT (pat.py) precedent

Scope matches Alloy/PAT's historical footprint: spin only, direct_call, 3 starter invariants (safety-only, same bounded-checking rationale as Alloy's library), Rocket Launcher as the in-context example.

Incidental fixes

  • scripts/run_benchmark.py: the unconditional TLA+ (java + tla2tools.jar) prerequisite gate now applies only when the TLA+ backend is selected; other languages rely on their own check_available(). A Node-only machine can now run JS-SAM end to end.
  • Resolved the contract ambiguity between the stale comment in transition_validation.py ("trace loading is the backend's responsibility") and the base.py validate_transitions signature (windows passed in) — in favor of the signature; docs updated.

Testing

  • 29 new tests (all passing): backend integration through the real Node helper, trace loader (both NDJSON record shapes), and the direct-path evaluator. Known-good / known-bad spec fixtures plus synthetic traces under tests/fixtures/js_sam/.
  • §6 end-to-end checkpoint passes: run_benchmark.py --metric compilation_check --language JS-SAM --spec-file tests/fixtures/js_sam/specs/spin-good.js
  • Reference spin model explores ~327k states in ~2.7s at depth 6; a deliberately buggy release model fails exactly the windows exercising the bug.

Open questions for maintainers (details in the spec doc §9)

  1. Is a runtime-pattern language in scope for the leaderboard? ("working SAM model" vs "sound TLA+ spec")
  2. Captured spin NDJSON traces (data/sys_traces/spin) are not in the repo — will they be published, or is harness execution expected per-evaluator? Unit tests use synthetic fixtures; the leaderboard Phase 3 number is blocked on real traces.
  3. Depth-bound governance: depthMax: 6 is a bench-wide constant for the first cut — per-task later?
  4. If real agent paths for invariant translation ever land for Alloy/PAT, JS-SAM should share that infrastructure rather than keep the remap.

🤖 Generated with Claude Code

kizo-core and others added 3 commits June 9, 2026 14:55
JavaScript + SAM pattern (sam.js.org) via @cognitive-fab/sam-pattern and
@cognitive-fab/sam-fsm: the bench's first non-formal-language backend, and
the first to use the direct Phase 3 transition-validation path.

- tla_eval/languages/js_sam.py: LanguageBackend subclass; shells out to a
  Node helper (tools/js-sam/cli.mjs, JSON-in/JSON-out) like Alloy/PAT.
  Phase 2/4 use the sam-pattern bundled behavior explorer (depthMax 6,
  bench-wide constant). Agent translators remap to a direct API call,
  matching Alloy/PAT precedent.
- tools/js-sam: helper CLI (ping/validate/check/transitions/invariants);
  captures sam-pattern acceptor throws (async-wrapped) and treats gated
  'unexpected action' rejections as legal model behavior.
- Direct Phase 3 path: new language-neutral NDJSON trace loader
  (trace_loader.py) + implementation of the stubbed direct branch in
  transition_validation.py per add_new_spec_language.md 3.3.
- spin task: js-sam prompts (Rocket Launcher in-context example, mandatory
  module contract) and 3 starter invariant templates.
- scripts/run_benchmark.py: TLA+ tool gate now applies only when the TLA+
  backend is selected; other languages use backend.check_available().
- Tests: 29 new (backend integration via real Node, trace loader,
  direct-path evaluator) plus known-good/known-bad spec fixtures and
  synthetic traces. docs/js_sam_backend_spec.md records the design and
  the implementation map.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
From-scratch explainer for reviewers new to the integration: what
SysMoBench is, what the JS-SAM backend adds, the module contract, the
reference and adversarial fixtures, every test that was run and what it
demonstrates, and the honest list of what is not yet demonstrated
(no live LLM runs, real spin traces not in the repo).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
claude-sonnet-4-20250514 generated a JS-SAM spin spec from the Asterinas
source and passed Phase 1 (first attempt, no correction loop), Phase 2
(clean depth-6 exploration), and Phase 4 (3/3 expert invariants via the
live translation path). Phase 3 remains blocked on captured traces.

The run surfaced a config bug: the claude entry requested
max_tokens 200000, which the Anthropic API rejects for claude-sonnet-4
(output cap 64000). Lowered with a dated probe note; this entry also
backs the Alloy/PAT invariant-translator remaps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jdubray

jdubray commented Jun 9, 2026

Copy link
Copy Markdown
Author

Live benchmark run: Claude on the spin task (2026-06-09)

Beyond the fixture-based tests in the PR, we ran the pipeline with a real model: claude-sonnet-4-20250514 generated a JS-SAM spec for spin from the actual Asterinas Rust source (run_benchmark.py --language JS-SAM, real generation, no --spec-file).

Stage Result Detail
Generation valid on attempt 1 of 3 15.0 s, 3,922 prompt + 1,293 completion tokens, no correction loop
Phase 1 — syntax/contract PASS 0 syntax, 0 semantic errors, 0.09 s
Phase 2 — exploration PASS depth 6, ~4 s, no runtime errors, no nondeterminism
Phase 4 — invariants PASS 3/3 live translation by Claude + explorer checks: MutualExclusion, LockStatusConsistency, NoDeadlock
Phase 3 — traces not run blocked on captured spin traces (open question #2 above)

Most interesting artifact: the generated spec was the model's own modeling, not a copy of the prompt's Rocket Launcher example. It added a reactor that grants the freed lock to a spinning thread the instant ReleaseLock fires — a defensible reading of "spins until acquisition," but it makes lock hand-off atomic with the release step, whereas real traces leave the waiting thread trying until its own next CAS wins. Phases 1/2/4 cannot see that divergence; only Phase 3 can adjudicate it — which makes the trace-availability question above concrete rather than hypothetical.

One fix this surfaced (now in the branch, commit 41ccc15): config/models.yaml requested max_tokens: 200000 for the claude entry, which the Anthropic API rejects for claude-sonnet-4 (output cap 64,000). Lowered with a dated probe note. The same entry backs the Alloy/PAT Phase 4 translator remaps, so the fix isn't JS-SAM-specific.

Caveat for honest reading: one model passing one task is an existence proof that the pipeline works end to end with real generation — not the hypothesis test. That needs multiple models and side-by-side JS-SAM vs TLA+/Alloy/PAT scores on the same task. Full details in docs/js_sam_walkthrough.md §5.

🤖 Generated with Claude Code

@Qian-Cheng-nju

Copy link
Copy Markdown
Member

Thank you very much. I briefly went through it, and I don’t think there are any major issues. However, I plan to fix a few things, and I will finish the review asap. Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants