Add JS-SAM language backend (fourth spec language) by jdubray · Pull Request #17 · specula-org/SysMoBench

jdubray · 2026-06-09T22:10:04Z

Summary

Adds a fourth specification language: JavaScript using the SAM pattern (sam.js.org), via @cognitive-fab/sam-pattern and @cognitive-fab/sam-fsm.

SAM is grounded in TLA+ semantics — actions present proposals, a model accepts them, state is a pure function of the model — but the artifact is an executable .js module. This is the bench's first non-formal-language backend, testing a different hypothesis: can LLMs produce correct system specs in a pattern they see constantly in training data, rather than in a formal language they rarely see?

Full design rationale, user stories, and decision log: docs/js_sam_backend_spec.md.
New to the project? Start with docs/js_sam_walkthrough.md — a from-scratch explainer of the integration, every test that was run, and what it demonstrates.

How it fits

One LanguageBackend subclass (tla_eval/languages/js_sam.py) plus a small Node helper (tools/js-sam/cli.mjs, JSON-in/JSON-out) the Python side shells out to — the same shape as the Alloy and PAT backends. No new CLI plumbing.

Phase 1 — parse + load + structural check of the mandatory module contract (instance, init, actions, getState, setState, checkerIntents)
Phase 2 — the sam-pattern bundled behavior explorer (bench-wide depthMax: 6), with runtime-error / nondeterminism / timeout classification aligned to the existing vocabulary
Phase 3 — first backend to use the direct path from docs/add_new_spec_language.md §3.3: SAM trace replay is setState(pre); actions[name](data); diff(getState(), post). Runs in seconds, no agent, no API spend. Per §3.3 this lands the evaluator-side work with it: a language-neutral NDJSON trace loader (trace_loader.py) and the previously-stubbed direct branch in transition_validation.py
Phase 4 — translated invariants are (state) => boolean predicates checked by the explorer; agent translators remap to a direct API call, matching the Alloy (alloy.py) and PAT (pat.py) precedent

Scope matches Alloy/PAT's historical footprint: spin only, direct_call, 3 starter invariants (safety-only, same bounded-checking rationale as Alloy's library), Rocket Launcher as the in-context example.

Incidental fixes

scripts/run_benchmark.py: the unconditional TLA+ (java + tla2tools.jar) prerequisite gate now applies only when the TLA+ backend is selected; other languages rely on their own check_available(). A Node-only machine can now run JS-SAM end to end.
Resolved the contract ambiguity between the stale comment in transition_validation.py ("trace loading is the backend's responsibility") and the base.py validate_transitions signature (windows passed in) — in favor of the signature; docs updated.

Testing

29 new tests (all passing): backend integration through the real Node helper, trace loader (both NDJSON record shapes), and the direct-path evaluator. Known-good / known-bad spec fixtures plus synthetic traces under tests/fixtures/js_sam/.
§6 end-to-end checkpoint passes: run_benchmark.py --metric compilation_check --language JS-SAM --spec-file tests/fixtures/js_sam/specs/spin-good.js
Reference spin model explores ~327k states in ~2.7s at depth 6; a deliberately buggy release model fails exactly the windows exercising the bug.

Open questions for maintainers (details in the spec doc §9)

Is a runtime-pattern language in scope for the leaderboard? ("working SAM model" vs "sound TLA+ spec")
Captured spin NDJSON traces (data/sys_traces/spin) are not in the repo — will they be published, or is harness execution expected per-evaluator? Unit tests use synthetic fixtures; the leaderboard Phase 3 number is blocked on real traces.
Depth-bound governance: depthMax: 6 is a bench-wide constant for the first cut — per-task later?
If real agent paths for invariant translation ever land for Alloy/PAT, JS-SAM should share that infrastructure rather than keep the remap.

🤖 Generated with Claude Code

JavaScript + SAM pattern (sam.js.org) via @cognitive-fab/sam-pattern and @cognitive-fab/sam-fsm: the bench's first non-formal-language backend, and the first to use the direct Phase 3 transition-validation path. - tla_eval/languages/js_sam.py: LanguageBackend subclass; shells out to a Node helper (tools/js-sam/cli.mjs, JSON-in/JSON-out) like Alloy/PAT. Phase 2/4 use the sam-pattern bundled behavior explorer (depthMax 6, bench-wide constant). Agent translators remap to a direct API call, matching Alloy/PAT precedent. - tools/js-sam: helper CLI (ping/validate/check/transitions/invariants); captures sam-pattern acceptor throws (async-wrapped) and treats gated 'unexpected action' rejections as legal model behavior. - Direct Phase 3 path: new language-neutral NDJSON trace loader (trace_loader.py) + implementation of the stubbed direct branch in transition_validation.py per add_new_spec_language.md 3.3. - spin task: js-sam prompts (Rocket Launcher in-context example, mandatory module contract) and 3 starter invariant templates. - scripts/run_benchmark.py: TLA+ tool gate now applies only when the TLA+ backend is selected; other languages use backend.check_available(). - Tests: 29 new (backend integration via real Node, trace loader, direct-path evaluator) plus known-good/known-bad spec fixtures and synthetic traces. docs/js_sam_backend_spec.md records the design and the implementation map. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

From-scratch explainer for reviewers new to the integration: what SysMoBench is, what the JS-SAM backend adds, the module contract, the reference and adversarial fixtures, every test that was run and what it demonstrates, and the honest list of what is not yet demonstrated (no live LLM runs, real spin traces not in the repo). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

claude-sonnet-4-20250514 generated a JS-SAM spin spec from the Asterinas source and passed Phase 1 (first attempt, no correction loop), Phase 2 (clean depth-6 exploration), and Phase 4 (3/3 expert invariants via the live translation path). Phase 3 remains blocked on captured traces. The run surfaced a config bug: the claude entry requested max_tokens 200000, which the Anthropic API rejects for claude-sonnet-4 (output cap 64000). Lowered with a dated probe note; this entry also backs the Alloy/PAT invariant-translator remaps. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jdubray · 2026-06-09T23:05:44Z

Live benchmark run: Claude on the spin task (2026-06-09)

Beyond the fixture-based tests in the PR, we ran the pipeline with a real model: claude-sonnet-4-20250514 generated a JS-SAM spec for spin from the actual Asterinas Rust source (run_benchmark.py --language JS-SAM, real generation, no --spec-file).

Stage	Result	Detail
Generation	valid on attempt 1 of 3	15.0 s, 3,922 prompt + 1,293 completion tokens, no correction loop
Phase 1 — syntax/contract	PASS	0 syntax, 0 semantic errors, 0.09 s
Phase 2 — exploration	PASS	depth 6, ~4 s, no runtime errors, no nondeterminism
Phase 4 — invariants	PASS 3/3	live translation by Claude + explorer checks: MutualExclusion, LockStatusConsistency, NoDeadlock
Phase 3 — traces	not run	blocked on captured `spin` traces (open question #2 above)

Most interesting artifact: the generated spec was the model's own modeling, not a copy of the prompt's Rocket Launcher example. It added a reactor that grants the freed lock to a spinning thread the instant ReleaseLock fires — a defensible reading of "spins until acquisition," but it makes lock hand-off atomic with the release step, whereas real traces leave the waiting thread trying until its own next CAS wins. Phases 1/2/4 cannot see that divergence; only Phase 3 can adjudicate it — which makes the trace-availability question above concrete rather than hypothetical.

One fix this surfaced (now in the branch, commit 41ccc15): config/models.yaml requested max_tokens: 200000 for the claude entry, which the Anthropic API rejects for claude-sonnet-4 (output cap 64,000). Lowered with a dated probe note. The same entry backs the Alloy/PAT Phase 4 translator remaps, so the fix isn't JS-SAM-specific.

Caveat for honest reading: one model passing one task is an existence proof that the pipeline works end to end with real generation — not the hypothesis test. That needs multiple models and side-by-side JS-SAM vs TLA+/Alloy/PAT scores on the same task. Full details in docs/js_sam_walkthrough.md §5.

🤖 Generated with Claude Code

Qian-Cheng-nju · 2026-06-11T12:27:23Z

Thank you very much. I briefly went through it, and I don’t think there are any major issues. However, I plan to fix a few things, and I will finish the review asap. Thank you again!

kizo-core and others added 3 commits June 9, 2026 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JS-SAM language backend (fourth spec language)#17

Add JS-SAM language backend (fourth spec language)#17
jdubray wants to merge 3 commits into
specula-org:mainfrom
jdubray:full-stack

jdubray commented Jun 9, 2026 •

edited

Loading

Uh oh!

jdubray commented Jun 9, 2026

Uh oh!

Qian-Cheng-nju commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jdubray commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it fits

Incidental fixes

Testing

Open questions for maintainers (details in the spec doc §9)

Uh oh!

jdubray commented Jun 9, 2026

Live benchmark run: Claude on the spin task (2026-06-09)

Uh oh!

Qian-Cheng-nju commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jdubray commented Jun 9, 2026 •

edited

Loading