Add JS-SAM language backend (fourth spec language)#17
Conversation
JavaScript + SAM pattern (sam.js.org) via @cognitive-fab/sam-pattern and @cognitive-fab/sam-fsm: the bench's first non-formal-language backend, and the first to use the direct Phase 3 transition-validation path. - tla_eval/languages/js_sam.py: LanguageBackend subclass; shells out to a Node helper (tools/js-sam/cli.mjs, JSON-in/JSON-out) like Alloy/PAT. Phase 2/4 use the sam-pattern bundled behavior explorer (depthMax 6, bench-wide constant). Agent translators remap to a direct API call, matching Alloy/PAT precedent. - tools/js-sam: helper CLI (ping/validate/check/transitions/invariants); captures sam-pattern acceptor throws (async-wrapped) and treats gated 'unexpected action' rejections as legal model behavior. - Direct Phase 3 path: new language-neutral NDJSON trace loader (trace_loader.py) + implementation of the stubbed direct branch in transition_validation.py per add_new_spec_language.md 3.3. - spin task: js-sam prompts (Rocket Launcher in-context example, mandatory module contract) and 3 starter invariant templates. - scripts/run_benchmark.py: TLA+ tool gate now applies only when the TLA+ backend is selected; other languages use backend.check_available(). - Tests: 29 new (backend integration via real Node, trace loader, direct-path evaluator) plus known-good/known-bad spec fixtures and synthetic traces. docs/js_sam_backend_spec.md records the design and the implementation map. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
From-scratch explainer for reviewers new to the integration: what SysMoBench is, what the JS-SAM backend adds, the module contract, the reference and adversarial fixtures, every test that was run and what it demonstrates, and the honest list of what is not yet demonstrated (no live LLM runs, real spin traces not in the repo). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
claude-sonnet-4-20250514 generated a JS-SAM spin spec from the Asterinas source and passed Phase 1 (first attempt, no correction loop), Phase 2 (clean depth-6 exploration), and Phase 4 (3/3 expert invariants via the live translation path). Phase 3 remains blocked on captured traces. The run surfaced a config bug: the claude entry requested max_tokens 200000, which the Anthropic API rejects for claude-sonnet-4 (output cap 64000). Lowered with a dated probe note; this entry also backs the Alloy/PAT invariant-translator remaps. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Live benchmark run: Claude on the spin task (2026-06-09)Beyond the fixture-based tests in the PR, we ran the pipeline with a real model: claude-sonnet-4-20250514 generated a JS-SAM spec for
Most interesting artifact: the generated spec was the model's own modeling, not a copy of the prompt's Rocket Launcher example. It added a reactor that grants the freed lock to a spinning thread the instant One fix this surfaced (now in the branch, commit 41ccc15): Caveat for honest reading: one model passing one task is an existence proof that the pipeline works end to end with real generation — not the hypothesis test. That needs multiple models and side-by-side JS-SAM vs TLA+/Alloy/PAT scores on the same task. Full details in 🤖 Generated with Claude Code |
|
Thank you very much. I briefly went through it, and I don’t think there are any major issues. However, I plan to fix a few things, and I will finish the review asap. Thank you again! |
Summary
Adds a fourth specification language: JavaScript using the SAM pattern (sam.js.org), via
@cognitive-fab/sam-patternand@cognitive-fab/sam-fsm.SAM is grounded in TLA+ semantics — actions present proposals, a model accepts them, state is a pure function of the model — but the artifact is an executable .js module. This is the bench's first non-formal-language backend, testing a different hypothesis: can LLMs produce correct system specs in a pattern they see constantly in training data, rather than in a formal language they rarely see?
Full design rationale, user stories, and decision log:
docs/js_sam_backend_spec.md.New to the project? Start with
docs/js_sam_walkthrough.md— a from-scratch explainer of the integration, every test that was run, and what it demonstrates.How it fits
One
LanguageBackendsubclass (tla_eval/languages/js_sam.py) plus a small Node helper (tools/js-sam/cli.mjs, JSON-in/JSON-out) the Python side shells out to — the same shape as the Alloy and PAT backends. No new CLI plumbing.instance,init,actions,getState,setState,checkerIntents)depthMax: 6), with runtime-error / nondeterminism / timeout classification aligned to the existing vocabularydocs/add_new_spec_language.md§3.3: SAM trace replay issetState(pre); actions[name](data); diff(getState(), post). Runs in seconds, no agent, no API spend. Per §3.3 this lands the evaluator-side work with it: a language-neutral NDJSON trace loader (trace_loader.py) and the previously-stubbed direct branch intransition_validation.py(state) => booleanpredicates checked by the explorer; agent translators remap to a direct API call, matching the Alloy (alloy.py) and PAT (pat.py) precedentScope matches Alloy/PAT's historical footprint: spin only,
direct_call, 3 starter invariants (safety-only, same bounded-checking rationale as Alloy's library), Rocket Launcher as the in-context example.Incidental fixes
scripts/run_benchmark.py: the unconditional TLA+ (java + tla2tools.jar) prerequisite gate now applies only when the TLA+ backend is selected; other languages rely on their owncheck_available(). A Node-only machine can now run JS-SAM end to end.transition_validation.py("trace loading is the backend's responsibility") and thebase.pyvalidate_transitionssignature (windows passed in) — in favor of the signature; docs updated.Testing
tests/fixtures/js_sam/.run_benchmark.py --metric compilation_check --language JS-SAM --spec-file tests/fixtures/js_sam/specs/spin-good.jsOpen questions for maintainers (details in the spec doc §9)
spinNDJSON traces (data/sys_traces/spin) are not in the repo — will they be published, or is harness execution expected per-evaluator? Unit tests use synthetic fixtures; the leaderboard Phase 3 number is blocked on real traces.depthMax: 6is a bench-wide constant for the first cut — per-task later?🤖 Generated with Claude Code