Skip to content

feat(test): automated Parallels launcher→terminal test harness#411

Open
ryanbreen wants to merge 13 commits into
mainfrom
feat/parallels-launcher-test-harness
Open

feat(test): automated Parallels launcher→terminal test harness#411
ryanbreen wants to merge 13 commits into
mainfrom
feat/parallels-launcher-test-harness

Conversation

@ryanbreen
Copy link
Copy Markdown
Owner

What this does

Host-side automation that drives the real Breenix GUI input path on a fresh Parallels VM and validates it with serial-log oracles — no kernel/userspace changes:

boot (run.sh --parallels) → BWM ready → double-tap SUPER →
  /bin/blauncher (Terminal pre-selected) → Enter → /bin/bterm

A run PASSes only when the serial log shows both [spawn] path='/bin/bterm' and [bterm] config:. "Launcher opened" alone ([spawn] path='/bin/blauncher') is an explicit FAIL — the oracle never accepts weaker evidence than "the terminal actually launched and initialized".

Files

  • scripts/parallels/inject.sh — canonical prlctl send-key-event helper (PS/2 set-1 scancodes, extended-key aware for Super = 0xE0 0x5B). Errors loudly (exit 2) on an empty/unset $VM instead of silently no-op'ing.
  • scripts/parallels/launcher-smoke.sh — one full run; prints exactly RESULT: PASS (exit 0) or RESULT: FAIL: <reason> (exit 1), with an evidence dir (serial excerpt + screenshots + result.txt). Includes a locked-screen preflight (refuses to run on a locked Mac, where Parallels silently drops injected keys) and a caffeinate -d keep-alive, both wired into the cleanup trap. The injection trigger is isolated to one config block at the top (SUPER_PREFIX/SUPER_CODE/INTER_TAP_MS/ENTER_CODE).
  • .claude/workflows/parallels-launcher-test.js — runs the smoke test sequentially (one VM, never parallel) up to 15×; gate = 10 consecutive PASS; reports the streak + first failure.
  • docs/planning/parallels-test-harness/README.md, RALPH_STATE.md — the proven recipe, host prerequisites, and known limitations.

Recipe

The recipe is proven in code and was walked manually in a prior session: double-tap Super (bwm.rs load_defaults binds SUPER+SUPER → exec /bin/blauncher), blauncher pre-selects APPS[0] = "Terminal", Enter alone launches /bin/bterm. Injection is prlctl send-key-event <VM> --scancode <ps2-set1> --event press|release — NOT CGEvents.

⚠️ Validation status — live 10× run is PENDING AN UNLOCKED MAC

The harness is built and verified (both scripts bash -n clean; the recipe walked manually before), but the live 10-in-a-row validation has not run, because:

Root cause: prlctl send-key-event reaches the guest only when the macOS console is unlocked. With the screen locked, Parallels detaches the VM window and silently drops every injected keystroke (send-key-event returns rc=0, key never lands — proven functionally by injecting = into the Bounce demo with no effect). This is not a TCC/Accessibility issue: injection goes through the virtual xHCI HID via prl_disp_service, not macOS CGEvent — so no permissions grant fixes it. There is no non-interactive unlock bypass; the smoke script preflights this and fails fast with a clear message.

Exact steps to run the validation (operator, at the console):

  1. Physically unlock the Mac at the console.
  2. Disable auto-lock for unattended runs: System Settings → Lock Screen → "Require password after screen saver begins/display is turned off" = Never/Off.
  3. caffeinate -d & (the smoke script also starts one, but disabling auto-lock is still required).
  4. bash scripts/parallels/launcher-smoke.sh (single run), then invoke the parallels-launcher-test workflow for the 10× gate.

QEMU is not a substitute for this flow: BWM's ARM64 path needs the Parallels-specific VirGL 3D compositor (absent on QEMU here, so BWM never starts), and the SUPER hotkey reads SUPER_PRESSED only from the USB-HID/xHCI driver, which never enumerates on QEMU (the virtio-keyboard MMIO driver never tracks Super). Making QEMU viable would require kernel changes (software-compositor fallback for BWM + a virtio-keyboard→SUPER bridge) — out of scope for this host-side harness.

Test plan

  • bash scripts/parallels/launcher-smoke.sh prints RESULT: PASS (requires unlocked Mac)
  • parallels-launcher-test workflow reports consecutiveGreenAchieved: true — 10 consecutive green (requires unlocked Mac)

🤖 Generated with Claude Code

ryanbreen and others added 13 commits June 1, 2026 21:16
Host-side automation that drives the real Breenix GUI input path on a fresh
Parallels VM and validates it with serial-log oracles:

  boot (run.sh --parallels) -> BWM ready -> double-tap SUPER -> /bin/blauncher
  (Terminal pre-selected) -> Enter -> /bin/bterm

PASS requires real serial evidence that bterm spawned AND emitted its config
line -- "launcher opened" alone is an explicit FAIL.

Files:
- scripts/parallels/inject.sh -- canonical prlctl send-key-event helper
  (PS/2 set-1 scancodes; extended-key aware; errors loudly on empty $VM).
- scripts/parallels/launcher-smoke.sh -- one full run, prints exactly
  "RESULT: PASS" / "RESULT: FAIL: <reason>". Locked-screen preflight (refuses
  to run on a locked Mac, where Parallels silently drops injected keys) plus a
  caffeinate -d keep-alive, both wired into the cleanup trap.
- .claude/workflows/parallels-launcher-test.js -- runs the smoke test
  sequentially (one VM, never parallel) up to 15x; gate = 10 consecutive PASS.
- docs/planning/parallels-test-harness/{README,RALPH_STATE}.md -- proven
  recipe, host prerequisites, and known limitations.

Documents the night's findings: the macOS console must be unlocked for
prlctl send-key-event to reach the guest (it injects through the virtual xHCI
HID via prl_disp_service, NOT macOS CGEvent/TCC -- so no permissions grant
fixes a locked screen), the unattended-run requirements (disable auto-lock +
caffeinate), and why QEMU is not a viable substitute for this flow (BWM needs
the Parallels-specific VirGL compositor and SUPER is only read from the
USB-HID/xHCI driver, which never enumerates on QEMU).

Validation status: the live 10x run is PENDING AN UNLOCKED MAC. The recipe is
proven in code and was walked manually in a prior session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…VM from run.sh stdout

Adversarial correctness review of the launcher-test harness (PR #411), which
has never been run end-to-end. Found and fixed one dangerous race that could
have caused a false readiness signal / wrong-VM injection on the very first
real run:

1. Stale-serial false match (HIGH). The readiness poll grepped
   /tmp/breenix-parallels-serial.log for the BWM ready marker with no guarantee
   the log was the fresh one this boot created. run.sh only `rm -f`s and
   recreates the serial log late (right before `prlctl start`, after the whole
   build). A leftover prior-run log at that path already containing the marker
   (confirmed present on the test Mac right now) would be matched as "ready"
   before the VM even started, after which BASE_LINE/tail-since would be
   computed against the wrong file and the oracle greps would see nothing.
   Fix: snapshot the leftover log's inode before launching run.sh and only
   trust the marker once the log's inode changes (fresh file) — serial_inode()
   + serial_is_fresh() gate the readiness poll.

2. Indirect VM-name resolution (MEDIUM). `prlctl list -a | grep breenix- |
   tail -1` could select a leftover/stuck breenix-* VM (run.sh's old-VM delete
   is best-effort). Fix: resolve the VM name authoritatively from run.sh's own
   `VM:     breenix-<epoch>` stdout line in RUN_LOG (printed only after the
   fresh VM is created+started), falling back to the prlctl heuristic.

The proven recipe (double-tap SUPER trigger, Enter, and the dual-oracle PASS
gate requiring BOTH `[spawn] path='/bin/bterm'` AND `[bterm] config:`) is
unchanged. README updated to match. inject.sh and the workflow JS were
reviewed and required no changes. bash -n, node --check, and shellcheck clean
(only an SC2329 false positive on the trap-invoked cleanup()).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…idate bterm's own startup

Three fixes from the first live end-to-end runs on an unlocked Mac (the flow is
now proven working — double-Ctrl opens /bin/blauncher, Enter launches /bin/bterm,
terminal window + child shell come up; serial + screenshot evidence):

- set -e lock preflight: the python lock probe exits 1 when UNLOCKED (the required
  state); as a bare statement that tripped `set -e` and aborted before reading $?.
  Run it as an if-condition (set -e exempt). Previously the harness died in ~1s on
  an unlocked Mac — the one state in which it must run.

- injection: Parallels 26.3.3 rejects `--scancode 91` (0x5B Super) with "Invalid
  scan code sequence" and offers no way to send the 0xE0 0x5B extended pair as
  separate --scancode calls. Breenix's HID layer maps the Left-Ctrl bit to the
  SUPER modifier, so inject Left-Ctrl (scancode 29, no prefix): accepted by
  Parallels and the exact "double control key" the operator describes.

- oracle: blauncher launches bterm via fork+execv, which does NOT emit the kernel's
  "[spawn] path='/bin/bterm'" line. Validate bterm's OWN startup logs instead --
  '[bterm] config:' AND '[bterm] spawned child pid=' (terminal started AND loaded
  its shell). Stronger, honest proof (the binary actually ran); never weakens the gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The double-tap trigger is timing-sensitive (bwm requires two Ctrl taps within a
400ms window). On a CPU-throttled / overloaded host, prlctl send-key-event
latency balloons (observed 162s for a single doubletap at 4 VM cores), spreading
the two taps far past the window so the launcher never opens. Log the injection
wall-time and warn when it exceeds ~350ms, so a "launcher did not open" failure
is diagnosable as a timing miss vs. the key never reaching the guest.

Conclusion from the throttled gate: do NOT throttle these runs. The flow works
at full CPU (proven once end-to-end); reliability must be measured at full speed,
which means running when the operator is away rather than throttled alongside them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… body, --no-build)

The generated workflow had two bugs that would have wrecked a real run:
- it invoked launcher-smoke.sh WITHOUT --no-build, so each of up to 15 attempts
  would trigger a full kernel+userspace+ext2 rebuild (~10 min each).
- it was written as `export default async function run()` calling
  `agent({prompt, schema})`, but the Workflow runtime executes the script BODY
  directly and agent() takes (promptString, {schema}) -- so as written the loop
  was never invoked.

Rewrite to the documented pattern: top-level body with phase()/await agent(),
agent(prompt, {schema}), --no-build, a pre-run lock guard, and per-attempt
injection-wall-time capture. Stops at 10 consecutive PASS or 15 attempts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…thout breaking injection)

The operator uses this Mac while runs happen, so the VM must not hog CPU — but
throttling it breaks the timing-sensitive double-tap. Resolution:
- background_vm_proc: drop the VM to `renice 20` (perf cores, polite under
  contention) as soon as it boots, through the long boot/warmup phases.
- foreground_vm_proc: restore `renice 0` for the brief double-tap injection window.
- Use renice ONLY (no `taskpolicy -b`): E-core banishment starved the guest so it
  couldn't consume the two taps inside bwm's 400ms window (observed 1876ms).
- Add --no-background opt-out; bump default timeout to 1200s (backgrounded boots
  are slower).

NB: a separate, host-side issue gates reliability — `prlctl send-key-event`
latency is variable and coupled to host load (seen 0.4s..166s/call); a double-tap
needs each call <~100ms, which requires a responsive/quiet Parallels dispatcher.
The renice toggle fires correctly; an end-to-end PASS with it is still pending a
responsive dispatcher (run on a quiet host / after a Parallels restart).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…enix VM is already running

run.sh kills any existing breenix VM before creating its own, so two overlapping
launcher-smoke runs would destroy each other's in-flight VM (and two VMs would
fight the Parallels dispatcher). Add a preflight that emits RESULT: FAIL and exits
if a breenix VM is already running, enforcing strictly-serial execution even if a
caller accidentally launches runs concurrently.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t -j` (load-independent double-tap)

ROOT CAUSE of the failing reliability gate (15/15 fail, every double-tap ~1.9s):
the double-tap was 4 SEPARATE `prlctl send-key-event` spawns, each ~475ms on a
loaded host, so the two taps landed ~1.9s apart — far outside bwm's 400ms window.
Proof #3 only passed because the dispatcher was fast on an idle (5am) host.

FIX: send every command as ONE `prlctl send-key-event -j` batch (JSON event array
on stdin). The inter-event delays are then applied by the Parallels dispatcher
with precise timing, INDEPENDENT of prlctl's per-spawn latency — so the double-tap
lands inside the 400ms window regardless of host load. Validated: the whole
double-tap is one ~0.6s call with the two taps spaced exactly 190ms by the
dispatcher (vs ~1.9s and unreliable across 4 spawns).

inject.sh: tap/doubletap/hold/type now build a JSON event array and send it via
one `-j` stdin call. launcher-smoke.sh: the injection wall-time log is reworded
(wall-time is now just prlctl overhead, not the tap spacing).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m input drops

The 10/15 gate exposed two REAL Breenix intermittency bugs (not harness issues):
- ~25% double-tap drop: bwm never registers the (correctly batched, dispatcher-
  timed) double-tap — blauncher truly never spawns (verified absent across the
  whole boot, not late). Guest-side BWM/HID input intermittency.
- EC=0xe Illegal Execution State crash on the Enter->fork/exec->bterm path
  (run-124137): launcher opened, then [UNHANDLED_EC] cpu=5 + [FATAL_POSTMORTEM];
  the handler parks the CPU in idle so heartbeats continue (looks "hung"). This is
  clone-exec/TTBR0 SMP territory — the area of this branch's in-flight fixes.

Make the harness an honest bug-detector: grep the post-injection serial for
[UNHANDLED_EC]/[FATAL_POSTMORTEM]/panic and report "KERNEL FAULT ..." with the
offending line, distinctly from a benign "double-tap dropped" or "terminal did
not launch". No silent retry-to-green — the gate honestly reports the real
reliability (and which failure mode), per the no-faking-tests policy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…postmortem

The EC=0xe (Illegal Execution State) catch-all previously printed only ELR, which
is not enough to confirm WHY the ERET landed in an illegal state. Add, on the
fatal park path only (interrupts already masked; lock-free raw-UART output like
the existing [UNHANDLED_EC]/[DATA_ABORT] lines; nothing on hot paths):
- [FATAL_REGS]: spsr, esr, far, elr, sp, x0..x30 from the exception frame.
- [FATAL_THREAD]: current tid, saved_by_inline_schedule, ctx_elr_el1 via the
  deadlock-safe scheduler try_dump_state (try_lock; skips if busy) — the same
  accessor the PC_ALIGN fatal handler already uses.

This makes the next capture of the intermittent crash decisive: SPSR shows the
illegal PSTATE, and saved_by_inline_schedule + ctx_elr_el1 directly confirm/refute
the stale-elr_el1-restored-on-dispatch-ERET hypothesis. Diagnostic only; exception.rs
only (no gold-master / context_switch.rs / userspace).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The launcher-test harness reproduced an intermittent crash; forensic analysis
(enhanced postmortem b196121 + symbolization + trace ring) confirms the
proximate cause with high confidence: idle_loop_arm64's register file gets saved
into a non-idle thread's Thread.context, which is later dispatched via ERET into
.bss (0x269000=WAKE_SITE_SCHEDULE) -> EC=0x0 (UDF) or EC=0xe (illegal SPSR).
Same bug, two exception classes. Unifies the prior crash hunt + the branch's
TTBR0/clone-exec cluster.

Fix is in gold-master context_switch.rs and the obvious mitigation intersects the
"NO EL0 dispatch guard" autopsy warning -> documented as a signoff proposal, not
applied. Doc lays out both fix options, the upstream-writer candidates, the
Parallels-only confirmation path, and how to validate via the harness.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…'t dropped

The launcher double-tap was dropped ~22% of the time: the modifier path is
polled-level (hid.rs SUPER_PRESSED.store), and bwm samples it once per (bursty,
GPU-fenced) compositor wake, so a tap's ~30ms high window can fall entirely
between two polls and be missed -> tap_count reaches 1 not 2 -> launcher never
fires. The mouse path already solved this with a press-edge latch; modifiers
lacked the equivalent.

Fix (mirrors the mouse latch; none of the 3 files are gold-master/prohibited):
- hid.rs: SUPER_TAP_COUNT atomic, incremented on the SUPER 0->1 rising edge at
  HID-report time (swap-based), plus a read-and-clear accessor; wakes the
  compositor on a Super edge. Lock-free, no logging on the path.
- graphics.rs: op=31 returns+clears the latched tap count; a keyboard-ready bit
  in compositor_ready_bits so a tap wakes compositor_wait.
- bwm.rs: drains the latch every frame and drives SUPER multi-tap from latched
  press-edges (combo semantics + 400ms window + cooldown preserved; a single tap
  cannot read as a double).

Validated via the launcher harness: drop rate ~22% -> ~9% (10/11 injected runs
opened the launcher), no regressions, no spurious launches, injection
load-independent. The residual ~9% showed zero guest HID activity post-injection
(a host injection-delivery miss, not the latch) -- separate, host-side.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run 3 of a validation batch hit a false readiness timeout (and leaked a VM)
because concurrent serial writers split the one-shot marker mid-line
("[in[bwm] hotkeys: using built-TELNETD_STARTING"). Match EITHER the
hotkeys-defaults line OR the recurring [bwm-fps] compositing line (printed
~180x/s once the desktop is live, so a clean instance appears within ms), via
grep -aE. Removes the harness's own flaky failure mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant