feat(test): automated Parallels launcher→terminal test harness#411
Open
ryanbreen wants to merge 13 commits into
Open
feat(test): automated Parallels launcher→terminal test harness#411ryanbreen wants to merge 13 commits into
ryanbreen wants to merge 13 commits into
Conversation
Host-side automation that drives the real Breenix GUI input path on a fresh
Parallels VM and validates it with serial-log oracles:
boot (run.sh --parallels) -> BWM ready -> double-tap SUPER -> /bin/blauncher
(Terminal pre-selected) -> Enter -> /bin/bterm
PASS requires real serial evidence that bterm spawned AND emitted its config
line -- "launcher opened" alone is an explicit FAIL.
Files:
- scripts/parallels/inject.sh -- canonical prlctl send-key-event helper
(PS/2 set-1 scancodes; extended-key aware; errors loudly on empty $VM).
- scripts/parallels/launcher-smoke.sh -- one full run, prints exactly
"RESULT: PASS" / "RESULT: FAIL: <reason>". Locked-screen preflight (refuses
to run on a locked Mac, where Parallels silently drops injected keys) plus a
caffeinate -d keep-alive, both wired into the cleanup trap.
- .claude/workflows/parallels-launcher-test.js -- runs the smoke test
sequentially (one VM, never parallel) up to 15x; gate = 10 consecutive PASS.
- docs/planning/parallels-test-harness/{README,RALPH_STATE}.md -- proven
recipe, host prerequisites, and known limitations.
Documents the night's findings: the macOS console must be unlocked for
prlctl send-key-event to reach the guest (it injects through the virtual xHCI
HID via prl_disp_service, NOT macOS CGEvent/TCC -- so no permissions grant
fixes a locked screen), the unattended-run requirements (disable auto-lock +
caffeinate), and why QEMU is not a viable substitute for this flow (BWM needs
the Parallels-specific VirGL compositor and SUPER is only read from the
USB-HID/xHCI driver, which never enumerates on QEMU).
Validation status: the live 10x run is PENDING AN UNLOCKED MAC. The recipe is
proven in code and was walked manually in a prior session.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…VM from run.sh stdout Adversarial correctness review of the launcher-test harness (PR #411), which has never been run end-to-end. Found and fixed one dangerous race that could have caused a false readiness signal / wrong-VM injection on the very first real run: 1. Stale-serial false match (HIGH). The readiness poll grepped /tmp/breenix-parallels-serial.log for the BWM ready marker with no guarantee the log was the fresh one this boot created. run.sh only `rm -f`s and recreates the serial log late (right before `prlctl start`, after the whole build). A leftover prior-run log at that path already containing the marker (confirmed present on the test Mac right now) would be matched as "ready" before the VM even started, after which BASE_LINE/tail-since would be computed against the wrong file and the oracle greps would see nothing. Fix: snapshot the leftover log's inode before launching run.sh and only trust the marker once the log's inode changes (fresh file) — serial_inode() + serial_is_fresh() gate the readiness poll. 2. Indirect VM-name resolution (MEDIUM). `prlctl list -a | grep breenix- | tail -1` could select a leftover/stuck breenix-* VM (run.sh's old-VM delete is best-effort). Fix: resolve the VM name authoritatively from run.sh's own `VM: breenix-<epoch>` stdout line in RUN_LOG (printed only after the fresh VM is created+started), falling back to the prlctl heuristic. The proven recipe (double-tap SUPER trigger, Enter, and the dual-oracle PASS gate requiring BOTH `[spawn] path='/bin/bterm'` AND `[bterm] config:`) is unchanged. README updated to match. inject.sh and the workflow JS were reviewed and required no changes. bash -n, node --check, and shellcheck clean (only an SC2329 false positive on the trap-invoked cleanup()). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…idate bterm's own startup Three fixes from the first live end-to-end runs on an unlocked Mac (the flow is now proven working — double-Ctrl opens /bin/blauncher, Enter launches /bin/bterm, terminal window + child shell come up; serial + screenshot evidence): - set -e lock preflight: the python lock probe exits 1 when UNLOCKED (the required state); as a bare statement that tripped `set -e` and aborted before reading $?. Run it as an if-condition (set -e exempt). Previously the harness died in ~1s on an unlocked Mac — the one state in which it must run. - injection: Parallels 26.3.3 rejects `--scancode 91` (0x5B Super) with "Invalid scan code sequence" and offers no way to send the 0xE0 0x5B extended pair as separate --scancode calls. Breenix's HID layer maps the Left-Ctrl bit to the SUPER modifier, so inject Left-Ctrl (scancode 29, no prefix): accepted by Parallels and the exact "double control key" the operator describes. - oracle: blauncher launches bterm via fork+execv, which does NOT emit the kernel's "[spawn] path='/bin/bterm'" line. Validate bterm's OWN startup logs instead -- '[bterm] config:' AND '[bterm] spawned child pid=' (terminal started AND loaded its shell). Stronger, honest proof (the binary actually ran); never weakens the gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The double-tap trigger is timing-sensitive (bwm requires two Ctrl taps within a 400ms window). On a CPU-throttled / overloaded host, prlctl send-key-event latency balloons (observed 162s for a single doubletap at 4 VM cores), spreading the two taps far past the window so the launcher never opens. Log the injection wall-time and warn when it exceeds ~350ms, so a "launcher did not open" failure is diagnosable as a timing miss vs. the key never reaching the guest. Conclusion from the throttled gate: do NOT throttle these runs. The flow works at full CPU (proven once end-to-end); reliability must be measured at full speed, which means running when the operator is away rather than throttled alongside them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… body, --no-build)
The generated workflow had two bugs that would have wrecked a real run:
- it invoked launcher-smoke.sh WITHOUT --no-build, so each of up to 15 attempts
would trigger a full kernel+userspace+ext2 rebuild (~10 min each).
- it was written as `export default async function run()` calling
`agent({prompt, schema})`, but the Workflow runtime executes the script BODY
directly and agent() takes (promptString, {schema}) -- so as written the loop
was never invoked.
Rewrite to the documented pattern: top-level body with phase()/await agent(),
agent(prompt, {schema}), --no-build, a pre-run lock guard, and per-attempt
injection-wall-time capture. Stops at 10 consecutive PASS or 15 attempts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…thout breaking injection) The operator uses this Mac while runs happen, so the VM must not hog CPU — but throttling it breaks the timing-sensitive double-tap. Resolution: - background_vm_proc: drop the VM to `renice 20` (perf cores, polite under contention) as soon as it boots, through the long boot/warmup phases. - foreground_vm_proc: restore `renice 0` for the brief double-tap injection window. - Use renice ONLY (no `taskpolicy -b`): E-core banishment starved the guest so it couldn't consume the two taps inside bwm's 400ms window (observed 1876ms). - Add --no-background opt-out; bump default timeout to 1200s (backgrounded boots are slower). NB: a separate, host-side issue gates reliability — `prlctl send-key-event` latency is variable and coupled to host load (seen 0.4s..166s/call); a double-tap needs each call <~100ms, which requires a responsive/quiet Parallels dispatcher. The renice toggle fires correctly; an end-to-end PASS with it is still pending a responsive dispatcher (run on a quiet host / after a Parallels restart). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…enix VM is already running run.sh kills any existing breenix VM before creating its own, so two overlapping launcher-smoke runs would destroy each other's in-flight VM (and two VMs would fight the Parallels dispatcher). Add a preflight that emits RESULT: FAIL and exits if a breenix VM is already running, enforcing strictly-serial execution even if a caller accidentally launches runs concurrently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t -j` (load-independent double-tap) ROOT CAUSE of the failing reliability gate (15/15 fail, every double-tap ~1.9s): the double-tap was 4 SEPARATE `prlctl send-key-event` spawns, each ~475ms on a loaded host, so the two taps landed ~1.9s apart — far outside bwm's 400ms window. Proof #3 only passed because the dispatcher was fast on an idle (5am) host. FIX: send every command as ONE `prlctl send-key-event -j` batch (JSON event array on stdin). The inter-event delays are then applied by the Parallels dispatcher with precise timing, INDEPENDENT of prlctl's per-spawn latency — so the double-tap lands inside the 400ms window regardless of host load. Validated: the whole double-tap is one ~0.6s call with the two taps spaced exactly 190ms by the dispatcher (vs ~1.9s and unreliable across 4 spawns). inject.sh: tap/doubletap/hold/type now build a JSON event array and send it via one `-j` stdin call. launcher-smoke.sh: the injection wall-time log is reworded (wall-time is now just prlctl overhead, not the tap spacing). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m input drops The 10/15 gate exposed two REAL Breenix intermittency bugs (not harness issues): - ~25% double-tap drop: bwm never registers the (correctly batched, dispatcher- timed) double-tap — blauncher truly never spawns (verified absent across the whole boot, not late). Guest-side BWM/HID input intermittency. - EC=0xe Illegal Execution State crash on the Enter->fork/exec->bterm path (run-124137): launcher opened, then [UNHANDLED_EC] cpu=5 + [FATAL_POSTMORTEM]; the handler parks the CPU in idle so heartbeats continue (looks "hung"). This is clone-exec/TTBR0 SMP territory — the area of this branch's in-flight fixes. Make the harness an honest bug-detector: grep the post-injection serial for [UNHANDLED_EC]/[FATAL_POSTMORTEM]/panic and report "KERNEL FAULT ..." with the offending line, distinctly from a benign "double-tap dropped" or "terminal did not launch". No silent retry-to-green — the gate honestly reports the real reliability (and which failure mode), per the no-faking-tests policy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…postmortem The EC=0xe (Illegal Execution State) catch-all previously printed only ELR, which is not enough to confirm WHY the ERET landed in an illegal state. Add, on the fatal park path only (interrupts already masked; lock-free raw-UART output like the existing [UNHANDLED_EC]/[DATA_ABORT] lines; nothing on hot paths): - [FATAL_REGS]: spsr, esr, far, elr, sp, x0..x30 from the exception frame. - [FATAL_THREAD]: current tid, saved_by_inline_schedule, ctx_elr_el1 via the deadlock-safe scheduler try_dump_state (try_lock; skips if busy) — the same accessor the PC_ALIGN fatal handler already uses. This makes the next capture of the intermittent crash decisive: SPSR shows the illegal PSTATE, and saved_by_inline_schedule + ctx_elr_el1 directly confirm/refute the stale-elr_el1-restored-on-dispatch-ERET hypothesis. Diagnostic only; exception.rs only (no gold-master / context_switch.rs / userspace). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The launcher-test harness reproduced an intermittent crash; forensic analysis (enhanced postmortem b196121 + symbolization + trace ring) confirms the proximate cause with high confidence: idle_loop_arm64's register file gets saved into a non-idle thread's Thread.context, which is later dispatched via ERET into .bss (0x269000=WAKE_SITE_SCHEDULE) -> EC=0x0 (UDF) or EC=0xe (illegal SPSR). Same bug, two exception classes. Unifies the prior crash hunt + the branch's TTBR0/clone-exec cluster. Fix is in gold-master context_switch.rs and the obvious mitigation intersects the "NO EL0 dispatch guard" autopsy warning -> documented as a signoff proposal, not applied. Doc lays out both fix options, the upstream-writer candidates, the Parallels-only confirmation path, and how to validate via the harness. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…'t dropped The launcher double-tap was dropped ~22% of the time: the modifier path is polled-level (hid.rs SUPER_PRESSED.store), and bwm samples it once per (bursty, GPU-fenced) compositor wake, so a tap's ~30ms high window can fall entirely between two polls and be missed -> tap_count reaches 1 not 2 -> launcher never fires. The mouse path already solved this with a press-edge latch; modifiers lacked the equivalent. Fix (mirrors the mouse latch; none of the 3 files are gold-master/prohibited): - hid.rs: SUPER_TAP_COUNT atomic, incremented on the SUPER 0->1 rising edge at HID-report time (swap-based), plus a read-and-clear accessor; wakes the compositor on a Super edge. Lock-free, no logging on the path. - graphics.rs: op=31 returns+clears the latched tap count; a keyboard-ready bit in compositor_ready_bits so a tap wakes compositor_wait. - bwm.rs: drains the latch every frame and drives SUPER multi-tap from latched press-edges (combo semantics + 400ms window + cooldown preserved; a single tap cannot read as a double). Validated via the launcher harness: drop rate ~22% -> ~9% (10/11 injected runs opened the launcher), no regressions, no spurious launches, injection load-independent. The residual ~9% showed zero guest HID activity post-injection (a host injection-delivery miss, not the latch) -- separate, host-side. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run 3 of a validation batch hit a false readiness timeout (and leaked a VM)
because concurrent serial writers split the one-shot marker mid-line
("[in[bwm] hotkeys: using built-TELNETD_STARTING"). Match EITHER the
hotkeys-defaults line OR the recurring [bwm-fps] compositing line (printed
~180x/s once the desktop is live, so a clean instance appears within ms), via
grep -aE. Removes the harness's own flaky failure mode.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Host-side automation that drives the real Breenix GUI input path on a fresh Parallels VM and validates it with serial-log oracles — no kernel/userspace changes:
A run PASSes only when the serial log shows both
[spawn] path='/bin/bterm'and[bterm] config:. "Launcher opened" alone ([spawn] path='/bin/blauncher') is an explicit FAIL — the oracle never accepts weaker evidence than "the terminal actually launched and initialized".Files
scripts/parallels/inject.sh— canonicalprlctl send-key-eventhelper (PS/2 set-1 scancodes, extended-key aware for Super =0xE0 0x5B). Errors loudly (exit 2) on an empty/unset$VMinstead of silently no-op'ing.scripts/parallels/launcher-smoke.sh— one full run; prints exactlyRESULT: PASS(exit 0) orRESULT: FAIL: <reason>(exit 1), with an evidence dir (serial excerpt + screenshots +result.txt). Includes a locked-screen preflight (refuses to run on a locked Mac, where Parallels silently drops injected keys) and acaffeinate -dkeep-alive, both wired into the cleanup trap. The injection trigger is isolated to one config block at the top (SUPER_PREFIX/SUPER_CODE/INTER_TAP_MS/ENTER_CODE)..claude/workflows/parallels-launcher-test.js— runs the smoke test sequentially (one VM, never parallel) up to 15×; gate = 10 consecutive PASS; reports the streak + first failure.docs/planning/parallels-test-harness/README.md,RALPH_STATE.md— the proven recipe, host prerequisites, and known limitations.Recipe
The recipe is proven in code and was walked manually in a prior session: double-tap Super (
bwm.rsload_defaultsbindsSUPER+SUPER → exec /bin/blauncher),blauncherpre-selectsAPPS[0] = "Terminal", Enter alone launches/bin/bterm. Injection isprlctl send-key-event <VM> --scancode <ps2-set1> --event press|release— NOT CGEvents.The harness is built and verified (both scripts
bash -nclean; the recipe walked manually before), but the live 10-in-a-row validation has not run, because:Root cause:
prlctl send-key-eventreaches the guest only when the macOS console is unlocked. With the screen locked, Parallels detaches the VM window and silently drops every injected keystroke (send-key-eventreturnsrc=0, key never lands — proven functionally by injecting=into the Bounce demo with no effect). This is not a TCC/Accessibility issue: injection goes through the virtual xHCI HID viaprl_disp_service, not macOS CGEvent — so no permissions grant fixes it. There is no non-interactive unlock bypass; the smoke script preflights this and fails fast with a clear message.Exact steps to run the validation (operator, at the console):
caffeinate -d &(the smoke script also starts one, but disabling auto-lock is still required).bash scripts/parallels/launcher-smoke.sh(single run), then invoke theparallels-launcher-testworkflow for the 10× gate.QEMU is not a substitute for this flow: BWM's ARM64 path needs the Parallels-specific VirGL 3D compositor (absent on QEMU here, so BWM never starts), and the SUPER hotkey reads
SUPER_PRESSEDonly from the USB-HID/xHCI driver, which never enumerates on QEMU (thevirtio-keyboardMMIO driver never tracks Super). Making QEMU viable would require kernel changes (software-compositor fallback for BWM + avirtio-keyboard→SUPER bridge) — out of scope for this host-side harness.Test plan
bash scripts/parallels/launcher-smoke.shprintsRESULT: PASS(requires unlocked Mac)parallels-launcher-testworkflow reportsconsecutiveGreenAchieved: true— 10 consecutive green (requires unlocked Mac)🤖 Generated with Claude Code