From f9ecf20d23598e4506b51ab0cc9d9b8351b9f4f5 Mon Sep 17 00:00:00 2001 From: Ryan Breen Date: Mon, 1 Jun 2026 13:57:41 -0400 Subject: [PATCH] docs(roadmap): operator launcher->picker->terminal full-VM lockup + #404 TTBR0 status Operator hit a full-VM lockup launching the terminal via launcher->picker->terminal. Folded into the active #404 fresh-process spawn-dispatch root-cause as the primary real-world repro (leading hypothesis: same wrong-page-table-root-on-fresh-child class). Fix forward, no bisect. Committed via plumbing off origin/main so the running Codex executor working tree is untouched. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/planning/ralph-roadmap.html | 36 ++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/docs/planning/ralph-roadmap.html b/docs/planning/ralph-roadmap.html index 853006b5..aac09a2c 100644 --- a/docs/planning/ralph-roadmap.html +++ b/docs/planning/ralph-roadmap.html @@ -12,6 +12,7 @@ h2{font-size:15px;margin:26px 0 10px;padding-bottom:6px;border-bottom:1px solid var(--border);text-transform:uppercase;letter-spacing:.3px;color:var(--muted)} .grid{display:grid;grid-template-columns:repeat(auto-fill,minmax(300px,1fr));gap:10px} .card{background:var(--panel);border:1px solid var(--border);border-radius:8px;padding:12px 14px} + .card.live{border-color:var(--amber);box-shadow:0 0 0 1px rgba(210,153,34,.25)} .card h3{margin:0 0 6px;font-size:14px;display:flex;align-items:center;gap:8px;flex-wrap:wrap} .card p{margin:6px 0 0;font-size:13px} .card .meta{margin-top:8px;color:var(--muted);font-size:12px} .badge{display:inline-block;font-size:11px;font-weight:600;padding:1px 7px;border-radius:10px;border:1px solid transparent} @@ -28,39 +29,52 @@ .summary div{background:var(--panel);border:1px solid var(--border);border-radius:8px;padding:8px 14px;min-width:84px;text-align:center} .summary .n{font-size:20px;font-weight:700} .summary .l{font-size:11px;color:var(--muted);text-transform:uppercase;letter-spacing:.4px} code{background:#1f2630;padding:1px 5px;border-radius:4px;font-size:12px} .note{color:var(--muted);font-size:12px;margin-top:6px} + .live-banner{background:rgba(210,153,34,.10);border:1px solid rgba(210,153,34,.35);border-radius:8px;padding:10px 14px;margin:0 0 16px;font-size:13px}

Breenix Roadmap โ€” Backlog

-

Canonical orchestrator backlog (maintained by Claude). Last updated 2026-06-01. ARM64 / Parallels focus. Refreshed at merge milestones (not per-turn โ€” per-turn churn caused the prior revert). Per-turn execution detail lives in the Ralph inbox.md; version-tracked copy in-repo at docs/planning/ralph-roadmap.html.

+

Canonical orchestrator backlog (maintained by Claude). Last updated 2026-06-01 โ€” refreshed every workflow round & finding (not just merges). ARM64 / Parallels focus. Live per-turn detail in the Ralph inbox.md + /workflows; in-repo at docs/planning/ralph-roadmap.html.

+ +
๐Ÿ”ด Operator-reported (2026-06-01): launching the terminal via launcher โ†’ picker โ†’ terminal locked up the whole VM. Folded into the active #404 root-cause as the primary real-world repro โ€” leading hypothesis is the same SMP fresh-process spawn-dispatch class (wrong page-table root on a freshly-spawned child), to be confirmed or refuted with evidence. Fix forward, no bisect.

๐Ÿ”Ž Active (TURN 350): root-causing the #404 residual post-spawn crash โ€” narrowed to a TTBR0 / page-table-root mismatch on the SMP spawn path: a freshly-spawned child's text address translates through the wrong page-table root to a sibling process's memory, so it executes the sibling's data as code (EC=0 fault). Prime suspect is the gold-master context_switch.rs dispatch fallback. PR #404 is held, not merged (the original "fixed" proof was a false pass). No gold-master file has been touched; if the fix lands there it comes to you as a signoff proposal first.
12
Shipped (session)
-
1
In progress
+
2
In progress
3
Queued
8
Signoff-gated
-

๐Ÿšง In progress

+

๐Ÿšง In progress โ€” fresh-process spawn-dispatch root-cause (unified)

-
-

#404 residual post-spawn crash โ€” root-cause re-verify failed ARM64 ยท Parallels

-

The user-stack frame-aliasing lockup fix (PR #404) reduces but does not eliminate the fault: a clean re-verify on current main faulted 1 of 2 stress boots with a post-spawn UNHANDLED_EC โ†’ PANIC on the freshly spawned child (PID 5) after exec.

-

Key finding: the prior "assertion-fired 3/3, 0 crashes" proof was contaminated by an SMP-serial byte-interleaving blind spot โ€” naive FATAL/PANIC line-grep reads 0; you must de-interleave the two CPU streams to see [FATAL] bug=UNHANDLED_EC. The fault sits near the stack/spawn path โ†’ may overlap the #406 kstack-reuse area. PR #404 held, not merged. Next: de-interleave + root-cause; a gold-master file may be required โ†’ operator signoff first. beads breenix-oia family adjacent.

+
+

launcher โ†’ picker โ†’ terminal full-VM lockup operator repro ยท investigating ARM64 ยท Parallels ยท SMP

+

Operator (2026-06-01) launched the terminal through the launcher โ†’ picker โ†’ terminal chain and the entire VM locked up. Fix forward โ€” no bisect (likely an intermittent defect that could predate any single commit).

+

Leading hypothesis: the same SMP fresh-process spawn-dispatch defect as #404 below โ€” launching the terminal spawns a fresh child process under a live multi-process SMP system, exactly the path where the wrong page-table root gets installed. A wrong root could surface as a clean fault (#404) or as a fault-storm / dispatch hang (full lockup). Not yet proven the same โ€” the investigation reproduces the operator's exact chain, captures the de-interleaved serial + trace ring at the lockup, and confirms or refutes the shared cause before fixing forward.

+

Primary verification target: the launcher โ†’ picker โ†’ terminal chain must launch repeatedly without lockup. Gold-master fixes (e.g. context_switch.rs) come to the operator as a signoff proposal first.

+
+
+

#404 residual post-spawn crash TURN 350 ยท TTBR0 mismatch ARM64 ยท Parallels ยท SMP

+

The user-stack frame-aliasing fix (PR #404) reduces but does not eliminate the fault โ€” ~4 of 6 fresh stress boots still crash. PR held, not merged.

+

Root cause (narrowed, TURN 349): a freshly-spawned child (bsshd, PID 5) takes EC=0x0 at EL0 on its own valid text address 0x4000460c. The SPSR is legal โ€” so this is not a corrupted exception frame. The live fault-time page-table root translates that text address to PID 2 (wait_stress)'s physical frame, so the child executes a sibling's read-only data as instructions โ†’ undefined-instruction fault. Page tables are correct at load time (each process's text frame verified live==expected); the bug is the page-table root (TTBR0) installed when the fresh child is dispatched, often on a remote CPU.

+

Prime suspect (not yet proven): the gold-master context_switch.rs inline-ret dispatch โ€” under PROCESS_MANAGER lock contention (PmLockBusy) it falls back to the thread's cached_ttbr0, which for a fresh child can hold a sibling's root. Fits the SMP / ~4-of-6 probabilistic pattern. Falsified with evidence: illegal-SPSR / exec-state, stale I-cache, kstack reuse, frame double-free, load-time ELF aliasing, reuse of bsshd's own text frame.

+

Why the prior "fix" looked proven: the original "assertion-fired 3/3, 0 crashes" result was a false pass โ€” the two SMP serial streams are byte-interleaved, so FATAL/PANIC line-greps read 0 unless you de-interleave. Verification now always de-interleaves.

+

Now (TURN 350): prove the discriminator โ€” is the fault root PID-5's own table (โ‡’ a post-load page-table-entry bug in NON-prohibited code: manager.rs / elf.rs / pt allocator โ‡’ direct PR) or a sibling's root (โ‡’ TTBR0 dispatch mismatch in gold-master context_switch.rs โ‡’ precise signoff proposal, no edit without your go)?

+

PR merges only after โ‰ฅ4-6 fresh de-interleaved crash-free stock-init WAIT_STRESS boots. Rounds: wf_2c11eaa2-3e1 โ†’ wf_fa1c5e4f-941. Live detail: /workflows + inbox.md TURN 345+.

๐Ÿ“‹ Queued โ€” non-gold-master (ARM64)

SOFT_LOCKUP_VIRGL Parallels failure class P1 ARM64 ยท Parallels

Investigate + classify the SOFT_LOCKUP_VIRGL failure class on Parallels. beads breenix-ha9.

-

F15 โ€” ARM64 AHCI timeout corridor after GICR discovery P1 ARM64

Remaining AHCI timeout corridor after GICR discovery; verify whether the fix is storage-driver-only (non-gold-master) or touches gic.rs โ†’ signoff. beads breenix-xk8.

+

F15 โ€” ARM64 AHCI timeout corridor after GICR discovery P1 ARM64

Remaining AHCI timeout corridor after GICR discovery; verify storage-driver-only (non-gold-master) vs gic.rs (signoff). beads breenix-xk8.

bsshd SSH exit-status / close for exec P2 ARM64

bsshd should send SSH exit-status + channel close for exec requests. beads breenix-72x.

๐Ÿ”’ Signoff-gated โ€” gold-master / CPU0 cluster (awaiting operator)

-

CPU0 timer-death + scheduler cluster signoff ARM64 ยท Parallels

CPU0 vtimer death on Parallels + remote-wake / resched scheduling. Fixes live in frozen gold-master timer_interrupt.rs / gic.rs / context_switch.rs โ†’ need operator signoff (this cluster burned ~a week before; investigate-only without go). beads oia, 9f1, cb7, 6f4, e43, k16, eh4.

-

BusyBox applet DATA_ABORT โ€” ARM64 musl TLS signoff ARM64

BusyBox applet faults on the ARM64 musl TLS path (errno read from a zero TPIDR_EL0). User-facing symptom already fixed (native bls as /bin/ls); the principled TLS fix touches gold-master context-switch โ†’ signoff. beads breenix-b7u.

+

CPU0 timer-death + scheduler cluster signoff ARM64 ยท Parallels

CPU0 vtimer death on Parallels + remote-wake / resched scheduling. Fixes live in frozen gold-master timer_interrupt.rs / gic.rs / context_switch.rs โ†’ need operator signoff (this cluster burned ~a week before). beads oia, 9f1, cb7, 6f4, e43, k16, eh4.

+

BusyBox applet DATA_ABORT โ€” ARM64 musl TLS signoff ARM64

BusyBox applet faults on the ARM64 musl TLS path (errno from a zero TPIDR_EL0). User-facing symptom fixed (native bls as /bin/ls); principled TLS fix touches gold-master context-switch โ†’ signoff. beads breenix-b7u.

โธ Deprioritized โ€” x86_64 (future)

@@ -78,7 +92,7 @@

โœ… Shipped โ€” merged to main (this session)

NB-1 โ€” CPU%-accounting fix merged

PR #396 ยท core-aware denominator + single-snapshot sampling (the 189% was an accounting artifact, not a real burn).

bssh client + host-auth suite merged

client channel #397, publickey #398, known-hosts verify #399, host-auth #403.

net RX stall + ARP pending queue merged

#400 net-rx-stall fix, #401 ARP pending queue.

-

crash-trace instrumentation merged

PR #402 ยท trace-ring crash diagnostics.

+

crash-trace instrumentation + roadmap refresh merged

#402 trace-ring crash diagnostics; #408 roadmap brought current.

โœ… Resolved without a fix (closed with evidence)