diff --git a/docs/planning/ralph-roadmap.html b/docs/planning/ralph-roadmap.html index 853006b5..aac09a2c 100644 --- a/docs/planning/ralph-roadmap.html +++ b/docs/planning/ralph-roadmap.html @@ -12,6 +12,7 @@ h2{font-size:15px;margin:26px 0 10px;padding-bottom:6px;border-bottom:1px solid var(--border);text-transform:uppercase;letter-spacing:.3px;color:var(--muted)} .grid{display:grid;grid-template-columns:repeat(auto-fill,minmax(300px,1fr));gap:10px} .card{background:var(--panel);border:1px solid var(--border);border-radius:8px;padding:12px 14px} + .card.live{border-color:var(--amber);box-shadow:0 0 0 1px rgba(210,153,34,.25)} .card h3{margin:0 0 6px;font-size:14px;display:flex;align-items:center;gap:8px;flex-wrap:wrap} .card p{margin:6px 0 0;font-size:13px} .card .meta{margin-top:8px;color:var(--muted);font-size:12px} .badge{display:inline-block;font-size:11px;font-weight:600;padding:1px 7px;border-radius:10px;border:1px solid transparent} @@ -28,39 +29,52 @@ .summary div{background:var(--panel);border:1px solid var(--border);border-radius:8px;padding:8px 14px;min-width:84px;text-align:center} .summary .n{font-size:20px;font-weight:700} .summary .l{font-size:11px;color:var(--muted);text-transform:uppercase;letter-spacing:.4px} code{background:#1f2630;padding:1px 5px;border-radius:4px;font-size:12px} .note{color:var(--muted);font-size:12px;margin-top:6px} + .live-banner{background:rgba(210,153,34,.10);border:1px solid rgba(210,153,34,.35);border-radius:8px;padding:10px 14px;margin:0 0 16px;font-size:13px}
Canonical orchestrator backlog (maintained by Claude). Last updated 2026-06-01. ARM64 / Parallels focus. Refreshed at merge milestones (not per-turn โ per-turn churn caused the prior revert). Per-turn execution detail lives in the Ralph inbox.md; version-tracked copy in-repo at docs/planning/ralph-roadmap.html.
Canonical orchestrator backlog (maintained by Claude). Last updated 2026-06-01 โ refreshed every workflow round & finding (not just merges). ARM64 / Parallels focus. Live per-turn detail in the Ralph inbox.md + /workflows; in-repo at docs/planning/ralph-roadmap.html.
The user-stack frame-aliasing lockup fix (PR #404) reduces but does not eliminate the fault: a clean re-verify on current main faulted 1 of 2 stress boots with a post-spawn UNHANDLED_EC โ PANIC on the freshly spawned child (PID 5) after exec.
Key finding: the prior "assertion-fired 3/3, 0 crashes" proof was contaminated by an SMP-serial byte-interleaving blind spot โ naive FATAL/PANIC line-grep reads 0; you must de-interleave the two CPU streams to see [FATAL] bug=UNHANDLED_EC. The fault sits near the stack/spawn path โ may overlap the #406 kstack-reuse area. PR #404 held, not merged. Next: de-interleave + root-cause; a gold-master file may be required โ operator signoff first. beads breenix-oia family adjacent.
Operator (2026-06-01) launched the terminal through the launcher โ picker โ terminal chain and the entire VM locked up. Fix forward โ no bisect (likely an intermittent defect that could predate any single commit).
+Leading hypothesis: the same SMP fresh-process spawn-dispatch defect as #404 below โ launching the terminal spawns a fresh child process under a live multi-process SMP system, exactly the path where the wrong page-table root gets installed. A wrong root could surface as a clean fault (#404) or as a fault-storm / dispatch hang (full lockup). Not yet proven the same โ the investigation reproduces the operator's exact chain, captures the de-interleaved serial + trace ring at the lockup, and confirms or refutes the shared cause before fixing forward.
+ +The user-stack frame-aliasing fix (PR #404) reduces but does not eliminate the fault โ ~4 of 6 fresh stress boots still crash. PR held, not merged.
+Root cause (narrowed, TURN 349): a freshly-spawned child (bsshd, PID 5) takes EC=0x0 at EL0 on its own valid text address 0x4000460c. The SPSR is legal โ so this is not a corrupted exception frame. The live fault-time page-table root translates that text address to PID 2 (wait_stress)'s physical frame, so the child executes a sibling's read-only data as instructions โ undefined-instruction fault. Page tables are correct at load time (each process's text frame verified live==expected); the bug is the page-table root (TTBR0) installed when the fresh child is dispatched, often on a remote CPU.
Prime suspect (not yet proven): the gold-master context_switch.rs inline-ret dispatch โ under PROCESS_MANAGER lock contention (PmLockBusy) it falls back to the thread's cached_ttbr0, which for a fresh child can hold a sibling's root. Fits the SMP / ~4-of-6 probabilistic pattern. Falsified with evidence: illegal-SPSR / exec-state, stale I-cache, kstack reuse, frame double-free, load-time ELF aliasing, reuse of bsshd's own text frame.
Why the prior "fix" looked proven: the original "assertion-fired 3/3, 0 crashes" result was a false pass โ the two SMP serial streams are byte-interleaved, so FATAL/PANIC line-greps read 0 unless you de-interleave. Verification now always de-interleaves.
Now (TURN 350): prove the discriminator โ is the fault root PID-5's own table (โ a post-load page-table-entry bug in NON-prohibited code: manager.rs / elf.rs / pt allocator โ direct PR) or a sibling's root (โ TTBR0 dispatch mismatch in gold-master context_switch.rs โ precise signoff proposal, no edit without your go)?
Investigate + classify the SOFT_LOCKUP_VIRGL failure class on Parallels. beads breenix-ha9.
Remaining AHCI timeout corridor after GICR discovery; verify whether the fix is storage-driver-only (non-gold-master) or touches gic.rs โ signoff. beads breenix-xk8.
Remaining AHCI timeout corridor after GICR discovery; verify storage-driver-only (non-gold-master) vs gic.rs (signoff). beads breenix-xk8.
bsshd should send SSH exit-status + channel close for exec requests. beads breenix-72x.
CPU0 vtimer death on Parallels + remote-wake / resched scheduling. Fixes live in frozen gold-master timer_interrupt.rs / gic.rs / context_switch.rs โ need operator signoff (this cluster burned ~a week before; investigate-only without go). beads oia, 9f1, cb7, 6f4, e43, k16, eh4.
BusyBox applet faults on the ARM64 musl TLS path (errno read from a zero TPIDR_EL0). User-facing symptom already fixed (native bls as /bin/ls); the principled TLS fix touches gold-master context-switch โ signoff. beads breenix-b7u.
CPU0 vtimer death on Parallels + remote-wake / resched scheduling. Fixes live in frozen gold-master timer_interrupt.rs / gic.rs / context_switch.rs โ need operator signoff (this cluster burned ~a week before). beads oia, 9f1, cb7, 6f4, e43, k16, eh4.
BusyBox applet faults on the ARM64 musl TLS path (errno from a zero TPIDR_EL0). User-facing symptom fixed (native bls as /bin/ls); principled TLS fix touches gold-master context-switch โ signoff. beads breenix-b7u.
PR #396 ยท core-aware denominator + single-snapshot sampling (the 189% was an accounting artifact, not a real burn).
client channel #397, publickey #398, known-hosts verify #399, host-auth #403.
PR #402 ยท trace-ring crash diagnostics.