fix(aarch64): requeue instead of using stale cached ttbr0#410
Merged
Conversation
Remove the PM-lock-busy fallback that restored cached TTBR0 values during dispatch. If TTBR0 setup cannot acquire the process-manager lock, the existing redirect/requeue path now handles the thread instead of risking a stale address-space root. Co-authored-by: Ryan Breen <ryan@ryanbreen.com> Co-authored-by: Claude Code <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the aarch64 stale cached TTBR0 dispatch bug: when the process-manager lock is busy, the dispatcher no longer installs
thread.cached_ttbr0as an unverifiable fallback. It requeues/redirects instead of dispatching under a stale or orphaned userspace page-table root.Root-cause evidence from TURN 350 showed the residual post-spawn
EC=0fault was not PID 5 running under its own page table. The fault-time thread was PID 4/TID 14 withcached_ttbr0=0x100004407e000, whose untagged root0x4407e000no longer belonged to any live process.Corrected TURN 351 verification
The initial verification harness had a false-pass bug: it stopped on
WAIT_STRESS_PASSand did not scan the whole boot. TURN 351 reclassified the saved boots using normalized/de-interleaved serial and a whole-boot rule: a pass requiresWAIT_STRESS_PASS, the #404 non-contiguous-frame assertion, and zeroUNHANDLED_EC/FATAL_POSTMORTEM/PANIC/DEFER_SNAP/ trace dump /SOFT_LOCKUP/DATA_ABORTanywhere in the log.Corrected whole-boot results:
WAIT_STRESS_PASS+ #404 assertion, no fault markersWAIT_STRESS_PASS+ #404 assertion, no fault markersWAIT_STRESS_PASS+ #404 assertion, no fault markersWAIT_STRESS_PASS+ #404 assertion, no fault markersWAIT_STRESS_PASS+ #404 assertion, no fault markersWAIT_STRESS_PASS+ #404 assertion, no fault markersUNHANDLED_EC, 6xDEFER_SNAP, and trace dump afterWAIT_STRESS_PASSUNHANDLED_ECandDEFER_SNAPArtifact table:
/Users/wrb/Downloads/Ralph/breenix-interrupt-io-roadmap-1780056222/turn351-artifacts/whole_boot_reclassification.md.Comparison notes:
main-compare-1andmain-compare-2: no fail window in the same whole-log classifier.pr404-compare-1: noWAIT_STRESS_PASS; not a clean comparison pass.pr404-compare-2: no fail window.Build evidence:
turn351-artifacts/final-fix-build-warning-error-grep.txtis 0 bytes.Test plan
turn351-artifacts/final-fix-build-warning-error-grep.txt.WAIT_STRESSverification boots with Fix ARM64 user stack frame mapping #404 assertion active:verify-2,verify-5,verify-7,verify-10,verify-11,verify-12.verify-3as FAIL from de-interleaved whole-boot serial.🤖 Generated with Claude Code / Codex Ralph verification loop.