Summary
A call to a function with 5 scalar args returning a 16-byte struct by value has mismatched argument-register assignment between the call site and the callee: the caller omits the first argument from the register sequence and shifts the rest into r0–r3, while the callee reads r0–r3 as args 1–4. The result is silently-wrong values (no fault). Reproduces without loom (synth directly), so it's a synth ARM-lowering issue, not loom.
Found extending the wasm-cross-LTO work to a 4th struct-return primitive (k_msgq_put). The 3 working primitives (sem, mutex, stack) all have decides with ≤3 args; msgq's decide is the first with 5 args, which is what exposes this.
Repro
gale_k_msgq_put_decide(write_idx, used_msgs, max_msgs, has_waiter, is_no_wait) -> GaleMsgqPutDecision (16-byte #[repr(C)]: i32 ret; u8 action; u32 new_write_idx; u32 new_used). Pipeline: clang --target=wasm32 -O2 → wasm-ld → synth compile --target cortex-m4f --native-pointer-abi --all-exports --relocatable. (loom not required — see below.)
Caller (the dissolved z_impl_k_msgq_put body) just before bl <gale_k_msgq_put_decide> — synth, NO loom:
ldr r5, [fp, msgq+32] ; r5 = used_msgs
ldr r7, [fp, msgq+12] ; r7 = max_msgs
... cmp reader,#0 ; r1 = has_waiter
ldr r4, [sp,#104] ; r4 = is_no_wait
mov r0, r5 ; r0 <- used_msgs
mov r1, r7 ; r1 <- max_msgs
mov r2, r1(has_waiter) ; r2 <- has_waiter
mov r3, r4 ; r3 <- is_no_wait
bl <gale_k_msgq_put_decide>
write_idx (computed earlier) is never moved into an argument register. The caller passes args 2..5 in r0..r3 and drops arg1.
Callee gale_k_msgq_put_decide entry:
stmdb sp!, {r4,r5,r6,r7,r8,lr}
str r0,[sp,#24] ; str r1,[sp,#28] ; str r2,[sp,#32] ; str r3,[sp,#36]
... uses r1,r2,r3 (=args) in the used<max comparison
The callee reads r0..r3 as args 1..4 (write_idx, used, max, has_waiter).
Net effect
- callee
used_msgs ← caller max_msgs (8)
- callee
max_msgs ← caller has_waiter (0)
- →
used(8) >= max(0) → not(used<max) → Full → returns -ENOMSG.
On silicon (G474RE): a freshly k_msgq_init'd empty queue returns rc=-35 (ENOMSG) and stores nothing, instead of rc=0. native k_msgq_put = 145 cyc (correct); wasm-cross-LTO returns wrong (no fault). msg layout independently DWARF-verified correct, and hardcoding write_idx=0 does not change it (so it's not the write_idx division) — the args are mis-assigned at the ABI level.
Isolation
- Reproduces with
synth compile directly on the wasm-ld output (no loom) — identical caller/callee register mismatch. → synth, not loom.
- The 3 ≤3-arg struct-return decides (sem/mutex/stack, 12-byte returns) lower correctly and run correct on silicon. The trigger is 5 args (and/or the 5-args + sret interaction).
Kill-criterion
Wrong if, after the fix, the dissolved z_impl_k_msgq_put passes write_idx in the correct arg register (caller/callee agree) and the bench returns rc=0 + round-trips the value on an empty queue. Repro harness: gale-smart-data/.../wasm-testbed/msgq-microbench/ (one command: builds + flashes + measures native vs wasm).
I'm the on-silicon gate (G474RE) — will re-measure the moment a fix lands.
Summary
A call to a function with 5 scalar args returning a 16-byte struct by value has mismatched argument-register assignment between the call site and the callee: the caller omits the first argument from the register sequence and shifts the rest into r0–r3, while the callee reads r0–r3 as args 1–4. The result is silently-wrong values (no fault). Reproduces without loom (synth directly), so it's a synth ARM-lowering issue, not loom.
Found extending the wasm-cross-LTO work to a 4th struct-return primitive (
k_msgq_put). The 3 working primitives (sem, mutex, stack) all have decides with ≤3 args; msgq's decide is the first with 5 args, which is what exposes this.Repro
gale_k_msgq_put_decide(write_idx, used_msgs, max_msgs, has_waiter, is_no_wait) -> GaleMsgqPutDecision(16-byte#[repr(C)]:i32 ret; u8 action; u32 new_write_idx; u32 new_used). Pipeline:clang --target=wasm32 -O2 → wasm-ld → synth compile --target cortex-m4f --native-pointer-abi --all-exports --relocatable. (loom not required — see below.)Caller (the dissolved
z_impl_k_msgq_putbody) just beforebl <gale_k_msgq_put_decide>— synth, NO loom:write_idx(computed earlier) is never moved into an argument register. The caller passes args 2..5 in r0..r3 and drops arg1.Callee
gale_k_msgq_put_decideentry:The callee reads r0..r3 as args 1..4 (
write_idx, used, max, has_waiter).Net effect
used_msgs← callermax_msgs(8)max_msgs← callerhas_waiter(0)used(8) >= max(0)→ not(used<max) →Full→ returns-ENOMSG.On silicon (G474RE): a freshly
k_msgq_init'd empty queue returnsrc=-35(ENOMSG) and stores nothing, instead ofrc=0. nativek_msgq_put= 145 cyc (correct); wasm-cross-LTO returns wrong (no fault). msg layout independently DWARF-verified correct, and hardcodingwrite_idx=0does not change it (so it's not thewrite_idxdivision) — the args are mis-assigned at the ABI level.Isolation
synth compiledirectly on the wasm-ld output (no loom) — identical caller/callee register mismatch. → synth, not loom.Kill-criterion
Wrong if, after the fix, the dissolved
z_impl_k_msgq_putpasseswrite_idxin the correct arg register (caller/callee agree) and the bench returnsrc=0+ round-trips the value on an empty queue. Repro harness:gale-smart-data/.../wasm-testbed/msgq-microbench/(one command: builds + flashes + measures native vs wasm).I'm the on-silicon gate (G474RE) — will re-measure the moment a fix lands.