Skip to content

Don't panic in get_signal when a finished compile has neither code nor signal#2746

Closed
jetm wants to merge 1 commit into
mozilla:mainfrom
jetm:fix/get-signal-must-have-signal
Closed

Don't panic in get_signal when a finished compile has neither code nor signal#2746
jetm wants to merge 1 commit into
mozilla:mainfrom
jetm:fix/get-signal-must-have-signal

Conversation

@jetm

@jetm jetm commented Jun 19, 2026

Copy link
Copy Markdown

Problem

get_signal does status.signal().expect("must have signal"), assuming the Unix
invariant that an ExitStatus with no exit code was terminated by a signal. That
does not always hold. When the server reconstructs an ExitStatus for a
distributed compile whose compiler died abnormally, the status can carry neither
an exit code nor a terminating signal, and the expect() panics the compile task.

I hit this with sccache-dist: when the build server's packaged toolchain fails
to start a compile (e.g. gcc: error while loading shared libraries: libm.so.6),
the synthesized status has neither a code nor a signal. The panic crashes the
in-flight task, which the server then surfaces as a misleading
Failed to bind socket: ... panicked with message "must have signal", and under
load it recurs on every such compile.

Fix

Return Option<i32> from get_signal and assign it straight into res.signal,
so a compile that reports neither a code nor a signal leaves res.signal unset
instead of panicking.

This matches the contract the client already implements: handle_compile_finished
in commands.rs handles the case where both retcode and signal are None by
printing Missing compiler exit status! and returning -3. The server simply
never reached that path because it panicked first.

The Windows arm returns None rather than panic!; ExitStatus::code() is always
Some there, so the signal branch is unreachable anyway.

Test

Added a unit test covering a real terminating signal (SIGKILL) and the
neither-code-nor-signal case (WIFSTOPPED via ExitStatusExt::from_raw), which
previously panicked.

With the fix, the same sccache-dist scenario now reports Missing compiler exit status! and falls back to local compilation instead of crashing the server.

get_signal did `status.signal().expect("must have signal")`, assuming the
Unix invariant that an ExitStatus with no exit code was terminated by a
signal. That does not always hold: an ExitStatus reconstructed for a
distributed compile (or an abnormal wait status such as WIFSTOPPED) can
report neither a code nor a signal. When that happened the expect() panicked
the compile task, which the server surfaced as a misleading "Failed to bind
socket" and, under load, repeatedly fell back to local compilation.

Return Option<i32> from get_signal and assign it straight into res.signal, so
a compile that reports neither code nor signal leaves res.signal unset
instead of crashing the in-flight task. The Windows arm returns None rather
than panicking; ExitStatus::code() is always Some there, so the signal branch
is never reached anyway.

Add a unit test covering a real terminating signal (SIGKILL) and the
neither-code-nor-signal case (WIFSTOPPED via from_raw), which previously
panicked.

Signed-off-by: Javier Tia <javier@peridio.com>
@jetm

jetm commented Jun 23, 2026

Copy link
Copy Markdown
Author

Superseded by #2750, which combines this with the rest of the OpenEmbedded/Yocto distributed-compile fixes into a single series. Closing in favor of that PR.

@jetm jetm closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant