Make distributed compilation work for OpenEmbedded/Yocto builds#2750
Draft
jetm wants to merge 5 commits into
Draft
Make distributed compilation work for OpenEmbedded/Yocto builds#2750jetm wants to merge 5 commits into
jetm wants to merge 5 commits into
Conversation
get_signal did `status.signal().expect("must have signal")`, assuming the
Unix invariant that an ExitStatus with no exit code was terminated by a
signal. That does not always hold: an ExitStatus reconstructed for a
distributed compile (or an abnormal wait status such as WIFSTOPPED) can
report neither a code nor a signal. When that happened the expect() panicked
the compile task, which the server surfaced as a misleading "Failed to bind
socket" and, under load, repeatedly fell back to local compilation.
Return Option<i32> from get_signal and assign it straight into res.signal, so
a compile that reports neither code nor signal leaves res.signal unset
instead of crashing the in-flight task. The Windows arm returns None rather
than panicking; ExitStatus::code() is always Some there, so the signal branch
is never reached anyway.
Add a unit test covering a real terminating signal (SIGKILL) and the
neither-code-nor-signal case (WIFSTOPPED via from_raw), which previously
panicked.
Signed-off-by: Javier Tia <javier@peridio.com>
sccache-dist packages a toolchain's shared libraries by parsing `ldd` output, which resolves NEEDED libraries against the host's dynamic loader. Yocto/OpenEmbedded "uninative" cross toolchains ship a relocated glibc whose loader has a built-in search path pointing at its own sysroot. For those binaries `ldd` reports host paths (e.g. /usr/lib/libm.so.6), yet inside the build sandbox the relocated loader searches its own sysroot lib dir, where those libraries were never packaged. The remote compile then dies with "libm.so.6: cannot open shared object file" and silently falls back to local compilation, so distribution never actually runs. When a packaged executable's PT_INTERP lives outside the standard host loader directories, also bundle the interpreter's own directory. That directory holds the libc/libm the relocated loader resolves against, so they land at the absolute path the loader searches inside the sandbox. Standard host toolchains are untouched: their interpreter is under /lib, /lib64, or /usr/lib, so the existing ldd-only path is preserved. Signed-off-by: Javier Tia <javier@peridio.com>
sccache-dist ships the preprocessed input through the inputs packager keyed on the absolute, simplified path cwd.join(input) (CInputsPackager), but the distributed compile command referenced the raw parsed_args.input. For out-of-tree builds the input is relative (e.g. OpenEmbedded's ../sources/foo.c), so the command and the packaged input disagreed and the build-server compiled a path the inputs were never placed at, failing with "cc1: fatal error: ... No such file or directory". Transform the same absolute, simplified path in the dist command so it matches the packaged input. An absolute input is unchanged, since cwd.join of an absolute path returns it verbatim. Signed-off-by: Javier Tia <javier@peridio.com>
sccache logged the distribute-vs-local decision only at debug ("Compiling
locally", "Attempting distributed compilation"), while an infrastructure
fallback warned. At info a successful distribution and a local compile were
both silent, so the only visible dist signal was the failure path - leaving
no way to see the distribute/fallback ratio without the full debug firehose.
Diagnosing why a distributed build under-distributes meant guessing.
Promote the decision points to info and add a log on successful
distribution naming the server and exit code, so SCCACHE_LOG=info gives a
per-compile dist trace (attempt then distributed-on-server, compiled-locally,
or falling-back-with-reason). sccache only emits logs when SCCACHE_LOG is
set, so default runs are unaffected.
Signed-off-by: Javier Tia <javier@peridio.com>
A distributed compile the build-server rejects is often a distribution artifact, not a genuine compiler error: an object that .incbin's a binary the inputs packager does not ship - the kernel's vdso, embedded-config, and dtb wrappers - cannot be assembled remotely and returns non-zero, which failed the whole build with no recourse and forced the kernel to be excluded wholesale. Treat a non-zero remote result as a fallback trigger rather than a terminal error: recompile locally, which either succeeds (confirming a dist-only artifact) or reproduces the genuine error. A remote failure can no longer break a build that would compile locally. Only failing dist compiles are affected - successful ones are returned unchanged - so the kernel now distributes (1124/1128 compiles), with its handful of .incbin objects falling back to local. Signed-off-by: Javier Tia <javier@peridio.com>
This was referenced Jun 23, 2026
AJIOB
reviewed
Jun 24, 2026
| // either succeeds (confirming a dist-only artifact) or reproduces the | ||
| // real error locally, so a remote failure never breaks a build that | ||
| // would compile fine locally. This only affects failing dist | ||
| // compiles; successful ones are returned unchanged above. |
Contributor
There was a problem hiding this comment.
This looks like a fix for #2700.
IMO, please, do a dedicated PR with that fix & test it to mitigate possible future regressions
Collaborator
Contributor
There was a problem hiding this comment.
Maybe yes. IMO, your code part should resolve that problem
Collaborator
There was a problem hiding this comment.
I actually haven't written any code in that area (yet, at least)
sylvestre
reviewed
Jun 24, 2026
| Some(dc) => dc, | ||
| None => { | ||
| debug!("[{}]: Compiling locally", out_pretty); | ||
| info!( |
Collaborator
There was a problem hiding this comment.
should be done in a different pr
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2750 +/- ##
==========================================
+ Coverage 74.63% 74.70% +0.06%
==========================================
Files 70 70
Lines 39898 40015 +117
==========================================
+ Hits 29778 29892 +114
- Misses 10120 10123 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This series makes
sccache --distusable end-to-end for OpenEmbedded/Yocto(BitBake) builds, where it previously panicked, failed, or silently fell back
to local on several distinct issues. Validated by building a qemuarm64
core-image-minimal entirely through a single-node sccache-dist cluster:
userspace do_compile distributes (e.g. busybox 509/509) and the kernel
distributes 1124/1128 compiles.
It combines and supersedes #2746 and #2747 into one OE/BitBake-support series.
Don't panic when a finished compile has neither code nor signal (Don't panic in get_signal when a finished compile has neither code nor signal #2746).
sccache-dist synthesizes an ExitStatus with neither exit code nor signal for
some abnormal remote compiles; get_signal panicked on .expect("must have
signal"). Return Option and handle the both-absent case.
Bundle the relocated interpreter's libdir into the dist toolchain package
(Bundle relocated interpreter's libdir into the dist toolchain package #2747). OE cross-toolchains use an absolute-path uninative loader whose
libc/libm live in the uninative sysroot, not the host paths ldd reports;
bundle the interpreter's libdir so the toolchain resolves in the sandbox.
Fix distributed compile of relative input paths. The dist command used the
raw input path while the inputs packager shipped the preprocessed content at
the absolute, simplified path, so out-of-tree builds (../sources/foo.c)
failed "No such file or directory" on the server. Use the same path in both.
Log distributed-compile decisions at info level. The distribute-vs-local
decision was debug-only and only failures warned; log it at info so
SCCACHE_LOG=info shows the distribute/fallback breakdown. Emitted only when
SCCACHE_LOG is set.
Fall back to local on distributed-compile failure. A remote failure is often
a distribution artifact (e.g. a kernel object that .incbin's a binary the
packager cannot ship) rather than a real compiler error; recompile locally
so a dist failure never breaks a build that compiles fine locally.
Draft: validated single-node so far; marking ready after a two-machine cluster
run confirms it.