Skip to content

Make distributed compilation work for OpenEmbedded/Yocto builds#2750

Draft
jetm wants to merge 5 commits into
mozilla:mainfrom
jetm:oe-bitbake-dist-support
Draft

Make distributed compilation work for OpenEmbedded/Yocto builds#2750
jetm wants to merge 5 commits into
mozilla:mainfrom
jetm:oe-bitbake-dist-support

Conversation

@jetm

@jetm jetm commented Jun 23, 2026

Copy link
Copy Markdown

This series makes sccache --dist usable end-to-end for OpenEmbedded/Yocto
(BitBake) builds, where it previously panicked, failed, or silently fell back
to local on several distinct issues. Validated by building a qemuarm64
core-image-minimal entirely through a single-node sccache-dist cluster:
userspace do_compile distributes (e.g. busybox 509/509) and the kernel
distributes 1124/1128 compiles.

It combines and supersedes #2746 and #2747 into one OE/BitBake-support series.

  1. Don't panic when a finished compile has neither code nor signal (Don't panic in get_signal when a finished compile has neither code nor signal #2746).
    sccache-dist synthesizes an ExitStatus with neither exit code nor signal for
    some abnormal remote compiles; get_signal panicked on .expect("must have
    signal"). Return Option and handle the both-absent case.

  2. Bundle the relocated interpreter's libdir into the dist toolchain package
    (Bundle relocated interpreter's libdir into the dist toolchain package #2747). OE cross-toolchains use an absolute-path uninative loader whose
    libc/libm live in the uninative sysroot, not the host paths ldd reports;
    bundle the interpreter's libdir so the toolchain resolves in the sandbox.

  3. Fix distributed compile of relative input paths. The dist command used the
    raw input path while the inputs packager shipped the preprocessed content at
    the absolute, simplified path, so out-of-tree builds (../sources/foo.c)
    failed "No such file or directory" on the server. Use the same path in both.

  4. Log distributed-compile decisions at info level. The distribute-vs-local
    decision was debug-only and only failures warned; log it at info so
    SCCACHE_LOG=info shows the distribute/fallback breakdown. Emitted only when
    SCCACHE_LOG is set.

  5. Fall back to local on distributed-compile failure. A remote failure is often
    a distribution artifact (e.g. a kernel object that .incbin's a binary the
    packager cannot ship) rather than a real compiler error; recompile locally
    so a dist failure never breaks a build that compiles fine locally.

Draft: validated single-node so far; marking ready after a two-machine cluster
run confirms it.

jetm added 5 commits June 23, 2026 13:29
get_signal did `status.signal().expect("must have signal")`, assuming the
Unix invariant that an ExitStatus with no exit code was terminated by a
signal. That does not always hold: an ExitStatus reconstructed for a
distributed compile (or an abnormal wait status such as WIFSTOPPED) can
report neither a code nor a signal. When that happened the expect() panicked
the compile task, which the server surfaced as a misleading "Failed to bind
socket" and, under load, repeatedly fell back to local compilation.

Return Option<i32> from get_signal and assign it straight into res.signal, so
a compile that reports neither code nor signal leaves res.signal unset
instead of crashing the in-flight task. The Windows arm returns None rather
than panicking; ExitStatus::code() is always Some there, so the signal branch
is never reached anyway.

Add a unit test covering a real terminating signal (SIGKILL) and the
neither-code-nor-signal case (WIFSTOPPED via from_raw), which previously
panicked.

Signed-off-by: Javier Tia <javier@peridio.com>
sccache-dist packages a toolchain's shared libraries by parsing `ldd`
output, which resolves NEEDED libraries against the host's dynamic
loader. Yocto/OpenEmbedded "uninative" cross toolchains ship a relocated
glibc whose loader has a built-in search path pointing at its own
sysroot. For those binaries `ldd` reports host paths (e.g.
/usr/lib/libm.so.6), yet inside the build sandbox the relocated loader
searches its own sysroot lib dir, where those libraries were never
packaged. The remote compile then dies with "libm.so.6: cannot open
shared object file" and silently falls back to local compilation, so
distribution never actually runs.

When a packaged executable's PT_INTERP lives outside the standard host
loader directories, also bundle the interpreter's own directory. That
directory holds the libc/libm the relocated loader resolves against, so
they land at the absolute path the loader searches inside the sandbox.
Standard host toolchains are untouched: their interpreter is under /lib,
/lib64, or /usr/lib, so the existing ldd-only path is preserved.

Signed-off-by: Javier Tia <javier@peridio.com>
sccache-dist ships the preprocessed input through the inputs packager
keyed on the absolute, simplified path cwd.join(input) (CInputsPackager),
but the distributed compile command referenced the raw parsed_args.input.
For out-of-tree builds the input is relative (e.g. OpenEmbedded's
../sources/foo.c), so the command and the packaged input disagreed and
the build-server compiled a path the inputs were never placed at, failing
with "cc1: fatal error: ... No such file or directory".

Transform the same absolute, simplified path in the dist command so it
matches the packaged input. An absolute input is unchanged, since
cwd.join of an absolute path returns it verbatim.

Signed-off-by: Javier Tia <javier@peridio.com>
sccache logged the distribute-vs-local decision only at debug ("Compiling
locally", "Attempting distributed compilation"), while an infrastructure
fallback warned. At info a successful distribution and a local compile were
both silent, so the only visible dist signal was the failure path - leaving
no way to see the distribute/fallback ratio without the full debug firehose.
Diagnosing why a distributed build under-distributes meant guessing.

Promote the decision points to info and add a log on successful
distribution naming the server and exit code, so SCCACHE_LOG=info gives a
per-compile dist trace (attempt then distributed-on-server, compiled-locally,
or falling-back-with-reason). sccache only emits logs when SCCACHE_LOG is
set, so default runs are unaffected.

Signed-off-by: Javier Tia <javier@peridio.com>
A distributed compile the build-server rejects is often a distribution
artifact, not a genuine compiler error: an object that .incbin's a binary the
inputs packager does not ship - the kernel's vdso, embedded-config, and dtb
wrappers - cannot be assembled remotely and returns non-zero, which failed the
whole build with no recourse and forced the kernel to be excluded wholesale.

Treat a non-zero remote result as a fallback trigger rather than a terminal
error: recompile locally, which either succeeds (confirming a dist-only
artifact) or reproduces the genuine error. A remote failure can no longer
break a build that would compile locally. Only failing dist compiles are
affected - successful ones are returned unchanged - so the kernel now
distributes (1124/1128 compiles), with its handful of .incbin objects falling
back to local.

Signed-off-by: Javier Tia <javier@peridio.com>
Comment thread src/compiler/compiler.rs
// either succeeds (confirming a dist-only artifact) or reproduces the
// real error locally, so a remote failure never breaks a build that
// would compile fine locally. This only affects failing dist
// compiles; successful ones are returned unchanged above.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a fix for #2700.

IMO, please, do a dedicated PR with that fix & test it to mitigate possible future regressions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it implements #2745, isn't it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To implement #2745, it probably(?) makes sense to wait until #2735 is merged, then reuse some of its code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe yes. IMO, your code part should resolve that problem

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually haven't written any code in that area (yet, at least)

Comment thread src/compiler/compiler.rs
Some(dc) => dc,
None => {
debug!("[{}]: Compiling locally", out_pretty);
info!(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be done in a different pr

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.88406% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.70%. Comparing base (b799af2) to head (425db1a).

Files with missing lines Patch % Lines
src/server.rs 47.61% 11 Missing ⚠️
src/dist/pkg.rs 82.97% 8 Missing ⚠️
src/compiler/compiler.rs 61.53% 5 Missing ⚠️
src/compiler/gcc.rs 98.24% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2750      +/-   ##
==========================================
+ Coverage   74.63%   74.70%   +0.06%     
==========================================
  Files          70       70              
  Lines       39898    40015     +117     
==========================================
+ Hits        29778    29892     +114     
- Misses      10120    10123       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants