Skip to content

asan again (ignore for the moment, just checking CI)#1529

Draft
daniel-noland wants to merge 58 commits into
mainfrom
pr/daniel-noland/asan-again
Draft

asan again (ignore for the moment, just checking CI)#1529
daniel-noland wants to merge 58 commits into
mainfrom
pr/daniel-noland/asan-again

Conversation

@daniel-noland
Copy link
Copy Markdown
Collaborator

No description provided.

daniel-noland and others added 30 commits May 11, 2026 08:47
The test spawned two worker threads but never joined them, so any panic
in a worker was silently dropped instead of failing the test. Joining
the handles surfaces the second thread's panic: it tried to allocate
NAT for 2.0.1.3, an IP belonging to an expose without stateful NAT
configured, so no pool exists and the allocator returns Denied. Switch
the second source to 1.2.0.0 (in expose1, which has stateful NAT) so
each thread hits a distinct pool and the test does what it claims to.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Lift the concurrent_allocations test body out of tests_shuttle into a
sibling tests module so it can be exercised under different shuttle
schedulers without duplication. Extract shuttle_config() and add
run_shuttle_pct, then run the same fixture under both RandomScheduler
(test_concurrent_allocations_shuttle_random) and PctScheduler
(test_concurrent_allocations_shuttle_pct). PCT explores schedules with
bounded preemptions, complementing random's coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
UnicastIpv4Addr::new rejects both multicast and broadcast addresses,
so the InvalidSourceAddr fuzz-test branch can legitimately observe a
broadcast source. Update the assertion to allow either, and expand
the error text and doc comment to describe both rejections accurately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Allows PciEbdf to be embedded in rkyv-archived types (notably
args::PortArg::PCI), which a later commit migrates from the
hardware crate's structured PciAddress to this string-based form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Two bugs in the same generator caused PciEbdf::try_new to reject
every value the bolero TypeGenerator produced:

  - The separator between bus and device was '.', but EBDF format
    is domain:bus:device.function (e.g. 0000:01:00.0). Generated
    strings looked like 0000:01.00.00 and were rejected on layout.
  - The function field was formatted with `{function:02x}`, always
    two hex digits, while try_new requires exactly one (PCI
    function is a 3-bit field, 0..=7). Even after the separator
    fix, the function width alone would have kept every generated
    string rejected.

Switch to the correct separator and mask function to 0x07 before
formatting it as a single hex digit, so the generator covers the
parser's accepted input space.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Decouple args, config, and k8s-intf from the hardware crate by
swapping the structured PciAddress newtype for the string-based
PciEbdf already exposed by net::pci. PciEbdf::try_new validates the
EBDF format on construction so call sites remain checked at the API
boundary; downstream consumers continue to round-trip through
to_string().

This removes hardware as a build-time dependency from these crates,
which is needed so that they (and their dependents) can be exercised
under miri -- hardware pulls in OS-specific machinery (sysfs,
hwlocality, dpdk-sysroot-helper) that miri can't run.

Dropping the hardware dep also severs an implicit chain that
enabled tracing/attributes for args via workspace-wide feature
unification (args -> hardware -> sysfs -> tracing/attributes), so
declare tracing/attributes explicitly on args/Cargo.toml since
args/src/lib.rs uses #[instrument]. Without this, `just test-each`
(which builds each package in isolation, bypassing workspace
unification) would fail to compile args.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Both crates ship code (FFI, asm, custom unwind) that miri can't
execute on the powerpc64 target we use for memory-model coverage.
Move the loom and shuttle entries of flow-entry, nat, and quiescent
into [target.'cfg(not(miri))'.dependencies] / dev-dependencies so
the workspace stays buildable under miri without dropping those
crates from the run.

Add a compile_error! in concurrency/src/lib.rs to surface a clear
diagnostic if someone explicitly enables shuttle or loom while
running under miri (the metadata-driven exclude path in miri.just
already prevents this combination, but the explicit error gives a
better failure mode for ad-hoc invocations). The concurrency crate
itself does not move its loom/shuttle dependencies here -- they
remain optional in the regular [dependencies] section, gated only
by the feature flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Add a hidden _strict_provenance feature that selects the Mutex<Arc<T>>
implementation of Slot even outside loom/shuttle. Used by the miri job
to exercise this crate under -Zmiri-strict-provenance, which fails
against arc-swap because its hazard-pointer machinery doesn't yet
expose provenance information in the standard form. The fallback
implementation gives us provenance coverage on the QSBR protocol
without changing observable behavior.

Collapse the two cfg-gated mod imp { ... } blocks into a single
cfg_select! at module scope so adding a new selector only requires
extending the predicate, not introducing a third nested module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
These tests interact with resources miri's interpreter does not
provide:

- cli::cliproto::test_communications and routing::router::rio's two
  tests bind Unix domain sockets at /tmp paths.
- routing::frr::test::test_fake_frr_agent stands up a fake FRR agent
  over Unix sockets.
- routing::atable::resolver::test_adjacency_resolver reads
  /proc/net/arp and queries kernel interfaces.
- k8s_less::local::test_kubeless drives an inotify watcher; miri
  cannot emulate inotify even with isolation disabled.
- flow_entry::flow_table::table::test_flow_table_flow_invalidation
  mixes tokio::time::Instant::from_std on a wall-clock std deadline
  with a tokio::time::sleep call counted in real seconds, so a 4 s
  sleep under miri's slow interpreter elapses past the "extended"
  flow's deadline.

For nat::stateful::test::test_full_config_unidirectional_nat_overlapping_destination
the only miri-incompatible piece is the tracectl::get_trace_ctl()
setup -- linkme's distributed_slice uses link_section, which miri
cannot load. Skip just that block under cfg(not(miri)) instead of
ignoring the whole test; the tracing macros at use sites are no-ops
when no targets are registered, so the rest of the test runs fine
on the interpreter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
miri's interpreter is roughly 100x slower than native execution, so
property tests and concurrency stress tests that run thousands of
iterations are impractical to run unmodified. Use cfg_select! to
shrink loop counts and iteration budgets when miri is the active
target, keeping the same coverage shape but with budgets that finish
in seconds rather than hours:

- flow-entry::flow_table::nf_lookup: 1000 -> 10 packets
- net::packet::hash::test_hash_bounds: 2000 -> 20 packets
- routing::fib::test_concurrency_fib*: 100_000 -> 50 packets
- routing::fib::test_*_table iterations: 1000/5_000 -> 50
- left-right-tlcache::test_readhandle_cache: 1000 -> 10 handles
- k8s-intf::bolero::support: bolero iterations 1000 -> 3 and drop
  the largest UNIQUE_COUNTS entry (100); also skip
  test_unique_v6_cidr_generator under miri (still too slow even at 3
  iterations because the mask range is 0..=128).

Also gate serial_test in left-right-tlcache to cfg(not(miri)) since
the crate's serial coordination uses OS-level locking that miri
doesn't model meaningfully and would only add overhead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Threads a `nightly` argument through default.nix into the LLVM
overlay so that callers can request the rust nightly channel with
the miri component installed. When `nightly = "true"` the
rust-toolchain pulls "nightly" with miri added to its component
list; otherwise the pinned stable channel is used unchanged. Also
sort the components alphabetically so future additions diff
cleanly.

Disable cargo-llvm-cov's test suite while we're in the overlay
file; its tests fail on the nightly channel for reasons unrelated
to this project, and we never use them anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Wires up `just miri::test` (and `just miri` via [default]) which:
- Drops into a nix-shell with the nightly+miri toolchain active.
- Sets MIRIFLAGS for the desired interpreter knobs (compare-exchange
  weak failure rate, isolation, many-seeds sweep, preemption rate,
  symbolic alignment check, provenance mode, optional stacked-borrows
  disable).
- Optionally sets RUSTFLAGS for layout randomization.
- Invokes `cargo miri nextest run --profile=miri --target=$target`
  for the configured CPU (default powerpc64 -- weak memory model and
  big-endian, ideal for surfacing concurrency and endianness bugs).

The runner consults a workspace.metadata.package table in the root
Cargo.toml that lists which packages are not buildable under miri
(dataplane, dpdk*, hardware, init, interface-manager, mgmt, sysfs,
tracectl). When invoked with no package selector the runner expands
those into --exclude= flags via tomlq so `--workspace` does the
right thing without an extra argument list to maintain.

Threads the additional --argstr flags through the existing
setup-roots and shell recipes so they accept the same surface as
build, and adds a `miri` profile to nextest with relaxed thread
budget and 500s slow-timeout (miri tests are heavy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Run `just miri::test` on devfile changes, version tags, and manual
dispatch. The job is gated to powerpc64 only -- weak memory model
and big-endian, catches the most concurrency/endianness bugs --
with other CPU rows commented out for targeted debugging.

The aggregator's check emits ::warning:: rather than ::error:: + exit 1,
so a miri failure surfaces in the run summary but does not block the
PR's overall status. max-parallel is set to 1 because miri is
memory-hungry.

Three steps per matrix entry:
1. The full workspace under miri.
2. dataplane-quiescent under permissive provenance with extra random
   seeds. This is the configuration that exercises the real arc-swap
   implementation, complementing what shuttle/loom can model.
3. dataplane-quiescent under strict provenance via the
   _strict_provenance feature, which switches Slot to the Mutex
   fallback so provenance checks can run cleanly.

Also flatten the existing `features` matrix into pure `include` rows
that each carry their own `build:` reference. The previous form
combined a `build:` axis with a `features:` axis under cross-product
semantics, then added `test_package` via include; the flattened form
is more legible without changing what gets executed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The previous assertion sampled Instant::now() at assertion time and
compared the remaining lifetime against the configured timeout minus
a 2-second tolerance. That meant the test was implicitly asserting
"the framework finishes the rest of the function in under 2 seconds",
which is true natively but false under miri's interpreter.

Switch to a wall-clock-independent form: snapshot Instant::now()
immediately before the process_packet call that is expected to
extend the timer, then assert that expires_at >= snapshot + timeout.
reset_expiry_unchecked sets expires_at = Instant::now() + duration
internally, so this lower bound holds whenever the timer was
extended at all -- and a regression that fails to extend it would
violate the bound regardless of how slow execution is.

Apply the same pattern to the second assertion (against
DEFAULT_ESTABLISHED_TOUT_UDP) since it had the same shape with a
larger 5-second tolerance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
PCI device is 5 bits and function is 3 bits on the wire, but the
PciEbdf parser only enforced syntax -- "0000:00:ff.f" would parse
successfully despite both fields being unrepresentable. Validate
the numeric ranges in try_new via new MAX_DEVICE (0x1f) and
MAX_FUNCTION (0x07) constants, and update the bolero generator to
mask both fields so generated values stay in range.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Three tweaks to the miri runner:

* nextest's `threads-required` for the miri profile was hardcoded
  to 8 in `.config/nextest.toml`. With `-Zmiri-many-seeds` packing
  $seeds seeds into a single test process, that fixed budget
  oversubscribes the box when $seeds exceeds 8 (and wastes
  capacity when it is smaller). Move the value out of the static
  config: at run time, mktemp a copy of the toml and tomlq in
  `threads-required = $seeds`, then pass `--config-file`. The
  checked-in miri profile no longer pins the value.

* `layout_seed` was hardcoded to "1", which meant every miri run
  exercised the same struct layout and let layout-sensitive bugs
  hide behind whatever ordering "1" happens to produce. Derive it
  from `git rev-parse HEAD` so each commit gets a different layout
  for free, with no manual seed bumping.

* Wrap `cargo miri` in `nice -n 19` so a 64-seed run stays out of
  the way of foreground work on the same machine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`test_flow_table_timeout` set a 2-second deadline against
`std::Instant::now()` and then `tokio::sleep`d 1s and 2s of wall
time. Under miri the interpreter is so much slower than wall
time that by the time the first 1s sleep returns, real time has
already elapsed past the 2s deadline, the per-flow timer task
has fired and removed the entry, and the "still present after 1
second" assertion fails. Increasing the seed count made this
reliably reachable rather than rare.

Switch to `#[tokio::test(start_paused = true)]` so the timer
task's `sleep_until` and the test's `sleep`s both run on tokio's
virtual clock, and capture `now` from `tokio::time::Instant`
instead of `std::Instant` so the deadline is anchored on the
same virtual baseline. With paused time a `std::Instant::now()`
in the test body still returns real wall time, which under miri
sits many real seconds past the paused virtual baseline; the
test's virtual sleeps would never advance virtual time far
enough to reach a deadline computed that way, and the entry
would still be present at the second assertion. Both ends of
the test now run on the same time base.

Add the `test-util` feature to the tokio dev-dep so the
`start_paused` attribute is available.

Same root cause as the existing `cfg_attr(miri, ignore)` on
`test_flow_table_flow_invalidation`; that test stays ignored
under miri for now since it exercises additional behaviour
(`extend_expiry`) that this fix does not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
There actually isn't a rule which says the scheduler will always
give you a tcp or udp or icmp in a given run. For instance, this
fails under miri because the number of runs is much lower and the
schedule for random is deliberately unfair.

Signed-off-by: Daniel Noland <daniel@githedgehog.com>
There is no rule that the fuzzer will always give you ipv4
and ipv6.  This tends to fail under miri because

1. the iteration count is far lower
2. the schedule is not fair (intentionally)

Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Same wall-clock-vs-virtual-clock mismatch as test_flow_table_timeout.
flow_2's deadline is set to `Instant::now() + 2s + 1min`, then the
test sleeps 3 tokio seconds and asserts flow_1 expired (active_len
== 1). Under miri the entire test takes ~90s of wall time, so by
the time the assertion runs flow_2's 62-second std-Instant deadline
has elapsed too and its timer task has expired it as well, leaving
active_len == 0.

Switch to `#[tokio::test(start_paused = true)]` so the per-flow
timers' `sleep_until` and the test's `sleep` share tokio's virtual
clock, and capture `now` from `tokio::time::Instant` so flow_2's
extended deadline is anchored on the same virtual baseline that
the timer task evaluates against.

extend_expiry_unchecked uses fetch_add on expires_at (no
Instant::now() call internally), so the 1-minute extension stays
relative to the virtual baseline once `now` is virtual.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Now that miri is run under nice there is little or no advantage
in delaying the run.

Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Add a Miri section to testing.md covering:

* what `just miri::test` is and how to run it (whole workspace, a
  specific test, with a larger seed fan-out);
* the configurable knobs the recipe exposes via `just miri::<name>=<value>`
  (cpu, seeds, schedule_seed, provenance, stacked_borrow_check,
  preemption_rate, weak_failure_rate, randomize_struct_layout,
  layout_seed) and their defaults;
* the workspace.metadata.package miri-exclusion list and how the
  runner threads it into `--exclude=` flags automatically;

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
From this commit forward, a miri failure is a regression on equal
footing with any other CI failure. The expectation is to fix the
underlying issue, not suppress the signal. If a future change
demonstrates that miri genuinely cannot exercise some piece of
code -- e.g. a syscall or hardware interaction the interpreter
does not model -- the resolution is to document that reason
explicitly:

  - (preferred) Gate the specific test or item with `#[cfg(miri)]`
    (or `#[cfg_attr(miri, ignore = "...")]`) and a comment naming
    the upstream limitation or unsupported operation.
  - (if you must) Add the package to workspace.metadata.package in
    the root Cargo.toml with `miri = false` and a comment citing the
    reason.

Note that some tests or even whole crates are just genuinely hopeless
under miri.  That's fine.  But that should be documented if so, and it
should not be the first answer you reach for.

The "if you must" option should be reserved for situations where
miri just can't compile the crate.  For instance, the dpdk-sys crate
is well and truly hopeless (and pointless) under miri.  That's not
a flaw in the crate, and it can be exempted with a shrug.  The
hardware crate is in a similar situation.  Miri simply does not
have virtual hardware.

More of a middle ground is the router crate.  That largely passes
and works under miri, but has a few tests which use unix sockets.
Those won't work and that's fine.  Just ignore those tests under miri.
Exempting the whole crate from miri would not be a reasonable solution.
Reasons like "this test is flaky under miri" do not qualify (finding
flaky tests is part of the point of this effort).

More rarely, some tests just take a genuinely excessive amount of time
under miri.  This is rarely a valid reason to exempt.  More often you
really can just reduce an iteration counter or something as a condition
in the miri build.  Sometimes that is just impractical or defeats the
point of the test.  Exempting in this case is valid, but it really
should be rare.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
crane was imported with no pkgs argument, falling back to its own
`import <nixpkgs> { }` default in `npins/crane/default.nix`.  That makes
`craneLib.buildPackage` / `mkCargoDerivation` build their derivations
against the *host* nixpkgs's stdenv, regardless of what stdenv the
project's `pkgs` (a `pkgsCross.<arch>`) provides.

Symptom under cross compile (e.g. `just platform=bluefield3 build-each`):
aws-lc-sys's build script invokes `cc` to assemble aarch64 `.S` files
because the surrounding `CC` env was the build-host x86_64 gcc, not the
target's `aarch64-unknown-linux-gnu-gcc`.  The link step rejects every
aarch64 mnemonic in `p521_jscalarmul.S` and friends:

    cargo:warning=Error: no such instruction: `ldp x4,x3,[x2,...`

Verified via `nix eval`: with `pkgs` passed in, the resulting derivation
has `stdenv.hostPlatform = aarch64-unknown-linux-gnu` and crane's
`mkCrossToolchainEnv` correctly exports `CC_aarch64_unknown_linux_gnu`
etc. to the cross toolchain so cc-rs picks it up.

No effect on native (libc=gnu, x86-64-v3) builds where buildPlatform
equals hostPlatform; the cross stdenv is the native stdenv in that
configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`is-cross-compile` was comparing `pkgs.stdenv'.hostPlatform` against
`ctarget`.  In a cross stdenv those are the *same* value (hostPlatform
is the target architecture, and ctarget is derived from
targetPlatform.rust.rustcTarget which equals hostPlatform.rust.rustcTarget
in a non-cross-compiler stdenv).  The check was therefore always false
under cross, and the `cxx` / `strip` / `objcopy` paths that depend on it
fell through to the non-cross-prefixed binary names.

Symptom under bluefield3 cross: the linker arg
`-Clinker=${pkgs.pkgsBuildHost.llvmPackages'.clang}/bin/clang++` resolved
to a wrapped-cross-clang directory that contains only the prefixed
`aarch64-unknown-linux-gnu-clang++`, not the bare `clang++`:

    error: linker `/nix/store/...-aarch64-unknown-linux-gnu-clang-wrapper-22.1.5/bin/clang++` not found

`buildPlatform` is the right comparison: cross-compile is defined by
"build host produces output for a different host", which is exactly
buildPlatform != hostPlatform.  Comparing buildPlatform to ctarget gives
true when actually cross-compiling (e.g. x86_64 -> aarch64) and false
for native and wasm (where build is the same as target).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`pkgs.pkgsBuildHost` is the splice variant for tools that *run on*
the build host and *target* the host -- i.e. cross compilers.  Under
a cross `pkgs` (`pkgsCross.<arch>` for libc=musl or platform=bluefield3),
that means `pkgsBuildHost.pkg-config` resolves to
`x86_64-unknown-linux-musl-pkg-config-wrapper-0.29.2`, whose bin/
contains only the prefixed binary `x86_64-unknown-linux-musl-pkg-config`
-- no plain `pkg-config`.

Cargo build scripts (hwlocality-sys, etc.) shell out via `Command::new
("pkg-config")` and crash because the unprefixed binary isn't on PATH:

    error: failed to run custom build command for `hwlocality-sys`
    ... The pkg-config command could not be found.

Same trap applies to `clang` and other unprefixed helpers in the
dev-shell symlinkJoin.

`pkgsBuildBuild` is "runs on build, targets build" -- exactly what
native dev tooling wants.  Under a native pkgs (libc=gnu host arch)
the two splice variants are the same derivation, so this is a no-op
for the current default build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The libcap overlay forces `CC:=clang` via makeFlags, which overrides
the cross stdenv's wrapped CC (set to e.g.
`x86_64-unknown-linux-musl-clang` by the stdenv setup hooks).  An
unprefixed `clang` in a cross-stdenv build environment resolves to
the *unwrapped* clang binary, which has no `-isystem` paths set up
for the target sysroot:

    clang -O2 ... _makenames.c -o _makenames
    fatal error: 'stdio.h' file not found
    make[1]: *** [Makefile:83: _makenames] Error 1

For native gnu (`pkgsCross.gnu64`) the cc.targetPrefix is empty and
the wrapped clang is named just `clang`, so `CC:=clang` happens to
hit the wrapper.  For musl or aarch64, the wrapped clang has a
prefix and the bare `clang` is the unwrapped binary.

`${final.stdenv'.cc.targetPrefix}clang` resolves to the right wrapper
in all configurations: `clang` natively, `x86_64-unknown-linux-musl-clang`
under musl, `aarch64-unknown-linux-gnu-clang` under bluefield3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
libcap's Makefile compiles a helper binary `_makenames` and *executes*
it during the build to generate `cap_names.h`.  Make.Rules defaults
`BUILD_CC ?= $(CC)`, so without an explicit BUILD_CC the helper gets
built with whatever CC is, which under cross stdenv is the wrapped
cross-clang.  For same-arch cross (x86_64 -> x86_64-musl) the produced
binary happens to run on the build host; for cross-arch it does not:

    ./_makenames > cap_names.h
    /bin/bash: ./_makenames: cannot execute binary file: Exec format error
    make[1]: *** [Makefile:86: cap_names.h] Error 126

`pkgsBuildBuild.llvmPackages'.clang` is the clang that runs on, and
targets, the build host -- exactly what BUILD_CC should be regardless
of the target architecture.  Pinning the absolute path avoids the
ambiguity around which `clang` PATH resolves to during a libcap build.

Note: `pkgsBuildHost` would be the cross compiler (runs on build,
targets host = aarch64), which is the wrong splice for a build-host
helper; this is the same trap that fooled the dev-shell tooling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`clib/bin/cpmock.c` calls memset/strcpy/strerror/memcmp without an
`#include <string.h>`.  Under glibc the standard headers transitively
re-expose those declarations (via <sys/socket.h>, <stdlib.h>, etc.) so
the compile happens to succeed.  Under musl the headers are strict and
disjoint, so the declarations are missing.  Combined with the
`-DCMAKE_C_STANDARD=23` cmake flag (which makes clang treat
`-Wimplicit-function-declaration` as an error per C23), the build dies:

    clib/bin/cpmock.c:33:5: error: call to undeclared library function
        'memset' with type 'void *(void *, int, __size_t)'; ISO C99 and
        later do not support implicit function declarations
    ... and similar for strcpy, strerror, memcmp.

This is a real correctness bug in the upstream source (implicit function
declarations have been UB since C99 and outright illegal since C23) -
glibc just covers for it.  Workaround here patches the include in
locally so the FRR/dataplane container can be built against musl.

TODO: remove once fixed upstream in githedgehog/dplane-rpc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`src/hh_dp_msg.c` accesses `struct in6_addr` through glibc's internal
anonymous union member name `.__in6_u.__u6_addr8` (three callsites:
lines 74, 131, 179).  That name is a glibc implementation detail, not
part of any standard.  POSIX defines `struct in6_addr` to expose the
bytes via the `.s6_addr` member directly, and musl does only that --
no wrapping anonymous union -- so the access fails to compile against
musl headers:

    src/hh_dp_msg.c:74:57: error: no member named '__in6_u' in 'struct in6_addr'
        memcpy(ifa.address.addr.ipv6, ifaddr->u.prefix6.__in6_u.__u6_addr8, 16);

The three sites are all doing the same thing -- copying 16 bytes out of
the address -- and `.s6_addr` is the portable spelling that works on
both glibcs and musl.

To make the postPatch actually take effect, also remove `dontUnpack = true`
so nix copies the source to a writable build directory, and change the
cmake `-S "$src"` reference to `-S .` so cmake consumes the patched copy
rather than the (read-only) original store path.

TODO: remove once fixed upstream in githedgehog/dplane-plugin.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
daniel-noland and others added 18 commits May 11, 2026 16:36
Adds a `must-build` CI job that exercises the platforms and libcs the
project intends to ship for, without yet routing them through `check` /
`build` (those jobs assume gnu/x86-64-v3 today).  The matrix:

  - wasm32-wasip1 / unknown / release  (build-each)
  - zen5 / musl / release              (build-container dataplane)
  - bluefield3 / gnu / debug           (build-container dataplane)
  - bluefield3 / musl / debug          (build-container dataplane)

`build-each` is added to the justfile as a thin wrapper that maps to
the `workspace` nix target (build every workspace member to its own
output), parallel to the existing `test-each`.

The wasm matrix entry uses `build-each`; the cross/native container
entries use `build-container dataplane`.  All entries pass platform,
profile, and libc explicitly via `JUST_VARS` so the just default
fallbacks don't matter and the row is self-documenting.

`must-build` is added to the final-status aggregator's `needs` list
along with a step that flags any matrix-row failure -- so a red
must-build row turns the workflow conclusion red even though the
job itself only blocks the summary aggregator.

The `# - check # TODO: add dependency before PR lands` line in the
job's `needs:` block flags that under the current `check` / `build`
implementation those jobs do not exercise cross/wasm, so making
`must-build` strictly depend on them would just slow down feedback
for the cross builds without buying coverage.  Tighten when `check`
itself learns to honour the platform/libc axis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The dataplane.tar `buildPhase` unconditionally tarred in `libc.out`
(`pkgs.pkgsHostHost.libc`) and `${pkgs.pkgsHostHost.glibc.libgcc}`.
The latter is a glibc-evaluation reference: under a musl pkgsCross
`pkgs.pkgsHostHost.glibc` triggers nix trying to build glibc against
musl, which is a nonsensical combination and fails outright with
errors from `sysdeps/x86_64/multiarch/strcasecmp_l.c` ifunc redirection
inside `glibc-nolibgcc-x86_64-unknown-linux-musl-2.42-61`.

Gate the libgcc input on `libc == "gnu"`: glibc-dynamic Rust binaries
need `libgcc_s.so.1` for unwinding, but musl Rust targets statically
link musl and Rust's compiler-builtins and have no libgcc_s consumer.
The exact store path of the libgcc input matters: nix's glibc ld-linux
is compiled with `pkgs.pkgsHostHost.glibc.libgcc` (`xgcc-<ver>-libgcc`
on native, `libgcc-<triple>-<ver>` on cross) in its system search list.
Bundling libgcc_s.so.1 from any other store path (e.g.
`pkgs.stdenv.cc.cc.lib`, which has the same content but a different
name) results in the file being present in the container but
unreachable: the loader doesn't search the directory and there's no
RUNPATH to redirect it.  Symptom on a binary built that way:
    /bin/dataplane: error while loading shared libraries:
    libgcc_s.so.1: cannot open shared object file: No such file or
    directory

Keep libc.out bundled in both libc variants.  The Rust binaries are
static on musl and don't need it, but the busybox we ship alongside
for `/bin/*` shell utilities is dynamically linked against whichever
libc its pkgset uses.  On musl-cross-arch the busybox interpreter is
`${pkgs.pkgsHostHost.libc}/lib/ld-musl-<arch>.so.1`; without that store
path inside the tar the applets fail at exec with no usable libc.

Rename the inner `libc` let-binding to `libc-pkg` so the outer
function-arg `libc` (the string "gnu" / "musl" / "unknown") stays
visible inside this scope for the conditional.  Add a header comment
on the new `libgcc-tar-input` documenting the path-matching constraint
so the next person doesn't "fix" it to a more-natural-looking
`stdenv.cc.cc.lib`.  Add a matching comment on `libc-tar-input`
explaining why busybox makes the bundled libc unconditional.

Also: typo fix on the surrounding comment block (`hazzards` -> `hazards`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The debug-tools container content list referenced `pkgs.libgcc.libgcc`,
which only exists as a passthru on the native gcc derivation
(`outputs = [ "out" "man" "info" "lib" "libgcc" "checksum" ]`).  Under
a cross pkgs (`pkgsCross.aarch64-multiplatform`), `pkgs.libgcc` is a
different derivation -- the static cross-libgcc helpers
(`crtbegin.o`/`libgcc.a`/...) with `outputs = [ "out" "dev" ]` and no
`libgcc` output -- so the attribute access raises "attribute 'libgcc'
missing" during eval of debug-tools for any non-native build.

`pkgs.pkgsHostHost.glibc.libgcc` resolves correctly in both contexts
and points at the same store path that the bundled glibc ld-linux has
in its compiled-in library search list (so runtime resolution still
works the way it has been).

Gate the inclusion on `libc == "gnu"` so musl-libc targets don't drag
in glibc.libgcc, which has no purpose there -- musl Rust binaries use
the system llvm libunwind, not libgcc_s.so.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
FRR's LDFLAGS hard-coded `-L${fancy.libgccjit}/lib -latomic`, pointing
at the glibc-targeted libgccjit-bundled `libatomic.so.1`.  That's the
right path when targeting glibc but is ABI-wrong (and at the wrong
store path) when targeting musl.

Add `libc` to the overlay's function args (and inherit it through
`default.nix`'s overlays import) so the LDFLAGS can pick a
libc-specific path:

  - libc == "gnu":  `${fancy.libgccjit}/lib/libatomic.so.1`
        The libgccjit overlay still builds with the host stdenv (its
        own override has `# TODO: debug issue preventing clang build`),
        so its libatomic is glibc-targeted.  Same path FRR has always
        used.

  - libc == "musl": `${stdenv.cc.cc.lib}/${triple}/lib/libatomic.so.1`
        The gcc-libs output of the cross-musl gcc that backs the
        cross-musl clang stdenv.  Note: `stdenv`, *not* `stdenv'` --
        the prime is the clang stdenv whose `cc.cc` is clang and
        contains no libatomic at all.

  - else: throw, to fail loudly on a libc this layout doesn't yet
        cover.

Link is dynamic on both sides (the `-L .. -latomic` sits *outside* the
`-Wl,--push-state,...,-Bstatic` block).  Single-`.so`-process-wide is
the only ODR-safe choice for libatomic: its `lock_for_pointer` lock
table is per-image, so a static copy inside libfrr.so cannot
synchronize with any other consumer that picks up libatomic.so
dynamically.

Also extend the `preFixup` nuke-refs allowlist to keep the cross-musl
gcc-libs path when targeting musl, so the RUNPATH entry pointing at
the dynamic libatomic.so survives fixup -- without it, nuke-refs
would scrub the entry, the binary's loader couldn't find libatomic,
and `watchfrr` / friends would die at startup with
`Error relocating .../libfrr.so: __atomic_*: symbol not found`.

(`libxcrypt` was already preserved in the working tree's allowlist
shape but missing from this branch -- include it here so the allowlist
matches what's actually linked.)

The same libc-aware nuke-refs treatment applies to `frr-agent`: its
fixupPhase referenced `libgcc.lib`, but under cross pkgs `final.libgcc`
is the static cross-libgcc helpers derivation (outputs `[ "out" "dev" ]`
only -- no `lib` attribute), so the package failed to evaluate for
every non-native target.  Move the fixupPhase from the package into
the overlay so `libc` is in scope, gate the libgcc exclusion on
`libc == "gnu"`, and use `pkgsHostHost.glibc.libgcc` -- the path the
bundled ld-linux's compiled-in search list points at, matching the
libgcc handling already in place at the dataplane.tar bundler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Two unrelated FRR-package cleanups that surface once the overlay starts
exercising libc=musl:

1. `libgccjit` is unconditional in FRR's buildInputs but is only meaningful
   on glibc.  The overlay's LDFLAGS already gates its `-L${libgccjit}/lib
   -latomic` on `libc == "gnu"` and pulls libatomic from the cross-musl
   gcc-libs output instead, so on musl libgccjit is just dead weight in
   the build closure (and a possible source of glibc-targeted ABI leakage
   into the musl runtime image).  Gate the buildInput on
   `stdenv.hostPlatform.isGnu`.

2. `--enable-grpc=no` / `--enable-protobuf=no` are mis-spelled autoconf
   flags.  Autoconf's `AC_ARG_ENABLE(grpc, ...)` accepts `--enable-grpc[=ARG]`
   and `--disable-grpc`, but `--enable-grpc=no` is parsed as
   `enableval=no` for the `grpc` feature -- which most projects handle by
   coincidence, but it's not the documented spelling and breaks under
   stricter autoconf versions.  `--disable-grpc` / `--disable-protobuf`
   is the canonical spelling.  No runtime-behavior change today, but
   the syntax is no longer wrong-looking-but-works-anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
FRR was just built with --disable-protobuf, and dplane-plugin does not
actually link against protobufc itself, so the buildInput (and the
matching call arg) were dead weight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Native cargo invocations from the dev shell (cargo build/clippy/test
--doc) defaulted to the build-host triple while LIBRARY_PATH and
PKG_CONFIG_PATH were already pointing at the cross sysroot.  The
resulting links pulled in cross-target libraries against the host
rust-std and failed (e.g. glibc rust-std + musl libc gave undefined
references to `open64`, `fstat64`, ...).

Pin CARGO_BUILD_TARGET to the same triple the sysroot is built for,
and set PKG_CONFIG_ALLOW_CROSS=1 so the Rust pkg-config crate stops
refusing cross-target builds (PKG_CONFIG_PATH already points at the
matching cross sysroot).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Glibc-targeted Rust binaries unwind through libgcc_s.so.1 (still pulled
into the image via glibc.libgcc).  Musl and wasi have no libgcc
consumer, so panic-unwind has no unwinder unless we ask build-std to
pull LLVM's libunwind out of the sysroot.

Gate the build-std-features flag on libc != "gnu" so glibc keeps using
the libgcc path while musl/wasi get system-llvm-libunwind appended.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Extract a generic aarch64 platform definition (march=generic, no march
or mcpu overrides) and refactor bluefield2 as a recursiveUpdate of it
to retain its armv8.2-a / cortex-a72 specialization.

Lets cross-platform CI exercise a vanilla aarch64 build without depending
on the bluefield-specific CPU/march flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Wire a `cargo check` derivation per workspace package as `check.<name>`,
mirroring the existing `workspace.<name>` build derivations.  The point
is to give CI a fast, no-link verification path -- particularly useful
for wasm32-wasip1 where a full container build is overkill but we still
want a build-error gate.

Add matching `just check` (single package) and `just check-each` (all
packages) recipes that drive the new derivation through the standard
nix build pipeline.

Also rearrange the `inherit` list so `check`/`dataplane` land in
alphabetical order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Lets contributors execute the wasm32-wasip1 binaries we already build
without leaving the dev shell:

- wasmtime in the dev-shell tooling
- `[target.wasm32-wasip1] runner = "wasmtime run --dir ."` so plain
  `cargo run`/`cargo test` against the wasm target invokes wasmtime
  automatically, with the cwd bound in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The single must-build matrix was conflating two different jobs:
a fast wasm sanity check and a slow cross-arch container build sweep.
Split them along that line:

wasm: same triggers as must-build (devfiles, tags, dispatch),
runs the wasm32-wasip1 `check` recipe only.  Keeps wasm verification
as a blocker for any PR that touches dev files.  Recipe is shaped as
a `{name, args}` object so the matrix axis carries recipe+args as a
single tuple rather than as parallel axes.

cross: {aarch64, bluefield3} x {gnu, musl} x {debug} x
{build-container dataplane, build-container frr.dataplane}.  Covers
both the dataplane container and the FRR sidecar, the two images
most affected by this PR's FRR overlay / dplane-plugin / dplane-rpc
cross-arch fixes.  Release profile is dropped to keep the leg count
flat while picking up FRR; cross-arch build problems surface in
debug just as well.  Gated narrowly (PR label `ci:+cross`, or
push/merge_group).

Advisory failure handling: job-level `continue-on-error: true`.  A
failing leg still reports `failure` to its own conclusion -- the job
badge goes red, the build step's `failure()` predicate fires (so the
shared `*tmate` anchor still opens a session on workflow_dispatch
debug), and the leg is visible in the run summary.  The workflow's
overall conclusion ignores the failure, so cross is non-blocking.
`needs.cross.result` reports `success` to dependents under job-level
CoE, so the summary aggregator does not (and cannot) gate on cross;
treat cross as purely informational.

Add `check` as a hard dependency so we don't burn a full cross build
on code that won't even type-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`docker import` reads a raw rootfs tar (no manifest, no platform metadata).
Without an explicit --platform, the imported image is tagged with the host
runner's architecture even when the rootfs is for a different one, which
silently mislabels cross-built images and would mislead anything that
consumes them by platform-aware pull or digest.

Map the just `platform` variable (aarch64 / bluefield2 / bluefield3 /
x86-64-v3 / x86-64-v4 / zen3 / zen4 / zen5) to the docker linux/{arm64,amd64}
pair before calling import.  Fail loudly on an unmapped platform rather
than silently fall back to host arch.

Only the `dataplane` arm of `build-container` is affected; the other arms
use `docker load` on real image tarballs (which carry their own platform
metadata) and do not need --platform.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`nuke-refs` is generated by `replaceVarsWith` that bakes a perl path
into the script.  Under a cross pkgset `prev.nukeReferences` (and a
plain `nukeReferences` argument grabbed without splicing) substitutes
a target-arch perl: e.g. `perl-aarch64-unknown-linux-gnu-5.42.0` for
an aarch64 cross.  Both `frr-build`'s preFixup and `frr-agent`'s
fixupPhase invoke `nuke-refs` on the build host during this
derivation's own build, where the kernel cannot execute target-arch
ELF and the build dies with:

    nuke-refs: ...-perl-aarch64-...-5.42.0/bin/perl:
    cannot execute binary file: Exec format error

Reach into `final.buildPackages.nukeReferences` in both spots so the
substituted perl is the build-host one.  Verified by building
`containers.frr.dataplane` for aarch64/gnu, which the cross matrix
added in this stack exercises.

The sibling `removeReferencesTo` is shell-only -- it doesn't bake in
perl, and the `@shell@` substituted into the script is
`stdenvNoCC.shell`, which resolves to build-host bash on every
pkgset.  No equivalent fix is needed there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
No need to bother with cross compile until miri passes.
Best not to overload the CI, and miri failures will often indicate
that cross is going to be pointless.

Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The `libc` string is a tag carried through the build for derivation
naming and platform dispatch.  For `wasm32-wasip1` it was set to
`"unknown"`, which read as "we don't know what libc this is" rather
than the truth: there is no libc -- WASI Preview 1 provides its own
system interface and the Rust target name uses the `wasi` env field,
not a libc.  Use `"none"` so the tag describes the actual situation.

Mechanical rename across `default.nix` (comment), `justfile` (`libc`
default for `wasm32-wasip1`), `nix/platforms.nix`
(`wasm32.wasip1.<libc>` attr key), and `.github/workflows/dev.yml`
(matrix `libc` value).  No build-output change for non-wasm targets.

Signed-off-by: Daniel Noland <daniel@githedgehog.com>
nixpkgs' musl `rustPlatform` (which `frr-agent` is currently built
with) emits binaries with `DT_NEEDED libgcc_s.so.1` and
`_Unwind_*@GCC_*` references for stack unwinding.  Our musl FRR
image has no `libgcc_s.so.1` -- the `frr-agent` overlay's
`fixupPhase` deliberately nukes the libgcc store-path on musl -- so
the musl loader bails before `main`:

    Error relocating /bin/frr-agent: _Unwind_Resume: symbol not found
    Error relocating /bin/frr-agent: _Unwind_Backtrace: symbol not found
    ...

Force `+crt-static` on the musl variant of `frr-agent` so the C
runtime (and with it `libgcc_eh.a`) is linked statically.  The
resulting binary is `static-pie linked` with no `_Unwind_*` undefs
and no `NEEDED` entries, and the container starts.

The accompanying comment on the `fixupPhase` is also corrected: the
prior claim that "musl Rust uses llvm-libunwind and has no libgcc_s
consumer" was wrong for the way `frr-agent` is currently built --
that's the path the rest of the workspace takes via crane +
`-Zbuild-std=...,system-llvm-libunwind`, not the path `frr-agent`
takes through nixpkgs' `rustPlatform.buildRustPackage`.

TODO: switch `frr-agent` onto the workspace's crane +
`system-llvm-libunwind` build path so it pulls LLVM libunwind
statically like every other Rust binary in this repo, rather than
relying on libgcc_eh from the gcc backing the cross-musl clang
stdenv.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
@daniel-noland daniel-noland changed the base branch from main to pr/daniel-noland/everything-is-wasm May 12, 2026 01:37
I think it works now?
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/asan-again branch from dda0e90 to bd030ac Compare May 12, 2026 01:59
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/everything-is-wasm branch from 06bea51 to f45af96 Compare May 12, 2026 19:16
Base automatically changed from pr/daniel-noland/everything-is-wasm to main May 13, 2026 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant