asan again (ignore for the moment, just checking CI)#1529
Draft
daniel-noland wants to merge 58 commits into
Draft
asan again (ignore for the moment, just checking CI)#1529daniel-noland wants to merge 58 commits into
daniel-noland wants to merge 58 commits into
Conversation
The test spawned two worker threads but never joined them, so any panic in a worker was silently dropped instead of failing the test. Joining the handles surfaces the second thread's panic: it tried to allocate NAT for 2.0.1.3, an IP belonging to an expose without stateful NAT configured, so no pool exists and the allocator returns Denied. Switch the second source to 1.2.0.0 (in expose1, which has stateful NAT) so each thread hits a distinct pool and the test does what it claims to. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Lift the concurrent_allocations test body out of tests_shuttle into a sibling tests module so it can be exercised under different shuttle schedulers without duplication. Extract shuttle_config() and add run_shuttle_pct, then run the same fixture under both RandomScheduler (test_concurrent_allocations_shuttle_random) and PctScheduler (test_concurrent_allocations_shuttle_pct). PCT explores schedules with bounded preemptions, complementing random's coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
UnicastIpv4Addr::new rejects both multicast and broadcast addresses, so the InvalidSourceAddr fuzz-test branch can legitimately observe a broadcast source. Update the assertion to allow either, and expand the error text and doc comment to describe both rejections accurately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Allows PciEbdf to be embedded in rkyv-archived types (notably args::PortArg::PCI), which a later commit migrates from the hardware crate's structured PciAddress to this string-based form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Two bugs in the same generator caused PciEbdf::try_new to reject
every value the bolero TypeGenerator produced:
- The separator between bus and device was '.', but EBDF format
is domain:bus:device.function (e.g. 0000:01:00.0). Generated
strings looked like 0000:01.00.00 and were rejected on layout.
- The function field was formatted with `{function:02x}`, always
two hex digits, while try_new requires exactly one (PCI
function is a 3-bit field, 0..=7). Even after the separator
fix, the function width alone would have kept every generated
string rejected.
Switch to the correct separator and mask function to 0x07 before
formatting it as a single hex digit, so the generator covers the
parser's accepted input space.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Decouple args, config, and k8s-intf from the hardware crate by swapping the structured PciAddress newtype for the string-based PciEbdf already exposed by net::pci. PciEbdf::try_new validates the EBDF format on construction so call sites remain checked at the API boundary; downstream consumers continue to round-trip through to_string(). This removes hardware as a build-time dependency from these crates, which is needed so that they (and their dependents) can be exercised under miri -- hardware pulls in OS-specific machinery (sysfs, hwlocality, dpdk-sysroot-helper) that miri can't run. Dropping the hardware dep also severs an implicit chain that enabled tracing/attributes for args via workspace-wide feature unification (args -> hardware -> sysfs -> tracing/attributes), so declare tracing/attributes explicitly on args/Cargo.toml since args/src/lib.rs uses #[instrument]. Without this, `just test-each` (which builds each package in isolation, bypassing workspace unification) would fail to compile args. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Both crates ship code (FFI, asm, custom unwind) that miri can't execute on the powerpc64 target we use for memory-model coverage. Move the loom and shuttle entries of flow-entry, nat, and quiescent into [target.'cfg(not(miri))'.dependencies] / dev-dependencies so the workspace stays buildable under miri without dropping those crates from the run. Add a compile_error! in concurrency/src/lib.rs to surface a clear diagnostic if someone explicitly enables shuttle or loom while running under miri (the metadata-driven exclude path in miri.just already prevents this combination, but the explicit error gives a better failure mode for ad-hoc invocations). The concurrency crate itself does not move its loom/shuttle dependencies here -- they remain optional in the regular [dependencies] section, gated only by the feature flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Add a hidden _strict_provenance feature that selects the Mutex<Arc<T>>
implementation of Slot even outside loom/shuttle. Used by the miri job
to exercise this crate under -Zmiri-strict-provenance, which fails
against arc-swap because its hazard-pointer machinery doesn't yet
expose provenance information in the standard form. The fallback
implementation gives us provenance coverage on the QSBR protocol
without changing observable behavior.
Collapse the two cfg-gated mod imp { ... } blocks into a single
cfg_select! at module scope so adding a new selector only requires
extending the predicate, not introducing a third nested module.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
These tests interact with resources miri's interpreter does not provide: - cli::cliproto::test_communications and routing::router::rio's two tests bind Unix domain sockets at /tmp paths. - routing::frr::test::test_fake_frr_agent stands up a fake FRR agent over Unix sockets. - routing::atable::resolver::test_adjacency_resolver reads /proc/net/arp and queries kernel interfaces. - k8s_less::local::test_kubeless drives an inotify watcher; miri cannot emulate inotify even with isolation disabled. - flow_entry::flow_table::table::test_flow_table_flow_invalidation mixes tokio::time::Instant::from_std on a wall-clock std deadline with a tokio::time::sleep call counted in real seconds, so a 4 s sleep under miri's slow interpreter elapses past the "extended" flow's deadline. For nat::stateful::test::test_full_config_unidirectional_nat_overlapping_destination the only miri-incompatible piece is the tracectl::get_trace_ctl() setup -- linkme's distributed_slice uses link_section, which miri cannot load. Skip just that block under cfg(not(miri)) instead of ignoring the whole test; the tracing macros at use sites are no-ops when no targets are registered, so the rest of the test runs fine on the interpreter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
miri's interpreter is roughly 100x slower than native execution, so property tests and concurrency stress tests that run thousands of iterations are impractical to run unmodified. Use cfg_select! to shrink loop counts and iteration budgets when miri is the active target, keeping the same coverage shape but with budgets that finish in seconds rather than hours: - flow-entry::flow_table::nf_lookup: 1000 -> 10 packets - net::packet::hash::test_hash_bounds: 2000 -> 20 packets - routing::fib::test_concurrency_fib*: 100_000 -> 50 packets - routing::fib::test_*_table iterations: 1000/5_000 -> 50 - left-right-tlcache::test_readhandle_cache: 1000 -> 10 handles - k8s-intf::bolero::support: bolero iterations 1000 -> 3 and drop the largest UNIQUE_COUNTS entry (100); also skip test_unique_v6_cidr_generator under miri (still too slow even at 3 iterations because the mask range is 0..=128). Also gate serial_test in left-right-tlcache to cfg(not(miri)) since the crate's serial coordination uses OS-level locking that miri doesn't model meaningfully and would only add overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Threads a `nightly` argument through default.nix into the LLVM overlay so that callers can request the rust nightly channel with the miri component installed. When `nightly = "true"` the rust-toolchain pulls "nightly" with miri added to its component list; otherwise the pinned stable channel is used unchanged. Also sort the components alphabetically so future additions diff cleanly. Disable cargo-llvm-cov's test suite while we're in the overlay file; its tests fail on the nightly channel for reasons unrelated to this project, and we never use them anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Wires up `just miri::test` (and `just miri` via [default]) which: - Drops into a nix-shell with the nightly+miri toolchain active. - Sets MIRIFLAGS for the desired interpreter knobs (compare-exchange weak failure rate, isolation, many-seeds sweep, preemption rate, symbolic alignment check, provenance mode, optional stacked-borrows disable). - Optionally sets RUSTFLAGS for layout randomization. - Invokes `cargo miri nextest run --profile=miri --target=$target` for the configured CPU (default powerpc64 -- weak memory model and big-endian, ideal for surfacing concurrency and endianness bugs). The runner consults a workspace.metadata.package table in the root Cargo.toml that lists which packages are not buildable under miri (dataplane, dpdk*, hardware, init, interface-manager, mgmt, sysfs, tracectl). When invoked with no package selector the runner expands those into --exclude= flags via tomlq so `--workspace` does the right thing without an extra argument list to maintain. Threads the additional --argstr flags through the existing setup-roots and shell recipes so they accept the same surface as build, and adds a `miri` profile to nextest with relaxed thread budget and 500s slow-timeout (miri tests are heavy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Run `just miri::test` on devfile changes, version tags, and manual dispatch. The job is gated to powerpc64 only -- weak memory model and big-endian, catches the most concurrency/endianness bugs -- with other CPU rows commented out for targeted debugging. The aggregator's check emits ::warning:: rather than ::error:: + exit 1, so a miri failure surfaces in the run summary but does not block the PR's overall status. max-parallel is set to 1 because miri is memory-hungry. Three steps per matrix entry: 1. The full workspace under miri. 2. dataplane-quiescent under permissive provenance with extra random seeds. This is the configuration that exercises the real arc-swap implementation, complementing what shuttle/loom can model. 3. dataplane-quiescent under strict provenance via the _strict_provenance feature, which switches Slot to the Mutex fallback so provenance checks can run cleanly. Also flatten the existing `features` matrix into pure `include` rows that each carry their own `build:` reference. The previous form combined a `build:` axis with a `features:` axis under cross-product semantics, then added `test_package` via include; the flattened form is more legible without changing what gets executed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The previous assertion sampled Instant::now() at assertion time and compared the remaining lifetime against the configured timeout minus a 2-second tolerance. That meant the test was implicitly asserting "the framework finishes the rest of the function in under 2 seconds", which is true natively but false under miri's interpreter. Switch to a wall-clock-independent form: snapshot Instant::now() immediately before the process_packet call that is expected to extend the timer, then assert that expires_at >= snapshot + timeout. reset_expiry_unchecked sets expires_at = Instant::now() + duration internally, so this lower bound holds whenever the timer was extended at all -- and a regression that fails to extend it would violate the bound regardless of how slow execution is. Apply the same pattern to the second assertion (against DEFAULT_ESTABLISHED_TOUT_UDP) since it had the same shape with a larger 5-second tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
PCI device is 5 bits and function is 3 bits on the wire, but the PciEbdf parser only enforced syntax -- "0000:00:ff.f" would parse successfully despite both fields being unrepresentable. Validate the numeric ranges in try_new via new MAX_DEVICE (0x1f) and MAX_FUNCTION (0x07) constants, and update the bolero generator to mask both fields so generated values stay in range. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Three tweaks to the miri runner: * nextest's `threads-required` for the miri profile was hardcoded to 8 in `.config/nextest.toml`. With `-Zmiri-many-seeds` packing $seeds seeds into a single test process, that fixed budget oversubscribes the box when $seeds exceeds 8 (and wastes capacity when it is smaller). Move the value out of the static config: at run time, mktemp a copy of the toml and tomlq in `threads-required = $seeds`, then pass `--config-file`. The checked-in miri profile no longer pins the value. * `layout_seed` was hardcoded to "1", which meant every miri run exercised the same struct layout and let layout-sensitive bugs hide behind whatever ordering "1" happens to produce. Derive it from `git rev-parse HEAD` so each commit gets a different layout for free, with no manual seed bumping. * Wrap `cargo miri` in `nice -n 19` so a 64-seed run stays out of the way of foreground work on the same machine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`test_flow_table_timeout` set a 2-second deadline against `std::Instant::now()` and then `tokio::sleep`d 1s and 2s of wall time. Under miri the interpreter is so much slower than wall time that by the time the first 1s sleep returns, real time has already elapsed past the 2s deadline, the per-flow timer task has fired and removed the entry, and the "still present after 1 second" assertion fails. Increasing the seed count made this reliably reachable rather than rare. Switch to `#[tokio::test(start_paused = true)]` so the timer task's `sleep_until` and the test's `sleep`s both run on tokio's virtual clock, and capture `now` from `tokio::time::Instant` instead of `std::Instant` so the deadline is anchored on the same virtual baseline. With paused time a `std::Instant::now()` in the test body still returns real wall time, which under miri sits many real seconds past the paused virtual baseline; the test's virtual sleeps would never advance virtual time far enough to reach a deadline computed that way, and the entry would still be present at the second assertion. Both ends of the test now run on the same time base. Add the `test-util` feature to the tokio dev-dep so the `start_paused` attribute is available. Same root cause as the existing `cfg_attr(miri, ignore)` on `test_flow_table_flow_invalidation`; that test stays ignored under miri for now since it exercises additional behaviour (`extend_expiry`) that this fix does not. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
There actually isn't a rule which says the scheduler will always give you a tcp or udp or icmp in a given run. For instance, this fails under miri because the number of runs is much lower and the schedule for random is deliberately unfair. Signed-off-by: Daniel Noland <daniel@githedgehog.com>
There is no rule that the fuzzer will always give you ipv4 and ipv6. This tends to fail under miri because 1. the iteration count is far lower 2. the schedule is not fair (intentionally) Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Same wall-clock-vs-virtual-clock mismatch as test_flow_table_timeout. flow_2's deadline is set to `Instant::now() + 2s + 1min`, then the test sleeps 3 tokio seconds and asserts flow_1 expired (active_len == 1). Under miri the entire test takes ~90s of wall time, so by the time the assertion runs flow_2's 62-second std-Instant deadline has elapsed too and its timer task has expired it as well, leaving active_len == 0. Switch to `#[tokio::test(start_paused = true)]` so the per-flow timers' `sleep_until` and the test's `sleep` share tokio's virtual clock, and capture `now` from `tokio::time::Instant` so flow_2's extended deadline is anchored on the same virtual baseline that the timer task evaluates against. extend_expiry_unchecked uses fetch_add on expires_at (no Instant::now() call internally), so the 1-minute extension stays relative to the virtual baseline once `now` is virtual. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Now that miri is run under nice there is little or no advantage in delaying the run. Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Add a Miri section to testing.md covering: * what `just miri::test` is and how to run it (whole workspace, a specific test, with a larger seed fan-out); * the configurable knobs the recipe exposes via `just miri::<name>=<value>` (cpu, seeds, schedule_seed, provenance, stacked_borrow_check, preemption_rate, weak_failure_rate, randomize_struct_layout, layout_seed) and their defaults; * the workspace.metadata.package miri-exclusion list and how the runner threads it into `--exclude=` flags automatically; Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
From this commit forward, a miri failure is a regression on equal
footing with any other CI failure. The expectation is to fix the
underlying issue, not suppress the signal. If a future change
demonstrates that miri genuinely cannot exercise some piece of
code -- e.g. a syscall or hardware interaction the interpreter
does not model -- the resolution is to document that reason
explicitly:
- (preferred) Gate the specific test or item with `#[cfg(miri)]`
(or `#[cfg_attr(miri, ignore = "...")]`) and a comment naming
the upstream limitation or unsupported operation.
- (if you must) Add the package to workspace.metadata.package in
the root Cargo.toml with `miri = false` and a comment citing the
reason.
Note that some tests or even whole crates are just genuinely hopeless
under miri. That's fine. But that should be documented if so, and it
should not be the first answer you reach for.
The "if you must" option should be reserved for situations where
miri just can't compile the crate. For instance, the dpdk-sys crate
is well and truly hopeless (and pointless) under miri. That's not
a flaw in the crate, and it can be exempted with a shrug. The
hardware crate is in a similar situation. Miri simply does not
have virtual hardware.
More of a middle ground is the router crate. That largely passes
and works under miri, but has a few tests which use unix sockets.
Those won't work and that's fine. Just ignore those tests under miri.
Exempting the whole crate from miri would not be a reasonable solution.
Reasons like "this test is flaky under miri" do not qualify (finding
flaky tests is part of the point of this effort).
More rarely, some tests just take a genuinely excessive amount of time
under miri. This is rarely a valid reason to exempt. More often you
really can just reduce an iteration counter or something as a condition
in the miri build. Sometimes that is just impractical or defeats the
point of the test. Exempting in this case is valid, but it really
should be rare.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
crane was imported with no pkgs argument, falling back to its own
`import <nixpkgs> { }` default in `npins/crane/default.nix`. That makes
`craneLib.buildPackage` / `mkCargoDerivation` build their derivations
against the *host* nixpkgs's stdenv, regardless of what stdenv the
project's `pkgs` (a `pkgsCross.<arch>`) provides.
Symptom under cross compile (e.g. `just platform=bluefield3 build-each`):
aws-lc-sys's build script invokes `cc` to assemble aarch64 `.S` files
because the surrounding `CC` env was the build-host x86_64 gcc, not the
target's `aarch64-unknown-linux-gnu-gcc`. The link step rejects every
aarch64 mnemonic in `p521_jscalarmul.S` and friends:
cargo:warning=Error: no such instruction: `ldp x4,x3,[x2,...`
Verified via `nix eval`: with `pkgs` passed in, the resulting derivation
has `stdenv.hostPlatform = aarch64-unknown-linux-gnu` and crane's
`mkCrossToolchainEnv` correctly exports `CC_aarch64_unknown_linux_gnu`
etc. to the cross toolchain so cc-rs picks it up.
No effect on native (libc=gnu, x86-64-v3) builds where buildPlatform
equals hostPlatform; the cross stdenv is the native stdenv in that
configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`is-cross-compile` was comparing `pkgs.stdenv'.hostPlatform` against
`ctarget`. In a cross stdenv those are the *same* value (hostPlatform
is the target architecture, and ctarget is derived from
targetPlatform.rust.rustcTarget which equals hostPlatform.rust.rustcTarget
in a non-cross-compiler stdenv). The check was therefore always false
under cross, and the `cxx` / `strip` / `objcopy` paths that depend on it
fell through to the non-cross-prefixed binary names.
Symptom under bluefield3 cross: the linker arg
`-Clinker=${pkgs.pkgsBuildHost.llvmPackages'.clang}/bin/clang++` resolved
to a wrapped-cross-clang directory that contains only the prefixed
`aarch64-unknown-linux-gnu-clang++`, not the bare `clang++`:
error: linker `/nix/store/...-aarch64-unknown-linux-gnu-clang-wrapper-22.1.5/bin/clang++` not found
`buildPlatform` is the right comparison: cross-compile is defined by
"build host produces output for a different host", which is exactly
buildPlatform != hostPlatform. Comparing buildPlatform to ctarget gives
true when actually cross-compiling (e.g. x86_64 -> aarch64) and false
for native and wasm (where build is the same as target).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`pkgs.pkgsBuildHost` is the splice variant for tools that *run on*
the build host and *target* the host -- i.e. cross compilers. Under
a cross `pkgs` (`pkgsCross.<arch>` for libc=musl or platform=bluefield3),
that means `pkgsBuildHost.pkg-config` resolves to
`x86_64-unknown-linux-musl-pkg-config-wrapper-0.29.2`, whose bin/
contains only the prefixed binary `x86_64-unknown-linux-musl-pkg-config`
-- no plain `pkg-config`.
Cargo build scripts (hwlocality-sys, etc.) shell out via `Command::new
("pkg-config")` and crash because the unprefixed binary isn't on PATH:
error: failed to run custom build command for `hwlocality-sys`
... The pkg-config command could not be found.
Same trap applies to `clang` and other unprefixed helpers in the
dev-shell symlinkJoin.
`pkgsBuildBuild` is "runs on build, targets build" -- exactly what
native dev tooling wants. Under a native pkgs (libc=gnu host arch)
the two splice variants are the same derivation, so this is a no-op
for the current default build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The libcap overlay forces `CC:=clang` via makeFlags, which overrides
the cross stdenv's wrapped CC (set to e.g.
`x86_64-unknown-linux-musl-clang` by the stdenv setup hooks). An
unprefixed `clang` in a cross-stdenv build environment resolves to
the *unwrapped* clang binary, which has no `-isystem` paths set up
for the target sysroot:
clang -O2 ... _makenames.c -o _makenames
fatal error: 'stdio.h' file not found
make[1]: *** [Makefile:83: _makenames] Error 1
For native gnu (`pkgsCross.gnu64`) the cc.targetPrefix is empty and
the wrapped clang is named just `clang`, so `CC:=clang` happens to
hit the wrapper. For musl or aarch64, the wrapped clang has a
prefix and the bare `clang` is the unwrapped binary.
`${final.stdenv'.cc.targetPrefix}clang` resolves to the right wrapper
in all configurations: `clang` natively, `x86_64-unknown-linux-musl-clang`
under musl, `aarch64-unknown-linux-gnu-clang` under bluefield3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
libcap's Makefile compiles a helper binary `_makenames` and *executes*
it during the build to generate `cap_names.h`. Make.Rules defaults
`BUILD_CC ?= $(CC)`, so without an explicit BUILD_CC the helper gets
built with whatever CC is, which under cross stdenv is the wrapped
cross-clang. For same-arch cross (x86_64 -> x86_64-musl) the produced
binary happens to run on the build host; for cross-arch it does not:
./_makenames > cap_names.h
/bin/bash: ./_makenames: cannot execute binary file: Exec format error
make[1]: *** [Makefile:86: cap_names.h] Error 126
`pkgsBuildBuild.llvmPackages'.clang` is the clang that runs on, and
targets, the build host -- exactly what BUILD_CC should be regardless
of the target architecture. Pinning the absolute path avoids the
ambiguity around which `clang` PATH resolves to during a libcap build.
Note: `pkgsBuildHost` would be the cross compiler (runs on build,
targets host = aarch64), which is the wrong splice for a build-host
helper; this is the same trap that fooled the dev-shell tooling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`clib/bin/cpmock.c` calls memset/strcpy/strerror/memcmp without an
`#include <string.h>`. Under glibc the standard headers transitively
re-expose those declarations (via <sys/socket.h>, <stdlib.h>, etc.) so
the compile happens to succeed. Under musl the headers are strict and
disjoint, so the declarations are missing. Combined with the
`-DCMAKE_C_STANDARD=23` cmake flag (which makes clang treat
`-Wimplicit-function-declaration` as an error per C23), the build dies:
clib/bin/cpmock.c:33:5: error: call to undeclared library function
'memset' with type 'void *(void *, int, __size_t)'; ISO C99 and
later do not support implicit function declarations
... and similar for strcpy, strerror, memcmp.
This is a real correctness bug in the upstream source (implicit function
declarations have been UB since C99 and outright illegal since C23) -
glibc just covers for it. Workaround here patches the include in
locally so the FRR/dataplane container can be built against musl.
TODO: remove once fixed upstream in githedgehog/dplane-rpc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`src/hh_dp_msg.c` accesses `struct in6_addr` through glibc's internal
anonymous union member name `.__in6_u.__u6_addr8` (three callsites:
lines 74, 131, 179). That name is a glibc implementation detail, not
part of any standard. POSIX defines `struct in6_addr` to expose the
bytes via the `.s6_addr` member directly, and musl does only that --
no wrapping anonymous union -- so the access fails to compile against
musl headers:
src/hh_dp_msg.c:74:57: error: no member named '__in6_u' in 'struct in6_addr'
memcpy(ifa.address.addr.ipv6, ifaddr->u.prefix6.__in6_u.__u6_addr8, 16);
The three sites are all doing the same thing -- copying 16 bytes out of
the address -- and `.s6_addr` is the portable spelling that works on
both glibcs and musl.
To make the postPatch actually take effect, also remove `dontUnpack = true`
so nix copies the source to a writable build directory, and change the
cmake `-S "$src"` reference to `-S .` so cmake consumes the patched copy
rather than the (read-only) original store path.
TODO: remove once fixed upstream in githedgehog/dplane-plugin.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Adds a `must-build` CI job that exercises the platforms and libcs the project intends to ship for, without yet routing them through `check` / `build` (those jobs assume gnu/x86-64-v3 today). The matrix: - wasm32-wasip1 / unknown / release (build-each) - zen5 / musl / release (build-container dataplane) - bluefield3 / gnu / debug (build-container dataplane) - bluefield3 / musl / debug (build-container dataplane) `build-each` is added to the justfile as a thin wrapper that maps to the `workspace` nix target (build every workspace member to its own output), parallel to the existing `test-each`. The wasm matrix entry uses `build-each`; the cross/native container entries use `build-container dataplane`. All entries pass platform, profile, and libc explicitly via `JUST_VARS` so the just default fallbacks don't matter and the row is self-documenting. `must-build` is added to the final-status aggregator's `needs` list along with a step that flags any matrix-row failure -- so a red must-build row turns the workflow conclusion red even though the job itself only blocks the summary aggregator. The `# - check # TODO: add dependency before PR lands` line in the job's `needs:` block flags that under the current `check` / `build` implementation those jobs do not exercise cross/wasm, so making `must-build` strictly depend on them would just slow down feedback for the cross builds without buying coverage. Tighten when `check` itself learns to honour the platform/libc axis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The dataplane.tar `buildPhase` unconditionally tarred in `libc.out`
(`pkgs.pkgsHostHost.libc`) and `${pkgs.pkgsHostHost.glibc.libgcc}`.
The latter is a glibc-evaluation reference: under a musl pkgsCross
`pkgs.pkgsHostHost.glibc` triggers nix trying to build glibc against
musl, which is a nonsensical combination and fails outright with
errors from `sysdeps/x86_64/multiarch/strcasecmp_l.c` ifunc redirection
inside `glibc-nolibgcc-x86_64-unknown-linux-musl-2.42-61`.
Gate the libgcc input on `libc == "gnu"`: glibc-dynamic Rust binaries
need `libgcc_s.so.1` for unwinding, but musl Rust targets statically
link musl and Rust's compiler-builtins and have no libgcc_s consumer.
The exact store path of the libgcc input matters: nix's glibc ld-linux
is compiled with `pkgs.pkgsHostHost.glibc.libgcc` (`xgcc-<ver>-libgcc`
on native, `libgcc-<triple>-<ver>` on cross) in its system search list.
Bundling libgcc_s.so.1 from any other store path (e.g.
`pkgs.stdenv.cc.cc.lib`, which has the same content but a different
name) results in the file being present in the container but
unreachable: the loader doesn't search the directory and there's no
RUNPATH to redirect it. Symptom on a binary built that way:
/bin/dataplane: error while loading shared libraries:
libgcc_s.so.1: cannot open shared object file: No such file or
directory
Keep libc.out bundled in both libc variants. The Rust binaries are
static on musl and don't need it, but the busybox we ship alongside
for `/bin/*` shell utilities is dynamically linked against whichever
libc its pkgset uses. On musl-cross-arch the busybox interpreter is
`${pkgs.pkgsHostHost.libc}/lib/ld-musl-<arch>.so.1`; without that store
path inside the tar the applets fail at exec with no usable libc.
Rename the inner `libc` let-binding to `libc-pkg` so the outer
function-arg `libc` (the string "gnu" / "musl" / "unknown") stays
visible inside this scope for the conditional. Add a header comment
on the new `libgcc-tar-input` documenting the path-matching constraint
so the next person doesn't "fix" it to a more-natural-looking
`stdenv.cc.cc.lib`. Add a matching comment on `libc-tar-input`
explaining why busybox makes the bundled libc unconditional.
Also: typo fix on the surrounding comment block (`hazzards` -> `hazards`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The debug-tools container content list referenced `pkgs.libgcc.libgcc`, which only exists as a passthru on the native gcc derivation (`outputs = [ "out" "man" "info" "lib" "libgcc" "checksum" ]`). Under a cross pkgs (`pkgsCross.aarch64-multiplatform`), `pkgs.libgcc` is a different derivation -- the static cross-libgcc helpers (`crtbegin.o`/`libgcc.a`/...) with `outputs = [ "out" "dev" ]` and no `libgcc` output -- so the attribute access raises "attribute 'libgcc' missing" during eval of debug-tools for any non-native build. `pkgs.pkgsHostHost.glibc.libgcc` resolves correctly in both contexts and points at the same store path that the bundled glibc ld-linux has in its compiled-in library search list (so runtime resolution still works the way it has been). Gate the inclusion on `libc == "gnu"` so musl-libc targets don't drag in glibc.libgcc, which has no purpose there -- musl Rust binaries use the system llvm libunwind, not libgcc_s.so.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
FRR's LDFLAGS hard-coded `-L${fancy.libgccjit}/lib -latomic`, pointing
at the glibc-targeted libgccjit-bundled `libatomic.so.1`. That's the
right path when targeting glibc but is ABI-wrong (and at the wrong
store path) when targeting musl.
Add `libc` to the overlay's function args (and inherit it through
`default.nix`'s overlays import) so the LDFLAGS can pick a
libc-specific path:
- libc == "gnu": `${fancy.libgccjit}/lib/libatomic.so.1`
The libgccjit overlay still builds with the host stdenv (its
own override has `# TODO: debug issue preventing clang build`),
so its libatomic is glibc-targeted. Same path FRR has always
used.
- libc == "musl": `${stdenv.cc.cc.lib}/${triple}/lib/libatomic.so.1`
The gcc-libs output of the cross-musl gcc that backs the
cross-musl clang stdenv. Note: `stdenv`, *not* `stdenv'` --
the prime is the clang stdenv whose `cc.cc` is clang and
contains no libatomic at all.
- else: throw, to fail loudly on a libc this layout doesn't yet
cover.
Link is dynamic on both sides (the `-L .. -latomic` sits *outside* the
`-Wl,--push-state,...,-Bstatic` block). Single-`.so`-process-wide is
the only ODR-safe choice for libatomic: its `lock_for_pointer` lock
table is per-image, so a static copy inside libfrr.so cannot
synchronize with any other consumer that picks up libatomic.so
dynamically.
Also extend the `preFixup` nuke-refs allowlist to keep the cross-musl
gcc-libs path when targeting musl, so the RUNPATH entry pointing at
the dynamic libatomic.so survives fixup -- without it, nuke-refs
would scrub the entry, the binary's loader couldn't find libatomic,
and `watchfrr` / friends would die at startup with
`Error relocating .../libfrr.so: __atomic_*: symbol not found`.
(`libxcrypt` was already preserved in the working tree's allowlist
shape but missing from this branch -- include it here so the allowlist
matches what's actually linked.)
The same libc-aware nuke-refs treatment applies to `frr-agent`: its
fixupPhase referenced `libgcc.lib`, but under cross pkgs `final.libgcc`
is the static cross-libgcc helpers derivation (outputs `[ "out" "dev" ]`
only -- no `lib` attribute), so the package failed to evaluate for
every non-native target. Move the fixupPhase from the package into
the overlay so `libc` is in scope, gate the libgcc exclusion on
`libc == "gnu"`, and use `pkgsHostHost.glibc.libgcc` -- the path the
bundled ld-linux's compiled-in search list points at, matching the
libgcc handling already in place at the dataplane.tar bundler.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Two unrelated FRR-package cleanups that surface once the overlay starts
exercising libc=musl:
1. `libgccjit` is unconditional in FRR's buildInputs but is only meaningful
on glibc. The overlay's LDFLAGS already gates its `-L${libgccjit}/lib
-latomic` on `libc == "gnu"` and pulls libatomic from the cross-musl
gcc-libs output instead, so on musl libgccjit is just dead weight in
the build closure (and a possible source of glibc-targeted ABI leakage
into the musl runtime image). Gate the buildInput on
`stdenv.hostPlatform.isGnu`.
2. `--enable-grpc=no` / `--enable-protobuf=no` are mis-spelled autoconf
flags. Autoconf's `AC_ARG_ENABLE(grpc, ...)` accepts `--enable-grpc[=ARG]`
and `--disable-grpc`, but `--enable-grpc=no` is parsed as
`enableval=no` for the `grpc` feature -- which most projects handle by
coincidence, but it's not the documented spelling and breaks under
stricter autoconf versions. `--disable-grpc` / `--disable-protobuf`
is the canonical spelling. No runtime-behavior change today, but
the syntax is no longer wrong-looking-but-works-anyway.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
FRR was just built with --disable-protobuf, and dplane-plugin does not actually link against protobufc itself, so the buildInput (and the matching call arg) were dead weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Native cargo invocations from the dev shell (cargo build/clippy/test --doc) defaulted to the build-host triple while LIBRARY_PATH and PKG_CONFIG_PATH were already pointing at the cross sysroot. The resulting links pulled in cross-target libraries against the host rust-std and failed (e.g. glibc rust-std + musl libc gave undefined references to `open64`, `fstat64`, ...). Pin CARGO_BUILD_TARGET to the same triple the sysroot is built for, and set PKG_CONFIG_ALLOW_CROSS=1 so the Rust pkg-config crate stops refusing cross-target builds (PKG_CONFIG_PATH already points at the matching cross sysroot). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Glibc-targeted Rust binaries unwind through libgcc_s.so.1 (still pulled into the image via glibc.libgcc). Musl and wasi have no libgcc consumer, so panic-unwind has no unwinder unless we ask build-std to pull LLVM's libunwind out of the sysroot. Gate the build-std-features flag on libc != "gnu" so glibc keeps using the libgcc path while musl/wasi get system-llvm-libunwind appended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Extract a generic aarch64 platform definition (march=generic, no march or mcpu overrides) and refactor bluefield2 as a recursiveUpdate of it to retain its armv8.2-a / cortex-a72 specialization. Lets cross-platform CI exercise a vanilla aarch64 build without depending on the bluefield-specific CPU/march flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Wire a `cargo check` derivation per workspace package as `check.<name>`, mirroring the existing `workspace.<name>` build derivations. The point is to give CI a fast, no-link verification path -- particularly useful for wasm32-wasip1 where a full container build is overkill but we still want a build-error gate. Add matching `just check` (single package) and `just check-each` (all packages) recipes that drive the new derivation through the standard nix build pipeline. Also rearrange the `inherit` list so `check`/`dataplane` land in alphabetical order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Lets contributors execute the wasm32-wasip1 binaries we already build without leaving the dev shell: - wasmtime in the dev-shell tooling - `[target.wasm32-wasip1] runner = "wasmtime run --dir ."` so plain `cargo run`/`cargo test` against the wasm target invokes wasmtime automatically, with the cwd bound in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The single must-build matrix was conflating two different jobs:
a fast wasm sanity check and a slow cross-arch container build sweep.
Split them along that line:
wasm: same triggers as must-build (devfiles, tags, dispatch),
runs the wasm32-wasip1 `check` recipe only. Keeps wasm verification
as a blocker for any PR that touches dev files. Recipe is shaped as
a `{name, args}` object so the matrix axis carries recipe+args as a
single tuple rather than as parallel axes.
cross: {aarch64, bluefield3} x {gnu, musl} x {debug} x
{build-container dataplane, build-container frr.dataplane}. Covers
both the dataplane container and the FRR sidecar, the two images
most affected by this PR's FRR overlay / dplane-plugin / dplane-rpc
cross-arch fixes. Release profile is dropped to keep the leg count
flat while picking up FRR; cross-arch build problems surface in
debug just as well. Gated narrowly (PR label `ci:+cross`, or
push/merge_group).
Advisory failure handling: job-level `continue-on-error: true`. A
failing leg still reports `failure` to its own conclusion -- the job
badge goes red, the build step's `failure()` predicate fires (so the
shared `*tmate` anchor still opens a session on workflow_dispatch
debug), and the leg is visible in the run summary. The workflow's
overall conclusion ignores the failure, so cross is non-blocking.
`needs.cross.result` reports `success` to dependents under job-level
CoE, so the summary aggregator does not (and cannot) gate on cross;
treat cross as purely informational.
Add `check` as a hard dependency so we don't burn a full cross build
on code that won't even type-check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`docker import` reads a raw rootfs tar (no manifest, no platform metadata).
Without an explicit --platform, the imported image is tagged with the host
runner's architecture even when the rootfs is for a different one, which
silently mislabels cross-built images and would mislead anything that
consumes them by platform-aware pull or digest.
Map the just `platform` variable (aarch64 / bluefield2 / bluefield3 /
x86-64-v3 / x86-64-v4 / zen3 / zen4 / zen5) to the docker linux/{arm64,amd64}
pair before calling import. Fail loudly on an unmapped platform rather
than silently fall back to host arch.
Only the `dataplane` arm of `build-container` is affected; the other arms
use `docker load` on real image tarballs (which carry their own platform
metadata) and do not need --platform.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
`nuke-refs` is generated by `replaceVarsWith` that bakes a perl path
into the script. Under a cross pkgset `prev.nukeReferences` (and a
plain `nukeReferences` argument grabbed without splicing) substitutes
a target-arch perl: e.g. `perl-aarch64-unknown-linux-gnu-5.42.0` for
an aarch64 cross. Both `frr-build`'s preFixup and `frr-agent`'s
fixupPhase invoke `nuke-refs` on the build host during this
derivation's own build, where the kernel cannot execute target-arch
ELF and the build dies with:
nuke-refs: ...-perl-aarch64-...-5.42.0/bin/perl:
cannot execute binary file: Exec format error
Reach into `final.buildPackages.nukeReferences` in both spots so the
substituted perl is the build-host one. Verified by building
`containers.frr.dataplane` for aarch64/gnu, which the cross matrix
added in this stack exercises.
The sibling `removeReferencesTo` is shell-only -- it doesn't bake in
perl, and the `@shell@` substituted into the script is
`stdenvNoCC.shell`, which resolves to build-host bash on every
pkgset. No equivalent fix is needed there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
No need to bother with cross compile until miri passes. Best not to overload the CI, and miri failures will often indicate that cross is going to be pointless. Signed-off-by: Daniel Noland <daniel@githedgehog.com>
The `libc` string is a tag carried through the build for derivation naming and platform dispatch. For `wasm32-wasip1` it was set to `"unknown"`, which read as "we don't know what libc this is" rather than the truth: there is no libc -- WASI Preview 1 provides its own system interface and the Rust target name uses the `wasi` env field, not a libc. Use `"none"` so the tag describes the actual situation. Mechanical rename across `default.nix` (comment), `justfile` (`libc` default for `wasm32-wasip1`), `nix/platforms.nix` (`wasm32.wasip1.<libc>` attr key), and `.github/workflows/dev.yml` (matrix `libc` value). No build-output change for non-wasm targets. Signed-off-by: Daniel Noland <daniel@githedgehog.com>
nixpkgs' musl `rustPlatform` (which `frr-agent` is currently built
with) emits binaries with `DT_NEEDED libgcc_s.so.1` and
`_Unwind_*@GCC_*` references for stack unwinding. Our musl FRR
image has no `libgcc_s.so.1` -- the `frr-agent` overlay's
`fixupPhase` deliberately nukes the libgcc store-path on musl -- so
the musl loader bails before `main`:
Error relocating /bin/frr-agent: _Unwind_Resume: symbol not found
Error relocating /bin/frr-agent: _Unwind_Backtrace: symbol not found
...
Force `+crt-static` on the musl variant of `frr-agent` so the C
runtime (and with it `libgcc_eh.a`) is linked statically. The
resulting binary is `static-pie linked` with no `_Unwind_*` undefs
and no `NEEDED` entries, and the container starts.
The accompanying comment on the `fixupPhase` is also corrected: the
prior claim that "musl Rust uses llvm-libunwind and has no libgcc_s
consumer" was wrong for the way `frr-agent` is currently built --
that's the path the rest of the workspace takes via crane +
`-Zbuild-std=...,system-llvm-libunwind`, not the path `frr-agent`
takes through nixpkgs' `rustPlatform.buildRustPackage`.
TODO: switch `frr-agent` onto the workspace's crane +
`system-llvm-libunwind` build path so it pulls LLVM libunwind
statically like every other Rust binary in this repo, rather than
relying on libgcc_eh from the gcc backing the cross-musl clang
stdenv.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
I think it works now?
dda0e90 to
bd030ac
Compare
06bea51 to
f45af96
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.