asan again (ignore for the moment, just checking CI) by daniel-noland · Pull Request #1529 · githedgehog/dataplane

daniel-noland · 2026-05-12T01:37:17Z

No description provided.

The test spawned two worker threads but never joined them, so any panic in a worker was silently dropped instead of failing the test. Joining the handles surfaces the second thread's panic: it tried to allocate NAT for 2.0.1.3, an IP belonging to an expose without stateful NAT configured, so no pool exists and the allocator returns Denied. Switch the second source to 1.2.0.0 (in expose1, which has stateful NAT) so each thread hits a distinct pool and the test does what it claims to. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Lift the concurrent_allocations test body out of tests_shuttle into a sibling tests module so it can be exercised under different shuttle schedulers without duplication. Extract shuttle_config() and add run_shuttle_pct, then run the same fixture under both RandomScheduler (test_concurrent_allocations_shuttle_random) and PctScheduler (test_concurrent_allocations_shuttle_pct). PCT explores schedules with bounded preemptions, complementing random's coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

UnicastIpv4Addr::new rejects both multicast and broadcast addresses, so the InvalidSourceAddr fuzz-test branch can legitimately observe a broadcast source. Update the assertion to allow either, and expand the error text and doc comment to describe both rejections accurately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Allows PciEbdf to be embedded in rkyv-archived types (notably args::PortArg::PCI), which a later commit migrates from the hardware crate's structured PciAddress to this string-based form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Two bugs in the same generator caused PciEbdf::try_new to reject every value the bolero TypeGenerator produced: - The separator between bus and device was '.', but EBDF format is domain:bus:device.function (e.g. 0000:01:00.0). Generated strings looked like 0000:01.00.00 and were rejected on layout. - The function field was formatted with `{function:02x}`, always two hex digits, while try_new requires exactly one (PCI function is a 3-bit field, 0..=7). Even after the separator fix, the function width alone would have kept every generated string rejected. Switch to the correct separator and mask function to 0x07 before formatting it as a single hex digit, so the generator covers the parser's accepted input space. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Decouple args, config, and k8s-intf from the hardware crate by swapping the structured PciAddress newtype for the string-based PciEbdf already exposed by net::pci. PciEbdf::try_new validates the EBDF format on construction so call sites remain checked at the API boundary; downstream consumers continue to round-trip through to_string(). This removes hardware as a build-time dependency from these crates, which is needed so that they (and their dependents) can be exercised under miri -- hardware pulls in OS-specific machinery (sysfs, hwlocality, dpdk-sysroot-helper) that miri can't run. Dropping the hardware dep also severs an implicit chain that enabled tracing/attributes for args via workspace-wide feature unification (args -> hardware -> sysfs -> tracing/attributes), so declare tracing/attributes explicitly on args/Cargo.toml since args/src/lib.rs uses #[instrument]. Without this, `just test-each` (which builds each package in isolation, bypassing workspace unification) would fail to compile args. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Both crates ship code (FFI, asm, custom unwind) that miri can't execute on the powerpc64 target we use for memory-model coverage. Move the loom and shuttle entries of flow-entry, nat, and quiescent into [target.'cfg(not(miri))'.dependencies] / dev-dependencies so the workspace stays buildable under miri without dropping those crates from the run. Add a compile_error! in concurrency/src/lib.rs to surface a clear diagnostic if someone explicitly enables shuttle or loom while running under miri (the metadata-driven exclude path in miri.just already prevents this combination, but the explicit error gives a better failure mode for ad-hoc invocations). The concurrency crate itself does not move its loom/shuttle dependencies here -- they remain optional in the regular [dependencies] section, gated only by the feature flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Add a hidden _strict_provenance feature that selects the Mutex<Arc<T>> implementation of Slot even outside loom/shuttle. Used by the miri job to exercise this crate under -Zmiri-strict-provenance, which fails against arc-swap because its hazard-pointer machinery doesn't yet expose provenance information in the standard form. The fallback implementation gives us provenance coverage on the QSBR protocol without changing observable behavior. Collapse the two cfg-gated mod imp { ... } blocks into a single cfg_select! at module scope so adding a new selector only requires extending the predicate, not introducing a third nested module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

These tests interact with resources miri's interpreter does not provide: - cli::cliproto::test_communications and routing::router::rio's two tests bind Unix domain sockets at /tmp paths. - routing::frr::test::test_fake_frr_agent stands up a fake FRR agent over Unix sockets. - routing::atable::resolver::test_adjacency_resolver reads /proc/net/arp and queries kernel interfaces. - k8s_less::local::test_kubeless drives an inotify watcher; miri cannot emulate inotify even with isolation disabled. - flow_entry::flow_table::table::test_flow_table_flow_invalidation mixes tokio::time::Instant::from_std on a wall-clock std deadline with a tokio::time::sleep call counted in real seconds, so a 4 s sleep under miri's slow interpreter elapses past the "extended" flow's deadline. For nat::stateful::test::test_full_config_unidirectional_nat_overlapping_destination the only miri-incompatible piece is the tracectl::get_trace_ctl() setup -- linkme's distributed_slice uses link_section, which miri cannot load. Skip just that block under cfg(not(miri)) instead of ignoring the whole test; the tracing macros at use sites are no-ops when no targets are registered, so the rest of the test runs fine on the interpreter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

miri's interpreter is roughly 100x slower than native execution, so property tests and concurrency stress tests that run thousands of iterations are impractical to run unmodified. Use cfg_select! to shrink loop counts and iteration budgets when miri is the active target, keeping the same coverage shape but with budgets that finish in seconds rather than hours: - flow-entry::flow_table::nf_lookup: 1000 -> 10 packets - net::packet::hash::test_hash_bounds: 2000 -> 20 packets - routing::fib::test_concurrency_fib*: 100_000 -> 50 packets - routing::fib::test_*_table iterations: 1000/5_000 -> 50 - left-right-tlcache::test_readhandle_cache: 1000 -> 10 handles - k8s-intf::bolero::support: bolero iterations 1000 -> 3 and drop the largest UNIQUE_COUNTS entry (100); also skip test_unique_v6_cidr_generator under miri (still too slow even at 3 iterations because the mask range is 0..=128). Also gate serial_test in left-right-tlcache to cfg(not(miri)) since the crate's serial coordination uses OS-level locking that miri doesn't model meaningfully and would only add overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Threads a `nightly` argument through default.nix into the LLVM overlay so that callers can request the rust nightly channel with the miri component installed. When `nightly = "true"` the rust-toolchain pulls "nightly" with miri added to its component list; otherwise the pinned stable channel is used unchanged. Also sort the components alphabetically so future additions diff cleanly. Disable cargo-llvm-cov's test suite while we're in the overlay file; its tests fail on the nightly channel for reasons unrelated to this project, and we never use them anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Wires up `just miri::test` (and `just miri` via [default]) which: - Drops into a nix-shell with the nightly+miri toolchain active. - Sets MIRIFLAGS for the desired interpreter knobs (compare-exchange weak failure rate, isolation, many-seeds sweep, preemption rate, symbolic alignment check, provenance mode, optional stacked-borrows disable). - Optionally sets RUSTFLAGS for layout randomization. - Invokes `cargo miri nextest run --profile=miri --target=$target` for the configured CPU (default powerpc64 -- weak memory model and big-endian, ideal for surfacing concurrency and endianness bugs). The runner consults a workspace.metadata.package table in the root Cargo.toml that lists which packages are not buildable under miri (dataplane, dpdk*, hardware, init, interface-manager, mgmt, sysfs, tracectl). When invoked with no package selector the runner expands those into --exclude= flags via tomlq so `--workspace` does the right thing without an extra argument list to maintain. Threads the additional --argstr flags through the existing setup-roots and shell recipes so they accept the same surface as build, and adds a `miri` profile to nextest with relaxed thread budget and 500s slow-timeout (miri tests are heavy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Run `just miri::test` on devfile changes, version tags, and manual dispatch. The job is gated to powerpc64 only -- weak memory model and big-endian, catches the most concurrency/endianness bugs -- with other CPU rows commented out for targeted debugging. The aggregator's check emits ::warning:: rather than ::error:: + exit 1, so a miri failure surfaces in the run summary but does not block the PR's overall status. max-parallel is set to 1 because miri is memory-hungry. Three steps per matrix entry: 1. The full workspace under miri. 2. dataplane-quiescent under permissive provenance with extra random seeds. This is the configuration that exercises the real arc-swap implementation, complementing what shuttle/loom can model. 3. dataplane-quiescent under strict provenance via the _strict_provenance feature, which switches Slot to the Mutex fallback so provenance checks can run cleanly. Also flatten the existing `features` matrix into pure `include` rows that each carry their own `build:` reference. The previous form combined a `build:` axis with a `features:` axis under cross-product semantics, then added `test_package` via include; the flattened form is more legible without changing what gets executed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

The previous assertion sampled Instant::now() at assertion time and compared the remaining lifetime against the configured timeout minus a 2-second tolerance. That meant the test was implicitly asserting "the framework finishes the rest of the function in under 2 seconds", which is true natively but false under miri's interpreter. Switch to a wall-clock-independent form: snapshot Instant::now() immediately before the process_packet call that is expected to extend the timer, then assert that expires_at >= snapshot + timeout. reset_expiry_unchecked sets expires_at = Instant::now() + duration internally, so this lower bound holds whenever the timer was extended at all -- and a regression that fails to extend it would violate the bound regardless of how slow execution is. Apply the same pattern to the second assertion (against DEFAULT_ESTABLISHED_TOUT_UDP) since it had the same shape with a larger 5-second tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

PCI device is 5 bits and function is 3 bits on the wire, but the PciEbdf parser only enforced syntax -- "0000:00:ff.f" would parse successfully despite both fields being unrepresentable. Validate the numeric ranges in try_new via new MAX_DEVICE (0x1f) and MAX_FUNCTION (0x07) constants, and update the bolero generator to mask both fields so generated values stay in range. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Three tweaks to the miri runner: * nextest's `threads-required` for the miri profile was hardcoded to 8 in `.config/nextest.toml`. With `-Zmiri-many-seeds` packing $seeds seeds into a single test process, that fixed budget oversubscribes the box when $seeds exceeds 8 (and wastes capacity when it is smaller). Move the value out of the static config: at run time, mktemp a copy of the toml and tomlq in `threads-required = $seeds`, then pass `--config-file`. The checked-in miri profile no longer pins the value. * `layout_seed` was hardcoded to "1", which meant every miri run exercised the same struct layout and let layout-sensitive bugs hide behind whatever ordering "1" happens to produce. Derive it from `git rev-parse HEAD` so each commit gets a different layout for free, with no manual seed bumping. * Wrap `cargo miri` in `nice -n 19` so a 64-seed run stays out of the way of foreground work on the same machine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

`test_flow_table_timeout` set a 2-second deadline against `std::Instant::now()` and then `tokio::sleep`d 1s and 2s of wall time. Under miri the interpreter is so much slower than wall time that by the time the first 1s sleep returns, real time has already elapsed past the 2s deadline, the per-flow timer task has fired and removed the entry, and the "still present after 1 second" assertion fails. Increasing the seed count made this reliably reachable rather than rare. Switch to `#[tokio::test(start_paused = true)]` so the timer task's `sleep_until` and the test's `sleep`s both run on tokio's virtual clock, and capture `now` from `tokio::time::Instant` instead of `std::Instant` so the deadline is anchored on the same virtual baseline. With paused time a `std::Instant::now()` in the test body still returns real wall time, which under miri sits many real seconds past the paused virtual baseline; the test's virtual sleeps would never advance virtual time far enough to reach a deadline computed that way, and the entry would still be present at the second assertion. Both ends of the test now run on the same time base. Add the `test-util` feature to the tokio dev-dep so the `start_paused` attribute is available. Same root cause as the existing `cfg_attr(miri, ignore)` on `test_flow_table_flow_invalidation`; that test stays ignored under miri for now since it exercises additional behaviour (`extend_expiry`) that this fix does not. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

There actually isn't a rule which says the scheduler will always give you a tcp or udp or icmp in a given run. For instance, this fails under miri because the number of runs is much lower and the schedule for random is deliberately unfair. Signed-off-by: Daniel Noland <daniel@githedgehog.com>

There is no rule that the fuzzer will always give you ipv4 and ipv6. This tends to fail under miri because 1. the iteration count is far lower 2. the schedule is not fair (intentionally) Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Same wall-clock-vs-virtual-clock mismatch as test_flow_table_timeout. flow_2's deadline is set to `Instant::now() + 2s + 1min`, then the test sleeps 3 tokio seconds and asserts flow_1 expired (active_len == 1). Under miri the entire test takes ~90s of wall time, so by the time the assertion runs flow_2's 62-second std-Instant deadline has elapsed too and its timer task has expired it as well, leaving active_len == 0. Switch to `#[tokio::test(start_paused = true)]` so the per-flow timers' `sleep_until` and the test's `sleep` share tokio's virtual clock, and capture `now` from `tokio::time::Instant` so flow_2's extended deadline is anchored on the same virtual baseline that the timer task evaluates against. extend_expiry_unchecked uses fetch_add on expires_at (no Instant::now() call internally), so the 1-minute extension stays relative to the virtual baseline once `now` is virtual. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Now that miri is run under nice there is little or no advantage in delaying the run. Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Add a Miri section to testing.md covering: * what `just miri::test` is and how to run it (whole workspace, a specific test, with a larger seed fan-out); * the configurable knobs the recipe exposes via `just miri::<name>=<value>` (cpu, seeds, schedule_seed, provenance, stacked_borrow_check, preemption_rate, weak_failure_rate, randomize_struct_layout, layout_seed) and their defaults; * the workspace.metadata.package miri-exclusion list and how the runner threads it into `--exclude=` flags automatically; Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

From this commit forward, a miri failure is a regression on equal footing with any other CI failure. The expectation is to fix the underlying issue, not suppress the signal. If a future change demonstrates that miri genuinely cannot exercise some piece of code -- e.g. a syscall or hardware interaction the interpreter does not model -- the resolution is to document that reason explicitly: - (preferred) Gate the specific test or item with `#[cfg(miri)]` (or `#[cfg_attr(miri, ignore = "...")]`) and a comment naming the upstream limitation or unsupported operation. - (if you must) Add the package to workspace.metadata.package in the root Cargo.toml with `miri = false` and a comment citing the reason. Note that some tests or even whole crates are just genuinely hopeless under miri. That's fine. But that should be documented if so, and it should not be the first answer you reach for. The "if you must" option should be reserved for situations where miri just can't compile the crate. For instance, the dpdk-sys crate is well and truly hopeless (and pointless) under miri. That's not a flaw in the crate, and it can be exempted with a shrug. The hardware crate is in a similar situation. Miri simply does not have virtual hardware. More of a middle ground is the router crate. That largely passes and works under miri, but has a few tests which use unix sockets. Those won't work and that's fine. Just ignore those tests under miri. Exempting the whole crate from miri would not be a reasonable solution. Reasons like "this test is flaky under miri" do not qualify (finding flaky tests is part of the point of this effort). More rarely, some tests just take a genuinely excessive amount of time under miri. This is rarely a valid reason to exempt. More often you really can just reduce an iteration counter or something as a condition in the miri build. Sometimes that is just impractical or defeats the point of the test. Exempting in this case is valid, but it really should be rare. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

crane was imported with no pkgs argument, falling back to its own `import <nixpkgs> { }` default in `npins/crane/default.nix`. That makes `craneLib.buildPackage` / `mkCargoDerivation` build their derivations against the *host* nixpkgs's stdenv, regardless of what stdenv the project's `pkgs` (a `pkgsCross.<arch>`) provides. Symptom under cross compile (e.g. `just platform=bluefield3 build-each`): aws-lc-sys's build script invokes `cc` to assemble aarch64 `.S` files because the surrounding `CC` env was the build-host x86_64 gcc, not the target's `aarch64-unknown-linux-gnu-gcc`. The link step rejects every aarch64 mnemonic in `p521_jscalarmul.S` and friends: cargo:warning=Error: no such instruction: `ldp x4,x3,[x2,...` Verified via `nix eval`: with `pkgs` passed in, the resulting derivation has `stdenv.hostPlatform = aarch64-unknown-linux-gnu` and crane's `mkCrossToolchainEnv` correctly exports `CC_aarch64_unknown_linux_gnu` etc. to the cross toolchain so cc-rs picks it up. No effect on native (libc=gnu, x86-64-v3) builds where buildPlatform equals hostPlatform; the cross stdenv is the native stdenv in that configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

`is-cross-compile` was comparing `pkgs.stdenv'.hostPlatform` against `ctarget`. In a cross stdenv those are the *same* value (hostPlatform is the target architecture, and ctarget is derived from targetPlatform.rust.rustcTarget which equals hostPlatform.rust.rustcTarget in a non-cross-compiler stdenv). The check was therefore always false under cross, and the `cxx` / `strip` / `objcopy` paths that depend on it fell through to the non-cross-prefixed binary names. Symptom under bluefield3 cross: the linker arg `-Clinker=${pkgs.pkgsBuildHost.llvmPackages'.clang}/bin/clang++` resolved to a wrapped-cross-clang directory that contains only the prefixed `aarch64-unknown-linux-gnu-clang++`, not the bare `clang++`: error: linker `/nix/store/...-aarch64-unknown-linux-gnu-clang-wrapper-22.1.5/bin/clang++` not found `buildPlatform` is the right comparison: cross-compile is defined by "build host produces output for a different host", which is exactly buildPlatform != hostPlatform. Comparing buildPlatform to ctarget gives true when actually cross-compiling (e.g. x86_64 -> aarch64) and false for native and wasm (where build is the same as target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

`pkgs.pkgsBuildHost` is the splice variant for tools that *run on* the build host and *target* the host -- i.e. cross compilers. Under a cross `pkgs` (`pkgsCross.<arch>` for libc=musl or platform=bluefield3), that means `pkgsBuildHost.pkg-config` resolves to `x86_64-unknown-linux-musl-pkg-config-wrapper-0.29.2`, whose bin/ contains only the prefixed binary `x86_64-unknown-linux-musl-pkg-config` -- no plain `pkg-config`. Cargo build scripts (hwlocality-sys, etc.) shell out via `Command::new ("pkg-config")` and crash because the unprefixed binary isn't on PATH: error: failed to run custom build command for `hwlocality-sys` ... The pkg-config command could not be found. Same trap applies to `clang` and other unprefixed helpers in the dev-shell symlinkJoin. `pkgsBuildBuild` is "runs on build, targets build" -- exactly what native dev tooling wants. Under a native pkgs (libc=gnu host arch) the two splice variants are the same derivation, so this is a no-op for the current default build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

The libcap overlay forces `CC:=clang` via makeFlags, which overrides the cross stdenv's wrapped CC (set to e.g. `x86_64-unknown-linux-musl-clang` by the stdenv setup hooks). An unprefixed `clang` in a cross-stdenv build environment resolves to the *unwrapped* clang binary, which has no `-isystem` paths set up for the target sysroot: clang -O2 ... _makenames.c -o _makenames fatal error: 'stdio.h' file not found make[1]: *** [Makefile:83: _makenames] Error 1 For native gnu (`pkgsCross.gnu64`) the cc.targetPrefix is empty and the wrapped clang is named just `clang`, so `CC:=clang` happens to hit the wrapper. For musl or aarch64, the wrapped clang has a prefix and the bare `clang` is the unwrapped binary. `${final.stdenv'.cc.targetPrefix}clang` resolves to the right wrapper in all configurations: `clang` natively, `x86_64-unknown-linux-musl-clang` under musl, `aarch64-unknown-linux-gnu-clang` under bluefield3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

libcap's Makefile compiles a helper binary `_makenames` and *executes* it during the build to generate `cap_names.h`. Make.Rules defaults `BUILD_CC ?= $(CC)`, so without an explicit BUILD_CC the helper gets built with whatever CC is, which under cross stdenv is the wrapped cross-clang. For same-arch cross (x86_64 -> x86_64-musl) the produced binary happens to run on the build host; for cross-arch it does not: ./_makenames > cap_names.h /bin/bash: ./_makenames: cannot execute binary file: Exec format error make[1]: *** [Makefile:86: cap_names.h] Error 126 `pkgsBuildBuild.llvmPackages'.clang` is the clang that runs on, and targets, the build host -- exactly what BUILD_CC should be regardless of the target architecture. Pinning the absolute path avoids the ambiguity around which `clang` PATH resolves to during a libcap build. Note: `pkgsBuildHost` would be the cross compiler (runs on build, targets host = aarch64), which is the wrong splice for a build-host helper; this is the same trap that fooled the dev-shell tooling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

`clib/bin/cpmock.c` calls memset/strcpy/strerror/memcmp without an `#include <string.h>`. Under glibc the standard headers transitively re-expose those declarations (via <sys/socket.h>, <stdlib.h>, etc.) so the compile happens to succeed. Under musl the headers are strict and disjoint, so the declarations are missing. Combined with the `-DCMAKE_C_STANDARD=23` cmake flag (which makes clang treat `-Wimplicit-function-declaration` as an error per C23), the build dies: clib/bin/cpmock.c:33:5: error: call to undeclared library function 'memset' with type 'void *(void *, int, __size_t)'; ISO C99 and later do not support implicit function declarations ... and similar for strcpy, strerror, memcmp. This is a real correctness bug in the upstream source (implicit function declarations have been UB since C99 and outright illegal since C23) - glibc just covers for it. Workaround here patches the include in locally so the FRR/dataplane container can be built against musl. TODO: remove once fixed upstream in githedgehog/dplane-rpc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

`src/hh_dp_msg.c` accesses `struct in6_addr` through glibc's internal anonymous union member name `.__in6_u.__u6_addr8` (three callsites: lines 74, 131, 179). That name is a glibc implementation detail, not part of any standard. POSIX defines `struct in6_addr` to expose the bytes via the `.s6_addr` member directly, and musl does only that -- no wrapping anonymous union -- so the access fails to compile against musl headers: src/hh_dp_msg.c:74:57: error: no member named '__in6_u' in 'struct in6_addr' memcpy(ifa.address.addr.ipv6, ifaddr->u.prefix6.__in6_u.__u6_addr8, 16); The three sites are all doing the same thing -- copying 16 bytes out of the address -- and `.s6_addr` is the portable spelling that works on both glibcs and musl. To make the postPatch actually take effect, also remove `dontUnpack = true` so nix copies the source to a writable build directory, and change the cmake `-S "$src"` reference to `-S .` so cmake consumes the patched copy rather than the (read-only) original store path. TODO: remove once fixed upstream in githedgehog/dplane-plugin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Adds a `must-build` CI job that exercises the platforms and libcs the project intends to ship for, without yet routing them through `check` / `build` (those jobs assume gnu/x86-64-v3 today). The matrix: - wasm32-wasip1 / unknown / release (build-each) - zen5 / musl / release (build-container dataplane) - bluefield3 / gnu / debug (build-container dataplane) - bluefield3 / musl / debug (build-container dataplane) `build-each` is added to the justfile as a thin wrapper that maps to the `workspace` nix target (build every workspace member to its own output), parallel to the existing `test-each`. The wasm matrix entry uses `build-each`; the cross/native container entries use `build-container dataplane`. All entries pass platform, profile, and libc explicitly via `JUST_VARS` so the just default fallbacks don't matter and the row is self-documenting. `must-build` is added to the final-status aggregator's `needs` list along with a step that flags any matrix-row failure -- so a red must-build row turns the workflow conclusion red even though the job itself only blocks the summary aggregator. The `# - check # TODO: add dependency before PR lands` line in the job's `needs:` block flags that under the current `check` / `build` implementation those jobs do not exercise cross/wasm, so making `must-build` strictly depend on them would just slow down feedback for the cross builds without buying coverage. Tighten when `check` itself learns to honour the platform/libc axis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

The dataplane.tar `buildPhase` unconditionally tarred in `libc.out` (`pkgs.pkgsHostHost.libc`) and `${pkgs.pkgsHostHost.glibc.libgcc}`. The latter is a glibc-evaluation reference: under a musl pkgsCross `pkgs.pkgsHostHost.glibc` triggers nix trying to build glibc against musl, which is a nonsensical combination and fails outright with errors from `sysdeps/x86_64/multiarch/strcasecmp_l.c` ifunc redirection inside `glibc-nolibgcc-x86_64-unknown-linux-musl-2.42-61`. Gate the libgcc input on `libc == "gnu"`: glibc-dynamic Rust binaries need `libgcc_s.so.1` for unwinding, but musl Rust targets statically link musl and Rust's compiler-builtins and have no libgcc_s consumer. The exact store path of the libgcc input matters: nix's glibc ld-linux is compiled with `pkgs.pkgsHostHost.glibc.libgcc` (`xgcc-<ver>-libgcc` on native, `libgcc-<triple>-<ver>` on cross) in its system search list. Bundling libgcc_s.so.1 from any other store path (e.g. `pkgs.stdenv.cc.cc.lib`, which has the same content but a different name) results in the file being present in the container but unreachable: the loader doesn't search the directory and there's no RUNPATH to redirect it. Symptom on a binary built that way: /bin/dataplane: error while loading shared libraries: libgcc_s.so.1: cannot open shared object file: No such file or directory Keep libc.out bundled in both libc variants. The Rust binaries are static on musl and don't need it, but the busybox we ship alongside for `/bin/*` shell utilities is dynamically linked against whichever libc its pkgset uses. On musl-cross-arch the busybox interpreter is `${pkgs.pkgsHostHost.libc}/lib/ld-musl-<arch>.so.1`; without that store path inside the tar the applets fail at exec with no usable libc. Rename the inner `libc` let-binding to `libc-pkg` so the outer function-arg `libc` (the string "gnu" / "musl" / "unknown") stays visible inside this scope for the conditional. Add a header comment on the new `libgcc-tar-input` documenting the path-matching constraint so the next person doesn't "fix" it to a more-natural-looking `stdenv.cc.cc.lib`. Add a matching comment on `libc-tar-input` explaining why busybox makes the bundled libc unconditional. Also: typo fix on the surrounding comment block (`hazzards` -> `hazards`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

The debug-tools container content list referenced `pkgs.libgcc.libgcc`, which only exists as a passthru on the native gcc derivation (`outputs = [ "out" "man" "info" "lib" "libgcc" "checksum" ]`). Under a cross pkgs (`pkgsCross.aarch64-multiplatform`), `pkgs.libgcc` is a different derivation -- the static cross-libgcc helpers (`crtbegin.o`/`libgcc.a`/...) with `outputs = [ "out" "dev" ]` and no `libgcc` output -- so the attribute access raises "attribute 'libgcc' missing" during eval of debug-tools for any non-native build. `pkgs.pkgsHostHost.glibc.libgcc` resolves correctly in both contexts and points at the same store path that the bundled glibc ld-linux has in its compiled-in library search list (so runtime resolution still works the way it has been). Gate the inclusion on `libc == "gnu"` so musl-libc targets don't drag in glibc.libgcc, which has no purpose there -- musl Rust binaries use the system llvm libunwind, not libgcc_s.so.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

FRR's LDFLAGS hard-coded `-L${fancy.libgccjit}/lib -latomic`, pointing at the glibc-targeted libgccjit-bundled `libatomic.so.1`. That's the right path when targeting glibc but is ABI-wrong (and at the wrong store path) when targeting musl. Add `libc` to the overlay's function args (and inherit it through `default.nix`'s overlays import) so the LDFLAGS can pick a libc-specific path: - libc == "gnu": `${fancy.libgccjit}/lib/libatomic.so.1` The libgccjit overlay still builds with the host stdenv (its own override has `# TODO: debug issue preventing clang build`), so its libatomic is glibc-targeted. Same path FRR has always used. - libc == "musl": `${stdenv.cc.cc.lib}/${triple}/lib/libatomic.so.1` The gcc-libs output of the cross-musl gcc that backs the cross-musl clang stdenv. Note: `stdenv`, *not* `stdenv'` -- the prime is the clang stdenv whose `cc.cc` is clang and contains no libatomic at all. - else: throw, to fail loudly on a libc this layout doesn't yet cover. Link is dynamic on both sides (the `-L .. -latomic` sits *outside* the `-Wl,--push-state,...,-Bstatic` block). Single-`.so`-process-wide is the only ODR-safe choice for libatomic: its `lock_for_pointer` lock table is per-image, so a static copy inside libfrr.so cannot synchronize with any other consumer that picks up libatomic.so dynamically. Also extend the `preFixup` nuke-refs allowlist to keep the cross-musl gcc-libs path when targeting musl, so the RUNPATH entry pointing at the dynamic libatomic.so survives fixup -- without it, nuke-refs would scrub the entry, the binary's loader couldn't find libatomic, and `watchfrr` / friends would die at startup with `Error relocating .../libfrr.so: __atomic_*: symbol not found`. (`libxcrypt` was already preserved in the working tree's allowlist shape but missing from this branch -- include it here so the allowlist matches what's actually linked.) The same libc-aware nuke-refs treatment applies to `frr-agent`: its fixupPhase referenced `libgcc.lib`, but under cross pkgs `final.libgcc` is the static cross-libgcc helpers derivation (outputs `[ "out" "dev" ]` only -- no `lib` attribute), so the package failed to evaluate for every non-native target. Move the fixupPhase from the package into the overlay so `libc` is in scope, gate the libgcc exclusion on `libc == "gnu"`, and use `pkgsHostHost.glibc.libgcc` -- the path the bundled ld-linux's compiled-in search list points at, matching the libgcc handling already in place at the dataplane.tar bundler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Two unrelated FRR-package cleanups that surface once the overlay starts exercising libc=musl: 1. `libgccjit` is unconditional in FRR's buildInputs but is only meaningful on glibc. The overlay's LDFLAGS already gates its `-L${libgccjit}/lib -latomic` on `libc == "gnu"` and pulls libatomic from the cross-musl gcc-libs output instead, so on musl libgccjit is just dead weight in the build closure (and a possible source of glibc-targeted ABI leakage into the musl runtime image). Gate the buildInput on `stdenv.hostPlatform.isGnu`. 2. `--enable-grpc=no` / `--enable-protobuf=no` are mis-spelled autoconf flags. Autoconf's `AC_ARG_ENABLE(grpc, ...)` accepts `--enable-grpc[=ARG]` and `--disable-grpc`, but `--enable-grpc=no` is parsed as `enableval=no` for the `grpc` feature -- which most projects handle by coincidence, but it's not the documented spelling and breaks under stricter autoconf versions. `--disable-grpc` / `--disable-protobuf` is the canonical spelling. No runtime-behavior change today, but the syntax is no longer wrong-looking-but-works-anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

FRR was just built with --disable-protobuf, and dplane-plugin does not actually link against protobufc itself, so the buildInput (and the matching call arg) were dead weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Native cargo invocations from the dev shell (cargo build/clippy/test --doc) defaulted to the build-host triple while LIBRARY_PATH and PKG_CONFIG_PATH were already pointing at the cross sysroot. The resulting links pulled in cross-target libraries against the host rust-std and failed (e.g. glibc rust-std + musl libc gave undefined references to `open64`, `fstat64`, ...). Pin CARGO_BUILD_TARGET to the same triple the sysroot is built for, and set PKG_CONFIG_ALLOW_CROSS=1 so the Rust pkg-config crate stops refusing cross-target builds (PKG_CONFIG_PATH already points at the matching cross sysroot). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Glibc-targeted Rust binaries unwind through libgcc_s.so.1 (still pulled into the image via glibc.libgcc). Musl and wasi have no libgcc consumer, so panic-unwind has no unwinder unless we ask build-std to pull LLVM's libunwind out of the sysroot. Gate the build-std-features flag on libc != "gnu" so glibc keeps using the libgcc path while musl/wasi get system-llvm-libunwind appended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Extract a generic aarch64 platform definition (march=generic, no march or mcpu overrides) and refactor bluefield2 as a recursiveUpdate of it to retain its armv8.2-a / cortex-a72 specialization. Lets cross-platform CI exercise a vanilla aarch64 build without depending on the bluefield-specific CPU/march flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Wire a `cargo check` derivation per workspace package as `check.<name>`, mirroring the existing `workspace.<name>` build derivations. The point is to give CI a fast, no-link verification path -- particularly useful for wasm32-wasip1 where a full container build is overkill but we still want a build-error gate. Add matching `just check` (single package) and `just check-each` (all packages) recipes that drive the new derivation through the standard nix build pipeline. Also rearrange the `inherit` list so `check`/`dataplane` land in alphabetical order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Lets contributors execute the wasm32-wasip1 binaries we already build without leaving the dev shell: - wasmtime in the dev-shell tooling - `[target.wasm32-wasip1] runner = "wasmtime run --dir ."` so plain `cargo run`/`cargo test` against the wasm target invokes wasmtime automatically, with the cwd bound in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

The single must-build matrix was conflating two different jobs: a fast wasm sanity check and a slow cross-arch container build sweep. Split them along that line: wasm: same triggers as must-build (devfiles, tags, dispatch), runs the wasm32-wasip1 `check` recipe only. Keeps wasm verification as a blocker for any PR that touches dev files. Recipe is shaped as a `{name, args}` object so the matrix axis carries recipe+args as a single tuple rather than as parallel axes. cross: {aarch64, bluefield3} x {gnu, musl} x {debug} x {build-container dataplane, build-container frr.dataplane}. Covers both the dataplane container and the FRR sidecar, the two images most affected by this PR's FRR overlay / dplane-plugin / dplane-rpc cross-arch fixes. Release profile is dropped to keep the leg count flat while picking up FRR; cross-arch build problems surface in debug just as well. Gated narrowly (PR label `ci:+cross`, or push/merge_group). Advisory failure handling: job-level `continue-on-error: true`. A failing leg still reports `failure` to its own conclusion -- the job badge goes red, the build step's `failure()` predicate fires (so the shared `*tmate` anchor still opens a session on workflow_dispatch debug), and the leg is visible in the run summary. The workflow's overall conclusion ignores the failure, so cross is non-blocking. `needs.cross.result` reports `success` to dependents under job-level CoE, so the summary aggregator does not (and cannot) gate on cross; treat cross as purely informational. Add `check` as a hard dependency so we don't burn a full cross build on code that won't even type-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

`docker import` reads a raw rootfs tar (no manifest, no platform metadata). Without an explicit --platform, the imported image is tagged with the host runner's architecture even when the rootfs is for a different one, which silently mislabels cross-built images and would mislead anything that consumes them by platform-aware pull or digest. Map the just `platform` variable (aarch64 / bluefield2 / bluefield3 / x86-64-v3 / x86-64-v4 / zen3 / zen4 / zen5) to the docker linux/{arm64,amd64} pair before calling import. Fail loudly on an unmapped platform rather than silently fall back to host arch. Only the `dataplane` arm of `build-container` is affected; the other arms use `docker load` on real image tarballs (which carry their own platform metadata) and do not need --platform. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

`nuke-refs` is generated by `replaceVarsWith` that bakes a perl path into the script. Under a cross pkgset `prev.nukeReferences` (and a plain `nukeReferences` argument grabbed without splicing) substitutes a target-arch perl: e.g. `perl-aarch64-unknown-linux-gnu-5.42.0` for an aarch64 cross. Both `frr-build`'s preFixup and `frr-agent`'s fixupPhase invoke `nuke-refs` on the build host during this derivation's own build, where the kernel cannot execute target-arch ELF and the build dies with: nuke-refs: ...-perl-aarch64-...-5.42.0/bin/perl: cannot execute binary file: Exec format error Reach into `final.buildPackages.nukeReferences` in both spots so the substituted perl is the build-host one. Verified by building `containers.frr.dataplane` for aarch64/gnu, which the cross matrix added in this stack exercises. The sibling `removeReferencesTo` is shell-only -- it doesn't bake in perl, and the `@shell@` substituted into the script is `stdenvNoCC.shell`, which resolves to build-host bash on every pkgset. No equivalent fix is needed there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

No need to bother with cross compile until miri passes. Best not to overload the CI, and miri failures will often indicate that cross is going to be pointless. Signed-off-by: Daniel Noland <daniel@githedgehog.com>

The `libc` string is a tag carried through the build for derivation naming and platform dispatch. For `wasm32-wasip1` it was set to `"unknown"`, which read as "we don't know what libc this is" rather than the truth: there is no libc -- WASI Preview 1 provides its own system interface and the Rust target name uses the `wasi` env field, not a libc. Use `"none"` so the tag describes the actual situation. Mechanical rename across `default.nix` (comment), `justfile` (`libc` default for `wasm32-wasip1`), `nix/platforms.nix` (`wasm32.wasip1.<libc>` attr key), and `.github/workflows/dev.yml` (matrix `libc` value). No build-output change for non-wasm targets. Signed-off-by: Daniel Noland <daniel@githedgehog.com>

nixpkgs' musl `rustPlatform` (which `frr-agent` is currently built with) emits binaries with `DT_NEEDED libgcc_s.so.1` and `_Unwind_*@GCC_*` references for stack unwinding. Our musl FRR image has no `libgcc_s.so.1` -- the `frr-agent` overlay's `fixupPhase` deliberately nukes the libgcc store-path on musl -- so the musl loader bails before `main`: Error relocating /bin/frr-agent: _Unwind_Resume: symbol not found Error relocating /bin/frr-agent: _Unwind_Backtrace: symbol not found ... Force `+crt-static` on the musl variant of `frr-agent` so the C runtime (and with it `libgcc_eh.a`) is linked statically. The resulting binary is `static-pie linked` with no `_Unwind_*` undefs and no `NEEDED` entries, and the container starts. The accompanying comment on the `fixupPhase` is also corrected: the prior claim that "musl Rust uses llvm-libunwind and has no libgcc_s consumer" was wrong for the way `frr-agent` is currently built -- that's the path the rest of the workspace takes via crane + `-Zbuild-std=...,system-llvm-libunwind`, not the path `frr-agent` takes through nixpkgs' `rustPlatform.buildRustPackage`. TODO: switch `frr-agent` onto the workspace's crane + `system-llvm-libunwind` build path so it pulls LLVM libunwind statically like every other Rust binary in this repo, rather than relying on libgcc_eh from the gcc backing the cross-musl clang stdenv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

I think it works now?

This reverts commit a39cc2b.

daniel-noland and others added 30 commits May 11, 2026 08:47

ci: run miri early

8cd782f

Now that miri is run under nice there is little or no advantage in delaying the run. Signed-off-by: Daniel Noland <daniel@githedgehog.com>

daniel-noland and others added 18 commits May 11, 2026 16:36

chore: clean up Cargo.toml features

ce09e80

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>

ci: require miri and wasm for cross

ed6fcdb

No need to bother with cross compile until miri passes. Best not to overload the CI, and miri failures will often indicate that cross is going to be pointless. Signed-off-by: Daniel Noland <daniel@githedgehog.com>

daniel-noland changed the base branch from main to pr/daniel-noland/everything-is-wasm May 12, 2026 01:37

Maybe fix asan?

bd030ac

I think it works now?

daniel-noland force-pushed the pr/daniel-noland/asan-again branch from dda0e90 to bd030ac Compare May 12, 2026 01:59

daniel-noland added 7 commits May 11, 2026 21:15

maybe

c1f6ac0

boop

46ede98

maybe

44360aa

temp

968867f

boop

a39cc2b

Revert "boop"

2fe5ad7

This reverts commit a39cc2b.

boop again

b0a0ef9

daniel-noland force-pushed the pr/daniel-noland/everything-is-wasm branch from 06bea51 to f45af96 Compare May 12, 2026 19:16

Base automatically changed from pr/daniel-noland/everything-is-wasm to main May 13, 2026 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asan again (ignore for the moment, just checking CI)#1529

asan again (ignore for the moment, just checking CI)#1529
daniel-noland wants to merge 58 commits into
mainfrom
pr/daniel-noland/asan-again

daniel-noland commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

daniel-noland commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant