Skip to content

`perf(poseidon): FFT MDS in AIR + RATE=12 MMO sponge#216

Open
Barnadrot wants to merge 5 commits intoleanEthereum:mainfrom
Barnadrot:perf/poseidon-fft-mmo
Open

`perf(poseidon): FFT MDS in AIR + RATE=12 MMO sponge#216
Barnadrot wants to merge 5 commits intoleanEthereum:mainfrom
Barnadrot:perf/poseidon-fft-mmo

Conversation

@Barnadrot
Copy link
Copy Markdown
Contributor

Title

perf(poseidon): FFT MDS in AIR + RATE=12 MMO sponge — -5.6% on XMSS aggregation

Body

Summary

Three perf changes to the Poseidon path on the leaf-aggregation hot path, plus a small commit adding #[inline] to four cross-crate hot functions. Net -5.58% wall-clock on the production XMSS leaf workload (1550 signatures, log_inv_rate=1).

Commit Change
b11aac3a mds_air_16: Karatsuba (72 mults) → FFT MDS (50 mults)
27319044 WHIR Merkle leaf sponge: RATE=8 → RATE=12, capacity=4 (native + zk-DSL verifier)
2198c0b4 MMO feedforward sponge — restores 124-bit collision security at RATE=12
602859ad Add #[inline] to mmo_hash_slice, mmo_precompute_zero_suffix_state, compress_mut, permute_mut

Benchmark

Hetzner AX42-U (Zen 4), RUSTFLAGS="-C target-cpu=native", 1550 signatures, log_inv_rate=1. zk-alloc allocator (workspace default). Production release profile (fat LTO, codegen-units = 1). Warm-proof average over 4 consecutive proofs after a discarded cold warmup; per-proof variance was <1% on both branches.

Branch Time / proof XMSS/s Proof size Δ vs main
main (19f1c774) 2.120 s 731 338 KiB
perf/poseidon-fft-mmo 2.002 s 774 345 KiB -5.58%

Welch's t-test on individual warm proof times: t = -28.8, df ≈ 6, p < 1e-6.

Reproduce with the production profile (fat LTO + codegen-units = 1):

CARGO_PROFILE_RELEASE_LTO=fat \
CARGO_PROFILE_RELEASE_CODEGEN_UNITS=1 \
RUSTFLAGS="-C target-cpu=native" \
cargo run --release -- xmss --n-signatures 1550 --log-inv-rate 1

Proof size grows 338 → 345 KiB (+2.0%) because the recursive zk-DSL verifier program adds dispatch logic for RATE=12 (250,208 → 253,755 instructions in the aggregation program). The wall-clock improvement is the net gain after that overhead.

Per-commit attribution

Each commit was cherry-picked onto main and benchmarked individually under the production profile:

Commit Standalone Δ
FFT MDS only -3.0%
RATE=12 + MMO (logically coupled — see security note) ~-2.5% on top of FFT MDS
#[inline] annotations alone ~0% on the production profile (see note below)
Bundle -5.58%

Correctness

All five integration tests pass at HEAD:

  • test_run_whir
  • test_xmss_signature
  • test_type_1_aggregation
  • test_aggregation
  • test_type_2_aggregation

End-to-end verification (including the recursive zkVM verifier) succeeds in ~37 ms. Proof remains valid under the existing verifier.

Security — RATE=12 + MMO

RATE=8 with capacity=8 in a plain Sponge gives 128-bit generic collision security (capacity/2). Bumping to RATE=12 with capacity=4 in a plain Sponge would drop generic collision security to ~64 bits, which is unacceptable.

Commit 2198c0b4 swaps the absorption mode from plain Sponge to MMO (Matyas-Meyer-Oseas) feedforward: each absorb step XORs the input back into the state after the permutation. MMO with width-16 KoalaBear gives ~124-bit collision security at RATE=12, capacity=4. Full analysis in the commit message.

Why the #[inline] commit is included

Although the production profile uses fat LTO and inlines the new sponge calls aggressively, the workspace [profile.release] is lto = "thin". Under thin LTO, the new RATE=12 + MMO hot path crosses three crates — mt_whir::merkle::build_merkle_tree_koalabearmt_symetric::sponge::mmo_hash_sliceCompression::compress_mut (impl on Poseidon1KoalaBear16 in mt_koala_bear) → Permutation::permute_mut — and these calls are left out-of-line. Inside the rayon worker loop that means a stack spill of the full 16-element packed state on every absorb iteration, which dominates per-iteration cost.

Concretely, without the #[inline] commit, the same source under workspace defaults (cargo run --release -- xmss ...) regresses +3.2% vs main. With it, the same command improves -4.87% vs main on the same machine.

#[inline] is just a hint; under fat LTO the compiler already inlines these. The annotations only change codegen under thin LTO, where they let it match what fat LTO already produces. No semantic change.

Files touched

  • crates/backend/koala-bear/src/poseidon1_koalabear_16.rs — FFT MDS, #[inline]
  • crates/lean_vm/src/tables/poseidon_16/mod.rs — FFT MDS
  • crates/backend/symetric/src/sponge.rs — RATE=12, MMO mode, #[inline]
  • crates/backend/symetric/src/permutation.rs#[inline]
  • crates/whir/src/merkle.rs — sponge integration, padding formula
  • crates/backend/fiat-shamir/src/verifier.rs — sponge integration
  • crates/rec_aggregation/zkdsl_implem/hashing.py — zk-DSL RATE=12 port

Test plan

  • cargo test --workspace --release
  • Production-profile reproducer command above produces and verifies a valid proof
  • Wall-clock improvement reproduces the -5.6% headline (warm proofs, t-test)
  • Proof size: 338 → 345 KiB (+2.0%, expected from added verifier instructions)

Barnadrot added 5 commits May 9, 2026 09:53
The Poseidon AIR constraint folder evaluates mds_air_16 8x per row across
runtime types (F, EF, FPacking, EFPacking). Previously this used Karatsuba
convolution (72 mults). Switch to the same FFT-MDS already used in the
permute_simd hot path: DIT_FFT(lambda/16 ⊙ DIF_IFFT(state)), 50 mults.

Saves 22 mults × 8 MDS calls per AIR row = 176 mults/row, ~10% reduction
in AIR Poseidon eval mult count. AIR Poseidon eval is ~10% of CPU time
in the e2e prover (eval_2_full_rounds_16 + eval_last_2_full_rounds_16 +
Poseidon16Precompile::eval).

The unpacked lambda_over_16 = (DIF_IFFT(MDS_CIRC_COL) * 16^-1) is
factored out of the SimdPrecomputed branch and stored at the top of
Precomputed; the SIMD branch reuses it (no duplication). FFT helpers
(bt/dit/neg_dif/dif_ifft/dit_fft) are ungated from target_feature
since they're pure generic Rust, and their bound is relaxed from
Algebra<KoalaBear> to PrimeCharacteristicRing + Mul<KoalaBear> to match
mds_circ_16 (so EFPacking, which lacks Algebra<KoalaBear>, is admitted).

Predicted magnitude: medium (1.0-1.5%).
Reduce Poseidon permutations per Merkle leaf by 22-32% by increasing the
sponge absorption rate from 8 to 12 field elements per permutation call.

Changes:
- sponge.rs: relax RATE==OUT and WIDTH==OUT+RATE asserts, support arbitrary RATE
- merkle.rs: SPONGE_RATE=12, padded_full_base_width helper, corrected
  n_zero_suffix_rate_chunks formula for RATE!=WIDTH/2
- verifier.rs: pad base_data to sponge-aligned length before hashing
- hashing.py: zk-DSL slice_hash_rtl rewritten for RATE=12, @inline removed
  to fix conditional branch fall-through bug
Replace standard outer-sponge with Matyas-Meyer-Oseas (MMO) feedforward
construction. Same Poseidon-16 permutation, same RATE=12, but collision
security lifts from 62-bit to 124-bit by chaining the full 16-element
state instead of just the 4-element capacity.

Changes:
- sponge.rs: mmo_hash_slice, mmo_hash_rtl_iter, mmo_precompute_zero_suffix_state
  with full-state feedforward (XOR pre-perm state into post-perm state)
- merkle.rs: wire MMO hash functions into Merkle tree construction
- verifier.rs: use MMO hash in verification path
- poseidon_16: new poseidon16_permute precompile (16-element output) for
  zk-DSL recursive verifier, with AIR constraints and trace generation
- hashing.py: zk-DSL updated to use MMO via poseidon16_permute precompile

Security: standard sponge collision = c*log2(p)/2 = 62 bits (unshippable).
MMO collision = b-bit birthday on full state output = 124 bits (meets target).
Verified against: Coratger-Khovratovich-Wagner-Mennink 2026, SAFE proof
(eprint 2023/520), Beetle (CHES 2018).
Under the workspace default thin LTO profile, the new RATE=12 + MMO sponge
code introduced cross-crate calls that did not get inlined: mmo_hash_slice,
mmo_precompute_zero_suffix_state, compress_mut, permute_mut. The hot loop
in build_merkle_tree_koalabear ended up making out-of-line calls into
mt_symetric and mt_koala_bear on every absorb, spilling the 16-element
state to the stack each iteration.

Adding #[inline] makes these functions available for cross-CGU inlining
under thin LTO, matching the codegen fat LTO already produces.

No semantic change. The functions are short hot-path wrappers/loops that
the compiler should inline anyway given the chance.
- rustfmt: re-flow long lines introduced by the MMO commit
- clippy: replace redundant closures in sponge tests with function refs
- clippy: allow too_many_arguments on eval_last_2_full_rounds_16 (AIR helper, 9 args)
- clippy: rewrite full_output_flags loop with .iter().enumerate()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant