A2Fast: portable conv + memory optimizations (up to 1.6x on channels=8)#277
Conversation
Previously each of the 23 layers held its own std::vector<float> history. Heap allocators can place these at addresses that share the same L1 cache sets, causing conflict misses on the wide-dilation layers whose tap reads, _layer_in writes, and head-history writes simultaneously compete for cache. Replace with a single contiguous allocation (_history_arena) from which each layer takes a raw pointer slice. Sequential placement naturally distributes base addresses across cache sets because each slot size mod the set stride is non-zero. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Previously each layer accumulated its post-activation output into a separate _head_sum buffer, which was then zeroed before every block (memset of Channels*num_frames floats) and copied into _head_history by _head_ring_write (another memcpy of the same size) after all layers finished. Replace with direct writes: layer 0 assigns (IsFirst=true template parameter), layers 1-22 accumulate (IsFirst=false). The IsFirst branch resolves at compile time, so there is no runtime branch in the hot loop. _head_ring_write is removed; process() now manages the head ring write position directly, rewinding with a cheap memmove of kHeadKernelSize-1 columns when the end of the buffer is reached. Also eliminates ztile.setZero() for the Channels=8 Eigen path: tap 0 now assigns into ztile (noalias() =) rather than accumulating on top of a zero-initialised matrix, saving one Channels*num_frames write per layer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Previously _ring_write always refreshed the full tail mirror — a memcpy of maxBufferSize * Channels * sizeof(float) bytes per layer per block (4 KB for Ch=8 / 128 frames, 92 KB total across 23 layers per buffer). Most of the time no tap read crosses the pow2_size boundary, so the mirror copy was pure overhead. Replace with mirror-on-demand: after each write, compute each tap's next-call read range and mirror only the columns that actually overflow past pow2_size. Zero copy is emitted whenever no tap straddles the wrap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Replaces the Eigen noalias() GEMM loop with an explicit T=4 frame-tiled,
tap-major C++ kernel. Structure of the inner body:
a[f][o] += Wcol[o] * h[f] (o = output channel, f = frame in tile)
Wcol = W[:,cp] is stride-1 in o, so the o loop vectorizes. h[f] is a
scalar loaded from the (column-major) history buffer, so the compiler
emits a broadcast-scalar FMA instruction — vmlaq_n_f32 on AArch64, the
AVX equivalent on x86 — without requiring architecture-specific intrinsics.
The T=4 tiling amortizes the two weight register loads (W[:,cp] = 8
floats = 2 Q-regs on AArch64) across four independent accumulator chains,
matching the weight-reuse strategy of the explicit NEON microkernel.
A scalar tail handles buffer sizes that are not multiples of 4; in
practice audio buffer sizes (64, 128, 256) are always multiples of 4.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
…s.txt to compile to Release by default.
Allman braces for the new if constexpr / dispatcher blocks, no single-line loops, and collapse the manually-aligned declarations. No behavioral change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sdatkinson
left a comment
There was a problem hiding this comment.
CMakeLists.txt but otherwise LGTM
| // broadcast) with weight vector Wcol reused across all T frames — the same | ||
| // weight amortisation as an explicit NEON microkernel, without intrinsics. | ||
| // Equivalent SIMD broadcast-FMA instructions are emitted on x86 (AVX) and | ||
| // other SIMD targets automatically. |
|
|
||
| set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/cmake") | ||
|
|
||
| # Default to a Release build: without this, an empty CMAKE_BUILD_TYPE compiles |
There was a problem hiding this comment.
I think I prefer defaulting to Debug--slower performance is very noticeable; losing your assertions when running tests, less so :)
|
Just a heads-up that this new optimized path seems to be very architecture and compiler dependent. In my testing, it is significantly faster on x64 with clang, but slower on x64 with MSVC. On Raspberry Pi 4, it is much slower using both clang (v14) and gcc (v12). |
|
@mikeoliphant thanks for testing and reporting. I did not test with MSVC and on ARM I only tested in a couple of architectures (M7 and A8). Weird that I expected it to be faster on the RPi 4 but that isn't the case... did you compile with -mnative or cross-compile? @sdatkinson Did you want me to gate these optimizations behind a compile time flag? |
|
@jfsantos I compiled natively on RPi4 (arm64), which enables the native optimizations by default using gcc. I don't know if clang does, but forcing "-march=native" doesn't improve things. |
|
I've now got confirmation the change also decreases performance on MOD Dwarf (ARM A35). |
…hannels=8) (sdatkinson#277)" This reverts commit baf1bf8.
…hannels=8) (sdatkinson#277)" This reverts commit baf1bf8.
Optimizes the A2-shape fast-path WaveNet (
a2_fast.cpp) with portable, intrinsic-free changes. No change to model output; the generic WaveNet path is untouched.Changes
T×Channelsaccumulator local and reuses each weight column across 4 frames. The compiler emits SIMD broadcast-FMA (vmlaq_n_f32on AArch64, AVX equivalents on x86) — the same weight amortization as a hand-written microkernel, without intrinsics._head_historyat the ring slot (IsFirstselects=vs+=), eliminating the separate_head_sumbuffer, its per-block zeroing pass, and the head ring-write copy.layer1x1projection + residual store feed nothing downstream, so they're elided via anIsLasttemplate parameter.mbs×Channelsmirror refresh every block, predict each tap's next read range and mirror only the columns that actually overflow pastpow2_size— frequently zero.Releasewhen no build type is set (an emptyCMAKE_BUILD_TYPEcompiles unoptimized, ~10x slower for Eigen-heavy code).Benchmark (desktop, Apple Silicon, fast-path p50 µs/block, vs
main)Both builds forced to
-DCMAKE_BUILD_TYPE=Releaseto isolate the code changes. The unchanged generic WaveNet ran as a control and stayed flat in both builds, confirming the deltas are real.Channels=8 sees 1.3–1.6x from the tiled conv. Channels=3 is roughly flat on desktop; the arena/mirror changes target narrow embedded caches (Cortex-A8/M7) and aren't captured by a desktop run.
Some of these optimizations were contributed by landon-chaos, as he found them out while optimizing Core to run on his pedals :)