Skip to content

PIF, Higher Order and Optimized Scatter/Gather, #501 Splitup #534

@aaadelmann

Description

@aaadelmann

PR 501 Split Map

Source PR: #501

Compared against: origin/master at da43fd66a18848b65dcd82af90001f5b64a798ab

Original PR size: 41 commits, 149 files changed, 23,231 insertions, 2,205 deletions.

The original PR is reviewable only as a source branch. The extracted split is now the working structure:

Order Branch Scope Depends on
1 pr501-pcg PCG / preconditioner allocation cleanup master
2 pr501-communication-particle-update communication buffers, particle update, particle serialization, particle send/recv tests pr501-pcg
3 pr501-hosg higher-order scatter/gather, binning, autotune cache pr501-communication-particle-update
4 pr501-fft FFT backend / transform refactor, pruned transforms pr501-hosg
5 pr501-nufft native NUFFT plus FINUFFT/cuFINUFFT integration pr501-fft
6 pr501-pif Alpine ElectrostaticPIF examples pr501-nufft

Split Rationale

The review load in PR #501 is dominated by three independent-looking but technically dependent areas:

  • Particle communication/update: direct archive serialization, shared send/recv buffers, layout migration, sorting, and nghost plumbing.
  • Interpolation/autotune: new scatter/gather dispatch, binning, tiling, tuning cache, and legacy CIC routing.
  • FFT/NUFFT/PIF: transform/backend refactor, native NUFFT, FINUFFT/cuFINUFFT support, and Alpine PIF examples.

The dependency order is:

PCG
  -> communication + particle update
    -> higher-order scatter/gather
      -> FFT backend / transform refactor
        -> native NUFFT + FINUFFT/cuFINUFFT
          -> PIF examples

Important dependency details:

  • Native NUFFT uses the new interpolation scatter/gather layer.
  • PIF uses FFT<NUFFTransform, Field_t>, scatterPIFNUFFT, and gatherPIFNUFFT.
  • Particle update and communication are coupled through ParticleAttrib::serialize, ParticleAttrib::deserialize, Archive, and BufferHandler.
  • nghost changes touch FieldLayout and HaloCells; they belong with particle update because the update path needs kernel-width-aware halos.

Extracted Branches

PR 1: PCG Allocation / Solver Performance

Branch: pr501-pcg

Scope:

  • src/LinearSolvers/PCG.h
  • src/LinearSolvers/Preconditioner.h
  • src/PoissonSolvers/PoissonCG.h

Purpose:

  • Hoist repeated per-iteration allocation work out of PCG/preconditioner hot paths.
  • Pass Field by reference through OperatorF.
  • Keep this independent from particle, interpolation, FFT, NUFFT, and PIF changes.

PR 2: Communication And Particle Update Infrastructure

Branch: pr501-communication-particle-update

Scope:

  • src/Communicate/*
  • src/Particle/ParticleSpatialLayout*
  • src/Particle/ParticleAttrib*
  • src/Particle/ParticleAttribBase.h
  • src/Particle/ParticleBase*
  • src/Particle/ParticleSort.h
  • src/Particle/SortBuffer.h
  • src/FieldLayout/FieldLayout*
  • src/Field/HaloCells*
  • particle update/send-recv tests

Purpose:

  • Replace per-attribute staging buffers with direct serialization into shared communication archives.
  • Rework particle migration into a packed locate/send/recv path with reusable scratch.
  • Preserve this as the non-PIF communication and particle update PR.

Important fix added during split:

  • The receive path now pre-reserves particle attribute storage once per update before deferred receive deserialization.
  • This avoids repeated preserving Kokkos::resize(offset + nrecvs) calls per source rank and per attribute.
  • On GPU this removed the observed multi-second particleDeserialize bottleneck.

Key regression tests:

  • ParticleSendRecv with at least 2 MPI ranks; 1 rank does not exercise receive/deserialization.
  • ParticleUpdate
  • ParticleUpdateNonuniform

PR 3: Higher-Order Scatter/Gather And Autotune

Branch: pr501-hosg

Scope:

  • src/Interpolation/*
  • cmake/AutoTunePresets.cmake
  • cmake/IpplAutoTunePresets.h.in
  • cmake/auto_tune/*
  • interpolation tests

Purpose:

  • Add scatter/gather dispatch over atomic, atomic-sort, tiled, and output-focused paths.
  • Add particle binning and hardware tile-size cache lookup.
  • Provide the algorithmic layer required by native NUFFT.

PR 4: FFT Backend / Transform Refactor

Branch: pr501-fft

Scope:

  • src/FFT/Backend/*
  • src/FFT/Transform/CC.h
  • src/FFT/Transform/RC.h
  • src/FFT/Transform/Trig.h
  • src/FFT/Transform/PrunedCC.h
  • src/FFT/Transform/PrunedRC.h
  • src/FFT/Transform/Common.h
  • src/FFT/Traits.h
  • existing FFT test updates

Purpose:

  • Split the old monolithic FFT layer into backend and transform frontends.
  • Expose CC, RC, pruned CC/RC, and trigonometric transforms without native NUFFT/PIF in the same PR.

Intentionally left for PR 5:

  • src/FFT/NUFFT/*
  • src/FFT/Transform/NUFFT.*
  • NUFFT-specific tests
  • FINUFFT/cuFINUFFT dependency use beyond dormant traits/config hooks.

PR 5: Native NUFFT And FINUFFT/cuFINUFFT Integration

Branch: pr501-nufft

Scope:

  • src/FFT/NUFFT/*
  • src/FFT/Transform/NUFFT.*
  • IPPL_ENABLE_FINUFFT
  • IPPL_ENABLE_CUFFTMP
  • NUFFT tests

Purpose:

  • Add native IPPL NUFFT and optional FINUFFT/cuFINUFFT backend support.
  • Keep all PIF examples out of this PR.

Validation with FINUFFT disabled:

cmake -S . -B build-pr501-nufft-debug -DCMAKE_BUILD_TYPE=Debug -DIPPL_PLATFORMS=SERIAL -DIPPL_ENABLE_FFT=ON -DIPPL_ENABLE_UNIT_TESTS=ON
cmake --build build-pr501-nufft-debug --target ippl NUFFT NUFFTAccuracy -j 8
ctest --test-dir build-pr501-nufft-debug -R '^NUFFT$|^NUFFTAccuracy$' --output-on-failure

2/2 NUFFT tests passed
Total test time: 19.19 sec

PR 6: PIF / Alpine Examples

Branch: pr501-pif

Scope:

  • alpine/ElectrostaticPIF/*
  • PIF-specific particle attribute convenience APIs if not already included in NUFFT
  • Alpine PIF wiring

Purpose:

  • Add the ElectrostaticPIF examples after the core particle, interpolation, FFT, and NUFFT APIs are already reviewable.

Current status:

  • LandauDampingPIF supports the default upsampled mode and an optional positional pruned mode.
  • LandauDampingPIFPruned is not kept as a separate executable in the current split.
  • pr501-pif has the receive pre-reserve particle update fix applied and pushed.

Build note:

  • LandauDampingPIF is added when IPPL_ENABLE_ALPINE=ON and IPPL_ENABLE_FFT=ON.
  • With IPPL_ENABLE_FINUFFT=OFF, it can compile against the native NUFFT stack, but runtime behavior depends on ChargedParticlesPIF::initNUFFT() selecting the native backend.

LUMI Status

The split branch stack is running successfully on LUMI. Current benchmark data show no evidence that the decomposition itself introduced a broad regression; several workloads improve relative to master.

Benchmark Problem size Nodes Ranks master pr501-pcg pr501-com pr501-fft pr501-hosg pr501-nufft
FEM 513_10 8 64 28.10 28.13 (0%) 27.55 (-2%) 26.78 (-5%) 27.29 (-3%) 26.49 (-6%)
FFT 512_10 4 32 4.43 4.38 (-1%) 3.48 (-21%) 3.70 (-16%) 3.47 (-22%) 3.53 (-20%)
FFT 512_10 16 128 1.66 1.66 (0%) 1.14 (-31%) 1.13 (-31%) 1.49 (-10%) 1.08 (-35%)
PCG 512_10 1 8 72.28 70.17 (-3%) 67.58 (-6%) 67.65 (-6%) 67.59 (-6%)
PCG 512_10 4 32 34.80 32.95 (-5%) 32.34 (-7%) 36.40 (+5%) 31.93 (-8%) 31.93 (-8%)
PCG 512_10 64 512 25.09 22.96 (-8%) 22.93 (-9%) 22.11 (-12%) 22.09 (-12%) 22.21 (-11%)

Particle receive pre-reserve fix evidence from LUMI:

Run Before After
32 ranks particleDeserialize wall max 7.88320 s 0.0106297 s
32 ranks updateParticle wall max 8.08981 s 0.338958 s
128 ranks particleDeserialize wall max 1.40305 s 0.0132888 s
128 ranks updateParticle wall max 1.59376 s 0.214328 s

Interpretation:

  • The slow path repeatedly resized particle attributes during deferred receive finalization.
  • Each preserving resize copied existing particle storage on the GPU.
  • Pre-reserving once to localAfterDestroy + totalRecvs removes the repeated copy pattern.

LUMI Final Results

Benchmark Problem size Nodes Ranks master pr501-pcg pr501-com pr501-fft pr501-hosg pr501-nufft pr501-original pr501-pif
FEM 513_10 8 64 28.08 27.93 (-1%) 27.01 (-4%) 26.56 (-5%) 26.60 (-5%) 26.32 (-6%) 28.76 (+2%) 27.78 (-1%)
FFT 512_10 4 32 4.50 4.46 (-1%) 3.51 (-22%) 3.59 (-20%) 3.42 (-24%) 3.55 (-21%) 3.81 (-15%) 3.48 (-23%)
FFT 512_10 16 128 1.65 1.64 (0%) 1.10 (-33%) 1.15 (-30%) 1.26 (-24%) 1.08 (-35%) 1.15 (-30%) 1.14 (-31%)
PCG 512_10 1 8 72.21 70.20 (-3%) 67.58 (-6%) 67.37 (-7%) 67.69 (-6%) 66.42 (-8%) 67.81 (-6%)
PCG 512_10 4 32 34.70 32.97 (-5%) 32.30 (-7%) 31.99 (-8%) 31.94 (-8%) 32.02 (-8%) 32.60 (-6%) 32.63 (-6%)
PCG 512_10 64 512 24.63 22.86 (-7%) 23.06 (-6%) 22.12 (-10%) 22.16 (-10%) 22.21 (-10%) 22.99 (-7%) 22.82 (-7%)
PIF 1024_10 16 128 132.20
PIF 1024_10 32 256 69.34
PIF 1024_10 64 512 33.78
PIF 512_10 4 32 59.02 59.10
PIF 512_10 8 64 31.05 29.16
PIF 512_10 16 128 15.50 15.63

Contrast With PR #501 Description

The live PR #501 description is still useful as a source inventory, but it no longer matches the finite split branches.

Area PR #501 Description Split Status
Overall scope One large PR for PIF, FFT, NUFFT, scatter/gather, particle update, communication, timings, Alpine, and tests Six finite branches with explicit dependency order
PCG Not a central part of the PR description Extracted as standalone pr501-pcg
Particle update Describes direct archive serialization and shared buffers Also includes receive pre-reserve fix for GPU deserialization performance
Scatter/gather Described together with all other changes Isolated as pr501-hosg
FFT Mixed with native NUFFT Backend/transform refactor isolated as pr501-fft
NUFFT Native NUFFT and FINUFFT/cuFINUFFT described together with FFT and PIF Isolated as pr501-nufft, after scatter/gather and FFT
PIF Mentions separate LandauDampingPIFPruned and ChargedParticlesPIFPruned files Current split uses LandauDampingPIF with optional pruned mode; check file list before reusing original wording
Validation Generic unit/integration test list Split map records concrete local tests and LUMI benchmark results

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions