PR 501 Split Map
Source PR: #501
Compared against: origin/master at da43fd66a18848b65dcd82af90001f5b64a798ab
Original PR size: 41 commits, 149 files changed, 23,231 insertions, 2,205 deletions.
The original PR is reviewable only as a source branch. The extracted split is now the working structure:
| Order |
Branch |
Scope |
Depends on |
| 1 |
pr501-pcg |
PCG / preconditioner allocation cleanup |
master |
| 2 |
pr501-communication-particle-update |
communication buffers, particle update, particle serialization, particle send/recv tests |
pr501-pcg |
| 3 |
pr501-hosg |
higher-order scatter/gather, binning, autotune cache |
pr501-communication-particle-update |
| 4 |
pr501-fft |
FFT backend / transform refactor, pruned transforms |
pr501-hosg |
| 5 |
pr501-nufft |
native NUFFT plus FINUFFT/cuFINUFFT integration |
pr501-fft |
| 6 |
pr501-pif |
Alpine ElectrostaticPIF examples |
pr501-nufft |
Split Rationale
The review load in PR #501 is dominated by three independent-looking but technically dependent areas:
- Particle communication/update: direct archive serialization, shared send/recv buffers, layout migration, sorting, and
nghost plumbing.
- Interpolation/autotune: new scatter/gather dispatch, binning, tiling, tuning cache, and legacy CIC routing.
- FFT/NUFFT/PIF: transform/backend refactor, native NUFFT, FINUFFT/cuFINUFFT support, and Alpine PIF examples.
The dependency order is:
PCG
-> communication + particle update
-> higher-order scatter/gather
-> FFT backend / transform refactor
-> native NUFFT + FINUFFT/cuFINUFFT
-> PIF examples
Important dependency details:
- Native NUFFT uses the new interpolation scatter/gather layer.
- PIF uses
FFT<NUFFTransform, Field_t>, scatterPIFNUFFT, and gatherPIFNUFFT.
- Particle update and communication are coupled through
ParticleAttrib::serialize, ParticleAttrib::deserialize, Archive, and BufferHandler.
nghost changes touch FieldLayout and HaloCells; they belong with particle update because the update path needs kernel-width-aware halos.
Extracted Branches
PR 1: PCG Allocation / Solver Performance
Branch: pr501-pcg
Scope:
src/LinearSolvers/PCG.h
src/LinearSolvers/Preconditioner.h
src/PoissonSolvers/PoissonCG.h
Purpose:
- Hoist repeated per-iteration allocation work out of PCG/preconditioner hot paths.
- Pass
Field by reference through OperatorF.
- Keep this independent from particle, interpolation, FFT, NUFFT, and PIF changes.
PR 2: Communication And Particle Update Infrastructure
Branch: pr501-communication-particle-update
Scope:
src/Communicate/*
src/Particle/ParticleSpatialLayout*
src/Particle/ParticleAttrib*
src/Particle/ParticleAttribBase.h
src/Particle/ParticleBase*
src/Particle/ParticleSort.h
src/Particle/SortBuffer.h
src/FieldLayout/FieldLayout*
src/Field/HaloCells*
- particle update/send-recv tests
Purpose:
- Replace per-attribute staging buffers with direct serialization into shared communication archives.
- Rework particle migration into a packed locate/send/recv path with reusable scratch.
- Preserve this as the non-PIF communication and particle update PR.
Important fix added during split:
- The receive path now pre-reserves particle attribute storage once per update before deferred receive deserialization.
- This avoids repeated preserving
Kokkos::resize(offset + nrecvs) calls per source rank and per attribute.
- On GPU this removed the observed multi-second
particleDeserialize bottleneck.
Key regression tests:
ParticleSendRecv with at least 2 MPI ranks; 1 rank does not exercise receive/deserialization.
ParticleUpdate
ParticleUpdateNonuniform
PR 3: Higher-Order Scatter/Gather And Autotune
Branch: pr501-hosg
Scope:
src/Interpolation/*
cmake/AutoTunePresets.cmake
cmake/IpplAutoTunePresets.h.in
cmake/auto_tune/*
- interpolation tests
Purpose:
- Add scatter/gather dispatch over atomic, atomic-sort, tiled, and output-focused paths.
- Add particle binning and hardware tile-size cache lookup.
- Provide the algorithmic layer required by native NUFFT.
PR 4: FFT Backend / Transform Refactor
Branch: pr501-fft
Scope:
src/FFT/Backend/*
src/FFT/Transform/CC.h
src/FFT/Transform/RC.h
src/FFT/Transform/Trig.h
src/FFT/Transform/PrunedCC.h
src/FFT/Transform/PrunedRC.h
src/FFT/Transform/Common.h
src/FFT/Traits.h
- existing FFT test updates
Purpose:
- Split the old monolithic FFT layer into backend and transform frontends.
- Expose CC, RC, pruned CC/RC, and trigonometric transforms without native NUFFT/PIF in the same PR.
Intentionally left for PR 5:
src/FFT/NUFFT/*
src/FFT/Transform/NUFFT.*
- NUFFT-specific tests
- FINUFFT/cuFINUFFT dependency use beyond dormant traits/config hooks.
PR 5: Native NUFFT And FINUFFT/cuFINUFFT Integration
Branch: pr501-nufft
Scope:
src/FFT/NUFFT/*
src/FFT/Transform/NUFFT.*
IPPL_ENABLE_FINUFFT
IPPL_ENABLE_CUFFTMP
- NUFFT tests
Purpose:
- Add native IPPL NUFFT and optional FINUFFT/cuFINUFFT backend support.
- Keep all PIF examples out of this PR.
Validation with FINUFFT disabled:
cmake -S . -B build-pr501-nufft-debug -DCMAKE_BUILD_TYPE=Debug -DIPPL_PLATFORMS=SERIAL -DIPPL_ENABLE_FFT=ON -DIPPL_ENABLE_UNIT_TESTS=ON
cmake --build build-pr501-nufft-debug --target ippl NUFFT NUFFTAccuracy -j 8
ctest --test-dir build-pr501-nufft-debug -R '^NUFFT$|^NUFFTAccuracy$' --output-on-failure
2/2 NUFFT tests passed
Total test time: 19.19 sec
PR 6: PIF / Alpine Examples
Branch: pr501-pif
Scope:
alpine/ElectrostaticPIF/*
- PIF-specific particle attribute convenience APIs if not already included in NUFFT
- Alpine PIF wiring
Purpose:
- Add the ElectrostaticPIF examples after the core particle, interpolation, FFT, and NUFFT APIs are already reviewable.
Current status:
LandauDampingPIF supports the default upsampled mode and an optional positional pruned mode.
LandauDampingPIFPruned is not kept as a separate executable in the current split.
pr501-pif has the receive pre-reserve particle update fix applied and pushed.
Build note:
LandauDampingPIF is added when IPPL_ENABLE_ALPINE=ON and IPPL_ENABLE_FFT=ON.
- With
IPPL_ENABLE_FINUFFT=OFF, it can compile against the native NUFFT stack, but runtime behavior depends on ChargedParticlesPIF::initNUFFT() selecting the native backend.
LUMI Status
The split branch stack is running successfully on LUMI. Current benchmark data show no evidence that the decomposition itself introduced a broad regression; several workloads improve relative to master.
| Benchmark |
Problem size |
Nodes |
Ranks |
master |
pr501-pcg |
pr501-com |
pr501-fft |
pr501-hosg |
pr501-nufft |
| FEM |
513_10 |
8 |
64 |
28.10 |
28.13 (0%) |
27.55 (-2%) |
26.78 (-5%) |
27.29 (-3%) |
26.49 (-6%) |
| FFT |
512_10 |
4 |
32 |
4.43 |
4.38 (-1%) |
3.48 (-21%) |
3.70 (-16%) |
3.47 (-22%) |
3.53 (-20%) |
| FFT |
512_10 |
16 |
128 |
1.66 |
1.66 (0%) |
1.14 (-31%) |
1.13 (-31%) |
1.49 (-10%) |
1.08 (-35%) |
| PCG |
512_10 |
1 |
8 |
72.28 |
70.17 (-3%) |
|
67.58 (-6%) |
67.65 (-6%) |
67.59 (-6%) |
| PCG |
512_10 |
4 |
32 |
34.80 |
32.95 (-5%) |
32.34 (-7%) |
36.40 (+5%) |
31.93 (-8%) |
31.93 (-8%) |
| PCG |
512_10 |
64 |
512 |
25.09 |
22.96 (-8%) |
22.93 (-9%) |
22.11 (-12%) |
22.09 (-12%) |
22.21 (-11%) |
Particle receive pre-reserve fix evidence from LUMI:
| Run |
Before |
After |
32 ranks particleDeserialize wall max |
7.88320 s |
0.0106297 s |
32 ranks updateParticle wall max |
8.08981 s |
0.338958 s |
128 ranks particleDeserialize wall max |
1.40305 s |
0.0132888 s |
128 ranks updateParticle wall max |
1.59376 s |
0.214328 s |
Interpretation:
- The slow path repeatedly resized particle attributes during deferred receive finalization.
- Each preserving resize copied existing particle storage on the GPU.
- Pre-reserving once to
localAfterDestroy + totalRecvs removes the repeated copy pattern.
LUMI Final Results
| Benchmark |
Problem size |
Nodes |
Ranks |
master |
pr501-pcg |
pr501-com |
pr501-fft |
pr501-hosg |
pr501-nufft |
pr501-original |
pr501-pif |
| FEM |
513_10 |
8 |
64 |
28.08 |
27.93 (-1%) |
27.01 (-4%) |
26.56 (-5%) |
26.60 (-5%) |
26.32 (-6%) |
28.76 (+2%) |
27.78 (-1%) |
| FFT |
512_10 |
4 |
32 |
4.50 |
4.46 (-1%) |
3.51 (-22%) |
3.59 (-20%) |
3.42 (-24%) |
3.55 (-21%) |
3.81 (-15%) |
3.48 (-23%) |
| FFT |
512_10 |
16 |
128 |
1.65 |
1.64 (0%) |
1.10 (-33%) |
1.15 (-30%) |
1.26 (-24%) |
1.08 (-35%) |
1.15 (-30%) |
1.14 (-31%) |
| PCG |
512_10 |
1 |
8 |
72.21 |
70.20 (-3%) |
|
67.58 (-6%) |
67.37 (-7%) |
67.69 (-6%) |
66.42 (-8%) |
67.81 (-6%) |
| PCG |
512_10 |
4 |
32 |
34.70 |
32.97 (-5%) |
32.30 (-7%) |
31.99 (-8%) |
31.94 (-8%) |
32.02 (-8%) |
32.60 (-6%) |
32.63 (-6%) |
| PCG |
512_10 |
64 |
512 |
24.63 |
22.86 (-7%) |
23.06 (-6%) |
22.12 (-10%) |
22.16 (-10%) |
22.21 (-10%) |
22.99 (-7%) |
22.82 (-7%) |
| PIF |
1024_10 |
16 |
128 |
|
|
|
|
|
|
|
132.20 |
| PIF |
1024_10 |
32 |
256 |
|
|
|
|
|
|
|
69.34 |
| PIF |
1024_10 |
64 |
512 |
|
|
|
|
|
|
|
33.78 |
| PIF |
512_10 |
4 |
32 |
|
|
|
|
|
|
59.02 |
59.10 |
| PIF |
512_10 |
8 |
64 |
|
|
|
|
|
|
31.05 |
29.16 |
| PIF |
512_10 |
16 |
128 |
|
|
|
|
|
|
15.50 |
15.63 |
Contrast With PR #501 Description
The live PR #501 description is still useful as a source inventory, but it no longer matches the finite split branches.
| Area |
PR #501 Description |
Split Status |
| Overall scope |
One large PR for PIF, FFT, NUFFT, scatter/gather, particle update, communication, timings, Alpine, and tests |
Six finite branches with explicit dependency order |
| PCG |
Not a central part of the PR description |
Extracted as standalone pr501-pcg |
| Particle update |
Describes direct archive serialization and shared buffers |
Also includes receive pre-reserve fix for GPU deserialization performance |
| Scatter/gather |
Described together with all other changes |
Isolated as pr501-hosg |
| FFT |
Mixed with native NUFFT |
Backend/transform refactor isolated as pr501-fft |
| NUFFT |
Native NUFFT and FINUFFT/cuFINUFFT described together with FFT and PIF |
Isolated as pr501-nufft, after scatter/gather and FFT |
| PIF |
Mentions separate LandauDampingPIFPruned and ChargedParticlesPIFPruned files |
Current split uses LandauDampingPIF with optional pruned mode; check file list before reusing original wording |
| Validation |
Generic unit/integration test list |
Split map records concrete local tests and LUMI benchmark results |
PR 501 Split Map
Source PR: #501
Compared against:
origin/masteratda43fd66a18848b65dcd82af90001f5b64a798abOriginal PR size: 41 commits, 149 files changed, 23,231 insertions, 2,205 deletions.
The original PR is reviewable only as a source branch. The extracted split is now the working structure:
pr501-pcgmasterpr501-communication-particle-updatepr501-pcgpr501-hosgpr501-communication-particle-updatepr501-fftpr501-hosgpr501-nufftpr501-fftpr501-pifpr501-nufftSplit Rationale
The review load in PR #501 is dominated by three independent-looking but technically dependent areas:
nghostplumbing.The dependency order is:
Important dependency details:
FFT<NUFFTransform, Field_t>,scatterPIFNUFFT, andgatherPIFNUFFT.ParticleAttrib::serialize,ParticleAttrib::deserialize,Archive, andBufferHandler.nghostchanges touchFieldLayoutandHaloCells; they belong with particle update because the update path needs kernel-width-aware halos.Extracted Branches
PR 1: PCG Allocation / Solver Performance
Branch:
pr501-pcgScope:
src/LinearSolvers/PCG.hsrc/LinearSolvers/Preconditioner.hsrc/PoissonSolvers/PoissonCG.hPurpose:
Fieldby reference throughOperatorF.PR 2: Communication And Particle Update Infrastructure
Branch:
pr501-communication-particle-updateScope:
src/Communicate/*src/Particle/ParticleSpatialLayout*src/Particle/ParticleAttrib*src/Particle/ParticleAttribBase.hsrc/Particle/ParticleBase*src/Particle/ParticleSort.hsrc/Particle/SortBuffer.hsrc/FieldLayout/FieldLayout*src/Field/HaloCells*Purpose:
Important fix added during split:
Kokkos::resize(offset + nrecvs)calls per source rank and per attribute.particleDeserializebottleneck.Key regression tests:
ParticleSendRecvwith at least 2 MPI ranks; 1 rank does not exercise receive/deserialization.ParticleUpdateParticleUpdateNonuniformPR 3: Higher-Order Scatter/Gather And Autotune
Branch:
pr501-hosgScope:
src/Interpolation/*cmake/AutoTunePresets.cmakecmake/IpplAutoTunePresets.h.incmake/auto_tune/*Purpose:
PR 4: FFT Backend / Transform Refactor
Branch:
pr501-fftScope:
src/FFT/Backend/*src/FFT/Transform/CC.hsrc/FFT/Transform/RC.hsrc/FFT/Transform/Trig.hsrc/FFT/Transform/PrunedCC.hsrc/FFT/Transform/PrunedRC.hsrc/FFT/Transform/Common.hsrc/FFT/Traits.hPurpose:
Intentionally left for PR 5:
src/FFT/NUFFT/*src/FFT/Transform/NUFFT.*PR 5: Native NUFFT And FINUFFT/cuFINUFFT Integration
Branch:
pr501-nufftScope:
src/FFT/NUFFT/*src/FFT/Transform/NUFFT.*IPPL_ENABLE_FINUFFTIPPL_ENABLE_CUFFTMPPurpose:
Validation with FINUFFT disabled:
PR 6: PIF / Alpine Examples
Branch:
pr501-pifScope:
alpine/ElectrostaticPIF/*Purpose:
Current status:
LandauDampingPIFsupports the default upsampled mode and an optional positionalprunedmode.LandauDampingPIFPrunedis not kept as a separate executable in the current split.pr501-pifhas the receive pre-reserve particle update fix applied and pushed.Build note:
LandauDampingPIFis added whenIPPL_ENABLE_ALPINE=ONandIPPL_ENABLE_FFT=ON.IPPL_ENABLE_FINUFFT=OFF, it can compile against the native NUFFT stack, but runtime behavior depends onChargedParticlesPIF::initNUFFT()selecting the native backend.LUMI Status
The split branch stack is running successfully on LUMI. Current benchmark data show no evidence that the decomposition itself introduced a broad regression; several workloads improve relative to
master.513_10512_10512_10512_10512_10512_10Particle receive pre-reserve fix evidence from LUMI:
particleDeserializewall maxupdateParticlewall maxparticleDeserializewall maxupdateParticlewall maxInterpretation:
localAfterDestroy + totalRecvsremoves the repeated copy pattern.LUMI Final Results
513_10512_10512_10512_10512_10512_101024_101024_101024_10512_10512_10512_10Contrast With PR #501 Description
The live PR #501 description is still useful as a source inventory, but it no longer matches the finite split branches.
pr501-pcgpr501-hosgpr501-fftpr501-nufft, after scatter/gather and FFTLandauDampingPIFPrunedandChargedParticlesPIFPrunedfilesLandauDampingPIFwith optionalprunedmode; check file list before reusing original wording