Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes incorrect CUDA min/max reductions when Compyle’s generated min/max collector is used with PyCUDA’s ReductionKernel final warp reduction, which accesses shared memory through volatile pointers. The fix ensures the collector can be copied from const volatile& sources (i.e., volatile out_type shared-memory values) into by-value reducer arguments.
Changes:
- Add a device default constructor for the generated CUDA min/max collector type.
- Add a
__device__copy constructor that loads fields from aconst volatile&source, enabling correct reads from volatile shared memory during reduction.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+100
to
+114
| __device__ ${dtype}() | ||
| { | ||
| } | ||
|
|
||
| __device__ ${dtype}(${dtype} const volatile &src) | ||
| { | ||
| % for prop in prop_names: | ||
| % if not only_max: | ||
| this->cur_min_${prop} = src.cur_min_${prop}; | ||
| % endif | ||
| % if not only_min: | ||
| this->cur_max_${prop} = src.cur_max_${prop}; | ||
| % endif | ||
| % endfor | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #98.
The bug is in Compyle's CUDA min/max collector when it is used with PyCUDA's final warp reduction.
Compyle passes the generated collector to
pycuda.reduction.ReductionKernelwith:compyle/compyle/array.py
Lines 209 to 212 in d738bf5
The generated reducer takes the collector by value:
WITHIN_KERNEL ${dtype} agg_mmc(${dtype} a, ${dtype} b)compyle/compyle/array.py
Lines 77 to 92 in d738bf5
In PyCUDA, the final warp reduction reads and writes shared memory through
volatile out_type *smem:So
REDUCE(smem[tid], smem[tid + offset])copiesvolatile out_typevalues into the by-valueagg_mmcparameters.Before this patch, Compyle only generated a volatile assignment operator for the collector:
__device__ ${dtype} volatile &operator=( ${dtype} const &src) volatilecompyle/compyle/array.py
Lines 98 to 112 in d738bf5
That handles writing the reduced result back to
smem[tid], but it does not define how to read fromvolatile out_typeintoagg_mmc.This patch adds a constructor from
const volatile &. That constructor is called whensmem[tid]orsmem[tid + offset]is copied into the by-valueagg_mmcarguments, and it explicitly loads each collector field from the volatile source.The following construction makes the access pattern visible. The
-1values are not harmless placeholders. They are deliberately placed at positions that should participate in the full reduction, but are not observed by lane 0 when the volatile shared-memory updates are missed.With the bug, the computed minimum is still
0, even though the true minimum is-1. If the marked-1positions are replaced by1000, the computed maximum still does not change. This shows that those values are present in the input but are not reaching the final result.This matches PyCUDA's reduction tree. In the final warp stage, lane 0 combines values through offsets
32, 16, 8, 4, 2, 1:The values placed at
0, 1, 2, 4, 8, 16, 32, ...are exactly the ones lane 0 can still see in the broken path. The marked-1values require intermediatesmem[...]updates from other lanes to be visible. Without theconst volatile &collector constructor, those intermediate volatile shared-memory updates can be missed.For the original example in #98,
a1 = np.asarray([4.2, 2.0, 0.01, 0.08, 4.0, 29.2], dtype=np.float32), Compyle setca1.maximum=4.2because the broken path effectively computes onlymax(4.2, 2.0, 0.01, 4.0).