Fix CUDA minmax collector copies from volatile shared memory by xsjk · Pull Request #127 · pypr/compyle

xsjk · 2026-05-07T19:50:45Z

Fixes #98.

The bug is in Compyle's CUDA min/max collector when it is used with PyCUDA's final warp reduction.

Compyle passes the generated collector to pycuda.reduction.ReductionKernel with:

reduce_expr="agg_mmc(a, b)",
map_expr="mmc_from_scalar(%s)" % map_args,

compyle/compyle/array.py

Lines 209 to 212 in d738bf5

    
           knl = ReductionKernel( 
        
               mmc_dtype, neutral="mmc_neutral()", 
        
               reduce_expr="agg_mmc(a, b)", 
        
               map_expr="mmc_from_scalar(%s)" % map_args,

The generated reducer takes the collector by value:

WITHIN_KERNEL ${dtype} agg_mmc(${dtype} a, ${dtype} b)

compyle/compyle/array.py

Lines 77 to 92 in d738bf5

    
               WITHIN_KERNEL ${dtype} agg_mmc(${dtype} a, ${dtype} b) 
        
               { 
        
                   ${dtype} result = a; 
        
                   % for prop in prop_names: 
        
                   % if not only_max: 
        
                   if (b.cur_min_${prop} < result.cur_min_${prop}) 
        
                       result.cur_min_${prop} = b.cur_min_${prop}; 
        
                   % endif 
        
                   % if not only_min: 
        
                   if (b.cur_max_${prop} > result.cur_max_${prop}) 
        
                       result.cur_max_${prop} = b.cur_max_${prop}; 
        
                   % endif 
        
                   % endfor 
        
                   return result;

In PyCUDA, the final warp reduction reads and writes shared memory through volatile out_type *smem:

if (tid < 32)
{
  // 'volatile' required according to Fermi compatibility guide 1.2.2
  volatile out_type *smem = sdata;
  if (BLOCK_SIZE >= 64) smem[tid] = REDUCE(smem[tid], smem[tid + 32]);
  if (BLOCK_SIZE >= 32) smem[tid] = REDUCE(smem[tid], smem[tid + 16]);
  if (BLOCK_SIZE >= 16) smem[tid] = REDUCE(smem[tid], smem[tid + 8]);
  if (BLOCK_SIZE >= 8)  smem[tid] = REDUCE(smem[tid], smem[tid + 4]);
  if (BLOCK_SIZE >= 4)  smem[tid] = REDUCE(smem[tid], smem[tid + 2]);
  if (BLOCK_SIZE >= 2)  smem[tid] = REDUCE(smem[tid], smem[tid + 1]);
}

So REDUCE(smem[tid], smem[tid + offset]) copies volatile out_type values into the by-value agg_mmc parameters.

Before this patch, Compyle only generated a volatile assignment operator for the collector:

__device__ ${dtype} volatile &operator=(
    ${dtype} const &src) volatile

compyle/compyle/array.py

Lines 98 to 112 in d738bf5

    
           minmax_operator_tpl = """ 
        
               __device__ ${dtype} volatile &operator=( 
        
                   ${dtype} const &src) volatile 
        
               { 
        
                   % for prop in prop_names: 
        
                   % if not only_max: 
        
                   this->cur_min_${prop} = src.cur_min_${prop}; 
        
                   % endif 
        
                   % if not only_min: 
        
                   this->cur_max_${prop} = src.cur_max_${prop}; 
        
                   % endif 
        
                   % endfor 
        
                   return *this; 
        
               }

That handles writing the reduced result back to smem[tid], but it does not define how to read from volatile out_type into agg_mmc.

This patch adds a constructor from const volatile &. That constructor is called when smem[tid] or smem[tid + offset] is copied into the by-value agg_mmc arguments, and it explicitly loads each collector field from the volatile source.

The following construction makes the access pattern visible. The -1 values are not harmless placeholders. They are deliberately placed at positions that should participate in the full reduction, but are not observed by lane 0 when the volatile shared-memory updates are missed.

import numpy as np
from compyle import array

x = array.wrap_array(np.array([
    0,
    1,
    2, -1,
    4, -1, -1, -1,
    8, -1, -1, -1, -1, -1, -1, -1,
    16, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    32, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    64,
    65,
    66, -1,
    68, -1, -1, -1,
    72, -1, -1, -1, -1, -1, -1, -1,
    90, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    128, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
]), backend="cuda")

array.update_minmax_gpu([x], backend="cuda")
print(x.minimum, x.maximum)

With the bug, the computed minimum is still 0, even though the true minimum is -1. If the marked -1 positions are replaced by 1000, the computed maximum still does not change. This shows that those values are present in the input but are not reaching the final result.

This matches PyCUDA's reduction tree. In the final warp stage, lane 0 combines values through offsets 32, 16, 8, 4, 2, 1:

volatile out_type *smem = sdata;
if (BLOCK_SIZE >= 64) smem[tid] = REDUCE(smem[tid], smem[tid + 32]);
if (BLOCK_SIZE >= 32) smem[tid] = REDUCE(smem[tid], smem[tid + 16]);
if (BLOCK_SIZE >= 16) smem[tid] = REDUCE(smem[tid], smem[tid + 8]);
if (BLOCK_SIZE >= 8)  smem[tid] = REDUCE(smem[tid], smem[tid + 4]);
if (BLOCK_SIZE >= 4)  smem[tid] = REDUCE(smem[tid], smem[tid + 2]);
if (BLOCK_SIZE >= 2)  smem[tid] = REDUCE(smem[tid], smem[tid + 1]);

The values placed at 0, 1, 2, 4, 8, 16, 32, ... are exactly the ones lane 0 can still see in the broken path. The marked -1 values require intermediate smem[...] updates from other lanes to be visible. Without the const volatile & collector constructor, those intermediate volatile shared-memory updates can be missed.

For the original example in #98, a1 = np.asarray([4.2, 2.0, 0.01, 0.08, 4.0, 29.2], dtype=np.float32), Compyle set ca1.maximum=4.2 because the broken path effectively computes only max(4.2, 2.0, 0.01, 4.0).

Copilot

Pull request overview

This PR fixes incorrect CUDA min/max reductions when Compyle’s generated min/max collector is used with PyCUDA’s ReductionKernel final warp reduction, which accesses shared memory through volatile pointers. The fix ensures the collector can be copied from const volatile& sources (i.e., volatile out_type shared-memory values) into by-value reducer arguments.

Changes:

Add a device default constructor for the generated CUDA min/max collector type.
Add a __device__ copy constructor that loads fields from a const volatile& source, enabling correct reads from volatile shared memory during reduction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

xsjk · 2026-05-08T04:45:35Z

+    __device__ ${dtype}()
+    {
+    }
+
+    __device__ ${dtype}(${dtype} const volatile &src)
+    {
+        % for prop in prop_names:
+        % if not only_max:
+        this->cur_min_${prop} = src.cur_min_${prop};
+        % endif
+        % if not only_min:
+        this->cur_max_${prop} = src.cur_max_${prop};
+        % endif
+        % endfor
+    }


@copilot apply changes based on this feedback

Fix CUDA minmax reduction for volatile collectors

309d82a

Copilot AI review requested due to automatic review settings May 7, 2026 19:50

Copilot started reviewing on behalf of xsjk May 7, 2026 19:51 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA minmax collector copies from volatile shared memory#127

Fix CUDA minmax collector copies from volatile shared memory#127
xsjk wants to merge 1 commit intopypr:mainfrom
xsjk:main

xsjk commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

xsjk May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	knl = ReductionKernel(
	mmc_dtype, neutral="mmc_neutral()",
	reduce_expr="agg_mmc(a, b)",
	map_expr="mmc_from_scalar(%s)" % map_args,

	WITHIN_KERNEL ${dtype} agg_mmc(${dtype} a, ${dtype} b)
	{
	${dtype} result = a;

	% for prop in prop_names:
	% if not only_max:
	if (b.cur_min_${prop} < result.cur_min_${prop})
	result.cur_min_${prop} = b.cur_min_${prop};
	% endif
	% if not only_min:
	if (b.cur_max_${prop} > result.cur_max_${prop})
	result.cur_max_${prop} = b.cur_max_${prop};
	% endif
	% endfor

	return result;

	minmax_operator_tpl = """

	__device__ ${dtype} volatile &operator=(
	${dtype} const &src) volatile
	{
	% for prop in prop_names:
	% if not only_max:
	this->cur_min_${prop} = src.cur_min_${prop};
	% endif
	% if not only_min:
	this->cur_max_${prop} = src.cur_max_${prop};
	% endif
	% endfor
	return *this;
	}

Conversation

xsjk commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

xsjk May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xsjk commented May 7, 2026 •

edited

Loading