Skip to content

Fix CUDA minmax collector copies from volatile shared memory#127

Open
xsjk wants to merge 1 commit intopypr:mainfrom
xsjk:main
Open

Fix CUDA minmax collector copies from volatile shared memory#127
xsjk wants to merge 1 commit intopypr:mainfrom
xsjk:main

Conversation

@xsjk
Copy link
Copy Markdown

@xsjk xsjk commented May 7, 2026

Fixes #98.

The bug is in Compyle's CUDA min/max collector when it is used with PyCUDA's final warp reduction.

Compyle passes the generated collector to pycuda.reduction.ReductionKernel with:

reduce_expr="agg_mmc(a, b)",
map_expr="mmc_from_scalar(%s)" % map_args,

compyle/compyle/array.py

Lines 209 to 212 in d738bf5

knl = ReductionKernel(
mmc_dtype, neutral="mmc_neutral()",
reduce_expr="agg_mmc(a, b)",
map_expr="mmc_from_scalar(%s)" % map_args,

The generated reducer takes the collector by value:

WITHIN_KERNEL ${dtype} agg_mmc(${dtype} a, ${dtype} b)

compyle/compyle/array.py

Lines 77 to 92 in d738bf5

WITHIN_KERNEL ${dtype} agg_mmc(${dtype} a, ${dtype} b)
{
${dtype} result = a;
% for prop in prop_names:
% if not only_max:
if (b.cur_min_${prop} < result.cur_min_${prop})
result.cur_min_${prop} = b.cur_min_${prop};
% endif
% if not only_min:
if (b.cur_max_${prop} > result.cur_max_${prop})
result.cur_max_${prop} = b.cur_max_${prop};
% endif
% endfor
return result;

In PyCUDA, the final warp reduction reads and writes shared memory through volatile out_type *smem:

if (tid < 32)
{
  // 'volatile' required according to Fermi compatibility guide 1.2.2
  volatile out_type *smem = sdata;
  if (BLOCK_SIZE >= 64) smem[tid] = REDUCE(smem[tid], smem[tid + 32]);
  if (BLOCK_SIZE >= 32) smem[tid] = REDUCE(smem[tid], smem[tid + 16]);
  if (BLOCK_SIZE >= 16) smem[tid] = REDUCE(smem[tid], smem[tid + 8]);
  if (BLOCK_SIZE >= 8)  smem[tid] = REDUCE(smem[tid], smem[tid + 4]);
  if (BLOCK_SIZE >= 4)  smem[tid] = REDUCE(smem[tid], smem[tid + 2]);
  if (BLOCK_SIZE >= 2)  smem[tid] = REDUCE(smem[tid], smem[tid + 1]);
}

So REDUCE(smem[tid], smem[tid + offset]) copies volatile out_type values into the by-value agg_mmc parameters.

Before this patch, Compyle only generated a volatile assignment operator for the collector:

__device__ ${dtype} volatile &operator=(
    ${dtype} const &src) volatile

compyle/compyle/array.py

Lines 98 to 112 in d738bf5

minmax_operator_tpl = """
__device__ ${dtype} volatile &operator=(
${dtype} const &src) volatile
{
% for prop in prop_names:
% if not only_max:
this->cur_min_${prop} = src.cur_min_${prop};
% endif
% if not only_min:
this->cur_max_${prop} = src.cur_max_${prop};
% endif
% endfor
return *this;
}

That handles writing the reduced result back to smem[tid], but it does not define how to read from volatile out_type into agg_mmc.

This patch adds a constructor from const volatile &. That constructor is called when smem[tid] or smem[tid + offset] is copied into the by-value agg_mmc arguments, and it explicitly loads each collector field from the volatile source.

The following construction makes the access pattern visible. The -1 values are not harmless placeholders. They are deliberately placed at positions that should participate in the full reduction, but are not observed by lane 0 when the volatile shared-memory updates are missed.

import numpy as np
from compyle import array

x = array.wrap_array(np.array([
    0,
    1,
    2, -1,
    4, -1, -1, -1,
    8, -1, -1, -1, -1, -1, -1, -1,
    16, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    32, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    64,
    65,
    66, -1,
    68, -1, -1, -1,
    72, -1, -1, -1, -1, -1, -1, -1,
    90, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    128, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
]), backend="cuda")

array.update_minmax_gpu([x], backend="cuda")
print(x.minimum, x.maximum)

With the bug, the computed minimum is still 0, even though the true minimum is -1. If the marked -1 positions are replaced by 1000, the computed maximum still does not change. This shows that those values are present in the input but are not reaching the final result.

This matches PyCUDA's reduction tree. In the final warp stage, lane 0 combines values through offsets 32, 16, 8, 4, 2, 1:

volatile out_type *smem = sdata;
if (BLOCK_SIZE >= 64) smem[tid] = REDUCE(smem[tid], smem[tid + 32]);
if (BLOCK_SIZE >= 32) smem[tid] = REDUCE(smem[tid], smem[tid + 16]);
if (BLOCK_SIZE >= 16) smem[tid] = REDUCE(smem[tid], smem[tid + 8]);
if (BLOCK_SIZE >= 8)  smem[tid] = REDUCE(smem[tid], smem[tid + 4]);
if (BLOCK_SIZE >= 4)  smem[tid] = REDUCE(smem[tid], smem[tid + 2]);
if (BLOCK_SIZE >= 2)  smem[tid] = REDUCE(smem[tid], smem[tid + 1]);

The values placed at 0, 1, 2, 4, 8, 16, 32, ... are exactly the ones lane 0 can still see in the broken path. The marked -1 values require intermediate smem[...] updates from other lanes to be visible. Without the const volatile & collector constructor, those intermediate volatile shared-memory updates can be missed.

For the original example in #98, a1 = np.asarray([4.2, 2.0, 0.01, 0.08, 4.0, 29.2], dtype=np.float32), Compyle set ca1.maximum=4.2 because the broken path effectively computes only max(4.2, 2.0, 0.01, 4.0).

Copilot AI review requested due to automatic review settings May 7, 2026 19:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incorrect CUDA min/max reductions when Compyle’s generated min/max collector is used with PyCUDA’s ReductionKernel final warp reduction, which accesses shared memory through volatile pointers. The fix ensures the collector can be copied from const volatile& sources (i.e., volatile out_type shared-memory values) into by-value reducer arguments.

Changes:

  • Add a device default constructor for the generated CUDA min/max collector type.
  • Add a __device__ copy constructor that loads fields from a const volatile& source, enabling correct reads from volatile shared memory during reduction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread compyle/array.py
Comment on lines +100 to +114
__device__ ${dtype}()
{
}

__device__ ${dtype}(${dtype} const volatile &src)
{
% for prop in prop_names:
% if not only_max:
this->cur_min_${prop} = src.cur_min_${prop};
% endif
% if not only_min:
this->cur_max_${prop} = src.cur_max_${prop};
% endif
% endfor
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: minimum and maximum are computed incorrectly for certain arrays with 'cuda' backend.

2 participants