Skip to content

perftest: fix reducescatter latency size reporting#93

Open
jasha64 wants to merge 1 commit into
NVIDIA:develfrom
jasha64:devel
Open

perftest: fix reducescatter latency size reporting#93
jasha64 wants to merge 1 commit into
NVIDIA:develfrom
jasha64:devel

Conversation

@jasha64

@jasha64 jasha64 commented Jun 15, 2026

Copy link
Copy Markdown

This fixes incorrect device reducescatter perftest results caused by RUN_ITERS_OP declaring a loop-local num_elems after the kernel argument lists had already captured the outer variable. The benchmark was printing increasing message sizes while repeatedly timing the initial element count, which produced nearly constant latencies and inflated bandwidth values. Reusing the existing num_elems keeps the kernel arguments, size table, and reported metrics in sync.

Previously, running this reducescatter_latency perftest will report extraordinarily large bandwidth. Fixed in this pull request.

#device_reducescatter
size(B)     count     type      redop     scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
8           1         int64     sum       b         2.140800          0.004         0.000       
16          2         int64     sum       b         1.910400          0.008         0.000       
32          4         int64     sum       b         1.923200          0.017         0.000       
64          8         int64     sum       b         1.955200          0.033         0.000       
128         16        int64     sum       b         1.907200          0.067         0.000       
256         32        int64     sum       b         1.907200          0.134         0.000       
512         64        int64     sum       b         1.900800          0.269         0.000       
1024        128       int64     sum       b         1.897600          0.540         0.000       
2048        256       int64     sum       b         1.916800          1.068         0.000       
4096        512       int64     sum       b         1.910400          2.144         0.000       
8192        1024      int64     sum       b         1.913600          4.281         0.000       
16384       2048      int64     sum       b         1.920000          8.533         0.000       
32768       4096      int64     sum       b         1.913600          17.124        0.000       
65536       8192      int64     sum       b         1.913600          34.247        0.000       
131072      16384     int64     sum       b         1.904000          68.840        0.000       
262144      32768     int64     sum       b         1.980800          132.342       0.000       
524288      65536     int64     sum       b         1.916800          273.523       0.000       
1048576     131072    int64     sum       b         2.016000          520.127       0.000       
2097152     262144    int64     sum       b         1.990400          1053.633      0.000       
4194304     524288    int64     sum       b         1.948800          2152.250      0.000       
8388608     1048576   int64     sum       b         1.926400          4354.552      0.000       
16777216    2097152   int64     sum       b         1.948800          8608.999      0.000       
33554432    4194304   int64     sum       b         2.038400          16461.161     0.000       
67108864    8388608   int64     sum       b         1.900800          35305.590     0.000       

Avoid shadowing num_elems in for loop so the timed kernel launches use the current message size instead of repeatedly measuring the initial element count.

Signed-off-by: jasha64 <yijunma@student.ethz.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant