A systematic comparison of GPU kernel performance across three frameworks on NVIDIA H200
This repository contains the complete source code and raw Nsight Compute profiling data for vector addition, batched matrix multiplication, and softmax — each implemented identically in CUDA, Triton, and Helion. Every kernel is a standalone script you can run, profile, and compare.
Built during our internship at the Red Hat PyTorch Team (Bangalore, India) as part of ongoing research into GPU kernel profiling with NVIDIA Nsight Tools.
| CUDA | Triton | Helion | |
|---|---|---|---|
| Vec-add (memory-bound) | 36–68% DRAM BW | 86–92% DRAM BW | 86–92% DRAM BW |
| MatMul (compute-bound) | 10–100x slower (no TC) | Near-peak (autotuned) | Near-peak (autotuned) |
| Softmax fwd (FP16) | Moderate | 93.5% compute | Competitive |
| Mixed precision speedup | Minimal | 10–30x via tensor cores | 10–30x via tensor cores |
git clone https://github.com/XAheli/Kernel-Profiling.git
cd Kernel-Profiling
pip install -r requirements.txt
# Run a kernel (correctness check against PyTorch)
python matmul/triton/kernel/matmul_triton_configG.py
# Profile it
ncu --set full -o matmul/triton/results/matmul_triton_configG \
python matmul/triton/kernel/matmul_triton_configG.py
# View results
ncu-ui matmul/triton/results/matmul_triton_configG.ncu-repRequirements: Python 3.10+ | CUDA 12.8+ | NVIDIA GPU (tested on H200, 141 GB HBM3e)
<operation>/
├── cuda/
│ ├── kernel/ # Python + inline CUDA C++
│ └── results/ # .ncu-rep reports
├── triton/
│ ├── kernel/ # @triton.jit kernels
│ └── results/
└── helion/
├── kernel/ # @helion.kernel DSL
└── results/
| Operation | What it tests |
|---|---|
vec-add/ |
Pure memory throughput — element-wise add on large 1-D buffers |
matmul/ |
Compute throughput — batched GEMM with tiling and tensor cores |
softmax/ |
Mixed workload — online 2-pass softmax (forward + backward) |
File naming: <op>_<framework>_<config>.py with matching .ncu-rep results.
- NVIDIA Nsight Compute — Getting Started
- Triton Kernel Profiling with NVIDIA Nsight Tools — Red Hat Emerging Technologies Blog
- Full Research Report (PDF)
Built with care at Red Hat · PyTorch Team · Bangalore, India