Kernel Profiling: CUDA vs Triton vs Helion

A systematic comparison of GPU kernel performance across three frameworks on NVIDIA H200

This repository contains the complete source code and raw Nsight Compute profiling data for vector addition, batched matrix multiplication, and softmax — each implemented identically in CUDA, Triton, and Helion. Every kernel is a standalone script you can run, profile, and compare.

Built during our internship at the Red Hat PyTorch Team (Bangalore, India) as part of ongoing research into GPU kernel profiling with NVIDIA Nsight Tools.

Highlights

	CUDA	Triton	Helion
Vec-add (memory-bound)	36–68% DRAM BW	86–92% DRAM BW	86–92% DRAM BW
MatMul (compute-bound)	10–100x slower (no TC)	Near-peak (autotuned)	Near-peak (autotuned)
Softmax fwd (FP16)	Moderate	93.5% compute	Competitive
Mixed precision speedup	Minimal	10–30x via tensor cores	10–30x via tensor cores

Quick Start

git clone https://github.com/XAheli/Kernel-Profiling.git
cd Kernel-Profiling
pip install -r requirements.txt

# Run a kernel (correctness check against PyTorch)
python matmul/triton/kernel/matmul_triton_configG.py

# Profile it
ncu --set full -o matmul/triton/results/matmul_triton_configG \
    python matmul/triton/kernel/matmul_triton_configG.py

# View results
ncu-ui matmul/triton/results/matmul_triton_configG.ncu-rep

Requirements: Python 3.10+ | CUDA 12.8+ | NVIDIA GPU (tested on H200, 141 GB HBM3e)

What's Inside

<operation>/
├── cuda/
│   ├── kernel/       # Python + inline CUDA C++
│   └── results/      # .ncu-rep reports
├── triton/
│   ├── kernel/       # @triton.jit kernels
│   └── results/
└── helion/
    ├── kernel/       # @helion.kernel DSL
    └── results/

Operation	What it tests
`vec-add/`	Pure memory throughput — element-wise add on large 1-D buffers
`matmul/`	Compute throughput — batched GEMM with tiling and tensor cores
`softmax/`	Mixed workload — online 2-pass softmax (forward + backward)

File naming: <op>_<framework>_<config>.py with matching .ncu-rep results.

Related Resources

_{Built with care at Red Hat · PyTorch Team · Bangalore, India}

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
archive		archive
data_dashboard		data_dashboard
matmul		matmul
softmax		softmax
vec-add		vec-add
.gitattributes		.gitattributes
Kernel_Profiling_Research_Intern Report.pdf		Kernel_Profiling_Research_Intern Report.pdf
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kernel Profiling: CUDA vs Triton vs Helion

Highlights

Quick Start

What's Inside

Related Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Languages

Folders and files

Latest commit

History

Repository files navigation

Kernel Profiling: CUDA vs Triton vs Helion

Highlights

Quick Start

What's Inside

Related Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Languages

Packages