Skip to content

XAheli/Kernel-Profiling

Repository files navigation

Kernel Profiling: CUDA vs Triton vs Helion

A systematic comparison of GPU kernel performance across three frameworks on NVIDIA H200

Dashboard Report Nsight Compute Red Hat Blog


This repository contains the complete source code and raw Nsight Compute profiling data for vector addition, batched matrix multiplication, and softmax — each implemented identically in CUDA, Triton, and Helion. Every kernel is a standalone script you can run, profile, and compare.

Built during our internship at the Red Hat PyTorch Team (Bangalore, India) as part of ongoing research into GPU kernel profiling with NVIDIA Nsight Tools.


Highlights

CUDA Triton Helion
Vec-add (memory-bound) 36–68% DRAM BW 86–92% DRAM BW 86–92% DRAM BW
MatMul (compute-bound) 10–100x slower (no TC) Near-peak (autotuned) Near-peak (autotuned)
Softmax fwd (FP16) Moderate 93.5% compute Competitive
Mixed precision speedup Minimal 10–30x via tensor cores 10–30x via tensor cores

Quick Start

git clone https://github.com/XAheli/Kernel-Profiling.git
cd Kernel-Profiling
pip install -r requirements.txt

# Run a kernel (correctness check against PyTorch)
python matmul/triton/kernel/matmul_triton_configG.py

# Profile it
ncu --set full -o matmul/triton/results/matmul_triton_configG \
    python matmul/triton/kernel/matmul_triton_configG.py

# View results
ncu-ui matmul/triton/results/matmul_triton_configG.ncu-rep

Requirements: Python 3.10+ | CUDA 12.8+ | NVIDIA GPU (tested on H200, 141 GB HBM3e)


What's Inside

<operation>/
├── cuda/
│   ├── kernel/       # Python + inline CUDA C++
│   └── results/      # .ncu-rep reports
├── triton/
│   ├── kernel/       # @triton.jit kernels
│   └── results/
└── helion/
    ├── kernel/       # @helion.kernel DSL
    └── results/
Operation What it tests
vec-add/ Pure memory throughput — element-wise add on large 1-D buffers
matmul/ Compute throughput — batched GEMM with tiling and tensor cores
softmax/ Mixed workload — online 2-pass softmax (forward + backward)

File naming: <op>_<framework>_<config>.py with matching .ncu-rep results.


Related Resources


Built with care at Red Hat · PyTorch Team · Bangalore, India