Torch Quantization to ONNX Export

This example demonstrates how to quantize PyTorch models followed by export to ONNX format. The scripts leverage the ModelOpt toolkit for quantization and ONNX export.

For vision models, the torch_quant_to_onnx.py script in this directory handles quantization and ONNX export directly.

For LLMs and VLMs, use TensorRT-Edge-LLM which provides a complete pipeline for quantizing models with ModelOpt and exporting them to optimized ONNX for deployment on edge platforms (Jetson, DRIVE).

Section	Description	Link
Pre-Requisites	Required packages to use this example	Link
Vision Models	Quantize timm models and export to ONNX	Link
LLM Quantization and Export	Quantize and export LLMs/VLMs via TensorRT-Edge-LLM	Link
Supported Models	LLM and VLM models supported by TensorRT-Edge-LLM	Link
Mixed Precision	Auto mode for optimal per-layer quantization	Link
Resources	Extra links to relevant resources	Link

Pre-Requisites

Docker

Please use the TensorRT docker image (e.g., nvcr.io/nvidia/tensorrt:26.02-py3) or visit our installation docs for more information.

Set the following environment variables inside the TensorRT docker.

export CUDNN_LIB_DIR=/usr/lib/x86_64-linux-gnu/
export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${LD_LIBRARY_PATH}"

Local Installation

Install Model Optimizer with onnx dependencies using pip from PyPI and install the requirements for the example:

pip install -U "nvidia-modelopt[onnx]"
pip install -r requirements.txt

For TensorRT Compiler framework workloads:

Install the latest TensorRT from here.

Vision Models

The torch_quant_to_onnx.py script quantizes timm vision models and exports them to ONNX.

What it does

Loads a pretrained timm torch model (default: ViT-Base).
Quantizes the torch model to FP8, MXFP8, INT8, NVFP4, or INT4_AWQ using ModelOpt.
For models with Conv2d layers (e.g., SwinTransformer), automatically overrides Conv2d quantization to FP8 (for MXFP8/NVFP4 modes) or INT8 (for INT4_AWQ mode) for TensorRT compatibility.
Exports the quantized model to ONNX.
Postprocesses the ONNX model to be compatible with TensorRT.
Saves the final ONNX model.

Opset 20 is used to export the torch models to ONNX.

Usage

python torch_quant_to_onnx.py \
    --timm_model_name=<timm model name> \
    --quantize_mode=<fp8|mxfp8|int8|nvfp4|int4_awq> \
    --onnx_save_path=<path to save the exported ONNX model>

Conv2d Quantization Override

TensorRT only supports FP8 and INT8 for convolution operations. When quantizing models with Conv2d layers (like SwinTransformer), the script automatically applies the following overrides:

Quantize Mode	Conv2d Override	Reason
FP8, INT8	None (already compatible)	Native TRT support
MXFP8, NVFP4	Conv2d -> FP8	TRT Conv limitation
INT4_AWQ	Conv2d -> INT8	TRT Conv limitation

Evaluation

If the input model is of type image classification, use the following script to evaluate it. The script automatically downloads and uses the ILSVRC/imagenet-1k dataset from Hugging Face. This gated repository requires authentication via Hugging Face access token. See https://huggingface.co/docs/hub/en/security-tokens for details.

Note: TensorRT 10.11 or later is required to evaluate the MXFP8 or NVFP4 ONNX models.

python ../onnx_ptq/evaluate.py \
    --onnx_path=<path to the exported ONNX model> \
    --imagenet_path=<HF dataset card or local path to the ImageNet dataset> \
    --engine_precision=stronglyTyped \
    --model_name=<timm model name>

LLM Quantization and Export with TensorRT-Edge-LLM

TensorRT-Edge-LLM provides a complete pipeline for quantizing LLMs and VLMs using NVIDIA ModelOpt and exporting them to optimized ONNX for deployment on edge platforms such as NVIDIA Jetson and DRIVE.

Overview

The pipeline follows these stages:

Quantize (x86 host with GPU) — Reduce model precision using ModelOpt (FP8, INT4 AWQ, NVFP4)
Export (x86 host with GPU) — Convert quantized model to ONNX
Build (edge device) — Compile ONNX into TensorRT engines
Inference (edge device) — Run the compiled engines

Installation

# Use the PyTorch Docker image (recommended)
docker pull nvcr.io/nvidia/pytorch:25.12-py3
docker run --gpus all -it --rm -v $(pwd):/workspace -w /workspace nvcr.io/nvidia/pytorch:25.12-py3 bash

# Clone and install TensorRT-Edge-LLM
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive
python3 -m venv venv
source venv/bin/activate
pip3 install .

# Verify installation
tensorrt-edgellm-quantize-llm --help
tensorrt-edgellm-export-llm --help

System requirements:

x86-64 Linux (Ubuntu 22.04 or 24.04 recommended)
NVIDIA GPU with Compute Capability 8.0+ (Ampere or newer)
CUDA 12.x or 13.x, Python 3.10+
GPU VRAM: 16 GB for models up to 3B, 40 GB for models up to 4B, 80 GB for models up to 8B

CLI Tools

Tool	Purpose
`tensorrt-edgellm-quantize-llm`	Quantize LLM models using ModelOpt (FP8, INT4 AWQ, NVFP4)
`tensorrt-edgellm-export-llm`	Export LLM to ONNX with precision-specific optimizations
`tensorrt-edgellm-export-visual`	Export visual encoders for multimodal VLM models
`tensorrt-edgellm-quantize-draft`	Quantize EAGLE draft models for speculative decoding
`tensorrt-edgellm-export-draft`	Export EAGLE draft models to ONNX
`tensorrt-edgellm-insert-lora`	Insert LoRA patterns into existing ONNX models
`tensorrt-edgellm-process-lora`	Process LoRA adapter weights for runtime loading

Example: Quantize and Export an LLM

# Step 1: Quantize with ModelOpt
tensorrt-edgellm-quantize-llm \
    --model_dir Qwen/Qwen2.5-3B-Instruct \
    --quantization fp8 \
    --output_dir quantized/qwen2.5-3b-fp8

# Step 2: Export to ONNX
tensorrt-edgellm-export-llm \
    --model_dir quantized/qwen2.5-3b-fp8 \
    --output_dir onnx_models/qwen2.5-3b

Example: Quantize and Export a VLM

# Quantize the language model component
tensorrt-edgellm-quantize-llm \
    --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
    --quantization fp8 \
    --output_dir quantized/qwen2.5-vl-3b

# Export the language model
tensorrt-edgellm-export-llm \
    --model_dir quantized/qwen2.5-vl-3b \
    --output_dir onnx_models/qwen2.5-vl-3b/llm

# Export the visual encoder
tensorrt-edgellm-export-visual \
    --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
    --output_dir onnx_models/qwen2.5-vl-3b/visual

Example: EAGLE Speculative Decoding

# Quantize base model
tensorrt-edgellm-quantize-llm \
    --model_dir meta-llama/Llama-3.1-8B-Instruct \
    --quantization fp8 \
    --output_dir quantized/llama3.1-8b-base

# Export base model with EAGLE flag
tensorrt-edgellm-export-llm \
    --model_dir quantized/llama3.1-8b-base \
    --output_dir onnx_models/llama3.1-8b/base \
    --is_eagle_base

# Quantize EAGLE draft model
tensorrt-edgellm-quantize-draft \
    --base_model_dir meta-llama/Llama-3.1-8B-Instruct \
    --draft_model_dir EAGLE3-LLaMA3.1-Instruct-8B \
    --quantization fp8 \
    --output_dir quantized/llama3.1-8b-draft

# Export draft model
tensorrt-edgellm-export-draft \
    --draft_model_dir quantized/llama3.1-8b-draft \
    --base_model_dir meta-llama/Llama-3.1-8B-Instruct \
    --output_dir onnx_models/llama3.1-8b/draft

Quantization Methods

Method	Description
FP8	Best accuracy-to-memory balance on SM89+ hardware (Hopper, Ada)
INT4 AWQ	Weight-only quantization; effective for memory-constrained platforms and low-batch inference
NVFP4	4-bit format for NVIDIA Blackwell and Thor hardware; applies to both weights and activations
MXFP8	Experimental; Microscaling FP8 format for SM89+ hardware
INT8 SmoothQuant	Experimental; INT8 weight and activation quantization with SmoothQuant
INT4 GPTQ	Can be loaded directly from HuggingFace Hub (no additional quantization needed)

Supported Models

For the latest support matrix, see the TensorRT-Edge-LLM Supported Models page.

LLMs

Model	FP16	FP8	INT4	NVFP4
Llama-3-8B-Instruct	✅	✅	✅	✅
Llama-3.1-8B-Instruct	✅	✅	✅	✅
Llama-3.2-3B-Instruct	✅	✅	✅	✅
Qwen2-0.5B-Instruct	✅	✅	✅	✅
Qwen2-1.5B-Instruct	✅	✅	✅	✅
Qwen2-7B-Instruct	✅	✅	✅	✅
Qwen2.5-0.5B-Instruct	✅	✅	✅	✅
Qwen2.5-1.5B-Instruct	✅	✅	✅	✅
Qwen2.5-3B-Instruct	✅	✅	✅	✅
Qwen2.5-7B-Instruct	✅	✅	✅	✅
Qwen3-0.6B	✅	✅	✅	✅
Qwen3-1.7B	✅	✅	✅	✅
Qwen3-4B-Instruct-2507	✅	✅	✅	✅
Qwen3-8B	✅	✅	✅	✅
DeepSeek-R1-Distill-Qwen-1.5B	✅	✅	✅	✅
DeepSeek-R1-Distill-Qwen-7B	✅	✅	✅	✅

VLMs

Model	FP16	FP8	INT4	NVFP4
Qwen2-VL-2B-Instruct	✅	✅	✅	✅
Qwen2-VL-7B-Instruct	✅	✅	✅	✅
Qwen2.5-VL-3B-Instruct	✅	✅	✅	✅
Qwen2.5-VL-7B-Instruct	✅	✅	✅	✅
Qwen3-VL-2B-Instruct	✅	✅	✅	✅
Qwen3-VL-4B-Instruct	✅	✅	✅	✅
Qwen3-VL-8B-Instruct	✅	✅	✅	✅
InternVL3-1B	✅	✅	✅	✅
InternVL3-2B	✅	✅	✅	✅
Phi-4-multimodal-instruct	✅	✅	✅	✅

Troubleshooting

GPU out of memory: Use a larger GPU (40 GB for models up to 4B, 80 GB for models up to 8B) or try --device cpu (limited precision support).
Calibration dataset issues: Download the dataset manually and pass the local path with --calib_dataset ./path/to/dataset.
Accuracy degradation: Try FP8 instead of INT4/NVFP4, or increase calibration sample size.

For full documentation, see the TensorRT-Edge-LLM Developer Guide.

Mixed Precision Quantization (Auto Mode)

The auto mode enables mixed precision quantization by searching for the optimal quantization format per layer. This approach balances model accuracy and compression by assigning different precision formats (e.g., NVFP4, FP8) to different layers based on their sensitivity.

How it works

Sensitivity Analysis: Computes per-layer sensitivity scores using gradient-based analysis
Format Search: Searches across specified quantization formats for each layer
Constraint Optimization: Finds the optimal format assignment that satisfies the effective bits constraint while minimizing accuracy loss

Key Parameters

Parameter	Default	Description
`--effective_bits`	4.8	Target average bits per weight across the model. Lower values = more compression but potentially lower accuracy. The search algorithm finds the optimal per-layer format assignment that meets this constraint while minimizing accuracy loss. For example, 4.8 means an average of 4.8 bits per weight (mix of FP4 and FP8 layers).
`--num_score_steps`	128	Number of forward/backward passes used to compute per-layer sensitivity scores via gradient-based analysis. Higher values provide more accurate sensitivity estimates but increase search time. Recommended range: 64-256.
`--calibration_data_size`	512	Number of calibration samples used for both sensitivity scoring and calibration. For auto mode, labels are required for loss computation.

Usage

python torch_quant_to_onnx.py \
    --timm_model_name=vit_base_patch16_224 \
    --quantize_mode=auto \
    --auto_quantization_formats NVFP4_AWQ_LITE_CFG FP8_DEFAULT_CFG \
    --effective_bits=4.8 \
    --num_score_steps=128 \
    --calibration_data_size=512 \
    --evaluate \
    --onnx_save_path=vit_base_patch16_224.auto_quant.onnx

ONNX Export Supported Vision Models

Model	FP8	INT8	MXFP8	NVFP4	INT4_AWQ	Auto
vit_base_patch16_224	✅	✅	✅	✅	✅	✅
swin_tiny_patch4_window7_224	✅	✅	✅	✅	✅	✅
swinv2_tiny_window8_256	✅	✅	✅	✅	✅	✅
resnet50	✅	✅	✅	✅		✅

Resources

Technical Resources

There are many quantization schemes supported in the example scripts:

The FP8 format is available on the Hopper and Ada GPUs with CUDA compute capability greater than or equal to 8.9.
The INT4 AWQ is an INT4 weight only quantization and calibration method. INT4 AWQ is particularly effective for low batch inference where inference latency is dominated by weight loading time rather than the computation time itself. For low batch inference, INT4 AWQ could give lower latency than FP8/INT8 and lower accuracy degradation than INT8.
The NVFP4 is one of the new FP4 formats supported by NVIDIA Blackwell GPU and demonstrates good accuracy compared with other 4-bit alternatives. NVFP4 can be applied to both model weights as well as activations, providing the potential for both a significant increase in math throughput and reductions in memory footprint and memory bandwidth usage compared to the FP8 data format on Blackwell.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch Quantization to ONNX Export

Pre-Requisites

Docker

Local Installation

Vision Models

What it does

Usage

Conv2d Quantization Override

Evaluation

LLM Quantization and Export with TensorRT-Edge-LLM

Overview

Installation

CLI Tools

Example: Quantize and Export an LLM

Example: Quantize and Export a VLM

Example: EAGLE Speculative Decoding

Quantization Methods

Supported Models

LLMs

VLMs

Troubleshooting

Mixed Precision Quantization (Auto Mode)

How it works

Key Parameters

Usage

ONNX Export Supported Vision Models

Resources

Technical Resources

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Torch Quantization to ONNX Export

Pre-Requisites

Docker

Local Installation

Vision Models

What it does

Usage

Conv2d Quantization Override

Evaluation

LLM Quantization and Export with TensorRT-Edge-LLM

Overview

Installation

CLI Tools

Example: Quantize and Export an LLM

Example: Quantize and Export a VLM

Example: EAGLE Speculative Decoding

Quantization Methods

Supported Models

LLMs

VLMs

Troubleshooting

Mixed Precision Quantization (Auto Mode)

How it works

Key Parameters

Usage

ONNX Export Supported Vision Models

Resources

Technical Resources