Accelerated on-device training with edge AI accelerators

Accelerated on-device training pipeline with inference only edge AI (Hailo) co-processor. Project evaluates and compares the proposed workflow with common-purpose edge AI hardware, such as Raspberry Pi or GPU-based NVIDIA Jetson. Repository covers model preparation, conversion, on-device training, and runtime profiling.

Build

Jetson Orin Nano

JetPack: 6.2
CUDA: 12.6
Python 3.12.6

ONNX Runtime

Source: https://github.com/microsoft/onnxruntime
Branch: v1.22.1

Build command:

./build.sh --config Release --update --build --build_shared_lib --no_kleidiai --skip_tests --enable_training --build_wheel --use_cuda --cuda_home /usr/local/cuda   --cuda_version=12.6 --cudnn_home /usr/lib/aarch64-linux-gnu --parallel

Hailo edge AI accelerator support

Hailo accelerator
HailoRT: 4.23.0 (pre-built packages, documentation and installtion steps are available at https://hailo.ai/developer-zone/)
Python 3.12.6

ONNX Runtime

Source: https://github.com/MatPiech/onnxruntime
Branch: v1.22.1-hailo

Build command:

./build.sh --config Release --update --build --build_shared_lib --no_kleidiai --skip_tests --enable_pybind --build_wheel --use_hailo --enable_training --parallel --cmake_extra_defines CMAKE_POLICY_VERSION_MINIMUM=3.5

References:

Conversion setup

Installation steps

To avoid conflicts between dependencies it is recommended to create two virtualenvs, e.g., hailo (conversion from ONNX to Hailo HAR and HEF formats) and ort (conversion from PyTorch to ONNX and generating ONNX Runtime training graphs).

hailo virtualenv:

Use Python 3.10 due to Hailo Dataflow Compiler Python bindings requirements.
Install HailoRT (4.23.0) and Hailo Dataflow Compiler (3.31.0) - follow the instructions at https://hailo.ai/developer-zone/.
Install repository dependencies: pip install -e .

ort virtualenv:

Keep the same Python version (3.10) for compatibility with hailo virtualenv.
Install repository dependencies: pip install -e .
Build or install onnxruntime-training with HailoRT provider support to enable conversion of ONNX models with HailoOp to training graphs compatible with ONNX Runtime training APIs:

pip install onnxruntime_training-1.22.1+hailo*.whl

Model conversion

Model conversion relies on the notebooks in notebooks/ and the helper script scripts/change_model_output_features.py.

Create a PyTorch model (from PyTorch Image Models - timm) and export to ONNX: notebooks/create_onnx_model.ipynb (ort virtualenv).
Convert ONNX for Hailo (quantization/optimization): notebooks/create_onnx_hailoop_model.ipynb (hailo virtualenv).
Generate training artifacts for ONNX Runtime: notebooks/generate_training_graph.ipynb (ort virtualenv).

The conversion flow typically produces artifacts under artifacts/<model_name>/... used by the runtime scripts.

Runtime installation

Jetson Orin Nano

Use Python 3.12 with pip (preferably in a virtualenv).
Install project dependencies:

pip install -e .[jetson]

Build or install onnxruntime-training with CUDA support.

pip install onnxruntime_training-1.22.1+cu12*.whl

CPU-based devices (Raspberry Pi)

Use Python 3.12 with pip (preferably in a virtualenv).
Install HailoRT and its Python bindings following the instructions at https://hailo.ai/developer-zone/.
Install project dependencies:

pip install -e .[rpi]

Build or install onnxruntime-training package with hardware accelerator (CPU / Hailo depending on platform):

pip install onnxruntime_training-1.22.1+*.whl

On-device training

ONNX Runtime training

Run on-device training using ONNX Runtime training APIs:

python scripts/onnx_runtime_training.py \
	--model-dir <model-dir> \
	--data-path <dataset-path> \
	--device <cpu / cuda / hailo> \
	--epochs 1 \
	--batch-size 1 \
	--train

Use --device cuda on GPU-based Jetson or --device hailo with the Hailo provider when available.

PyTorch training

For native PyTorch training on CPU or CUDA:

python scripts/pytorch_training.py \
	--model-path <model-file>.pth \
	--data-path <dataset-path> \
	--device <cpu / cuda> \
	--epochs 1 \
	--batch-size 1 \
	--train

Results

Training accuracy vs time curves for CIFAR-100 and Oxford-IIIT Pet across four models, comparing Raspberry Pi CPU, Jetson Orin Nano, and Raspberry Pi + Hailo-8L.

Bar charts comparing CIFAR-100 and Oxford-IIIT Pet test accuracy across models, showing how different Hailo int8 quantization and performance recovery strategies restore accuracy relative to Raspberry Pi fp32 baseline.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
.images		.images
notebooks		notebooks
scripts		scripts
src/accelerator_training		src/accelerator_training
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accelerated on-device training with edge AI accelerators

Contents

Build

Jetson Orin Nano

ONNX Runtime

Hailo edge AI accelerator support

ONNX Runtime

Conversion setup

Installation steps

Model conversion

Runtime installation

Jetson Orin Nano

CPU-based devices (Raspberry Pi)

On-device training

ONNX Runtime training

PyTorch training

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Accelerated on-device training with edge AI accelerators

Contents

Build

Jetson Orin Nano

ONNX Runtime

Hailo edge AI accelerator support

ONNX Runtime

Conversion setup

Installation steps

Model conversion

Runtime installation

Jetson Orin Nano

CPU-based devices (Raspberry Pi)

On-device training

ONNX Runtime training

PyTorch training

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages