ONNX Export Tutorial

Export compressed models to ONNX for deployment

Overview

After compressing a model with fasterai, you’ll want to deploy it. ONNX (Open Neural Network Exchange) is the standard format for deploying models across different platforms and runtimes.

Why Export to ONNX?

Benefit	Description
Portability	Run on any platform: servers, mobile, edge devices, browsers
Performance	ONNX Runtime is highly optimized for inference
Quantization	Apply additional INT8 quantization during export
No Python needed	Deploy without Python dependencies

The Deployment Pipeline

Train → Compress (prune/sparsify/quantize) → Fold BN → Export ONNX → Deploy

This tutorial walks through the complete pipeline.

1. Setup and Training

First, let’s train a model that we’ll later compress and export.

path = untar_data(URLs.PETS)
files = get_image_files(path/"images")

def label_func(f): return f[0].isupper()

dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))

learn = vision_learner(dls, resnet18, metrics=accuracy)
learn.unfreeze()
learn.fit_one_cycle(3)

epoch	train_loss	valid_loss	accuracy	time
0	0.641139	0.338822	0.868065	00:02
1	0.365181	0.225297	0.905954	00:02
2	0.195590	0.186686	0.914750	00:02

2. Compress the Model

Apply sparsification to reduce model size. You could also use pruning, quantization, or any combination.

sp_cb = SparsifyCallback(sparsity=50, granularity='weight', context='local', criteria=large_final, schedule=one_cycle)
learn.fit_one_cycle(2, cbs=sp_cb)

Pruning of weight until a sparsity of [50]%
Saving Weights at epoch 0

epoch	train_loss	valid_loss	accuracy	time
0	0.263917	0.234192	0.901894	00:02
1	0.176362	0.202073	0.921516	00:02

Sparsity at the end of epoch 0: [36.57]%
Sparsity at the end of epoch 1: [50.0]%
Final Sparsity: [50.0]%

Sparsity Report:
--------------------------------------------------------------------------------
Layer                Type            Params     Zeros      Sparsity  
--------------------------------------------------------------------------------
Layer 0              Conv2d          9,408      4,704         50.00%
Layer 1              Conv2d          36,864     18,432        50.00%
Layer 2              Conv2d          36,864     18,432        50.00%
Layer 3              Conv2d          36,864     18,432        50.00%
Layer 4              Conv2d          36,864     18,432        50.00%
Layer 5              Conv2d          73,728     36,863        50.00%
Layer 6              Conv2d          147,456    73,726        50.00%
Layer 7              Conv2d          8,192      4,096         50.00%
Layer 8              Conv2d          147,456    73,726        50.00%
Layer 9              Conv2d          147,456    73,726        50.00%
Layer 10             Conv2d          294,912    147,452       50.00%
Layer 11             Conv2d          589,824    294,905       50.00%
Layer 12             Conv2d          32,768     16,384        50.00%
Layer 13             Conv2d          589,824    294,905       50.00%
Layer 14             Conv2d          589,824    294,905       50.00%
Layer 15             Conv2d          1,179,648  589,810       50.00%
Layer 16             Conv2d          2,359,296  1,179,619     50.00%
Layer 17             Conv2d          131,072    65,534        50.00%
Layer 18             Conv2d          2,359,296  1,179,619     50.00%
Layer 19             Conv2d          2,359,296  1,179,618     50.00%
--------------------------------------------------------------------------------
Overall              all             11,166,912 5,583,320     50.00%

3. Fold BatchNorm Layers

Before export, fold batch normalization layers into convolutions for faster inference:

bn_folder = BN_Folder()
model = bn_folder.fold(learn.model)
model.eval();

4. Export to ONNX

Now export the optimized model to ONNX format:

# Create example input (batch_size=1, channels=3, height=64, width=64)
sample = torch.randn(1, 3, 64, 64)

# Export to ONNX
onnx_path = export_onnx(model.cpu(), sample, "model.onnx")
print(f"Exported to: {onnx_path}")

Exported to: model.onnx

Verify the Export

Always verify that the ONNX model produces the same outputs as the PyTorch model:

is_valid = verify_onnx(model, onnx_path, sample)
print(f"Verification {'passed' if is_valid else 'FAILED'}: ONNX outputs {'match' if is_valid else 'do not match'} PyTorch!")

Verification passed: ONNX outputs match PyTorch!

5. Export with INT8 Quantization

For even smaller models and faster inference, apply INT8 quantization during export:

# Dynamic quantization (no calibration data needed)
quantized_path = export_onnx(
    model.cpu(), sample, "model_int8.onnx",
    quantize=True,
    quantize_mode="dynamic"
)
print(f"Exported quantized model to: {quantized_path}")

Exported quantized model to: model_int8_int8.onnx

For better accuracy, use static quantization with calibration data:

# Static quantization with calibration
quantized_path = export_onnx(
    model, sample, "model_int8_static.onnx",
    quantize=True,
    quantize_mode="static",
    calibration_data=dls.train  # Use training data for calibration
)

6. Compare Model Sizes

import os

def get_size_mb(path):
    return os.path.getsize(path) / 1e6

# Save PyTorch model for comparison
torch.save(model.state_dict(), "model.pt")

pt_size = get_size_mb("model.pt")
onnx_size = get_size_mb("model.onnx")
int8_size = get_size_mb(quantized_path)

print(f"PyTorch model:    {pt_size:.2f} MB")
print(f"ONNX model:       {onnx_size:.2f} MB")
print(f"ONNX INT8 model:  {int8_size:.2f} MB ({pt_size/int8_size:.1f}x smaller)")

PyTorch model:    46.83 MB
ONNX model:       46.82 MB
ONNX INT8 model:  11.78 MB (4.0x smaller)

7. Running Inference with ONNX Runtime

Use the ONNXModel wrapper for easy inference:

# Load the ONNX model
onnx_model = ONNXModel("model.onnx", device="cpu")

# Run inference
test_input = torch.randn(1, 3, 64, 64)
output = onnx_model(test_input)

print(f"Output shape: {output.shape}")
print(f"Predictions: {output}")

Output shape: torch.Size([1, 2])
Predictions: tensor([[0.6364, 0.3489]])

Benchmark Inference Speed

import time

def benchmark(fn, input_tensor, warmup=10, runs=100):
    # Warmup
    for _ in range(warmup):
        fn(input_tensor)
    
    # Benchmark
    start = time.perf_counter()
    for _ in range(runs):
        fn(input_tensor)
    elapsed = (time.perf_counter() - start) / runs * 1000
    return elapsed

test_input = torch.randn(1, 3, 64, 64)

# PyTorch
model.eval()
with torch.no_grad():
    pt_time = benchmark(model, test_input)

# ONNX
onnx_model = ONNXModel("model.onnx")
onnx_time = benchmark(onnx_model, test_input)

# ONNX INT8
onnx_int8 = ONNXModel(quantized_path)
int8_time = benchmark(onnx_int8, test_input)

print(f"PyTorch inference: {pt_time:.2f} ms")
print(f"ONNX inference:    {onnx_time:.2f} ms ({pt_time/onnx_time:.1f}x faster)")
print(f"ONNX INT8:         {int8_time:.2f} ms ({pt_time/int8_time:.1f}x faster)")

PyTorch inference: 1.27 ms
ONNX inference:    0.87 ms (1.5x faster)
ONNX INT8:         2.90 ms (0.4x faster)

8. Parameter Reference

export_onnx Parameters

Parameter	Default	Description
`model`	Required	PyTorch model to export
`sample`	Required	Example input tensor (with batch dimension)
`output_path`	Required	Output .onnx file path
`opset_version`	`18`	ONNX opset version
`quantize`	`False`	Apply INT8 quantization after export
`quantize_mode`	`"dynamic"`	`"dynamic"` (no calibration) or `"static"`
`calibration_data`	`None`	DataLoader for static quantization
`optimize`	`True`	Run ONNX graph optimizer
`dynamic_batch`	`True`	Allow variable batch size at runtime

Quantization Mode Comparison

Mode	Calibration	Accuracy	Speed	Use Case
`dynamic`	Not needed	Good	Fast export	Quick deployment
`static`	Required	Better	Slower export	Production models

Summary

Step	Tool	Purpose
Compress	SparsifyCallback, PruneCallback, etc.	Reduce model complexity
Fold BN	BN_Folder	Eliminate batch norm overhead
Export	export_onnx	Convert to deployment format
Verify	verify_onnx	Ensure correctness
Quantize	`quantize=True`	Further reduce size (4x)
Deploy	ONNXModel	Run inference

Complete Pipeline Example

from fasterai.sparse.all import *
from fasterai.misc.all import *
from fasterai.export.all import *

# 1. Train with compression
sp_cb = SparsifyCallback(sparsity=50, granularity='weight', ...)
learn.fit_one_cycle(5, cbs=sp_cb)

# 2. Fold batch norm
model = BN_Folder().fold(learn.model)

# 3. Export with quantization
sample = torch.randn(1, 3, 224, 224)
path = export_onnx(model, sample, "model_int8.onnx", quantize=True)

# 4. Verify
assert verify_onnx(model, path, sample)

# 5. Deploy
onnx_model = ONNXModel(path)
output = onnx_model(input_tensor)