ONNX Export Tutorial

Export compressed models to ONNX for deployment

Overview

After compressing a model with fasterai, you’ll want to deploy it. ONNX (Open Neural Network Exchange) is the standard format for deploying models across different platforms and runtimes.

Why Export to ONNX?

Benefit Description
Portability Run on any platform: servers, mobile, edge devices, browsers
Performance ONNX Runtime is highly optimized for inference
Quantization Apply additional INT8 quantization during export
No Python needed Deploy without Python dependencies

The Deployment Pipeline

Train → Compress (prune/sparsify/quantize) → Fold BN → Export ONNX → Deploy

This tutorial walks through the complete pipeline.

1. Setup and Training

First, let’s train a model that we’ll later compress and export.

path = untar_data(URLs.PETS)
files = get_image_files(path/"images")

def label_func(f): return f[0].isupper()

dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))
learn = vision_learner(dls, resnet18, metrics=accuracy)
learn.unfreeze()
learn.fit_one_cycle(3)
epoch train_loss valid_loss accuracy time
0 0.641139 0.338822 0.868065 00:02
1 0.365181 0.225297 0.905954 00:02
2 0.195590 0.186686 0.914750 00:02

2. Compress the Model

Apply sparsification to reduce model size. You could also use pruning, quantization, or any combination.

sp_cb = SparsifyCallback(sparsity=50, granularity='weight', context='local', criteria=large_final, schedule=one_cycle)
learn.fit_one_cycle(2, cbs=sp_cb)
Pruning of weight until a sparsity of [50]%
Saving Weights at epoch 0
epoch train_loss valid_loss accuracy time
0 0.263917 0.234192 0.901894 00:02
1 0.176362 0.202073 0.921516 00:02
Sparsity at the end of epoch 0: [36.57]%
Sparsity at the end of epoch 1: [50.0]%
Final Sparsity: [50.0]%

Sparsity Report:
--------------------------------------------------------------------------------
Layer                Type            Params     Zeros      Sparsity  
--------------------------------------------------------------------------------
Layer 0              Conv2d          9,408      4,704         50.00%
Layer 1              Conv2d          36,864     18,432        50.00%
Layer 2              Conv2d          36,864     18,432        50.00%
Layer 3              Conv2d          36,864     18,432        50.00%
Layer 4              Conv2d          36,864     18,432        50.00%
Layer 5              Conv2d          73,728     36,863        50.00%
Layer 6              Conv2d          147,456    73,726        50.00%
Layer 7              Conv2d          8,192      4,096         50.00%
Layer 8              Conv2d          147,456    73,726        50.00%
Layer 9              Conv2d          147,456    73,726        50.00%
Layer 10             Conv2d          294,912    147,452       50.00%
Layer 11             Conv2d          589,824    294,905       50.00%
Layer 12             Conv2d          32,768     16,384        50.00%
Layer 13             Conv2d          589,824    294,905       50.00%
Layer 14             Conv2d          589,824    294,905       50.00%
Layer 15             Conv2d          1,179,648  589,810       50.00%
Layer 16             Conv2d          2,359,296  1,179,619     50.00%
Layer 17             Conv2d          131,072    65,534        50.00%
Layer 18             Conv2d          2,359,296  1,179,619     50.00%
Layer 19             Conv2d          2,359,296  1,179,618     50.00%
--------------------------------------------------------------------------------
Overall              all             11,166,912 5,583,320     50.00%

3. Fold BatchNorm Layers

Before export, fold batch normalization layers into convolutions for faster inference:

bn_folder = BN_Folder()
model = bn_folder.fold(learn.model)
model.eval();

4. Export to ONNX

Now export the optimized model to ONNX format:

# Create example input (batch_size=1, channels=3, height=64, width=64)
sample = torch.randn(1, 3, 64, 64)

# Export to ONNX
onnx_path = export_onnx(model.cpu(), sample, "model.onnx")
print(f"Exported to: {onnx_path}")
Exported to: model.onnx

Verify the Export

Always verify that the ONNX model produces the same outputs as the PyTorch model:

is_valid = verify_onnx(model, onnx_path, sample)
print(f"Verification {'passed' if is_valid else 'FAILED'}: ONNX outputs {'match' if is_valid else 'do not match'} PyTorch!")
Verification passed: ONNX outputs match PyTorch!

5. Export with INT8 Quantization

For even smaller models and faster inference, apply INT8 quantization during export:

# Dynamic quantization (no calibration data needed)
quantized_path = export_onnx(
    model.cpu(), sample, "model_int8.onnx",
    quantize=True,
    quantize_mode="dynamic"
)
print(f"Exported quantized model to: {quantized_path}")
Exported quantized model to: model_int8_int8.onnx

For better accuracy, use static quantization with calibration data:

# Static quantization with calibration
quantized_path = export_onnx(
    model, sample, "model_int8_static.onnx",
    quantize=True,
    quantize_mode="static",
    calibration_data=dls.train  # Use training data for calibration
)

6. Compare Model Sizes

import os

def get_size_mb(path):
    return os.path.getsize(path) / 1e6

# Save PyTorch model for comparison
torch.save(model.state_dict(), "model.pt")

pt_size = get_size_mb("model.pt")
onnx_size = get_size_mb("model.onnx")
int8_size = get_size_mb(quantized_path)

print(f"PyTorch model:    {pt_size:.2f} MB")
print(f"ONNX model:       {onnx_size:.2f} MB")
print(f"ONNX INT8 model:  {int8_size:.2f} MB ({pt_size/int8_size:.1f}x smaller)")
PyTorch model:    46.83 MB
ONNX model:       46.82 MB
ONNX INT8 model:  11.78 MB (4.0x smaller)

7. Running Inference with ONNX Runtime

Use the ONNXModel wrapper for easy inference:

# Load the ONNX model
onnx_model = ONNXModel("model.onnx", device="cpu")

# Run inference
test_input = torch.randn(1, 3, 64, 64)
output = onnx_model(test_input)

print(f"Output shape: {output.shape}")
print(f"Predictions: {output}")
Output shape: torch.Size([1, 2])
Predictions: tensor([[0.6364, 0.3489]])

Benchmark Inference Speed

import time

def benchmark(fn, input_tensor, warmup=10, runs=100):
    # Warmup
    for _ in range(warmup):
        fn(input_tensor)
    
    # Benchmark
    start = time.perf_counter()
    for _ in range(runs):
        fn(input_tensor)
    elapsed = (time.perf_counter() - start) / runs * 1000
    return elapsed

test_input = torch.randn(1, 3, 64, 64)

# PyTorch
model.eval()
with torch.no_grad():
    pt_time = benchmark(model, test_input)

# ONNX
onnx_model = ONNXModel("model.onnx")
onnx_time = benchmark(onnx_model, test_input)

# ONNX INT8
onnx_int8 = ONNXModel(quantized_path)
int8_time = benchmark(onnx_int8, test_input)

print(f"PyTorch inference: {pt_time:.2f} ms")
print(f"ONNX inference:    {onnx_time:.2f} ms ({pt_time/onnx_time:.1f}x faster)")
print(f"ONNX INT8:         {int8_time:.2f} ms ({pt_time/int8_time:.1f}x faster)")
PyTorch inference: 1.27 ms
ONNX inference:    0.87 ms (1.5x faster)
ONNX INT8:         2.90 ms (0.4x faster)

8. Parameter Reference

export_onnx Parameters

Parameter Default Description
model Required PyTorch model to export
sample Required Example input tensor (with batch dimension)
output_path Required Output .onnx file path
opset_version 18 ONNX opset version
quantize False Apply INT8 quantization after export
quantize_mode "dynamic" "dynamic" (no calibration) or "static"
calibration_data None DataLoader for static quantization
optimize True Run ONNX graph optimizer
dynamic_batch True Allow variable batch size at runtime

Quantization Mode Comparison

Mode Calibration Accuracy Speed Use Case
dynamic Not needed Good Fast export Quick deployment
static Required Better Slower export Production models

Summary

Step Tool Purpose
Compress SparsifyCallback, PruneCallback, etc. Reduce model complexity
Fold BN BN_Folder Eliminate batch norm overhead
Export export_onnx Convert to deployment format
Verify verify_onnx Ensure correctness
Quantize quantize=True Further reduce size (4x)
Deploy ONNXModel Run inference

Complete Pipeline Example

from fasterai.sparse.all import *
from fasterai.misc.all import *
from fasterai.export.all import *

# 1. Train with compression
sp_cb = SparsifyCallback(sparsity=50, granularity='weight', ...)
learn.fit_one_cycle(5, cbs=sp_cb)

# 2. Fold batch norm
model = BN_Folder().fold(learn.model)

# 3. Export with quantization
sample = torch.randn(1, 3, 224, 224)
path = export_onnx(model, sample, "model_int8.onnx", quantize=True)

# 4. Verify
assert verify_onnx(model, path, sample)

# 5. Deploy
onnx_model = ONNXModel(path)
output = onnx_model(input_tensor)

See Also