path = untar_data(URLs.PETS)
files = get_image_files(path/"images")
def label_func(f): return f[0].isupper()
dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))ONNX Export Tutorial
Overview
After compressing a model with fasterai, you’ll want to deploy it. ONNX (Open Neural Network Exchange) is the standard format for deploying models across different platforms and runtimes.
Why Export to ONNX?
| Benefit | Description |
|---|---|
| Portability | Run on any platform: servers, mobile, edge devices, browsers |
| Performance | ONNX Runtime is highly optimized for inference |
| Quantization | Apply additional INT8 quantization during export |
| No Python needed | Deploy without Python dependencies |
The Deployment Pipeline
Train → Compress (prune/sparsify/quantize) → Fold BN → Export ONNX → Deploy
This tutorial walks through the complete pipeline.
1. Setup and Training
First, let’s train a model that we’ll later compress and export.
learn = vision_learner(dls, resnet18, metrics=accuracy)
learn.unfreeze()
learn.fit_one_cycle(3)| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.641139 | 0.338822 | 0.868065 | 00:02 |
| 1 | 0.365181 | 0.225297 | 0.905954 | 00:02 |
| 2 | 0.195590 | 0.186686 | 0.914750 | 00:02 |
2. Compress the Model
Apply sparsification to reduce model size. You could also use pruning, quantization, or any combination.
sp_cb = SparsifyCallback(sparsity=50, granularity='weight', context='local', criteria=large_final, schedule=one_cycle)
learn.fit_one_cycle(2, cbs=sp_cb)Pruning of weight until a sparsity of [50]%
Saving Weights at epoch 0
| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.263917 | 0.234192 | 0.901894 | 00:02 |
| 1 | 0.176362 | 0.202073 | 0.921516 | 00:02 |
Sparsity at the end of epoch 0: [36.57]%
Sparsity at the end of epoch 1: [50.0]%
Final Sparsity: [50.0]%
Sparsity Report:
--------------------------------------------------------------------------------
Layer Type Params Zeros Sparsity
--------------------------------------------------------------------------------
Layer 0 Conv2d 9,408 4,704 50.00%
Layer 1 Conv2d 36,864 18,432 50.00%
Layer 2 Conv2d 36,864 18,432 50.00%
Layer 3 Conv2d 36,864 18,432 50.00%
Layer 4 Conv2d 36,864 18,432 50.00%
Layer 5 Conv2d 73,728 36,863 50.00%
Layer 6 Conv2d 147,456 73,726 50.00%
Layer 7 Conv2d 8,192 4,096 50.00%
Layer 8 Conv2d 147,456 73,726 50.00%
Layer 9 Conv2d 147,456 73,726 50.00%
Layer 10 Conv2d 294,912 147,452 50.00%
Layer 11 Conv2d 589,824 294,905 50.00%
Layer 12 Conv2d 32,768 16,384 50.00%
Layer 13 Conv2d 589,824 294,905 50.00%
Layer 14 Conv2d 589,824 294,905 50.00%
Layer 15 Conv2d 1,179,648 589,810 50.00%
Layer 16 Conv2d 2,359,296 1,179,619 50.00%
Layer 17 Conv2d 131,072 65,534 50.00%
Layer 18 Conv2d 2,359,296 1,179,619 50.00%
Layer 19 Conv2d 2,359,296 1,179,618 50.00%
--------------------------------------------------------------------------------
Overall all 11,166,912 5,583,320 50.00%
3. Fold BatchNorm Layers
Before export, fold batch normalization layers into convolutions for faster inference:
bn_folder = BN_Folder()
model = bn_folder.fold(learn.model)
model.eval();4. Export to ONNX
Now export the optimized model to ONNX format:
# Create example input (batch_size=1, channels=3, height=64, width=64)
sample = torch.randn(1, 3, 64, 64)
# Export to ONNX
onnx_path = export_onnx(model.cpu(), sample, "model.onnx")
print(f"Exported to: {onnx_path}")Exported to: model.onnx
Verify the Export
Always verify that the ONNX model produces the same outputs as the PyTorch model:
is_valid = verify_onnx(model, onnx_path, sample)
print(f"Verification {'passed' if is_valid else 'FAILED'}: ONNX outputs {'match' if is_valid else 'do not match'} PyTorch!")Verification passed: ONNX outputs match PyTorch!
5. Export with INT8 Quantization
For even smaller models and faster inference, apply INT8 quantization during export:
# Dynamic quantization (no calibration data needed)
quantized_path = export_onnx(
model.cpu(), sample, "model_int8.onnx",
quantize=True,
quantize_mode="dynamic"
)
print(f"Exported quantized model to: {quantized_path}")Exported quantized model to: model_int8_int8.onnx
For better accuracy, use static quantization with calibration data:
# Static quantization with calibration
quantized_path = export_onnx(
model, sample, "model_int8_static.onnx",
quantize=True,
quantize_mode="static",
calibration_data=dls.train # Use training data for calibration
)6. Compare Model Sizes
import os
def get_size_mb(path):
return os.path.getsize(path) / 1e6
# Save PyTorch model for comparison
torch.save(model.state_dict(), "model.pt")
pt_size = get_size_mb("model.pt")
onnx_size = get_size_mb("model.onnx")
int8_size = get_size_mb(quantized_path)
print(f"PyTorch model: {pt_size:.2f} MB")
print(f"ONNX model: {onnx_size:.2f} MB")
print(f"ONNX INT8 model: {int8_size:.2f} MB ({pt_size/int8_size:.1f}x smaller)")PyTorch model: 46.83 MB
ONNX model: 46.82 MB
ONNX INT8 model: 11.78 MB (4.0x smaller)
7. Running Inference with ONNX Runtime
Use the ONNXModel wrapper for easy inference:
# Load the ONNX model
onnx_model = ONNXModel("model.onnx", device="cpu")
# Run inference
test_input = torch.randn(1, 3, 64, 64)
output = onnx_model(test_input)
print(f"Output shape: {output.shape}")
print(f"Predictions: {output}")Output shape: torch.Size([1, 2])
Predictions: tensor([[0.6364, 0.3489]])
Benchmark Inference Speed
import time
def benchmark(fn, input_tensor, warmup=10, runs=100):
# Warmup
for _ in range(warmup):
fn(input_tensor)
# Benchmark
start = time.perf_counter()
for _ in range(runs):
fn(input_tensor)
elapsed = (time.perf_counter() - start) / runs * 1000
return elapsed
test_input = torch.randn(1, 3, 64, 64)
# PyTorch
model.eval()
with torch.no_grad():
pt_time = benchmark(model, test_input)
# ONNX
onnx_model = ONNXModel("model.onnx")
onnx_time = benchmark(onnx_model, test_input)
# ONNX INT8
onnx_int8 = ONNXModel(quantized_path)
int8_time = benchmark(onnx_int8, test_input)
print(f"PyTorch inference: {pt_time:.2f} ms")
print(f"ONNX inference: {onnx_time:.2f} ms ({pt_time/onnx_time:.1f}x faster)")
print(f"ONNX INT8: {int8_time:.2f} ms ({pt_time/int8_time:.1f}x faster)")PyTorch inference: 1.27 ms
ONNX inference: 0.87 ms (1.5x faster)
ONNX INT8: 2.90 ms (0.4x faster)
8. Parameter Reference
export_onnx Parameters
| Parameter | Default | Description |
|---|---|---|
model |
Required | PyTorch model to export |
sample |
Required | Example input tensor (with batch dimension) |
output_path |
Required | Output .onnx file path |
opset_version |
18 |
ONNX opset version |
quantize |
False |
Apply INT8 quantization after export |
quantize_mode |
"dynamic" |
"dynamic" (no calibration) or "static" |
calibration_data |
None |
DataLoader for static quantization |
optimize |
True |
Run ONNX graph optimizer |
dynamic_batch |
True |
Allow variable batch size at runtime |
Quantization Mode Comparison
| Mode | Calibration | Accuracy | Speed | Use Case |
|---|---|---|---|---|
dynamic |
Not needed | Good | Fast export | Quick deployment |
static |
Required | Better | Slower export | Production models |
Summary
| Step | Tool | Purpose |
|---|---|---|
| Compress | SparsifyCallback, PruneCallback, etc. | Reduce model complexity |
| Fold BN | BN_Folder | Eliminate batch norm overhead |
| Export | export_onnx | Convert to deployment format |
| Verify | verify_onnx | Ensure correctness |
| Quantize | quantize=True |
Further reduce size (4x) |
| Deploy | ONNXModel | Run inference |
Complete Pipeline Example
from fasterai.sparse.all import *
from fasterai.misc.all import *
from fasterai.export.all import *
# 1. Train with compression
sp_cb = SparsifyCallback(sparsity=50, granularity='weight', ...)
learn.fit_one_cycle(5, cbs=sp_cb)
# 2. Fold batch norm
model = BN_Folder().fold(learn.model)
# 3. Export with quantization
sample = torch.randn(1, 3, 224, 224)
path = export_onnx(model, sample, "model_int8.onnx", quantize=True)
# 4. Verify
assert verify_onnx(model, path, sample)
# 5. Deploy
onnx_model = ONNXModel(path)
output = onnx_model(input_tensor)See Also
- ONNX Exporter API - Detailed API reference
- BN Folding - Fold batch norm before export
- CPU Optimizer - Alternative: TorchScript for CPU deployment
- Sparsify Callback - Compress before export
- Quantize Callback - QAT before export