Quantizer
Overview
The Quantizer class provides model quantization capabilities to reduce model size and improve inference speed. Quantization converts floating-point weights and activations to lower precision integers (typically int8).
Supported Backends: - 'x86': Optimized for Intel CPUs (default) - 'qnnpack': Optimized for ARM CPUs (mobile devices) - 'fbgemm': Facebook’s quantization backend
Quantization Methods: - 'static': Post-training quantization with calibration data - best accuracy, requires representative data - 'dynamic': Runtime quantization without calibration - easier to use, slightly lower accuracy - 'qat': Quantization-aware training - highest accuracy, requires retraining
Note: PyTorch quantization produces CPU-only models. The quantized model will always run on CPU regardless of original device.
Choosing the Right Method
| Method | Accuracy | Setup Effort | When to Use |
|---|---|---|---|
| Static | High | Medium (needs calibration data) | Production with representative dataset available |
| Dynamic | Medium | Low (no calibration) | Quick experiments, NLP models with variable input |
| QAT | Highest | High (requires retraining) | Maximum accuracy critical, have training resources |
Backend Selection Guide
| Backend | Target Hardware | Best For |
|---|---|---|
'x86' |
Intel/AMD CPUs | Desktop/server deployment |
'qnnpack' |
ARM CPUs | Mobile (iOS/Android), Raspberry Pi |
'fbgemm' |
Intel CPUs | Server-side with batch inference |
Found permutation search CUDA kernels [ASP][Info] permutation_search_kernels can be imported. —
Quantizer
def Quantizer(
backend:str='x86', # Target backend for quantization
method:str='static', # Quantization method: 'static', 'dynamic', or 'qat'
qconfig_mapping:dict | None=None, # Optional custom quantization config
custom_configs:dict | None=None, # Custom module-specific configurations
use_per_tensor:bool=False, # Force per-tensor quantization
verbose:bool=False, # Enable verbose output
):
Initialize a quantizer with specified backend and options.
Parameters:
backend: Target hardware backend ('x86','qnnpack','fbgemm')method: Quantization approach ('static','dynamic','qat')qconfig_mapping: Optional custom quantization configurationcustom_configs: Dict of module-specific configurationsuse_per_tensor: Force per-tensor quantization (may help with conversion issues)verbose: Enable detailed output during quantization
Quantizer.quantize
def quantize(
model:Module, # Model to quantize
calibration_dl:Any, # Dataloader for calibration
max_calibration_samples:int=100, # Maximum number of samples to use for calibration
device:str | torch.device='cpu', # Device to use for calibration
)->Module:
Quantize a model using the specified method and settings.
Note: PyTorch quantization produces CPU-only models. The returned model will always be on CPU regardless of the input model’s device.
Usage Examples
Static Quantization (Recommended for best accuracy)
from fasterai.quantize.quantizer import Quantizer
# Create quantizer for static quantization
quantizer = Quantizer(
backend='x86',
method='static',
verbose=True
)
# Quantize with calibration data
quantized_model = quantizer.quantize(
model,
calibration_dl=dls.valid,
max_calibration_samples=100
)Dynamic Quantization (No calibration needed)
from fasterai.quantize.quantizer import Quantizer
# Create quantizer for dynamic quantization
quantizer = Quantizer(
backend='x86',
method='dynamic'
)
# Quantize - no dataloader needed
quantized_model = quantizer.quantize(model, calibration_dl=dls.valid)Mobile Deployment (ARM devices)
```python from fasterai.quantize.quantizer import Quantizer
Use qnnpack backend for mobile
quantizer = Quantizer( backend=‘qnnpack’, method=‘static’ )
quantized_model = quantizer.quantize(model, calibration_dl=dls.valid)
See Also
- QuantizeCallback - Apply quantization during fastai training
- PyTorch Quantization Documentation - Official PyTorch quantization guide
- ONNX Exporter - Export models for cross-platform deployment