Quantizer

Quantize your network

Overview

The Quantizer class provides model quantization capabilities to reduce model size and improve inference speed. Quantization converts floating-point weights and activations to lower precision integers (typically int8).

Supported Backends: - 'x86': Optimized for Intel CPUs (default) - 'qnnpack': Optimized for ARM CPUs (mobile devices) - 'fbgemm': Facebook’s quantization backend

Quantization Methods: - 'static': Post-training quantization with calibration data - best accuracy, requires representative data - 'dynamic': Runtime quantization without calibration - easier to use, slightly lower accuracy - 'qat': Quantization-aware training - highest accuracy, requires retraining

Note: PyTorch quantization produces CPU-only models. The quantized model will always run on CPU regardless of original device.

Choosing the Right Method

Method Accuracy Setup Effort When to Use
Static High Medium (needs calibration data) Production with representative dataset available
Dynamic Medium Low (no calibration) Quick experiments, NLP models with variable input
QAT Highest High (requires retraining) Maximum accuracy critical, have training resources

Backend Selection Guide

Backend Target Hardware Best For
'x86' Intel/AMD CPUs Desktop/server deployment
'qnnpack' ARM CPUs Mobile (iOS/Android), Raspberry Pi
'fbgemm' Intel CPUs Server-side with batch inference

Found permutation search CUDA kernels [ASP][Info] permutation_search_kernels can be imported. —

source

Quantizer


def Quantizer(
    backend:str='x86', # Target backend for quantization
    method:str='static', # Quantization method: 'static', 'dynamic', or 'qat'
    qconfig_mapping:dict | None=None, # Optional custom quantization config
    custom_configs:dict | None=None, # Custom module-specific configurations
    use_per_tensor:bool=False, # Force per-tensor quantization
    verbose:bool=False, # Enable verbose output
):

Initialize a quantizer with specified backend and options.

Parameters:

  • backend: Target hardware backend ('x86', 'qnnpack', 'fbgemm')
  • method: Quantization approach ('static', 'dynamic', 'qat')
  • qconfig_mapping: Optional custom quantization configuration
  • custom_configs: Dict of module-specific configurations
  • use_per_tensor: Force per-tensor quantization (may help with conversion issues)
  • verbose: Enable detailed output during quantization


source

Quantizer.quantize


def quantize(
    model:Module, # Model to quantize
    calibration_dl:Any, # Dataloader for calibration
    max_calibration_samples:int=100, # Maximum number of samples to use for calibration
    device:str | torch.device='cpu', # Device to use for calibration
)->Module:

Quantize a model using the specified method and settings.

Note: PyTorch quantization produces CPU-only models. The returned model will always be on CPU regardless of the input model’s device.


Usage Examples

Dynamic Quantization (No calibration needed)

from fasterai.quantize.quantizer import Quantizer

# Create quantizer for dynamic quantization
quantizer = Quantizer(
    backend='x86',
    method='dynamic'
)

# Quantize - no dataloader needed
quantized_model = quantizer.quantize(model, calibration_dl=dls.valid)

Mobile Deployment (ARM devices)

```python from fasterai.quantize.quantizer import Quantizer

Use qnnpack backend for mobile

quantizer = Quantizer( backend=‘qnnpack’, method=‘static’ )

quantized_model = quantizer.quantize(model, calibration_dl=dls.valid)


See Also