Speed
Latency and throughput measurement for PyTorch models
sweep_latency
def sweep_latency(
model:nn.Module, # model to benchmark
shapes:Sequence[Sequence[int]], # input shapes to test, e.g. [(1,3,224,224), (1,3,384,384)]
device:str | torch.device='cuda', # device to run on
warmup:int=20, # warmup iterations per shape
steps:int=100, # measurement iterations per shape
)->list[dict]:
Sweep input shapes to analyze latency vs resolution.
sweep_batch_sizes
def sweep_batch_sizes(
model:nn.Module, # model to benchmark
input_shape:Sequence[int], # input shape WITHOUT batch dim, e.g. (3, 224, 224)
batch_sizes:Sequence[int]=(1, 2, 4, 8, 16, 32), # batch sizes to test
device:str | torch.device='cuda', # device to run on
warmup:int=20, # warmup iterations per batch size
steps:int=100, # measurement iterations per batch size
)->list[dict]:
Sweep batch sizes to find optimal throughput.
sweep_threads
def sweep_threads(
model:nn.Module, # model to benchmark
sample:torch.Tensor, # input tensor (with batch dimension)
thread_counts:Sequence[int]=(1, 2, 4, 8), # thread counts to test
warmup:int=20, # warmup iterations per thread count
steps:int=100, # measurement iterations per thread count
)->list[dict]:
Sweep CPU thread counts to find optimal parallelism.
compute_speed_multi
def compute_speed_multi(
model:nn.Module, # model to benchmark
sample:torch.Tensor, # input tensor (with batch dimension)
devices:Sequence[str | torch.device] | None=None, # devices to benchmark (default: cpu + cuda if available)
kwargs:VAR_KEYWORD
)->dict[str, SpeedMetrics]:
Measure latency/throughput on multiple devices.
compute_speed
def compute_speed(
model:nn.Module, # model to benchmark
sample:torch.Tensor, # input tensor (with batch dimension)
device:str | torch.device='cpu', # device to run on
warmup:int=20, # warmup iterations
steps:int=100, # measurement iterations
)->SpeedMetrics:
Measure latency and throughput on a single device.
SpeedMetrics
def SpeedMetrics(
p50_ms:float, p90_ms:float, p99_ms:float, mean_ms:float, std_ms:float, throughput_s:float
)->None:
Latency and throughput metrics for a single device.
See Also
- Benchmark - Unified benchmarking with
benchmark() - Profiling - Per-layer speed analysis with
LayerProfiler - Memory - Memory consumption metrics
- Profiling Tutorial - Practical examples of sweep functions