import torch
import pandas as pd
from torchvision.models import resnet18
from fasterbench.profiling import LayerProfiler
model = resnet18()
dummy = torch.randn(1, 3, 224, 224)
# Create profiler
profiler = LayerProfiler(model, dummy)
# Profile multiple metrics at once
results = profiler.profile(["speed", "size", "memory"], device="cpu", warmup=3, steps=10)Profiling Tutorial
Overview
When optimizing models, you need to know which layers are the bottlenecks. This tutorial covers advanced profiling techniques in fasterbench for identifying performance issues:
| Tool | Use Case |
|---|---|
| LayerProfiler | Unified per-layer analysis (speed, memory, size, compute) |
| sweep_batch_sizes() | Find optimal batch size for throughput |
| sweep_threads() | Find optimal CPU thread count |
| sweep_latency() | Analyze latency vs input resolution |
1. LayerProfiler: Unified Per-Layer Analysis
The LayerProfiler class provides a unified interface for profiling multiple metrics per layer. This is the recommended approach for comprehensive layer-level analysis.
Available Metrics
| Metric | Columns Added | Description |
|---|---|---|
speed |
speed_ms, speed_percent |
Forward pass latency per layer |
memory |
memory_mib, memory_percent |
Output tensor size (activation memory) |
size |
params, params_percent |
Parameter count per layer |
compute |
macs, macs_percent |
MACs per layer (requires torchprofile) |
Utility Methods: top() and summary()
The LayerProfiler provides convenient methods to quickly identify bottlenecks after profiling:
# Get top 5 slowest layers
print("Top 5 slowest layers:")
for r in profiler.top("speed", n=5):
print(f" {r['name']:30} {r['speed_ms']:.3f} ms ({r['speed_percent']:.1f}%)")
# Get top 5 fastest layers (ascending order)
print("\nTop 5 fastest layers:")
for r in profiler.top("speed", n=5, ascending=True):
print(f" {r['name']:30} {r['speed_ms']:.3f} ms")
# Get layers with most parameters
print("\nTop 5 layers by parameter count:")
for r in profiler.top("size", n=5):
print(f" {r['name']:30} {r['params']:>12,} params")Top 5 slowest layers:
layer4.0.conv2 0.503 ms (6.8%)
layer4.1.conv1 0.498 ms (6.7%)
layer4.1.conv2 0.494 ms (6.6%)
layer3.0.conv2 0.346 ms (4.6%)
layer3.1.conv1 0.345 ms (4.6%)
Top 5 fastest layers:
layer4.0.relu 0.011 ms
layer4.1.relu 0.011 ms
layer3.0.relu 0.012 ms
layer3.1.relu 0.013 ms
layer2.0.relu 0.013 ms
Top 5 layers by parameter count:
layer4.0.conv2 2,359,296.0 params
layer4.1.conv1 2,359,296.0 params
layer4.1.conv2 2,359,296.0 params
layer4.0.conv1 1,179,648.0 params
layer3.0.conv2 589,824.0 params
# Print formatted summary of all profiled metrics
profiler.summary(top=10)═══ Speed (slowest) ═══════════════════════════════════════
layer4.0.conv2 Conv2d 0.503 ms ( 6.8%)
layer4.1.conv1 Conv2d 0.498 ms ( 6.7%)
layer4.1.conv2 Conv2d 0.494 ms ( 6.6%)
layer3.0.conv2 Conv2d 0.346 ms ( 4.6%)
layer3.1.conv1 Conv2d 0.345 ms ( 4.6%)
layer3.1.conv2 Conv2d 0.340 ms ( 4.6%)
conv1 Conv2d 0.337 ms ( 4.5%)
maxpool MaxPool2d 0.330 ms ( 4.4%)
layer1.0.conv1 Conv2d 0.328 ms ( 4.4%)
layer2.0.conv2 Conv2d 0.322 ms ( 4.3%)
═══ Parameters (largest) ══════════════════════════════════
layer4.0.conv2 Conv2d 2,359,296 ( 20.2%)
layer4.1.conv1 Conv2d 2,359,296 ( 20.2%)
layer4.1.conv2 Conv2d 2,359,296 ( 20.2%)
layer4.0.conv1 Conv2d 1,179,648 ( 10.1%)
layer3.0.conv2 Conv2d 589,824 ( 5.0%)
layer3.1.conv1 Conv2d 589,824 ( 5.0%)
layer3.1.conv2 Conv2d 589,824 ( 5.0%)
fc Linear 513,000 ( 4.4%)
layer3.0.conv1 Conv2d 294,912 ( 2.5%)
layer2.0.conv2 Conv2d 147,456 ( 1.3%)
═══ Memory (largest) ══════════════════════════════════════
conv1 Conv2d 3.062 MiB ( 11.9%)
bn1 BatchNorm2d 3.062 MiB ( 11.9%)
relu ReLU 3.062 MiB ( 11.9%)
maxpool MaxPool2d 0.766 MiB ( 3.0%)
layer1.0.conv1 Conv2d 0.766 MiB ( 3.0%)
layer1.1.conv2 Conv2d 0.766 MiB ( 3.0%)
layer1.1.conv1 Conv2d 0.766 MiB ( 3.0%)
layer1.0.conv2 Conv2d 0.766 MiB ( 3.0%)
layer1.0.bn1 BatchNorm2d 0.766 MiB ( 3.0%)
layer1.1.bn1 BatchNorm2d 0.766 MiB ( 3.0%)
2. Batch Size Sweeping
Find the optimal batch size for maximum throughput. Larger batches improve GPU utilization but eventually hit memory limits.
from fasterbench import sweep_batch_sizes
if torch.cuda.is_available():
results = sweep_batch_sizes(
model,
input_shape=(3, 224, 224), # Shape WITHOUT batch dimension
batch_sizes=[1, 2, 4, 8, 16, 32],
device="cuda",
warmup=10,
steps=50
)
print("Batch Size Analysis:")
print("-" * 60)
print(f"{'Batch':>6} {'Latency':>10} {'Per-Sample':>12} {'Throughput':>12}")
print(f"{'Size':>6} {'(ms)':>10} {'(ms)':>12} {'(inf/s)':>12}")
print("-" * 60)
for r in results:
if 'throughput_s' in r and not pd.isna(r.get('mean_ms')):
print(f"{r['batch_size']:>6} {r['mean_ms']:>10.2f} {r['latency_per_sample_ms']:>12.3f} {r['throughput_s']:>12.1f}")Batch Size Analysis:
------------------------------------------------------------
Batch Latency Per-Sample Throughput
Size (ms) (ms) (inf/s)
------------------------------------------------------------
1 0.69 0.687 1455.3
2 0.72 0.362 2760.6
4 0.87 0.218 4582.8
8 1.06 0.133 7530.9
16 1.33 0.083 11988.0
32 2.54 0.079 12605.3
3. Thread Count Sweeping (CPU)
For CPU inference, the number of threads significantly impacts performance. More threads isn’t always better due to contention.
from fasterbench import sweep_threads
import os
num_cores = os.cpu_count()
thread_counts = [t for t in [1, 2, 4, 8, 16, 32] if t <= num_cores]
results = sweep_threads(model, dummy, thread_counts=thread_counts, warmup=10, steps=30)
print("Thread Count Analysis:")
print("-" * 50)
print(f"{'Threads':>8} {'Latency (ms)':>15} {'Throughput':>15}")
print("-" * 50)
for r in results:
print(f"{r['threads']:>8} {r['mean_ms']:>15.2f} {r['throughput_s']:>15.1f}")Thread Count Analysis:
--------------------------------------------------
Threads Latency (ms) Throughput
--------------------------------------------------
1 30.66 32.6
2 29.59 33.8
4 29.16 34.3
8 33.89 29.5
16 29.92 33.4
32 31.37 31.9
4. Input Resolution Sweeping
For vision models, latency scales with input resolution. Use this to find the right speed/accuracy trade-off:
from fasterbench import sweep_latency
shapes = [
(1, 3, 128, 128),
(1, 3, 224, 224),
(1, 3, 384, 384),
(1, 3, 512, 512),
]
results = sweep_latency(model, shapes, device="cpu", warmup=5, steps=20)
print("Resolution Analysis:")
print("-" * 50)
print(f"{'Shape':>20} {'Latency (ms)':>15} {'Throughput':>12}")
print("-" * 50)
for r in results:
print(f"{r['shape']:>20} {r['mean_ms']:>15.2f} {r['throughput_s']:>12.1f}")Resolution Analysis:
--------------------------------------------------
Shape Latency (ms) Throughput
--------------------------------------------------
1×3×128×128 12.18 82.1
1×3×224×224 32.76 30.5
1×3×384×384 167.54 6.0
1×3×512×512 438.43 2.3
Summary
| Tool | Use Case |
|---|---|
LayerProfiler |
Comprehensive per-layer analysis (speed, memory, size, compute) |
LayerProfiler.top() |
Get top N layers sorted by any metric |
LayerProfiler.summary() |
Print formatted summary of all metrics |
sweep_batch_sizes() |
Find optimal batch size for throughput |
sweep_threads() |
Find optimal CPU thread count |
sweep_latency() |
Analyze latency vs input resolution |
See Also
- Getting Started Tutorial - Basic benchmarking with
benchmark() - Sensitivity Analysis - Analyze layer importance for pruning
- Profiling API - Full LayerProfiler reference
- Speed Metrics - Detailed speed measurement options