Profiling Tutorial

Advanced profiling techniques for model optimization

Overview

When optimizing models, you need to know which layers are the bottlenecks. This tutorial covers advanced profiling techniques in fasterbench for identifying performance issues:

Tool Use Case
LayerProfiler Unified per-layer analysis (speed, memory, size, compute)
sweep_batch_sizes() Find optimal batch size for throughput
sweep_threads() Find optimal CPU thread count
sweep_latency() Analyze latency vs input resolution

1. LayerProfiler: Unified Per-Layer Analysis

The LayerProfiler class provides a unified interface for profiling multiple metrics per layer. This is the recommended approach for comprehensive layer-level analysis.

import torch
import pandas as pd
from torchvision.models import resnet18
from fasterbench.profiling import LayerProfiler

model = resnet18()
dummy = torch.randn(1, 3, 224, 224)

# Create profiler
profiler = LayerProfiler(model, dummy)

# Profile multiple metrics at once
results = profiler.profile(["speed", "size", "memory"], device="cpu", warmup=3, steps=10)

Available Metrics

Metric Columns Added Description
speed speed_ms, speed_percent Forward pass latency per layer
memory memory_mib, memory_percent Output tensor size (activation memory)
size params, params_percent Parameter count per layer
compute macs, macs_percent MACs per layer (requires torchprofile)

Utility Methods: top() and summary()

The LayerProfiler provides convenient methods to quickly identify bottlenecks after profiling:

# Get top 5 slowest layers
print("Top 5 slowest layers:")
for r in profiler.top("speed", n=5):
    print(f"  {r['name']:30} {r['speed_ms']:.3f} ms ({r['speed_percent']:.1f}%)")

# Get top 5 fastest layers (ascending order)
print("\nTop 5 fastest layers:")
for r in profiler.top("speed", n=5, ascending=True):
    print(f"  {r['name']:30} {r['speed_ms']:.3f} ms")

# Get layers with most parameters
print("\nTop 5 layers by parameter count:")
for r in profiler.top("size", n=5):
    print(f"  {r['name']:30} {r['params']:>12,} params")
Top 5 slowest layers:
  layer4.0.conv2                 0.503 ms (6.8%)
  layer4.1.conv1                 0.498 ms (6.7%)
  layer4.1.conv2                 0.494 ms (6.6%)
  layer3.0.conv2                 0.346 ms (4.6%)
  layer3.1.conv1                 0.345 ms (4.6%)

Top 5 fastest layers:
  layer4.0.relu                  0.011 ms
  layer4.1.relu                  0.011 ms
  layer3.0.relu                  0.012 ms
  layer3.1.relu                  0.013 ms
  layer2.0.relu                  0.013 ms

Top 5 layers by parameter count:
  layer4.0.conv2                  2,359,296.0 params
  layer4.1.conv1                  2,359,296.0 params
  layer4.1.conv2                  2,359,296.0 params
  layer4.0.conv1                  1,179,648.0 params
  layer3.0.conv2                    589,824.0 params
# Print formatted summary of all profiled metrics
profiler.summary(top=10)
═══ Speed (slowest) ═══════════════════════════════════════
  layer4.0.conv2                           Conv2d             0.503 ms (  6.8%)
  layer4.1.conv1                           Conv2d             0.498 ms (  6.7%)
  layer4.1.conv2                           Conv2d             0.494 ms (  6.6%)
  layer3.0.conv2                           Conv2d             0.346 ms (  4.6%)
  layer3.1.conv1                           Conv2d             0.345 ms (  4.6%)
  layer3.1.conv2                           Conv2d             0.340 ms (  4.6%)
  conv1                                    Conv2d             0.337 ms (  4.5%)
  maxpool                                  MaxPool2d          0.330 ms (  4.4%)
  layer1.0.conv1                           Conv2d             0.328 ms (  4.4%)
  layer2.0.conv2                           Conv2d             0.322 ms (  4.3%)

═══ Parameters (largest) ══════════════════════════════════
  layer4.0.conv2                           Conv2d             2,359,296 ( 20.2%)
  layer4.1.conv1                           Conv2d             2,359,296 ( 20.2%)
  layer4.1.conv2                           Conv2d             2,359,296 ( 20.2%)
  layer4.0.conv1                           Conv2d             1,179,648 ( 10.1%)
  layer3.0.conv2                           Conv2d               589,824 (  5.0%)
  layer3.1.conv1                           Conv2d               589,824 (  5.0%)
  layer3.1.conv2                           Conv2d               589,824 (  5.0%)
  fc                                       Linear               513,000 (  4.4%)
  layer3.0.conv1                           Conv2d               294,912 (  2.5%)
  layer2.0.conv2                           Conv2d               147,456 (  1.3%)

═══ Memory (largest) ══════════════════════════════════════
  conv1                                    Conv2d             3.062 MiB ( 11.9%)
  bn1                                      BatchNorm2d        3.062 MiB ( 11.9%)
  relu                                     ReLU               3.062 MiB ( 11.9%)
  maxpool                                  MaxPool2d          0.766 MiB (  3.0%)
  layer1.0.conv1                           Conv2d             0.766 MiB (  3.0%)
  layer1.1.conv2                           Conv2d             0.766 MiB (  3.0%)
  layer1.1.conv1                           Conv2d             0.766 MiB (  3.0%)
  layer1.0.conv2                           Conv2d             0.766 MiB (  3.0%)
  layer1.0.bn1                             BatchNorm2d        0.766 MiB (  3.0%)
  layer1.1.bn1                             BatchNorm2d        0.766 MiB (  3.0%)

2. Batch Size Sweeping

Find the optimal batch size for maximum throughput. Larger batches improve GPU utilization but eventually hit memory limits.

from fasterbench import sweep_batch_sizes

if torch.cuda.is_available():
    results = sweep_batch_sizes(
        model,
        input_shape=(3, 224, 224),  # Shape WITHOUT batch dimension
        batch_sizes=[1, 2, 4, 8, 16, 32],
        device="cuda",
        warmup=10,
        steps=50
    )
    
    print("Batch Size Analysis:")
    print("-" * 60)
    print(f"{'Batch':>6} {'Latency':>10} {'Per-Sample':>12} {'Throughput':>12}")
    print(f"{'Size':>6} {'(ms)':>10} {'(ms)':>12} {'(inf/s)':>12}")
    print("-" * 60)
    for r in results:
        if 'throughput_s' in r and not pd.isna(r.get('mean_ms')):
            print(f"{r['batch_size']:>6} {r['mean_ms']:>10.2f} {r['latency_per_sample_ms']:>12.3f} {r['throughput_s']:>12.1f}")
Batch Size Analysis:
------------------------------------------------------------
 Batch    Latency   Per-Sample   Throughput
  Size       (ms)         (ms)      (inf/s)
------------------------------------------------------------
     1       0.69        0.687       1455.3
     2       0.72        0.362       2760.6
     4       0.87        0.218       4582.8
     8       1.06        0.133       7530.9
    16       1.33        0.083      11988.0
    32       2.54        0.079      12605.3

3. Thread Count Sweeping (CPU)

For CPU inference, the number of threads significantly impacts performance. More threads isn’t always better due to contention.

from fasterbench import sweep_threads
import os

num_cores = os.cpu_count()
thread_counts = [t for t in [1, 2, 4, 8, 16, 32] if t <= num_cores]

results = sweep_threads(model, dummy, thread_counts=thread_counts, warmup=10, steps=30)

print("Thread Count Analysis:")
print("-" * 50)
print(f"{'Threads':>8} {'Latency (ms)':>15} {'Throughput':>15}")
print("-" * 50)
for r in results:
    print(f"{r['threads']:>8} {r['mean_ms']:>15.2f} {r['throughput_s']:>15.1f}")
Thread Count Analysis:
--------------------------------------------------
 Threads    Latency (ms)      Throughput
--------------------------------------------------
       1           30.66            32.6
       2           29.59            33.8
       4           29.16            34.3
       8           33.89            29.5
      16           29.92            33.4
      32           31.37            31.9

4. Input Resolution Sweeping

For vision models, latency scales with input resolution. Use this to find the right speed/accuracy trade-off:

from fasterbench import sweep_latency

shapes = [
    (1, 3, 128, 128),
    (1, 3, 224, 224),
    (1, 3, 384, 384),
    (1, 3, 512, 512),
]

results = sweep_latency(model, shapes, device="cpu", warmup=5, steps=20)

print("Resolution Analysis:")
print("-" * 50)
print(f"{'Shape':>20} {'Latency (ms)':>15} {'Throughput':>12}")
print("-" * 50)
for r in results:
    print(f"{r['shape']:>20} {r['mean_ms']:>15.2f} {r['throughput_s']:>12.1f}")
Resolution Analysis:
--------------------------------------------------
               Shape    Latency (ms)   Throughput
--------------------------------------------------
         1×3×128×128           12.18         82.1
         1×3×224×224           32.76         30.5
         1×3×384×384          167.54          6.0
         1×3×512×512          438.43          2.3

Summary

Tool Use Case
LayerProfiler Comprehensive per-layer analysis (speed, memory, size, compute)
LayerProfiler.top() Get top N layers sorted by any metric
LayerProfiler.summary() Print formatted summary of all metrics
sweep_batch_sizes() Find optimal batch size for throughput
sweep_threads() Find optimal CPU thread count
sweep_latency() Analyze latency vs input resolution

See Also