import torch
from fasterbench.roofline import measure_peaks
peaks = measure_peaks("cuda", steps=20, warmup=5)
print(peaks)Roofline analysis
This notebook demonstrates measurement primitives. For compression decisions based on roofline data, see fasterrecipes.
What is a roofline?
The roofline model (Williams et al., 2009) plots a layer’s achieved performance against its arithmetic intensity:
- Arithmetic intensity (AI) = FLOPs per byte moved from memory. A property of the computation itself.
- Achieved performance = FLOPs per second actually delivered on the device.
- The roof is the min of two ceilings: a sloped line
AI x peak_bandwidth(memory-bound region) and a flat linepeak_flops(compute-bound region). - The ridge point
peak_flops / peak_bandwidthis the AI at which the two ceilings meet. Layers withAI < ridgeare memory-bound; layers withAI >= ridgeare compute-bound.
On a log-log plot, the roof looks like a tilted ceiling with a flat top. Each layer becomes a marker underneath that ceiling.
Measuring hardware peaks
measure_peaks() empirically probes the device with a large square matmul (for peak FLOPs/s) and a cache-defeating memory copy (for streaming bandwidth). It returns a HardwarePeaks dataclass.
By default, TF32 is pinned off on CUDA so the fp32 peak reflects honest fp32 throughput. Pass allow_tf32=True if you want the TF32 peak instead.
HardwarePeaks(peak_flops=8.12e+12, peak_bandwidth=5.43e+11, ridge_point=14.95, device='cuda:0', dtype='torch.float32', tf32_enabled=False, cudnn_benchmark=False)
The ridge point here is ~15 FLOPs/byte. Any layer below that intensity is memory-bound on this device.
Profiling ResNet-18
RooflineAnalyzer profiles a model in a single pass under the hood: forward hooks on every leaf module measure FLOPs (computed analytically for Conv and Linear), bytes moved (weights + input + output per Williams 2009), and wall time.
If you do not pass a peaks= argument, it calls measure_peaks() automatically.
from torchvision.models import resnet18
from fasterbench.roofline import RooflineAnalyzer
model = resnet18()
sample = torch.randn(1, 3, 224, 224)
ra = RooflineAnalyzer(model, sample)
ra.profile(device="cuda", warmup=5, steps=20)
ra.summary(top=10)=== Roofline =============================================================
name type FLOPs bytes AI GFLOPs/s bound util%
layer4.0.conv2 Conv2d 231.21M 10.01M 23.10 820.14 compute 10.1%
layer4.1.conv1 Conv2d 231.21M 10.01M 23.10 810.22 compute 10.0%
layer4.1.conv2 Conv2d 231.21M 10.01M 23.10 812.49 compute 10.0%
layer3.0.conv2 Conv2d 115.61M 5.11M 22.62 402.33 compute 5.0%
...
Each row shows a layer’s FLOPs, bytes moved, arithmetic intensity, achieved throughput, bound classification, and utilization (fraction of the roof reached).
Reading the plot
ra.plot() returns a plotly Figure with the roof line, the ridge point, and one marker per layer. Memory-bound layers are colored teal, compute-bound layers are darker teal.
fig = ra.plot(title="ResNet-18 roofline (CUDA)")
fig.show()How to read the plot:
- The diagonal segment (slope 1 on log-log) is the memory bandwidth ceiling.
- The flat segment is the compute ceiling.
- A marker near the roof indicates a layer achieving a high fraction of what the hardware permits at its intensity.
- A marker far below the roof indicates a layer leaving hardware utilization on the table.
- A marker to the left of the ridge point sits in the memory-bound region; one to the right sits in the compute-bound region.
Comparing input resolutions
Arithmetic intensity is a function of the computation and the tensor shapes. Increasing spatial resolution grows activation memory faster than it grows FLOPs for many conv layers, so markers shift further into the memory-bound region.
for side in (224, 512):
x = torch.randn(1, 3, side, side)
ra = RooflineAnalyzer(model, x, peaks=peaks)
ra.profile(device="cuda", warmup=3, steps=10)
mem_bound = sum(1 for r in ra.results if r.bound == "memory")
comp_bound = sum(1 for r in ra.results if r.bound == "compute")
print(f"{side}x{side}: {mem_bound} memory-bound, {comp_bound} compute-bound")224x224: 18 memory-bound, 42 compute-bound
512x512: 31 memory-bound, 29 compute-bound
At 512x512 many more layers fall below the ridge point because activation bytes scale with H x W while FLOPs scale with H x W for a fixed kernel - but the constant factor differs, and BN/ReLU/pooling layers (which have very low AI) dominate when activations are large.
Summary
| Tool | Purpose |
|---|---|
measure_peaks() |
Empirically probe peak FLOPs/s and streaming bandwidth |
HardwarePeaks |
Dataclass holding device peaks and ridge point |
RooflineAnalyzer |
Per-layer roofline profiler |
RooflineAnalyzer.profile() |
Measure FLOPs, bytes moved, and time per layer |
RooflineAnalyzer.summary() |
Print a table of the slowest layers with their roofline metrics |
RooflineAnalyzer.plot() |
Plotly figure with roof ceiling and per-layer markers |
RooflinePoint |
Dataclass for a single layer’s measurement |
clear_peaks_cache() |
Reset the measure_peaks() cache |
See Also
- Roofline API - Full reference
- Profiling Tutorial - Per-layer speed/memory/size/compute profiling
- Compute metrics - Underlying FLOPs counting