Roofline analysis

Measuring arithmetic intensity vs hardware peaks with RooflineAnalyzer

This notebook demonstrates measurement primitives. For compression decisions based on roofline data, see fasterrecipes.

What is a roofline?

The roofline model (Williams et al., 2009) plots a layer’s achieved performance against its arithmetic intensity:

Arithmetic intensity (AI) = FLOPs per byte moved from memory. A property of the computation itself.
Achieved performance = FLOPs per second actually delivered on the device.
The roof is the min of two ceilings: a sloped line AI x peak_bandwidth (memory-bound region) and a flat line peak_flops (compute-bound region).
The ridge point peak_flops / peak_bandwidth is the AI at which the two ceilings meet. Layers with AI < ridge are memory-bound; layers with AI >= ridge are compute-bound.

On a log-log plot, the roof looks like a tilted ceiling with a flat top. Each layer becomes a marker underneath that ceiling.

Measuring hardware peaks

measure_peaks() empirically probes the device with a large square matmul (for peak FLOPs/s) and a cache-defeating memory copy (for streaming bandwidth). It returns a HardwarePeaks dataclass.

By default, TF32 is pinned off on CUDA so the fp32 peak reflects honest fp32 throughput. Pass allow_tf32=True if you want the TF32 peak instead.

import torch
from fasterbench.roofline import measure_peaks

peaks = measure_peaks("cuda", steps=20, warmup=5)
print(peaks)

HardwarePeaks(peak_flops=8.12e+12, peak_bandwidth=5.43e+11, ridge_point=14.95, device='cuda:0', dtype='torch.float32', tf32_enabled=False, cudnn_benchmark=False)

The ridge point here is ~15 FLOPs/byte. Any layer below that intensity is memory-bound on this device.

Profiling ResNet-18

RooflineAnalyzer profiles a model in a single pass under the hood: forward hooks on every leaf module measure FLOPs (computed analytically for Conv and Linear), bytes moved (weights + input + output per Williams 2009), and wall time.

If you do not pass a peaks= argument, it calls measure_peaks() automatically.

from torchvision.models import resnet18
from fasterbench.roofline import RooflineAnalyzer

model = resnet18()
sample = torch.randn(1, 3, 224, 224)

ra = RooflineAnalyzer(model, sample)
ra.profile(device="cuda", warmup=5, steps=20)
ra.summary(top=10)

=== Roofline =============================================================
  name                             type           FLOPs      bytes       AI   GFLOPs/s     bound   util%
  layer4.0.conv2                   Conv2d         231.21M     10.01M    23.10      820.14   compute   10.1%
  layer4.1.conv1                   Conv2d         231.21M     10.01M    23.10      810.22   compute   10.0%
  layer4.1.conv2                   Conv2d         231.21M     10.01M    23.10      812.49   compute   10.0%
  layer3.0.conv2                   Conv2d         115.61M      5.11M    22.62      402.33   compute    5.0%
  ...

Each row shows a layer’s FLOPs, bytes moved, arithmetic intensity, achieved throughput, bound classification, and utilization (fraction of the roof reached).

Reading the plot

ra.plot() returns a plotly Figure with the roof line, the ridge point, and one marker per layer. Memory-bound layers are colored teal, compute-bound layers are darker teal.

fig = ra.plot(title="ResNet-18 roofline (CUDA)")
fig.show()

How to read the plot:

The diagonal segment (slope 1 on log-log) is the memory bandwidth ceiling.
The flat segment is the compute ceiling.
A marker near the roof indicates a layer achieving a high fraction of what the hardware permits at its intensity.
A marker far below the roof indicates a layer leaving hardware utilization on the table.
A marker to the left of the ridge point sits in the memory-bound region; one to the right sits in the compute-bound region.

Comparing input resolutions

Arithmetic intensity is a function of the computation and the tensor shapes. Increasing spatial resolution grows activation memory faster than it grows FLOPs for many conv layers, so markers shift further into the memory-bound region.

for side in (224, 512):
    x = torch.randn(1, 3, side, side)
    ra = RooflineAnalyzer(model, x, peaks=peaks)
    ra.profile(device="cuda", warmup=3, steps=10)
    mem_bound = sum(1 for r in ra.results if r.bound == "memory")
    comp_bound = sum(1 for r in ra.results if r.bound == "compute")
    print(f"{side}x{side}: {mem_bound} memory-bound, {comp_bound} compute-bound")

224x224: 18 memory-bound, 42 compute-bound
512x512: 31 memory-bound, 29 compute-bound

At 512x512 many more layers fall below the ridge point because activation bytes scale with H x W while FLOPs scale with H x W for a fixed kernel - but the constant factor differs, and BN/ReLU/pooling layers (which have very low AI) dominate when activations are large.

Summary

Tool	Purpose
`measure_peaks()`	Empirically probe peak FLOPs/s and streaming bandwidth
`HardwarePeaks`	Dataclass holding device peaks and ridge point
`RooflineAnalyzer`	Per-layer roofline profiler
`RooflineAnalyzer.profile()`	Measure FLOPs, bytes moved, and time per layer
`RooflineAnalyzer.summary()`	Print a table of the slowest layers with their roofline metrics
`RooflineAnalyzer.plot()`	Plotly figure with roof ceiling and per-layer markers
`RooflinePoint`	Dataclass for a single layer’s measurement
`clear_peaks_cache()`	Reset the `measure_peaks()` cache