A7. The Roofline Model
Deep Dive: Hardware Ceilings and Arithmetic Intensity
What is a Roofline Model?
The Roofline Model is an intuitive visual performance model used to provide performance bounds for a given compute kernel on specific hardware. It ties together floating-point performance (GFLOPS), memory bandwidth (GB/s), and arithmetic intensity (FLOPs/Byte) into a single two-dimensional graph.
The "Roof" is composed of two intersecting lines:
- The Memory Ceiling: The slanted line on the left. If a program's arithmetic intensity is low, it is bound by how fast the hardware can move data from RAM to the CPU (Memory Bandwidth).
- The Compute Ceiling: The flat line on the right. If a program does a lot of math for every byte of data it loads, it becomes bound by the maximum theoretical calculation speed of the CPU (Peak Compute).
Analyzing the Engines
A Roofline Plot visualizes the absolute physical limits of your hardware. Every algorithm is bounded by two ceilings: Memory Bandwidth (how fast you can feed the CPU) and Compute (how fast the CPU can crunch numbers).
Hardware Specs
- CPU ProcessorIntel Core i7-13650HX
- Physical Memory24GB DDR5-4800 MT/s
- Theoretical Max Compute~816 GFLOPS
- Theoretical Max Bandwidth76.8 GB/s
The X-axis is Arithmetic Intensity: How many math operations (FLOPs) you perform per byte of memory you read from RAM.
Notice the Naive implementation: despite the large matrix sizes, it is trapped under the slanted Memory Ceiling. Because it reads memory inefficiently, its effective Arithmetic Intensity collapses. Meanwhile, PyTorch (using Intel MKL) and my custom SIMD AVX2 implementations successfully transition past the "Ridge Point" and ride the flat Compute Ceiling, proving they are fully saturating the processor's ALUs.
cpp-vs-torch