A1. Cache Misses & Tiling

Deep Dive: The Physical Constraints of L1/L2 Memory

Read the full raw markdown report on GitHub ↗

The Memory Wall

In modern CPUs, mathematical operations (ALU) are incredibly fast, while fetching data from RAM is painfully slow. To bridge this gap, CPUs use a hierarchy of caches (L1, L2, L3). L1 cache is the fastest but also the smallest (typically 32KB to 64KB per core). If the data required by the ALU is not in the L1 cache, the CPU stalls and waits for it to be fetched from L2, L3, or RAM. This is called a Cache Miss.

The Naive O(N^3) matrix multiplication algorithm is notorious for cache misses. While traversing the first matrix row by row (cache-friendly), it traverses the second matrix column by column (cache-hostile). Because memory is physically laid out in row-major order, every single element access in the second matrix causes a cache miss, resulting in billions of wasted CPU cycles.

Loop Tiling (Block Matrix Multiplication)

To solve this, I implemented Loop Tiling. Instead of multiplying entire rows and columns, I break the matrix down into smaller sub-matrices (tiles or blocks). For the C Engine, I used a block size of 32x32 floats. Why 32? A 32x32 block of 32-bit floats perfectly occupies a segment of the L1 Cache. Then load a block into the L1 cache, perform all possible mathematical operations on it, and then write the block back. This maximizes Data Reuse.

The result? At N=1000, L1 Cache Misses drop from a staggering 997,679,203 misses in the Naive implementation, down to just 3,560,762 in the Tiled implementation. A near 300x improvement purely from changing the order in which memory is read.