A4. The Branch Predictor
Deep Dive: Speculative Execution and Pipeline Flushes
The Cost of "If"
In modern CPUs, instructions are not executed one at a time. They are processed in a long "pipeline" (fetch, decode, execute, write-back). This means the CPU is working on dozens of instructions simultaneously. However, when the CPU encounters an `if` statement (a conditional branch), it doesn't know which path to fetch next until the `if` condition is actually evaluated.
Instead of stalling the pipeline and waiting, the CPU guesses. This is the Branch Predictor. If it guesses correctly, execution continues at full speed. If it guesses wrong, it suffers a Branch Misprediction Penalty: the entire pipeline must be flushed, discarding all the speculatively executed work, and restarting from the correct path. This wastes 10 to 20 clock cycles.
Branching in Matrix Multiplication
Matrix multiplication is largely composed of tight `for` loops. The condition of a `for` loop (`i < N`) is essentially a branch. For a loop that runs 1,000 times, the branch predictor will guess "loop again" 999 times (correctly), and will guess "loop again" on the 1000th time (incorrectly). Thus, tight, predictable loops have near-perfect branch prediction rates.
During profiling, I observed that branch mispredictions were negligible for the Naive and Tiled implementations (often less than 0.1% miss rate). The bottlenecks were overwhelmingly Memory Bandwidth and L1 Cache Misses, not Branch Prediction failures. Avoiding branching inside the innermost loop is critical, but a standard matrix multiplication naturally avoids this.
Why does PyTorch have a massive 10% miss rate at N=32? It's an illusion of scale. At small matrix sizes, the actual math takes microseconds, meaning the majority of the instructions the CPU executes are just PyTorch's Python interpreter overhead: dynamic dispatching, type checking, and tensor metadata validation. These generic framework instructions involve thousands of unpredictable `if/else` checks, leading to a high miss rate. As the matrix grows to N=2048, the tight C++ MKL math loop completely dominates the execution time, and PyTorch's miss rate converges back down to ~0.01%.
cpp-vs-torch