A6. Thread Scaling
Deep Dive: Amdahl's Law and Hybrid CPU Architectures
Amdahl's Law in Practice
Amdahl's Law defines the maximum theoretical speedup a program can achieve via parallelization. If 5% of a program is strictly serial (e.g., memory allocation, setup, teardown), the maximum possible speedup is 20x, even if you have 1,000,000 cores.
In matrix multiplication, the inner loops are highly parallelizable. I used OpenMP (`#pragma omp parallel for`) to divide the output matrix rows equally among available threads. This allows me to scale performance almost linearly—up to a point.
The Hybrid Architecture Cliff
Modern Intel CPUs (like the Core i7-12650H used in this benchmark) use a Hybrid Architecture consisting of Performance Cores (P-Cores) and Efficiency Cores (E-Cores). P-Cores run at high clock speeds and have large caches. E-Cores run at lower speeds and are optimized for background tasks.
Notice the sharp drop in the C/C++ Engine speedup curves exactly at Thread Count = 7. The CPU has 6 P-Cores. The first 6 threads are scheduled onto the lightning-fast P-Cores, resulting in linear scaling. The 7th thread is pushed onto a slow E-Core. Because OpenMP synchronizes all threads at the end of the loop, the 6 fast P-Cores sit idle waiting for the 1 slow E-Core to finish its chunk of math. This completely breaks the parallel scaling.
cpp-vs-torch