A6. Thread Scaling

Deep Dive: Amdahl's Law and Hybrid CPU Architectures

Read the full raw markdown report on GitHub ↗

Amdahl's Law in Practice

Amdahl's Law defines the maximum theoretical speedup a program can achieve via parallelization. If 5% of a program is strictly serial (e.g., memory allocation, setup, teardown), the maximum possible speedup is 20x, even if you have 1,000,000 cores.

In matrix multiplication, the inner loops are highly parallelizable. I used OpenMP (`#pragma omp parallel for`) to divide the output matrix rows equally among available threads. This allows me to scale performance almost linearly—up to a point.

The Hybrid Architecture Cliff

Modern Intel CPUs (like the Core i7-12650H used in this benchmark) use a Hybrid Architecture consisting of Performance Cores (P-Cores) and Efficiency Cores (E-Cores). P-Cores run at high clock speeds and have large caches. E-Cores run at lower speeds and are optimized for background tasks.

Active Threads: 20

P-Cores: 6/6

E-Cores: 8/8

SMT (HyperT): 6/6

Notice the sharp drop in the C/C++ Engine speedup curves exactly at Thread Count = 7. The CPU has 6 P-Cores. The first 6 threads are scheduled onto the lightning-fast P-Cores, resulting in linear scaling. The 7th thread is pushed onto a slow E-Core. Because OpenMP synchronizes all threads at the end of the loop, the 6 fast P-Cores sit idle waiting for the 1 slow E-Core to finish its chunk of math. This completely breaks the parallel scaling.