A3. Power Analysis
System Dive: Energy Efficiency and the "Race to Sleep"
The Counter-Intuitive Nature of Power
A common assumption in systems programming is that highly optimized, parallelized code burns more energy because it maxes out the CPU cores. While it's true that the instantaneous power draw (Wattage) of SIMD and PyTorch is much higher than the Naive loop, the total energy consumed (Joules) tells a completely different story.
This phenomenon is known as the "Race to Sleep." Modern CPUs are designed to execute instructions as fast as physically possible so they can immediately return to a low-power idle state (C-states). The Naive loop prevents the CPU from sleeping for 17 entire seconds.
Dynamic Frequency Scaling (Turbo Boost)
Modern CPUs utilize Dynamic Frequency Scaling. If the CPU detects a heavy workload (like AVX2 SIMD instructions) and thermal headroom allows, it will "Turbo Boost" its clock speed well above the base frequency. However, AVX2 instructions physically draw significantly more current than standard integer math. This generates intense localized heat on the silicon die.
Because AVX2 instructions cause rapid heating, the processor may aggressively downclock (thermal throttle) to prevent melting. This is why a highly optimized SIMD kernel might show diminishing returns on a poorly cooled system—the CPU is literally slowing itself down to survive the very instructions designed to make it faster.
The Power Data
This chart directly compares the raw energy consumption (measured in Joules via Intel RAPL) of the C Engine, C++ Engine, and PyTorch across 6 different OS power governors at Matrix Size N=1000.
You can also use the interactive dropdown on the Main Page's 'Execution Time' chart to dynamically hot-swap the underlying benchmark execution times between the 6 different power modes. Observe how the gap between the Naive implementation and PyTorch narrows or widens depending on how aggressively the CPU is allowed to boost.
cpp-vs-torch