C++ vs PyTorch

17s→0.2s

REPORT BY : MYOK

A Systems Engineering Journey

A deep-dive into matrix multiplication, cache hierarchies, SIMD vectorization, and the limits of solo engineering against 30 years of BLAS optimization.

TEST RIG: Intel Core i7-13650HX (20 Threads) | 24GiB DDR5 RAM | Linux

01. The Naive Baseline

I started this project by implementing the most obvious triple for loop matrix multiplication in C. I wanted to see just how slow unoptimized code really is compared to industry titans like PyTorch and NumPy.

73x

At N=2000: Naive C took 17 seconds. PyTorch took 0.23 seconds.

The moment I realized I had no idea what I was doing.

The primary reason for this massive delay is not just the sheer number of math operations, it is how the CPU interacts with memory. When the naive nested loop iterates over the columns of the second matrix, the CPU is constantly jumping around in physical memory, destroying the cache pre-fetcher.

Read the full report on GitHub: 01. The Naive Baseline ↗

System Dive: The baseline numbers hide a wealth of microarchitectural details. How much energy does it take to run this unoptimized code? Do modern CPUs successfully predict the branches of my loops?

Explore A3. Power Analysis →Explore A4. Branch Predictor →

02. Tiling & Cache Optimization

Before SIMD, before multithreading — the single biggest fix was reading memory in the right order. By splitting the matrix into 32x32 blocks that perfectly fit inside the ultra-fast L1 cache, performance skyrocketed. This prevents the CPU from thrashing the cache by predicting exactly which rows and columns will be loaded next.

-92%

Drop in LLC (Last Level Cache) misses.

System Dive A1: Hooking into the Linux perf_event_open syscall to mathematically prove how column-major traversal wastes 93.75% of cache line bandwidth.

Explore A1. Cache Misses →

03. SIMD Vectorization

Once memory access was optimized, the math itself became the bottleneck. Modern Intel/AMD processors possess 256-bit wide vector registers (AVX2). Normally, a processor computes one floating-point multiplication per clock cycle. But by manually writing assembly-level C intrinsics (_mm256_fmadd_ps), I forced the processor to compute 8 multiplications and 8 additions simultaneously in a single cycle. This is called Single Instruction Multiple Data (SIMD).

By vectorizing the inner loop, the sheer number of instructions required to multiply a 1000x1000 matrix collapsed entirely.

Metric:

Fewer instructions executed compared to the baseline loop.

System Dive A5: Hand-writing assembly intrinsics is fun, but can GCC beat me if I just let it optimize the Naive code itself using -O3 and -ffast-math?

Explore A5. Compiler Flags →

04. The Arena Allocator

Despite executing the exact same SIMD instructions, the C Engine was consistently 19% faster than the C++ Engine. The culprit was memory allocation overhead via Page Faults.

Every time the C++ engine instantiated a matrix during the forward/backward pass using std::vector, the Operating System had to trap the execution and map physical RAM pages to virtual memory. This caused severe OS Jitter. To fix this, I implemented a 1GB Bump Allocator in the C Engine: memory is allocated exactly once at startup, and during training, I just move a pointer forward and backward. The result? Zero interaction with the OS.

0 vs 15,593

Page faults eliminated. Zero system calls during execution.

System Dive A2: Exposing the hidden cost of Demand Paging. Proving why std::vector (which maps virtual pages lazily and faults on write) triggers thousands of minor page faults compared to a pre-faulted Arena.

Explore A2. OS Allocators →

05. Beating PyTorch (Multithreading)

After weeks of optimization, I was still losing to PyTorch. One final change closed the gap: Multithreading via OpenMP. By dividing the output matrix rows among available CPU cores, the computation scaled linearly.

But not everything scales infinitely. Look at the chart below: at 7 threads, the speedup abruptly drops. This happens because modern Intel CPUs use a Hybrid Architecture. Once all 6 Performance-Cores (P-Cores) were saturated, the OS scheduler pushed the 7th thread onto a slow Efficiency-Core (E-Core), dragging down the entire thread pool due to Amdahl's Law.

Active Threads: 20

P-Cores: 6/6

E-Cores: 8/8

SMT (HyperT): 6/6

84.6x

Total Speedup. C++ OpenMP (0.20s) finally beat PyTorch (0.23s).

Read the full report on GitHub: 02. Beating PyTorch ↗

System Dive A6: Explore Amdahl's Law and exactly how OpenMP thread synchronization locks up when dealing with heterogenous architectures (P-Cores vs E-Cores).

Explore A6. Thread Scaling →

View Raw Data (JSON)

{
  "labels": [
    10,
    20,
    30,
    40,
    50,
    60,
    70,
    80,
    90,
    100,
    200,
    300,
    400,
    500,
    600,
    700,
    800,
    900,
    1000,
    2000
  ],
  "datasets": [
    {
      "label": "Naive C",
      "data": [
        9.49111e-7,
        0.00000630829,
        0.0000167138,
        0.0000389817,
        0.0000764757,
        0.000134441,
        0.000234274,
        0.000361164,
        0.000528895,
        0.000734619,
        0.00643984,
        0.0222186,
        0.0535749,
        0.104916,
        0.183185,
        0.291707,
        0.496373,
        0.676366,
        0.934128,
        9.73528
      ]
    },
    {
      "label": "Naive C++",
      "data": [
        5.07396e-7,
        0.00000315959,
        0.00000997538,
        0.0000224736,
        0.0000461553,
        0.0000875925,
        0.000146793,
        0.000236967,
        0.000366967,
        0.000498606,
        0.00566939,
        0.0205328,
        0.0503611,
        0.100307,
        0.175949,
        0.284257,
        0.429051,
        0.607257,
        0.835162,
        8.48415
      ]
    },
    {
      "label": "Tiled C",
      "data": [
        6.95104e-7,
        0.0000046188,
        0.0000163932,
        0.0000366509,
        0.0000677266,
        0.000118622,
        0.000196206,
        0.000291588,
        0.000406131,
        0.000571923,
        0.00456646,
        0.0152637,
        0.0362421,
        0.0708004,
        0.122488,
        0.194633,
        0.290745,
        0.415413,
        0.569557,
        4.59074
      ]
    },
    {
      "label": "Tiled C++",
      "data": [
        4.82495e-7,
        0.00000280609,
        0.00000855642,
        0.0000204052,
        0.0000365038,
        0.0000644459,
        0.000108275,
        0.000161376,
        0.000217748,
        0.000306553,
        0.00249031,
        0.00825655,
        0.0194673,
        0.0378638,
        0.0655746,
        0.104285,
        0.156286,
        0.222535,
        0.305381,
        2.47733
      ]
    },
    {
      "label": "SIMD AVX2 C",
      "data": [
        7.11484e-7,
        0.00000149501,
        0.00000417769,
        0.00000640019,
        0.0000116194,
        0.0000220081,
        0.0000395222,
        0.0000448464,
        0.0000628039,
        0.0000862582,
        0.000732444,
        0.00240997,
        0.00620991,
        0.0114016,
        0.0215808,
        0.0327726,
        0.0509093,
        0.0682166,
        0.0979069,
        0.874592
      ]
    },
    {
      "label": "PyTorch (MKL)",
      "data": [
        0.0000019550323486328123,
        0.0000016450881958007813,
        0.000002789497375488281,
        0.0000027418136596679688,
        0.000005197525024414063,
        0.0000049591064453125,
        0.000008606910705566407,
        0.00001010894775390625,
        0.00001778602600097656,
        0.000017404556274414062,
        0.00012755393981933594,
        0.0004117012023925781,
        0.0009807586669921876,
        0.0018505811691284179,
        0.0031421422958374024,
        0.004951834678649902,
        0.007352280616760254,
        0.010358905792236328,
        0.014184975624084472,
        0.11530628204345703
      ]
    }
  ]
}

SCROLL TO EXPLORE

06. The BLAS Wall

The unifying theory. This single chart explains every result in this project. The Naive algorithm never becomes compute-bound. At any matrix size, it is bottlenecked by Memory Bandwidth. PyTorch's MKL hits the Compute Ceiling at 606 GFLOPS. The difference isn't just optimization — it's a different mathematical regime.

Ceiling:

Custom SIMD gets you 95% of the way. The last 5% requires dedicating your life to assembly micro-kernels and GotoBLAS memory packing. I chose to write this site instead.

Read the full report on GitHub: 06. The BLAS Wall / SIMD Limits ↗

System Dive A7: Learn how to calculate Arithmetic Intensity from scratch, and map floating point operations against DDR5 memory bandwidth to prove MKL achieved ~74% of the absolute physical limit of the silicon.

Explore A7. The Roofline →

07Diagnosing Erratic Execution Times

During single-threaded execution sweeps, a bizarre anomaly occurred. Even though the mathematical workload was identical, execution time would occasionally jump from 1 second to 20 seconds, representing a massive 20x latency spike.

In systems programming, severe single-threaded jitter like this points to three OS/Hardware suspects: Thermal Throttling firmware logic, the OS Scheduler aggressively migrating the thread to an E-Core to save power, or Subnormal Floating Point Values causing the FPU to trap into software microcode interrupts. Check out the full investigation report below to see the experimental pipeline used to diagnose it.

Explore 05. Diagnosing Erratic Times →

08The End-to-End Test

Microbenchmarks are useful, but they don't mean anything unless the system actually trains a neural network. To prove the engine works, I implemented a full Stochastic Gradient Descent (SGD) loop for a Multi-Layer Perceptron (MLP) on the MNIST dataset as well as a synthetic dataset.

The neural network achieved ~90% test accuracy across all backends, proving that the aggressive SIMD math, custom bumped memory allocations, and loop unrolling were completely numerically stable and mathematically correct.

The C engine successfully trains the dataset in a fraction of a second per epoch. While PyTorch's highly optimized MKL backend retains the crown on the real-world MNIST dataset, my bespoke C engine actually beats PyTorch on the synthetic benchmark dataset. I didn't just compete with PyTorch in an isolated matrix multiplication; I built an end-to-end framework capable of training models from scratch.

Read Report 03: End-to-End ML Training ↗Read Report 04: Scaling to MNIST ↗