A8. Model Training

System Dive: End-to-End Stochastic Gradient Descent

Beyond Microbenchmarks

A matrix multiplication microbenchmark proves raw compute capability, but an end-to-end training loop proves algorithmic stability. To validate the C/C++ engine architectures, I implemented a full Multi-Layer Perceptron (MLP) with a Stochastic Gradient Descent (SGD) optimizer. The goal was to ensure that aggressive loop unrolling, AVX2 SIMD intrinsics, and Bump Arena memory allocation did not introduce subtle floating-point precision errors during the backward pass.

Dataset Paradigms: Real vs Synthetic

The models were trained on two distinct datasets: the classic MNIST handwritten digits dataset (60,000 images), and a heavily over-parameterized Synthetic dataset designed to stress-test the raw FLOPs capabilities of the CPU.

The chart above reveals a fascinating performance inversion based on the workload characteristics.

The Synthetic Victory: On the synthetic benchmark where the matrix dimensions are massive and continuous, the bespoke C engine successfully beats PyTorch. The lack of framework overhead combined with perfect L1 cache tiling allows the C engine to maintain maximum instruction throughput.
The MNIST Crown: However, on the real-world MNIST dataset, PyTorch's highly optimized Intel MKL backend reclaims the lead. The complex batching logic and uneven memory access patterns of real-world data favor the decades of algorithmic tuning inside the MKL library.

Numerical Stability

Despite the differences in execution speed, the most important outcome of this test was mathematical correctness. Across both the custom C engines and PyTorch, the neural network consistently achieved ~90% test accuracy on the MNIST validation set. This confirmed that the SIMD optimizations were mathematically sound and that the custom forward/backward passes were computing identical gradients to PyTorch's Autograd engine.