A2. OS Jitter & Allocators

Deep Dive: Bypassing the Kernel with Custom Memory Arenas

Read the full raw markdown report on GitHub ↗

The Virtual Memory Wall

In high-performance computing, the Operating System is often your biggest bottleneck. The C++ Engine used standard libraries like std::vector to manage memory. Whenever a new matrix was instantiated during training, std::vector requested memory from the OS. However, the OS does not immediately give the process physical RAM. Instead, it provides "Virtual Memory" addresses.

When the CPU actually tries to write data to these addresses for the first time, a Minor Page Fault occurs. The OS halts the execution, finds an empty page of physical RAM, maps it to the virtual address, and then resumes execution. This overhead is devastating when you are dynamically allocating and freeing hundreds of megabytes of memory across thousands of training epochs.

The Bump Allocator Solution

To eliminate this overhead in the C Engine, I completely bypassed the system allocator (malloc / new). Instead, I implemented a custom Memory Arena (specifically, a Bump Allocator).

The C++ Way (15,593 Page Faults)

// Vector allocates and maps pages
// dynamically DURING the loop execution.
for (int i = 0; i < epochs; i++) {
    std::vector<float> matrix(N * N);
    // Page fault occurs here on first write!
    matrix[0] = 1.0f; 
}

The C Way (0 Page Faults)

// Allocate 1GB block upfront
float* arena = malloc(1024 * 1024 * 1024);

// CRITICAL HACK: Pre-fault all pages 
// during initialization, NOT during loop!
memset(arena, 0, 1024 * 1024 * 1024);

for (int i = 0; i < epochs; i++) {
    // Allocation is just moving a pointer
    float* matrix = bump_alloc(arena, N * N); 
}

Upon startup, the C Engine asks the OS for a single, massive 1GB block of memory upfront. Crucially, it then uses memset to write 0 to every single byte. This forces the OS to handle all 262,000+ page faults during startup, rather than during the benchmark. Once training begins, allocating memory for a new matrix simply involves moving a pointer forward by N * N * sizeof(float) bytes. Freeing memory is just moving the pointer backward. It is an O(1) operation that takes a single CPU cycle.

By using a Bump Allocator, the C Engine incurs exactly 0 page faults during execution. In stark contrast, the C++ Engine triggers up to 15,593 page faults at N=2000, causing severe OS Jitter and making C++ consistently 19% slower than raw C despite running the exact same math instructions.