A5. Compiler Flags

System Dive: -O3, -ffast-math, and Auto-Vectorization

Read the full raw markdown report on GitHub ↗

The Baseline `-O3`

Writing highly optimized C or C++ is only half the battle. The compiler must translate that source code into machine instructions. By default, compilers prioritize fast compilation times over execution speed. Passing the `-O3` flag instructs GCC/Clang to perform aggressive optimizations: loop unrolling, function inlining, and dead code elimination.

The Power of Modern Compilers

A common critique of manual SIMD intrinsics is that modern compilers (like GCC or Clang) can auto-vectorize loops for you if you provide the right flags. Let's look at what GCC does to my Naive C code when I pass -O3 -march=native -ffast-math.

C Source (Naive)

Assembly Output (AVX2)

1void matmul_naive(float* A, float* B, float* C, int N) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            float sum = 0.0f;
            for (int k = 0; k < N; k++) {
                // The critical inner loop
                sum += A[i*N + k] * B[k*N + j];
            }
            C[i*N + j] = sum;
        }
    }
}

// GCC -O3 -march=native -ffast-math
.L4:
        // Load 8 floats from A into ymm1
        vmovups ymm1, YMMWORD PTR [rcx+rax]
        
        // Load 8 floats from B into ymm2
        vmovups ymm2, YMMWORD PTR [r8+rax]
        
        // Fused Multiply-Add (ymm0 = ymm1 * ymm2 + ymm0)
        vfmadd231ps     ymm0, ymm1, ymm2
        
        // Loop unrolling (GCC does this 4x or 8x automatically)
        vmovups ymm1, YMMWORD PTR [rcx+32+rax]
        vmovups ymm2, YMMWORD PTR [r8+32+rax]
        vfmadd231ps     ymm3, ymm1, ymm2

        add     rax, 128
        cmp     rdx, rax
        jne     .L4

The compiler successfully identifies the inner loop and unrolls it into vfmadd231ps (Fused Multiply-Add) instructions, processing 8 floats per cycle. However, as the chart below shows, even with aggressive compiler auto-vectorization, the Naive loop still cannot beat a hand-written Tiled or blocked algorithm, because the compiler cannot safely re-order massive memory access patterns.

Hardware-Specific Optimizations

Even with `-O3`, the compiler generates generic x86_64 machine code that can run on any processor from the last 20 years. To unlock the full power of a modern CPU, you must tell the compiler exactly what hardware it is targeting.

-march=native -mtune=native

These flags instruct the compiler to generate machine code explicitly tailored for the CPU currently compiling the code. It enables the use of advanced instruction sets (like AVX2 or AVX-512) that a generic build would be forced to ignore for compatibility reasons.

Flag Performance Impact

The following chart demonstrates the cumulative speedup multiplier gained by applying increasingly aggressive compiler flags to my base SIMD AVX2 C implementation. (Baseline -O0 = 1x).

Fast Math

The IEEE 754 floating-point standard requires strict adherence to mathematical order of operations, handling of NaNs, and signed zeros. This strictness limits the compiler's ability to reorder floating-point operations for speed.

The -ffast-math flag breaks IEEE compliance. It allows the compiler to assume math is associative (i.e., (A+B)+C == A+(B+C)), flush denormalized numbers to zero, and ignore NaNs. In machine learning, where slight floating-point inaccuracies are absorbed by the neural network's noise tolerance, -ffast-math provides a massive "free" speedup at the cost of strict numerical precision.