SIMD

Single instruction multiple data

Note: All examples & intrinsics are AVX2, AVX-512 uses the _mm512 variant, you'll have to search that yourself

How does SIMD work?

SIMD hardware packs multiple data elements into a single, wider processor register (e.g., 128-bit or 256-bit). A single mathematical command operates on the entire batch at once, drastically reducing clock cycles.

What builtin functions can I use?

You have access to difference instructions based on the programming language and the target CPU architecture

Arithmetic (Float & Double)

Intrinsic	Data Type	Operation
_mm256_add_ps / pd	8 floats / 4 doubles	Addition (A+B)
_mm256_sub_ps / pd	8 floats / 4 doubles	Subtraction (A-B)
_mm256_mul_ps / pd	8 floats / 4 doubles	Multiplication (AxB)
_mm256_div_ps / pd	8 floats / 4 doubles	Division (A÷B)
_mm256_sqrt_ps / pd	8 floats / 4 doubles	Square Root (A)
_mm256_fmadd_ps / pd	8 floats / 4 doubles	Fused Multiply-Add (AxB+C)

Integer Operations

AVX2 significantly improved integer support over the original AVX.

Intrinsic	Operation
_mm256_add_epi32	Adds 8 packed 32-bit integers.
_mm256_sub_epi64	Subtracts 4 packed 64-bit integers.
_mm256_slli_epi32	Shift left logical for 8 integers (great for bitwise math).
_mm256_cmpeq_epi32	Compares 8 integers; returns a mask where bits are all 1 if equal.

Memory Management (Loading & Storing)

You can't do math until the data is in a YMM register for 256 bit SIMD operations (AVX2).

Intrinsic	Operation
_mm256_loadu_si256	Loads 256 bits of integer data from an unaligned memory address. This is slower than aligned load
_mm256_load_ps	Loads 8 floats from a 32-byte aligned address. This is faster than unaligned load.
_mm256_storeu_ps	Stores 8 floats back to memory.
_mm256_set1_ps(float a)	"Broadcasts" a single float to all 8 slots in the register.

Logical Operations (Bitwise SIMD)

These perform the bitwise logic

Intrinsic	Operation
_mm256_and_si256	Bitwise AND.
_mm256_or_si256	Bitwise OR.
_mm256_xor_si256	Bitwise XOR.

Important things to note

Data Alignment: AVX256 registers work most efficiently when memory addresses are aligned to 32-byte boundaries. Using _mm256_load_ps on unaligned memory will cause a segmentation fault (crash). Use _mm256_loadu_ps (the 'u' stands for unaligned) if you aren't sure, though it carries a minor performance penalty.
The "Remainder" Problem: If your array size isn't a perfect multiple of 8 (for floats) or 4 (for doubles), your SIMD loop will overshoot the end of the array. You must handle the final few elements using a standard scalar loop (called a cleanup loop).
Branching (The Masking Technique): You cannot use an if statement inside a SIMD lane. Instead, you perform the operation for all lanes, generate a mask using a comparison intrinsic (like _mm256_cmpeq_epi32), and use that mask to "blend" the results together.
AVX-SSE Transition Penalty: Mixing older 128-bit SSE code with newer 256-bit AVX code can cause a CPU "state change" penalty. If you mix them, use the _mm256_zeroupper() intrinsic to clear the upper half of the YMM registers and avoid performance drops.
Horizontal Operations are Slow: SIMD is designed for Vertical operations (adding Array A to Array B). Horizontal operations (adding all numbers inside a single register together) are much slower and require special instructions like _mm256_hadd_ps.

Comparison of Register

Name	Size	Capacity (32-bit float)	Architecture	Memory alignment
XMM	128 bits	4 elements	SSE	16 byte
YMM	256 bits	8 elements	AVX / AVX2	32 byte
ZMM	512 bits	16 elements	AVX-512	64 byte

Memory alignment

To allocate memory for SIMD operations, the golden rule is that your alignment must match the register width

The Modern C++ Way

Starting with C++17, you should use std::aligned_alloc on linux. For vectors and other data structures, use a custom memory allocator like

#include <cstdlib>
#include <new>
#include <vector>

template <typename T, std::size_t Alignment>
struct aligned_allocator {
    using value_type = T;

    aligned_allocator() noexcept = default;
    template <typename U> aligned_allocator(const aligned_allocator<U, Alignment>&) noexcept {}

    T* allocate(std::size_t n) {
        if (n == 0) return nullptr;
        
        // Size argument inside std::aligned_alloc MUST be a valid multiple of the specified alignment
        std::size_t size = n * sizeof(T);
        if (std::size_t remainder = size % Alignment; remainder != 0)
          size += (Alignment - remainder);

        void* ptr = std::aligned_alloc(Alignment, size);
        if (!ptr) throw std::bad_alloc();
        
        return static_cast<T*>(ptr);
    }

    void deallocate(T* p, std::size_t) noexcept {
        std::free(p);
    }

    bool operator==(const aligned_allocator&) const = default;
};

std::vector<float, aligned_allocator<float, 32>> vec32; // AVX2
std::vector<float, aligned_allocator<float, 64>> vec64; // AVX-512

// AVX2
struct alignas(32) Vec32 {
  float data[8];
};

// AVX-512
struct alignas(64) Vec64 {
  float data[16]; // Be aware that data[8] was changed to data[16] so that all memory is used. Otherwise we would have padding till the 64 byte line.
};

The standard malloc is guaranteed to return memory aligned to 16 bytes.

Because 128-bit SIMD (SSE/SSE2/SSE4) requires 16-byte alignment, standard malloc and std::vector will work perfectly fine with aligned instructions like _mm_load_ps or _mm_load_pd.

AVX instructions like _mm256_load_ps require 32-byte alignment. Since malloc only guarantees 16 byte alignment, there is a chance your pointer will fall on a 16-byte boundary that is not a 32-byte boundary.

If you use an aligned AVX load on a 16-byte aligned pointer provided by malloc, you are in undefined behaviour territory.

Example usage

#include <iostream>
#include <cstdlib>
#include <immintrin.h>
#include <vector>

// High-performance real-time processing loop
void apply_audio_gain_avx2(float* data, size_t count, float volume_factor) {
  // 1. Broadcast the single scalar factor into all 8 slots of a 256-bit YMM register
  __m256 v_factor = _mm256_set1_ps(volume_factor);

  size_t i = 0;
  // 2. Vectorized Main Loop: Process 8 float elements at a time
  size_t vectorized_end = count - (count % 8);
  for (; i < vectorized_end; i += 8) {
    // Load 8 contiguous elements from memory. Assumes data pointer is aligned to 32 bytes!
    __m256 v_samples = _mm256_load_ps(&data[i]);

    // Perform parallel element-wise multiplication
    __m256 v_result = _mm256_mul_ps(v_samples, v_factor);

    // Store back optimized vector block directly into target buffer memory
    _mm256_store_ps(&data[i], v_result);
  }

  // 3. Cleanup / Tail Loop: Catch leftover elements sequentially to prevent out-of-bound errors
  for (; i < count; ++i) {
    data[i] *= volume_factor;
  }
}

int main() {
  const size_t total_elements = 1003; // Intentional odd number to activate the cleanup tail loop
  const float volume_multiplier = 1.5f;

  // Allocate 32-byte strictly aligned memory using standard C++17 paradigms
  size_t alloc_size = total_elements * sizeof(float);
  // Adjust size to a multiple of 32 bytes to conform to std::aligned_alloc rules
  if (alloc_size % 32 != 0) {
    alloc_size += (32 - (alloc_size % 32));
  }

  float* audio_buffer = static_cast<float*>(std::aligned_alloc(32, alloc_size));
  
  if (!audio_buffer) {
    std::cerr << "Memory alignment allocation failed!" << std::endl;
    return -1;
  }

  // Populate buffer dummy audio values
  for (size_t i = 0; i < total_elements; ++i) {
    audio_buffer[i] = static_cast<float>(i % 10);
  }

  // Run vectorized pipeline
  apply_audio_gain_avx2(audio_buffer, total_elements, volume_multiplier);

  // Cleanup resources
  std::free(audio_buffer);
  return 0;
}

How does SIMD work?​

What builtin functions can I use?​

Arithmetic (Float & Double)​

Integer Operations​

Memory Management (Loading & Storing)​

Logical Operations (Bitwise SIMD)​

Important things to note​

Comparison of Register​

Memory alignment​

The Modern C++ Way​

Example usage​