Skip to main content

SIMD

Single instruction multiple data

Note: All examples & intrinsics are AVX2, AVX-512 uses the _mm512 variant, you'll have to search that yourself

How does SIMD work?

SIMD hardware packs multiple data elements into a single, wider processor register (e.g., 128-bit or 256-bit). A single mathematical command operates on the entire batch at once, drastically reducing clock cycles.

What builtin functions can I use?

You have access to difference instructions based on the programming language and the target CPU architecture

Arithmetic (Float & Double)

IntrinsicData TypeOperation
_mm256_add_ps / pd8 floats / 4 doublesAddition (A+B)
_mm256_sub_ps / pd8 floats / 4 doublesSubtraction (A-B)
_mm256_mul_ps / pd8 floats / 4 doublesMultiplication (AxB)
_mm256_div_ps / pd8 floats / 4 doublesDivision (A÷B)
_mm256_sqrt_ps / pd8 floats / 4 doublesSquare Root (A)
_mm256_fmadd_ps / pd8 floats / 4 doublesFused Multiply-Add (AxB+C)

Integer Operations

AVX2 significantly improved integer support over the original AVX.

IntrinsicOperation
_mm256_add_epi32Adds 8 packed 32-bit integers.
_mm256_sub_epi64Subtracts 4 packed 64-bit integers.
_mm256_slli_epi32Shift left logical for 8 integers (great for bitwise math).
_mm256_cmpeq_epi32Compares 8 integers; returns a mask where bits are all 1 if equal.

Memory Management (Loading & Storing)

You can't do math until the data is in a YMM register for 256 bit SIMD operations (AVX2).

IntrinsicOperation
_mm256_loadu_si256Loads 256 bits of integer data from an unaligned memory address. This is slower than aligned load
_mm256_load_psLoads 8 floats from a 32-byte aligned address. This is faster than unaligned load.
_mm256_storeu_psStores 8 floats back to memory.
_mm256_set1_ps(float a)"Broadcasts" a single float to all 8 slots in the register.

Logical Operations (Bitwise SIMD)

These perform the bitwise logic

IntrinsicOperation
_mm256_and_si256Bitwise AND.
_mm256_or_si256Bitwise OR.
_mm256_xor_si256Bitwise XOR.

Important things to note

  • Data Alignment: AVX256 registers work most efficiently when memory addresses are aligned to 32-byte boundaries. Using _mm256_load_ps on unaligned memory will cause a segmentation fault (crash). Use _mm256_loadu_ps (the 'u' stands for unaligned) if you aren't sure, though it carries a minor performance penalty.
  • The "Remainder" Problem: If your array size isn't a perfect multiple of 8 (for floats) or 4 (for doubles), your SIMD loop will overshoot the end of the array. You must handle the final few elements using a standard scalar loop (called a cleanup loop).
  • Branching (The Masking Technique): You cannot use an if statement inside a SIMD lane. Instead, you perform the operation for all lanes, generate a mask using a comparison intrinsic (like _mm256_cmpeq_epi32), and use that mask to "blend" the results together.
  • AVX-SSE Transition Penalty: Mixing older 128-bit SSE code with newer 256-bit AVX code can cause a CPU "state change" penalty. If you mix them, use the _mm256_zeroupper() intrinsic to clear the upper half of the YMM registers and avoid performance drops.
  • Horizontal Operations are Slow: SIMD is designed for Vertical operations (adding Array A to Array B). Horizontal operations (adding all numbers inside a single register together) are much slower and require special instructions like _mm256_hadd_ps.

Comparison of Register

NameSizeCapacity (32-bit float)ArchitectureMemory alignment
XMM128 bits4 elementsSSE16 byte
YMM256 bits8 elementsAVX / AVX232 byte
ZMM512 bits16 elementsAVX-51264 byte

Memory alignment

To allocate memory for SIMD operations, the golden rule is that your alignment must match the register width

The Modern C++ Way

Starting with C++17, you should use std::aligned_alloc on linux. For vectors and other data structures, use a custom memory allocator like

#include <cstdlib>
#include <new>
#include <vector>

template <typename T, std::size_t Alignment>
struct aligned_allocator {
using value_type = T;

aligned_allocator() noexcept = default;
template <typename U> aligned_allocator(const aligned_allocator<U, Alignment>&) noexcept {}

T* allocate(std::size_t n) {
if (n == 0) return nullptr;

// Size argument inside std::aligned_alloc MUST be a valid multiple of the specified alignment
std::size_t size = n * sizeof(T);
if (std::size_t remainder = size % Alignment; remainder != 0)
size += (Alignment - remainder);

void* ptr = std::aligned_alloc(Alignment, size);
if (!ptr) throw std::bad_alloc();

return static_cast<T*>(ptr);
}

void deallocate(T* p, std::size_t) noexcept {
std::free(p);
}

bool operator==(const aligned_allocator&) const = default;
};
std::vector<float, aligned_allocator<float, 32>> vec32; // AVX2
std::vector<float, aligned_allocator<float, 64>> vec64; // AVX-512

// AVX2
struct alignas(32) Vec32 {
float data[8];
};

// AVX-512
struct alignas(64) Vec64 {
float data[16]; // Be aware that data[8] was changed to data[16] so that all memory is used. Otherwise we would have padding till the 64 byte line.
};

The standard malloc is guaranteed to return memory aligned to 16 bytes.

Because 128-bit SIMD (SSE/SSE2/SSE4) requires 16-byte alignment, standard malloc and std::vector will work perfectly fine with aligned instructions like _mm_load_ps or _mm_load_pd.

AVX instructions like _mm256_load_ps require 32-byte alignment. Since malloc only guarantees 16 byte alignment, there is a chance your pointer will fall on a 16-byte boundary that is not a 32-byte boundary.

If you use an aligned AVX load on a 16-byte aligned pointer provided by malloc, you are in undefined behaviour territory.

Example usage

#include <iostream>
#include <cstdlib>
#include <immintrin.h>
#include <vector>

// High-performance real-time processing loop
void apply_audio_gain_avx2(float* data, size_t count, float volume_factor) {
// 1. Broadcast the single scalar factor into all 8 slots of a 256-bit YMM register
__m256 v_factor = _mm256_set1_ps(volume_factor);

size_t i = 0;
// 2. Vectorized Main Loop: Process 8 float elements at a time
size_t vectorized_end = count - (count % 8);
for (; i < vectorized_end; i += 8) {
// Load 8 contiguous elements from memory. Assumes data pointer is aligned to 32 bytes!
__m256 v_samples = _mm256_load_ps(&data[i]);

// Perform parallel element-wise multiplication
__m256 v_result = _mm256_mul_ps(v_samples, v_factor);

// Store back optimized vector block directly into target buffer memory
_mm256_store_ps(&data[i], v_result);
}

// 3. Cleanup / Tail Loop: Catch leftover elements sequentially to prevent out-of-bound errors
for (; i < count; ++i) {
data[i] *= volume_factor;
}
}

int main() {
const size_t total_elements = 1003; // Intentional odd number to activate the cleanup tail loop
const float volume_multiplier = 1.5f;

// Allocate 32-byte strictly aligned memory using standard C++17 paradigms
size_t alloc_size = total_elements * sizeof(float);
// Adjust size to a multiple of 32 bytes to conform to std::aligned_alloc rules
if (alloc_size % 32 != 0) {
alloc_size += (32 - (alloc_size % 32));
}

float* audio_buffer = static_cast<float*>(std::aligned_alloc(32, alloc_size));

if (!audio_buffer) {
std::cerr << "Memory alignment allocation failed!" << std::endl;
return -1;
}

// Populate buffer dummy audio values
for (size_t i = 0; i < total_elements; ++i) {
audio_buffer[i] = static_cast<float>(i % 10);
}

// Run vectorized pipeline
apply_audio_gain_avx2(audio_buffer, total_elements, volume_multiplier);

// Cleanup resources
std::free(audio_buffer);
return 0;
}