SIMD
Single instruction multiple data
Note: All examples & intrinsics are AVX2, AVX-512 uses the _mm512 variant, you'll have to search that yourself
How does SIMD work?
SIMD hardware packs multiple data elements into a single, wider processor register (e.g., 128-bit or 256-bit). A single mathematical command operates on the entire batch at once, drastically reducing clock cycles.
What builtin functions can I use?
You have access to difference instructions based on the programming language and the target CPU architecture
Arithmetic (Float & Double)
| Intrinsic | Data Type | Operation |
|---|---|---|
| _mm256_add_ps / pd | 8 floats / 4 doubles | Addition (A+B) |
| _mm256_sub_ps / pd | 8 floats / 4 doubles | Subtraction (A-B) |
| _mm256_mul_ps / pd | 8 floats / 4 doubles | Multiplication (AxB) |
| _mm256_div_ps / pd | 8 floats / 4 doubles | Division (A÷B) |
| _mm256_sqrt_ps / pd | 8 floats / 4 doubles | Square Root (A) |
| _mm256_fmadd_ps / pd | 8 floats / 4 doubles | Fused Multiply-Add (AxB+C) |
Integer Operations
AVX2 significantly improved integer support over the original AVX.
| Intrinsic | Operation |
|---|---|
| _mm256_add_epi32 | Adds 8 packed 32-bit integers. |
| _mm256_sub_epi64 | Subtracts 4 packed 64-bit integers. |
| _mm256_slli_epi32 | Shift left logical for 8 integers (great for bitwise math). |
| _mm256_cmpeq_epi32 | Compares 8 integers; returns a mask where bits are all 1 if equal. |
Memory Management (Loading & Storing)
You can't do math until the data is in a YMM register for 256 bit SIMD operations (AVX2).
| Intrinsic | Operation |
|---|---|
| _mm256_loadu_si256 | Loads 256 bits of integer data from an unaligned memory address. This is slower than aligned load |
| _mm256_load_ps | Loads 8 floats from a 32-byte aligned address. This is faster than unaligned load. |
| _mm256_storeu_ps | Stores 8 floats back to memory. |
| _mm256_set1_ps(float a) | "Broadcasts" a single float to all 8 slots in the register. |
Logical Operations (Bitwise SIMD)
These perform the bitwise logic
| Intrinsic | Operation |
|---|---|
| _mm256_and_si256 | Bitwise AND. |
| _mm256_or_si256 | Bitwise OR. |
| _mm256_xor_si256 | Bitwise XOR. |
Important things to note
- Data Alignment: AVX256 registers work most efficiently when memory addresses are aligned to 32-byte boundaries. Using
_mm256_load_pson unaligned memory will cause a segmentation fault (crash). Use_mm256_loadu_ps(the 'u' stands for unaligned) if you aren't sure, though it carries a minor performance penalty. - The "Remainder" Problem: If your array size isn't a perfect multiple of 8 (for floats) or 4 (for doubles), your SIMD loop will overshoot the end of the array. You must handle the final few elements using a standard scalar loop (called a cleanup loop).
- Branching (The Masking Technique): You cannot use an if statement inside a SIMD lane. Instead, you perform the operation for all lanes, generate a mask using a comparison intrinsic (like
_mm256_cmpeq_epi32), and use that mask to "blend" the results together. - AVX-SSE Transition Penalty: Mixing older 128-bit SSE code with newer 256-bit AVX code can cause a CPU "state change" penalty. If you mix them, use the
_mm256_zeroupper()intrinsic to clear the upper half of the YMM registers and avoid performance drops. - Horizontal Operations are Slow: SIMD is designed for Vertical operations (adding Array A to Array B). Horizontal operations (adding all numbers inside a single register together) are much slower and require special instructions like
_mm256_hadd_ps.
Comparison of Register
| Name | Size | Capacity (32-bit float) | Architecture | Memory alignment |
|---|---|---|---|---|
| XMM | 128 bits | 4 elements | SSE | 16 byte |
| YMM | 256 bits | 8 elements | AVX / AVX2 | 32 byte |
| ZMM | 512 bits | 16 elements | AVX-512 | 64 byte |
Memory alignment
To allocate memory for SIMD operations, the golden rule is that your alignment must match the register width
The Modern C++ Way
Starting with C++17, you should use std::aligned_alloc on linux. For vectors and other data structures, use a custom memory allocator like
#include <cstdlib>
#include <new>
#include <vector>
template <typename T, std::size_t Alignment>
struct aligned_allocator {
using value_type = T;
aligned_allocator() noexcept = default;
template <typename U> aligned_allocator(const aligned_allocator<U, Alignment>&) noexcept {}
T* allocate(std::size_t n) {
if (n == 0) return nullptr;
// Size argument inside std::aligned_alloc MUST be a valid multiple of the specified alignment
std::size_t size = n * sizeof(T);
if (std::size_t remainder = size % Alignment; remainder != 0)
size += (Alignment - remainder);
void* ptr = std::aligned_alloc(Alignment, size);
if (!ptr) throw std::bad_alloc();
return static_cast<T*>(ptr);
}
void deallocate(T* p, std::size_t) noexcept {
std::free(p);
}
bool operator==(const aligned_allocator&) const = default;
};
std::vector<float, aligned_allocator<float, 32>> vec32; // AVX2
std::vector<float, aligned_allocator<float, 64>> vec64; // AVX-512
// AVX2
struct alignas(32) Vec32 {
float data[8];
};
// AVX-512
struct alignas(64) Vec64 {
float data[16]; // Be aware that data[8] was changed to data[16] so that all memory is used. Otherwise we would have padding till the 64 byte line.
};
The standard malloc is guaranteed to return memory aligned to 16 bytes.
Because 128-bit SIMD (SSE/SSE2/SSE4) requires 16-byte alignment, standard malloc and std::vector will work perfectly fine with aligned instructions like _mm_load_ps or _mm_load_pd.
AVX instructions like _mm256_load_ps require 32-byte alignment. Since malloc only guarantees 16 byte alignment, there is a chance your pointer will fall on a 16-byte boundary that is not a 32-byte boundary.
If you use an aligned AVX load on a 16-byte aligned pointer provided by malloc, you are in undefined behaviour territory.
Example usage
#include <iostream>
#include <cstdlib>
#include <immintrin.h>
#include <vector>
// High-performance real-time processing loop
void apply_audio_gain_avx2(float* data, size_t count, float volume_factor) {
// 1. Broadcast the single scalar factor into all 8 slots of a 256-bit YMM register
__m256 v_factor = _mm256_set1_ps(volume_factor);
size_t i = 0;
// 2. Vectorized Main Loop: Process 8 float elements at a time
size_t vectorized_end = count - (count % 8);
for (; i < vectorized_end; i += 8) {
// Load 8 contiguous elements from memory. Assumes data pointer is aligned to 32 bytes!
__m256 v_samples = _mm256_load_ps(&data[i]);
// Perform parallel element-wise multiplication
__m256 v_result = _mm256_mul_ps(v_samples, v_factor);
// Store back optimized vector block directly into target buffer memory
_mm256_store_ps(&data[i], v_result);
}
// 3. Cleanup / Tail Loop: Catch leftover elements sequentially to prevent out-of-bound errors
for (; i < count; ++i) {
data[i] *= volume_factor;
}
}
int main() {
const size_t total_elements = 1003; // Intentional odd number to activate the cleanup tail loop
const float volume_multiplier = 1.5f;
// Allocate 32-byte strictly aligned memory using standard C++17 paradigms
size_t alloc_size = total_elements * sizeof(float);
// Adjust size to a multiple of 32 bytes to conform to std::aligned_alloc rules
if (alloc_size % 32 != 0) {
alloc_size += (32 - (alloc_size % 32));
}
float* audio_buffer = static_cast<float*>(std::aligned_alloc(32, alloc_size));
if (!audio_buffer) {
std::cerr << "Memory alignment allocation failed!" << std::endl;
return -1;
}
// Populate buffer dummy audio values
for (size_t i = 0; i < total_elements; ++i) {
audio_buffer[i] = static_cast<float>(i % 10);
}
// Run vectorized pipeline
apply_audio_gain_avx2(audio_buffer, total_elements, volume_multiplier);
// Cleanup resources
std::free(audio_buffer);
return 0;
}