Version: 0.1.0
SIMD_math_lib is a small C++20 math library that exposes a clean, architecture-independent API for simple SIMD-friendly operations on float data.
The current focus of the project is:
- a stable public API
- a lightweight
Vec4fabstraction - array-based math operations such as
add,sub,mul,div, anddot - ARM NEON optimization for hot paths
- a generic fallback implementation for unsupported targets
- unit tests and micro-benchmarks to validate correctness and measure performance
The goal is to let you write code against one simple interface, while the implementation selects the best available backend for the target architecture.
- Why this library exists
- Current features
- Architecture support
- Project layout
- Public API
- Code examples
- Build instructions
- Install and consume
- Run tests
- Run benchmarks
- Benchmark notes
- Design notes
- Current limitations
- Roadmap
- License
SIMD code is fast, but it is often hard to maintain because it quickly becomes tied to platform-specific intrinsics such as NEON or SSE/AVX.
This project separates the problem into two layers:
- Public API layer
- the code you use in your application
- portable and easy to read
- Architecture-specific backend layer
- the code that uses NEON, SSE, AVX, or a scalar fallback
- optimized internally without changing the public API
That makes it easier to:
- start with a correct implementation first
- optimize incrementally
- compare architectures
- keep tests stable while changing low-level code
Version 0.1.0 currently provides:
simd::Vec4f- load/store 4 floats
+,-,*,/- dot product
- Array-based operations on
float*addsubmuldivdot
- Null-pointer guards for array APIs
- Correct handling of lengths that are not multiples of 4
- GoogleTest-based unit tests
- A benchmark executable to compare common operations
At build time, the library selects a backend based on CMAKE_SYSTEM_PROCESSOR.
Vec4fbackend implemented with NEON- hot array operations in
src/import.cpphave direct NEON paths - current best-optimized path in this version
Vec4fbackend implemented with SSE-style intrinsics insrc/arch/x86/vec4f_x86.cpp- CMake enables
-mavx2for x86 builds - the public API stays the same
- portable scalar implementation in
src/arch/generic/vec4f_generic.cpp - used when no specialized backend is selected
Note: the most optimized array hot-paths in
0.1.0are currently the ARM NEON ones.
SIMD-math-lib/
├── CMakeLists.txt
├── README.md
├── include/
│ └── simd/
│ ├── math.h
│ └── vec4f.h
├── src/
│ ├── CMakeLists.txt
│ ├── import.cpp
│ └── arch/
│ ├── arm/
│ │ └── vec4f_arm.cpp
│ ├── generic/
│ │ └── vec4f_generic.cpp
│ └── x86/
│ └── vec4f_x86.cpp
├── tests/
│ ├── CMakeLists.txt
│ └── test_main.cpp
└── bench/
└── bench_main.cpp
include/simd/vec4f.h
Public vector type and small vector operations.include/simd/math.h
Public array-based math API.src/import.cpp
Main implementation of array operations and hot-path SIMD loops.src/arch/...
Architecture-specificVec4fbackends.tests/test_main.cpp
Functional and edge-case tests.bench/bench_main.cpp
Micro-benchmark runner foradd,sub,mul,div, anddot.
Header:
#include <simd/vec4f.h>Vec4f is a lightweight public type representing 4 float values.
Current interface:
namespace simd {
struct Vec4f {
float x;
float y;
float z;
float w;
constexpr Vec4f() noexcept;
constexpr Vec4f(float x_, float y_, float z_, float w_) noexcept;
[[nodiscard]] static Vec4f load(const float* ptr) noexcept;
void store(float* ptr) const noexcept;
friend Vec4f operator+(const Vec4f& lhs, const Vec4f& rhs) noexcept;
friend Vec4f operator-(const Vec4f& lhs, const Vec4f& rhs) noexcept;
friend Vec4f operator*(const Vec4f& lhs, const Vec4f& rhs) noexcept;
friend Vec4f operator/(const Vec4f& lhs, const Vec4f& rhs) noexcept;
};
[[nodiscard]] float dot(const Vec4f& lhs, const Vec4f& rhs) noexcept;
}Header:
#include <simd/math.h>Current interface:
namespace simd {
void add(const float* a, const float* b, float* out, std::size_t n) noexcept;
void sub(const float* a, const float* b, float* out, std::size_t n) noexcept;
void mul(const float* a, const float* b, float* out, std::size_t n) noexcept;
void div(const float* a, const float* b, float* out, std::size_t n) noexcept;
[[nodiscard]] float dot(const float* a, const float* b, std::size_t n) noexcept;
}a,b, andoutmust point to arrays of at leastnelements.- If a required pointer is
nullptr, the function returns immediately. - Operations work for any
n, including values not divisible by 4. dotreturns0.0fif either input pointer isnullptror ifn == 0.
#include <simd/math.h>
#include <cstddef>
#include <iostream>
#include <vector>
int main() {
std::vector<float> a{1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
std::vector<float> b{10.0f, 20.0f, 30.0f, 40.0f, 50.0f};
std::vector<float> out(a.size());
simd::add(a.data(), b.data(), out.data(), out.size());
for (float value : out) {
std::cout << value << ' ';
}
std::cout << '\n';
return 0;
}Output:
11 22 33 44 55
#include <simd/math.h>
#include <iostream>
#include <vector>
int main() {
std::vector<float> a{1.0f, 2.0f, 3.0f, 4.0f};
std::vector<float> b{10.0f, 20.0f, 30.0f, 40.0f};
float result = simd::dot(a.data(), b.data(), a.size());
std::cout << "dot = " << result << '\n';
return 0;
}Expected result:
dot = 300
Because:
1*10 + 2*20 + 3*30 + 4*40 = 300
#include <simd/vec4f.h>
#include <iostream>
int main() {
float lhs_data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
float rhs_data[4] = {5.0f, 6.0f, 7.0f, 8.0f};
simd::Vec4f lhs = simd::Vec4f::load(lhs_data);
simd::Vec4f rhs = simd::Vec4f::load(rhs_data);
simd::Vec4f sum = lhs + rhs;
float out[4] = {};
sum.store(out);
std::cout << out[0] << ' ' << out[1] << ' ' << out[2] << ' ' << out[3] << '\n';
std::cout << "dot = " << simd::dot(lhs, rhs) << '\n';
return 0;
}Expected output:
6 8 10 12
dot = 70
The easiest integration model right now is to add this repository as a subdirectory.
cmake_minimum_required(VERSION 4.2)
project(my_app LANGUAGES CXX)
add_subdirectory(path/to/SIMD-math-lib)
add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE SIMD_math_lib)#include <simd/math.h>
#include <vector>
int main() {
std::vector<float> a{1.0f, 2.0f, 3.0f, 4.0f};
std::vector<float> b{2.0f, 2.0f, 2.0f, 2.0f};
std::vector<float> out(a.size());
simd::mul(a.data(), b.data(), out.data(), out.size());
return 0;
}Requirements:
- CMake
4.2or newer - a C++20 compiler
- a supported platform toolchain (Apple Clang / Clang / GCC with the required SIMD support)
cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release
cmake --build cmake-build-releasecmake -S . -B cmake-build-debug -DCMAKE_BUILD_TYPE=Debug
cmake --build cmake-build-debugThe root project currently exposes:
SIMD_BUILD_TESTS=ONby defaultSIMD_BUILD_BENCH=ONby default
Example:
cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release -DSIMD_BUILD_TESTS=ON -DSIMD_BUILD_BENCH=ON
cmake --build cmake-build-releaseThis project exports a CMake package, so you can install it and consume it with
find_package.
cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="$PWD/_install"
cmake --build cmake-build-release
cmake --install cmake-build-releaseAfter installation, you should have:
_install/include/simd/math.h_install/include/simd/vec4f.h_install/lib/libSIMD_math_lib.dylib(platform dependent extension)_install/lib/cmake/SIMD_math_lib/SIMD_math_libConfig.cmake_install/lib/cmake/SIMD_math_lib/SIMD_math_libTargets.cmake
cmake_minimum_required(VERSION 4.2)
project(my_consumer LANGUAGES CXX)
find_package(SIMD_math_lib CONFIG REQUIRED)
add_executable(my_consumer main.cpp)
target_link_libraries(my_consumer PRIVATE SIMD_math_lib::SIMD_math_lib)If the package is installed in a custom prefix, set CMAKE_PREFIX_PATH when
configuring the consumer project:
cmake -S . -B build -DCMAKE_PREFIX_PATH="/absolute/path/to/_install"
cmake --build buildThe project uses GoogleTest through CMake FetchContent, and tests are registered through CTest.
ctest --test-dir cmake-build-release --output-on-failureAt the time of writing, the test suite covers:
Vec4fload/storeVec4foperators- vector and array dot products
- array add/mul/div
- non-multiple-of-4 sizes
- zero-length input
- null-pointer handling
- a larger mixed-value dot-product validation
The benchmark executable is named SIMD_math_lib_bench.
./cmake-build-release/SIMD_math_lib_benchThe benchmark currently measures:
addsubmuldivdot
Two array sizes are used in one run:
1 << 241 << 26
This gives a quick view of how the library behaves for medium and large workloads.
These numbers were measured during development on an ARM64 / Apple Silicon machine in a Release build. They are examples, not guarantees.
| Operation | Average time |
|---|---|
add |
2.67 ms |
dot |
2.22 ms |
sub |
2.60 ms |
mul |
2.61 ms |
div |
2.61 ms |
| Operation | Average time |
|---|---|
add |
10.93 ms |
dot |
9.53 ms |
sub |
10.52 ms |
mul |
10.57 ms |
div |
10.70 ms |
add,sub,mul, anddivare very close to each other for large inputs.- That usually means the workload is memory-bandwidth bound.
dotbenefits more from arithmetic intensity and register accumulation.- The ARM NEON path gives a very large improvement compared with the original scalar baseline used during development.
Vec4f gives users a small, explicit type that is easy to understand and test. It also acts as a clean boundary between:
- the public API
- the backend-specific implementation
Real applications often work on contiguous arrays rather than individual vectors.
That is why simd::add, simd::sub, simd::mul, simd::div, and simd::dot are the main "workhorse" APIs.
The generic backend ensures correctness and portability. The optimized backend ensures performance where SIMD instructions are available.
This approach makes it easier to develop safely:
- write and validate the portable version
- add architecture-specific acceleration
- verify that tests still pass
- compare benchmark numbers
Version 0.1.0 is intentionally small.
Current limitations include:
- API currently focuses on
floatonly - vector width is fixed to 4 lanes in the public type
- no runtime CPU feature dispatch yet
- no higher-level math functions such as
sqrt,min,max,fma,abs, or reductions beyonddot - x86 support exists, but the most aggressively optimized hot-path work has been done on ARM first
Planned next steps after 0.1.0:
- improve and validate the x86 backend further
- add more operations (
min,max,sqrt, fused operations, reductions) - expand benchmark coverage and reporting
- add more architecture-specific kernels while preserving the same API
- improve documentation and examples further
This project is licensed under the Apache License 2.0.
Copyright © 2026 Samuele Garbuglia
You can find the full license text in the LICENSE file.
You can find the project notice in the NOTICE file.
SIMD_math_lib 0.1.0 is a compact C++20 SIMD math library built around a simple idea:
one clear public interface, multiple internal backends
If you want a small project that is easy to read, test, benchmark, and evolve toward more advanced SIMD support, this is exactly what this repository is designed for.