SIMD_math_lib

Version: 0.1.0

SIMD_math_lib is a small C++20 math library that exposes a clean, architecture-independent API for simple SIMD-friendly operations on float data.

The current focus of the project is:

a stable public API
a lightweight Vec4f abstraction
array-based math operations such as add, sub, mul, div, and dot
ARM NEON optimization for hot paths
a generic fallback implementation for unsupported targets
unit tests and micro-benchmarks to validate correctness and measure performance

The goal is to let you write code against one simple interface, while the implementation selects the best available backend for the target architecture.

Why this library exists

SIMD code is fast, but it is often hard to maintain because it quickly becomes tied to platform-specific intrinsics such as NEON or SSE/AVX.

This project separates the problem into two layers:

Public API layer
- the code you use in your application
- portable and easy to read
Architecture-specific backend layer
- the code that uses NEON, SSE, AVX, or a scalar fallback
- optimized internally without changing the public API

That makes it easier to:

start with a correct implementation first
optimize incrementally
compare architectures
keep tests stable while changing low-level code

Current features

Version 0.1.0 currently provides:

simd::Vec4f
- load/store 4 floats
- +, -, *, /
- dot product
Array-based operations on float*
- add
- sub
- mul
- div
- dot
Null-pointer guards for array APIs
Correct handling of lengths that are not multiples of 4
GoogleTest-based unit tests
A benchmark executable to compare common operations

Architecture support

At build time, the library selects a backend based on CMAKE_SYSTEM_PROCESSOR.

ARM / AArch64

Vec4f backend implemented with NEON
hot array operations in src/import.cpp have direct NEON paths
current best-optimized path in this version

x86 / x86_64

Vec4f backend implemented with SSE-style intrinsics in src/arch/x86/vec4f_x86.cpp
CMake enables -mavx2 for x86 builds
the public API stays the same

Generic fallback

portable scalar implementation in src/arch/generic/vec4f_generic.cpp
used when no specialized backend is selected

Note: the most optimized array hot-paths in 0.1.0 are currently the ARM NEON ones.

Project layout

SIMD-math-lib/
├── CMakeLists.txt
├── README.md
├── include/
│   └── simd/
│       ├── math.h
│       └── vec4f.h
├── src/
│   ├── CMakeLists.txt
│   ├── import.cpp
│   └── arch/
│       ├── arm/
│       │   └── vec4f_arm.cpp
│       ├── generic/
│       │   └── vec4f_generic.cpp
│       └── x86/
│           └── vec4f_x86.cpp
├── tests/
│   ├── CMakeLists.txt
│   └── test_main.cpp
└── bench/
    └── bench_main.cpp

What each part does

include/simd/vec4f.h
Public vector type and small vector operations.
include/simd/math.h
Public array-based math API.
src/import.cpp
Main implementation of array operations and hot-path SIMD loops.
src/arch/...
Architecture-specific Vec4f backends.
tests/test_main.cpp
Functional and edge-case tests.
bench/bench_main.cpp
Micro-benchmark runner for add, sub, mul, div, and dot.

Public API

`simd::Vec4f`

Header:

#include <simd/vec4f.h>

Vec4f is a lightweight public type representing 4 float values.

Current interface:

namespace simd {
    struct Vec4f {
        float x;
        float y;
        float z;
        float w;

        constexpr Vec4f() noexcept;
        constexpr Vec4f(float x_, float y_, float z_, float w_) noexcept;

        [[nodiscard]] static Vec4f load(const float* ptr) noexcept;
        void store(float* ptr) const noexcept;

        friend Vec4f operator+(const Vec4f& lhs, const Vec4f& rhs) noexcept;
        friend Vec4f operator-(const Vec4f& lhs, const Vec4f& rhs) noexcept;
        friend Vec4f operator*(const Vec4f& lhs, const Vec4f& rhs) noexcept;
        friend Vec4f operator/(const Vec4f& lhs, const Vec4f& rhs) noexcept;
    };

    [[nodiscard]] float dot(const Vec4f& lhs, const Vec4f& rhs) noexcept;
}

Array operations

Header:

#include <simd/math.h>

Current interface:

namespace simd {
    void add(const float* a, const float* b, float* out, std::size_t n) noexcept;
    void sub(const float* a, const float* b, float* out, std::size_t n) noexcept;
    void mul(const float* a, const float* b, float* out, std::size_t n) noexcept;
    void div(const float* a, const float* b, float* out, std::size_t n) noexcept;

    [[nodiscard]] float dot(const float* a, const float* b, std::size_t n) noexcept;
}

Semantics

a, b, and out must point to arrays of at least n elements.
If a required pointer is nullptr, the function returns immediately.
Operations work for any n, including values not divisible by 4.
dot returns 0.0f if either input pointer is nullptr or if n == 0.

Code examples

Example 1: add two arrays

#include <simd/math.h>

#include <cstddef>
#include <iostream>
#include <vector>

int main() {
    std::vector<float> a{1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
    std::vector<float> b{10.0f, 20.0f, 30.0f, 40.0f, 50.0f};
    std::vector<float> out(a.size());

    simd::add(a.data(), b.data(), out.data(), out.size());

    for (float value : out) {
        std::cout << value << ' ';
    }
    std::cout << '\n';

    return 0;
}

Output:

11 22 33 44 55

Example 2: dot product

#include <simd/math.h>

#include <iostream>
#include <vector>

int main() {
    std::vector<float> a{1.0f, 2.0f, 3.0f, 4.0f};
    std::vector<float> b{10.0f, 20.0f, 30.0f, 40.0f};

    float result = simd::dot(a.data(), b.data(), a.size());
    std::cout << "dot = " << result << '\n';

    return 0;
}

Expected result:

dot = 300

Because:

1*10 + 2*20 + 3*30 + 4*40 = 300

Example 3: work with `Vec4f`

#include <simd/vec4f.h>

#include <iostream>

int main() {
    float lhs_data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
    float rhs_data[4] = {5.0f, 6.0f, 7.0f, 8.0f};

    simd::Vec4f lhs = simd::Vec4f::load(lhs_data);
    simd::Vec4f rhs = simd::Vec4f::load(rhs_data);

    simd::Vec4f sum = lhs + rhs;

    float out[4] = {};
    sum.store(out);

    std::cout << out[0] << ' ' << out[1] << ' ' << out[2] << ' ' << out[3] << '\n';
    std::cout << "dot = " << simd::dot(lhs, rhs) << '\n';

    return 0;
}

Expected output:

6 8 10 12
dot = 70

Example 4: use the library from another CMake project

The easiest integration model right now is to add this repository as a subdirectory.

Your `CMakeLists.txt`

cmake_minimum_required(VERSION 4.2)
project(my_app LANGUAGES CXX)

add_subdirectory(path/to/SIMD-math-lib)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE SIMD_math_lib)

Your `main.cpp`

#include <simd/math.h>

#include <vector>

int main() {
    std::vector<float> a{1.0f, 2.0f, 3.0f, 4.0f};
    std::vector<float> b{2.0f, 2.0f, 2.0f, 2.0f};
    std::vector<float> out(a.size());

    simd::mul(a.data(), b.data(), out.data(), out.size());
    return 0;
}

Build instructions

Requirements:

CMake 4.2 or newer
a C++20 compiler
a supported platform toolchain (Apple Clang / Clang / GCC with the required SIMD support)

Configure a Release build

cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release
cmake --build cmake-build-release

Configure a Debug build

cmake -S . -B cmake-build-debug -DCMAKE_BUILD_TYPE=Debug
cmake --build cmake-build-debug

CMake options

The root project currently exposes:

SIMD_BUILD_TESTS = ON by default
SIMD_BUILD_BENCH = ON by default

Example:

cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release -DSIMD_BUILD_TESTS=ON -DSIMD_BUILD_BENCH=ON
cmake --build cmake-build-release

Install and consume

This project exports a CMake package, so you can install it and consume it with find_package.

Install locally

cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="$PWD/_install"
cmake --build cmake-build-release
cmake --install cmake-build-release

After installation, you should have:

_install/include/simd/math.h
_install/include/simd/vec4f.h
_install/lib/libSIMD_math_lib.dylib (platform dependent extension)
_install/lib/cmake/SIMD_math_lib/SIMD_math_libConfig.cmake
_install/lib/cmake/SIMD_math_lib/SIMD_math_libTargets.cmake

Consume from another CMake project

cmake_minimum_required(VERSION 4.2)
project(my_consumer LANGUAGES CXX)

find_package(SIMD_math_lib CONFIG REQUIRED)

add_executable(my_consumer main.cpp)
target_link_libraries(my_consumer PRIVATE SIMD_math_lib::SIMD_math_lib)

If the package is installed in a custom prefix, set CMAKE_PREFIX_PATH when configuring the consumer project:

cmake -S . -B build -DCMAKE_PREFIX_PATH="/absolute/path/to/_install"
cmake --build build

Run tests

The project uses GoogleTest through CMake FetchContent, and tests are registered through CTest.

ctest --test-dir cmake-build-release --output-on-failure

At the time of writing, the test suite covers:

Vec4f load/store
Vec4f operators
vector and array dot products
array add/mul/div
non-multiple-of-4 sizes
zero-length input
null-pointer handling
a larger mixed-value dot-product validation

Run benchmarks

The benchmark executable is named SIMD_math_lib_bench.

./cmake-build-release/SIMD_math_lib_bench

The benchmark currently measures:

add
sub
mul
div
dot

Two array sizes are used in one run:

1 << 24
1 << 26

This gives a quick view of how the library behaves for medium and large workloads.

Benchmark notes

These numbers were measured during development on an ARM64 / Apple Silicon machine in a Release build. They are examples, not guarantees.

Example results

`N = 1 << 24` (`16,777,216` elements)

Operation	Average time
`add`	`2.67 ms`
`dot`	`2.22 ms`
`sub`	`2.60 ms`
`mul`	`2.61 ms`
`div`	`2.61 ms`

`N = 1 << 26` (`67,108,864` elements)

Operation	Average time
`add`	`10.93 ms`
`dot`	`9.53 ms`
`sub`	`10.52 ms`
`mul`	`10.57 ms`
`div`	`10.70 ms`

Interpretation

add, sub, mul, and div are very close to each other for large inputs.
That usually means the workload is memory-bandwidth bound.
dot benefits more from arithmetic intensity and register accumulation.
The ARM NEON path gives a very large improvement compared with the original scalar baseline used during development.

Design notes

Why `Vec4f` is public

Vec4f gives users a small, explicit type that is easy to understand and test. It also acts as a clean boundary between:

the public API
the backend-specific implementation

Why array functions still matter

Real applications often work on contiguous arrays rather than individual vectors.

That is why simd::add, simd::sub, simd::mul, simd::div, and simd::dot are the main "workhorse" APIs.

Why there is both a generic and an optimized backend

The generic backend ensures correctness and portability. The optimized backend ensures performance where SIMD instructions are available.

This approach makes it easier to develop safely:

write and validate the portable version
add architecture-specific acceleration
verify that tests still pass
compare benchmark numbers

Current limitations

Version 0.1.0 is intentionally small.

Current limitations include:

API currently focuses on float only
vector width is fixed to 4 lanes in the public type
no runtime CPU feature dispatch yet
no higher-level math functions such as sqrt, min, max, fma, abs, or reductions beyond dot
x86 support exists, but the most aggressively optimized hot-path work has been done on ARM first

Roadmap

Planned next steps after 0.1.0:

improve and validate the x86 backend further
add more operations (min, max, sqrt, fused operations, reductions)
expand benchmark coverage and reporting
add more architecture-specific kernels while preserving the same API
improve documentation and examples further

License

This project is licensed under the Apache License 2.0.

You can find the full license text in the LICENSE file. You can find the project notice in the NOTICE file.

Summary

SIMD_math_lib 0.1.0 is a compact C++20 SIMD math library built around a simple idea:

one clear public interface, multiple internal backends

If you want a small project that is easy to read, test, benchmark, and evolve toward more advanced SIMD support, this is exactly what this repository is designed for.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
bench		bench
include/simd		include/simd
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SIMD_math_libConfig.cmake.in		SIMD_math_libConfig.cmake.in

Folders and files

Latest commit

History

Repository files navigation

SIMD_math_lib

Contents

Why this library exists

Current features

Architecture support

ARM / AArch64

x86 / x86_64

Generic fallback

Project layout

What each part does

Public API

simd::Vec4f

Array operations

Semantics

Code examples

Example 1: add two arrays

Example 2: dot product

Example 3: work with Vec4f

Example 4: use the library from another CMake project

Your CMakeLists.txt

Your main.cpp

Build instructions

Configure a Release build

Configure a Debug build

CMake options

Install and consume

Install locally

Consume from another CMake project

Run tests

Run benchmarks

Benchmark notes

Example results

N = 1 << 24 (16,777,216 elements)

N = 1 << 26 (67,108,864 elements)

Interpretation

Design notes

Why Vec4f is public

Why array functions still matter

Why there is both a generic and an optimized backend

Current limitations

Roadmap

License

Summary

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

`simd::Vec4f`

Example 3: work with `Vec4f`

Your `CMakeLists.txt`

Your `main.cpp`

`N = 1 << 24` (`16,777,216` elements)

`N = 1 << 26` (`67,108,864` elements)

Why `Vec4f` is public