Skip to content

garbsam97/SIMD-Math-Lib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SIMD_math_lib

Version: 0.1.0

SIMD_math_lib is a small C++20 math library that exposes a clean, architecture-independent API for simple SIMD-friendly operations on float data.

The current focus of the project is:

  • a stable public API
  • a lightweight Vec4f abstraction
  • array-based math operations such as add, sub, mul, div, and dot
  • ARM NEON optimization for hot paths
  • a generic fallback implementation for unsupported targets
  • unit tests and micro-benchmarks to validate correctness and measure performance

The goal is to let you write code against one simple interface, while the implementation selects the best available backend for the target architecture.


Contents


Why this library exists

SIMD code is fast, but it is often hard to maintain because it quickly becomes tied to platform-specific intrinsics such as NEON or SSE/AVX.

This project separates the problem into two layers:

  1. Public API layer
    • the code you use in your application
    • portable and easy to read
  2. Architecture-specific backend layer
    • the code that uses NEON, SSE, AVX, or a scalar fallback
    • optimized internally without changing the public API

That makes it easier to:

  • start with a correct implementation first
  • optimize incrementally
  • compare architectures
  • keep tests stable while changing low-level code

Current features

Version 0.1.0 currently provides:

  • simd::Vec4f
    • load/store 4 floats
    • +, -, *, /
    • dot product
  • Array-based operations on float*
    • add
    • sub
    • mul
    • div
    • dot
  • Null-pointer guards for array APIs
  • Correct handling of lengths that are not multiples of 4
  • GoogleTest-based unit tests
  • A benchmark executable to compare common operations

Architecture support

At build time, the library selects a backend based on CMAKE_SYSTEM_PROCESSOR.

ARM / AArch64

  • Vec4f backend implemented with NEON
  • hot array operations in src/import.cpp have direct NEON paths
  • current best-optimized path in this version

x86 / x86_64

  • Vec4f backend implemented with SSE-style intrinsics in src/arch/x86/vec4f_x86.cpp
  • CMake enables -mavx2 for x86 builds
  • the public API stays the same

Generic fallback

  • portable scalar implementation in src/arch/generic/vec4f_generic.cpp
  • used when no specialized backend is selected

Note: the most optimized array hot-paths in 0.1.0 are currently the ARM NEON ones.


Project layout

SIMD-math-lib/
├── CMakeLists.txt
├── README.md
├── include/
│   └── simd/
│       ├── math.h
│       └── vec4f.h
├── src/
│   ├── CMakeLists.txt
│   ├── import.cpp
│   └── arch/
│       ├── arm/
│       │   └── vec4f_arm.cpp
│       ├── generic/
│       │   └── vec4f_generic.cpp
│       └── x86/
│           └── vec4f_x86.cpp
├── tests/
│   ├── CMakeLists.txt
│   └── test_main.cpp
└── bench/
    └── bench_main.cpp

What each part does

  • include/simd/vec4f.h
    Public vector type and small vector operations.
  • include/simd/math.h
    Public array-based math API.
  • src/import.cpp
    Main implementation of array operations and hot-path SIMD loops.
  • src/arch/...
    Architecture-specific Vec4f backends.
  • tests/test_main.cpp
    Functional and edge-case tests.
  • bench/bench_main.cpp
    Micro-benchmark runner for add, sub, mul, div, and dot.

Public API

simd::Vec4f

Header:

#include <simd/vec4f.h>

Vec4f is a lightweight public type representing 4 float values.

Current interface:

namespace simd {
    struct Vec4f {
        float x;
        float y;
        float z;
        float w;

        constexpr Vec4f() noexcept;
        constexpr Vec4f(float x_, float y_, float z_, float w_) noexcept;

        [[nodiscard]] static Vec4f load(const float* ptr) noexcept;
        void store(float* ptr) const noexcept;

        friend Vec4f operator+(const Vec4f& lhs, const Vec4f& rhs) noexcept;
        friend Vec4f operator-(const Vec4f& lhs, const Vec4f& rhs) noexcept;
        friend Vec4f operator*(const Vec4f& lhs, const Vec4f& rhs) noexcept;
        friend Vec4f operator/(const Vec4f& lhs, const Vec4f& rhs) noexcept;
    };

    [[nodiscard]] float dot(const Vec4f& lhs, const Vec4f& rhs) noexcept;
}

Array operations

Header:

#include <simd/math.h>

Current interface:

namespace simd {
    void add(const float* a, const float* b, float* out, std::size_t n) noexcept;
    void sub(const float* a, const float* b, float* out, std::size_t n) noexcept;
    void mul(const float* a, const float* b, float* out, std::size_t n) noexcept;
    void div(const float* a, const float* b, float* out, std::size_t n) noexcept;

    [[nodiscard]] float dot(const float* a, const float* b, std::size_t n) noexcept;
}

Semantics

  • a, b, and out must point to arrays of at least n elements.
  • If a required pointer is nullptr, the function returns immediately.
  • Operations work for any n, including values not divisible by 4.
  • dot returns 0.0f if either input pointer is nullptr or if n == 0.

Code examples

Example 1: add two arrays

#include <simd/math.h>

#include <cstddef>
#include <iostream>
#include <vector>

int main() {
    std::vector<float> a{1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
    std::vector<float> b{10.0f, 20.0f, 30.0f, 40.0f, 50.0f};
    std::vector<float> out(a.size());

    simd::add(a.data(), b.data(), out.data(), out.size());

    for (float value : out) {
        std::cout << value << ' ';
    }
    std::cout << '\n';

    return 0;
}

Output:

11 22 33 44 55

Example 2: dot product

#include <simd/math.h>

#include <iostream>
#include <vector>

int main() {
    std::vector<float> a{1.0f, 2.0f, 3.0f, 4.0f};
    std::vector<float> b{10.0f, 20.0f, 30.0f, 40.0f};

    float result = simd::dot(a.data(), b.data(), a.size());
    std::cout << "dot = " << result << '\n';

    return 0;
}

Expected result:

dot = 300

Because:

1*10 + 2*20 + 3*30 + 4*40 = 300

Example 3: work with Vec4f

#include <simd/vec4f.h>

#include <iostream>

int main() {
    float lhs_data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
    float rhs_data[4] = {5.0f, 6.0f, 7.0f, 8.0f};

    simd::Vec4f lhs = simd::Vec4f::load(lhs_data);
    simd::Vec4f rhs = simd::Vec4f::load(rhs_data);

    simd::Vec4f sum = lhs + rhs;

    float out[4] = {};
    sum.store(out);

    std::cout << out[0] << ' ' << out[1] << ' ' << out[2] << ' ' << out[3] << '\n';
    std::cout << "dot = " << simd::dot(lhs, rhs) << '\n';

    return 0;
}

Expected output:

6 8 10 12
dot = 70

Example 4: use the library from another CMake project

The easiest integration model right now is to add this repository as a subdirectory.

Your CMakeLists.txt

cmake_minimum_required(VERSION 4.2)
project(my_app LANGUAGES CXX)

add_subdirectory(path/to/SIMD-math-lib)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE SIMD_math_lib)

Your main.cpp

#include <simd/math.h>

#include <vector>

int main() {
    std::vector<float> a{1.0f, 2.0f, 3.0f, 4.0f};
    std::vector<float> b{2.0f, 2.0f, 2.0f, 2.0f};
    std::vector<float> out(a.size());

    simd::mul(a.data(), b.data(), out.data(), out.size());
    return 0;
}

Build instructions

Requirements:

  • CMake 4.2 or newer
  • a C++20 compiler
  • a supported platform toolchain (Apple Clang / Clang / GCC with the required SIMD support)

Configure a Release build

cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release
cmake --build cmake-build-release

Configure a Debug build

cmake -S . -B cmake-build-debug -DCMAKE_BUILD_TYPE=Debug
cmake --build cmake-build-debug

CMake options

The root project currently exposes:

  • SIMD_BUILD_TESTS = ON by default
  • SIMD_BUILD_BENCH = ON by default

Example:

cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release -DSIMD_BUILD_TESTS=ON -DSIMD_BUILD_BENCH=ON
cmake --build cmake-build-release

Install and consume

This project exports a CMake package, so you can install it and consume it with find_package.

Install locally

cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="$PWD/_install"
cmake --build cmake-build-release
cmake --install cmake-build-release

After installation, you should have:

  • _install/include/simd/math.h
  • _install/include/simd/vec4f.h
  • _install/lib/libSIMD_math_lib.dylib (platform dependent extension)
  • _install/lib/cmake/SIMD_math_lib/SIMD_math_libConfig.cmake
  • _install/lib/cmake/SIMD_math_lib/SIMD_math_libTargets.cmake

Consume from another CMake project

cmake_minimum_required(VERSION 4.2)
project(my_consumer LANGUAGES CXX)

find_package(SIMD_math_lib CONFIG REQUIRED)

add_executable(my_consumer main.cpp)
target_link_libraries(my_consumer PRIVATE SIMD_math_lib::SIMD_math_lib)

If the package is installed in a custom prefix, set CMAKE_PREFIX_PATH when configuring the consumer project:

cmake -S . -B build -DCMAKE_PREFIX_PATH="/absolute/path/to/_install"
cmake --build build

Run tests

The project uses GoogleTest through CMake FetchContent, and tests are registered through CTest.

ctest --test-dir cmake-build-release --output-on-failure

At the time of writing, the test suite covers:

  • Vec4f load/store
  • Vec4f operators
  • vector and array dot products
  • array add/mul/div
  • non-multiple-of-4 sizes
  • zero-length input
  • null-pointer handling
  • a larger mixed-value dot-product validation

Run benchmarks

The benchmark executable is named SIMD_math_lib_bench.

./cmake-build-release/SIMD_math_lib_bench

The benchmark currently measures:

  • add
  • sub
  • mul
  • div
  • dot

Two array sizes are used in one run:

  • 1 << 24
  • 1 << 26

This gives a quick view of how the library behaves for medium and large workloads.


Benchmark notes

These numbers were measured during development on an ARM64 / Apple Silicon machine in a Release build. They are examples, not guarantees.

Example results

N = 1 << 24 (16,777,216 elements)

Operation Average time
add 2.67 ms
dot 2.22 ms
sub 2.60 ms
mul 2.61 ms
div 2.61 ms

N = 1 << 26 (67,108,864 elements)

Operation Average time
add 10.93 ms
dot 9.53 ms
sub 10.52 ms
mul 10.57 ms
div 10.70 ms

Interpretation

  • add, sub, mul, and div are very close to each other for large inputs.
  • That usually means the workload is memory-bandwidth bound.
  • dot benefits more from arithmetic intensity and register accumulation.
  • The ARM NEON path gives a very large improvement compared with the original scalar baseline used during development.

Design notes

Why Vec4f is public

Vec4f gives users a small, explicit type that is easy to understand and test. It also acts as a clean boundary between:

  • the public API
  • the backend-specific implementation

Why array functions still matter

Real applications often work on contiguous arrays rather than individual vectors.

That is why simd::add, simd::sub, simd::mul, simd::div, and simd::dot are the main "workhorse" APIs.

Why there is both a generic and an optimized backend

The generic backend ensures correctness and portability. The optimized backend ensures performance where SIMD instructions are available.

This approach makes it easier to develop safely:

  1. write and validate the portable version
  2. add architecture-specific acceleration
  3. verify that tests still pass
  4. compare benchmark numbers

Current limitations

Version 0.1.0 is intentionally small.

Current limitations include:

  • API currently focuses on float only
  • vector width is fixed to 4 lanes in the public type
  • no runtime CPU feature dispatch yet
  • no higher-level math functions such as sqrt, min, max, fma, abs, or reductions beyond dot
  • x86 support exists, but the most aggressively optimized hot-path work has been done on ARM first

Roadmap

Planned next steps after 0.1.0:

  • improve and validate the x86 backend further
  • add more operations (min, max, sqrt, fused operations, reductions)
  • expand benchmark coverage and reporting
  • add more architecture-specific kernels while preserving the same API
  • improve documentation and examples further

License

This project is licensed under the Apache License 2.0.

Copyright © 2026 Samuele Garbuglia

You can find the full license text in the LICENSE file. You can find the project notice in the NOTICE file.


Summary

SIMD_math_lib 0.1.0 is a compact C++20 SIMD math library built around a simple idea:

one clear public interface, multiple internal backends

If you want a small project that is easy to read, test, benchmark, and evolve toward more advanced SIMD support, this is exactly what this repository is designed for.

About

C++20 multi-architecture SIMD math library with a unified API, optimized ARM NEON kernels, generic fallback, CMake package export, and benchmarks.

Topics

Resources

License

Stars

Watchers

Forks

Contributors