Skip to content

n4hy/OptimizedKernelsForRaspberryPi5_NvidiaCUDA

Repository files navigation

OptMathKernels

High-Performance Numerical Library for ARM SBCs and NVIDIA GPUs

OptMathKernels is a C++20 numerical library optimized for Raspberry Pi 5, Orange Pi 6 Plus, and NVIDIA CUDA GPUs. It seamlessly bridges Eigen (CPU), ARM NEON (SIMD), ARM SVE2 (Scalable Vectors), Vulkan (Compute Shaders), and CUDA (NVIDIA GPUs) into a single, easy-to-use API.

Designed to accelerate math and signal processing tasks by leveraging:

  • Raspberry Pi 5: Cortex-A76 NEON and VideoCore VII GPU
  • Orange Pi 6 Plus: Cortex-A720 SVE2/FCMA/I8MM and Mali-G720-Immortalis GPU
  • NVIDIA GPUs: cuBLAS, cuFFT, cuSOLVER, and Tensor Cores (Volta+)

While remaining compatible with standard Linux x86/ARM environments.


Table of Contents


Key Applications

Passive Radar Signal Processing

OptMathKernels powers the PassiveRadar_Kraken project, providing hardware-accelerated kernels for:

ARM NEON (Raspberry Pi 5):

Operation Speedup Application
Complex multiply 4-8x CAF Doppler shifting
Dot product 4-6x NLMS adaptive filter
GEMM (blocked) 3-5x Beamforming, covariance
Transcendentals 10-50x Phase computation
FFT (Vulkan) 10x Large batch processing

NVIDIA CUDA (Workstation/Server):

Operation Speedup GPU Features
GEMM (1024x1024) 50-100x Tensor Cores (Ampere+)
FFT (64K points) 20-50x cuFFT optimized
CAF (full map) 30-80x Batched FFT + complex ops
CFAR 2D 10-30x Parallel threshold compute
Beamforming 20-50x cuBLAS matrix-vector
Transcendentals 50-200x CUDA fast math intrinsics

General Numerical Computing

  • Machine learning inference (activation functions, matrix ops)
  • Digital signal processing (filtering, FFT, convolution)
  • Scientific computing (linear algebra, statistics)
  • Real-time audio/video processing

Features

Core Capabilities

  • NEON Acceleration: Hand-tuned ARMv8 NEON intrinsics for SIMD acceleration on 64-bit ARM (aarch64). Includes optimized matrix multiplication, convolution, and vector math.
  • Vulkan Compute: Massive parallel offloading to the GPU (VideoCore VII on Pi 5). Supports large vector operations, matrix math, FFT (Radix-2/4), and reductions.
  • Eigen Integration: Fully compatible with Eigen::VectorXf, Eigen::MatrixXf, and Eigen::VectorXcf. Pass your existing data structures directly to accelerated kernels.
  • Easy Integration: Standard CMake package that installs to /usr/local and works with find_package(OptMathKernels).

Radar Signal Processing (optmath::radar)

  • Cross-Ambiguity Function (CAF): Core passive radar operation for range-Doppler detection
  • CFAR Detection: 1D/2D Cell-Averaging and Ordered Statistic CFAR detectors
  • Clutter Filtering: NLMS adaptive filter and projection-based clutter cancellation
  • Doppler Processing: FFT-based Doppler processing and MTI filters
  • Beamforming: Delay-and-sum and phase-shift beamformers with steering vector generation
  • Window Functions: Hamming, Hanning, Blackman, Blackman-Harris, Kaiser

Optimized Kernels (optmath::neon)

  • Vectorized Transcendentals: Fast NEON-accelerated exp, sin, cos, sigmoid, tanh (~10-50x faster than scalar)
  • Complex Operations: Vectorized complex multiply, conjugate multiply, magnitude, phase
  • Cache-Blocked GEMM: 8x8 microkernel with runtime cache blocking (MC/KC/NC auto-tuned for detected L3 cache size)
  • Reductions: Sum, max, min, dot product with horizontal NEON adds

SVE2 Acceleration (optmath::sve2)

  • Predicated Vector Operations: All loops use svwhilelt/svptest predication - zero tail-handling code
  • FCMA Complex Multiply: 2-instruction complex multiply via svcmla (vs 4 NEON instructions)
  • I8MM GEMM: Int8 matrix multiply using svmmla_s32 with float32 quantize/dequantize
  • SVE2 Cache-Blocked GEMM: Vectorized 8x8 microkernel with svmla_n_f32_z FMA, tuned for Cortex-A720 12MB L3 (MC=256, KC=512, NC=2048)
  • SVE2 Transcendentals: Single-pass inlined polynomial approximations (no heap allocations), cos/sigmoid/tanh fused with range reduction
  • SVE2 Radar DSP: FCMA-accelerated CAF with vectorized Doppler shift (batch sve2_fast_cos/sin), cross-correlation, and beamforming
  • SVE2 Complex Operations: Split and interleaved formats, dot product, magnitude, phase; vectorized complex_exp via fast sin/cos

Platform Detection (optmath::platform)

  • CPU Topology Detection: Identifies big.LITTLE core clusters via /proc/cpuinfo and sysfs
  • Thread Affinity: Pin threads to performance (A720) or efficiency (A520) cores
  • SVE Vector Length: Runtime detection via prctl(PR_SVE_GET_VL)
  • L2/L3 Cache Detection: Per-core L2 and shared L3 cache sizes from sysfs with part-ID fallbacks
  • Cache-Aware GEMM Tuning: Auto-selects blocking parameters based on detected L2/L3 cache sizes

DSP Kernels (optmath::neon)

  • Polyphase Resampler: Rational L/M rate conversion with NEON-optimized FIR per phase, streaming and one-shot APIs
  • Biquad IIR Filter: Direct Form II Transposed with cascade support, design helpers for lowpass/highpass/bandpass/notch
  • 2D Convolution: NEON-vectorized general NxM convolution, separable kernels, unrolled 3x3 and 5x5 specializations

Dense Linear Algebra (optmath::neon)

  • Triangular Solve (TRSV/TRSM): Column-oriented forward/backward substitution with NEON-vectorized AXPY updates, unit-diagonal and transpose variants, multi-RHS support
  • Cholesky Decomposition: Unblocked right-looking A = L*L^T with NEON dot products, SPD validation with error reporting
  • LU Decomposition: Partial pivoting with NEON-vectorized column scaling and rank-1 trailing updates
  • QR Decomposition: Householder reflections with NEON-vectorized reflector application, explicit Q extraction
  • Solvers: General solve (LU), SPD solve (Cholesky), matrix inverse (LU + TRSM)

GPU Acceleration (optmath::vulkan)

  • Tiled GPU Matrix Multiply: 16x16 shared memory tiles (default), 32x32 tiles for Mali-G720
  • FFT: Radix-2/4 FFT with butterfly operations in compute shaders
  • Convolution: 1D and 2D convolution with separable kernel optimization
  • Vector Operations: Add, multiply, dot product, reductions
  • Mali-G720 Optimized: Auto-selects 1024-thread shaders with subgroup arithmetic on Mali GPUs

NVIDIA CUDA Acceleration (optmath::cuda)

  • cuBLAS Integration: Level 1, 2, and 3 BLAS operations with Tensor Core support
  • cuFFT: High-performance FFT up to 64M points, batched transforms
  • cuSOLVER: Cholesky, LU, QR, SVD, eigenvalue decomposition
  • Tensor Cores: Automatic acceleration on Volta+ (SM 7.0+) and Ampere+ (SM 8.0+)
  • Mixed Precision: FP32, TF32, FP16 compute modes
  • Multi-GPU: Device enumeration, peer-to-peer access, workload distribution
  • Unified Memory: Simplified CPU-GPU data management
  • Radar Processing: Full GPU-accelerated CAF, CFAR, beamforming, NLMS filter
  • Optimized CAF: Zero CPU-GPU transfers inside Doppler loop for maximum throughput

Hardware Support

Orange Pi 6 Plus (CIX P1)

Target Hardware: Orange Pi 6 Plus with CIX P1 CD8160 SoC (ARMv9)

Feature Specification
CPU 8x Cortex-A720 + 4x Cortex-A520 (big.LITTLE)
SIMD NEON (128-bit) + SVE2 (128-bit VL) + FCMA + I8MM + BF16
GPU Mali-G720-Immortalis (Vulkan 1.3)
Memory 8GB/16GB LPDDR5
L1 Cache 64KB per core
L2 Cache 512KB per core
L3 Cache 12MB shared

SVE2 Optimizations:

  • Predicated loops eliminate all scalar tail handling (vs 28+ NEON tail loops)
  • FCMA complex multiply: 2 instructions vs 4 NEON instructions per complex multiply
  • I8MM: Hardware int8 matrix multiply with svmmla_s32
  • Cache blocking tuned for 12MB L3: MC=256, KC=512, NC=2048

Mali-G720-Immortalis GPU (Vulkan):

Feature Specification
Vulkan 1.3
Shared Memory 32KB per workgroup
Max Workgroup Size 1024
Subgroup Size 16
Subgroup Ops Arithmetic (subgroupAdd, etc.)
  • 32x32 tiled GEMM shader (1024 threads, 8KB shared memory)
  • 1024-thread reduction with subgroup-level arithmetic

Build for Orange Pi 6 Plus:

cmake -DCMAKE_BUILD_TYPE=Release \
      -DENABLE_NEON=ON \
      -DENABLE_SVE2=ON \
      -DENABLE_VULKAN=ON \
      -DENABLE_CUDA=OFF \
      -DBUILD_TESTS=ON ..
make -j$(nproc)

Raspberry Pi 5

Target Hardware: Raspberry Pi 5 with Broadcom BCM2712 SoC (Cortex-A76)

Feature Specification
CPU Quad-core ARM Cortex-A76 @ 2.4 GHz
SIMD NEON (128-bit Advanced SIMD)
GPU VideoCore VII (Vulkan 1.2)
Memory 4GB/8GB LPDDR4X-4267
L1 Cache 64KB per core
L2 Cache 512KB per core
L3 Cache 2MB shared

NEON Optimizations:

  • Complex multiply-accumulate (4 complex float32 per cycle)
  • Vectorized dot products (8 float32 per instruction)
  • SIMD FFT butterflies with vfmaq_f32
  • Parallel magnitude/phase computation
  • Cache-blocked GEMM tuned for A76 cache hierarchy

Cache Blocking Parameters (auto-tuned per platform):

// Pi 5 (2MB L3):  MC=128, KC=256, NC=512
// Orange Pi 6+ (12MB L3): MC=256, KC=512, NC=2048
// 8x8 microkernel with 4x4 register tiles (32 NEON registers)
neon_gemm_blocked_f32(C, A, B, M, N, K, lda, ldb, ldc);

Performance on Pi 5:

Operation NEON Time Scalar Time Speedup
Complex dot 4096 0.8 μs 12.4 μs 15.5x
GEMM 256x256 1.2 ms 18 ms 15x
CAF 4096x64 5.2 ms 82 ms 15.8x
CFAR 2D 256x64 0.9 ms 11.2 ms 12.4x
Exp 1M elements 1.2 ms 54 ms 45x
Sin 1M elements 1.5 ms 48 ms 32x

x86_64 (Intel/AMD)

Supported Processors: Any x86_64 CPU with SSE4.2 or AVX2

Feature Minimum Recommended
CPU Intel Core i3 / AMD Ryzen 3 Intel Core i7 / AMD Ryzen 7
SIMD SSE4.2 AVX2/AVX-512
Memory 8GB DDR4 16GB+ DDR4/DDR5

Eigen3 Auto-Vectorization: The OptMathKernels library uses Eigen3's expression templates which automatically vectorize for AVX/AVX2/AVX-512 when available on x86_64.

# Check for AVX2 support
cat /proc/cpuinfo | grep avx2

NVIDIA CUDA/RTX

Supported GPUs: NVIDIA GPUs with Compute Capability 5.0+ (Maxwell and later)

GPU Generation Architecture Compute Capability CUDA Cores Tensor Cores Memory
GTX 900 Maxwell 5.0/5.2 1024-3072 - 2-12 GB
GTX 1000 Pascal 6.1 1280-3584 - 3-11 GB
Tesla P100 Pascal 6.0 3584 - 12-16 GB
Tesla V100 Volta 7.0 5120 640 16-32 GB
RTX 2000 Turing 7.5 2304-4608 288-576 8-11 GB
RTX 3000 Ampere 8.6 3584-10496 112-328 8-24 GB
A100 Ampere 8.0 6912 432 40-80 GB
RTX 4000 Ada Lovelace 8.9 5888-16384 184-512 8-24 GB
H100 Hopper 9.0 14592 456 80 GB
Jetson Orin Ampere 8.7 1024-2048 32-64 8-64 GB
RTX 5000 Blackwell 10.0 21760 680 32 GB

Architecture Features:

Architecture SM FP16 Tensor Cores TF32 FP8
Maxwell 5.x
Pascal 6.x
Volta 7.0 ✅ Gen 1
Turing 7.5 ✅ Gen 2
Ampere 8.x ✅ Gen 3
Ada 8.9 ✅ Gen 4
Hopper 9.x ✅ Gen 4
Blackwell 10.x ✅ Gen 5

CUDA Toolkit Requirements:

GPU Architecture Minimum CUDA Recommended CUDA SM Version
Maxwell (GTX 9xx) CUDA 6.5 CUDA 11.x+ SM 5.0/5.2
Pascal (GTX 10xx) CUDA 8.0 CUDA 11.x+ SM 6.0/6.1
Volta (V100) CUDA 9.0 CUDA 11.x+ SM 7.0
Turing (RTX 20xx) CUDA 10.0 CUDA 12.x SM 7.5
Ampere (RTX 30xx) CUDA 11.0 CUDA 12.x SM 8.0/8.6
Ada (RTX 40xx) CUDA 11.8 CUDA 12.x SM 8.9
Hopper (H100) CUDA 12.0 CUDA 12.x SM 9.0
Blackwell (RTX 50xx) CUDA 12.8 CUDA 13.0+ SM 10.0

Note: RTX 5090 (Blackwell) requires CUDA 12.8+ for native SM 10.0 support. Earlier CUDA versions (e.g., 12.0 from Ubuntu packages) will compile with SM 9.0a for Hopper, but this may cause runtime issues due to PTX forward compatibility limitations. For optimal Blackwell performance, install CUDA 12.8+ directly from NVIDIA.

CUDA Performance (RTX 4090):

Operation CUDA Time CPU Time Speedup
Complex FFT 65536 0.12 ms 8.4 ms 70x
GEMM 1024x1024 0.08 ms 45 ms 562x
2D Convolution 512x512 0.15 ms 120 ms 800x
CAF 16384x256 1.2 ms 450 ms 375x
Exp 1M elements 0.03 ms 54 ms 1800x

Vulkan GPU Compute

Supported GPUs: Any Vulkan 1.2+ capable GPU

Vendor GPUs Vulkan Version Driver
NVIDIA GTX 900+, RTX 20xx/30xx/40xx/50xx 1.3 Proprietary
AMD RX 400+, RDNA/RDNA2/RDNA3 1.3 Mesa RADV
Intel UHD 600+, Arc 1.3 Mesa ANV
Raspberry Pi 5 VideoCore VII 1.2 Mesa V3D
Orange Pi 6 Plus Mali-G720-Immortalis 1.3 Mali (panfrost/panvk)
Qualcomm Adreno 6xx+ 1.1 Proprietary

Vulkan on RTX 50xx: The Vulkan backend provides full GPU acceleration on Blackwell GPUs regardless of CUDA toolkit version. This is a good fallback when CUDA 12.8+ is not available.

VideoCore VII (Raspberry Pi 5):

Feature Specification
QPU Cores 12
Clock 800 MHz
Compute ~1 GFLOPS FP32
Shared Memory 4KB per workgroup
Max Workgroup Size 256

Vulkan Compute Shaders: 40 GLSL compute shaders compiled to SPIR-V:

  • Vector operations (add, sub, mul, div, dot, norm)
  • Matrix operations (add, mul, transpose, scale)
  • Reductions (sum, max, min, prefix scan)
  • Convolution (1D, 2D, optimized variants)
  • FFT (radix-2, radix-4, forward/inverse)
  • Radar (CAF Doppler shift, CFAR 2D)

Installation

Prerequisites

Raspberry Pi 5 / Ubuntu / Debian

sudo apt update
sudo apt install -y \
    build-essential \
    cmake \
    git \
    pkg-config \
    libeigen3-dev \
    libvulkan-dev \
    mesa-vulkan-drivers \
    vulkan-tools \
    glslang-tools \
    glslc

NVIDIA CUDA Support (Optional)

# Ubuntu/Debian with NVIDIA driver already installed (CUDA 12.0)
# Note: Ubuntu packages provide CUDA 12.0 which does NOT support Blackwell (RTX 50xx)
sudo apt install -y nvidia-cuda-toolkit

# For RTX 50xx (Blackwell) - REQUIRES CUDA 12.8+ from NVIDIA
# Download from: https://developer.nvidia.com/cuda-downloads

# Ubuntu 22.04/24.04 - NVIDIA repo (gets latest CUDA)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-8  # or latest available

# Add to PATH (use NVIDIA's version, not Ubuntu's)
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify installation
nvcc --version   # Should show 12.8+ for Blackwell
nvidia-smi       # Shows driver and GPU info

GPU Architecture Detection:

# Check your GPU's compute capability
nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader
# Example output: "NVIDIA GeForce RTX 5090, 10.0" (Blackwell)

Build & Install

1. Clone the Repository

git clone https://github.com/n4hy/OptimizedKernelsForRaspberryPi5_NvidiaCUDA.git
cd OptimizedKernelsForRaspberryPi5_NvidiaCUDA

2. Configure and Build

Raspberry Pi 5 (NEON + Vulkan):

mkdir -p build && cd build

cmake -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
      -DENABLE_NEON=ON \
      -DENABLE_VULKAN=ON \
      -DENABLE_CUDA=OFF \
      -DBUILD_TESTS=ON \
      -DBUILD_BENCHMARKS=OFF \
      -DCMAKE_INSTALL_PREFIX=/usr/local \
      ..

make -j$(nproc)

NVIDIA Workstation (CUDA + optional Vulkan):

mkdir -p build && cd build

# Default: All architectures Maxwell through Blackwell (CUDA 13+)
cmake -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
      -DENABLE_NEON=OFF \
      -DENABLE_VULKAN=ON \
      -DENABLE_CUDA=ON \
      -DBUILD_TESTS=ON \
      -DCMAKE_INSTALL_PREFIX=/usr/local \
      ..

# Specific architectures - Modern GPUs only (RTX 20xx/30xx/40xx)
cmake -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_CUDA_ARCHITECTURES="75;86;89" \
      -DENABLE_CUDA=ON \
      ..

# Legacy GPUs (GTX 900/1000 series) - CUDA 11.x compatible
cmake -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_CUDA_ARCHITECTURES="50;52;60;61" \
      -DENABLE_CUDA=ON \
      ..

# For RTX 50xx Blackwell (REQUIRES CUDA 12.8+ or CUDA 13)
cmake -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
      -DENABLE_NEON=OFF \
      -DENABLE_VULKAN=ON \
      -DENABLE_CUDA=ON \
      -DCMAKE_CUDA_ARCHITECTURES="100" \
      -DCMAKE_CXX_FLAGS="-O3 -march=native" \
      -DBUILD_TESTS=ON \
      -DCMAKE_INSTALL_PREFIX=/usr/local \
      ..

make -j$(nproc)

CMake Options:

Option Default Description
ENABLE_NEON ON (ARM) Enable ARM NEON SIMD
ENABLE_SVE2 ON Enable ARM SVE2 (ARMv9)
ENABLE_FCMA ON Enable FCMA complex multiply
ENABLE_I8MM ON Enable I8MM int8 matrix multiply
ENABLE_VULKAN ON Enable Vulkan compute
ENABLE_CUDA ON Enable NVIDIA CUDA
CMAKE_CUDA_ARCHITECTURES 50;52;60;61;70;75;80;86;89;100;103 CUDA compute capabilities (Maxwell through Blackwell)
BUILD_TESTS ON Build GoogleTest tests
BUILD_BENCHMARKS OFF Build Google Benchmark
CMAKE_POSITION_INDEPENDENT_CODE ON Enable -fPIC (set globally)

3. Run Tests

ctest --output-on-failure

4. Install

sudo make install
sudo ldconfig

5. Verify Installation

ls /usr/local/lib/libOptMathKernels*
ls /usr/local/include/optmath/
ls /usr/local/lib/cmake/OptMathKernels/

API Reference

For complete API documentation of all 473 functions, see:

FunctionsIncluded.md - Complete API Reference

Quick Reference by Backend

Backend Functions Description
NEON 104 ARM SIMD operations for Raspberry Pi 5 / Orange Pi 6 Plus
SVE2 46 ARM SVE2/FCMA/I8MM for ARMv9 (Orange Pi 6 Plus)
Platform 10 CPU topology detection, thread affinity, cache tuning
CUDA 242 NVIDIA GPU kernels (cuBLAS, cuFFT, cuSOLVER)
Vulkan 23 Cross-platform GPU compute shaders
Radar 48 Passive radar signal processing

Headers

#include <optmath/neon_kernels.hpp>    // ARM NEON operations
#include <optmath/sve2_kernels.hpp>    // ARM SVE2/FCMA/I8MM operations
#include <optmath/platform.hpp>        // CPU detection, thread affinity
#include <optmath/vulkan_backend.hpp>  // Vulkan GPU compute
#include <optmath/cuda_backend.hpp>    // NVIDIA CUDA operations
#include <optmath/radar_kernels.hpp>   // Radar signal processing

CMake Integration

cmake_minimum_required(VERSION 3.18)
project(MyApp)

find_package(OptMathKernels REQUIRED)
find_package(Eigen3 REQUIRED)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE
    OptMathKernels::OptMathKernels
    Eigen3::Eigen
)

Usage Examples

NEON Vector Operations

#include <optmath/neon_kernels.hpp>

if (optmath::neon::is_available()) {
    Eigen::VectorXf a = Eigen::VectorXf::Random(1024);
    Eigen::VectorXf b = Eigen::VectorXf::Random(1024);

    // Basic operations
    Eigen::VectorXf c = optmath::neon::neon_add(a, b);
    Eigen::VectorXf d = optmath::neon::neon_mul(a, b);
    float dot = optmath::neon::neon_dot(a, b);
    float norm = optmath::neon::neon_norm(a);

    // Reductions
    float sum = optmath::neon::neon_reduce_sum(a);
    float max = optmath::neon::neon_reduce_max(a);
}

NEON Matrix Operations

Eigen::MatrixXf A = Eigen::MatrixXf::Random(256, 256);
Eigen::MatrixXf B = Eigen::MatrixXf::Random(256, 256);

// Basic GEMM (4x4 tiled)
Eigen::MatrixXf C = optmath::neon::neon_gemm(A, B);

// Optimized blocked GEMM (8x8 microkernel, cache blocking)
Eigen::MatrixXf D = optmath::neon::neon_gemm_blocked(A, B);

// Other operations
Eigen::MatrixXf At = optmath::neon::neon_mat_transpose(A);
Eigen::MatrixXf As = optmath::neon::neon_mat_scale(A, 2.5f);

NEON Vectorized Transcendentals

std::vector<float> input(N), output(N);

// Fast exp approximation (~12% relative error at extremes, better near zero)
optmath::neon::neon_fast_exp_f32(output.data(), input.data(), N);

// Fast sin/cos approximation (~1e-5 accuracy)
optmath::neon::neon_fast_sin_f32(output.data(), input.data(), N);
optmath::neon::neon_fast_cos_f32(output.data(), input.data(), N);

// Fast activation functions (~3% sigmoid, ~6% tanh)
optmath::neon::neon_fast_sigmoid_f32(output.data(), input.data(), N);
optmath::neon::neon_fast_tanh_f32(output.data(), input.data(), N);

NEON Complex Operations

Eigen::VectorXcf a = Eigen::VectorXcf::Random(1024);
Eigen::VectorXcf b = Eigen::VectorXcf::Random(1024);

// Element-wise complex operations
Eigen::VectorXcf c = optmath::neon::neon_complex_mul(a, b);
Eigen::VectorXcf d = optmath::neon::neon_complex_conj_mul(a, b);  // a * conj(b)

// Complex dot product: sum(a * conj(b))
std::complex<float> dot = optmath::neon::neon_complex_dot(a, b);

// Magnitude and phase
Eigen::VectorXf mag = optmath::neon::neon_complex_magnitude(a);
Eigen::VectorXf phase = optmath::neon::neon_complex_phase(a);

NEON DSP: Polyphase Resampler

#include <optmath/neon_kernels.hpp>
using namespace optmath::neon;

// Rational rate conversion by L/M (e.g., 48kHz -> 44.1kHz ≈ 147/160)
std::size_t L = 147, M = 160;

// Design a lowpass prototype filter (cutoff at min(pi/L, pi/M))
std::vector<float> filter = /* your FIR lowpass filter */;

// One-shot resampling
std::vector<float> input(10000), output(input.size() * L / M + 100);
std::size_t out_len = 0;
neon_resample_oneshot_f32(output.data(), &out_len,
                           input.data(), input.size(),
                           filter.data(), filter.size(), L, M);

// Streaming resampling (maintains state between calls)
PolyphaseResamplerState state;
neon_resample_init(state, filter.data(), filter.size(), L, M);
std::size_t n = neon_resample_f32(output.data(), input.data(), input.size(), state);

// Eigen wrapper
Eigen::VectorXf in_vec = Eigen::VectorXf::Random(10000);
Eigen::VectorXf filt_vec = Eigen::Map<Eigen::VectorXf>(filter.data(), filter.size());
Eigen::VectorXf result = neon_resample(in_vec, filt_vec, L, M);

NEON DSP: Biquad IIR Filter

#include <optmath/neon_kernels.hpp>
using namespace optmath::neon;

// Design a lowpass filter (fc=1000Hz, fs=48000Hz, Q=0.707)
BiquadCoeffs lp = neon_biquad_lowpass(1000.0f, 48000.0f);

// Process audio samples
std::vector<float> input(4096), output(4096);
BiquadState state;
neon_biquad_f32(output.data(), input.data(), input.size(), lp, state);

// Cascade for higher-order filtering (4th order = 2 biquad sections)
BiquadCoeffs cascade[2] = {
    neon_biquad_lowpass(1000.0f, 48000.0f),
    neon_biquad_lowpass(1000.0f, 48000.0f)
};
BiquadState states[2] = {};
neon_biquad_cascade_f32(output.data(), input.data(), input.size(),
                         cascade, states, 2);

// Other filter types
BiquadCoeffs hp = neon_biquad_highpass(500.0f, 48000.0f);
BiquadCoeffs bp = neon_biquad_bandpass(1000.0f, 48000.0f, 5.0f);
BiquadCoeffs notch = neon_biquad_notch(60.0f, 48000.0f, 30.0f);  // 60Hz hum removal

NEON DSP: 2D Convolution

#include <optmath/neon_kernels.hpp>
using namespace optmath::neon;

// Input image in row-major layout
std::size_t rows = 480, cols = 640;
std::vector<float> image(rows * cols);

// 3x3 Sobel edge detection (horizontal)
float sobel_x[9] = {-1, 0, 1, -2, 0, 2, -1, 0, 1};
std::vector<float> edges((rows - 2) * (cols - 2));
neon_conv2d_3x3_f32(edges.data(), image.data(), rows, cols, sobel_x);

// 5x5 Gaussian blur
float gauss5x5[25] = { /* your kernel */ };
std::vector<float> blurred((rows - 4) * (cols - 4));
neon_conv2d_5x5_f32(blurred.data(), image.data(), rows, cols, gauss5x5);

// Separable convolution (faster for separable kernels)
float row_k[] = {0.25f, 0.5f, 0.25f};
float col_k[] = {0.25f, 0.5f, 0.25f};
neon_conv2d_separable_f32(blurred.data(), image.data(), rows, cols,
                           row_k, 3, col_k, 3);

// Eigen wrapper
Eigen::MatrixXf img = Eigen::MatrixXf::Random(64, 64);
Eigen::MatrixXf kern(3, 3);
kern << 1, 2, 1, 2, 4, 2, 1, 2, 1;
kern /= 16.0f;
Eigen::MatrixXf result = neon_conv2d(img, kern);

NEON Dense Linear Algebra

#include <optmath/neon_kernels.hpp>
using namespace optmath::neon;

// Cholesky decomposition (A = L * L^T)
Eigen::MatrixXf A = /* symmetric positive definite matrix */;
Eigen::MatrixXf L = neon_cholesky(A);  // returns lower triangular L

// LU decomposition with partial pivoting
Eigen::MatrixXf M = Eigen::MatrixXf::Random(64, 64);
auto [LU, piv] = neon_lu(M);

// QR decomposition (Householder)
auto [Q, R] = neon_qr(M);  // Q orthogonal, R upper triangular

// Solve A*x = b (general via LU)
Eigen::VectorXf b = Eigen::VectorXf::Random(64);
Eigen::VectorXf x = neon_solve(M, b);

// Solve SPD system via Cholesky (faster for symmetric positive definite)
Eigen::VectorXf x_spd = neon_solve_spd(A, b);

// Triangular solve
Eigen::VectorXf y = neon_trsv_lower(L, b);  // solve L*y = b
Eigen::VectorXf z = neon_trsv_upper(R, b);  // solve R*z = b

// Matrix inverse
Eigen::MatrixXf Minv = neon_inverse(M);

Radar: Cross-Ambiguity Function (CAF)

#include <optmath/radar_kernels.hpp>
using namespace optmath::radar;

// Reference and surveillance signals
Eigen::VectorXcf ref = /* transmitter signal */;
Eigen::VectorXcf surv = /* receiver signal */;

// CAF parameters
size_t n_doppler_bins = 101;
size_t n_range_bins = 500;
float doppler_start = -500.0f;  // Hz
float doppler_step = 10.0f;      // Hz
float sample_rate = 1e6f;        // Hz

// Compute CAF (output: n_doppler x n_range magnitude matrix)
Eigen::MatrixXf caf_mag = caf(ref, surv,
                               n_doppler_bins, doppler_start, doppler_step,
                               sample_rate, n_range_bins);

// Find peak (target detection)
Eigen::Index doppler_bin, range_bin;
float peak_mag = caf_mag.maxCoeff(&doppler_bin, &range_bin);

Radar: CFAR Detection

// 1D CA-CFAR
Eigen::VectorXf power_data = /* input power */;
size_t guard_cells = 4;
size_t reference_cells = 16;
float pfa_factor = 10.0f;

auto detections = cfar_ca(power_data, guard_cells, reference_cells, pfa_factor);

// 2D CFAR for range-Doppler maps
Eigen::MatrixXf range_doppler_map = /* CAF output */;
auto det_2d = cfar_2d(range_doppler_map,
                      guard_range, guard_doppler,
                      ref_range, ref_doppler,
                      pfa_factor);

CUDA Operations

#include <optmath/cuda_backend.hpp>

if (optmath::cuda::is_available()) {
    optmath::cuda::init();

    Eigen::VectorXf a = Eigen::VectorXf::Random(1000000);
    Eigen::VectorXf b = Eigen::VectorXf::Random(1000000);

    // Vector operations
    Eigen::VectorXf c = optmath::cuda::cuda_add(a, b);
    float dot = optmath::cuda::cuda_dot(a, b);

    // Matrix operations (Tensor Core accelerated)
    Eigen::MatrixXf A = Eigen::MatrixXf::Random(1024, 1024);
    Eigen::MatrixXf B = Eigen::MatrixXf::Random(1024, 1024);
    Eigen::MatrixXf C = optmath::cuda::cuda_gemm(A, B);

    // FFT
    Eigen::VectorXcf signal = Eigen::VectorXcf::Random(65536);
    Eigen::VectorXcf spectrum = optmath::cuda::cuda_fft(signal);

    // Transcendentals
    Eigen::VectorXf exp_x = optmath::cuda::cuda_exp(a);
    Eigen::VectorXf sigmoid_x = optmath::cuda::cuda_sigmoid(a);
}

Vulkan Operations

#include <optmath/vulkan_backend.hpp>

if (optmath::vulkan::is_available()) {
    Eigen::VectorXf a = Eigen::VectorXf::Random(1000000);
    Eigen::VectorXf b = Eigen::VectorXf::Random(1000000);

    // Vector operations
    Eigen::VectorXf c = optmath::vulkan::vulkan_vec_add(a, b);
    float dot = optmath::vulkan::vulkan_vec_dot(a, b);

    // Matrix operations
    Eigen::MatrixXf A = Eigen::MatrixXf::Random(512, 512);
    Eigen::MatrixXf B = Eigen::MatrixXf::Random(512, 512);
    Eigen::MatrixXf C = optmath::vulkan::vulkan_mat_mul(A, B);

    // Convolution
    Eigen::VectorXf signal = Eigen::VectorXf::Random(10000);
    Eigen::VectorXf kernel = Eigen::VectorXf::Random(64);
    Eigen::VectorXf result = optmath::vulkan::vulkan_convolution_1d(signal, kernel);
}

Benchmarking

cmake -DBUILD_BENCHMARKS=ON ..
make -j$(nproc)

# Run individual benchmarks
./benchmarks/bench_neon_transcendentals  # exp, sin, cos, sigmoid, tanh
./benchmarks/bench_neon_gemm             # Matrix multiplication
./benchmarks/bench_neon_fft              # FIR, cross-correlation, complex ops
./benchmarks/bench_vulkan_matmul         # GPU vector/matrix operations
./benchmarks/bench_radar_caf             # CAF, CFAR, beamforming

Example output (Raspberry Pi 5):

BM_NEON_Exp_Approx/1048576    1.2 ms    1.2 ms    580  FLOPS=13.1G/s
BM_Std_Exp/1048576           45.2 ms   45.1 ms     15  FLOPS=23.2M/s
BM_NEON_ComplexMul/65536      0.1 ms    0.1 ms   6200  Elements/s=655M
BM_CAF_NEON/4096x64x256       4.8 ms    4.7 ms    148  CAF/s=212

Orange Pi 6 Plus Benchmark Results

Tested on CIX P1 CD8160 (12x cores: 8x Cortex-A720 + 4x Cortex-A520) @ 2.6 GHz, Mali-G720-Immortalis GPU (Vulkan 1.3.275). All benchmarks compiled with -march=armv9-a+sve2+bf16+i8mm, C++20.

NEON GEMM (Matrix Multiplication)

Benchmark Size Time FLOPS Notes
NEON GEMM 4x4 16 30.0 ns 17.1 GFLOPS Micro-kernel
NEON GEMM 4x4 128 245 ns 16.7 GFLOPS L1-resident
NEON GEMM 4x4 1024 1.91 μs 17.1 GFLOPS Streaming
NEON GEMM Blocked 32 4.48 μs 14.6 GFLOPS Cache-blocked
NEON GEMM Blocked 256 2.87 ms 11.7 GFLOPS L3-resident
NEON GEMM Blocked 512 29.9 ms 9.01 GFLOPS Beyond L3
Eigen GEMM 32 2.25 μs 29.1 GFLOPS Reference
Eigen GEMM 256 919 μs 36.6 GFLOPS Highly optimized
Eigen GEMM 512 7.14 ms 37.7 GFLOPS AVX-class perf
NEON MatVec 1024 220 μs 9.55 GFLOPS Matrix-vector
NEON MatVec 2048 867 μs 9.71 GFLOPS Large

NEON Transcendentals (Speedup vs std:: scalar)

Function Size NEON FLOPS std:: FLOPS Speedup
exp 1M 8.55 GFLOPS 261 MFLOPS 32.8x
sin 1M 8.55 GFLOPS 228 MFLOPS 37.5x
cos 1M 7.41 GFLOPS ~33x
sigmoid 256K 8.09 GFLOPS 408 MFLOPS 19.8x
tanh 256K 8.66 GFLOPS 73.8 MFLOPS 117x
ReLU 1M 34.2 GB/s Memory-bound

NEON DSP (Signal Processing)

Benchmark Size Time Throughput
FIR Filter 16K samples, 64 taps 179 μs 11.7 GFLOPS
FIR Filter 64K samples, 128 taps 1.53 ms 10.9 GFLOPS
Cross-Correlation 4096 3.65 ms 9.18 GFLOPS
Complex Multiply 4096 8.08 μs 3.04 GFLOPS
Complex Magnitude 64K 87.5 μs 3.00 GFLOPS
Dot Product (NEON) 256K 60.9 μs 8.62 GFLOPS
Dot Product (Eigen) 256K 48.0 μs 10.9 GFLOPS

Vulkan GPU (Mali-G720-Immortalis)

Benchmark Size Wall Time GPU FLOPS Notes
Matrix Multiply 64x64 228 μs 5.54 GFLOPS Setup overhead
Matrix Multiply 256x256 1.19 ms 108 GFLOPS Tiled 32x32
Matrix Multiply 512x512 7.07 ms 274 GFLOPS Sweet spot
Matrix Multiply 1024x1024 51.2 ms 481 GFLOPS Peak compute
Vec Add 1M 3.06 ms 5.98 GB/s Memory-bound
Vec Add 4M 15.0 ms 4.27 GB/s Large transfer

Radar Signal Processing

Benchmark Parameters Time FLOPS
CAF 4096 samples, 41 Doppler, 100 range 9.94 ms 17.0 GFLOPS
CAF 16384 samples, 61 Doppler, 200 range 110 ms 18.3 GFLOPS
CAF 65536 samples, 101 Doppler, 500 range 1.82 s 18.2 GFLOPS
CFAR CA 1D 64K samples 6.81 ms 1.23 GFLOPS
CFAR 2D 256x512 range-Doppler 34.0 ms 247 MFLOPS
CFAR 2D 512x1024 range-Doppler 137 ms 246 MFLOPS
NLMS Filter 64K samples, 64 taps 4.07 ms 4.12 GFLOPS
NLMS Filter 256K samples, 128 taps 32.2 ms 4.17 GFLOPS
MTI Filter 256 pulses x 2048 range 2.55 ms 1.22 GFLOPS
Beamform (Phase) 8 elements, 16K samples 267 μs 3.92 GFLOPS
Beamform (Delay-Sum) 16 elements, 64K samples 2.63 ms 801 MFLOPS
Steering Vector 64 elements 19.1 μs 104 M items/s

Demo Application Output

OptMathKernels Benchmark (N=1000000)
------------------------------------------
NEON: Available
Vulkan: Available (GPU initialized) — Mali-G720-Immortalis

--- Dot Product ---
Eigen (CPU): 0.303 ms, Result: -95.2917
NEON       : 0.289 ms, Result: -95.2971 (Diff: 0.005)
Vulkan     : 15.5 ms,  Result: -95.2891 (Diff: 0.003)

--- Vector Addition ---
Eigen (CPU): 1.59 ms
NEON       : 1.61 ms, Norm Diff: 0
Vulkan     : 7.79 ms, Norm Diff: 0

--- FIR Filter (Small Input) ---
Naive CPU  : 0.910 ms
NEON       : 0.242 ms, Norm Diff: 8.16e-05 (3.8x speedup)
Vulkan     : 7.12 ms,  Norm Diff: 8.16e-05

Test Results (Orange Pi 6 Plus — 16/16 Suites Pass)

Test Suite Tests Status Backend
test_basic 1 Passed Core
test_neon_kernels 4 Passed NEON
test_neon_complex 7 Passed NEON
test_neon_transcendentals 10 Passed NEON
test_neon_resample 7 Passed NEON
test_neon_iir 10 Passed NEON
test_neon_conv2d 9 Passed NEON
test_neon_linalg 21 Passed NEON
test_sve2_kernels 18 Passed SVE2
test_platform 9 Passed Platform
test_vulkan_vector 1 Passed Vulkan
test_vulkan_matrix 1 Passed Vulkan
test_vulkan_dsp 1 Passed Vulkan
test_vulkan_advanced 2 Passed Vulkan
test_radar_caf 6 Passed Radar
test_radar_cfar 9 Passed Radar

Troubleshooting

Build Issues

"Vulkan not found" during CMake:

# Install Vulkan SDK
sudo apt install -y libvulkan-dev vulkan-tools glslang-tools

# Verify Vulkan
vulkaninfo | head -20

"Could NOT find GTest":

# GTest is fetched automatically via FetchContent
# Clear build directory and re-run CMake
rm -rf build && mkdir build && cd build
cmake ..

"CUDA not found":

# Ensure CUDA toolkit is installed
nvcc --version

# Add to PATH if needed
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

"Unsupported gpu architecture 'compute_100'" (Blackwell):

# Your CUDA toolkit is too old for Blackwell (RTX 50xx)
nvcc --version  # Shows version (needs 12.8+ for SM 100)

# Option 1: Install CUDA 12.8+ from NVIDIA
# https://developer.nvidia.com/cuda-downloads

# Option 2: Use Hopper SM 90a (runs via PTX forward compatibility, may have issues)
cmake -DCMAKE_CUDA_ARCHITECTURES="90a" ..

# Option 3: Use Vulkan backend instead (full GPU acceleration, no CUDA needed)
cmake -DENABLE_CUDA=OFF -DENABLE_VULKAN=ON ..

CUDA tests fail with zeros/segfault on RTX 50xx:

# PTX forward compatibility from SM 90a to Blackwell SM 100 can have issues
# The Vulkan backend works correctly on RTX 50xx

# Solution: Install CUDA 12.8+ for native Blackwell support
# Or use Vulkan for GPU acceleration until CUDA is upgraded

"-fPIC" linking error when building shared libraries:

# Rebuild with position-independent code
cmake -DCMAKE_POSITION_INDEPENDENT_CODE=ON ..
make -j$(nproc)
sudo make install

Runtime Issues

"failed to open file: vec_add.comp.spv":

# Run sudo make install to place shaders in /usr/local/share/...
sudo make install

# Or set shader path manually
export OPTMATH_KERNELS_PATH=/usr/local/share/optmathkernels/shaders/

NEON tests skipped:

# NEON is only enabled on ARM platforms
# On x86, tests will skip with "NEON not available"

Eigen3 not found when using installed package:

sudo apt install libeigen3-dev

Performance Issues

High CPU usage on Pi 5:

  • Use NEON for operations that fit in L2 cache (<512KB)
  • Use Vulkan for large arrays where GPU parallelism helps
  • Enable cache blocking for large GEMM operations

GPU not being used:

# Check Vulkan
vulkaninfo | grep "GPU"

# Check CUDA
nvidia-smi

File Structure

OptMathKernels/
├── include/optmath/
│   ├── neon_kernels.hpp      # NEON API declarations (104 functions)
│   ├── sve2_kernels.hpp      # SVE2/FCMA/I8MM API declarations (46 functions)
│   ├── platform.hpp          # Platform detection, thread affinity (10 functions)
│   ├── vulkan_backend.hpp    # Vulkan API declarations (23 functions)
│   ├── cuda_backend.hpp      # CUDA API declarations (242 functions)
│   └── radar_kernels.hpp     # Radar processing API (48 functions)
├── src/
│   ├── neon/                           # ARM NEON (128-bit SIMD) kernels
│   │   ├── neon_kernels.cpp        # Core vector ops (4-accumulator dot, add/sub/mul/div),
│   │   │                           #   reductions (sum/max/min), 4x4 GEMM microkernel,
│   │   │                           #   FIR filter, ReLU, vectorized transcendentals
│   │   │                           #   (6th-order minimax exp/sin/cos/sigmoid/tanh)
│   │   ├── neon_complex.cpp        # Split & interleaved complex arithmetic (vld2q/vst2q),
│   │   │                           #   complex dot product, magnitude (Newton-Raphson rsqrt),
│   │   │                           #   phase, complex exponential
│   │   ├── neon_gemm_optimized.cpp # 3-level Goto-style cache-blocked GEMM with 8x8
│   │   │                           #   microkernel (vmlaq_laneq_f32 rank-1 updates),
│   │   │                           #   runtime A76/A720 cache tuning (MC/KC/NC)
│   │   ├── neon_radar.cpp          # Window functions (Hamming-Kaiser), CAF with vectorized
│   │   │                           #   Doppler shift, CFAR (CA/2D-SAT/OS), NLMS adaptive
│   │   │                           #   filter, projection clutter, DFT, MTI, beamforming
│   │   ├── neon_resample.cpp       # Polyphase L:M rational resampler with NEON FIR
│   │   ├── neon_iir.cpp            # Biquad IIR DF2T, cascade, Bristow-Johnson design
│   │   │                           #   (lowpass/highpass/bandpass/notch)
│   │   ├── neon_conv2d.cpp         # General NxM, separable, unrolled 3x3 & 5x5 convolution
│   │   └── neon_linalg.cpp         # TRSV/TRSM, Cholesky, LU (partial pivot), QR
│   │                               #   (Householder), general/SPD solve, matrix inverse
│   ├── sve2/                           # ARM SVE2 (scalable vector) kernels
│   │   ├── sve2_kernels.cpp        # Predicated vector ops (svwhilelt_b32 loops),
│   │   │                           #   transcendentals (predicated Horner/Chebyshev),
│   │   │                           #   8x8 GEMM microkernel (MC=256/KC=512/NC=2048),
│   │   │                           #   I8MM int8 GEMM (svmmla_s32), FIR filter
│   │   ├── sve2_complex.cpp        # FCMA-accelerated complex multiply (svcmla rotations
│   │   │                           #   0/90/270, 2 instructions vs 4), split & interleaved
│   │   │                           #   formats, complex dot/magnitude (svsqrt)/phase/exp
│   │   ├── sve2_radar.cpp          # CAF with predicated complex MAC (svmla_f32_m merging),
│   │   │                           #   cross-correlation, phase-shift beamforming, windowing
│   │   └── sve2_detect.cpp         # Runtime SVE2 detection via getauxval(AT_HWCAP2)
│   ├── platform/
│   │   └── platform.cpp            # CPU topology, thread affinity, cache detection
│   ├── vulkan/
│   │   ├── vulkan_backend.cpp      # Vulkan context & dispatch (Mali-G720 auto-detect)
│   │   └── shaders/                # 40 GLSL compute shaders
│   │       ├── vec_add.comp.glsl
│   │       ├── mat_mul_tiled.comp.glsl
│   │       ├── mat_mul_tiled_mali.comp.glsl  # 32x32 tiles for Mali-G720
│   │       ├── reduce_sum_mali.comp.glsl     # 1024-thread subgroup reduction
│   │       ├── fft_radix2.comp.glsl
│   │       ├── caf_doppler_shift.comp.glsl
│   │       ├── cfar_2d.comp.glsl
│   │       └── ... (33 more shaders)
│   └── cuda/                           # NVIDIA CUDA GPU kernels
│       ├── cuda_backend.cpp        # Context, device enumeration, memory management
│       ├── cuda_kernels.cu         # Vector elementwise (float4 vectorized), transcendentals
│       │                           #   (__expf/__sinf/__cosf/__sincosf fast-math intrinsics),
│       │                           #   activation functions (sigmoid/tanh/ReLU/GELU/softmax),
│       │                           #   matrix ops (tiled transpose with bank-conflict padding),
│       │                           #   cuBLAS (Sgemm/Sdot/Snrm2/Sgemv), CUB reductions
│       │                           #   (DeviceReduce::Sum/Max/Min), cuSOLVER Cholesky
│       │                           #   (Spotrf/Dpotrf/Spotrs, architecture-aware thresholds)
│       ├── cuda_complex.cu         # Split-format complex arithmetic, warp-level dot product
│       │                           #   (__shfl_down_sync reduction + atomicAdd), format
│       │                           #   conversion (interleave/deinterleave), convolution
│       │                           #   (1D naive/shared-memory, 2D), cuFFT wrappers
│       │                           #   (C2C 1D/batch/2D via cufftPlan1d/PlanMany/Plan2d)
│       └── cuda_radar.cu           # Window functions (Hamming-Blackman-Harris), GPU-resident
│                                   #   CAF pipeline (Doppler shift + FFT + conj-mul + IFFT),
│                                   #   CFAR (1D CA + 2D), Bartlett beamformer with shared-
│                                   #   memory reduction, ULA steering vectors, NLMS (CPU),
│                                   #   projection clutter cancellation
├── tests/
│   ├── test_neon_kernels.cpp       # NEON unit tests
│   ├── test_neon_complex.cpp       # Complex operation tests
│   ├── test_neon_transcendentals.cpp
│   ├── test_neon_resample.cpp      # Polyphase resampler tests
│   ├── test_neon_iir.cpp           # Biquad IIR filter tests
│   ├── test_neon_conv2d.cpp        # 2D convolution tests
│   ├── test_neon_linalg.cpp        # Dense linear algebra tests (21 tests)
│   ├── test_sve2_kernels.cpp       # SVE2 unit tests (18 tests)
│   ├── test_platform.cpp           # Platform detection tests (9 tests)
│   ├── test_vulkan_vector.cpp      # Vulkan vector tests
│   ├── test_vulkan_matrix.cpp      # Vulkan matrix tests
│   ├── test_vulkan_dsp.cpp         # Vulkan DSP tests
│   ├── test_cuda_kernels.cpp       # CUDA unit tests
│   ├── test_cuda_radar.cpp         # CUDA radar tests
│   ├── test_radar_caf.cpp          # CAF computation tests
│   └── test_radar_cfar.cpp         # CFAR detection tests
├── benchmarks/
│   ├── bench_neon_transcendentals.cpp
│   ├── bench_neon_gemm.cpp
│   ├── bench_neon_fft.cpp
│   ├── bench_vulkan_matmul.cpp
│   └── bench_radar_caf.cpp
├── examples/
│   ├── demo.cpp                    # NEON/Vulkan demo
│   └── cuda_demo.cpp               # CUDA demo
├── cmake/
│   └── OptMathKernelsConfig.cmake.in  # CMake package config
├── FunctionsIncluded.md            # Complete API reference (473 functions)
└── README.md                       # This file

Related Projects

  • PassiveRadar_Kraken: Complete passive bistatic radar system using OptMathKernels for acceleration. Includes GNU Radio blocks, multi-target tracking, and real-time displays.

Recent Changes

v0.5.11 - Comprehensive Kernel Documentation (April 2026)

Source-Level Documentation:

Every kernel source file now has a thorough header comment documenting all functional blocks, specific API/intrinsic usage, and algorithmic techniques. This makes the codebase self-documenting for contributors and users reading the source.

CUDA Kernels (3 files):

  • cuda_kernels.cu: 8 documented blocks — vector elementwise (float4 vectorized), transcendentals (CUDA __expf/__sinf/__cosf/__sincosf/__tanf/__powf fast-math intrinsics), activation functions (sigmoid/tanh/ReLU/leaky-ReLU/GELU/softmax with shared-memory reduction), matrix ops (tiled transpose with +1 bank-conflict padding), cuBLAS wrappers (cublasSgemm/cublasSdot/cublasSnrm2/cublasSgemv), CUB parallel reductions (cub::DeviceReduce::Sum/Max/Min), Eigen host wrappers (error-checked cudaMalloc/cudaMemcpy with CPU fallback), cuSOLVER Cholesky (cusolverDnSpotrf/cusolverDnDpotrf/cusolverDnSpotrs with architecture-aware GPU/CPU thresholds)
  • cuda_complex.cu: 7 documented blocks — split-format complex arithmetic, complex analysis kernels, warp-level complex dot product (__shfl_down_sync + atomicAdd), format conversion (interleave/deinterleave), convolution (1D naive/shared-memory template, 2D), cuFFT wrappers (cufftPlan1d/cufftPlanMany/cufftPlan2d with CUFFT_C2C), Eigen complex wrappers
  • cuda_radar.cu: 7 documented blocks — window function generators, GPU-resident CAF pipeline (Doppler shift via __sincosf + FFT + conj-multiply + IFFT + magnitude extraction), CFAR detection (1D CA-CFAR + 2D with guard cells), Doppler windowing, Bartlett beamformer (ULA steering vectors + shared-memory spectrum reduction), NLMS adaptive filter (CPU, sequential weight updates), projection clutter cancellation

NEON Kernels (8 files):

  • neon_kernels.cpp: Vector ops with 4-accumulator FMA pipeline utilization, vectorized transcendentals (6th-order minimax polynomials with IEEE754 bit manipulation for exponent reconstruction)
  • neon_complex.cpp: Split and interleaved (vld2q_f32/vst2q_f32) complex arithmetic, Newton-Raphson rsqrt magnitude (vrsqrteq_f32/vrsqrtsq_f32)
  • neon_gemm_optimized.cpp: 3-level Goto-style cache-blocked GEMM, 8x8 microkernel with vmlaq_laneq_f32 rank-1 updates, runtime A76/A720 cache parameter tuning
  • neon_radar.cpp: Window functions (including Kaiser with Bessel I0), CAF, CFAR (CA/2D-SAT/OS), NLMS, projection clutter, DFT, MTI, delay-sum and phase-shift beamforming
  • neon_conv2d.cpp: General, separable, fully-unrolled 3x3 and 5x5 convolution
  • neon_iir.cpp: Biquad Direct Form II Transposed, cascade, Bristow-Johnson filter design
  • neon_linalg.cpp: TRSV/TRSM, Cholesky (A=L*L^T), LU (partial pivot), QR (Householder), solvers, matrix inverse
  • neon_resample.cpp: Polyphase L:M rational rate conversion with streaming delay line

SVE2 Kernels (4 files):

  • sve2_kernels.cpp: Predicated vector ops (svwhilelt_b32 loops), transcendentals (predicated Horner/Chebyshev polynomials), 8x8 GEMM microkernel (MC=256/KC=512/NC=2048 for A720 12MB L3), I8MM int8 GEMM (svmmla_s32)
  • sve2_complex.cpp: FCMA-accelerated complex multiply (svcmla_f32_z rotations 0/90/270 for 2-instruction complex multiply), non-FCMA fallback with svtbl_f32 deinterleaving, native svsqrt_f32_z magnitude
  • sve2_radar.cpp: CAF with predicated complex MAC (svmla_f32_m merging semantics), cross-correlation, phase-shift beamforming
  • sve2_detect.cpp: Runtime SVE2 detection via getauxval(AT_HWCAP2) & HWCAP2_SVE2

README Updated:

  • File Structure section now includes per-file descriptions of all functional blocks and key APIs
  • Each source file entry documents the specific intrinsics, algorithms, and CUDA/cuBLAS/cuFFT/cuSOLVER calls used

v0.5.13 - GEMM L3 Optimization, L2 Detection & SVE2 Prefetch (April 2026)

Bug Fixes:

  • SVE2 GEMM buffer bounds clamping - Added std::min() clamping of runtime MC/KC/NC parameters to static buffer maximums, preventing potential buffer overflow if platform detection returned values exceeding MAX_MC/MAX_KC/MAX_NC. NEON GEMM already had this safety check; SVE2 GEMM was missing it.
  • NEON GEMM docstring correction - Fixed inaccurate cache blocking parameter values in the file header comment (claimed MC=384-512 but code used MC=128-256).

Optimizations (CIX P1 / Orange Pi 6 Plus):

  • GEMM NC doubled to 2048 for large L3 - B panel now occupies 4MB (33% of 12MB L3) vs prior 2MB (17%). Reduces main memory traffic by ~2x for large matrix multiplications on the CIX P1 CD8160.
  • SVE2 GEMM microkernel prefetch - Added svprfb(SV_PLDL1KEEP) prefetch hints to pipeline next iteration's A/B panels into L1 data cache, reducing stall cycles on Cortex-A720.

Platform Detection Enhancements:

  • L2 cache size detection - New get_l2_cache_size() API reads per-core L2 from sysfs (probes a performance core), with heuristic fallbacks: A720=512KB, A520=256KB, A76=512KB.
  • CpuInfo::l2_cache_bytes field - Cached L2 size available via detect_cpu_info().
  • Updated test suite - Platform test validates L2 detection and new NC=2048 GEMM blocking.

Test Results (CIX P1 CD8160, aarch64, all 16/16 pass):

Backend Tests Status
NEON 7/7 Pass
SVE2 1/1 Pass
Vulkan 4/4 Pass
Radar 2/2 Pass
Platform 1/1 Pass
Basic 1/1 Pass

v0.5.12 - CUDA 13 Full Upgrade & SM 103 Support (April 2026)

CUDA 13.0 Full Support:

  • CUDA 13.0.88 verified - Complete build and test pass with CUDA 13 toolkit
  • SM 103 architecture added - Support for RTX 5070/5060 (GB205/GB206 Blackwell variants)
  • Updated architecture list - CUDA 13+ now builds: SM 75, 80, 86, 89, 90, 100, 103
  • Improved Blackwell detection - CMake now detects both SM 100 and SM 103 for OPTMATH_CUDA_BLACKWELL

Test Results (x86-64 with RTX 5090, CUDA 13.0.88):

Backend Tests Status Notes
CUDA 36/36 Pass Full GPU acceleration with TF32 Tensor Cores
Vulkan 4/4 Pass Full GPU acceleration
NEON/CPU 11/11 Pass Eigen auto-vectorized for AVX2

Architecture Support (CUDA 13+):

SM Architecture GPUs
75 Turing RTX 2060/2070/2080
80/86 Ampere RTX 3060/3070/3080/3090, A100
89 Ada Lovelace RTX 4060/4070/4080/4090
90 Hopper H100
100 Blackwell RTX 5080/5090 (GB202/GB203)
103 Blackwell RTX 5060/5070 (GB205/GB206)

v0.5.10 - TF32 Tolerance Fix & Verified CUDA 13 + cuSolver Cholesky (April 2026)

CUDA 13 Verified:

  • Full test pass on RTX 5090 (Blackwell) - All 36 CUDA tests pass including 7 Cholesky tests
  • cuSolver Cholesky decomposition - Verified working with CUDA 13's cusolverDnSpotrf/Dpotrf
  • SM 75 (Turing) fallback - Architecture-aware thresholds work correctly from SM 75-100

TF32 Precision Handling:

  • MatrixGEMM test tolerance adjusted for TF32 (TensorFloat-32) precision on Ampere+ GPUs
  • TF32 uses 19-bit mantissa vs FP32's 24-bit, allowing ~0.4% relative error
  • Tolerance updated to 1% relative + 5e-3 absolute to account for accumulated rounding in GEMM

Test Results (x86-64 with RTX 5090, CUDA 13.0):

Backend Tests Status Notes
CUDA 36/36 Pass Full GPU acceleration with TF32 Tensor Cores
Vulkan 4/4 Pass Full GPU acceleration
NEON/CPU 11/11 Pass Eigen auto-vectorized for AVX2

v0.5.9 - CUDA 13 Support & Multi-Architecture Compatibility (April 2026)

CUDA 13 Support:

  • Full CUDA 13.0 compatibility - API changes for cudaMemPrefetchAsync and cudaDeviceProp
  • Architecture auto-detection - CMake automatically selects appropriate architectures based on CUDA toolkit version:
    • CUDA 13+: SM 75, 80, 86, 89, 90, 100, 103 (Turing through Blackwell)
    • CUDA 12.x: SM 50-89 (Maxwell through Ada)
    • CUDA 11.x: SM 50-86 (Maxwell through Ampere)
  • Native build option - Use -DOPTMATH_CUDA_NATIVE=ON for faster compilation targeting only the local GPU
  • Blackwell (SM 100) optimizations - Full native support with CUDA 13

Architecture Compile Definitions:

New compile definitions for architecture-specific code paths:

  • OPTMATH_CUDA_13_PLUS - CUDA 13+ detected
  • OPTMATH_CUDA_BLACKWELL - SM 100 enabled
  • OPTMATH_CUDA_HOPPER - SM 90 enabled
  • OPTMATH_CUDA_ADA - SM 89 enabled
  • OPTMATH_CUDA_AMPERE - SM 80/86 enabled
  • OPTMATH_CUDA_TURING - SM 75 enabled

cuSolver Cholesky Improvements:

  • Enhanced fallback chain with architecture-aware thresholds
  • Small matrix optimization (CPU path for matrices < 64-128 depending on architecture)
  • Improved DeviceInfo API with architecture detection helpers

Note: CUDA 13 dropped support for SM < 75. For Maxwell/Pascal/Volta GPUs, use CUDA 12.x.


v0.5.8 - Blackwell (RTX 50xx) Support & Documentation (April 2026)

NVIDIA Blackwell Support:

  • RTX 5090 Tested: Full build and test run on NVIDIA GeForce RTX 5090 (SM 10.0, Blackwell architecture)
  • CUDA 12.8+ Requirement: Documented that Blackwell SM 100 requires CUDA 12.8+ for native support
  • Vulkan Fallback: Vulkan 1.3 backend provides full GPU acceleration on Blackwell when CUDA toolkit is older
  • Architecture Table Updated: Added RTX 5090 specs (21760 CUDA cores, 680 Tensor Cores, 32GB VRAM)

Build System:

  • Multi-Architecture Support: Build instructions updated for RTX 20xx/30xx/40xx (SM 75/86/89) and RTX 50xx (SM 100)
  • Native Optimization: Added -march=native -mtune=native flags for x86-64 builds
  • Local Install: Support for ~/.local prefix installation without root access

Documentation:

  • CUDA Toolkit Requirements Table: Clear mapping of GPU generations to minimum CUDA versions
  • Troubleshooting Expanded: Added Blackwell-specific error messages and solutions
  • GPU Architecture Detection: Added nvidia-smi commands to identify compute capability

Test Results (x86-64 with RTX 5090, CUDA 12.0, Vulkan 1.3):

Backend Tests Status Notes
Vulkan 4/4 ✅ Pass Full GPU acceleration
NEON/CPU 11/11 ✅ Pass Eigen auto-vectorized for AVX2
CUDA 0/1 ⚠️ Skip Requires CUDA 12.8+ for Blackwell

Known Limitation: CUDA tests fail on RTX 50xx with CUDA 12.0 due to PTX forward compatibility issues. Install CUDA 12.8+ from NVIDIA for native Blackwell support, or use the Vulkan backend.


v0.5.7 - SVE2 & Radar Pipeline Optimization for Orange Pi 6 Plus (March 2026)

SVE2 Transcendentals — Eliminate Heap Allocations (sve2_kernels.cpp):

  • sve2_fast_cos_f32: Inlined sin polynomial with pi/2 offset in a single SVE2 predicated pass. Previously allocated a std::vector<float> temp buffer and made 2 passes over the data.
  • sve2_fast_sigmoid_f32: Fused single-pass with inline exp(-x) computation. Previously allocated 2 std::vector buffers and made 3 passes (clamp+negate, exp, divide).
  • sve2_fast_tanh_f32: Fused single-pass with inline exp(-2x) + sigmoid. Previously allocated 2 std::vector buffers and made 3 passes.
  • Impact: Eliminates 1-2 heap allocations and 1-2 extra data passes per call in hot paths.

SVE2 GEMM Microkernel — Vectorize with SVE2 FMA (sve2_kernels.cpp):

  • micro_kernel_8x8_sve2: Replaced scalar float acc[8][8] loop with SVE2 column-oriented accumulators using svmla_n_f32_z for rank-1 broadcast FMA, matching the NEON kernel's register-blocked design.
  • Uses lo/hi vector pairs for 8-row columns on 128-bit SVE2 (CIX P1 svcntw()=4).
  • Vectorized load-add-store for C matrix writeback.

CAF Doppler Shift — Vectorize Trig in Hot Loop (neon_radar.cpp, sve2_radar.cpp):

  • Replaced per-sample std::cos/std::sin calls (the CAF bottleneck) with batch neon_fast_cos/sin_f32 and sve2_fast_cos/sin_f32.
  • Doppler phase rotation now uses vectorized neon_complex_mul_f32 / sve2_complex_mul_f32 instead of scalar multiply.
  • Estimated ~10x speedup for the Doppler shift phase of CAF computation.

SVE2 Complex Exponential — Vectorize (sve2_complex.cpp):

  • sve2_complex_exp_f32: Replaced scalar std::cos/std::sin loop with sve2_fast_cos/sin_f32 for full SVE2 vectorization.

Architecture Safety: All changes guarded by #ifdef OPTMATH_USE_SVE2 / #ifdef OPTMATH_USE_NEON with scalar fallbacks intact. No changes to headers, APIs, or non-ARM code paths.

All 16 test suites pass (100%) on Orange Pi 6 Plus.


v0.5.6 - SVE2 Runtime Detection, Platform Model Name Fallback (March 2026)

Bug Fixes:

  • SVE2 SIGILL on Non-SVE2 Hardware (sve2_detect.cpp, src/CMakeLists.txt):

    • is_available() was compiled with -march=armv9-a+sve2 flags in sve2_kernels.cpp, causing SIGILL crash on hardware without SVE2 (e.g., Raspberry Pi 5 Cortex-A76)
    • Moved is_available() to new sve2_detect.cpp compiled without SVE2 flags
    • Now performs runtime detection via getauxval(AT_HWCAP2) & HWCAP2_SVE2
    • SVE2 tests gracefully skip on non-SVE2 hardware instead of crashing
  • Platform CPU Model Name Empty on ARM Kernels (platform.cpp):

    • Raspberry Pi 5 /proc/cpuinfo lacks the model name field, causing DetectCPUInfo test failure
    • Added fallback that maps ARM CPU part IDs to model names (e.g., 0xd0b → "Cortex-A76", 0xd81 → "Cortex-A720")
    • Covers Cortex-A53 through Cortex-X4

All 16 test suites pass (100%) on Raspberry Pi 5.


v0.5.5 - NEON Kernel Audit: Pipeline Optimization, Allocation Elimination (March 2026)

Performance Optimizations:

  • Dot Product / Reduce Sum (neon_kernels.cpp):

    • 4 independent accumulators to break FMA dependency chain (A76: 4-cycle latency, 2 pipelines)
  • Transcendentals (neon_kernels.cpp):

    • neon_fast_cos_f32: inline sin polynomial with π/2 offset, eliminate std::vector heap allocation
    • neon_fast_sigmoid_f32: fused single-pass with inline exp(-x), eliminate 2 heap allocations per call
    • neon_fast_tanh_f32: fused single-pass with inline exp(-2x) + sigmoid, eliminate 4 heap allocations per call
  • GEMM (neon_gemm_optimized.cpp):

    • micro_kernel_8x8: column-oriented accumulators for vector store (16 vector ops vs 64 scalar vgetq_lane_f32 extractions)
    • neon_gemm_blocked Eigen wrapper: enable actual blocked GEMM path
  • Complex Exponential (neon_complex.cpp):

    • neon_complex_exp_f32: vectorize using neon_fast_sin/cos_f32

Vulkan Fix:

  • SPV Shader Discovery for FetchContent (vulkan_backend.cpp, src/CMakeLists.txt):
    • Added OPTMATH_SPV_BUILD_DIR compile definition so Vulkan backend locates compiled SPIR-V shaders when used via CMake FetchContent

v0.5.4 - Orange Pi 6 Plus Full Audit, Build, and Benchmark (March 2026)

Build and Installation Verified on Orange Pi 6 Plus (CIX P1 CD8160, Cortex-A720, Mali-G720-Immortalis):

  • All backends enabled: NEON, SVE2 (FCMA/I8MM), Vulkan 1.3
  • Library, 40 SPIR-V shaders, 16 tests, 5 benchmarks, demo all compile cleanly
  • make install verified with CMake package config

Bug Fixes (Build):

  • SPIR-V 1.3 Target for Mali Subgroup Shaders (src/CMakeLists.txt):

    • reduce_sum_mali.comp.glsl uses subgroupAdd() which requires SPIR-V 1.3
    • glslangValidator was invoked with default SPIR-V 1.0 target, causing compilation failure
    • Added --target-env vulkan1.1 to all shader compilation commands (Vulkan 1.1 implies SPIR-V 1.3)
  • Vulkan WSI Layer Static Destruction Crash (vulkan_backend.cpp, vulkan_backend.hpp):

    • All 4 Vulkan tests passed assertions but crashed during program exit with unordered_map::at
    • Root cause: Mesa's libVkLayer_window_system_integration.so destroys its internal unordered_map before the static VulkanContext singleton runs its destructor, causing vkDestroyDevice to crash inside the WSI layer
    • Fixed by registering std::atexit() handler in VulkanContext::init() to run cleanup before static destruction order conflicts
    • Added vkDeviceWaitIdle() before resource teardown
    • Added try-catch in destructor as safety net

Bug Fixes (Deep Source Audit — CRITICAL):

  • Cross-Correlation OOB Reads (neon_radar.cpp, sve2_radar.cpp):
    • When nx > ny, overlap computation y_offset + len > ny causes out-of-bounds reads on the y array
    • Fixed in all 4 functions: xcorr_f32 and xcorr_complex_f32 for both NEON and SVE2 backends
    • Added len clamping to ny - y_offset and zero-input early-return guards

Bug Fixes (Deep Source Audit — MODERATE):

  • MTI Filter Unsigned Underflow (neon_radar.cpp):

    • n_pulses - n_coeffs + 1 wraps to SIZE_MAX when n_pulses < n_coeffs (unsigned arithmetic)
    • Added early-return guards in both mti_filter_f32 and the Eigen wrapper
  • Vulkan Init Retry Leaks Handles (vulkan_backend.cpp):

    • Partial failure during VulkanContext::init() left Vulkan handles allocated; retry overwrote them
    • Added cleanup() call before re-init if device or instance are already non-null
  • Vulkan vkMapMemory Unchecked (vulkan_backend.cpp):

    • mapAndCopyFrom ignored vkMapMemory return value; would crash on mapping failure
    • Added error check with throw std::runtime_error on failure

Bug Fixes (Deep Source Audit — LOW):

  • NEON Sin Scalar Tail UB (neon_kernels.cpp):

    • (int)k cast of large float is undefined behavior when float exceeds INT_MAX
    • Changed to (int64_t)k to handle full float range
  • SPIR-V Alignment (vulkan_backend.cpp):

    • readFile returns vector<char> (1-byte alignment) but Vulkan requires uint32_t-aligned pCode
    • Fixed by padding buffer size to 4-byte alignment boundary

Benchmark Results Added:

  • Complete Orange Pi 6 Plus benchmark results for all 5 benchmark suites
  • NEON GEMM: 17.1 GFLOPS (4x4 micro-kernel), Eigen reference: 37.7 GFLOPS
  • NEON transcendentals: exp 32.8x, sin 37.5x, tanh 117x faster than std::
  • Vulkan MatMul on Mali-G720: 481 GFLOPS peak at 1024x1024
  • Radar CAF: 18.2 GFLOPS sustained for 65K-sample processing

All 16 test suites pass (100%) on Orange Pi 6 Plus.


v0.5.3 - Audit Review: Vulkan Fixes, False Positives Removed (March 2026)

Vulkan Bug Fixes:

  • Unchecked vkBindBufferMemory (vulkan_backend.cpp:90):

    • Return value was silently ignored; can fail with VK_ERROR_OUT_OF_DEVICE_MEMORY
    • Added error check with proper cleanup (free memory, destroy buffer) before throwing
  • Memory Barrier Spec Violation (vulkan_backend.cpp:500-507):

    • VK_PIPELINE_STAGE_HOST_BIT was used as destination stage in vkCmdPipelineBarrier, which is invalid per Vulkan spec
    • Removed HOST_BIT and HOST_READ_BIT; host visibility is guaranteed by HOST_COHERENT_BIT on all buffers + vkQueueWaitIdle
  • FFT Float log2 Truncation (vulkan_backend.cpp:1171,1220):

    • Radix-2 FFT used (uint32_t)std::log2(N) which can truncate incorrectly (e.g., log2(8) = 2.9999... → 2)
    • Replaced with integer bit-counting loop (radix-4 already had a loop but redundantly called std::log2 again)

Audit Plan Corrections (5 false positives removed):

  • Issue 14 (NEON Conv2D out-of-bounds): Loop condition c + 3 < out_cols with out_cols = in_cols - kernel_cols + 1 ensures max access = in_cols - 1 (last valid index). Algebraically proven safe for all conv variants.
  • Issue 37 (SVE2 FCMA complex multiply): svcmla with rotations 0 and 90 is the correct way to multiply interleaved complex data — the instruction natively operates on [re, im] pairs
  • Issue 17 (NEON complex magnitude): Newton-Raphson via vrsqrteq_f32 + 2x vrsqrtsq_f32 is mathematically correct (~24-bit accuracy)
  • Issue 33 (Vulkan host coherency): HOST_COHERENT_BIT is set on all buffers
  • Issue 15 (Householder sign): Standard QR sign selection with 1e-30 denominator guard

Audit Plan Status Updates (4 issues already fixed but not tracked):

  • Issues 6, 7 (CUDA window div/0 and dot product race) were fixed in v0.5.2 but audit plan still listed them as OPEN
  • Issues 11, 12 (Vulkan resource leaks and command buffer check) were already properly handled
  • Issue 22 line reference corrected: CAF Doppler phase is in sve2_radar.cpp:43-50, not sve2_kernels.cpp:1095-1106

All 16 test suites pass (100%).


v0.5.2 - Comprehensive Audit: Critical Bug Fixes Across All Backends (March 2026)

SVE2 Fixes:

  • Interleaved Complex Multiply Bug (sve2_complex.cpp):

    • Fixed incorrect deinterleave using svuzp1_f32(va, va) / svuzp2_f32(va, va) which duplicates elements instead of splitting even/odd
    • Replaced with svtbl_f32 index-based extraction for correct real/imaginary separation
    • Affected: sve2_complex_mul_interleaved_f32, sve2_complex_conj_mul_interleaved_f32
  • ODR Violation: Duplicate Function Definitions (sve2_kernels.cpp):

    • Removed ~580 lines of duplicate function definitions that were also defined in sve2_complex.cpp and sve2_radar.cpp
    • Static library silently picked one copy, masking the One Definition Rule violation

Platform Fixes:

  • Feature Detection End-of-Line Bug (platform.cpp):

    • Fixed: getline() strips \n, so features at end of line (e.g., " sve2") were missed by substring search
    • Added space padding before search to ensure end-of-line features are found
    • Features on heterogeneous CPUs (different Features lines) now detected correctly with !info.has_* guards
  • sysconf Negative Return (platform.cpp):

    • sysconf(_SC_NPROCESSORS_ONLN) can return -1 on error; now guarded with fallback to 1

CUDA Fixes:

  • tanh NaN for Large Inputs (cuda_kernels.cu):

    • __expf(2*x) overflows for |x| > ~20, producing NaN
    • Added input clamping: values beyond +/-20 return +/-1.0f directly
  • Softmax Single-Block Limitation (cuda_kernels.cu):

    • Kernel assumed n <= 256 (single block) but was called with arbitrary sizes, producing wrong results
    • Added CPU fallback path for n > BLOCK_SIZE
  • IFFT Dead Normalization Code (cuda_complex.cu):

    • Removed unused scale and blocks variables that computed normalization but never applied it
  • CFAR 2D Row-Major vs Column-Major Mismatch (cuda_radar.cu):

    • CUDA kernel used row-major indexing (d * n_range + r) but Eigen MatrixXf is column-major
    • Fixed all indexing to column-major (d + r * n_doppler)
  • CPU Window Functions Divide-by-Zero (cuda_radar.cu):

    • Window generation with n=1 caused division by (n-1) = 0
    • Added safe divisor: (n > 1) ? (n-1) : 1

Vulkan Fixes:

  • SPIR-V File Read Overflow (vulkan_backend.cpp):

    • tellg() returns -1 on failure; cast to size_t produced a massive allocation
    • Added check before cast
  • Buffer Leak on Allocation Failure (vulkan_backend.cpp):

    • vkCreateBuffer succeeded but vkAllocateMemory failed, leaking the buffer handle
    • Added vkDestroyBuffer cleanup before throwing
  • Floating-Point log2 Truncation (vulkan_backend.cpp):

    • bit_reverse_copy and radix-4 power check used (int)std::log2(N), which can truncate incorrectly (e.g., log2(8) = 2.9999... → 2)
    • Replaced with integer bit-counting loop

Safety Guards:

  • NEON GEMM Buffer Overflow (neon_gemm_optimized.cpp):

    • Runtime MC/KC/NC from platform detection could exceed compile-time MAX constants
    • Added std::min clamp to prevent stack buffer overflows
  • IIR Validation Order (neon_iir.cpp):

    • fc >= fs * 0.5f was checked before fs > 0, causing UB when fs = 0
    • Reordered: fs > 0 check now comes first
  • TRSV Zero-Diagonal Guard (neon_linalg.cpp):

    • Added check before division in neon_trsv_lower_trans_f32 to prevent division by zero
  • CUDA Buffer Move Semantics (cuda_backend.hpp):

    • PinnedBuffer and UnifiedBuffer had implicit copy constructors that would double-free
    • Added deleted copy ops and proper move semantics

Test Quality Improvements:

  • Tolerance Fixes (test_neon_kernels.cpp, test_vulkan_vector.cpp, test_sve2_kernels.cpp):

    • Changed dot/norm tolerance from 1e-2 * N to 1e-4f * N + 1e-2f (was masking real bugs)
  • Float Comparison (test_neon_kernels.cpp):

    • Changed EXPECT_EQ to EXPECT_FLOAT_EQ for reduce_max/min (bitwise vs ULP comparison)
  • Type Mismatch (test_radar_caf.cpp):

    • size_t max_idxEigen::Index max_idx to match maxCoeff() signature
  • Platform-Gated Assertions (test_platform.cpp):

    • L3 cache size and GEMM blocking assertions now gated on target platform detection
    • Previously failed on any system without 8MB+ L3 cache
  • Complex Dot Test Convention (test_sve2_kernels.cpp):

    • Test computed sum(a * conj(b)) but kernel implements sum(conj(a) * b) (BLAS convention)
    • Fixed test reference to match kernel

All 16 test suites pass (100%).


v0.5.1 - Bug Fixes: Numerical Stability and Error Handling (March 2026)

Critical Bug Fixes:

  • SVE2 Accumulation Bug (sve2_radar.cpp, sve2_complex.cpp):

    • Fixed predicate variant: changed _z (zeroing) to _m (merging) for accumulation operations
    • The zeroing variant was destroying accumulated values in inactive lanes during predicated loops
    • Affected: CAF, cross-correlation, complex dot product, and beamforming kernels
  • SVE2 Loop Termination (sve2_complex.cpp, sve2_radar.cpp):

    • Fixed: svptest_firstsvptest_any to prevent premature loop exit
    • svptest_first could exit early when the first lane was inactive but other lanes had work remaining
  • CUDA Window Division by Zero (cuda_radar.cu):

    • Added safe_window_divisor() helper to prevent NaN when generating windows with n=1
    • Affected: Hamming, Hanning, Blackman, Blackman-Harris, Kaiser window functions
  • CUDA Warp Reduction Race Condition (cuda_complex.cu):

    • Added guard for blockDim.x < 64 to prevent reading uninitialized shared memory
    • The warp reduction assumed at least 64 threads but smaller blocks could be launched
  • CUDA/cuFFT Error Handling (cuda_backend.cpp):

    • Added error checking for cufftExecC2C and memory allocation functions
    • Previously silent failures could occur on FFT execution errors

Numerical Stability Fixes:

  • NEON IIR Parameter Validation (neon_iir.cpp):

    • Added validate_iir_params() to prevent division by zero when Q=0
    • Validates: Q > 0, 0 < fc < fs/2, fs > 0
    • Returns unity pass-through filter coefficients on invalid parameters
    • Affected: neon_biquad_lowpass, neon_biquad_highpass, neon_biquad_bandpass, neon_biquad_notch
  • NEON Householder QR Stability (neon_linalg.cpp):

    • Fixed potential division by zero when alpha ≈ beta in Householder reflection
    • Added threshold check: if |alpha - beta| < 1e-30, skip reflection (column already normalized)

Vulkan Error Handling (vulkan_backend.cpp):

  • Added error checking for: vkAllocateCommandBuffers, vkBeginCommandBuffer, vkEndCommandBuffer, vkQueueSubmit, vkQueueWaitIdle
  • Proper resource cleanup on error paths

Test Improvements:

  • Platform Test Portability (test_platform.cpp):
    • Made ARM-specific assertions conditional on detecting target platform
    • Tests now pass on both ARM (Orange Pi 6 Plus, Raspberry Pi 5) and x86_64 development machines
    • Platform-specific checks (core part IDs, SVE2 features, efficiency cores) only run on appropriate hardware

v0.5.0 - Orange Pi 6 Plus: SVE2, FCMA, I8MM, Mali-G720 (March 2026)

New Platform: Orange Pi 6 Plus (CIX P1 CD8160)

Full hardware-specific optimization for the CIX P1 SoC's ARMv9 big.LITTLE cores and Mali-G720-Immortalis GPU.

SVE2 Acceleration (sve2_kernels.cpp, sve2_complex.cpp, sve2_radar.cpp):

  • Predicated vector operations: svwhilelt/svptest loops eliminate all scalar tail handling
  • FCMA complex multiply: svcmla_f32_z for 2-instruction complex multiply (vs 4 NEON)
  • I8MM GEMM: svmmla_s32 hardware int8 matrix multiply with float32 quantization
  • SVE2 cache-blocked GEMM tuned for A720 12MB L3 (MC=256, KC=512, NC=2048)
  • Complete SVE2 transcendentals: exp, sin, cos, sigmoid, tanh
  • SVE2 radar DSP: CAF with FCMA inner loop, cross-correlation, beamforming

Platform Detection (platform.cpp):

  • CPU topology detection via /proc/cpuinfo and sysfs (A720/A520 part IDs)
  • Thread affinity: pin to performance or efficiency cores via sched_setaffinity
  • SVE vector length detection via prctl(PR_SVE_GET_VL)
  • Runtime GEMM cache blocking parameter selection based on L3 cache size

NEON GEMM Runtime Tuning (neon_gemm_optimized.cpp):

  • Cache blocking parameters now auto-selected at runtime based on detected hardware
  • 12MB L3 (A720): MC=256, KC=512, NC=2048; 2MB L3 (A76): MC=128, KC=256, NC=512

Mali-G720 Vulkan Shaders (vulkan_backend.cpp):

  • Auto-detects Mali-G720 via vendor ID and device name
  • 32x32 tiled GEMM shader (1024 threads, 8KB shared memory)
  • 1024-thread reduction with subgroupAdd() subgroup arithmetic
  • Transparent fallback to standard shaders on non-Mali GPUs

New Tests:

Test Suite Tests Description
test_sve2_kernels 18 SVE2 correctness, edge cases, stress tests
test_platform 9 CPU topology, affinity, cache detection

v0.4.0 - Dense Linear Algebra: Cholesky, LU, QR, Solve, Inverse (February 2026)

New Dense Linear Algebra Kernels (neon_linalg.cpp):

  • Triangular Solve (TRSV/TRSM):

    • Forward/backward substitution with NEON-vectorized AXPY column updates (vld1q_f32/vmlaq_f32)
    • Unit-diagonal and transpose variants for composing with LU and Cholesky
    • Multi-RHS TRSM for matrix inverse computation
  • Cholesky Decomposition (neon_cholesky_f32):

    • Unblocked right-looking algorithm with NEON dot products
    • SPD validation: returns 1-based error index if matrix is not positive definite
    • Clean lower-triangular output (upper triangle zeroed)
  • LU Decomposition (neon_lu_f32):

    • Partial pivoting with argmax pivot search
    • NEON-vectorized column scaling and rank-1 trailing submatrix updates
    • Row permutation vector output for solver composition
  • QR Decomposition (neon_qr_f32):

    • Householder reflections with stable sign choice to avoid cancellation
    • NEON-vectorized reflector application to trailing columns
    • Explicit Q extraction via reverse-order reflector accumulation
  • Solvers:

    • neon_solve_f32: General A*x=b via LU + TRSV
    • neon_solve_spd_f32: SPD A*x=b via Cholesky + TRSV
    • neon_inverse_f32: A^{-1} via LU + TRSM on identity columns
  • Eigen Wrappers: neon_cholesky, neon_lu, neon_qr, neon_trsv_lower, neon_trsv_upper, neon_solve, neon_solve_spd, neon_inverse

New Tests:

Test Suite Tests Status
test_neon_linalg 21 Passed

Total: 88 individual test cases passed (21 new + 67 existing)

v0.3.0 - DSP Kernels: Resampler, IIR, 2D Convolution (February 2026)

New DSP Kernels:

  • Polyphase Resampler (neon_resample.cpp):

    • Rational L/M sample rate conversion using polyphase decomposition
    • NEON-optimized FIR filtering per phase (reuses neon_dot_f32)
    • Phase accumulator algorithm with delay line state management
    • Both streaming (neon_resample_f32) and one-shot (neon_resample_oneshot_f32) APIs
  • Biquad IIR Filter (neon_iir.cpp):

    • Direct Form II Transposed for optimal numerical stability
    • Single section (neon_biquad_f32) and cascade (neon_biquad_cascade_f32)
    • In-place processing supported (out and in may alias)
    • Design helpers using Audio EQ Cookbook formulas: lowpass, highpass, bandpass, notch
  • 2D Convolution (neon_conv2d.cpp):

    • General NxM convolution with NEON vectorization over output columns (4 at a time)
    • Separable 2D convolution: row pass reuses neon_fir_f32, NEON column pass
    • Fully unrolled 3x3 kernel (9 broadcast coefficients, 9 multiply-accumulates)
    • Unrolled 5x5 kernel specialization
    • Row-major layout for natural image/matrix processing

New Tests:

Test Suite Tests Status
test_neon_resample 7 Passed
test_neon_iir 10 Passed
test_neon_conv2d 9 Passed

Total: 67 individual test cases passed (26 new + 41 existing)

v0.2.2 - Cross-Platform Verification (January 2026)

Cross-Platform Testing:

  • Verified full compatibility between AMD64 (x86_64) and ARM64 (aarch64) platforms
  • All 10 test suites pass on Raspberry Pi 5 (ARM Cortex-A76)
  • Tested with GCC 14.2.0, Eigen 3.4.0, Vulkan 1.4.309

Test Results (Raspberry Pi 5):

Test Suite Tests Status
test_basic 1 Passed
test_neon_kernels 4 Passed
test_neon_complex 7 Passed
test_neon_transcendentals 10 Passed
test_vulkan_vector 1 Passed
test_vulkan_matrix 1 Passed
test_vulkan_dsp 1 Passed
test_vulkan_advanced 1 Passed
test_radar_caf 6 Passed
test_radar_cfar 9 Passed

Total: 41 individual test cases passed

v0.2.1 - Performance and Bug Fixes

CUDA Backend:

  • Optimized cuda_caf(): Eliminated all CPU-GPU memory transfers inside the Doppler processing loop
    • Added kernel_interleave_complex_f32 for GPU-side complex array interleaving
    • Added kernel_complex_conj_mul_interleaved_f32 for frequency-domain multiplication
    • Added kernel_magnitude_interleaved_f32 for magnitude extraction
    • Result: ~6 fewer cudaMemcpy calls per Doppler bin, data stays on GPU until final output
  • Fixed cuda_vec_sum_f32(): Removed dead code (unused buffer allocation and kernel call)

NEON Backend:

  • Fixed neon_fast_tanh_f32(): Removed dead code line that was immediately overwritten

Testing:

  • All 10 test suites pass (NEON, Vulkan, CUDA, Radar)

License

MIT License - See LICENSE file for details.


Author

N4HY - Bob McGwier Dr Robert W McGwier, PhD

All unit tests, much of the debugging, and all documentation except the earliest pieces are all Claude code (AMAZING).

About

Optmized mathematics kernels. in a library using NEON, Vulkans, etc

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors