High-Performance Numerical Library for ARM SBCs and NVIDIA GPUs
OptMathKernels is a C++20 numerical library optimized for Raspberry Pi 5, Orange Pi 6 Plus, and NVIDIA CUDA GPUs. It seamlessly bridges Eigen (CPU), ARM NEON (SIMD), ARM SVE2 (Scalable Vectors), Vulkan (Compute Shaders), and CUDA (NVIDIA GPUs) into a single, easy-to-use API.
Designed to accelerate math and signal processing tasks by leveraging:
- Raspberry Pi 5: Cortex-A76 NEON and VideoCore VII GPU
- Orange Pi 6 Plus: Cortex-A720 SVE2/FCMA/I8MM and Mali-G720-Immortalis GPU
- NVIDIA GPUs: cuBLAS, cuFFT, cuSOLVER, and Tensor Cores (Volta+)
While remaining compatible with standard Linux x86/ARM environments.
- Key Applications
- Features
- Hardware Support
- Installation
- API Reference
- Usage Examples
- Benchmarking
- Troubleshooting
- File Structure
- License
- Recent Changes
OptMathKernels powers the PassiveRadar_Kraken project, providing hardware-accelerated kernels for:
ARM NEON (Raspberry Pi 5):
| Operation | Speedup | Application |
|---|---|---|
| Complex multiply | 4-8x | CAF Doppler shifting |
| Dot product | 4-6x | NLMS adaptive filter |
| GEMM (blocked) | 3-5x | Beamforming, covariance |
| Transcendentals | 10-50x | Phase computation |
| FFT (Vulkan) | 10x | Large batch processing |
NVIDIA CUDA (Workstation/Server):
| Operation | Speedup | GPU Features |
|---|---|---|
| GEMM (1024x1024) | 50-100x | Tensor Cores (Ampere+) |
| FFT (64K points) | 20-50x | cuFFT optimized |
| CAF (full map) | 30-80x | Batched FFT + complex ops |
| CFAR 2D | 10-30x | Parallel threshold compute |
| Beamforming | 20-50x | cuBLAS matrix-vector |
| Transcendentals | 50-200x | CUDA fast math intrinsics |
- Machine learning inference (activation functions, matrix ops)
- Digital signal processing (filtering, FFT, convolution)
- Scientific computing (linear algebra, statistics)
- Real-time audio/video processing
- NEON Acceleration: Hand-tuned ARMv8 NEON intrinsics for SIMD acceleration on 64-bit ARM (aarch64). Includes optimized matrix multiplication, convolution, and vector math.
- Vulkan Compute: Massive parallel offloading to the GPU (VideoCore VII on Pi 5). Supports large vector operations, matrix math, FFT (Radix-2/4), and reductions.
- Eigen Integration: Fully compatible with
Eigen::VectorXf,Eigen::MatrixXf, andEigen::VectorXcf. Pass your existing data structures directly to accelerated kernels. - Easy Integration: Standard CMake package that installs to
/usr/localand works withfind_package(OptMathKernels).
- Cross-Ambiguity Function (CAF): Core passive radar operation for range-Doppler detection
- CFAR Detection: 1D/2D Cell-Averaging and Ordered Statistic CFAR detectors
- Clutter Filtering: NLMS adaptive filter and projection-based clutter cancellation
- Doppler Processing: FFT-based Doppler processing and MTI filters
- Beamforming: Delay-and-sum and phase-shift beamformers with steering vector generation
- Window Functions: Hamming, Hanning, Blackman, Blackman-Harris, Kaiser
- Vectorized Transcendentals: Fast NEON-accelerated exp, sin, cos, sigmoid, tanh (~10-50x faster than scalar)
- Complex Operations: Vectorized complex multiply, conjugate multiply, magnitude, phase
- Cache-Blocked GEMM: 8x8 microkernel with runtime cache blocking (MC/KC/NC auto-tuned for detected L3 cache size)
- Reductions: Sum, max, min, dot product with horizontal NEON adds
- Predicated Vector Operations: All loops use
svwhilelt/svptestpredication - zero tail-handling code - FCMA Complex Multiply: 2-instruction complex multiply via
svcmla(vs 4 NEON instructions) - I8MM GEMM: Int8 matrix multiply using
svmmla_s32with float32 quantize/dequantize - SVE2 Cache-Blocked GEMM: Vectorized 8x8 microkernel with
svmla_n_f32_zFMA, tuned for Cortex-A720 12MB L3 (MC=256, KC=512, NC=2048) - SVE2 Transcendentals: Single-pass inlined polynomial approximations (no heap allocations), cos/sigmoid/tanh fused with range reduction
- SVE2 Radar DSP: FCMA-accelerated CAF with vectorized Doppler shift (batch
sve2_fast_cos/sin), cross-correlation, and beamforming - SVE2 Complex Operations: Split and interleaved formats, dot product, magnitude, phase; vectorized
complex_expvia fast sin/cos
- CPU Topology Detection: Identifies big.LITTLE core clusters via
/proc/cpuinfoand sysfs - Thread Affinity: Pin threads to performance (A720) or efficiency (A520) cores
- SVE Vector Length: Runtime detection via
prctl(PR_SVE_GET_VL) - L2/L3 Cache Detection: Per-core L2 and shared L3 cache sizes from sysfs with part-ID fallbacks
- Cache-Aware GEMM Tuning: Auto-selects blocking parameters based on detected L2/L3 cache sizes
- Polyphase Resampler: Rational L/M rate conversion with NEON-optimized FIR per phase, streaming and one-shot APIs
- Biquad IIR Filter: Direct Form II Transposed with cascade support, design helpers for lowpass/highpass/bandpass/notch
- 2D Convolution: NEON-vectorized general NxM convolution, separable kernels, unrolled 3x3 and 5x5 specializations
- Triangular Solve (TRSV/TRSM): Column-oriented forward/backward substitution with NEON-vectorized AXPY updates, unit-diagonal and transpose variants, multi-RHS support
- Cholesky Decomposition: Unblocked right-looking A = L*L^T with NEON dot products, SPD validation with error reporting
- LU Decomposition: Partial pivoting with NEON-vectorized column scaling and rank-1 trailing updates
- QR Decomposition: Householder reflections with NEON-vectorized reflector application, explicit Q extraction
- Solvers: General solve (LU), SPD solve (Cholesky), matrix inverse (LU + TRSM)
- Tiled GPU Matrix Multiply: 16x16 shared memory tiles (default), 32x32 tiles for Mali-G720
- FFT: Radix-2/4 FFT with butterfly operations in compute shaders
- Convolution: 1D and 2D convolution with separable kernel optimization
- Vector Operations: Add, multiply, dot product, reductions
- Mali-G720 Optimized: Auto-selects 1024-thread shaders with subgroup arithmetic on Mali GPUs
- cuBLAS Integration: Level 1, 2, and 3 BLAS operations with Tensor Core support
- cuFFT: High-performance FFT up to 64M points, batched transforms
- cuSOLVER: Cholesky, LU, QR, SVD, eigenvalue decomposition
- Tensor Cores: Automatic acceleration on Volta+ (SM 7.0+) and Ampere+ (SM 8.0+)
- Mixed Precision: FP32, TF32, FP16 compute modes
- Multi-GPU: Device enumeration, peer-to-peer access, workload distribution
- Unified Memory: Simplified CPU-GPU data management
- Radar Processing: Full GPU-accelerated CAF, CFAR, beamforming, NLMS filter
- Optimized CAF: Zero CPU-GPU transfers inside Doppler loop for maximum throughput
Target Hardware: Orange Pi 6 Plus with CIX P1 CD8160 SoC (ARMv9)
| Feature | Specification |
|---|---|
| CPU | 8x Cortex-A720 + 4x Cortex-A520 (big.LITTLE) |
| SIMD | NEON (128-bit) + SVE2 (128-bit VL) + FCMA + I8MM + BF16 |
| GPU | Mali-G720-Immortalis (Vulkan 1.3) |
| Memory | 8GB/16GB LPDDR5 |
| L1 Cache | 64KB per core |
| L2 Cache | 512KB per core |
| L3 Cache | 12MB shared |
SVE2 Optimizations:
- Predicated loops eliminate all scalar tail handling (vs 28+ NEON tail loops)
- FCMA complex multiply: 2 instructions vs 4 NEON instructions per complex multiply
- I8MM: Hardware int8 matrix multiply with
svmmla_s32 - Cache blocking tuned for 12MB L3: MC=256, KC=512, NC=2048
Mali-G720-Immortalis GPU (Vulkan):
| Feature | Specification |
|---|---|
| Vulkan | 1.3 |
| Shared Memory | 32KB per workgroup |
| Max Workgroup Size | 1024 |
| Subgroup Size | 16 |
| Subgroup Ops | Arithmetic (subgroupAdd, etc.) |
- 32x32 tiled GEMM shader (1024 threads, 8KB shared memory)
- 1024-thread reduction with subgroup-level arithmetic
Build for Orange Pi 6 Plus:
cmake -DCMAKE_BUILD_TYPE=Release \
-DENABLE_NEON=ON \
-DENABLE_SVE2=ON \
-DENABLE_VULKAN=ON \
-DENABLE_CUDA=OFF \
-DBUILD_TESTS=ON ..
make -j$(nproc)Target Hardware: Raspberry Pi 5 with Broadcom BCM2712 SoC (Cortex-A76)
| Feature | Specification |
|---|---|
| CPU | Quad-core ARM Cortex-A76 @ 2.4 GHz |
| SIMD | NEON (128-bit Advanced SIMD) |
| GPU | VideoCore VII (Vulkan 1.2) |
| Memory | 4GB/8GB LPDDR4X-4267 |
| L1 Cache | 64KB per core |
| L2 Cache | 512KB per core |
| L3 Cache | 2MB shared |
NEON Optimizations:
- Complex multiply-accumulate (4 complex float32 per cycle)
- Vectorized dot products (8 float32 per instruction)
- SIMD FFT butterflies with
vfmaq_f32 - Parallel magnitude/phase computation
- Cache-blocked GEMM tuned for A76 cache hierarchy
Cache Blocking Parameters (auto-tuned per platform):
// Pi 5 (2MB L3): MC=128, KC=256, NC=512
// Orange Pi 6+ (12MB L3): MC=256, KC=512, NC=2048
// 8x8 microkernel with 4x4 register tiles (32 NEON registers)
neon_gemm_blocked_f32(C, A, B, M, N, K, lda, ldb, ldc);Performance on Pi 5:
| Operation | NEON Time | Scalar Time | Speedup |
|---|---|---|---|
| Complex dot 4096 | 0.8 μs | 12.4 μs | 15.5x |
| GEMM 256x256 | 1.2 ms | 18 ms | 15x |
| CAF 4096x64 | 5.2 ms | 82 ms | 15.8x |
| CFAR 2D 256x64 | 0.9 ms | 11.2 ms | 12.4x |
| Exp 1M elements | 1.2 ms | 54 ms | 45x |
| Sin 1M elements | 1.5 ms | 48 ms | 32x |
Supported Processors: Any x86_64 CPU with SSE4.2 or AVX2
| Feature | Minimum | Recommended |
|---|---|---|
| CPU | Intel Core i3 / AMD Ryzen 3 | Intel Core i7 / AMD Ryzen 7 |
| SIMD | SSE4.2 | AVX2/AVX-512 |
| Memory | 8GB DDR4 | 16GB+ DDR4/DDR5 |
Eigen3 Auto-Vectorization: The OptMathKernels library uses Eigen3's expression templates which automatically vectorize for AVX/AVX2/AVX-512 when available on x86_64.
# Check for AVX2 support
cat /proc/cpuinfo | grep avx2Supported GPUs: NVIDIA GPUs with Compute Capability 5.0+ (Maxwell and later)
| GPU Generation | Architecture | Compute Capability | CUDA Cores | Tensor Cores | Memory |
|---|---|---|---|---|---|
| GTX 900 | Maxwell | 5.0/5.2 | 1024-3072 | - | 2-12 GB |
| GTX 1000 | Pascal | 6.1 | 1280-3584 | - | 3-11 GB |
| Tesla P100 | Pascal | 6.0 | 3584 | - | 12-16 GB |
| Tesla V100 | Volta | 7.0 | 5120 | 640 | 16-32 GB |
| RTX 2000 | Turing | 7.5 | 2304-4608 | 288-576 | 8-11 GB |
| RTX 3000 | Ampere | 8.6 | 3584-10496 | 112-328 | 8-24 GB |
| A100 | Ampere | 8.0 | 6912 | 432 | 40-80 GB |
| RTX 4000 | Ada Lovelace | 8.9 | 5888-16384 | 184-512 | 8-24 GB |
| H100 | Hopper | 9.0 | 14592 | 456 | 80 GB |
| Jetson Orin | Ampere | 8.7 | 1024-2048 | 32-64 | 8-64 GB |
| RTX 5000 | Blackwell | 10.0 | 21760 | 680 | 32 GB |
Architecture Features:
| Architecture | SM | FP16 | Tensor Cores | TF32 | FP8 |
|---|---|---|---|---|---|
| Maxwell | 5.x | ❌ | ❌ | ❌ | ❌ |
| Pascal | 6.x | ✅ | ❌ | ❌ | ❌ |
| Volta | 7.0 | ✅ | ✅ Gen 1 | ❌ | ❌ |
| Turing | 7.5 | ✅ | ✅ Gen 2 | ❌ | ❌ |
| Ampere | 8.x | ✅ | ✅ Gen 3 | ✅ | ❌ |
| Ada | 8.9 | ✅ | ✅ Gen 4 | ✅ | ✅ |
| Hopper | 9.x | ✅ | ✅ Gen 4 | ✅ | ✅ |
| Blackwell | 10.x | ✅ | ✅ Gen 5 | ✅ | ✅ |
CUDA Toolkit Requirements:
| GPU Architecture | Minimum CUDA | Recommended CUDA | SM Version |
|---|---|---|---|
| Maxwell (GTX 9xx) | CUDA 6.5 | CUDA 11.x+ | SM 5.0/5.2 |
| Pascal (GTX 10xx) | CUDA 8.0 | CUDA 11.x+ | SM 6.0/6.1 |
| Volta (V100) | CUDA 9.0 | CUDA 11.x+ | SM 7.0 |
| Turing (RTX 20xx) | CUDA 10.0 | CUDA 12.x | SM 7.5 |
| Ampere (RTX 30xx) | CUDA 11.0 | CUDA 12.x | SM 8.0/8.6 |
| Ada (RTX 40xx) | CUDA 11.8 | CUDA 12.x | SM 8.9 |
| Hopper (H100) | CUDA 12.0 | CUDA 12.x | SM 9.0 |
| Blackwell (RTX 50xx) | CUDA 12.8 | CUDA 13.0+ | SM 10.0 |
Note: RTX 5090 (Blackwell) requires CUDA 12.8+ for native SM 10.0 support. Earlier CUDA versions (e.g., 12.0 from Ubuntu packages) will compile with SM 9.0a for Hopper, but this may cause runtime issues due to PTX forward compatibility limitations. For optimal Blackwell performance, install CUDA 12.8+ directly from NVIDIA.
CUDA Performance (RTX 4090):
| Operation | CUDA Time | CPU Time | Speedup |
|---|---|---|---|
| Complex FFT 65536 | 0.12 ms | 8.4 ms | 70x |
| GEMM 1024x1024 | 0.08 ms | 45 ms | 562x |
| 2D Convolution 512x512 | 0.15 ms | 120 ms | 800x |
| CAF 16384x256 | 1.2 ms | 450 ms | 375x |
| Exp 1M elements | 0.03 ms | 54 ms | 1800x |
Supported GPUs: Any Vulkan 1.2+ capable GPU
| Vendor | GPUs | Vulkan Version | Driver |
|---|---|---|---|
| NVIDIA | GTX 900+, RTX 20xx/30xx/40xx/50xx | 1.3 | Proprietary |
| AMD | RX 400+, RDNA/RDNA2/RDNA3 | 1.3 | Mesa RADV |
| Intel | UHD 600+, Arc | 1.3 | Mesa ANV |
| Raspberry Pi 5 | VideoCore VII | 1.2 | Mesa V3D |
| Orange Pi 6 Plus | Mali-G720-Immortalis | 1.3 | Mali (panfrost/panvk) |
| Qualcomm | Adreno 6xx+ | 1.1 | Proprietary |
Vulkan on RTX 50xx: The Vulkan backend provides full GPU acceleration on Blackwell GPUs regardless of CUDA toolkit version. This is a good fallback when CUDA 12.8+ is not available.
VideoCore VII (Raspberry Pi 5):
| Feature | Specification |
|---|---|
| QPU Cores | 12 |
| Clock | 800 MHz |
| Compute | ~1 GFLOPS FP32 |
| Shared Memory | 4KB per workgroup |
| Max Workgroup Size | 256 |
Vulkan Compute Shaders: 40 GLSL compute shaders compiled to SPIR-V:
- Vector operations (add, sub, mul, div, dot, norm)
- Matrix operations (add, mul, transpose, scale)
- Reductions (sum, max, min, prefix scan)
- Convolution (1D, 2D, optimized variants)
- FFT (radix-2, radix-4, forward/inverse)
- Radar (CAF Doppler shift, CFAR 2D)
sudo apt update
sudo apt install -y \
build-essential \
cmake \
git \
pkg-config \
libeigen3-dev \
libvulkan-dev \
mesa-vulkan-drivers \
vulkan-tools \
glslang-tools \
glslc# Ubuntu/Debian with NVIDIA driver already installed (CUDA 12.0)
# Note: Ubuntu packages provide CUDA 12.0 which does NOT support Blackwell (RTX 50xx)
sudo apt install -y nvidia-cuda-toolkit
# For RTX 50xx (Blackwell) - REQUIRES CUDA 12.8+ from NVIDIA
# Download from: https://developer.nvidia.com/cuda-downloads
# Ubuntu 22.04/24.04 - NVIDIA repo (gets latest CUDA)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-8 # or latest available
# Add to PATH (use NVIDIA's version, not Ubuntu's)
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# Verify installation
nvcc --version # Should show 12.8+ for Blackwell
nvidia-smi # Shows driver and GPU infoGPU Architecture Detection:
# Check your GPU's compute capability
nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader
# Example output: "NVIDIA GeForce RTX 5090, 10.0" (Blackwell)git clone https://github.com/n4hy/OptimizedKernelsForRaspberryPi5_NvidiaCUDA.git
cd OptimizedKernelsForRaspberryPi5_NvidiaCUDARaspberry Pi 5 (NEON + Vulkan):
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DENABLE_NEON=ON \
-DENABLE_VULKAN=ON \
-DENABLE_CUDA=OFF \
-DBUILD_TESTS=ON \
-DBUILD_BENCHMARKS=OFF \
-DCMAKE_INSTALL_PREFIX=/usr/local \
..
make -j$(nproc)NVIDIA Workstation (CUDA + optional Vulkan):
mkdir -p build && cd build
# Default: All architectures Maxwell through Blackwell (CUDA 13+)
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DENABLE_NEON=OFF \
-DENABLE_VULKAN=ON \
-DENABLE_CUDA=ON \
-DBUILD_TESTS=ON \
-DCMAKE_INSTALL_PREFIX=/usr/local \
..
# Specific architectures - Modern GPUs only (RTX 20xx/30xx/40xx)
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES="75;86;89" \
-DENABLE_CUDA=ON \
..
# Legacy GPUs (GTX 900/1000 series) - CUDA 11.x compatible
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES="50;52;60;61" \
-DENABLE_CUDA=ON \
..
# For RTX 50xx Blackwell (REQUIRES CUDA 12.8+ or CUDA 13)
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DENABLE_NEON=OFF \
-DENABLE_VULKAN=ON \
-DENABLE_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="100" \
-DCMAKE_CXX_FLAGS="-O3 -march=native" \
-DBUILD_TESTS=ON \
-DCMAKE_INSTALL_PREFIX=/usr/local \
..
make -j$(nproc)CMake Options:
| Option | Default | Description |
|---|---|---|
ENABLE_NEON |
ON (ARM) | Enable ARM NEON SIMD |
ENABLE_SVE2 |
ON | Enable ARM SVE2 (ARMv9) |
ENABLE_FCMA |
ON | Enable FCMA complex multiply |
ENABLE_I8MM |
ON | Enable I8MM int8 matrix multiply |
ENABLE_VULKAN |
ON | Enable Vulkan compute |
ENABLE_CUDA |
ON | Enable NVIDIA CUDA |
CMAKE_CUDA_ARCHITECTURES |
50;52;60;61;70;75;80;86;89;100;103 | CUDA compute capabilities (Maxwell through Blackwell) |
BUILD_TESTS |
ON | Build GoogleTest tests |
BUILD_BENCHMARKS |
OFF | Build Google Benchmark |
CMAKE_POSITION_INDEPENDENT_CODE |
ON | Enable -fPIC (set globally) |
ctest --output-on-failuresudo make install
sudo ldconfigls /usr/local/lib/libOptMathKernels*
ls /usr/local/include/optmath/
ls /usr/local/lib/cmake/OptMathKernels/For complete API documentation of all 473 functions, see:
FunctionsIncluded.md - Complete API Reference
| Backend | Functions | Description |
|---|---|---|
| NEON | 104 | ARM SIMD operations for Raspberry Pi 5 / Orange Pi 6 Plus |
| SVE2 | 46 | ARM SVE2/FCMA/I8MM for ARMv9 (Orange Pi 6 Plus) |
| Platform | 10 | CPU topology detection, thread affinity, cache tuning |
| CUDA | 242 | NVIDIA GPU kernels (cuBLAS, cuFFT, cuSOLVER) |
| Vulkan | 23 | Cross-platform GPU compute shaders |
| Radar | 48 | Passive radar signal processing |
#include <optmath/neon_kernels.hpp> // ARM NEON operations
#include <optmath/sve2_kernels.hpp> // ARM SVE2/FCMA/I8MM operations
#include <optmath/platform.hpp> // CPU detection, thread affinity
#include <optmath/vulkan_backend.hpp> // Vulkan GPU compute
#include <optmath/cuda_backend.hpp> // NVIDIA CUDA operations
#include <optmath/radar_kernels.hpp> // Radar signal processingcmake_minimum_required(VERSION 3.18)
project(MyApp)
find_package(OptMathKernels REQUIRED)
find_package(Eigen3 REQUIRED)
add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE
OptMathKernels::OptMathKernels
Eigen3::Eigen
)#include <optmath/neon_kernels.hpp>
if (optmath::neon::is_available()) {
Eigen::VectorXf a = Eigen::VectorXf::Random(1024);
Eigen::VectorXf b = Eigen::VectorXf::Random(1024);
// Basic operations
Eigen::VectorXf c = optmath::neon::neon_add(a, b);
Eigen::VectorXf d = optmath::neon::neon_mul(a, b);
float dot = optmath::neon::neon_dot(a, b);
float norm = optmath::neon::neon_norm(a);
// Reductions
float sum = optmath::neon::neon_reduce_sum(a);
float max = optmath::neon::neon_reduce_max(a);
}Eigen::MatrixXf A = Eigen::MatrixXf::Random(256, 256);
Eigen::MatrixXf B = Eigen::MatrixXf::Random(256, 256);
// Basic GEMM (4x4 tiled)
Eigen::MatrixXf C = optmath::neon::neon_gemm(A, B);
// Optimized blocked GEMM (8x8 microkernel, cache blocking)
Eigen::MatrixXf D = optmath::neon::neon_gemm_blocked(A, B);
// Other operations
Eigen::MatrixXf At = optmath::neon::neon_mat_transpose(A);
Eigen::MatrixXf As = optmath::neon::neon_mat_scale(A, 2.5f);std::vector<float> input(N), output(N);
// Fast exp approximation (~12% relative error at extremes, better near zero)
optmath::neon::neon_fast_exp_f32(output.data(), input.data(), N);
// Fast sin/cos approximation (~1e-5 accuracy)
optmath::neon::neon_fast_sin_f32(output.data(), input.data(), N);
optmath::neon::neon_fast_cos_f32(output.data(), input.data(), N);
// Fast activation functions (~3% sigmoid, ~6% tanh)
optmath::neon::neon_fast_sigmoid_f32(output.data(), input.data(), N);
optmath::neon::neon_fast_tanh_f32(output.data(), input.data(), N);Eigen::VectorXcf a = Eigen::VectorXcf::Random(1024);
Eigen::VectorXcf b = Eigen::VectorXcf::Random(1024);
// Element-wise complex operations
Eigen::VectorXcf c = optmath::neon::neon_complex_mul(a, b);
Eigen::VectorXcf d = optmath::neon::neon_complex_conj_mul(a, b); // a * conj(b)
// Complex dot product: sum(a * conj(b))
std::complex<float> dot = optmath::neon::neon_complex_dot(a, b);
// Magnitude and phase
Eigen::VectorXf mag = optmath::neon::neon_complex_magnitude(a);
Eigen::VectorXf phase = optmath::neon::neon_complex_phase(a);#include <optmath/neon_kernels.hpp>
using namespace optmath::neon;
// Rational rate conversion by L/M (e.g., 48kHz -> 44.1kHz ≈ 147/160)
std::size_t L = 147, M = 160;
// Design a lowpass prototype filter (cutoff at min(pi/L, pi/M))
std::vector<float> filter = /* your FIR lowpass filter */;
// One-shot resampling
std::vector<float> input(10000), output(input.size() * L / M + 100);
std::size_t out_len = 0;
neon_resample_oneshot_f32(output.data(), &out_len,
input.data(), input.size(),
filter.data(), filter.size(), L, M);
// Streaming resampling (maintains state between calls)
PolyphaseResamplerState state;
neon_resample_init(state, filter.data(), filter.size(), L, M);
std::size_t n = neon_resample_f32(output.data(), input.data(), input.size(), state);
// Eigen wrapper
Eigen::VectorXf in_vec = Eigen::VectorXf::Random(10000);
Eigen::VectorXf filt_vec = Eigen::Map<Eigen::VectorXf>(filter.data(), filter.size());
Eigen::VectorXf result = neon_resample(in_vec, filt_vec, L, M);#include <optmath/neon_kernels.hpp>
using namespace optmath::neon;
// Design a lowpass filter (fc=1000Hz, fs=48000Hz, Q=0.707)
BiquadCoeffs lp = neon_biquad_lowpass(1000.0f, 48000.0f);
// Process audio samples
std::vector<float> input(4096), output(4096);
BiquadState state;
neon_biquad_f32(output.data(), input.data(), input.size(), lp, state);
// Cascade for higher-order filtering (4th order = 2 biquad sections)
BiquadCoeffs cascade[2] = {
neon_biquad_lowpass(1000.0f, 48000.0f),
neon_biquad_lowpass(1000.0f, 48000.0f)
};
BiquadState states[2] = {};
neon_biquad_cascade_f32(output.data(), input.data(), input.size(),
cascade, states, 2);
// Other filter types
BiquadCoeffs hp = neon_biquad_highpass(500.0f, 48000.0f);
BiquadCoeffs bp = neon_biquad_bandpass(1000.0f, 48000.0f, 5.0f);
BiquadCoeffs notch = neon_biquad_notch(60.0f, 48000.0f, 30.0f); // 60Hz hum removal#include <optmath/neon_kernels.hpp>
using namespace optmath::neon;
// Input image in row-major layout
std::size_t rows = 480, cols = 640;
std::vector<float> image(rows * cols);
// 3x3 Sobel edge detection (horizontal)
float sobel_x[9] = {-1, 0, 1, -2, 0, 2, -1, 0, 1};
std::vector<float> edges((rows - 2) * (cols - 2));
neon_conv2d_3x3_f32(edges.data(), image.data(), rows, cols, sobel_x);
// 5x5 Gaussian blur
float gauss5x5[25] = { /* your kernel */ };
std::vector<float> blurred((rows - 4) * (cols - 4));
neon_conv2d_5x5_f32(blurred.data(), image.data(), rows, cols, gauss5x5);
// Separable convolution (faster for separable kernels)
float row_k[] = {0.25f, 0.5f, 0.25f};
float col_k[] = {0.25f, 0.5f, 0.25f};
neon_conv2d_separable_f32(blurred.data(), image.data(), rows, cols,
row_k, 3, col_k, 3);
// Eigen wrapper
Eigen::MatrixXf img = Eigen::MatrixXf::Random(64, 64);
Eigen::MatrixXf kern(3, 3);
kern << 1, 2, 1, 2, 4, 2, 1, 2, 1;
kern /= 16.0f;
Eigen::MatrixXf result = neon_conv2d(img, kern);#include <optmath/neon_kernels.hpp>
using namespace optmath::neon;
// Cholesky decomposition (A = L * L^T)
Eigen::MatrixXf A = /* symmetric positive definite matrix */;
Eigen::MatrixXf L = neon_cholesky(A); // returns lower triangular L
// LU decomposition with partial pivoting
Eigen::MatrixXf M = Eigen::MatrixXf::Random(64, 64);
auto [LU, piv] = neon_lu(M);
// QR decomposition (Householder)
auto [Q, R] = neon_qr(M); // Q orthogonal, R upper triangular
// Solve A*x = b (general via LU)
Eigen::VectorXf b = Eigen::VectorXf::Random(64);
Eigen::VectorXf x = neon_solve(M, b);
// Solve SPD system via Cholesky (faster for symmetric positive definite)
Eigen::VectorXf x_spd = neon_solve_spd(A, b);
// Triangular solve
Eigen::VectorXf y = neon_trsv_lower(L, b); // solve L*y = b
Eigen::VectorXf z = neon_trsv_upper(R, b); // solve R*z = b
// Matrix inverse
Eigen::MatrixXf Minv = neon_inverse(M);#include <optmath/radar_kernels.hpp>
using namespace optmath::radar;
// Reference and surveillance signals
Eigen::VectorXcf ref = /* transmitter signal */;
Eigen::VectorXcf surv = /* receiver signal */;
// CAF parameters
size_t n_doppler_bins = 101;
size_t n_range_bins = 500;
float doppler_start = -500.0f; // Hz
float doppler_step = 10.0f; // Hz
float sample_rate = 1e6f; // Hz
// Compute CAF (output: n_doppler x n_range magnitude matrix)
Eigen::MatrixXf caf_mag = caf(ref, surv,
n_doppler_bins, doppler_start, doppler_step,
sample_rate, n_range_bins);
// Find peak (target detection)
Eigen::Index doppler_bin, range_bin;
float peak_mag = caf_mag.maxCoeff(&doppler_bin, &range_bin);// 1D CA-CFAR
Eigen::VectorXf power_data = /* input power */;
size_t guard_cells = 4;
size_t reference_cells = 16;
float pfa_factor = 10.0f;
auto detections = cfar_ca(power_data, guard_cells, reference_cells, pfa_factor);
// 2D CFAR for range-Doppler maps
Eigen::MatrixXf range_doppler_map = /* CAF output */;
auto det_2d = cfar_2d(range_doppler_map,
guard_range, guard_doppler,
ref_range, ref_doppler,
pfa_factor);#include <optmath/cuda_backend.hpp>
if (optmath::cuda::is_available()) {
optmath::cuda::init();
Eigen::VectorXf a = Eigen::VectorXf::Random(1000000);
Eigen::VectorXf b = Eigen::VectorXf::Random(1000000);
// Vector operations
Eigen::VectorXf c = optmath::cuda::cuda_add(a, b);
float dot = optmath::cuda::cuda_dot(a, b);
// Matrix operations (Tensor Core accelerated)
Eigen::MatrixXf A = Eigen::MatrixXf::Random(1024, 1024);
Eigen::MatrixXf B = Eigen::MatrixXf::Random(1024, 1024);
Eigen::MatrixXf C = optmath::cuda::cuda_gemm(A, B);
// FFT
Eigen::VectorXcf signal = Eigen::VectorXcf::Random(65536);
Eigen::VectorXcf spectrum = optmath::cuda::cuda_fft(signal);
// Transcendentals
Eigen::VectorXf exp_x = optmath::cuda::cuda_exp(a);
Eigen::VectorXf sigmoid_x = optmath::cuda::cuda_sigmoid(a);
}#include <optmath/vulkan_backend.hpp>
if (optmath::vulkan::is_available()) {
Eigen::VectorXf a = Eigen::VectorXf::Random(1000000);
Eigen::VectorXf b = Eigen::VectorXf::Random(1000000);
// Vector operations
Eigen::VectorXf c = optmath::vulkan::vulkan_vec_add(a, b);
float dot = optmath::vulkan::vulkan_vec_dot(a, b);
// Matrix operations
Eigen::MatrixXf A = Eigen::MatrixXf::Random(512, 512);
Eigen::MatrixXf B = Eigen::MatrixXf::Random(512, 512);
Eigen::MatrixXf C = optmath::vulkan::vulkan_mat_mul(A, B);
// Convolution
Eigen::VectorXf signal = Eigen::VectorXf::Random(10000);
Eigen::VectorXf kernel = Eigen::VectorXf::Random(64);
Eigen::VectorXf result = optmath::vulkan::vulkan_convolution_1d(signal, kernel);
}cmake -DBUILD_BENCHMARKS=ON ..
make -j$(nproc)
# Run individual benchmarks
./benchmarks/bench_neon_transcendentals # exp, sin, cos, sigmoid, tanh
./benchmarks/bench_neon_gemm # Matrix multiplication
./benchmarks/bench_neon_fft # FIR, cross-correlation, complex ops
./benchmarks/bench_vulkan_matmul # GPU vector/matrix operations
./benchmarks/bench_radar_caf # CAF, CFAR, beamformingExample output (Raspberry Pi 5):
BM_NEON_Exp_Approx/1048576 1.2 ms 1.2 ms 580 FLOPS=13.1G/s
BM_Std_Exp/1048576 45.2 ms 45.1 ms 15 FLOPS=23.2M/s
BM_NEON_ComplexMul/65536 0.1 ms 0.1 ms 6200 Elements/s=655M
BM_CAF_NEON/4096x64x256 4.8 ms 4.7 ms 148 CAF/s=212
Tested on CIX P1 CD8160 (12x cores: 8x Cortex-A720 + 4x Cortex-A520) @ 2.6 GHz, Mali-G720-Immortalis GPU (Vulkan 1.3.275). All benchmarks compiled with -march=armv9-a+sve2+bf16+i8mm, C++20.
| Benchmark | Size | Time | FLOPS | Notes |
|---|---|---|---|---|
| NEON GEMM 4x4 | 16 | 30.0 ns | 17.1 GFLOPS | Micro-kernel |
| NEON GEMM 4x4 | 128 | 245 ns | 16.7 GFLOPS | L1-resident |
| NEON GEMM 4x4 | 1024 | 1.91 μs | 17.1 GFLOPS | Streaming |
| NEON GEMM Blocked | 32 | 4.48 μs | 14.6 GFLOPS | Cache-blocked |
| NEON GEMM Blocked | 256 | 2.87 ms | 11.7 GFLOPS | L3-resident |
| NEON GEMM Blocked | 512 | 29.9 ms | 9.01 GFLOPS | Beyond L3 |
| Eigen GEMM | 32 | 2.25 μs | 29.1 GFLOPS | Reference |
| Eigen GEMM | 256 | 919 μs | 36.6 GFLOPS | Highly optimized |
| Eigen GEMM | 512 | 7.14 ms | 37.7 GFLOPS | AVX-class perf |
| NEON MatVec | 1024 | 220 μs | 9.55 GFLOPS | Matrix-vector |
| NEON MatVec | 2048 | 867 μs | 9.71 GFLOPS | Large |
| Function | Size | NEON FLOPS | std:: FLOPS | Speedup |
|---|---|---|---|---|
| exp | 1M | 8.55 GFLOPS | 261 MFLOPS | 32.8x |
| sin | 1M | 8.55 GFLOPS | 228 MFLOPS | 37.5x |
| cos | 1M | 7.41 GFLOPS | — | ~33x |
| sigmoid | 256K | 8.09 GFLOPS | 408 MFLOPS | 19.8x |
| tanh | 256K | 8.66 GFLOPS | 73.8 MFLOPS | 117x |
| ReLU | 1M | 34.2 GB/s | — | Memory-bound |
| Benchmark | Size | Time | Throughput |
|---|---|---|---|
| FIR Filter | 16K samples, 64 taps | 179 μs | 11.7 GFLOPS |
| FIR Filter | 64K samples, 128 taps | 1.53 ms | 10.9 GFLOPS |
| Cross-Correlation | 4096 | 3.65 ms | 9.18 GFLOPS |
| Complex Multiply | 4096 | 8.08 μs | 3.04 GFLOPS |
| Complex Magnitude | 64K | 87.5 μs | 3.00 GFLOPS |
| Dot Product (NEON) | 256K | 60.9 μs | 8.62 GFLOPS |
| Dot Product (Eigen) | 256K | 48.0 μs | 10.9 GFLOPS |
| Benchmark | Size | Wall Time | GPU FLOPS | Notes |
|---|---|---|---|---|
| Matrix Multiply | 64x64 | 228 μs | 5.54 GFLOPS | Setup overhead |
| Matrix Multiply | 256x256 | 1.19 ms | 108 GFLOPS | Tiled 32x32 |
| Matrix Multiply | 512x512 | 7.07 ms | 274 GFLOPS | Sweet spot |
| Matrix Multiply | 1024x1024 | 51.2 ms | 481 GFLOPS | Peak compute |
| Vec Add | 1M | 3.06 ms | 5.98 GB/s | Memory-bound |
| Vec Add | 4M | 15.0 ms | 4.27 GB/s | Large transfer |
| Benchmark | Parameters | Time | FLOPS |
|---|---|---|---|
| CAF | 4096 samples, 41 Doppler, 100 range | 9.94 ms | 17.0 GFLOPS |
| CAF | 16384 samples, 61 Doppler, 200 range | 110 ms | 18.3 GFLOPS |
| CAF | 65536 samples, 101 Doppler, 500 range | 1.82 s | 18.2 GFLOPS |
| CFAR CA 1D | 64K samples | 6.81 ms | 1.23 GFLOPS |
| CFAR 2D | 256x512 range-Doppler | 34.0 ms | 247 MFLOPS |
| CFAR 2D | 512x1024 range-Doppler | 137 ms | 246 MFLOPS |
| NLMS Filter | 64K samples, 64 taps | 4.07 ms | 4.12 GFLOPS |
| NLMS Filter | 256K samples, 128 taps | 32.2 ms | 4.17 GFLOPS |
| MTI Filter | 256 pulses x 2048 range | 2.55 ms | 1.22 GFLOPS |
| Beamform (Phase) | 8 elements, 16K samples | 267 μs | 3.92 GFLOPS |
| Beamform (Delay-Sum) | 16 elements, 64K samples | 2.63 ms | 801 MFLOPS |
| Steering Vector | 64 elements | 19.1 μs | 104 M items/s |
OptMathKernels Benchmark (N=1000000)
------------------------------------------
NEON: Available
Vulkan: Available (GPU initialized) — Mali-G720-Immortalis
--- Dot Product ---
Eigen (CPU): 0.303 ms, Result: -95.2917
NEON : 0.289 ms, Result: -95.2971 (Diff: 0.005)
Vulkan : 15.5 ms, Result: -95.2891 (Diff: 0.003)
--- Vector Addition ---
Eigen (CPU): 1.59 ms
NEON : 1.61 ms, Norm Diff: 0
Vulkan : 7.79 ms, Norm Diff: 0
--- FIR Filter (Small Input) ---
Naive CPU : 0.910 ms
NEON : 0.242 ms, Norm Diff: 8.16e-05 (3.8x speedup)
Vulkan : 7.12 ms, Norm Diff: 8.16e-05
| Test Suite | Tests | Status | Backend |
|---|---|---|---|
test_basic |
1 | Passed | Core |
test_neon_kernels |
4 | Passed | NEON |
test_neon_complex |
7 | Passed | NEON |
test_neon_transcendentals |
10 | Passed | NEON |
test_neon_resample |
7 | Passed | NEON |
test_neon_iir |
10 | Passed | NEON |
test_neon_conv2d |
9 | Passed | NEON |
test_neon_linalg |
21 | Passed | NEON |
test_sve2_kernels |
18 | Passed | SVE2 |
test_platform |
9 | Passed | Platform |
test_vulkan_vector |
1 | Passed | Vulkan |
test_vulkan_matrix |
1 | Passed | Vulkan |
test_vulkan_dsp |
1 | Passed | Vulkan |
test_vulkan_advanced |
2 | Passed | Vulkan |
test_radar_caf |
6 | Passed | Radar |
test_radar_cfar |
9 | Passed | Radar |
"Vulkan not found" during CMake:
# Install Vulkan SDK
sudo apt install -y libvulkan-dev vulkan-tools glslang-tools
# Verify Vulkan
vulkaninfo | head -20"Could NOT find GTest":
# GTest is fetched automatically via FetchContent
# Clear build directory and re-run CMake
rm -rf build && mkdir build && cd build
cmake .."CUDA not found":
# Ensure CUDA toolkit is installed
nvcc --version
# Add to PATH if needed
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH"Unsupported gpu architecture 'compute_100'" (Blackwell):
# Your CUDA toolkit is too old for Blackwell (RTX 50xx)
nvcc --version # Shows version (needs 12.8+ for SM 100)
# Option 1: Install CUDA 12.8+ from NVIDIA
# https://developer.nvidia.com/cuda-downloads
# Option 2: Use Hopper SM 90a (runs via PTX forward compatibility, may have issues)
cmake -DCMAKE_CUDA_ARCHITECTURES="90a" ..
# Option 3: Use Vulkan backend instead (full GPU acceleration, no CUDA needed)
cmake -DENABLE_CUDA=OFF -DENABLE_VULKAN=ON ..CUDA tests fail with zeros/segfault on RTX 50xx:
# PTX forward compatibility from SM 90a to Blackwell SM 100 can have issues
# The Vulkan backend works correctly on RTX 50xx
# Solution: Install CUDA 12.8+ for native Blackwell support
# Or use Vulkan for GPU acceleration until CUDA is upgraded"-fPIC" linking error when building shared libraries:
# Rebuild with position-independent code
cmake -DCMAKE_POSITION_INDEPENDENT_CODE=ON ..
make -j$(nproc)
sudo make install"failed to open file: vec_add.comp.spv":
# Run sudo make install to place shaders in /usr/local/share/...
sudo make install
# Or set shader path manually
export OPTMATH_KERNELS_PATH=/usr/local/share/optmathkernels/shaders/NEON tests skipped:
# NEON is only enabled on ARM platforms
# On x86, tests will skip with "NEON not available"
Eigen3 not found when using installed package:
sudo apt install libeigen3-devHigh CPU usage on Pi 5:
- Use NEON for operations that fit in L2 cache (<512KB)
- Use Vulkan for large arrays where GPU parallelism helps
- Enable cache blocking for large GEMM operations
GPU not being used:
# Check Vulkan
vulkaninfo | grep "GPU"
# Check CUDA
nvidia-smiOptMathKernels/
├── include/optmath/
│ ├── neon_kernels.hpp # NEON API declarations (104 functions)
│ ├── sve2_kernels.hpp # SVE2/FCMA/I8MM API declarations (46 functions)
│ ├── platform.hpp # Platform detection, thread affinity (10 functions)
│ ├── vulkan_backend.hpp # Vulkan API declarations (23 functions)
│ ├── cuda_backend.hpp # CUDA API declarations (242 functions)
│ └── radar_kernels.hpp # Radar processing API (48 functions)
├── src/
│ ├── neon/ # ARM NEON (128-bit SIMD) kernels
│ │ ├── neon_kernels.cpp # Core vector ops (4-accumulator dot, add/sub/mul/div),
│ │ │ # reductions (sum/max/min), 4x4 GEMM microkernel,
│ │ │ # FIR filter, ReLU, vectorized transcendentals
│ │ │ # (6th-order minimax exp/sin/cos/sigmoid/tanh)
│ │ ├── neon_complex.cpp # Split & interleaved complex arithmetic (vld2q/vst2q),
│ │ │ # complex dot product, magnitude (Newton-Raphson rsqrt),
│ │ │ # phase, complex exponential
│ │ ├── neon_gemm_optimized.cpp # 3-level Goto-style cache-blocked GEMM with 8x8
│ │ │ # microkernel (vmlaq_laneq_f32 rank-1 updates),
│ │ │ # runtime A76/A720 cache tuning (MC/KC/NC)
│ │ ├── neon_radar.cpp # Window functions (Hamming-Kaiser), CAF with vectorized
│ │ │ # Doppler shift, CFAR (CA/2D-SAT/OS), NLMS adaptive
│ │ │ # filter, projection clutter, DFT, MTI, beamforming
│ │ ├── neon_resample.cpp # Polyphase L:M rational resampler with NEON FIR
│ │ ├── neon_iir.cpp # Biquad IIR DF2T, cascade, Bristow-Johnson design
│ │ │ # (lowpass/highpass/bandpass/notch)
│ │ ├── neon_conv2d.cpp # General NxM, separable, unrolled 3x3 & 5x5 convolution
│ │ └── neon_linalg.cpp # TRSV/TRSM, Cholesky, LU (partial pivot), QR
│ │ # (Householder), general/SPD solve, matrix inverse
│ ├── sve2/ # ARM SVE2 (scalable vector) kernels
│ │ ├── sve2_kernels.cpp # Predicated vector ops (svwhilelt_b32 loops),
│ │ │ # transcendentals (predicated Horner/Chebyshev),
│ │ │ # 8x8 GEMM microkernel (MC=256/KC=512/NC=2048),
│ │ │ # I8MM int8 GEMM (svmmla_s32), FIR filter
│ │ ├── sve2_complex.cpp # FCMA-accelerated complex multiply (svcmla rotations
│ │ │ # 0/90/270, 2 instructions vs 4), split & interleaved
│ │ │ # formats, complex dot/magnitude (svsqrt)/phase/exp
│ │ ├── sve2_radar.cpp # CAF with predicated complex MAC (svmla_f32_m merging),
│ │ │ # cross-correlation, phase-shift beamforming, windowing
│ │ └── sve2_detect.cpp # Runtime SVE2 detection via getauxval(AT_HWCAP2)
│ ├── platform/
│ │ └── platform.cpp # CPU topology, thread affinity, cache detection
│ ├── vulkan/
│ │ ├── vulkan_backend.cpp # Vulkan context & dispatch (Mali-G720 auto-detect)
│ │ └── shaders/ # 40 GLSL compute shaders
│ │ ├── vec_add.comp.glsl
│ │ ├── mat_mul_tiled.comp.glsl
│ │ ├── mat_mul_tiled_mali.comp.glsl # 32x32 tiles for Mali-G720
│ │ ├── reduce_sum_mali.comp.glsl # 1024-thread subgroup reduction
│ │ ├── fft_radix2.comp.glsl
│ │ ├── caf_doppler_shift.comp.glsl
│ │ ├── cfar_2d.comp.glsl
│ │ └── ... (33 more shaders)
│ └── cuda/ # NVIDIA CUDA GPU kernels
│ ├── cuda_backend.cpp # Context, device enumeration, memory management
│ ├── cuda_kernels.cu # Vector elementwise (float4 vectorized), transcendentals
│ │ # (__expf/__sinf/__cosf/__sincosf fast-math intrinsics),
│ │ # activation functions (sigmoid/tanh/ReLU/GELU/softmax),
│ │ # matrix ops (tiled transpose with bank-conflict padding),
│ │ # cuBLAS (Sgemm/Sdot/Snrm2/Sgemv), CUB reductions
│ │ # (DeviceReduce::Sum/Max/Min), cuSOLVER Cholesky
│ │ # (Spotrf/Dpotrf/Spotrs, architecture-aware thresholds)
│ ├── cuda_complex.cu # Split-format complex arithmetic, warp-level dot product
│ │ # (__shfl_down_sync reduction + atomicAdd), format
│ │ # conversion (interleave/deinterleave), convolution
│ │ # (1D naive/shared-memory, 2D), cuFFT wrappers
│ │ # (C2C 1D/batch/2D via cufftPlan1d/PlanMany/Plan2d)
│ └── cuda_radar.cu # Window functions (Hamming-Blackman-Harris), GPU-resident
│ # CAF pipeline (Doppler shift + FFT + conj-mul + IFFT),
│ # CFAR (1D CA + 2D), Bartlett beamformer with shared-
│ # memory reduction, ULA steering vectors, NLMS (CPU),
│ # projection clutter cancellation
├── tests/
│ ├── test_neon_kernels.cpp # NEON unit tests
│ ├── test_neon_complex.cpp # Complex operation tests
│ ├── test_neon_transcendentals.cpp
│ ├── test_neon_resample.cpp # Polyphase resampler tests
│ ├── test_neon_iir.cpp # Biquad IIR filter tests
│ ├── test_neon_conv2d.cpp # 2D convolution tests
│ ├── test_neon_linalg.cpp # Dense linear algebra tests (21 tests)
│ ├── test_sve2_kernels.cpp # SVE2 unit tests (18 tests)
│ ├── test_platform.cpp # Platform detection tests (9 tests)
│ ├── test_vulkan_vector.cpp # Vulkan vector tests
│ ├── test_vulkan_matrix.cpp # Vulkan matrix tests
│ ├── test_vulkan_dsp.cpp # Vulkan DSP tests
│ ├── test_cuda_kernels.cpp # CUDA unit tests
│ ├── test_cuda_radar.cpp # CUDA radar tests
│ ├── test_radar_caf.cpp # CAF computation tests
│ └── test_radar_cfar.cpp # CFAR detection tests
├── benchmarks/
│ ├── bench_neon_transcendentals.cpp
│ ├── bench_neon_gemm.cpp
│ ├── bench_neon_fft.cpp
│ ├── bench_vulkan_matmul.cpp
│ └── bench_radar_caf.cpp
├── examples/
│ ├── demo.cpp # NEON/Vulkan demo
│ └── cuda_demo.cpp # CUDA demo
├── cmake/
│ └── OptMathKernelsConfig.cmake.in # CMake package config
├── FunctionsIncluded.md # Complete API reference (473 functions)
└── README.md # This file
- PassiveRadar_Kraken: Complete passive bistatic radar system using OptMathKernels for acceleration. Includes GNU Radio blocks, multi-target tracking, and real-time displays.
Source-Level Documentation:
Every kernel source file now has a thorough header comment documenting all functional blocks, specific API/intrinsic usage, and algorithmic techniques. This makes the codebase self-documenting for contributors and users reading the source.
CUDA Kernels (3 files):
cuda_kernels.cu: 8 documented blocks — vector elementwise (float4 vectorized), transcendentals (CUDA__expf/__sinf/__cosf/__sincosf/__tanf/__powffast-math intrinsics), activation functions (sigmoid/tanh/ReLU/leaky-ReLU/GELU/softmax with shared-memory reduction), matrix ops (tiled transpose with +1 bank-conflict padding), cuBLAS wrappers (cublasSgemm/cublasSdot/cublasSnrm2/cublasSgemv), CUB parallel reductions (cub::DeviceReduce::Sum/Max/Min), Eigen host wrappers (error-checked cudaMalloc/cudaMemcpy with CPU fallback), cuSOLVER Cholesky (cusolverDnSpotrf/cusolverDnDpotrf/cusolverDnSpotrswith architecture-aware GPU/CPU thresholds)cuda_complex.cu: 7 documented blocks — split-format complex arithmetic, complex analysis kernels, warp-level complex dot product (__shfl_down_sync+atomicAdd), format conversion (interleave/deinterleave), convolution (1D naive/shared-memory template, 2D), cuFFT wrappers (cufftPlan1d/cufftPlanMany/cufftPlan2dwithCUFFT_C2C), Eigen complex wrapperscuda_radar.cu: 7 documented blocks — window function generators, GPU-resident CAF pipeline (Doppler shift via__sincosf+ FFT + conj-multiply + IFFT + magnitude extraction), CFAR detection (1D CA-CFAR + 2D with guard cells), Doppler windowing, Bartlett beamformer (ULA steering vectors + shared-memory spectrum reduction), NLMS adaptive filter (CPU, sequential weight updates), projection clutter cancellation
NEON Kernels (8 files):
neon_kernels.cpp: Vector ops with 4-accumulator FMA pipeline utilization, vectorized transcendentals (6th-order minimax polynomials with IEEE754 bit manipulation for exponent reconstruction)neon_complex.cpp: Split and interleaved (vld2q_f32/vst2q_f32) complex arithmetic, Newton-Raphson rsqrt magnitude (vrsqrteq_f32/vrsqrtsq_f32)neon_gemm_optimized.cpp: 3-level Goto-style cache-blocked GEMM, 8x8 microkernel withvmlaq_laneq_f32rank-1 updates, runtime A76/A720 cache parameter tuningneon_radar.cpp: Window functions (including Kaiser with Bessel I0), CAF, CFAR (CA/2D-SAT/OS), NLMS, projection clutter, DFT, MTI, delay-sum and phase-shift beamformingneon_conv2d.cpp: General, separable, fully-unrolled 3x3 and 5x5 convolutionneon_iir.cpp: Biquad Direct Form II Transposed, cascade, Bristow-Johnson filter designneon_linalg.cpp: TRSV/TRSM, Cholesky (A=L*L^T), LU (partial pivot), QR (Householder), solvers, matrix inverseneon_resample.cpp: Polyphase L:M rational rate conversion with streaming delay line
SVE2 Kernels (4 files):
sve2_kernels.cpp: Predicated vector ops (svwhilelt_b32loops), transcendentals (predicated Horner/Chebyshev polynomials), 8x8 GEMM microkernel (MC=256/KC=512/NC=2048 for A720 12MB L3), I8MM int8 GEMM (svmmla_s32)sve2_complex.cpp: FCMA-accelerated complex multiply (svcmla_f32_zrotations 0/90/270 for 2-instruction complex multiply), non-FCMA fallback withsvtbl_f32deinterleaving, nativesvsqrt_f32_zmagnitudesve2_radar.cpp: CAF with predicated complex MAC (svmla_f32_mmerging semantics), cross-correlation, phase-shift beamformingsve2_detect.cpp: Runtime SVE2 detection viagetauxval(AT_HWCAP2) & HWCAP2_SVE2
README Updated:
- File Structure section now includes per-file descriptions of all functional blocks and key APIs
- Each source file entry documents the specific intrinsics, algorithms, and CUDA/cuBLAS/cuFFT/cuSOLVER calls used
Bug Fixes:
- SVE2 GEMM buffer bounds clamping - Added
std::min()clamping of runtime MC/KC/NC parameters to static buffer maximums, preventing potential buffer overflow if platform detection returned values exceedingMAX_MC/MAX_KC/MAX_NC. NEON GEMM already had this safety check; SVE2 GEMM was missing it. - NEON GEMM docstring correction - Fixed inaccurate cache blocking parameter values in the file header comment (claimed MC=384-512 but code used MC=128-256).
Optimizations (CIX P1 / Orange Pi 6 Plus):
- GEMM NC doubled to 2048 for large L3 - B panel now occupies 4MB (33% of 12MB L3) vs prior 2MB (17%). Reduces main memory traffic by ~2x for large matrix multiplications on the CIX P1 CD8160.
- SVE2 GEMM microkernel prefetch - Added
svprfb(SV_PLDL1KEEP)prefetch hints to pipeline next iteration's A/B panels into L1 data cache, reducing stall cycles on Cortex-A720.
Platform Detection Enhancements:
- L2 cache size detection - New
get_l2_cache_size()API reads per-core L2 from sysfs (probes a performance core), with heuristic fallbacks: A720=512KB, A520=256KB, A76=512KB. CpuInfo::l2_cache_bytesfield - Cached L2 size available viadetect_cpu_info().- Updated test suite - Platform test validates L2 detection and new NC=2048 GEMM blocking.
Test Results (CIX P1 CD8160, aarch64, all 16/16 pass):
| Backend | Tests | Status |
|---|---|---|
| NEON | 7/7 | Pass |
| SVE2 | 1/1 | Pass |
| Vulkan | 4/4 | Pass |
| Radar | 2/2 | Pass |
| Platform | 1/1 | Pass |
| Basic | 1/1 | Pass |
CUDA 13.0 Full Support:
- CUDA 13.0.88 verified - Complete build and test pass with CUDA 13 toolkit
- SM 103 architecture added - Support for RTX 5070/5060 (GB205/GB206 Blackwell variants)
- Updated architecture list - CUDA 13+ now builds: SM 75, 80, 86, 89, 90, 100, 103
- Improved Blackwell detection - CMake now detects both SM 100 and SM 103 for
OPTMATH_CUDA_BLACKWELL
Test Results (x86-64 with RTX 5090, CUDA 13.0.88):
| Backend | Tests | Status | Notes |
|---|---|---|---|
| CUDA | 36/36 | Pass | Full GPU acceleration with TF32 Tensor Cores |
| Vulkan | 4/4 | Pass | Full GPU acceleration |
| NEON/CPU | 11/11 | Pass | Eigen auto-vectorized for AVX2 |
Architecture Support (CUDA 13+):
| SM | Architecture | GPUs |
|---|---|---|
| 75 | Turing | RTX 2060/2070/2080 |
| 80/86 | Ampere | RTX 3060/3070/3080/3090, A100 |
| 89 | Ada Lovelace | RTX 4060/4070/4080/4090 |
| 90 | Hopper | H100 |
| 100 | Blackwell | RTX 5080/5090 (GB202/GB203) |
| 103 | Blackwell | RTX 5060/5070 (GB205/GB206) |
CUDA 13 Verified:
- Full test pass on RTX 5090 (Blackwell) - All 36 CUDA tests pass including 7 Cholesky tests
- cuSolver Cholesky decomposition - Verified working with CUDA 13's cusolverDnSpotrf/Dpotrf
- SM 75 (Turing) fallback - Architecture-aware thresholds work correctly from SM 75-100
TF32 Precision Handling:
- MatrixGEMM test tolerance adjusted for TF32 (TensorFloat-32) precision on Ampere+ GPUs
- TF32 uses 19-bit mantissa vs FP32's 24-bit, allowing ~0.4% relative error
- Tolerance updated to 1% relative + 5e-3 absolute to account for accumulated rounding in GEMM
Test Results (x86-64 with RTX 5090, CUDA 13.0):
| Backend | Tests | Status | Notes |
|---|---|---|---|
| CUDA | 36/36 | Pass | Full GPU acceleration with TF32 Tensor Cores |
| Vulkan | 4/4 | Pass | Full GPU acceleration |
| NEON/CPU | 11/11 | Pass | Eigen auto-vectorized for AVX2 |
CUDA 13 Support:
- Full CUDA 13.0 compatibility - API changes for
cudaMemPrefetchAsyncandcudaDeviceProp - Architecture auto-detection - CMake automatically selects appropriate architectures based on CUDA toolkit version:
- CUDA 13+: SM 75, 80, 86, 89, 90, 100, 103 (Turing through Blackwell)
- CUDA 12.x: SM 50-89 (Maxwell through Ada)
- CUDA 11.x: SM 50-86 (Maxwell through Ampere)
- Native build option - Use
-DOPTMATH_CUDA_NATIVE=ONfor faster compilation targeting only the local GPU - Blackwell (SM 100) optimizations - Full native support with CUDA 13
Architecture Compile Definitions:
New compile definitions for architecture-specific code paths:
OPTMATH_CUDA_13_PLUS- CUDA 13+ detectedOPTMATH_CUDA_BLACKWELL- SM 100 enabledOPTMATH_CUDA_HOPPER- SM 90 enabledOPTMATH_CUDA_ADA- SM 89 enabledOPTMATH_CUDA_AMPERE- SM 80/86 enabledOPTMATH_CUDA_TURING- SM 75 enabled
cuSolver Cholesky Improvements:
- Enhanced fallback chain with architecture-aware thresholds
- Small matrix optimization (CPU path for matrices < 64-128 depending on architecture)
- Improved DeviceInfo API with architecture detection helpers
Note: CUDA 13 dropped support for SM < 75. For Maxwell/Pascal/Volta GPUs, use CUDA 12.x.
NVIDIA Blackwell Support:
- RTX 5090 Tested: Full build and test run on NVIDIA GeForce RTX 5090 (SM 10.0, Blackwell architecture)
- CUDA 12.8+ Requirement: Documented that Blackwell SM 100 requires CUDA 12.8+ for native support
- Vulkan Fallback: Vulkan 1.3 backend provides full GPU acceleration on Blackwell when CUDA toolkit is older
- Architecture Table Updated: Added RTX 5090 specs (21760 CUDA cores, 680 Tensor Cores, 32GB VRAM)
Build System:
- Multi-Architecture Support: Build instructions updated for RTX 20xx/30xx/40xx (SM 75/86/89) and RTX 50xx (SM 100)
- Native Optimization: Added
-march=native -mtune=nativeflags for x86-64 builds - Local Install: Support for
~/.localprefix installation without root access
Documentation:
- CUDA Toolkit Requirements Table: Clear mapping of GPU generations to minimum CUDA versions
- Troubleshooting Expanded: Added Blackwell-specific error messages and solutions
- GPU Architecture Detection: Added
nvidia-smicommands to identify compute capability
Test Results (x86-64 with RTX 5090, CUDA 12.0, Vulkan 1.3):
| Backend | Tests | Status | Notes |
|---|---|---|---|
| Vulkan | 4/4 | ✅ Pass | Full GPU acceleration |
| NEON/CPU | 11/11 | ✅ Pass | Eigen auto-vectorized for AVX2 |
| CUDA | 0/1 | Requires CUDA 12.8+ for Blackwell |
Known Limitation: CUDA tests fail on RTX 50xx with CUDA 12.0 due to PTX forward compatibility issues. Install CUDA 12.8+ from NVIDIA for native Blackwell support, or use the Vulkan backend.
SVE2 Transcendentals — Eliminate Heap Allocations (sve2_kernels.cpp):
sve2_fast_cos_f32: Inlined sin polynomial with pi/2 offset in a single SVE2 predicated pass. Previously allocated astd::vector<float>temp buffer and made 2 passes over the data.sve2_fast_sigmoid_f32: Fused single-pass with inline exp(-x) computation. Previously allocated 2std::vectorbuffers and made 3 passes (clamp+negate, exp, divide).sve2_fast_tanh_f32: Fused single-pass with inline exp(-2x) + sigmoid. Previously allocated 2std::vectorbuffers and made 3 passes.- Impact: Eliminates 1-2 heap allocations and 1-2 extra data passes per call in hot paths.
SVE2 GEMM Microkernel — Vectorize with SVE2 FMA (sve2_kernels.cpp):
micro_kernel_8x8_sve2: Replaced scalarfloat acc[8][8]loop with SVE2 column-oriented accumulators usingsvmla_n_f32_zfor rank-1 broadcast FMA, matching the NEON kernel's register-blocked design.- Uses lo/hi vector pairs for 8-row columns on 128-bit SVE2 (CIX P1
svcntw()=4). - Vectorized load-add-store for C matrix writeback.
CAF Doppler Shift — Vectorize Trig in Hot Loop (neon_radar.cpp, sve2_radar.cpp):
- Replaced per-sample
std::cos/std::sincalls (the CAF bottleneck) with batchneon_fast_cos/sin_f32andsve2_fast_cos/sin_f32. - Doppler phase rotation now uses vectorized
neon_complex_mul_f32/sve2_complex_mul_f32instead of scalar multiply. - Estimated ~10x speedup for the Doppler shift phase of CAF computation.
SVE2 Complex Exponential — Vectorize (sve2_complex.cpp):
sve2_complex_exp_f32: Replaced scalarstd::cos/std::sinloop withsve2_fast_cos/sin_f32for full SVE2 vectorization.
Architecture Safety: All changes guarded by #ifdef OPTMATH_USE_SVE2 / #ifdef OPTMATH_USE_NEON with scalar fallbacks intact. No changes to headers, APIs, or non-ARM code paths.
All 16 test suites pass (100%) on Orange Pi 6 Plus.
Bug Fixes:
-
SVE2 SIGILL on Non-SVE2 Hardware (
sve2_detect.cpp,src/CMakeLists.txt):is_available()was compiled with-march=armv9-a+sve2flags insve2_kernels.cpp, causing SIGILL crash on hardware without SVE2 (e.g., Raspberry Pi 5 Cortex-A76)- Moved
is_available()to newsve2_detect.cppcompiled without SVE2 flags - Now performs runtime detection via
getauxval(AT_HWCAP2) & HWCAP2_SVE2 - SVE2 tests gracefully skip on non-SVE2 hardware instead of crashing
-
Platform CPU Model Name Empty on ARM Kernels (
platform.cpp):- Raspberry Pi 5
/proc/cpuinfolacks themodel namefield, causingDetectCPUInfotest failure - Added fallback that maps ARM CPU part IDs to model names (e.g.,
0xd0b→ "Cortex-A76",0xd81→ "Cortex-A720") - Covers Cortex-A53 through Cortex-X4
- Raspberry Pi 5
All 16 test suites pass (100%) on Raspberry Pi 5.
Performance Optimizations:
-
Dot Product / Reduce Sum (
neon_kernels.cpp):- 4 independent accumulators to break FMA dependency chain (A76: 4-cycle latency, 2 pipelines)
-
Transcendentals (
neon_kernels.cpp):neon_fast_cos_f32: inline sin polynomial with π/2 offset, eliminatestd::vectorheap allocationneon_fast_sigmoid_f32: fused single-pass with inline exp(-x), eliminate 2 heap allocations per callneon_fast_tanh_f32: fused single-pass with inline exp(-2x) + sigmoid, eliminate 4 heap allocations per call
-
GEMM (
neon_gemm_optimized.cpp):micro_kernel_8x8: column-oriented accumulators for vector store (16 vector ops vs 64 scalarvgetq_lane_f32extractions)neon_gemm_blockedEigen wrapper: enable actual blocked GEMM path
-
Complex Exponential (
neon_complex.cpp):neon_complex_exp_f32: vectorize usingneon_fast_sin/cos_f32
Vulkan Fix:
- SPV Shader Discovery for FetchContent (
vulkan_backend.cpp,src/CMakeLists.txt):- Added
OPTMATH_SPV_BUILD_DIRcompile definition so Vulkan backend locates compiled SPIR-V shaders when used via CMake FetchContent
- Added
Build and Installation Verified on Orange Pi 6 Plus (CIX P1 CD8160, Cortex-A720, Mali-G720-Immortalis):
- All backends enabled: NEON, SVE2 (FCMA/I8MM), Vulkan 1.3
- Library, 40 SPIR-V shaders, 16 tests, 5 benchmarks, demo all compile cleanly
make installverified with CMake package config
Bug Fixes (Build):
-
SPIR-V 1.3 Target for Mali Subgroup Shaders (
src/CMakeLists.txt):reduce_sum_mali.comp.glslusessubgroupAdd()which requires SPIR-V 1.3glslangValidatorwas invoked with default SPIR-V 1.0 target, causing compilation failure- Added
--target-env vulkan1.1to all shader compilation commands (Vulkan 1.1 implies SPIR-V 1.3)
-
Vulkan WSI Layer Static Destruction Crash (
vulkan_backend.cpp,vulkan_backend.hpp):- All 4 Vulkan tests passed assertions but crashed during program exit with
unordered_map::at - Root cause: Mesa's
libVkLayer_window_system_integration.sodestroys its internalunordered_mapbefore the staticVulkanContextsingleton runs its destructor, causingvkDestroyDeviceto crash inside the WSI layer - Fixed by registering
std::atexit()handler inVulkanContext::init()to run cleanup before static destruction order conflicts - Added
vkDeviceWaitIdle()before resource teardown - Added try-catch in destructor as safety net
- All 4 Vulkan tests passed assertions but crashed during program exit with
Bug Fixes (Deep Source Audit — CRITICAL):
- Cross-Correlation OOB Reads (
neon_radar.cpp,sve2_radar.cpp):- When
nx > ny, overlap computationy_offset + len > nycauses out-of-bounds reads on the y array - Fixed in all 4 functions:
xcorr_f32andxcorr_complex_f32for both NEON and SVE2 backends - Added
lenclamping tony - y_offsetand zero-input early-return guards
- When
Bug Fixes (Deep Source Audit — MODERATE):
-
MTI Filter Unsigned Underflow (
neon_radar.cpp):n_pulses - n_coeffs + 1wraps toSIZE_MAXwhenn_pulses < n_coeffs(unsigned arithmetic)- Added early-return guards in both
mti_filter_f32and the Eigen wrapper
-
Vulkan Init Retry Leaks Handles (
vulkan_backend.cpp):- Partial failure during
VulkanContext::init()left Vulkan handles allocated; retry overwrote them - Added
cleanup()call before re-init ifdeviceorinstanceare already non-null
- Partial failure during
-
Vulkan vkMapMemory Unchecked (
vulkan_backend.cpp):mapAndCopyFromignoredvkMapMemoryreturn value; would crash on mapping failure- Added error check with
throw std::runtime_erroron failure
Bug Fixes (Deep Source Audit — LOW):
-
NEON Sin Scalar Tail UB (
neon_kernels.cpp):(int)kcast of large float is undefined behavior when float exceedsINT_MAX- Changed to
(int64_t)kto handle full float range
-
SPIR-V Alignment (
vulkan_backend.cpp):readFilereturnsvector<char>(1-byte alignment) but Vulkan requiresuint32_t-alignedpCode- Fixed by padding buffer size to 4-byte alignment boundary
Benchmark Results Added:
- Complete Orange Pi 6 Plus benchmark results for all 5 benchmark suites
- NEON GEMM: 17.1 GFLOPS (4x4 micro-kernel), Eigen reference: 37.7 GFLOPS
- NEON transcendentals: exp 32.8x, sin 37.5x, tanh 117x faster than std::
- Vulkan MatMul on Mali-G720: 481 GFLOPS peak at 1024x1024
- Radar CAF: 18.2 GFLOPS sustained for 65K-sample processing
All 16 test suites pass (100%) on Orange Pi 6 Plus.
Vulkan Bug Fixes:
-
Unchecked
vkBindBufferMemory(vulkan_backend.cpp:90):- Return value was silently ignored; can fail with
VK_ERROR_OUT_OF_DEVICE_MEMORY - Added error check with proper cleanup (free memory, destroy buffer) before throwing
- Return value was silently ignored; can fail with
-
Memory Barrier Spec Violation (
vulkan_backend.cpp:500-507):VK_PIPELINE_STAGE_HOST_BITwas used as destination stage invkCmdPipelineBarrier, which is invalid per Vulkan spec- Removed
HOST_BITandHOST_READ_BIT; host visibility is guaranteed byHOST_COHERENT_BITon all buffers +vkQueueWaitIdle
-
FFT Float log2 Truncation (
vulkan_backend.cpp:1171,1220):- Radix-2 FFT used
(uint32_t)std::log2(N)which can truncate incorrectly (e.g.,log2(8) = 2.9999...→ 2) - Replaced with integer bit-counting loop (radix-4 already had a loop but redundantly called
std::log2again)
- Radix-2 FFT used
Audit Plan Corrections (5 false positives removed):
- Issue 14 (NEON Conv2D out-of-bounds): Loop condition
c + 3 < out_colswithout_cols = in_cols - kernel_cols + 1ensures max access =in_cols - 1(last valid index). Algebraically proven safe for all conv variants. - Issue 37 (SVE2 FCMA complex multiply):
svcmlawith rotations 0 and 90 is the correct way to multiply interleaved complex data — the instruction natively operates on[re, im]pairs - Issue 17 (NEON complex magnitude): Newton-Raphson via
vrsqrteq_f32+ 2xvrsqrtsq_f32is mathematically correct (~24-bit accuracy) - Issue 33 (Vulkan host coherency):
HOST_COHERENT_BITis set on all buffers - Issue 15 (Householder sign): Standard QR sign selection with
1e-30denominator guard
Audit Plan Status Updates (4 issues already fixed but not tracked):
- Issues 6, 7 (CUDA window div/0 and dot product race) were fixed in v0.5.2 but audit plan still listed them as OPEN
- Issues 11, 12 (Vulkan resource leaks and command buffer check) were already properly handled
- Issue 22 line reference corrected: CAF Doppler phase is in
sve2_radar.cpp:43-50, notsve2_kernels.cpp:1095-1106
All 16 test suites pass (100%).
SVE2 Fixes:
-
Interleaved Complex Multiply Bug (
sve2_complex.cpp):- Fixed incorrect deinterleave using
svuzp1_f32(va, va)/svuzp2_f32(va, va)which duplicates elements instead of splitting even/odd - Replaced with
svtbl_f32index-based extraction for correct real/imaginary separation - Affected:
sve2_complex_mul_interleaved_f32,sve2_complex_conj_mul_interleaved_f32
- Fixed incorrect deinterleave using
-
ODR Violation: Duplicate Function Definitions (
sve2_kernels.cpp):- Removed ~580 lines of duplicate function definitions that were also defined in
sve2_complex.cppandsve2_radar.cpp - Static library silently picked one copy, masking the One Definition Rule violation
- Removed ~580 lines of duplicate function definitions that were also defined in
Platform Fixes:
-
Feature Detection End-of-Line Bug (
platform.cpp):- Fixed:
getline()strips\n, so features at end of line (e.g.," sve2") were missed by substring search - Added space padding before search to ensure end-of-line features are found
- Features on heterogeneous CPUs (different Features lines) now detected correctly with
!info.has_*guards
- Fixed:
-
sysconf Negative Return (
platform.cpp):sysconf(_SC_NPROCESSORS_ONLN)can return -1 on error; now guarded with fallback to 1
CUDA Fixes:
-
tanh NaN for Large Inputs (
cuda_kernels.cu):__expf(2*x)overflows for|x| > ~20, producing NaN- Added input clamping: values beyond +/-20 return +/-1.0f directly
-
Softmax Single-Block Limitation (
cuda_kernels.cu):- Kernel assumed
n <= 256(single block) but was called with arbitrary sizes, producing wrong results - Added CPU fallback path for
n > BLOCK_SIZE
- Kernel assumed
-
IFFT Dead Normalization Code (
cuda_complex.cu):- Removed unused
scaleandblocksvariables that computed normalization but never applied it
- Removed unused
-
CFAR 2D Row-Major vs Column-Major Mismatch (
cuda_radar.cu):- CUDA kernel used row-major indexing (
d * n_range + r) but EigenMatrixXfis column-major - Fixed all indexing to column-major (
d + r * n_doppler)
- CUDA kernel used row-major indexing (
-
CPU Window Functions Divide-by-Zero (
cuda_radar.cu):- Window generation with
n=1caused division by(n-1) = 0 - Added safe divisor:
(n > 1) ? (n-1) : 1
- Window generation with
Vulkan Fixes:
-
SPIR-V File Read Overflow (
vulkan_backend.cpp):tellg()returns -1 on failure; cast tosize_tproduced a massive allocation- Added check before cast
-
Buffer Leak on Allocation Failure (
vulkan_backend.cpp):vkCreateBuffersucceeded butvkAllocateMemoryfailed, leaking the buffer handle- Added
vkDestroyBuffercleanup before throwing
-
Floating-Point log2 Truncation (
vulkan_backend.cpp):bit_reverse_copyand radix-4 power check used(int)std::log2(N), which can truncate incorrectly (e.g.,log2(8) = 2.9999...→ 2)- Replaced with integer bit-counting loop
Safety Guards:
-
NEON GEMM Buffer Overflow (
neon_gemm_optimized.cpp):- Runtime MC/KC/NC from platform detection could exceed compile-time MAX constants
- Added
std::minclamp to prevent stack buffer overflows
-
IIR Validation Order (
neon_iir.cpp):fc >= fs * 0.5fwas checked beforefs > 0, causing UB whenfs = 0- Reordered:
fs > 0check now comes first
-
TRSV Zero-Diagonal Guard (
neon_linalg.cpp):- Added check before division in
neon_trsv_lower_trans_f32to prevent division by zero
- Added check before division in
-
CUDA Buffer Move Semantics (
cuda_backend.hpp):PinnedBufferandUnifiedBufferhad implicit copy constructors that would double-free- Added deleted copy ops and proper move semantics
Test Quality Improvements:
-
Tolerance Fixes (
test_neon_kernels.cpp,test_vulkan_vector.cpp,test_sve2_kernels.cpp):- Changed dot/norm tolerance from
1e-2 * Nto1e-4f * N + 1e-2f(was masking real bugs)
- Changed dot/norm tolerance from
-
Float Comparison (
test_neon_kernels.cpp):- Changed
EXPECT_EQtoEXPECT_FLOAT_EQfor reduce_max/min (bitwise vs ULP comparison)
- Changed
-
Type Mismatch (
test_radar_caf.cpp):size_t max_idx→Eigen::Index max_idxto matchmaxCoeff()signature
-
Platform-Gated Assertions (
test_platform.cpp):- L3 cache size and GEMM blocking assertions now gated on target platform detection
- Previously failed on any system without 8MB+ L3 cache
-
Complex Dot Test Convention (
test_sve2_kernels.cpp):- Test computed
sum(a * conj(b))but kernel implementssum(conj(a) * b)(BLAS convention) - Fixed test reference to match kernel
- Test computed
All 16 test suites pass (100%).
Critical Bug Fixes:
-
SVE2 Accumulation Bug (
sve2_radar.cpp,sve2_complex.cpp):- Fixed predicate variant: changed
_z(zeroing) to_m(merging) for accumulation operations - The zeroing variant was destroying accumulated values in inactive lanes during predicated loops
- Affected: CAF, cross-correlation, complex dot product, and beamforming kernels
- Fixed predicate variant: changed
-
SVE2 Loop Termination (
sve2_complex.cpp,sve2_radar.cpp):- Fixed:
svptest_first→svptest_anyto prevent premature loop exit svptest_firstcould exit early when the first lane was inactive but other lanes had work remaining
- Fixed:
-
CUDA Window Division by Zero (
cuda_radar.cu):- Added
safe_window_divisor()helper to prevent NaN when generating windows with n=1 - Affected: Hamming, Hanning, Blackman, Blackman-Harris, Kaiser window functions
- Added
-
CUDA Warp Reduction Race Condition (
cuda_complex.cu):- Added guard for
blockDim.x < 64to prevent reading uninitialized shared memory - The warp reduction assumed at least 64 threads but smaller blocks could be launched
- Added guard for
-
CUDA/cuFFT Error Handling (
cuda_backend.cpp):- Added error checking for
cufftExecC2Cand memory allocation functions - Previously silent failures could occur on FFT execution errors
- Added error checking for
Numerical Stability Fixes:
-
NEON IIR Parameter Validation (
neon_iir.cpp):- Added
validate_iir_params()to prevent division by zero when Q=0 - Validates: Q > 0, 0 < fc < fs/2, fs > 0
- Returns unity pass-through filter coefficients on invalid parameters
- Affected:
neon_biquad_lowpass,neon_biquad_highpass,neon_biquad_bandpass,neon_biquad_notch
- Added
-
NEON Householder QR Stability (
neon_linalg.cpp):- Fixed potential division by zero when
alpha ≈ betain Householder reflection - Added threshold check: if
|alpha - beta| < 1e-30, skip reflection (column already normalized)
- Fixed potential division by zero when
Vulkan Error Handling (vulkan_backend.cpp):
- Added error checking for:
vkAllocateCommandBuffers,vkBeginCommandBuffer,vkEndCommandBuffer,vkQueueSubmit,vkQueueWaitIdle - Proper resource cleanup on error paths
Test Improvements:
- Platform Test Portability (
test_platform.cpp):- Made ARM-specific assertions conditional on detecting target platform
- Tests now pass on both ARM (Orange Pi 6 Plus, Raspberry Pi 5) and x86_64 development machines
- Platform-specific checks (core part IDs, SVE2 features, efficiency cores) only run on appropriate hardware
New Platform: Orange Pi 6 Plus (CIX P1 CD8160)
Full hardware-specific optimization for the CIX P1 SoC's ARMv9 big.LITTLE cores and Mali-G720-Immortalis GPU.
SVE2 Acceleration (sve2_kernels.cpp, sve2_complex.cpp, sve2_radar.cpp):
- Predicated vector operations:
svwhilelt/svptestloops eliminate all scalar tail handling - FCMA complex multiply:
svcmla_f32_zfor 2-instruction complex multiply (vs 4 NEON) - I8MM GEMM:
svmmla_s32hardware int8 matrix multiply with float32 quantization - SVE2 cache-blocked GEMM tuned for A720 12MB L3 (MC=256, KC=512, NC=2048)
- Complete SVE2 transcendentals: exp, sin, cos, sigmoid, tanh
- SVE2 radar DSP: CAF with FCMA inner loop, cross-correlation, beamforming
Platform Detection (platform.cpp):
- CPU topology detection via
/proc/cpuinfoand sysfs (A720/A520 part IDs) - Thread affinity: pin to performance or efficiency cores via
sched_setaffinity - SVE vector length detection via
prctl(PR_SVE_GET_VL) - Runtime GEMM cache blocking parameter selection based on L3 cache size
NEON GEMM Runtime Tuning (neon_gemm_optimized.cpp):
- Cache blocking parameters now auto-selected at runtime based on detected hardware
- 12MB L3 (A720): MC=256, KC=512, NC=2048; 2MB L3 (A76): MC=128, KC=256, NC=512
Mali-G720 Vulkan Shaders (vulkan_backend.cpp):
- Auto-detects Mali-G720 via vendor ID and device name
- 32x32 tiled GEMM shader (1024 threads, 8KB shared memory)
- 1024-thread reduction with
subgroupAdd()subgroup arithmetic - Transparent fallback to standard shaders on non-Mali GPUs
New Tests:
| Test Suite | Tests | Description |
|---|---|---|
test_sve2_kernels |
18 | SVE2 correctness, edge cases, stress tests |
test_platform |
9 | CPU topology, affinity, cache detection |
New Dense Linear Algebra Kernels (neon_linalg.cpp):
-
Triangular Solve (TRSV/TRSM):
- Forward/backward substitution with NEON-vectorized AXPY column updates (
vld1q_f32/vmlaq_f32) - Unit-diagonal and transpose variants for composing with LU and Cholesky
- Multi-RHS TRSM for matrix inverse computation
- Forward/backward substitution with NEON-vectorized AXPY column updates (
-
Cholesky Decomposition (
neon_cholesky_f32):- Unblocked right-looking algorithm with NEON dot products
- SPD validation: returns 1-based error index if matrix is not positive definite
- Clean lower-triangular output (upper triangle zeroed)
-
LU Decomposition (
neon_lu_f32):- Partial pivoting with argmax pivot search
- NEON-vectorized column scaling and rank-1 trailing submatrix updates
- Row permutation vector output for solver composition
-
QR Decomposition (
neon_qr_f32):- Householder reflections with stable sign choice to avoid cancellation
- NEON-vectorized reflector application to trailing columns
- Explicit Q extraction via reverse-order reflector accumulation
-
Solvers:
neon_solve_f32: General A*x=b via LU + TRSVneon_solve_spd_f32: SPD A*x=b via Cholesky + TRSVneon_inverse_f32: A^{-1} via LU + TRSM on identity columns
-
Eigen Wrappers:
neon_cholesky,neon_lu,neon_qr,neon_trsv_lower,neon_trsv_upper,neon_solve,neon_solve_spd,neon_inverse
New Tests:
| Test Suite | Tests | Status |
|---|---|---|
test_neon_linalg |
21 | Passed |
Total: 88 individual test cases passed (21 new + 67 existing)
New DSP Kernels:
-
Polyphase Resampler (
neon_resample.cpp):- Rational L/M sample rate conversion using polyphase decomposition
- NEON-optimized FIR filtering per phase (reuses
neon_dot_f32) - Phase accumulator algorithm with delay line state management
- Both streaming (
neon_resample_f32) and one-shot (neon_resample_oneshot_f32) APIs
-
Biquad IIR Filter (
neon_iir.cpp):- Direct Form II Transposed for optimal numerical stability
- Single section (
neon_biquad_f32) and cascade (neon_biquad_cascade_f32) - In-place processing supported (out and in may alias)
- Design helpers using Audio EQ Cookbook formulas: lowpass, highpass, bandpass, notch
-
2D Convolution (
neon_conv2d.cpp):- General NxM convolution with NEON vectorization over output columns (4 at a time)
- Separable 2D convolution: row pass reuses
neon_fir_f32, NEON column pass - Fully unrolled 3x3 kernel (9 broadcast coefficients, 9 multiply-accumulates)
- Unrolled 5x5 kernel specialization
- Row-major layout for natural image/matrix processing
New Tests:
| Test Suite | Tests | Status |
|---|---|---|
test_neon_resample |
7 | Passed |
test_neon_iir |
10 | Passed |
test_neon_conv2d |
9 | Passed |
Total: 67 individual test cases passed (26 new + 41 existing)
Cross-Platform Testing:
- Verified full compatibility between AMD64 (x86_64) and ARM64 (aarch64) platforms
- All 10 test suites pass on Raspberry Pi 5 (ARM Cortex-A76)
- Tested with GCC 14.2.0, Eigen 3.4.0, Vulkan 1.4.309
Test Results (Raspberry Pi 5):
| Test Suite | Tests | Status |
|---|---|---|
test_basic |
1 | Passed |
test_neon_kernels |
4 | Passed |
test_neon_complex |
7 | Passed |
test_neon_transcendentals |
10 | Passed |
test_vulkan_vector |
1 | Passed |
test_vulkan_matrix |
1 | Passed |
test_vulkan_dsp |
1 | Passed |
test_vulkan_advanced |
1 | Passed |
test_radar_caf |
6 | Passed |
test_radar_cfar |
9 | Passed |
Total: 41 individual test cases passed
CUDA Backend:
- Optimized
cuda_caf(): Eliminated all CPU-GPU memory transfers inside the Doppler processing loop- Added
kernel_interleave_complex_f32for GPU-side complex array interleaving - Added
kernel_complex_conj_mul_interleaved_f32for frequency-domain multiplication - Added
kernel_magnitude_interleaved_f32for magnitude extraction - Result: ~6 fewer
cudaMemcpycalls per Doppler bin, data stays on GPU until final output
- Added
- Fixed
cuda_vec_sum_f32(): Removed dead code (unused buffer allocation and kernel call)
NEON Backend:
- Fixed
neon_fast_tanh_f32(): Removed dead code line that was immediately overwritten
Testing:
- All 10 test suites pass (NEON, Vulkan, CUDA, Radar)
MIT License - See LICENSE file for details.
N4HY - Bob McGwier Dr Robert W McGwier, PhD
All unit tests, much of the debugging, and all documentation except the earliest pieces are all Claude code (AMAZING).