Skip to content

Systems-Performance-Lab/cache-optimized-matrix-multiplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cache-Optimized Matrix Multiplication

IEEE-style research project — Design, implementation and experimental evaluation of cache-aware matrix multiplication in C, targeting x86-64 architecture with AVX2/FMA SIMD vectorization.

Authors: Dila Öykü Eyüboğlu · Burcu Kösedağı
Affiliation: Computer Engineering, Istanbul Commerce University
Platform: Intel Core i5-10210U · Linux WSL2 · GCC 13.3 · AVX2 + FMA


Key Results

  • 40.25× speedup over naive baseline
  • 455× reduction in LLC cache misses
  • 98.4% → 3.9% D1 miss rate reduction
  • Fully verified with Valgrind & Cachegrind
Matrix Size Naïve Time Optimized Time SIMD Time Speedup (Opt) Speedup (SIMD)
N = 256 0.027 s 0.005 s 0.008 s 5.15× 3.33×
N = 512 0.266 s 0.054 s 0.078 s 4.94× 3.40×
N = 1024 8.007 s 0.434 s 0.591 s 18.46× 13.54×
N = 2048 142.5 s 3.54 s 5.18 s 40.25× 27.51×

Table of Contents


Overview

Matrix multiplication (C = A × B) is a foundational computational kernel in machine learning, scientific simulation, and graphics rendering. Despite its O(N³) arithmetic complexity, real-world performance is dictated by memory access patterns, not raw floating-point throughput.

This project demonstrates the memory wall effect quantitatively:

Naïve (N=2048):       0.12 GFLOPS   ← 90% degradation from N=256
Optimized (N=2048):   4.85 GFLOPS   ← stable across all sizes
SIMD (N=2048):        3.32 GFLOPS   ← stable across all sizes

The naïve implementation accesses matrix B column-wise (stride-N), causing virtually every access to miss L1 cache. Loop interchange, tiling, and vectorization eliminate this bottleneck.


Implementations

1. Naïve (multiply_naive.c)

Standard i → j → k loop order. B is accessed with stride N — cache-hostile.

for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
        for (k = 0; k < N; k++)
            C[i*N+j] += A[i*N+k] * B[k*N+j];  // stride-N miss on B

Problem: For N=1024, each B access spans 8,192 bytes between rows → L1 miss every iteration.


2. Optimized (multiply_opt.c)

Three combined techniques:

① Loop Interchange (i → k → j)

for (i = 0; i < N; i++)
    for (k = 0; k < N; k++) {
        double a_ik = A[i*N+k];   // hoisted to register
        for (j = 0; j < N; j++)
            C[i*N+j] += a_ik * B[k*N+j];  // stride-1 on B and C
    }

② Cache Blocking (BLOCK_SIZE = 64)

#define BLOCK_SIZE 64
for (ii = 0; ii < N; ii += BLOCK_SIZE)
  for (kk = 0; kk < N; kk += BLOCK_SIZE)
    for (jj = 0; jj < N; jj += BLOCK_SIZE)
      // inner 64×64 tile fits in 32 KB L1-D cache

A single 64×64 tile = 32,768 bytes = one L1 cache instance. Three tiles (A, B, C) = 96 KB → fit in L2 (256 KB).

③ Manual Loop Unrolling (×4)

for (; j + 3 < j_max; j += 4) {
    C[i*N+j]   += a_ik * B[k*N+j];
    C[i*N+j+1] += a_ik * B[k*N+j+1];
    C[i*N+j+2] += a_ik * B[k*N+j+2];
    C[i*N+j+3] += a_ik * B[k*N+j+3];
}
for (; j < j_max; j++)   // scalar remainder
    C[i*N+j] += a_ik * B[k*N+j];

3. SIMD AVX2 (multiply_simd.c)

Replaces the scalar inner loop with 256-bit FMA intrinsics, processing 4 doubles per instruction:

#ifdef __AVX2__
#include <immintrin.h>
__m256d va = _mm256_set1_pd(a_ik);         // broadcast scalar to 4 lanes
for (; j + 3 < j_max; j += 4) {
    __m256d vb = _mm256_loadu_pd(&B[k*N+j]);
    __m256d vc = _mm256_loadu_pd(&C[i*N+j]);
    vc = _mm256_fmadd_pd(va, vb, vc);       // vc += va × vb (4 doubles/cycle)
    _mm256_storeu_pd(&C[i*N+j], vc);
}
#endif

GCC-emitted assembly (verified via gcc -S):

vbroadcastsd  %xmm2, %ymm1       ; broadcast a_ik to all 4 lanes
vmovupd       (%r8,%rcx,8), %ymm0 ; load 4 B values
vfmadd213pd   (%rdi), %ymm1, %ymm0 ; fused multiply-add (4 doubles)
vmovupd       %ymm0, (%rdi)        ; store 4 C values

Project Structure

cache_optimization_project/
│
├── include/
│   ├── matrix.h          # Matrix struct + allocation API
│   ├── multiply.h        # Function declarations (naive / opt / simd)
│   ├── timer.h           # High-resolution timer (POSIX + Win32)
│   └── verify.h          # Element-wise correctness check
│
├── src/
│   ├── main.c            # Benchmark driver + CSV export
│   ├── matrix.c          # create_matrix(), free_matrix(), fill_matrix()
│   ├── multiply_naive.c  # Baseline i→j→k
│   ├── multiply_opt.c    # Blocked + unrolled (i→k→j, BLOCK_SIZE=64)
│   ├── multiply_simd.c   # AVX2 FMA vectorized
│   ├── timer.c           # clock_gettime / QueryPerformanceCounter
│   └── verify.c          # compare_matrices() with ε = 1×10⁻⁹
│
├── figures/
│   ├── speedup_plot.png          # Speedup vs. N
│   ├── gflops_plot.png           # Throughput vs. N
│   ├── runtime_plot.jpg          # Execution time (log scale)
│   ├── cache_miss_plot.jpg       # D1 & LLd miss rates (Cachegrind)
│   ├── memory_hierarchy.jpg      # x86-64 cache hierarchy diagram
│   └── system_architecture.jpg  # Benchmark pipeline diagram
│
├── results/
│   ├── benchmark_results_aggregated.csv  # Median of 5 runs per config
│   ├── benchmark_results_all_runs.csv    # Raw per-run data
│   └── cachegrind_report.txt            # Valgrind cache profiling output
│
├── paper/
│   └── Cache_Optimized_Matrix_Multiplication.pdf
│
├── scripts/
│   ├── run_cachegrind.sh   # Valgrind Cachegrind profiling (Linux)
│   └── run_perf.sh         # Linux perf hardware counters
│
├── makefile
└── README.md

Requirements

Tool Version Purpose
GCC ≥ 9.0 Compilation with AVX2 (-mavx2 -mfma)
make any Build system
valgrind + cg_annotate any Cache profiling (make cachegrind)
perf any Hardware performance counters (make perf)

Install on Linux (Ubuntu/Debian)

sudo apt update
sudo apt install gcc make valgrind linux-tools-generic

Windows (MinGW/MSYS2)

AVX2 supported; valgrind and perf targets are Linux-only.

Verify AVX2 support

lscpu | grep -E "avx2|fma"
# Expected: avx2 fma in Flags section

Build & Run

Build with AVX2 SIMD (default)

make

Compiler flags: -O3 -Wall -Wextra -std=c11 -march=native -mavx2 -mfma

Build without SIMD (scalar fallback)

make nosimd

Run full benchmark suite

make run
# or directly:
./matrix_app

Benchmarks N ∈ {256, 512, 1024, 2048}, 5 runs each. Results written to benchmark_results.csv.

Inspect AVX2 assembly

gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s
grep -A5 "vfmadd" multiply_simd.s

Clean build artifacts

make clean

Output Format

Terminal Output (example at N=1024)

AVX2 ENABLED
Cache-Optimized Matrix Multiplication Benchmark

N = 1024
C[0][0]     naive=2048.00   opt=2048.00   simd=2048.00
Correctness opt=PASS         simd=PASS
Time(s)     naive=8.007      opt=0.434     simd=0.591
Speedup     opt=18.46x       simd=13.54x
GFLOPS      naive=0.268      opt=4.951     simd=3.631

CSV Columns (benchmark_results.csv)

Column Description
N Matrix size
Correct_Opt Correctness of optimized (PASS/FAIL)
Correct_SIMD Correctness of SIMD (PASS/FAIL)
NaiveTime Median wall-clock time (s)
OptTime Median wall-clock time (s)
SIMDTime Median wall-clock time (s)
Speedup_Opt T_naive / T_opt
Speedup_SIMD T_naive / T_simd
NaiveGFLOPS 2N³ / (T × 10⁹)
OptGFLOPS 2N³ / (T × 10⁹)
SIMDGFLOPS 2N³ / (T × 10⁹)

Cache Profiling

Run Valgrind Cachegrind

make cachegrind
# or manually:
valgrind --tool=cachegrind --cache-sim=yes --branch-sim=yes \
         --cachegrind-out-file=cachegrind.out ./matrix_app

cg_annotate --show=Ir,Dr,D1mr,DLmr,Dw,D1mw,DLmw \
            cachegrind.out > cachegrind_report.txt

Results at N=1024 (Valgrind Cachegrind)

Implementation D1 Read Misses LLd Read Misses D1 Miss Rate LLd Miss Rate
Naïve ≈1.08×10⁹ ≈1.07×10⁹ ≈98.4% ≈93.1%
Optimized ≈1.56×10⁸ ≈2.36×10⁶ ≈3.9% ≈0.19%
SIMD ≈1.56×10⁸ ≈2.36×10⁶ ≈3.8% ≈0.18%

455× reduction in LLd read misses · 25× reduction in D1 miss rate


Performance Counters

make perf
# or:
bash scripts/run_perf.sh

Reports hardware events (real cache misses, CPI, pipeline stalls) via Linux perf stat. Output saved to perf_report.txt.


Correctness Verification

All implementations are verified against the naïve baseline with element-wise tolerance ε = 1×10⁻⁹:

∀ i, j :  |C_opt[i][j] − C_naive[i][j]| ≤ 1×10⁻⁹

Test case: A[i][j] = 1.0, B[i][j] = 2.0 → expected C[i][j] = 2N for all i, j.

All three implementations pass for all N across all 5 runs.

Memory safety: Verified with valgrind --tool=memcheck — zero leaks, zero invalid accesses across all configurations.


Optimization Techniques

Why naïve is slow

In the i → j → k loop order, B[k][j] is accessed column-wise:

  • For N=1024: stride = 1024 × 8 bytes = 8,192 bytes between consecutive accesses
  • L1 cache (32 KB) holds only 512 doubles → no reuse of B in L1
  • Result: D1 miss rate ≈ 98.4%, CPU stalls on every inner-loop iteration

Why the optimized version is fast

Technique Effect
Loop interchange (i→k→j) All three matrices traversed stride-1; a_ik hoisted to register
Cache blocking (64×64) Working set (3 tiles = 96 KB) fits in L2; misses drop from O(N³) to O(N²/B)
Loop unrolling (×4) 4 independent MACs per iteration; improves instruction-level parallelism
AVX2 FMA 4 doubles per instruction; fused multiply-add halves arithmetic instruction count

Memory hierarchy fit (Intel Core i5-10210U)

CPU Registers  (YMM) │ 1 cycle      │ ← a_ik hoisted here
L1-D Cache  32 KB    │ 4–5 cycles   │ ← one 64×64 tile (32 KB) fits here
L2 Cache   256 KB    │ ~12 cycles   │ ← three tiles (96 KB) fit here
L3 Cache     6 MiB   │ 30–40 cycles │
DRAM        16 GB    │ 200–300 cyc  │ ← naïve accesses land here

Experimental Setup

Parameter Value
CPU Intel Core i5-10210U (Comet Lake), 4C/8T, 1.6–4.2 GHz
L1-D 128 KB (4 × 32 KB, 8-way, 64B lines)
L2 1 MiB (4 × 256 KB)
L3 6 MiB shared
OS Linux WSL2 (Ubuntu 24.04), kernel 6.6.87.2
Compiler GCC 13.3.0, -O3 -march=native -mavx2 -mfma
Matrix sizes N ∈ {256, 512, 1024, 2048}, square, double precision
Runs 5 independent; median reported
Timer clock_gettime(CLOCK_MONOTONIC)

Reproducibility

The full benchmark suite, Cachegrind profiling, and assembly inspection can each be triggered with a single command:

# Full benchmark
make clean && make && make run

# Valgrind cache profiling
make cachegrind

# AVX2 assembly inspection
gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s

# Scalar build (no SIMD) for comparison
make nosimd && make run

All results are written to results/benchmark_results_aggregated.csv for reproducible post-processing.


References

Selected key references (full list in paper):

  1. Drepper, U. — What Every Programmer Should Know About Memory, Red Hat, 2007
  2. Lam, Rothberg, Wolf — The Cache Performance and Optimizations of Blocked Algorithms, ASPLOS IV, 1991
  3. Williams, Waterman, Patterson — Roofline: An Insightful Visual Performance Model, CACM, 2009
  4. Goto & van de Geijn — Anatomy of High-Performance Matrix Multiplication, ACM TOMS, 2008
  5. Intel Corporation — Intel® 64 and IA-32 Architectures Optimization Reference Manual, 2023

For the full experimental analysis, see paper/Cache_Optimized_Matrix_Multiplication.pdf.

About

High-performance matrix multiplication in C with cache blocking, loop optimizations, and AVX2 SIMD acceleration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors