IEEE-style research project — Design, implementation and experimental evaluation of cache-aware matrix multiplication in C, targeting x86-64 architecture with AVX2/FMA SIMD vectorization.
Authors: Dila Öykü Eyüboğlu · Burcu Kösedağı
Affiliation: Computer Engineering, Istanbul Commerce University
Platform: Intel Core i5-10210U · Linux WSL2 · GCC 13.3 · AVX2 + FMA
- 40.25× speedup over naive baseline
- 455× reduction in LLC cache misses
- 98.4% → 3.9% D1 miss rate reduction
- Fully verified with Valgrind & Cachegrind
| Matrix Size | Naïve Time | Optimized Time | SIMD Time | Speedup (Opt) | Speedup (SIMD) |
|---|---|---|---|---|---|
| N = 256 | 0.027 s | 0.005 s | 0.008 s | 5.15× | 3.33× |
| N = 512 | 0.266 s | 0.054 s | 0.078 s | 4.94× | 3.40× |
| N = 1024 | 8.007 s | 0.434 s | 0.591 s | 18.46× | 13.54× |
| N = 2048 | 142.5 s | 3.54 s | 5.18 s | 40.25× | 27.51× |
- Overview
- Implementations
- Project Structure
- Requirements
- Build & Run
- Output Format
- Cache Profiling
- Performance Counters
- Correctness Verification
- Optimization Techniques
- Architecture Notes
- Reproducibility
Matrix multiplication (C = A × B) is a foundational computational kernel in machine learning, scientific simulation, and graphics rendering. Despite its O(N³) arithmetic complexity, real-world performance is dictated by memory access patterns, not raw floating-point throughput.
This project demonstrates the memory wall effect quantitatively:
Naïve (N=2048): 0.12 GFLOPS ← 90% degradation from N=256
Optimized (N=2048): 4.85 GFLOPS ← stable across all sizes
SIMD (N=2048): 3.32 GFLOPS ← stable across all sizes
The naïve implementation accesses matrix B column-wise (stride-N), causing virtually every access to miss L1 cache. Loop interchange, tiling, and vectorization eliminate this bottleneck.
Standard i → j → k loop order. B is accessed with stride N — cache-hostile.
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
C[i*N+j] += A[i*N+k] * B[k*N+j]; // stride-N miss on BProblem: For N=1024, each B access spans 8,192 bytes between rows → L1 miss every iteration.
Three combined techniques:
① Loop Interchange (i → k → j)
for (i = 0; i < N; i++)
for (k = 0; k < N; k++) {
double a_ik = A[i*N+k]; // hoisted to register
for (j = 0; j < N; j++)
C[i*N+j] += a_ik * B[k*N+j]; // stride-1 on B and C
}② Cache Blocking (BLOCK_SIZE = 64)
#define BLOCK_SIZE 64
for (ii = 0; ii < N; ii += BLOCK_SIZE)
for (kk = 0; kk < N; kk += BLOCK_SIZE)
for (jj = 0; jj < N; jj += BLOCK_SIZE)
// inner 64×64 tile fits in 32 KB L1-D cacheA single 64×64 tile = 32,768 bytes = one L1 cache instance. Three tiles (A, B, C) = 96 KB → fit in L2 (256 KB).
③ Manual Loop Unrolling (×4)
for (; j + 3 < j_max; j += 4) {
C[i*N+j] += a_ik * B[k*N+j];
C[i*N+j+1] += a_ik * B[k*N+j+1];
C[i*N+j+2] += a_ik * B[k*N+j+2];
C[i*N+j+3] += a_ik * B[k*N+j+3];
}
for (; j < j_max; j++) // scalar remainder
C[i*N+j] += a_ik * B[k*N+j];Replaces the scalar inner loop with 256-bit FMA intrinsics, processing 4 doubles per instruction:
#ifdef __AVX2__
#include <immintrin.h>
__m256d va = _mm256_set1_pd(a_ik); // broadcast scalar to 4 lanes
for (; j + 3 < j_max; j += 4) {
__m256d vb = _mm256_loadu_pd(&B[k*N+j]);
__m256d vc = _mm256_loadu_pd(&C[i*N+j]);
vc = _mm256_fmadd_pd(va, vb, vc); // vc += va × vb (4 doubles/cycle)
_mm256_storeu_pd(&C[i*N+j], vc);
}
#endifGCC-emitted assembly (verified via gcc -S):
vbroadcastsd %xmm2, %ymm1 ; broadcast a_ik to all 4 lanes
vmovupd (%r8,%rcx,8), %ymm0 ; load 4 B values
vfmadd213pd (%rdi), %ymm1, %ymm0 ; fused multiply-add (4 doubles)
vmovupd %ymm0, (%rdi) ; store 4 C valuescache_optimization_project/
│
├── include/
│ ├── matrix.h # Matrix struct + allocation API
│ ├── multiply.h # Function declarations (naive / opt / simd)
│ ├── timer.h # High-resolution timer (POSIX + Win32)
│ └── verify.h # Element-wise correctness check
│
├── src/
│ ├── main.c # Benchmark driver + CSV export
│ ├── matrix.c # create_matrix(), free_matrix(), fill_matrix()
│ ├── multiply_naive.c # Baseline i→j→k
│ ├── multiply_opt.c # Blocked + unrolled (i→k→j, BLOCK_SIZE=64)
│ ├── multiply_simd.c # AVX2 FMA vectorized
│ ├── timer.c # clock_gettime / QueryPerformanceCounter
│ └── verify.c # compare_matrices() with ε = 1×10⁻⁹
│
├── figures/
│ ├── speedup_plot.png # Speedup vs. N
│ ├── gflops_plot.png # Throughput vs. N
│ ├── runtime_plot.jpg # Execution time (log scale)
│ ├── cache_miss_plot.jpg # D1 & LLd miss rates (Cachegrind)
│ ├── memory_hierarchy.jpg # x86-64 cache hierarchy diagram
│ └── system_architecture.jpg # Benchmark pipeline diagram
│
├── results/
│ ├── benchmark_results_aggregated.csv # Median of 5 runs per config
│ ├── benchmark_results_all_runs.csv # Raw per-run data
│ └── cachegrind_report.txt # Valgrind cache profiling output
│
├── paper/
│ └── Cache_Optimized_Matrix_Multiplication.pdf
│
├── scripts/
│ ├── run_cachegrind.sh # Valgrind Cachegrind profiling (Linux)
│ └── run_perf.sh # Linux perf hardware counters
│
├── makefile
└── README.md
| Tool | Version | Purpose |
|---|---|---|
| GCC | ≥ 9.0 | Compilation with AVX2 (-mavx2 -mfma) |
| make | any | Build system |
| valgrind + cg_annotate | any | Cache profiling (make cachegrind) |
| perf | any | Hardware performance counters (make perf) |
sudo apt update
sudo apt install gcc make valgrind linux-tools-genericAVX2 supported; valgrind and perf targets are Linux-only.
lscpu | grep -E "avx2|fma"
# Expected: avx2 fma in Flags sectionmakeCompiler flags: -O3 -Wall -Wextra -std=c11 -march=native -mavx2 -mfma
make nosimdmake run
# or directly:
./matrix_appBenchmarks N ∈ {256, 512, 1024, 2048}, 5 runs each. Results written to benchmark_results.csv.
gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s
grep -A5 "vfmadd" multiply_simd.smake cleanAVX2 ENABLED
Cache-Optimized Matrix Multiplication Benchmark
N = 1024
C[0][0] naive=2048.00 opt=2048.00 simd=2048.00
Correctness opt=PASS simd=PASS
Time(s) naive=8.007 opt=0.434 simd=0.591
Speedup opt=18.46x simd=13.54x
GFLOPS naive=0.268 opt=4.951 simd=3.631
| Column | Description |
|---|---|
N |
Matrix size |
Correct_Opt |
Correctness of optimized (PASS/FAIL) |
Correct_SIMD |
Correctness of SIMD (PASS/FAIL) |
NaiveTime |
Median wall-clock time (s) |
OptTime |
Median wall-clock time (s) |
SIMDTime |
Median wall-clock time (s) |
Speedup_Opt |
T_naive / T_opt |
Speedup_SIMD |
T_naive / T_simd |
NaiveGFLOPS |
2N³ / (T × 10⁹) |
OptGFLOPS |
2N³ / (T × 10⁹) |
SIMDGFLOPS |
2N³ / (T × 10⁹) |
make cachegrind
# or manually:
valgrind --tool=cachegrind --cache-sim=yes --branch-sim=yes \
--cachegrind-out-file=cachegrind.out ./matrix_app
cg_annotate --show=Ir,Dr,D1mr,DLmr,Dw,D1mw,DLmw \
cachegrind.out > cachegrind_report.txt| Implementation | D1 Read Misses | LLd Read Misses | D1 Miss Rate | LLd Miss Rate |
|---|---|---|---|---|
| Naïve | ≈1.08×10⁹ | ≈1.07×10⁹ | ≈98.4% | ≈93.1% |
| Optimized | ≈1.56×10⁸ | ≈2.36×10⁶ | ≈3.9% | ≈0.19% |
| SIMD | ≈1.56×10⁸ | ≈2.36×10⁶ | ≈3.8% | ≈0.18% |
455× reduction in LLd read misses · 25× reduction in D1 miss rate
make perf
# or:
bash scripts/run_perf.shReports hardware events (real cache misses, CPI, pipeline stalls) via Linux perf stat. Output saved to perf_report.txt.
All implementations are verified against the naïve baseline with element-wise tolerance ε = 1×10⁻⁹:
∀ i, j : |C_opt[i][j] − C_naive[i][j]| ≤ 1×10⁻⁹
Test case: A[i][j] = 1.0, B[i][j] = 2.0 → expected C[i][j] = 2N for all i, j.
All three implementations pass for all N across all 5 runs.
Memory safety: Verified with valgrind --tool=memcheck — zero leaks, zero invalid accesses across all configurations.
In the i → j → k loop order, B[k][j] is accessed column-wise:
- For N=1024: stride = 1024 × 8 bytes = 8,192 bytes between consecutive accesses
- L1 cache (32 KB) holds only 512 doubles → no reuse of B in L1
- Result: D1 miss rate ≈ 98.4%, CPU stalls on every inner-loop iteration
| Technique | Effect |
|---|---|
| Loop interchange (i→k→j) | All three matrices traversed stride-1; a_ik hoisted to register |
| Cache blocking (64×64) | Working set (3 tiles = 96 KB) fits in L2; misses drop from O(N³) to O(N²/B) |
| Loop unrolling (×4) | 4 independent MACs per iteration; improves instruction-level parallelism |
| AVX2 FMA | 4 doubles per instruction; fused multiply-add halves arithmetic instruction count |
CPU Registers (YMM) │ 1 cycle │ ← a_ik hoisted here
L1-D Cache 32 KB │ 4–5 cycles │ ← one 64×64 tile (32 KB) fits here
L2 Cache 256 KB │ ~12 cycles │ ← three tiles (96 KB) fit here
L3 Cache 6 MiB │ 30–40 cycles │
DRAM 16 GB │ 200–300 cyc │ ← naïve accesses land here
| Parameter | Value |
|---|---|
| CPU | Intel Core i5-10210U (Comet Lake), 4C/8T, 1.6–4.2 GHz |
| L1-D | 128 KB (4 × 32 KB, 8-way, 64B lines) |
| L2 | 1 MiB (4 × 256 KB) |
| L3 | 6 MiB shared |
| OS | Linux WSL2 (Ubuntu 24.04), kernel 6.6.87.2 |
| Compiler | GCC 13.3.0, -O3 -march=native -mavx2 -mfma |
| Matrix sizes | N ∈ {256, 512, 1024, 2048}, square, double precision |
| Runs | 5 independent; median reported |
| Timer | clock_gettime(CLOCK_MONOTONIC) |
The full benchmark suite, Cachegrind profiling, and assembly inspection can each be triggered with a single command:
# Full benchmark
make clean && make && make run
# Valgrind cache profiling
make cachegrind
# AVX2 assembly inspection
gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s
# Scalar build (no SIMD) for comparison
make nosimd && make runAll results are written to results/benchmark_results_aggregated.csv for reproducible post-processing.
Selected key references (full list in paper):
- Drepper, U. — What Every Programmer Should Know About Memory, Red Hat, 2007
- Lam, Rothberg, Wolf — The Cache Performance and Optimizations of Blocked Algorithms, ASPLOS IV, 1991
- Williams, Waterman, Patterson — Roofline: An Insightful Visual Performance Model, CACM, 2009
- Goto & van de Geijn — Anatomy of High-Performance Matrix Multiplication, ACM TOMS, 2008
- Intel Corporation — Intel® 64 and IA-32 Architectures Optimization Reference Manual, 2023
For the full experimental analysis, see paper/Cache_Optimized_Matrix_Multiplication.pdf.