This project implements a CUDA-accelerated multi-restart K-Means clustering system. It compares a CPU baseline against two GPU execution modes: serial GPU restarts and CUDA-streamed GPU restarts. The main GPU computation is implemented using custom CUDA kernels for point assignment, shared-memory partial aggregation, centroid update, and objective computation.
For the full project explanation, design decisions, benchmark results, correctness validation, Nsight Systems/Compute profiling, and scalability discussion, please see: Project Report
make clean
make
./bin/kmeans_streams --n 100000 --dim 8 --k 8 --iters 20 --restarts 4 --streams 4./scripts/run_project.sh standardObtain the plots:
pip install matplotlib
python3 scripts/plot_results.pyThe main benchmark outputs are written to:
results/timing_all.csv
results/derived_speedups.csv
results/plots/
Nsight Systems:
./scripts/profile_nsys.shNsight Compute:
./scripts/profile_ncu.shProfiling outputs are written to:
results/profile/
The dataset is synthetically generated by the program and written to data/points.bin before being read back for CPU/GPU execution. The report PDF contains the main analysis and should be used as the primary writeup.