⚡ Bolt: Optimize Decompression for Offset 25#381
Conversation
Optimized the decompression hot path for Offset 25 by specializing `decompress_offset_alignr_cycle` for `SHIFT=7`. The loop was unrolled to a stride of 96 bytes (6 vectors). The serial dependency chain of `alignr` instructions was optimized by computing vectors `v_next2`, `v_next4`, and `v_next5` using accumulated shift constants (e.g., using shift 14 on `v_prev` and `v_align` directly instead of relying on `v_next1`). This reduces the dependency depth and increases instruction-level parallelism. Performance Impact: - Throughput for `Decompress offset25` improved by ~1.4% (from ~10.07 GiB/s to ~10.23 GiB/s). - Verified correctness with `cargo test`. Co-authored-by: 404Setup <[email protected]>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Optimized decompression for Offset 25 (Shift 7) by parallelizing
alignrinstructions.💡 What
Specialized
decompress_offset_alignr_cycleforSHIFT = 7(which corresponds to Offset 25).Unrolled the loop to write 96 bytes (6 vectors) per iteration.
Broke the serial dependency chain:
v_next2is computed asalignr(v_prev, v_align, 14)instead of depending onv_next1.v_next4is computed asalignr(v_next1, v_next0, 14)instead of depending onv_next3.v_next5is computed asalignr(v_next2, v_next1, 14)instead of depending onv_next4.🎯 Why
The generic implementation of
decompress_offset_alignr_cycleuses a serial chain ofalignrinstructions, where each vector depends on the previous one. This limits throughput due to latency. By calculating intermediate vectors directly from earlier states using larger shift constants, we expose more parallelism to the CPU.📊 Impact
Benchmark
Decompress offset25shows a throughput improvement of approximately 1.4% (10.07 GiB/s -> 10.23 GiB/s).🔬 Measurement
Run
cargo bench --bench bench_main "Decompress offset25"to verify.Ensure
bench_data/data_offset25.binis generated usingscripts/gen_bench_files.py.PR created automatically by Jules for task 12004466611968664539 started by @404Setup