Commit 9a81fcb
⚡ Bolt: Optimize decompression for offset 24
Optimized `decompress_bmi2` for `offset == 24` by removing the `alignr` dependency chain.
The pattern for offset 24 repeats every 48 bytes (3 vectors). The implementation now precomputes these 3 vectors (`v1`, `v2`, `v0`) using `unpacklo` and `alignr` and stores them in an unrolled loop.
This breaks the loop-carried dependency found in the previous implementation, allowing better pipelining and ILP.
Benchmarks show a ~32% improvement in throughput for offset 24 decompression.
Co-authored-by: 404Setup <[email protected]>1 parent 288d3a7 commit 9a81fcb
2 files changed
Lines changed: 37 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
709 | 709 | | |
710 | 710 | | |
711 | 711 | | |
712 | | - | |
713 | | - | |
| 712 | + | |
| 713 | + | |
714 | 714 | | |
715 | | - | |
| 715 | + | |
716 | 716 | | |
717 | | - | |
718 | | - | |
719 | | - | |
720 | | - | |
| 717 | + | |
| 718 | + | |
721 | 719 | | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
722 | 742 | | |
723 | | - | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
724 | 749 | | |
725 | 750 | | |
726 | 751 | | |
727 | 752 | | |
728 | | - | |
729 | | - | |
730 | 753 | | |
731 | 754 | | |
732 | 755 | | |
| |||
0 commit comments