Commit 42e7161
Optimize decompress_bmi2 for offset 3 using SIMD
This commit optimizes the decompression of `offset == 3` matches in
`decompress_bmi2` (x86-64). Previously, this case fell back to a slow
byte-by-byte scalar copy loop.
The optimization uses `_mm_shuffle_epi8` with precomputed cyclic masks
(`OFFSET3_MASKS`) to construct 16-byte vectors containing the repeating
3-byte pattern from the first 16 bytes loaded from `src` (safely masking
out garbage). The copy loop is unrolled to process 48 bytes (LCM of 3
and 16) per iteration.
Performance Impact:
- `Decompress offset3` throughput improved by ~540% (~1.44 GiB/s -> ~9.22 GiB/s).
- `Decompress offset3 small` throughput improved by ~1.4%.
- `Decompress offset30` throughput improved by ~4%.
- `Decompress offset31` throughput improved by ~9%.
- `Decompress offset32` throughput improved by ~1.8%.
This aligns the performance of offset 3 with other small offsets that are
already optimized.
Co-authored-by: 404Setup <[email protected]>1 parent 6502d7b commit 42e7161
2 files changed
Lines changed: 67 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
52 | 58 | | |
53 | 59 | | |
54 | 60 | | |
| |||
931 | 937 | | |
932 | 938 | | |
933 | 939 | | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
| 980 | + | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
934 | 996 | | |
935 | | - | |
| 997 | + | |
936 | 998 | | |
937 | 999 | | |
938 | 1000 | | |
| |||
0 commit comments