Skip to content

Cache min/max indices in patches#7753

Open
palaska wants to merge 12 commits intodevelopfrom
bp/minmax-index
Open

Cache min/max indices in patches#7753
palaska wants to merge 12 commits intodevelopfrom
bp/minmax-index

Conversation

@palaska
Copy link
Copy Markdown
Contributor

@palaska palaska commented May 1, 2026

Cache min/max patch indices in Patches and short-circuit search_index when the query falls outside that range. Speeds up ALP/BitPacked scalar_at. This should help with slicing any patched array into a region that doesn't overlap the patch range

Screenshot 2026-05-01 at 16 26 04

Signed-off-by: Baris Palaska <[email protected]>
@palaska palaska added the changelog/performance A performance improvement label May 1, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 1, 2026

Merging this PR will degrade performance by 31.73%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 16 improved benchmarks
❌ 71 regressed benchmarks
✅ 1119 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime cuda/bitpacked_u8/unpack/3bw[100M] 352.3 µs 298.6 µs +17.96%
Simulation chunked_dict_primitive_canonical_into[f32, (1000, 100, 100)] 686.1 µs 766.3 µs -10.46%
Simulation chunked_dict_primitive_canonical_into[u32, (1000, 100, 100)] 685.8 µs 764.6 µs -10.31%
Simulation chunked_dict_primitive_canonical_into[u32, (1000, 10, 100)] 670.5 µs 749.3 µs -10.52%
Simulation chunked_dict_primitive_canonical_into[u32, (1000, 100, 10)] 87.4 µs 97.2 µs -10.08%
Simulation chunked_dict_primitive_canonical_into[f32, (1000, 10, 100)] 670.4 µs 750.6 µs -10.69%
Simulation chunked_dict_primitive_into_canonical[f32, (1000, 10, 100)] 727.7 µs 808.6 µs -10%
Simulation chunked_dict_primitive_into_canonical[u32, (1000, 100, 10)] 97.6 µs 108.7 µs -10.22%
Simulation decode_primitives[f32, (1000, 2)] 17 µs 19.8 µs -14.19%
Simulation decode_primitives[f32, (1000, 32)] 17.1 µs 19.3 µs -11.85%
Simulation decode_primitives[f32, (1000, 4)] 17 µs 19.9 µs -14.71%
Simulation decode_primitives[f32, (1000, 512)] 18.3 µs 20.9 µs -12.34%
Simulation decode_primitives[f32, (1000, 8)] 17 µs 19.3 µs -11.75%
Simulation decode_primitives[i64, (1000, 2)] 19.3 µs 22.3 µs -13.65%
Simulation decode_primitives[i64, (1000, 4)] 19.3 µs 22 µs -12.43%
Simulation decode_primitives[i64, (1000, 512)] 21.4 µs 24.3 µs -11.91%
Simulation decode_primitives[i64, (1000, 8)] 19.3 µs 22 µs -11.94%
Simulation decode_primitives[u8, (1000, 2)] 16.4 µs 18.8 µs -12.67%
Simulation decode_primitives[u8, (1000, 32)] 16 µs 18.4 µs -13.09%
Simulation decode_primitives[u8, (1000, 4)] 16 µs 18.4 µs -13.25%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing bp/minmax-index (8bc3d8a) with develop (f307edc)

Open in CodSpeed

Signed-off-by: Baris Palaska <[email protected]>
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indices arrays have min/max stats these should be used instead.

We might not currently save or use this, but by these benchmarks we should.

If these benchmarks are cleaned up I am happy to merge those first.

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Signed-off-by: Baris Palaska <[email protected]>
Signed-off-by: Baris Palaska <[email protected]>
palaska added a commit that referenced this pull request May 5, 2026
Isolating benchmarks before this optimization:
#7753

---------

Signed-off-by: Baris Palaska <[email protected]>
Copy link
Copy Markdown
Contributor

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta point, this is why you want to have special logic that's shared across all arrays, if we have Patched array then that could be one place where all of this logic could live without hacks. Alas we need Patched array to get over the line

@palaska
Copy link
Copy Markdown
Contributor Author

palaska commented May 6, 2026

Stats lookups (rwlock + hashmap + scalar conversion) were too expensive to be in the hot path (search_index is called in tight loops) so I still went with a OnceLock approach, combined with stats.

@palaska
Copy link
Copy Markdown
Contributor Author

palaska commented May 6, 2026

I think RwLock -> ArcSwap change made setting stats more expensive..

@joseph-isaacs
Copy link
Copy Markdown
Contributor

It does look that way. I wonder if it made reading them faster?

@robert3005
Copy link
Copy Markdown
Contributor

robert3005 commented May 6, 2026

Github is failing me - remove extra Arc around arcswap tihs is not going to work, arcswap isn't clone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants