Skip to content

perf(parquet): use SVE gather for dictionary decode on AArch64#10066

Open
wuleiwuleiwulei wants to merge 1 commit into
apache:mainfrom
wuleiwuleiwulei:wl_0602_get_batch_with_dict
Open

perf(parquet): use SVE gather for dictionary decode on AArch64#10066
wuleiwuleiwulei wants to merge 1 commit into
apache:mainfrom
wuleiwuleiwulei:wl_0602_get_batch_with_dict

Conversation

@wuleiwuleiwulei

@wuleiwuleiwulei wuleiwuleiwulei commented Jun 4, 2026

Copy link
Copy Markdown

Add a runtime-detected SVE gather path to RleDecoder::get_batch_with_dict. On AArch64, NEON has no gather instruction and SVE is not part of the baseline ISA, so a portable binary cannot autovectorise the dictionary lookup to a hardware gather. This adds an inline-assembly SVE kernel, gated by a cached is_aarch64_feature_detected!("sve") check, that gathers 4- and 8-byte dictionary values.

The gather slots in at the existing get_unchecked site introduced in #9746 and reuses its hoisted max-reduction bounds check, so it adds no new per-element checks. It is gated on !needs_drop::<T>() since the kernel copies raw element bytes, and falls back to the scalar loop for every other type, width, or when SVE is unavailable. Non-AArch64 targets are unchanged.

Which issue does this PR close?

Rationale for this change

RleDecoder::get_batch_with_dict is on the hot path for decoding
dictionary-encoded columns. #9746 recently reworked the bit-packed branch so
the per-element bounds check is hoisted into a single u32 max-reduction
followed by an unchecked gather, which lets the autovectorizer emit a hardware
gather on x86 (AVX2/AVX-512) when those target features are enabled.

On AArch64 the situation is different: base Armv8-A / NEON has no gather
instruction, and the only Arm gather lives in SVE, which is an optional
extension that is not part of the AArch64 baseline. A portable
aarch64-unknown-linux-gnu binary therefore does not have +sve in its target
features, so even after #9746 the autovectorizer cannot emit an SVE gather for
this loop and falls back to scalar loads. This PR fills that gap with a
runtime-detected SVE gather, so portable binaries can use the hardware gather on
SVE-capable cores (e.g. Kunpeng 920B) without requiring -C target-cpu=... at
build time.

What changes are included in this PR?

  • A new #[cfg(target_arch = "aarch64")] mod sve_gather containing:
    • a cached runtime feature check (is_aarch64_feature_detected!("sve")),
    • gather_4byte / gather_8byte inline-assembly kernels (Rust has no stable
      SVE intrinsics yet, hence asm!),
    • a try_gather entry point that returns false so callers fall back to the
      scalar path.
  • In the bit-packed branch of get_batch_with_dict, the SVE gather is attempted
    at the existing get_unchecked site introduced in perf(parquet): Vectorize dict-index bounds check in RleDecoder::get_batch_with_dict (up to -7.9%) #9746. It reuses perf(parquet): Vectorize dict-index bounds check in RleDecoder::get_batch_with_dict (up to -7.9%) #9746's
    hoisted max-reduction bounds check unchanged
    — every index in the chunk is
    already proven in-bounds — so it adds no new per-element checks and no new
    unsoundness. If SVE is unavailable (or the type is unsuitable) it falls back to
    the same scalar loop.
  • The SVE path is gated on !needs_drop::<T>() and element size 4/8, since the
    kernels copy raw element bytes; everything else (other widths, drop types,
    non-AArch64 targets) is unchanged and takes the existing scalar path.

No public API changes; the x86 / non-AArch64 code paths are untouched.

Are these changes tested?

Correctness is covered by the existing dictionary-decode tests
(test_rle_decode_with_dict_int32, test_rle_skip_dict, test_truncated_rle,
test_long_run), which exercise the same get_batch_with_dict path; on
SVE-capable hardware these run through the SVE gather, and on all other targets
through the unchanged scalar path. The SVE gather is type-transparent (it
produces the same bytes as the scalar clone_from), reusing the bounds check
that #9746 already validates.

Build/lint verified for both targets:

  • x86_64: existing rle tests pass; cargo clippy ... -D warnings clean.
  • aarch64-unknown-linux-gnu (cross): compiles and cargo clippy -D warnings
    is clean, so the asm! blocks are checked by the compiler.

Performance: measured on Kunpeng 920B over TPC-H. Note that my original numbers
were taken against a pre-#9746 baseline, so they conflate #9746's
autovectorization with the SVE gather. I'm re-running the benchmark against
current main (post-#9746) to isolate the SVE-only contribution and will post
the apples-to-apples results here.

Are there any user-facing changes?

No public API or behavior changes. On AArch64 CPUs that report SVE support at
runtime, dictionary decoding uses an SVE gather; on every other CPU/target the
behavior and code path are identical to before. No new feature flags and no MSRV
change (uses stable is_aarch64_feature_detected! and asm!).

Add a runtime-detected SVE gather path to RleDecoder::get_batch_with_dict.
On AArch64, NEON has no gather instruction and SVE is not part of the
baseline ISA, so a portable binary cannot autovectorise the dictionary
lookup to a hardware gather. This adds an inline-assembly SVE kernel,
gated by a cached `is_aarch64_feature_detected!("sve")` check, that gathers
4- and 8-byte dictionary values.

The gather slots in at the existing `get_unchecked` site introduced in apache#9746
and reuses its hoisted max-reduction bounds check, so it adds no new
per-element checks. It is gated on `!needs_drop::<T>()` since the kernel
copies raw element bytes, and falls back to the scalar loop for every other
type, width, or when SVE is unavailable. Non-AArch64 targets are unchanged.
@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: Use AArch64 SVE gather to speed up RLE dictionary decoding

1 participant