perf(parquet): use SVE gather for dictionary decode on AArch64#10066
Open
wuleiwuleiwulei wants to merge 1 commit into
Open
perf(parquet): use SVE gather for dictionary decode on AArch64#10066wuleiwuleiwulei wants to merge 1 commit into
wuleiwuleiwulei wants to merge 1 commit into
Conversation
Add a runtime-detected SVE gather path to RleDecoder::get_batch_with_dict.
On AArch64, NEON has no gather instruction and SVE is not part of the
baseline ISA, so a portable binary cannot autovectorise the dictionary
lookup to a hardware gather. This adds an inline-assembly SVE kernel,
gated by a cached `is_aarch64_feature_detected!("sve")` check, that gathers
4- and 8-byte dictionary values.
The gather slots in at the existing `get_unchecked` site introduced in apache#9746
and reuses its hoisted max-reduction bounds check, so it adds no new
per-element checks. It is gated on `!needs_drop::<T>()` since the kernel
copies raw element bytes, and falls back to the scalar loop for every other
type, width, or when SVE is unavailable. Non-AArch64 targets are unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a runtime-detected SVE gather path to RleDecoder::get_batch_with_dict. On AArch64, NEON has no gather instruction and SVE is not part of the baseline ISA, so a portable binary cannot autovectorise the dictionary lookup to a hardware gather. This adds an inline-assembly SVE kernel, gated by a cached
is_aarch64_feature_detected!("sve")check, that gathers 4- and 8-byte dictionary values.The gather slots in at the existing
get_uncheckedsite introduced in #9746 and reuses its hoisted max-reduction bounds check, so it adds no new per-element checks. It is gated on!needs_drop::<T>()since the kernel copies raw element bytes, and falls back to the scalar loop for every other type, width, or when SVE is unavailable. Non-AArch64 targets are unchanged.Which issue does this PR close?
Rationale for this change
RleDecoder::get_batch_with_dictis on the hot path for decodingdictionary-encoded columns. #9746 recently reworked the bit-packed branch so
the per-element bounds check is hoisted into a single u32 max-reduction
followed by an unchecked gather, which lets the autovectorizer emit a hardware
gather on x86 (AVX2/AVX-512) when those target features are enabled.
On AArch64 the situation is different: base Armv8-A / NEON has no gather
instruction, and the only Arm gather lives in SVE, which is an optional
extension that is not part of the AArch64 baseline. A portable
aarch64-unknown-linux-gnubinary therefore does not have+svein its targetfeatures, so even after #9746 the autovectorizer cannot emit an SVE gather for
this loop and falls back to scalar loads. This PR fills that gap with a
runtime-detected SVE gather, so portable binaries can use the hardware gather on
SVE-capable cores (e.g. Kunpeng 920B) without requiring
-C target-cpu=...atbuild time.
What changes are included in this PR?
#[cfg(target_arch = "aarch64")] mod sve_gathercontaining:is_aarch64_feature_detected!("sve")),gather_4byte/gather_8byteinline-assembly kernels (Rust has no stableSVE intrinsics yet, hence
asm!),try_gatherentry point that returnsfalseso callers fall back to thescalar path.
get_batch_with_dict, the SVE gather is attemptedat the existing
get_uncheckedsite introduced in perf(parquet): Vectorize dict-index bounds check in RleDecoder::get_batch_with_dict (up to -7.9%) #9746. It reuses perf(parquet): Vectorize dict-index bounds check in RleDecoder::get_batch_with_dict (up to -7.9%) #9746'shoisted max-reduction bounds check unchanged — every index in the chunk is
already proven in-bounds — so it adds no new per-element checks and no new
unsoundness. If SVE is unavailable (or the type is unsuitable) it falls back to
the same scalar loop.
!needs_drop::<T>()and element size 4/8, since thekernels copy raw element bytes; everything else (other widths, drop types,
non-AArch64 targets) is unchanged and takes the existing scalar path.
No public API changes; the x86 / non-AArch64 code paths are untouched.
Are these changes tested?
Correctness is covered by the existing dictionary-decode tests
(
test_rle_decode_with_dict_int32,test_rle_skip_dict,test_truncated_rle,test_long_run), which exercise the sameget_batch_with_dictpath; onSVE-capable hardware these run through the SVE gather, and on all other targets
through the unchanged scalar path. The SVE gather is type-transparent (it
produces the same bytes as the scalar
clone_from), reusing the bounds checkthat #9746 already validates.
Build/lint verified for both targets:
x86_64: existingrletests pass;cargo clippy ... -D warningsclean.aarch64-unknown-linux-gnu(cross): compiles andcargo clippy -D warningsis clean, so the
asm!blocks are checked by the compiler.Performance: measured on Kunpeng 920B over TPC-H. Note that my original numbers
were taken against a pre-#9746 baseline, so they conflate #9746's
autovectorization with the SVE gather. I'm re-running the benchmark against
current
main(post-#9746) to isolate the SVE-only contribution and will postthe apples-to-apples results here.
Are there any user-facing changes?
No public API or behavior changes. On AArch64 CPUs that report SVE support at
runtime, dictionary decoding uses an SVE gather; on every other CPU/target the
behavior and code path are identical to before. No new feature flags and no MSRV
change (uses stable
is_aarch64_feature_detected!andasm!).