You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These issues/PRs implement a coordinated performance improvement effort for parquet-java encoding and decoding hot paths. The work focuses on reducing CPU overhead, allocation pressure, and avoidable memory copies in commonly used readers and writers, including plain values, binary values, byte-stream split encoding, dictionary encoding, delta byte-array encoding, delta binary packing, RLE/bit-packing decoding, page assembly, and row-group flushing.
Together, the changes preserve existing Parquet format compatibility and public behavior while making the implementation more efficient internally. The improvements use more direct ByteBuffer access, batched read/write operations, reusable buffers and helpers, cached computed values, and earlier release of temporary memory. The goal of the parent issue is to track this broader optimization series as a set of focused, reviewable PRs that each improve one hot path while contributing to better end-to-end read/write performance and lower memory usage.
Benchmark summary
Benchmarks were run with JMH (-wi 3 -i 5 -f 1, 100k values/invocation) on Linux x86_64, JDK 25 (Temurin-25.0.3+9-LTS). The machine was an Azure VM with 8 vCPUs on an AMD EPYC 9V45 96-Core Processor, 4 cores / 8 threads visible, AVX2 and AVX-512 available, and 31 GiB RAM.
Additional changes are primarily allocation or memory improvements rather than direct throughput microbenchmark wins: IntList.size() becomes O(1), the batch read API enables more efficient reader implementations, page assembly avoids full-page copies, and row-group flushing releases column buffers earlier to reduce peak memory usage.
This is a parent issue to track the ongoing work on performance improvements for encodings/decodings and other areas of the Java implementation. Since I am not a committer I don't have permission to create sub-issues so I am using this one as the main place to track them.
These issues/PRs implement a coordinated performance improvement effort for parquet-java encoding and decoding hot paths. The work focuses on reducing CPU overhead, allocation pressure, and avoidable memory copies in commonly used readers and writers, including plain values, binary values, byte-stream split encoding, dictionary encoding, delta byte-array encoding, delta binary packing, RLE/bit-packing decoding, page assembly, and row-group flushing.
Together, the changes preserve existing Parquet format compatibility and public behavior while making the implementation more efficient internally. The improvements use more direct ByteBuffer access, batched read/write operations, reusable buffers and helpers, cached computed values, and earlier release of temporary memory. The goal of the parent issue is to track this broader optimization series as a set of focused, reviewable PRs that each improve one hot path while contributing to better end-to-end read/write performance and lower memory usage.
Benchmark summary
Benchmarks were run with JMH (
-wi 3 -i 5 -f 1, 100k values/invocation) on Linuxx86_64, JDK 25 (Temurin-25.0.3+9-LTS). The machine was an Azure VM with 8 vCPUs on an AMD EPYC 9V45 96-Core Processor, 4 cores / 8 threads visible, AVX2 and AVX-512 available, and 31 GiB RAM.IntEncodingBenchmark.decodePlainIntEncodingBenchmark.encodePlainBinaryEncodingBenchmark.encodeDictionaryLOW/1000ByteStreamSplitEncodingBenchmarkLongByteStreamSplitDecodingBenchmarkFloatBinaryEncodingBenchmark.decodePlainLOW/10IntEncodingBenchmark.encodeDictionaryRANDOMBinaryEncodingBenchmark.encodeDeltaByteArrayHIGH/10IntEncodingBenchmark.decodeDictionarySEQUENTIALIntEncodingBenchmark.decodeDeltaHIGH_CARDINALITYAdditional changes are primarily allocation or memory improvements rather than direct throughput microbenchmark wins:
IntList.size()becomes O(1), the batch read API enables more efficient reader implementations, page assembly avoids full-page copies, and row-group flushing releases column buffers earlier to reduce peak memory usage.This is a parent issue to track the ongoing work on performance improvements for encodings/decodings and other areas of the Java implementation. Since I am not a committer I don't have permission to create sub-issues so I am using this one as the main place to track them.
REVIEWS IN PROGRESS
GH-3493: Optimize PlainValuesReader with direct ByteBuffer reads #3494
GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup) #3496
GH-3499: Cache hashCode() for non-reused Binary instances (up to 73x dictionary-encode speedup) #3500
GH-3503: Optimize ByteStreamSplitValuesWriter with batched scatter writes #3504
GH-3505: Optimize ByteStreamSplitValuesReader page transposition #3506
GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer #3510
Optimize dictionary writers by replacing fastutil Linked maps with OpenHashMap + ArrayList #3513
GH-3516: Optimize DeltaByteArrayWriter and DeltaLengthByteArrayValuesWriter #3517
GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) #3523
Benchmarks PR
GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar #3512
WARNING: The associated GH-XXXX is not correct)