Multiple changes to optimize and cleanup typed_buffers, BasicXorEncryptor, and Sequencer by avalerio-tkd · Pull Request #231 · protegrity/DataBatchProtectionService

avalerio-tkd · 2026-03-13T21:43:32Z

General haircut
_ Inline methods, cache sizes of vectors/spans to avoid xx.size() calls.

Setup
_ Read num_elements from Parquet metadata, not costly calculating from payload.

Sequencer
_ Removed compress/decompress of encrypted result (minimal or negative size gains)

Loop-read
_ Simplified iterator to read spans from input data (zero-copy)

Loop-xor
_ Make xor encryption with char*

Loop-write
_ GetWritableSpan to encrypt in-place
_ GetWritableSpan with single resize and pointers

- Made various functions inline to avoid extra function calls

- Merging conflicts.

- Marked various functions as inline - Switched to ues XorEncryptInto

- small performance improvements.

…s optimization.

…aders+level bytes without actually reading the value bytes. - num_elements for DICTIONARY_PAGE and DATA_PAGE_V2 pages read from the headers values directly (easy) - num_elements for DATA_PAGE_V1 pages read from the level bytes and RLE bit decoding (hard!) - Added unittests for RLE bit decoding and count present values from definition levels for DATA_PAGE_V1 pages. - Updated all scripts and unittests for new Parquet parsing requirements.

…XorEncryptor. - Now num_elements is an invariant / constant all along TypedValuesBuffer and BasicXorEncryptor (Yay!) - Fixed issue on iterator to account for prefix size on input buffers. - Many unittest fixes for interfaces updates for num_elements.

… trailing values in bit-packed runs.

…uffer_testing_codecs.h for testing border cases. - Added StringToBytes helper function to bytes_utils.h for testing.

argmarco-tkd

Thank you for this. Overall it looks good. For the DataPageV1 related-code I only skimmed (mainly because of non-familiarity with the format). I can take a deeper dive if needed. Left a few comments through the PR.

argmarco-tkd · 2026-03-14T23:01:06Z

+inline std::vector<uint8_t> StringToBytes(const std::string& str) {
+    return std::vector<uint8_t>(str.begin(), str.end());
+}


we should have a test for this function.

argmarco-tkd · 2026-03-14T23:09:29Z

    constexpr bool is_fixed = TypedBuffer::is_fixed_sized;
    constexpr size_t prefix_length = is_fixed ? kFixedHeaderLength : kVariableHeaderLength;
+
+    // GetNumElements is read from Parquet metadata and level bytes, not calculated from payload.


nit: are we introducing Parquet concepts into the Encryptor intentionally?

Removed the comment.. We're not adding Parquet concepts. This was a note-to-self as a reminder of the optimization of num_elements being read from the Parquet metadata instead of reading the payload.

argmarco-tkd · 2026-03-14T23:23:08Z

+      element_iterator_current_ptr_(
+          elements_span.data() + std::min(prefix_size, elements_span_size_)),
+      element_iterator_end_ptr_(elements_span.data() + elements_span_size_),


let's add some comments to help understand this.

Added comments. Let me know if those are sufficient. Ideally I would have put these in the constructor's body, but if moved there, then those can't be const (become mutable) so preferred to keep them here.

thanks. it help. for clarity, I think lin e 165 should read **calculated as** start of the span + size

Yes, much better. Updated the comment. Thanks.

argmarco-tkd · 2026-03-14T23:30:08Z

-                "Malformed fixed-size buffer: computed payload size does not match readable size");
+
+        // Check if the num_elements passed at contruction time coincides with the calculated from the payload size.
+        const size_t num_elements_on_payload = readable_size / element_size_;


is there risk of this not being a pure integer division?

Thanks for checking on that. The result is always a correct integer result because of the "modulo" check above (no remainder on the division). Added an inline comment for this.

argmarco-tkd · 2026-03-14T23:32:09Z

@@ -246,52 +212,54 @@ void ByteBuffer<Codec>::InitializeFromSpan() const {
    // Variable-size layout stores [u32 size][element value] back-to-back.
    // Single pass validates shape and captures per-element prefix offsets.
    offsets_.clear();


as a note: vector.clear() is not a zero-cost operation, as each element is destroyed and the vector is 'resized' to zero. In this case, because the vectors are essentially byte arrays, the cost of calling clear may be very small (but non zero).

You're right. In this case, since this is the initialization function (called once during lazy eval) and the line is called once, I preferred to guarantee a clean state and swipe the vector first. The cost should be very small since the vector should come clean anyway.

argmarco-tkd · 2026-03-15T16:35:18Z

+    RawBytesFixedSizedBuffer buffer(
+        tcb::span<const uint8_t>(bytes), expected.size(), 0u, RawBytesFixedSizedCodec{kElementSize});


nit: shouldn't it be bytes.size() instead of expected.size() (assuming we keep using bytes, that is)

In this case, expected.size() gives us the number of elements, the param needed for the TypedBuffer constructor. bytes.size() would give the length of the flattened raw buffer, which for this parameter would be incorrect.

argmarco-tkd · 2026-03-15T17:14:46Z

+        if (value.size() != write_span.size()) {
+            throw InvalidInputException("Encode: value size does not match write_span size");
+        }
+        std::memcpy(write_span.data(), value.data(), write_span.size());


this is a "string" encoder. Don't we need to zero-terminate the "encoded" version?

I 2-checked the cpp docs and I was initially wrong that cpp strings should be zero-terminated. Strings (and string views) don't need that.

argmarco-tkd · 2026-03-15T17:15:09Z

+        if (value.size() != write_span.size()) {
+            throw InvalidInputException("Encode: value size does not match write_span size");
+        }
+        std::memcpy(write_span.data(), value.data(), write_span.size());


similar to above - doesn't this need to be zero-terminated?

Same as above. And thanks very much for checking on these details!

argmarco-tkd · 2026-03-15T17:25:06Z

+            // Build a minimal valid DATA_PAGE_V1 nullable payload (max_def_level=1, max_rep_level=0):
+            // level bytes layout is [u32 def_payload_len][def_payload].
+            // def_payload is hybrid RLE/bit-packed:
+            //   - 0x06 = varint header for an RLE run (LSB=0), run_len = 0x06 >> 1 = 3
+            //   - 0x01 = repeated definition level value (bit_width=1, value=1 => "present")
+            // This yields 3 definition levels, all present, matching the 3 fixed-length values below.


This testing makes a lot of sense here (in the remote_testapp) as a sort of integ test - however, this made me think that we need similar tests for the sequencer (I took a quick look and did not find them) - after all, the sequencer is the main execution path for both the remote and local versions of the app.

Thanks. This is a great observation. This happened when we initially were parsing only DICTIONARY_PAGE as a POC and we didn't intend at the time to fully support Parquet page formatting. Later we added more support but didn't come back to the test.

In fairness, there is coverage of this on the Parquet utils level, but not on the outermost encryption_sequencer.

Will add a few tests around that at the encryption_sequencer entry point for this.

…ow-compression ratio for encrypted payloads). - Small updates from code review.

avalerio-tkd

Thanks @argmarco-tkd for the review. Added the optimization to remove the Compress/Decompress call to the end of the sequencer (the one discussed earlier today)

Could you PTAL?

avalerio-tkd · 2026-03-16T20:02:59Z

+inline std::vector<uint8_t> StringToBytes(const std::string& str) {
+    return std::vector<uint8_t>(str.begin(), str.end());
+}


avalerio-tkd · 2026-03-16T20:13:01Z

    constexpr bool is_fixed = TypedBuffer::is_fixed_sized;
    constexpr size_t prefix_length = is_fixed ? kFixedHeaderLength : kVariableHeaderLength;
+
+    // GetNumElements is read from Parquet metadata and level bytes, not calculated from payload.


Removed the comment.. We're not adding Parquet concepts. This was a note-to-self as a reminder of the optimization of num_elements being read from the Parquet metadata instead of reading the payload.

avalerio-tkd · 2026-03-16T20:23:05Z

        auto [level_bytes, value_bytes] = Split(decompressed_bytes, leading_bytes_to_strip);
-        return LevelAndValueBytes{std::move(level_bytes), std::move(value_bytes)};
+
+        // For DATA_PAGE_V1, calculate num_elements by parsing the level bytes.


Added comment to compare to v2. Also expanded comment on v2.

avalerio-tkd · 2026-03-16T21:36:47Z

+        {0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57},
+    };
+
+    std::vector<uint8_t> bytes;


This is because expected is the list of elements (like a list of int32 for example). bytes is the assembled buffer we pass the TypedBuffer to parse. It may look redundant for the fixed-byte-array case, but it isn't. One is a vector-of-vectors, the other is a flattened vector.

avalerio-tkd · 2026-03-16T21:41:24Z

+    RawBytesFixedSizedBuffer buffer(
+        tcb::span<const uint8_t>(bytes), expected.size(), 0u, RawBytesFixedSizedCodec{kElementSize});


In this case, expected.size() gives us the number of elements, the param needed for the TypedBuffer constructor. bytes.size() would give the length of the flattened raw buffer, which for this parameter would be incorrect.

avalerio-tkd · 2026-03-16T21:45:13Z

+        if (value.size() != write_span.size()) {
+            throw InvalidInputException("Encode: value size does not match write_span size");
+        }
+        std::memcpy(write_span.data(), value.data(), write_span.size());


I 2-checked the cpp docs and I was initially wrong that cpp strings should be zero-terminated. Strings (and string views) don't need that.

avalerio-tkd · 2026-03-16T21:46:04Z

+        if (value.size() != write_span.size()) {
+            throw InvalidInputException("Encode: value size does not match write_span size");
+        }
+        std::memcpy(write_span.data(), value.data(), write_span.size());


Same as above. And thanks very much for checking on these details!

avalerio-tkd · 2026-03-16T21:54:42Z

+            // Build a minimal valid DATA_PAGE_V1 nullable payload (max_def_level=1, max_rep_level=0):
+            // level bytes layout is [u32 def_payload_len][def_payload].
+            // def_payload is hybrid RLE/bit-packed:
+            //   - 0x06 = varint header for an RLE run (LSB=0), run_len = 0x06 >> 1 = 3
+            //   - 0x01 = repeated definition level value (bit_width=1, value=1 => "present")
+            // This yields 3 definition levels, all present, matching the 3 fixed-length values below.


Thanks. This is a great observation. This happened when we initially were parsing only DICTIONARY_PAGE as a POC and we didn't intend at the time to fully support Parquet page formatting. Later we added more support but didn't come back to the test.

In fairness, there is coverage of this on the Parquet utils level, but not on the outermost encryption_sequencer.

Will add a few tests around that at the encryption_sequencer entry point for this.

argmarco-tkd

Thank you for all the changes! Overall LGTM. Left a few comments, but none which need a new PR when/if addressed.

argmarco-tkd · 2026-03-17T00:12:17Z

+        // Encrypted payloads mostly have a low-compression ratio, so the gains in size from compression are minimal or negative.
+        // Therefore, the final joined encrypted bytes are returned as-is without compression.


nit: this comment may end up generating more questions than answering them (i.e. the reviewers of the final version may not have context of the version where we did have compression). I'd suggest removing or rephrasing

Ok, removing it then.

argmarco-tkd · 2026-03-17T00:21:36Z

 template <class Codec>
 ByteBuffer<Codec>::ByteBuffer(
    tcb::span<const uint8_t> elements_span,
+    size_t num_elements,


Thanks. It helps! I'd just say it the **actual** payload count mismatches...

argmarco-tkd · 2026-03-17T00:23:32Z

+      element_iterator_current_ptr_(
+          elements_span.data() + std::min(prefix_size, elements_span_size_)),
+      element_iterator_end_ptr_(elements_span.data() + elements_span_size_),


thanks. it help. for clarity, I think lin e 165 should read **calculated as** start of the span + size

avalerio-tkd

Thanks. Addressed the num_elements comment, the element_iterator_current_ptr_ comment.

Also removed the compression/decompression comment at the end of the sequencer per your suggestion.

Merging..

avalerio-tkd · 2026-03-17T01:08:06Z

+        // Encrypted payloads mostly have a low-compression ratio, so the gains in size from compression are minimal or negative.
+        // Therefore, the final joined encrypted bytes are returned as-is without compression.


Ok, removing it then.

avalerio-tkd added 16 commits March 10, 2026 16:21

TEMPORARY BRANCH FOR ZERO-COPY VERSION PERFORMANCE DEBUGGING

1ddbed9

- Timer switched to nanoseconds

a997cce

- Small optimization in GetRawElement for fixed-size elements

f52a9e1

- Made various functions inline to avoid extra function calls

Merge branch 'main' into av_typelist_optimizing_079

dec9c18

- Merging conflicts.

- Updating to use GetWritableRawElement in the BasicXorEncryptor.

2d551a2

- Marked various functions as inline - Switched to ues XorEncryptInto

- Restored one lost comment in the BasicXorEncryptor.cpp

99d8c34

- Adding streamlined iterator for typed buffers.

91b76db

- Fixing unittests for streamlined iterator.

fb0566f

- Optimizing GetWritableRawElement for variable-size elements.

971903a

- small performance improvements.

- Pushing small cleanups before pushing the Parquet-based num_element…

097d919

…s optimization.

- Fixing issue with empty strings on iterator.

6d8944d

- Fixed corner case in Parquet page v1 that didn't account for padded…

ccfb079

… trailing values in bit-packed runs.

- Added StringFixedSizedCodec and StringVariableSizedCodec to typed_b…

af827a0

…uffer_testing_codecs.h for testing border cases. - Added StringToBytes helper function to bytes_utils.h for testing.

- Removing macOS hidden files.

55c9baa

avalerio-tkd changed the base branch from main to av_typelist_optimizing_079 March 13, 2026 21:47

avalerio-tkd changed the base branch from av_typelist_optimizing_079 to main March 13, 2026 21:47

avalerio-tkd changed the base branch from main to av_typelist_optimizing_079 March 13, 2026 21:47

avalerio-tkd changed the base branch from av_typelist_optimizing_079 to main March 13, 2026 21:52

avalerio-tkd requested review from argmarco-tkd and sofia-tekdatum March 13, 2026 22:05

argmarco-tkd reviewed Mar 16, 2026

View reviewed changes

- Removing Compression call from encryption sequencer final result (l…

3ad364d

…ow-compression ratio for encrypted payloads). - Small updates from code review.

avalerio-tkd commented Mar 16, 2026

View reviewed changes

- Comment update.

f7c57e4

argmarco-tkd approved these changes Mar 17, 2026

View reviewed changes

avalerio-tkd commented Mar 17, 2026

View reviewed changes

- Updating comments after code review.

145f85c

avalerio-tkd merged commit 63c4c8b into main Mar 17, 2026
2 checks passed

avalerio-tkd mentioned this pull request Mar 19, 2026

Optimize memory buffers on DBPS EncryptionSequencer libraries #218

Closed

avalerio-tkd deleted the av_typelist_optimizing_083 branch April 8, 2026 16:52

		RawBytesFixedSizedBuffer buffer(
		tcb::span<const uint8_t>(bytes), expected.size(), 0u, RawBytesFixedSizedCodec{kElementSize});

		// Encrypted payloads mostly have a low-compression ratio, so the gains in size from compression are minimal or negative.
		// Therefore, the final joined encrypted bytes are returned as-is without compression.

Conversation

avalerio-tkd commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

argmarco-tkd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avalerio-tkd Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avalerio-tkd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avalerio-tkd Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

argmarco-tkd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avalerio-tkd commented Mar 13, 2026 •

edited

Loading

avalerio-tkd Mar 16, 2026 •

edited

Loading

avalerio-tkd Mar 16, 2026 •

edited

Loading