Skip to content

Multiple changes to optimize and cleanup typed_buffers, BasicXorEncryptor, and Sequencer#231

Merged
avalerio-tkd merged 19 commits into
mainfrom
av_typelist_optimizing_083
Mar 17, 2026
Merged

Multiple changes to optimize and cleanup typed_buffers, BasicXorEncryptor, and Sequencer#231
avalerio-tkd merged 19 commits into
mainfrom
av_typelist_optimizing_083

Conversation

@avalerio-tkd

@avalerio-tkd avalerio-tkd commented Mar 13, 2026

Copy link
Copy Markdown
Collaborator

General haircut
_ Inline methods, cache sizes of vectors/spans to avoid xx.size() calls.

Setup
_ Read num_elements from Parquet metadata, not costly calculating from payload.

Sequencer
_ Removed compress/decompress of encrypted result (minimal or negative size gains)

Loop-read
_ Simplified iterator to read spans from input data (zero-copy)

Loop-xor
_ Make xor encryption with char*

Loop-write
_ GetWritableSpan to encrypt in-place
_ GetWritableSpan with single resize and pointers

- Made various functions inline to avoid extra function calls
- Marked various functions as inline
- Switched to ues XorEncryptInto
…aders+level bytes without actually reading the value bytes.

- num_elements for DICTIONARY_PAGE and DATA_PAGE_V2 pages read from the headers values directly (easy)
- num_elements for DATA_PAGE_V1 pages read from the level bytes and RLE bit decoding (hard!)
- Added unittests for RLE bit decoding and count present values from definition levels for DATA_PAGE_V1 pages.
- Updated all scripts and unittests for new Parquet parsing requirements.
…XorEncryptor.

- Now num_elements is an invariant / constant all along TypedValuesBuffer and BasicXorEncryptor (Yay!)
- Fixed issue on iterator to account for prefix size on input buffers.
- Many unittest fixes for interfaces updates for num_elements.
…uffer_testing_codecs.h for testing border cases.

- Added StringToBytes helper function to bytes_utils.h for testing.
@avalerio-tkd avalerio-tkd changed the base branch from main to av_typelist_optimizing_079 March 13, 2026 21:47
@avalerio-tkd avalerio-tkd changed the base branch from av_typelist_optimizing_079 to main March 13, 2026 21:47
@avalerio-tkd avalerio-tkd changed the base branch from main to av_typelist_optimizing_079 March 13, 2026 21:47
@avalerio-tkd avalerio-tkd changed the base branch from av_typelist_optimizing_079 to main March 13, 2026 21:52

@argmarco-tkd argmarco-tkd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this. Overall it looks good. For the DataPageV1 related-code I only skimmed (mainly because of non-familiarity with the format). I can take a deeper dive if needed. Left a few comments through the PR.

Comment thread src/common/bytes_utils.h
Comment on lines +307 to +309
inline std::vector<uint8_t> StringToBytes(const std::string& str) {
return std::vector<uint8_t>(str.begin(), str.end());
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have a test for this function.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

constexpr bool is_fixed = TypedBuffer::is_fixed_sized;
constexpr size_t prefix_length = is_fixed ? kFixedHeaderLength : kVariableHeaderLength;

// GetNumElements is read from Parquet metadata and level bytes, not calculated from payload.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: are we introducing Parquet concepts into the Encryptor intentionally?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment.. We're not adding Parquet concepts. This was a note-to-self as a reminder of the optimization of num_elements being read from the Parquet metadata instead of reading the payload.

Comment on lines +155 to +157
element_iterator_current_ptr_(
elements_span.data() + std::min(prefix_size, elements_span_size_)),
element_iterator_end_ptr_(elements_span.data() + elements_span_size_),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some comments to help understand this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments. Let me know if those are sufficient. Ideally I would have put these in the constructor's body, but if moved there, then those can't be const (become mutable) so preferred to keep them here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. it help. for clarity, I think lin e 165 should read **calculated as** start of the span + size

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much better. Updated the comment. Thanks.

"Malformed fixed-size buffer: computed payload size does not match readable size");

// Check if the num_elements passed at contruction time coincides with the calculated from the payload size.
const size_t num_elements_on_payload = readable_size / element_size_;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there risk of this not being a pure integer division?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking on that. The result is always a correct integer result because of the "modulo" check above (no remainder on the division). Added an inline comment for this.

@@ -246,52 +212,54 @@ void ByteBuffer<Codec>::InitializeFromSpan() const {
// Variable-size layout stores [u32 size][element value] back-to-back.
// Single pass validates shape and captures per-element prefix offsets.
offsets_.clear();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a note: vector.clear() is not a zero-cost operation, as each element is destroyed and the vector is 'resized' to zero. In this case, because the vectors are essentially byte arrays, the cost of calling clear may be very small (but non zero).

@avalerio-tkd avalerio-tkd Mar 16, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. In this case, since this is the initialization function (called once during lazy eval) and the line is called once, I preferred to guarantee a clean state and swipe the vector first. The cost should be very small since the vector should come clean anyway.

Comment on lines +982 to +983
RawBytesFixedSizedBuffer buffer(
tcb::span<const uint8_t>(bytes), expected.size(), 0u, RawBytesFixedSizedCodec{kElementSize});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldn't it be bytes.size() instead of expected.size() (assuming we keep using bytes, that is)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, expected.size() gives us the number of elements, the param needed for the TypedBuffer constructor. bytes.size() would give the length of the flattened raw buffer, which for this parameter would be incorrect.

Comment thread src/processing/encryptors/basic_xor_encryptor.cpp
if (value.size() != write_span.size()) {
throw InvalidInputException("Encode: value size does not match write_span size");
}
std::memcpy(write_span.data(), value.data(), write_span.size());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a "string" encoder. Don't we need to zero-terminate the "encoded" version?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 2-checked the cpp docs and I was initially wrong that cpp strings should be zero-terminated. Strings (and string views) don't need that.

if (value.size() != write_span.size()) {
throw InvalidInputException("Encode: value size does not match write_span size");
}
std::memcpy(write_span.data(), value.data(), write_span.size());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above - doesn't this need to be zero-terminated?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. And thanks very much for checking on these details!

Comment on lines +625 to +630
// Build a minimal valid DATA_PAGE_V1 nullable payload (max_def_level=1, max_rep_level=0):
// level bytes layout is [u32 def_payload_len][def_payload].
// def_payload is hybrid RLE/bit-packed:
// - 0x06 = varint header for an RLE run (LSB=0), run_len = 0x06 >> 1 = 3
// - 0x01 = repeated definition level value (bit_width=1, value=1 => "present")
// This yields 3 definition levels, all present, matching the 3 fixed-length values below.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This testing makes a lot of sense here (in the remote_testapp) as a sort of integ test - however, this made me think that we need similar tests for the sequencer (I took a quick look and did not find them) - after all, the sequencer is the main execution path for both the remote and local versions of the app.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is a great observation. This happened when we initially were parsing only DICTIONARY_PAGE as a POC and we didn't intend at the time to fully support Parquet page formatting. Later we added more support but didn't come back to the test.

In fairness, there is coverage of this on the Parquet utils level, but not on the outermost encryption_sequencer.

Will add a few tests around that at the encryption_sequencer entry point for this.

…ow-compression ratio for encrypted payloads).

- Small updates from code review.

@avalerio-tkd avalerio-tkd left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @argmarco-tkd for the review. Added the optimization to remove the Compress/Decompress call to the end of the sequencer (the one discussed earlier today)

Could you PTAL?

Comment thread src/common/bytes_utils.h
Comment on lines +307 to +309
inline std::vector<uint8_t> StringToBytes(const std::string& str) {
return std::vector<uint8_t>(str.begin(), str.end());
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread src/processing/encryptors/basic_xor_encryptor.cpp
constexpr bool is_fixed = TypedBuffer::is_fixed_sized;
constexpr size_t prefix_length = is_fixed ? kFixedHeaderLength : kVariableHeaderLength;

// GetNumElements is read from Parquet metadata and level bytes, not calculated from payload.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment.. We're not adding Parquet concepts. This was a note-to-self as a reminder of the optimization of num_elements being read from the Parquet metadata instead of reading the payload.

Comment thread src/processing/parquet_utils.cpp
Comment thread src/processing/parquet_utils.cpp Outdated
auto [level_bytes, value_bytes] = Split(decompressed_bytes, leading_bytes_to_strip);
return LevelAndValueBytes{std::move(level_bytes), std::move(value_bytes)};

// For DATA_PAGE_V1, calculate num_elements by parsing the level bytes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment to compare to v2. Also expanded comment on v2.

{0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57},
};

std::vector<uint8_t> bytes;

@avalerio-tkd avalerio-tkd Mar 16, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because expected is the list of elements (like a list of int32 for example). bytes is the assembled buffer we pass the TypedBuffer to parse. It may look redundant for the fixed-byte-array case, but it isn't. One is a vector-of-vectors, the other is a flattened vector.

Comment on lines +982 to +983
RawBytesFixedSizedBuffer buffer(
tcb::span<const uint8_t>(bytes), expected.size(), 0u, RawBytesFixedSizedCodec{kElementSize});

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, expected.size() gives us the number of elements, the param needed for the TypedBuffer constructor. bytes.size() would give the length of the flattened raw buffer, which for this parameter would be incorrect.

if (value.size() != write_span.size()) {
throw InvalidInputException("Encode: value size does not match write_span size");
}
std::memcpy(write_span.data(), value.data(), write_span.size());

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 2-checked the cpp docs and I was initially wrong that cpp strings should be zero-terminated. Strings (and string views) don't need that.

if (value.size() != write_span.size()) {
throw InvalidInputException("Encode: value size does not match write_span size");
}
std::memcpy(write_span.data(), value.data(), write_span.size());

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. And thanks very much for checking on these details!

Comment on lines +625 to +630
// Build a minimal valid DATA_PAGE_V1 nullable payload (max_def_level=1, max_rep_level=0):
// level bytes layout is [u32 def_payload_len][def_payload].
// def_payload is hybrid RLE/bit-packed:
// - 0x06 = varint header for an RLE run (LSB=0), run_len = 0x06 >> 1 = 3
// - 0x01 = repeated definition level value (bit_width=1, value=1 => "present")
// This yields 3 definition levels, all present, matching the 3 fixed-length values below.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is a great observation. This happened when we initially were parsing only DICTIONARY_PAGE as a POC and we didn't intend at the time to fully support Parquet page formatting. Later we added more support but didn't come back to the test.

In fairness, there is coverage of this on the Parquet utils level, but not on the outermost encryption_sequencer.

Will add a few tests around that at the encryption_sequencer entry point for this.

@argmarco-tkd argmarco-tkd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the changes! Overall LGTM. Left a few comments, but none which need a new PR when/if addressed.

Comment thread src/processing/encryption_sequencer.cpp Outdated
Comment on lines +151 to +152
// Encrypted payloads mostly have a low-compression ratio, so the gains in size from compression are minimal or negative.
// Therefore, the final joined encrypted bytes are returned as-is without compression.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment may end up generating more questions than answering them (i.e. the reviewers of the final version may not have context of the version where we did have compression). I'd suggest removing or rephrasing

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removing it then.

Comment thread src/processing/parquet_utils.cpp
template <class Codec>
ByteBuffer<Codec>::ByteBuffer(
tcb::span<const uint8_t> elements_span,
size_t num_elements,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It helps! I'd just say it the **actual** payload count mismatches...

Comment on lines +155 to +157
element_iterator_current_ptr_(
elements_span.data() + std::min(prefix_size, elements_span_size_)),
element_iterator_end_ptr_(elements_span.data() + elements_span_size_),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. it help. for clarity, I think lin e 165 should read **calculated as** start of the span + size

@avalerio-tkd avalerio-tkd left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Addressed the num_elements comment, the element_iterator_current_ptr_ comment.

Also removed the compression/decompression comment at the end of the sequencer per your suggestion.

Merging..

Comment thread src/processing/encryption_sequencer.cpp Outdated
Comment on lines +151 to +152
// Encrypted payloads mostly have a low-compression ratio, so the gains in size from compression are minimal or negative.
// Therefore, the final joined encrypted bytes are returned as-is without compression.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removing it then.

@avalerio-tkd avalerio-tkd merged commit 63c4c8b into main Mar 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants