Multiple changes to optimize and cleanup typed_buffers, BasicXorEncryptor, and Sequencer#231
Conversation
- Made various functions inline to avoid extra function calls
- Merging conflicts.
- Marked various functions as inline - Switched to ues XorEncryptInto
- small performance improvements.
…aders+level bytes without actually reading the value bytes. - num_elements for DICTIONARY_PAGE and DATA_PAGE_V2 pages read from the headers values directly (easy) - num_elements for DATA_PAGE_V1 pages read from the level bytes and RLE bit decoding (hard!) - Added unittests for RLE bit decoding and count present values from definition levels for DATA_PAGE_V1 pages. - Updated all scripts and unittests for new Parquet parsing requirements.
…XorEncryptor. - Now num_elements is an invariant / constant all along TypedValuesBuffer and BasicXorEncryptor (Yay!) - Fixed issue on iterator to account for prefix size on input buffers. - Many unittest fixes for interfaces updates for num_elements.
… trailing values in bit-packed runs.
…uffer_testing_codecs.h for testing border cases. - Added StringToBytes helper function to bytes_utils.h for testing.
argmarco-tkd
left a comment
There was a problem hiding this comment.
Thank you for this. Overall it looks good. For the DataPageV1 related-code I only skimmed (mainly because of non-familiarity with the format). I can take a deeper dive if needed. Left a few comments through the PR.
| inline std::vector<uint8_t> StringToBytes(const std::string& str) { | ||
| return std::vector<uint8_t>(str.begin(), str.end()); | ||
| } |
There was a problem hiding this comment.
we should have a test for this function.
| constexpr bool is_fixed = TypedBuffer::is_fixed_sized; | ||
| constexpr size_t prefix_length = is_fixed ? kFixedHeaderLength : kVariableHeaderLength; | ||
|
|
||
| // GetNumElements is read from Parquet metadata and level bytes, not calculated from payload. |
There was a problem hiding this comment.
nit: are we introducing Parquet concepts into the Encryptor intentionally?
There was a problem hiding this comment.
Removed the comment.. We're not adding Parquet concepts. This was a note-to-self as a reminder of the optimization of num_elements being read from the Parquet metadata instead of reading the payload.
| element_iterator_current_ptr_( | ||
| elements_span.data() + std::min(prefix_size, elements_span_size_)), | ||
| element_iterator_end_ptr_(elements_span.data() + elements_span_size_), |
There was a problem hiding this comment.
let's add some comments to help understand this.
There was a problem hiding this comment.
Added comments. Let me know if those are sufficient. Ideally I would have put these in the constructor's body, but if moved there, then those can't be const (become mutable) so preferred to keep them here.
There was a problem hiding this comment.
thanks. it help. for clarity, I think lin e 165 should read **calculated as** start of the span + size
There was a problem hiding this comment.
Yes, much better. Updated the comment. Thanks.
| "Malformed fixed-size buffer: computed payload size does not match readable size"); | ||
|
|
||
| // Check if the num_elements passed at contruction time coincides with the calculated from the payload size. | ||
| const size_t num_elements_on_payload = readable_size / element_size_; |
There was a problem hiding this comment.
is there risk of this not being a pure integer division?
There was a problem hiding this comment.
Thanks for checking on that. The result is always a correct integer result because of the "modulo" check above (no remainder on the division). Added an inline comment for this.
| @@ -246,52 +212,54 @@ void ByteBuffer<Codec>::InitializeFromSpan() const { | |||
| // Variable-size layout stores [u32 size][element value] back-to-back. | |||
| // Single pass validates shape and captures per-element prefix offsets. | |||
| offsets_.clear(); | |||
There was a problem hiding this comment.
as a note: vector.clear() is not a zero-cost operation, as each element is destroyed and the vector is 'resized' to zero. In this case, because the vectors are essentially byte arrays, the cost of calling clear may be very small (but non zero).
There was a problem hiding this comment.
You're right. In this case, since this is the initialization function (called once during lazy eval) and the line is called once, I preferred to guarantee a clean state and swipe the vector first. The cost should be very small since the vector should come clean anyway.
| RawBytesFixedSizedBuffer buffer( | ||
| tcb::span<const uint8_t>(bytes), expected.size(), 0u, RawBytesFixedSizedCodec{kElementSize}); |
There was a problem hiding this comment.
nit: shouldn't it be bytes.size() instead of expected.size() (assuming we keep using bytes, that is)
There was a problem hiding this comment.
In this case, expected.size() gives us the number of elements, the param needed for the TypedBuffer constructor. bytes.size() would give the length of the flattened raw buffer, which for this parameter would be incorrect.
| if (value.size() != write_span.size()) { | ||
| throw InvalidInputException("Encode: value size does not match write_span size"); | ||
| } | ||
| std::memcpy(write_span.data(), value.data(), write_span.size()); |
There was a problem hiding this comment.
this is a "string" encoder. Don't we need to zero-terminate the "encoded" version?
There was a problem hiding this comment.
I 2-checked the cpp docs and I was initially wrong that cpp strings should be zero-terminated. Strings (and string views) don't need that.
| if (value.size() != write_span.size()) { | ||
| throw InvalidInputException("Encode: value size does not match write_span size"); | ||
| } | ||
| std::memcpy(write_span.data(), value.data(), write_span.size()); |
There was a problem hiding this comment.
similar to above - doesn't this need to be zero-terminated?
There was a problem hiding this comment.
Same as above. And thanks very much for checking on these details!
| // Build a minimal valid DATA_PAGE_V1 nullable payload (max_def_level=1, max_rep_level=0): | ||
| // level bytes layout is [u32 def_payload_len][def_payload]. | ||
| // def_payload is hybrid RLE/bit-packed: | ||
| // - 0x06 = varint header for an RLE run (LSB=0), run_len = 0x06 >> 1 = 3 | ||
| // - 0x01 = repeated definition level value (bit_width=1, value=1 => "present") | ||
| // This yields 3 definition levels, all present, matching the 3 fixed-length values below. |
There was a problem hiding this comment.
This testing makes a lot of sense here (in the remote_testapp) as a sort of integ test - however, this made me think that we need similar tests for the sequencer (I took a quick look and did not find them) - after all, the sequencer is the main execution path for both the remote and local versions of the app.
There was a problem hiding this comment.
Thanks. This is a great observation. This happened when we initially were parsing only DICTIONARY_PAGE as a POC and we didn't intend at the time to fully support Parquet page formatting. Later we added more support but didn't come back to the test.
In fairness, there is coverage of this on the Parquet utils level, but not on the outermost encryption_sequencer.
Will add a few tests around that at the encryption_sequencer entry point for this.
…ow-compression ratio for encrypted payloads). - Small updates from code review.
avalerio-tkd
left a comment
There was a problem hiding this comment.
Thanks @argmarco-tkd for the review. Added the optimization to remove the Compress/Decompress call to the end of the sequencer (the one discussed earlier today)
Could you PTAL?
| inline std::vector<uint8_t> StringToBytes(const std::string& str) { | ||
| return std::vector<uint8_t>(str.begin(), str.end()); | ||
| } |
| constexpr bool is_fixed = TypedBuffer::is_fixed_sized; | ||
| constexpr size_t prefix_length = is_fixed ? kFixedHeaderLength : kVariableHeaderLength; | ||
|
|
||
| // GetNumElements is read from Parquet metadata and level bytes, not calculated from payload. |
There was a problem hiding this comment.
Removed the comment.. We're not adding Parquet concepts. This was a note-to-self as a reminder of the optimization of num_elements being read from the Parquet metadata instead of reading the payload.
| auto [level_bytes, value_bytes] = Split(decompressed_bytes, leading_bytes_to_strip); | ||
| return LevelAndValueBytes{std::move(level_bytes), std::move(value_bytes)}; | ||
|
|
||
| // For DATA_PAGE_V1, calculate num_elements by parsing the level bytes. |
There was a problem hiding this comment.
Added comment to compare to v2. Also expanded comment on v2.
| {0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57}, | ||
| }; | ||
|
|
||
| std::vector<uint8_t> bytes; |
There was a problem hiding this comment.
This is because expected is the list of elements (like a list of int32 for example). bytes is the assembled buffer we pass the TypedBuffer to parse. It may look redundant for the fixed-byte-array case, but it isn't. One is a vector-of-vectors, the other is a flattened vector.
| RawBytesFixedSizedBuffer buffer( | ||
| tcb::span<const uint8_t>(bytes), expected.size(), 0u, RawBytesFixedSizedCodec{kElementSize}); |
There was a problem hiding this comment.
In this case, expected.size() gives us the number of elements, the param needed for the TypedBuffer constructor. bytes.size() would give the length of the flattened raw buffer, which for this parameter would be incorrect.
| if (value.size() != write_span.size()) { | ||
| throw InvalidInputException("Encode: value size does not match write_span size"); | ||
| } | ||
| std::memcpy(write_span.data(), value.data(), write_span.size()); |
There was a problem hiding this comment.
I 2-checked the cpp docs and I was initially wrong that cpp strings should be zero-terminated. Strings (and string views) don't need that.
| if (value.size() != write_span.size()) { | ||
| throw InvalidInputException("Encode: value size does not match write_span size"); | ||
| } | ||
| std::memcpy(write_span.data(), value.data(), write_span.size()); |
There was a problem hiding this comment.
Same as above. And thanks very much for checking on these details!
| // Build a minimal valid DATA_PAGE_V1 nullable payload (max_def_level=1, max_rep_level=0): | ||
| // level bytes layout is [u32 def_payload_len][def_payload]. | ||
| // def_payload is hybrid RLE/bit-packed: | ||
| // - 0x06 = varint header for an RLE run (LSB=0), run_len = 0x06 >> 1 = 3 | ||
| // - 0x01 = repeated definition level value (bit_width=1, value=1 => "present") | ||
| // This yields 3 definition levels, all present, matching the 3 fixed-length values below. |
There was a problem hiding this comment.
Thanks. This is a great observation. This happened when we initially were parsing only DICTIONARY_PAGE as a POC and we didn't intend at the time to fully support Parquet page formatting. Later we added more support but didn't come back to the test.
In fairness, there is coverage of this on the Parquet utils level, but not on the outermost encryption_sequencer.
Will add a few tests around that at the encryption_sequencer entry point for this.
argmarco-tkd
left a comment
There was a problem hiding this comment.
Thank you for all the changes! Overall LGTM. Left a few comments, but none which need a new PR when/if addressed.
| // Encrypted payloads mostly have a low-compression ratio, so the gains in size from compression are minimal or negative. | ||
| // Therefore, the final joined encrypted bytes are returned as-is without compression. |
There was a problem hiding this comment.
nit: this comment may end up generating more questions than answering them (i.e. the reviewers of the final version may not have context of the version where we did have compression). I'd suggest removing or rephrasing
There was a problem hiding this comment.
Ok, removing it then.
| template <class Codec> | ||
| ByteBuffer<Codec>::ByteBuffer( | ||
| tcb::span<const uint8_t> elements_span, | ||
| size_t num_elements, |
There was a problem hiding this comment.
Thanks. It helps! I'd just say it the **actual** payload count mismatches...
| element_iterator_current_ptr_( | ||
| elements_span.data() + std::min(prefix_size, elements_span_size_)), | ||
| element_iterator_end_ptr_(elements_span.data() + elements_span_size_), |
There was a problem hiding this comment.
thanks. it help. for clarity, I think lin e 165 should read **calculated as** start of the span + size
avalerio-tkd
left a comment
There was a problem hiding this comment.
Thanks. Addressed the num_elements comment, the element_iterator_current_ptr_ comment.
Also removed the compression/decompression comment at the end of the sequencer per your suggestion.
Merging..
| // Encrypted payloads mostly have a low-compression ratio, so the gains in size from compression are minimal or negative. | ||
| // Therefore, the final joined encrypted bytes are returned as-is without compression. |
There was a problem hiding this comment.
Ok, removing it then.
General haircut
_ Inline methods, cache sizes of vectors/spans to avoid xx.size() calls.
Setup
_ Read num_elements from Parquet metadata, not costly calculating from payload.
Sequencer
_ Removed compress/decompress of encrypted result (minimal or negative size gains)
Loop-read
_ Simplified iterator to read spans from input data (zero-copy)
Loop-xor
_ Make xor encryption with char*
Loop-write
_ GetWritableSpan to encrypt in-place
_ GetWritableSpan with single resize and pointers