[CELEBORN-2329][CIP22] Encryption at Rest Spark Impl#3689
Conversation
|
+CC @mridulm @rmcyang @FMX @SteNicholas PTAL, thanks. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3689 +/- ##
==========================================
+ Coverage 66.91% 66.98% +0.08%
==========================================
Files 358 359 +1
Lines 21986 22218 +232
Branches 1946 1968 +22
==========================================
+ Hits 14710 14881 +171
- Misses 6262 6315 +53
- Partials 1014 1022 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds Spark-side “encryption at rest” support by introducing a pluggable crypto interface in the Celeborn client, wiring encryption into push paths and decryption into fetch/read paths, and providing a Spark-backed implementation that uses Spark’s IO encryption settings/key.
Changes:
- Introduces
CryptoHandlerand threads an optional handler throughShuffleClient/ShuffleClientImplinto push (encrypt) and read (decrypt) code paths. - Adds Spark implementation (
SparkCryptoHandler) + plumbing (SparkCommonUtils.getCryptoHandler, Spark shuffle readers/managers updated to pass the handler). - Updates tests and shaded artifacts/dependencies to include the needed crypto runtime.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| client/src/test/java/org/apache/celeborn/client/read/CelebornInputStreamPeerFailoverTest.java | Updates test calls to new CelebornInputStream.create signature with optional crypto handler. |
| client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java | Encrypts outgoing shuffle payloads when crypto is enabled; passes handler into partition reads. |
| client/src/main/java/org/apache/celeborn/client/ShuffleClient.java | Adds ShuffleClient.get(...) overload accepting an optional crypto handler; exposes setup hook. |
| client/src/main/java/org/apache/celeborn/client/security/CryptoHandler.java | New interface for encrypt/decrypt on byte arrays. |
| client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java | Decrypts incoming batches before optional decompression and consumption. |
| client/src/main/java/org/apache/celeborn/client/DummyShuffleClient.java | Adds no-op crypto handler setup for dummy client. |
| client/pom.xml | Adds commons-crypto dependency to client module. |
| client-spark/spark-3/src/test/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReaderSuite.scala | Updates static mock to match new ShuffleClient.get signature. |
| client-spark/spark-3/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReader.scala | Extends constructors/plumbing to carry optional crypto handler to ShuffleClient. |
| client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java | Updates reflective columnar shuffle reader construction to pass optional crypto handler. |
| client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java | Passes Spark-derived crypto handler into writers/readers. |
| client-spark/spark-3-shaded/pom.xml | Ensures commons-crypto is included in shaded Spark 3 artifact. |
| client-spark/spark-3-columnar-shuffle/src/test/scala/org/apache/spark/shuffle/celeborn/CelebornColumnarShuffleReaderSuite.scala | Updates tests to pass Optional.empty() crypto handler. |
| client-spark/spark-3-columnar-shuffle/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornColumnarShuffleReader.scala | Threads optional crypto handler through columnar reader. |
| client-spark/spark-2/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReader.scala | Threads optional crypto handler through Spark 2 reader. |
| client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java | Passes Spark-derived crypto handler into Spark 2 writers/readers. |
| client-spark/spark-2-shaded/pom.xml | Ensures commons-crypto is included in shaded Spark 2 artifact. |
| client-spark/common/src/test/java/org/apache/spark/shuffle/celeborn/SparkCryptoHandlerSuiteJ.java | Adds unit tests for Spark crypto handler behavior. |
| client-spark/common/src/main/java/org/apache/spark/shuffle/celeborn/SparkCryptoHandler.java | Implements CryptoHandler using Spark CryptoStreamUtils. |
| client-spark/common/src/main/java/org/apache/spark/shuffle/celeborn/SparkCommonUtils.java | Derives optional crypto handler from Spark config/env IO encryption key. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Read and optionally decrypt data into the appropriate buffer | ||
| if (cryptoHandler.isPresent()) { | ||
| if (size > encryptedBuf.length) { | ||
| encryptedBuf = new byte[size]; | ||
| } | ||
| currentChunk.readBytes(encryptedBuf, 0, size); | ||
| byte[] decrypted = cryptoHandler.get().decrypt(encryptedBuf, 0, size); | ||
| logger.debug( | ||
| "Decrypted shuffle data for shuffle {} partition {}: {} bytes -> {} bytes.", | ||
| shuffleId, | ||
| partitionId, | ||
| size, | ||
| decrypted.length); | ||
| size = decrypted.length; | ||
| if (shouldDecompress) { | ||
| compressedBuf = decrypted; | ||
| } else { | ||
| rawDataBuf = decrypted; | ||
| } |
| ByteArrayInputStream bais = new ByteArrayInputStream(input, offset, length); | ||
| DataInputStream dis = new DataInputStream(bais); | ||
| int decryptedLength = dis.readInt(); | ||
| if (decryptedLength < 0) { | ||
| throw new IOException( | ||
| "Invalid decrypted length: " + decryptedLength + ", encrypted length: " + length); | ||
| } | ||
| try (DataInputStream cis = | ||
| new DataInputStream(CryptoStreamUtils.createCryptoInputStream(dis, sparkConf, key))) { | ||
| byte[] decrypted = new byte[decryptedLength]; | ||
| cis.readFully(decrypted); | ||
| return decrypted; |
| @Test | ||
| public void testEncryptWithOffset() throws IOException { | ||
| byte[] actual = "offset test data".getBytes(); | ||
| byte[] padded = Arrays.copyOf(actual, actual.length + 20); | ||
|
|
||
| byte[] encrypted = handler.encrypt(padded, 0, actual.length); | ||
| byte[] decrypted = handler.decrypt(encrypted, 0, encrypted.length); | ||
|
|
||
| assertArrayEquals(actual, decrypted); | ||
| } |
PR Review: [CELEBORN-2329][CIP22] Encryption at Rest Spark ImplOverall: Good architecture, some performance and safety concerns. The encrypt-at-client / decrypt-at-client approach is clean — workers store ciphertext, no key distribution needed to workers. Compress-then-encrypt ordering is correct. Performance Concerns1. Heavy allocation on hot path — every batch allocates multiple objects
For a typical shuffle with thousands of batches per second per executor, this generates significant GC pressure. Consider:
Moderate Issues2. Non-Spark engines (Flink, MR) using the Celeborn client now pull in 3. SparkCommonUtils.getCryptoHandler(conf) // called in getWriter
SparkCommonUtils.getCryptoHandler(conf) // called in getReader (multiple times for columnar path)Each call accesses Minor / Nits4. No AES-GCM / integrity protection Spark's 5. Test coverage is good — roundtrip, wrong key, large data, empty data, offset. Could add a test for corrupted ciphertext (flip a byte in encrypted output, verify decrypt fails or produces wrong output). 6. Debug logging on every batch logger.debug("Encrypted shuffle data for shuffle {} ...", shuffleId, ...);
logger.debug("Decrypted shuffle data for shuffle {} ...", shuffleId, ...);Even with debug disabled, the varargs boxing and string formatting preparation happens. On a hot path with millions of batches, this adds up. Consider guarding with Reviewed with Claude Code |
|
I will address the above comments in 2 weeks, I will be on vacation. |
What changes were proposed in this pull request?
Adds EAR support for Spark side.
See more details in doc.
Why are the changes needed?
See more details in doc.
Does this PR resolve a correctness bug?
no
Does this PR introduce any user-facing change?
no
How was this patch tested?
unit tests and tested in production internally.