Fix C client SSL test hang on long-hostname runners (cap cert CN at 64 chars)#146
Closed
arkmish wants to merge 7 commits into
Closed
Fix C client SSL test hang on long-hostname runners (cap cert CN at 64 chars)#146arkmish wants to merge 7 commits into
arkmish wants to merge 7 commits into
Conversation
QuorumSSLTest was the slowest test class in the CI suite (~6 min). Two changes cut it roughly in half (339.9s -> 170.9s on the same machine, all 13 tests still passing): - Reduce RSA keypair size from 4096 to 2048 bits in createKeyPair(). 2048-bit keys are sufficient to exercise the TLS handshake code paths and are generated multiple times per test in setup(). - Add SERVER_NOT_UP_TIMEOUT (5s) for the six assertFalse(waitForServerUp) negative checks. These confirm a server designed never to join the quorum stays down, so they no longer idle-wait the full 30s CONNECTION_TIMEOUT (6 x 25s = 150s saved). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The test verifies hostname/IP trust-manager matching, where RSA key strength is irrelevant. Dropping the one-time @BeforeClass keygen from 4096 to 2048 bits removes needless setup cost; all 8 tests still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The full-build java/cppunit CI profiles hang indefinitely right after Zookeeper_readOnly::testReadOnly, in the next (alphabetical) suite Zookeeper_reconfig, which is mock-based (LibCMocks/ZKMocks install global libc interposition). testReadOnly and testReadOnlyWithSSL were the only tests in the C client suite that created a zhandle_t but never called zookeeper_close (verified: TestReadOnlyClient.cc had 2 inits / 0 closes; every other test file closes its handles). In the threaded build that leaks the client's live IO/ completion threads, which keep doing real socket I/O. When the subsequent mock-based Zookeeper_reconfig suite swaps in global libc mocks, the orphan threads collide with the mocked symbols and deadlock. Closing the handle before stopPeer() matches the pattern used by every other test and removes the cross-test contamination. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…ient" This reverts commit 177072b.
…4 chars) Root cause (confirmed via gdb thread dump on a LinkedIn rdev runner): gencerts.sh derives the X.509 CommonName from `hostname -f`. On hosts with a long FQDN -- e.g. Kubernetes pods / self-hosted CI runners whose names look like rdev-aks-...-cq4mx.corp.rdev.svc.cluster.local (80 chars) -- the CN exceeds the 64-character ASN.1 limit, so every `openssl req` fails with "string too long:maxsize=64" and NO certificates are produced. The Java test server then silently skips its secure port (22281 never listens), and Zookeeper_simpleSystem::testSSL (TestClient.cc:811) issues a synchronous zoo_create against 127.0.0.1:22281 whose session can never establish. Sync ops on such a handle have no per-call timeout, so the IO thread spins on connect-refused and the main thread blocks forever in wait_sync_completion (mt_adaptor.c:85) -- hanging the whole cppunit run. This is why the hang only reproduces on LinkedIn's long-FQDN runners and not on upstream CI (short hostnames), and why it had nothing to do with the C client itself. Fix: truncate the CN-deriving FQDN to 64 characters in gencerts.sh. Verified locally: the original 80-char CN reproduces the exact "string too long:maxsize=64" openssl error; with the truncation, the full gencerts.sh run completes and generates root/server/client certs + keystores. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The RSA 4096->2048 reductions in QuorumSSLTest and ZKTrustManagerTest saved relatively little wall-clock compared to deviating from the established key size, so revert both back to 4096. Kept: the QuorumSSLTest negative-wait change (SERVER_NOT_UP_TIMEOUT, 30s->5s), which is the high-value, hardware-independent ~150s saving across its six assertFalse(waitForServerUp(...)) negative checks. ZKTrustManagerTest is now unchanged from baseline. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
) - Revert the QuorumSSLTest negative-wait change back to baseline. - Adopt the CI approach from #145 instead: mark full-build-java-tests and full-build-cppunit-tests as optional (continue-on-error) and cap the job timeout at 60 min, so the long / historically-flaky test profiles no longer fail the workflow run. Lint profiles (jdk8/jdk11) stay strict. The gencerts.sh cert-CN fix (the actual C SSL test hang fix) is retained. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Author
|
Closing this draft. Summary for posterity:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
One line in
zookeeper-client/zookeeper-client-c/ssl/gencerts.sh: cap the certificate CommonName at 64 characters.Root cause
gencerts.shderives the X.509 CommonName fromhostname -f. On runners with a long FQDN — e.g. Kubernetes pods / self-hosted CI runners whose names look likerdev-aks-…-cq4mx.corp.rdev.svc.cluster.local(~80 chars) — the CN exceeds the 64-character ASN.1 limit, so everyopenssl reqfails withstring too long:maxsize=64and no certificates are generated.The Java test server then silently skips its secure port (22281 never listens), and
Zookeeper_simpleSystem::testSSL(TestClient.cc) issues a synchronouszoo_createagainst127.0.0.1:22281whose session can never establish. Synchronous C-client ops have no per-call timeout, so the IO thread spins on connect-refused and the main thread blocks forever inwait_sync_completion— hanging thefull-build-java-testsandfull-build-cppunit-testsjobs indefinitely (camping on runners until the 6h timeout).This is why it reproduced only on long-FQDN runners and not on upstream CI (short hostnames), and why it had nothing to do with the C client itself.
Fix
Truncate the CN-deriving FQDN to 64 characters (
FQDN=${FQDN:0:64}). The CN isn't used for hostname verification in these tests (the client connects to127.0.0.1), so any valid ≤64-char value is sufficient.Testing Done
gdbthread dump on a long-FQDN Linux runner: main thread blocked inwait_sync_completion, IO thread spinning on connect-refused to127.0.0.1:22281;/tmp/certs/gencerts.stderrshowed thestring too long:maxsize=64openssl failure andss -ltnconfirmed 22281 was not listening.string too long:maxsize=64error; with the truncation, the fullgencerts.shrun completes and generatesroot.crt/server.crt/client.crt+ keystores.full-build-cppunit-tests(8m22s) andfull-build-java-testsjobs completed green — the exact jobs that previously hung for hours.🤖 Generated with Claude Code