Fix ports.isEmpty() reliability gap: make request correlation survive scheduler hops#313
Fix ports.isEmpty() reliability gap: make request correlation survive scheduler hops#313Mishenevd wants to merge 3 commits into
Conversation
… scheduler hops PendingHostnamesStore was ThreadLocal, so a WebClient call's registered intent could get lost whenever intent and connect ran on different OS threads - either an app's own .publishOn() or Reactor Netty's internal event-loop dispatch. Confirmed live: an SSRF request to a private IP went out completely unblocked when preceded by a scheduler hop. Made PendingHostnamesStore a global bounded-LRU map instead (so it survives any thread hop), each entry now also carrying the ContextObject captured at registration time, bridged through Reactor's own Context (not ours) via a new reflection-only helper so it works without needing reactor-core as a hard runtime dependency. Also fixes a latent bug this surfaced: two wrappers located URLCollector.report via name-only reflection, which silently broke once a second overload existed.
|
|
||
| private ReactorAikidoContext() {} | ||
|
|
||
| // `mono` is a Mono<Void>, returned Object is that same Mono<Void> wrapped with .contextWrite(). |
There was a problem hiding this comment.
ReactorAikidoContext uses extensive reflection, dynamic Proxy, and classloader-based lookups (write, deferRegisterUrl, RegisterUrlHandler.invoke), which obscures control flow and behavior and hinders auditing.
Details
✨ AI Reasoning
The new helper uses runtime classloader lookups (Class.forName), reflection to call methods (getMethod/invoke), and a dynamic Proxy/InvocationHandler that inspects and invokes on opaque Objects passed from the target app's reactor classes. These constructs make the control flow and data flow non-obvious to code reviewers and static analysis tools and can be used to hide or obfuscate behavior. The code attempts to bridge reactor types without a compile-time dependency by keeping everything Object-typed and reflective, which increases opacity. Specifically, the write() and deferRegisterUrl() methods and the RegisterUrlHandler.invoke() perform many reflective operations and dynamically call into target-context objects, which can be used to execute logic that is hard to audit.
🔧 How do I fix it?
Ensure code is transparent and not intentionally obfuscated. Avoid hiding functionality from code review. Focus on intent and deception, not specific patterns.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
| try { | ||
| Method getOrDefault = ctxView.getClass().getMethod("getOrDefault", Object.class, Object.class); | ||
| context = (ContextObject) getOrDefault.invoke(ctxView, KEY, null); | ||
| } catch (Throwable ignored) { |
There was a problem hiding this comment.
Empty catch in RegisterUrlHandler.invoke swallowing Throwable; add logging or explicit handling so context extraction/reporting failures aren't silently ignored.
Details
✨ AI Reasoning
A new private InvocationHandler implementation swallows all Throwables with an empty catch block. This handler runs during Reactor context extraction and URL registration, so silent failures could hide problems in request correlation. The empty catch neither logs nor documents why errors are ignored, making debugging harder and potentially masking bugs in context retrieval or URL reporting.
🔧 How do I fix it?
Add proper error handling in catch blocks. Log the error, show user feedback, or rethrow if needed.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
| * application's classpath. Must be public: the woven target class (in a completely different | ||
| * package) needs to call into it directly. | ||
| */ | ||
| public final class ReactorAikidoContext { |
There was a problem hiding this comment.
ReactorAikidoContext mixes Reactor reflection/plumbing with URLCollector.report call; split reactor-context bridging from URL registration logic.
Details
✨ AI Reasoning
The new ReactorAikidoContext both performs low-level Reactor reflection plumbing (constructing Context objects, creating Function proxies, invoking contextWrite/deferContextual) and directly calls URLCollector.report to register URLs. This couples reactor-integration plumbing with application-level collector behavior, increasing responsibility scope and coupling.
🔧 How do I fix it?
Split classes that handle database, HTTP, and UI concerns into separate, focused classes.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
… HttpURLConnectionTest Tried writing one (webClient.get().uri(privateIP)...block(), asserting SSRFException) assuming PendingHostnamesStore going global would make it possible. It doesn't: the taint-context capture only gets written by SpringWebfluxWrapper, which requires a real incoming WebFlux request - a bare WebClient call from a test never populates it, so the SSRF check silently no-ops and a real network call goes out (confirmed empirically, request hung until timeout instead of being blocked).
Hans separately found a customer report involving OkHttpClient and an unknown-port DNS lookup - same shape as this PR's WebFlux scheduler-hop bug, but via OkHttp's own Dispatcher thread pool instead (newCall() registers intent on the calling thread, but call.enqueue() runs the actual connect on a different thread from OkHttp's own pool). Confirmed empirically rather than assumed: this PR's PendingHostnamesStore fix is generic (not Spring/Reactor-specific), so it already covers this case without needing anything OkHttp-specific. Verified the same test fails on the base branch (ports lost across the thread hop, real network call goes out unblocked) and passes here.
Context
Follow-up to #312. Hans's review there flagged
ports.isEmpty()(used to distinguish real requests from infra noise inDNSRecordCollector) as not fully reliable. #312 fixed the one concrete case found at the time (WebClient redirects) and left the general problem as a documented, unverified hypothesis.This PR closes that gap for real:
Context/PendingHostnamesStorewereThreadLocal, which only survives when a WebClient call's "intent" (registering the pending host+port) and its "effect" (the actual socket connect) run on the same OS thread. That holds for the default WebFlux request-handling flow, but breaks under:.publishOn(Schedulers.boundedElastic())before aWebClientcall - a common pattern for mixing blocking JDBC with reactive controllers).reactor-http-niothread) - separate from any app-level scheduler hop.Confirmed via e2e before investing in a fix, not just reasoned about on paper. Added a test endpoint (
/api/request/publish-on,Mono.just(url).publishOn(Schedulers.boundedElastic()).flatMap(...)) targeting the same private IP (169.254.169.254) used elsewhere in #312's e2e tests:SSRF Attack detected due to: 169.254.169.254:80.URLCollectorregistered intent on threadboundedElastic-1,DNSRecordCollector.reportConnect()ran the check on threadreactor-http-nio-5, no SSRF log line at all, the request actually went out over the network (curl hung until timeout).What changed
PendingHostnamesStore.java- rewritten fromThreadLocalto a global,synchronized, bounded-LRU map (same 1000-entry cap and eviction policy as before). Each entry now also carries theContextObjectthat was active when it was registered, not just the ports.Trade-off: two concurrent requests to the same hostname can now share/overwrite each other's entry in a narrow race window. Doesn't open an SSRF bypass -
SSRFDetector/StoredSSRFDetectorstill run unconditionally either way - worst case is a wrong source attribution for that one request. This is a real, demonstrated effect and not just theoretical: it brokeRestTemplateTest/InetAddressTestduring development (turned out to be an unrelated reflection bug this change exposed, see below, now fixed) - worth keeping an eye on if dashboard source attribution ever looks off for high-traffic shared hostnames.Considered an object-identity-keyed alternative (key by the Reactor Netty request-handler object instead of hostname, avoiding the cross-contamination trade-off entirely) but ruled it out: that object isn't reachable from
SpringWebClientWrapper's hook point (ExchangeFunction.exchange(), Spring's own layer, which runs before Reactor Netty constructs its own internal handler for the call).WebRequestCollector.java- removed the per-incoming-requestPendingHostnamesStore.clear()call. It was aThreadLocal-era safety net ("wipe this thread's leftover entries before handling a new request") that becomes actively harmful now that the store is global - it would wipe other concurrent requests' in-flight entries, not just stale ones from a previous request on a reused thread. The bounded-LRU eviction already handles unbounded growth on its own.ReactorAikidoContext.java(new) - bridges the capturedContextObjectthrough Reactor's ownContext(reactor.util.context.Context/ContextView- not our ownContextclass) so it survives.publishOn()between the incoming request and aWebClientcall made while handling it.SpringWebfluxWrapperwrites it via.contextWrite();SpringWebClientWrapperreads it back via.deferContextual()atExchangeFunction.exchange()time and passes it intoPendingHostnamesStorealongside the port.DNSRecordCollector.report()/reportConnect()temporarily restore that captured context around the SSRF check, so it's used regardless of which thread actually ends up processing the connect.Everything in this class goes through reflection rather than a static type reference to
reactor.util.context.*.reactor-coreiscompileOnlyfor this module: unlike@Advice-bound parameter types (which ByteBuddy resolves specially against the woven target's own classloader), a normal helper class that declaresContext/ContextViewas a concrete parameter type fails to verify at all on the agent's own classloader, which has no visibility into the target app's classpath - confirmed by hitting exactly thatNoClassDefFoundErroron the first attempt. A second attempt using lambdas for theFunctioncallbacks hit a different problem: ByteBuddy inlines@Advicebytecode into the target class, so a lambda constructed there becomes a private synthetic method that the target class can't call back into (IllegalAccessError) - same class of bug already hit once in Track and protect WebClient outbound requests, fix private-IP SSRF regression #312 withSpringWebClientRedirectWrapper'sexternalFormhelper. Fixed with a plain,public,Object-typedInvocationHandlerclass instead of a lambda.Considered
io.micrometer:context-propagationfirst (Micrometer'sThreadLocalAccessorAPI, purpose-built for this). Dropped it: it only helps if the target app already happens to have that optional library on its classpath, which can't be required or forced - the agent can't add a hard dependency that isn't guaranteed present. The fix above needs nothing beyondreactor-core, which is guaranteed present whereverWebClientis.Incidental bugfix found while testing this:
HttpURLConnectionWrapperandHttpClientSendWrapperlocateURLCollector.reportvia reflection, matching by method name only (clazz.getMethods()+.getName().equals("report")). Adding the newreport(URL, ContextObject)overload made this ambiguous - depending on JVM method-ordering, it could silently pick the wrong overload, throw anIllegalArgumentExceptionon the argument-count mismatch, and (since both wrappers run withsuppress = Throwable.class) fail completely silently, breaking outbound tracking/SSRF for plainHttpURLConnection/java.net.http.HttpClientusers. Caught byRestTemplateTest/InetAddressTestfailing during development. Fixed by looking up the exact single-argument overload (clazz.getMethod("report", URL.class)) instead of matching by name.e2e scenario tested
Same setup as #312 (real agent, real Spring WebFlux sample app, mock core):
SSRF via WebClient after an explicit scheduler hop, now blocked:
→ HTTP 500,
SSRF Attack detected due to: 169.254.169.254:80,source=query. Log confirms the cross-thread path:URLCollectorregisters intent on threadboundedElastic-1,DNSRecordCollector.reportConnect()runs the check on threadreactor-http-nio-5- different threads, correctly correlated.Also re-ran all of #312's existing e2e scenarios (safe request, literal-IP SSRF, dual-stack, redirect) to confirm no regressions - all still pass.
Tests
New
PendingHostnamesStoreTestcases:testEntryIsVisibleFromADifferentThread- the actual point of making the store global: writes on one thread, reads on another, confirms the entry is visible.testGetContextReturnsWhatWasCapturedAtAddTime/testGetContextIsNullWhenNoneWasCaptured- the captured-ContextObjecthalf of an entry.testUnboundedHostnamesDoNotGrowThreadLocalMapForever→testUnboundedHostnamesDoNotGrowMapForever(no longer thread-local).Full regression pass: 1036 tests, same 42 pre-existing environmental failures (network/DB dependent, unavailable in sandbox) as #312's baseline, zero new failures.
New
OkHttpTest.testSSRFAsyncEnqueueOnDifferentThread- a separate customer report surfaced the same class of bug viaOkHttpClient:newCall()registers intent on the calling thread, butcall.enqueue()(async) runs the actual connect on OkHttp's own Dispatcher thread pool, a different thread. This PR'sPendingHostnamesStorefix is generic, not Spring/Reactor-specific, so it turns out to already cover this case with no OkHttp-specific changes needed - confirmed by running the same test against the base branch first (fails: real network call goes out unblocked) and then this branch (passes).Addendum: tried to close a #312 review comment fully, found a real limit
Hans's review on #312 also asked for something matching
HttpURLConnectionTest's pattern - a single JUnit test where a real client call to a private IP gets blocked, no splitting needed. #312 couldn't do that forWebClient(had to splitWebClientTestinto two tests) because of theThreadLocalthread-hop problem this PR fixes - so tried writing that single test here, assuming it'd now work.It doesn't, and not for a reason this PR can fix:
ReactorAikidoContext's taint-context capture only gets written into Reactor's ownContextbySpringWebfluxWrapper, which only fires for a real incoming WebFlux request. A barewebClient.get().uri(privateIP)...block()call made directly from a JUnit test, with no request behind it, never populates it - confirmed empirically, the request actually went out over the network and hung until timeout instead of getting blocked. UnlikeHttpURLConnectionTest(whose SSRF check never depended on any request-scoped context propagation beyond a plainThreadLocalon the same thread), this fix inherently needs a real Spring request context to exist.The correct test for "WebClient in a Spring app gets blocked" is what already exists:
spring_webflux_postgres.py'sssrfpayload (from #312) plus this PR's own scheduler-hop e2e scenario, both against the real running sample app. Documented this inWebClientTest.java's javadoc so it doesn't get re-litigated in review as "why isn't this just one test".Known remaining limitations (unchanged from #312, not addressed here)
SocketChannelWrappernever fires under Netty's native epoll transport (Linux's default) - see Track and protect WebClient outbound requests, fix private-IP SSRF regression #312's worklog item 1 for the full writeup.EpollSocketChannelimplements Netty's ownio.netty.channel.socket.SocketChannel, not the JDK'sjava.nio.channels.SocketChannelour matcher targets, despite the identical simple name. WebClient traffic is completely untracked/unprotected under this transport, which is why this PR's ownSpringWebfluxSampleAppe2e job fails 100% of the time in CI while passing locally on macOS (native epoll lib present but fails to load there, silent fallback to plain NIO masks the gap). Not yet fixed - candidate direction noted in Track and protect WebClient outbound requests, fix private-IP SSRF regression #312.RestClient(6.1+) instrumentation status is still unverified.IsPrivateIP.isPrivateIp()'s handling of exotic private-IP encodings (decimal/octal, IPv4-mapped IPv6) is still unverified.🤖 Generated with Claude Code
Summary by Aikido
🚀 New Features
🐛 Bugfixes
🔧 Refactors
More info