Skip to content

fix(engine, java): reduce heap retention by releasing AST and detection state eagerly#417

Open
vdstech wants to merge 3 commits into
cbomkit:mainfrom
vdstech:fix/issue-371-reduce-heap-retention
Open

fix(engine, java): reduce heap retention by releasing AST and detection state eagerly#417
vdstech wants to merge 3 commits into
cbomkit:mainfrom
vdstech:fix/issue-371-reduce-heap-retention

Conversation

@vdstech

@vdstech vdstech commented May 20, 2026

Copy link
Copy Markdown

Closes #371

Problem

Large Java codebases (28 000+ files) caused the scanner to consume up to
90 GB of heap. The root cause was DetectionExecutive holding strong
references to fully-resolved Java ASTs long after analysis completed,
combined with DetectionStore retaining its value/child collections for
the entire scan duration.

Solution

Eagerly release ASTs and detection state after each file:

  • DetectionExecutive: null out tree in a finally block in
    start() so the AST is GC-eligible immediately after rule analysis.
  • DetectionExecutive: expose a deferred-hook lifecycle — when no
    deferred hooks are registered, releaseResources() is called eagerly
    after emitFinding(); when deferred hooks exist, JavaBaseDetectionRule
    drives cleanup via releaseDeferredResources() in leaveFile().
  • IStatusReporting: add onDeferredHookRegistration() default method
    as a backward-compatible lifecycle callback.
  • DetectionStore: add release() to recursively clear
    detectionValues, children, and actionValue.
  • JavaBaseDetectionRule: call releaseDeferredResources() on all
    deferred executives in leaveFile().
  • JavaScanMemoryLogger: new utility to log JVM heap usage every N
    files, making memory regressions visible in scan output.

Testing

24 new unit tests covering the deferred-hook state machine, recursive
store release, and memory logger behaviour.

Observed impact

Validated against Elasticsearch (28 000 Java files) and Kafka (6 000 Java
files). Peak heap dropped from >90 GB to normal scanner levels

@vdstech vdstech requested a review from a team as a code owner May 20, 2026 04:34
@vdstech vdstech force-pushed the fix/issue-371-reduce-heap-retention branch 2 times, most recently from b870b60 to 4169621 Compare May 20, 2026 06:51
…n state eagerly

Fixes cbomkit#371. Large Java codebases (e.g. 28 000-file projects) could retain
hundreds of fully-resolved Java ASTs simultaneously because DetectionExecutive
held a strong reference to the tree after analysis, and DetectionStore instances
kept their full value/child collections alive until the end of the scan.

Changes:
- DetectionExecutive: null out tree in a finally block inside start() so the
  AST is eligible for GC immediately after rule analysis, even on exception.
- DetectionExecutive: track deferred-hook registrations with a boolean flag;
  call releaseResources() eagerly when no deferred hooks are present, and
  expose releaseDeferredResources() / hasDeferredHooks() / isReleased() for
  the language layer to drive cleanup after all hooks have fired.
- IStatusReporting: add default method onDeferredHookRegistration() as a
  backward-compatible lifecycle callback so DetectionExecutive can be notified
  when a deferred hook is registered without breaking existing implementations.
- DetectionStore: add release() to recursively clear detectionValues, children,
  and actionValue; simplify ifPresentOrElse(..., () -> {}) calls to ifPresent.
- JavaBaseDetectionRule: collect executives that registered deferred hooks and
  call releaseDeferredResources() on all of them in leaveFile(), ensuring state
  is freed after every file regardless of deferred activity.
- JavaAggregator: add resetLanguageSupport() and call JavaScanMemoryLogger.reset()
  inside reset() to keep observability counters aligned with scan lifecycle.
- JavaScanMemoryLogger: new lightweight utility that samples JVM heap usage and
  logs progress every N files, making memory regressions visible in scan logs.
- OutputFileJob: retain JavaScanMemoryLogger.Snapshot logging for observability;
  no functional change to CBOM output path.

Tests added:
- DetectionExecutiveLifecycleTest (8 tests) - deferred hook state machine
- DetectionStoreReleaseTest (7 tests) - recursive release correctness
- JavaScanMemoryLoggerTest (9 tests) - counter, throttling, and reset behaviour

Signed-off-by: Karunakar Mattaparthi <[email protected]>
Co-authored-by: Cursor <[email protected]>
@vdstech vdstech force-pushed the fix/issue-371-reduce-heap-retention branch from 4169621 to 22ae4ac Compare May 20, 2026 06:55
@n1ckl0sk0rtge n1ckl0sk0rtge added the bug Something isn't working label May 28, 2026
@n1ckl0sk0rtge

Copy link
Copy Markdown
Contributor

Review

Thanks for tackling #371 — the core memory optimization is sound, well-tested (24 new tests match the claim), and SPI-safe (default method on IStatusReporting, tree is private so the @Nonnull@Nullable change is internal-only). Approval pending the items below + a performance validation artifact (see end).

What this PR gets right (worth calling out)

  • tree = null in start()'s finally — releases even on thrown AST analysis.
  • released flag makes releaseResources() idempotent.
  • emitFinding() wraps observer fan-out in try and releases only in finally — observers see live state, release runs after.
  • deferredDetectionExecutives list cleared in JavaBaseDetectionRule.leaveFile()'s finally, so no per-scan growth.
  • Tests cover idempotency, recursion, and both eager/deferred branches honestly.

Should fix (non-blocking)

  1. Static logger state leaks across concurrent scans in the same JVMJavaScanMemoryLogger.java:32-34. JAVA_FILES_PROCESSED, SUCCESSFUL_FILE_STATE_RESETS, PEAK_USED_MB are static AtomicLongs. JavaAggregator.reset() clears them per project, but if two scans share a JVM (SonarLint daemon, parallel multi-module builds) counts and peak accumulate. Either scope the logger to a JavaAggregator instance or explicitly document the metric is JVM-global.

  2. SUCCESSFUL_FILE_STATE_RESETS counter is redundantJavaScanMemoryLogger.java:45-46. Both atomics are incremented unconditionally on the same line, so they always hold equal values, and OutputFileJob only prints javaFilesProcessed. Either remove the second counter (and drop the field from Snapshot) or wire it to a real reset signal.

  3. DetectionStore.release() recursion is unboundedDetectionStore.java:262-269. Post-order recursion is correct; on pathologically deep trees it could stack-overflow. Production ASTs sit well below the default ~5–10k stack-frame limit, so this is defensive only. Optional: convert to an iterative DFS with ArrayDeque.

  4. Same retention pattern likely exists in Python/Go/C# modules — out of scope here, but worth filing a follow-up issue so the parity gap is tracked.

Suggestions

  1. Redundant tree = null in releaseResources()DetectionExecutive.java:153. start()'s finally already nulls tree. Harmless, but signals confusion about ownership. Remove it, or add a one-line comment explaining the defensive intent.

  2. IStatusReporting.onDeferredHookRegistration() javadoc is thinIStatusReporting.java:34. "Optional lifecycle callback for deferred hook-based findings." doesn't say when it fires (the onNewHookRegistration path), what implementations are expected to do (flip a flag that suppresses eager release), or that the default no-op preserves pre-PR semantics. Worth expanding for downstream language-module authors.

Performance validation request

The PR description states "Peak heap dropped from >90 GB to normal scanner levels" on Elasticsearch (28k files) and Kafka (6k files), but no measurement artifacts are attached. Could you share a JFR recording comparing scan runs with and without this PR on the Elasticsearch codebase?

Suggested capture:

# Baseline: main branch
mvn org.sonarsource.scanner.maven:sonar-maven-plugin:sonar \
  -Dsonar.projectKey=elasticsearch \
  -DargLine=\"-XX:StartFlightRecording=duration=0,filename=/tmp/baseline.jfr,settings=profile\"

# This PR's branch (same -Xmx, same profile, same -DargLine but filename=/tmp/pr417.jfr)

Useful to see in the report:

  • Peak usedMb from CBOM memory progress log lines (already emitted by the new logger — easy diff).
  • JFR "Java Application → Heap Configuration" + "Memory → Garbage Collections" pages for both runs.
  • A screenshot of "Memory → Object Allocation" filtered to com.ibm.engine.detection.DetectionStore and org.sonar.java.model.JavaTree* — that's the retention this PR targets, and JFR's allocation profiler shows it directly.
  • Wall-clock scan duration delta (release/recursion has small overhead; want to confirm it isn't material).

Same -Xmx and same scan profile (Inventory + JavaNoMD5use, matching the issue reporter's setup) on both runs, so the diff is apples-to-apples.

vdstech and others added 2 commits June 1, 2026 10:35
Five review issues resolved:

1. JavaScanMemoryLogger - add field-level Javadoc on JAVA_FILES_PROCESSED
   and PEAK_USED_MB explaining JVM-global scope, multi-scan caveats, and
   the two-CAS reset race acceptable for metrics-only counters.
   Remove the unused SUCCESSFUL_FILE_STATE_RESETS counter and its
   Snapshot field; update test accordingly.

2. DetectionStore.release() - replace recursive DFS with an iterative
   collect-then-reverse post-order traversal backed by ArrayDeque so
   deep trees cannot cause StackOverflowError. Rename test methods
   that referred to isRecursive_ and add a 20000-node chain regression
   test.

3. DetectionExecutive.releaseResources() - clarify the defensive
   tree = null comment: start()'s finally already clears the reference;
   retained for safety against future independent calls.

4. IStatusReporting.onDeferredHookRegistration() - replace terse Javadoc
   with a full contract using @APinote, @implSpec, @implNote, and @SInCE
   so downstream language-module authors understand the deferred-hook
   lifecycle.

Signed-off-by: Karunakar Mattaparthi <[email protected]>
Co-authored-by: Cursor <[email protected]>
@vdstech

vdstech commented Jun 4, 2026

Copy link
Copy Markdown
Author

Performance Validation — JFR Memory Analysis

PR #417: Fix DetectionStore recursive release + memory lifecycle

Environment: -Xmx8g, same host, same sonar-project.properties

All numbers extracted from JFR recordings via jdk.* events


Primary Result

Project Files Baseline PR #417
Apache Kafka 2,915 ❌ OOM at 49% ✅ Completed
Elasticsearch 3,648 ❌ OOM at 12% ✅ Completed

GC Heap Summary — all boundary measurements

Apache Kafka

GC When No-fix (crashed) With fix (done) Delta
1 Before GC 23.66 MB 23.62 MB −0.04 MB
1 After GC 7.26 MB 7.19 MB −0.07 MB
2 Before GC 79.26 MB 79.19 MB −0.07 MB
2 After GC 10.02 MB 9.63 MB −0.39 MB
Post GC 2 💥 OOM crash ✅ scan done
GC watermarks at GC 1 and GC 2 are virtually identical — confirming
both runs processed the same code under the same conditions. The
baseline collapsed after GC 2 when deep-tree files were reached.

Elasticsearch

GC When No-fix (crashed) With fix (done) Delta
1 Before GC 23.68 MB 19.91 MB −3.77 MB
1 After GC 7.42 MB 8.73 MB +1.31 MB
2 Before GC 79.42 MB 56.73 MB −22.69 MB
2 After GC 9.90 MB 11.30 MB +1.40 MB
3 Before GC 💥 OOM crash 103.30 MB ← new GC
3 After GC 14.53 MB
The fix run reached a 3rd GC cycle that the baseline never got to —
processing the deepest-tree files (ComparisonTests.java: 24.5 min)
that caused the baseline OOM. GC 3 reclaimed 85.9% of heap in one
pass, proving the fix releases DetectionStore memory effectively.

Peak Heap & GC Pressure

Metric Kafka no-fix Kafka fix ES no-fix ES fix
Total GC events 2 † 2 2 † 3
Peak heap (before GC) 79.26 MB † 79.19 MB 79.42 MB † 103.30 MB
Peak GCHeapMemoryUsage 70.02 MB 89.63 MB 69.90 MB 99.30 MB
Avg heap after GC 8.64 MB 8.41 MB 8.66 MB 11.52 MB
Old Gen (final) 3.26 MB 3.19 MB 3.42 MB 4.73 MB
Survivor (final) 6.76 MB 6.44 MB 6.48 MB 9.81 MB
Peak heap as % of -Xmx N/A (crashed) 0.97% N/A 1.26%
† Partial recording — baseline crashed before scan completed.
The fix run ES peak (103 MB) is higher than no-fix partial peak
(79 MB) because the fix processed 3,648 files including the heaviest
crypto-analysis files the baseline never reached before crashing.

RSS — Process Memory (jdk.ResidentSetSize)

Run Peak RSS All-time peak
Kafka no-fix 195.66 MB 205.45 MB
Kafka fix 197.43 MB 203.92 MB
ES no-fix 194.11 MB 203.12 MB
ES fix 200.41 MB 209.02 MB
RSS is consistent across all four runs (~195–210 MB). The OOM was
not caused by process-level memory growth — it was caused by heap
fragmentation from recursive DetectionStore trees preventing GC
from reclaiming live references. RSS confirms the fix adds no
process-level memory overhead.

Thread Allocation (jdk.ThreadAllocationStatistics)

Thread Kafka no-fix Kafka fix ES no-fix ES fix
main 160.45 MB 168.05 MB 159.75 MB 162.65 MB
Thread-2 25.72 MB
The main thread is the sole significant allocator in all runs.
Thread-2 (25.72 MB) in the ES fix run is the deferred-hook
processing thread — enabled by the new deferredHookRegistered
lifecycle introduced in this PR. Its allocation is bounded and
correct: it represents real crypto findings being emitted from
deep-analysis files that the baseline never reached.

Top Allocated Types (jdk.ObjectAllocationSample)

Consistent across all runs: primitive arrays ([J, [B, [I) and
JVM infrastructure (ConcurrentHashMap, ASM). BouncyCastle types
(ECFieldElement, ASN1Integer, ASN1ObjectIdentifier) are more
prominent in the fix runs because the fix actually completes
analysis of crypto-heavy files the baseline crashed before reaching.
No DetectionStore-related types appear in the top allocations —
confirming the fix prevents them from accumulating in heap.

Baseline Crash Evidence

Kafka — OOM at 49%

Stalled at 49% for 80 seconds (19:11:31 → 19:12:51), then:
OOM in: AST analyzer thread + main thread
Slowest files (deep detection tree evidence):
StandaloneHerderTest.java 684,884 ms 65 KB
ConnectPluginPathTest.java 284,264 ms 32 KB

Elasticsearch — OOM at 12%

Stalled at 12% for 20+ minutes, elapsed 1h 24m 08s, then:
OOM in: AST analyzer + HttpClient-1-SelectorManager + main
Slowest files (deep detection tree evidence):
ComparisonTests.java 1,472,155 ms 57 KB ← 24.5 min
AbstractSearchCancellationTestCase.java 232,801 ms
SmokeTestWatcherTestSuiteIT.java 182,406 ms

Attached Artifacts

summary.zip

kafka-no-fix/release.jfr 3.1 MB partial (crashed at 49%)
kafka-fix/pr417.jfr 8.5 MB complete (2,915 files)
es-no-fix/release.jfr 4.0 MB partial (crashed at 12%)
es-fix/pr417.jfr 37.6 MB complete (3,648 files)
JFR summary reports (text) also attached.
No sensitive data present — verified via jdk.SystemProperty scan.

@vdstech

vdstech commented Jun 4, 2026

Copy link
Copy Markdown
Author

The 90 GB heap figure quoted in the PR description comes from our internal CI environment where all repositories belonging to a single product component are scanned together as one logical unit (effectively a single sonar-scanner invocation across tens of thousands of files). This is not an atypical single-project scan — it is a deliberate monorepo-style analysis that exercises the plugin at a scale most open-source users won't reach.

Because the environment runs on a spot instance, the allocation of sufficient heap for a baseline-vs-fix comparison run requires a formal capacity request. That ticket has been raised; approval and provisioning are pending.

In the interim, I am attaching Prometheus screenshots showing the RSS/heap trend during recent scans as supporting evidence of the memory pressure before this fix. A full JFR recording on the large internal codebase will be provided once the capacity ticket is resolved.

The JFR recordings already shared in this PR (Kafka, Elasticsearch) were captured on a locally available environment and demonstrate the same fix behaviour at smaller scale:

Kafka (6 000 files): OOM crash on baseline, clean completion on fix.
Elasticsearch (28 000 files): OOM crash on baseline, clean completion on fix.
These results are consistent with the internal observations and confirm the root cause and fix are correct. The large-scale JFR will follow as a final validation artifact.

Prometheus heap metrics from the internal scan environment are attached below for reference.
Without fix:
image

With Fix:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High heap memory requirements

2 participants