Make Pipeline Overview responsive on large in-progress builds#1225
Make Pipeline Overview responsive on large in-progress builds#1225timja wants to merge 26 commits intojenkinsci:mainfrom
Conversation
6e69845 to
eeb3475
Compare
eeb3475 to
a0d3a43
Compare
|
@jimklimov I'd be interested if its possible for you to test this on your instance. I've done some basic testing and it seems to work functionally but haven't done any serious performance testing on it. |
There was a problem hiding this comment.
Pull request overview
This PR implements an in-memory “live graph” path for in-progress Pipeline runs so graph/steps DTOs can be produced without repeatedly walking the entire FlowNode graph via DepthFirstScanner on each HTTP request (fixes #884).
Changes:
- Added a new
livestate/subsystem (registry + per-run state + listeners) to accumulate FlowNodes as the build runs and expose immutable snapshots. - Updated
PipelineGraphApi/PipelineStepApito prefer live snapshots (with per-version DTO caching) and fall back to the existing scanner path when unavailable. - Added constructors to
PipelineNodeTreeScanner/PipelineNodeGraphAdapterto build from a pre-collected node set, plus new integration tests for lifecycle/caching behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphState.java | Per-run mutable node accumulator + per-version DTO caches. |
| src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphSnapshot.java | Immutable snapshot record of observed nodes/workspace nodes + version. |
| src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphRegistry.java | Singleton registry (Caffeine) for live states + escape-hatch feature flag. |
| src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphPopulator.java | GraphListener.Synchronous extension feeding FlowNodes into LiveGraphState with catch-up. |
| src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphLifecycle.java | FlowExecutionListener extension creating/evicting state on start/resume/complete. |
| src/main/java/io/jenkins/plugins/pipelinegraphview/utils/PipelineGraphApi.java | Prefer live snapshot + per-version graph caching; reduce per-stage scans using workspace-node list. |
| src/main/java/io/jenkins/plugins/pipelinegraphview/utils/PipelineStepApi.java | Prefer live snapshot + per-version steps caching. |
| src/main/java/io/jenkins/plugins/pipelinegraphview/treescanner/PipelineNodeTreeScanner.java | New constructor + factoring to build from a caller-supplied node collection. |
| src/main/java/io/jenkins/plugins/pipelinegraphview/treescanner/PipelineNodeGraphAdapter.java | New constructor to use the pre-collected node set. |
| src/test/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphLifecycleTest.java | Integration tests covering live population, eviction, feature flag, and cache hit/miss behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void cacheGraph(long version, PipelineGraph graph) { | ||
| VersionedCache<PipelineGraph> current = cachedGraph; | ||
| if (current == null || current.version < version) { | ||
| cachedGraph = new VersionedCache<>(version, graph); | ||
| } | ||
| } |
There was a problem hiding this comment.
cacheGraph is not thread-safe: two concurrent callers can both observe the same current value and then overwrite cachedGraph out of order, allowing an older version to replace a newer one. This can cause stale PipelineGraph instances to be served even after a newer graph was cached. Make the update monotonic (e.g., synchronize cacheGraph, or use an AtomicReference/CAS loop that only replaces the cache when the stored version is still older).
| void cacheAllSteps(long version, PipelineStepList steps) { | ||
| VersionedCache<PipelineStepList> current = cachedAllSteps; | ||
| if (current == null || current.version < version) { | ||
| cachedAllSteps = new VersionedCache<>(version, steps); | ||
| } | ||
| } |
There was a problem hiding this comment.
cacheAllSteps has the same race as cacheGraph: the check-then-set on the volatile cachedAllSteps reference can regress the cached version under concurrent requests, leading to stale step lists being served. Use synchronization or an atomic compare-and-set style update to guarantee the cached version never decreases.
| synchronized LiveGraphSnapshot snapshot() { | ||
| if (poisoned) { | ||
| return null; | ||
| } | ||
| return new LiveGraphSnapshot(List.copyOf(nodes), List.copyOf(workspaceNodes), version); | ||
| } |
There was a problem hiding this comment.
snapshot() is synchronized and performs List.copyOf(...) over the full node lists on every call. For large in-progress graphs and frequent polling, this is O(N) work while holding the monitor that addNode also needs, which can block the CPS VM thread. Consider memoizing the last LiveGraphSnapshot and returning it when version is unchanged, or publishing an immutable snapshot from addNode so readers can fetch it without copying under the lock.
| /** | ||
| * Non-null when the live-state path supplied a pre-collected list of nodes carrying a | ||
| * {@link WorkspaceAction}. Avoids the per-stage {@code DepthFirstScanner} walk inside | ||
| * {@link #getStageNode(FlowNodeWrapper)}. | ||
| */ | ||
| @CheckForNull | ||
| private transient List<FlowNode> workspaceNodesOverride; | ||
|
|
There was a problem hiding this comment.
workspaceNodesOverride is mutable shared state on the PipelineGraphApi instance. PipelineConsoleViewAction holds a single PipelineGraphApi per run and in-progress requests are not synchronized (see PipelineGraphViewCache#getGraph), so concurrent HTTP calls can race: one thread can clear/replace workspaceNodesOverride while another is in getStageNode, causing unpredictable fallback scanning and inconsistent performance. Avoid using an instance field for this; instead, thread the candidate iterable as a local variable/parameter (e.g., pass the workspace-node list into getStageNode or into createTree so each request stays isolated).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| LiveGraphRegistry.get().cacheAllSteps(run, snapshot.version(), computed); | ||
| return computed; | ||
| } | ||
| return getAllSteps(CachedPipelineNodeGraphAdaptor.instance.getFor(run), runIsComplete); |
There was a problem hiding this comment.
computeAllSteps() was changed from package-private to public so LiveGraphLifecycle (different package) can call it. This makes an internal implementation detail part of the plugin's public API. If it is not meant for external callers, consider keeping it non-public via package refactoring, or annotate it with @restricted(DoNotUse.class) to match the existing pattern used for internal-only public helpers (e.g., StatusAndTiming#printNodes).
| synchronized LiveGraphSnapshot snapshot() { | ||
| if (poisoned) { | ||
| return null; | ||
| } | ||
| if (lastSnapshot != null && lastSnapshot.version() == version) { | ||
| return lastSnapshot; | ||
| } |
There was a problem hiding this comment.
LiveGraphState.snapshot() memoizes and returns lastSnapshot when the node version hasn't changed. This can make workspaceNodes stale: WorkspaceAction may be attached to an existing FlowNode after onNewHead fires (as noted in the class Javadoc), but with no new nodes the version won't bump, so subsequent snapshot() calls can miss the newly-attached WorkspaceAction and return incorrect agent/node mapping. Consider recomputing workspaceNodes even when version is unchanged (e.g., reuse the previously-copied nodes list but rescan actions), or drop the lastSnapshot fast-path to preserve correctness.
| @@ -160,7 +168,18 @@ public PipelineGraph createTree() { | |||
| return PipelineGraphViewCache.get().getGraph(run, this::computeTree); | |||
| } | |||
|
|
|||
| PipelineGraph computeTree() { | |||
| return createTree(CachedPipelineNodeGraphAdaptor.instance.getFor(run)); | |||
| public PipelineGraph computeTree() { | |||
| LiveGraphSnapshot snapshot = LiveGraphRegistry.get().snapshot(run); | |||
| if (snapshot != null) { | |||
| PipelineGraph cached = LiveGraphRegistry.get().cachedGraph(run, snapshot.version()); | |||
| if (cached != null) { | |||
| return cached; | |||
| } | |||
| PipelineGraph computed = | |||
| createTree(new PipelineNodeGraphAdapter(run, snapshot.nodes()), snapshot.workspaceNodes()); | |||
| LiveGraphRegistry.get().cacheGraph(run, snapshot.version(), computed); | |||
| return computed; | |||
| } | |||
| return createTree(CachedPipelineNodeGraphAdaptor.instance.getFor(run), null); | |||
There was a problem hiding this comment.
computeTree() was changed from package-private to public solely so LiveGraphLifecycle (different package) can call it. Exposing this as a public method effectively expands the plugin's supported API surface. If this is intended as an internal hook, consider keeping it non-public and moving the caller into the same package, or annotate the method with @restricted(DoNotUse.class) (consistent with StatusAndTiming#printNodes) to avoid implying external use is supported.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| synchronized LiveGraphSnapshot snapshot() { | ||
| if (poisoned || !ready) { | ||
| return null; | ||
| } | ||
| // Filter for WorkspaceAction at snapshot time rather than at addNode time: | ||
| // WorkspaceAction is attached to a block-start node when the workspace is allocated, | ||
| // which can happen AFTER onNewHead has already fired for that node. A snapshot-time | ||
| // scan always observes the latest action state on each captured FlowNode. | ||
| // | ||
| // The list is built newest-first (reverse insertion order) to match the iteration | ||
| // order of DepthFirstScanner (from current heads backward): PipelineGraphApi#getStageNode | ||
| // returns on the first match, and for nested agents the innermost workspace is the | ||
| // more-specific match — the one a later-created inner `node {}` block sits in. | ||
| List<FlowNode> workspaceNodes = new ArrayList<>(); | ||
| for (int i = nodes.size() - 1; i >= 0; i--) { | ||
| FlowNode n = nodes.get(i); | ||
| if (n.getAction(WorkspaceAction.class) != null) { | ||
| workspaceNodes.add(n); | ||
| } | ||
| } | ||
| return new LiveGraphSnapshot(List.copyOf(nodes), List.copyOf(workspaceNodes), version); |
There was a problem hiding this comment.
LiveGraphState.snapshot() holds the instance monitor while (a) scanning the entire nodes list for WorkspaceAction and (b) copying both nodes and workspaceNodes. Since addNode() is called from GraphListener.Synchronous on the CPS VM thread, this means an API read can block pipeline execution for O(N) time on large graphs. Consider minimizing time under the monitor by copying nodes+version under lock and doing the WorkspaceAction filtering outside the synchronized block (or maintaining a separate concurrent/workspace index) so writers are less likely to stall.
| PipelineGraph graph = new PipelineGraphApi(run).computeTree(); | ||
| PipelineStepList allSteps = new PipelineStepApi(run).computeAllSteps(); | ||
| // WorkflowRun.isBuilding() can still be true here even though FlowExecution | ||
| // is complete; rebuild with runIsComplete=true so the persisted copy matches | ||
| // reality. PipelineGraph.complete already reflects FlowExecution.isComplete(). | ||
| PipelineStepList finalSteps = new PipelineStepList(allSteps.steps, true); | ||
| PipelineGraphViewCache.get().seed(run, graph, finalSteps); |
There was a problem hiding this comment.
In onCompleted(), finalSteps reuses the same mutable List instance via new PipelineStepList(allSteps.steps, true). Since PipelineStepList.steps is publicly mutable, this aliases the list between the cached in-memory DTO (potentially concurrently read) and the persisted payload being written to disk. Prefer copying the list (or adding a constructor that defensively copies) when creating the “final” persisted version.
| static void catchUp(FlowExecution execution, LiveGraphState state) { | ||
| try { | ||
| DepthFirstScanner scanner = new DepthFirstScanner(); | ||
| scanner.setup(execution.getCurrentHeads()); | ||
| for (FlowNode existing : scanner) { | ||
| state.addNode(existing); | ||
| } |
There was a problem hiding this comment.
LiveGraphState.snapshot() assumes nodes insertion order is chronological (oldest→newest) and reverses it to approximate DepthFirstScanner (newest-first). In catchUp(), you iterate DepthFirstScanner directly and call state.addNode(existing) in that iteration order, which is already newest-first; after a restart this makes the subsequent reverse-scan produce oldest-first workspace candidates, so PipelineGraphApi#getStageNode may pick an outer workspace instead of the most-specific one. Consider collecting scanner output and adding nodes to the state in reverse (oldest→newest) to keep insertion order consistent with the live onNewHead path.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return (cachedAllSteps != null && cachedAllSteps.version >= minVersion) ? cachedAllSteps.value : null; | ||
| } | ||
|
|
||
| synchronized void cacheAllSteps(long version, PipelineStepList steps) { | ||
| if (cachedAllSteps == null || cachedAllSteps.version < version) { | ||
| cachedAllSteps = new VersionedCache<>(version, steps); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
The live-state caches return the same PipelineStepList instance across calls, but PipelineStepList.steps is a publicly mutable List. That means a consumer (or even accidental mutation during serialization) can corrupt the cached value and affect other concurrent readers. Consider caching an immutable/defensive copy (or wrapping steps as unmodifiable) before storing/returning it from the live-state cache.
| return (cachedAllSteps != null && cachedAllSteps.version >= minVersion) ? cachedAllSteps.value : null; | |
| } | |
| synchronized void cacheAllSteps(long version, PipelineStepList steps) { | |
| if (cachedAllSteps == null || cachedAllSteps.version < version) { | |
| cachedAllSteps = new VersionedCache<>(version, steps); | |
| } | |
| } | |
| return (cachedAllSteps != null && cachedAllSteps.version >= minVersion) | |
| ? copyPipelineStepList(cachedAllSteps.value) | |
| : null; | |
| } | |
| synchronized void cacheAllSteps(long version, PipelineStepList steps) { | |
| if (cachedAllSteps == null || cachedAllSteps.version < version) { | |
| cachedAllSteps = new VersionedCache<>(version, copyPipelineStepList(steps)); | |
| } | |
| } | |
| private PipelineStepList copyPipelineStepList(PipelineStepList steps) { | |
| return new PipelineStepList(List.copyOf(steps.steps)); | |
| } |
| public void onCompleted(@NonNull FlowExecution execution) { | ||
| try { | ||
| WorkflowRun run = workflowRunFor(execution); | ||
| if (run != null) { | ||
| PipelineGraph graph; | ||
| PipelineStepList allSteps; | ||
| LiveGraphSnapshot snapshot = LiveGraphRegistry.get().snapshot(run); | ||
| if (snapshot != null) { | ||
| // Share a single adapter (and therefore a single tree-scanner pass) for | ||
| // both graph and steps rather than paying the cost twice. | ||
| PipelineNodeGraphAdapter adapter = new PipelineNodeGraphAdapter(run, snapshot.nodes()); | ||
| // runIsComplete=true here directly: WorkflowRun.isBuilding() can still be | ||
| // true even though FlowExecution is complete. PipelineGraph.complete is | ||
| // already derived from FlowExecution.isComplete() inside createTreeFrom. | ||
| graph = new PipelineGraphApi(run).createTreeFrom(adapter, snapshot.workspaceNodes()); | ||
| allSteps = new PipelineStepApi(run).getAllStepsFrom(adapter, true); | ||
| } else { | ||
| // No live state (feature disabled, poisoned, plugin installed mid-build). | ||
| // Fall back to the scanner-backed paths and rebuild steps with | ||
| // runIsComplete=true to keep the persisted copy consistent. Defensively | ||
| // copy the list: PipelineStepList.steps is publicly mutable, and `raw` | ||
| // may be held by the DTO cache or returned to a concurrent HTTP reader. | ||
| graph = new PipelineGraphApi(run).computeTree(); | ||
| PipelineStepList raw = new PipelineStepApi(run).computeAllSteps(); | ||
| allSteps = new PipelineStepList(new ArrayList<>(raw.steps), true); | ||
| } | ||
| PipelineGraphViewCache.get().seed(run, graph, allSteps); | ||
| } |
There was a problem hiding this comment.
onCompleted now computes and writes the graph/steps for every finished run (including when the live-state feature flag is disabled, since snapshot() will be null and the code falls back to a full scanner sweep). This can add noticeable CPU and I/O to build finalization even if nobody ever requests the view. Consider honoring the .enabled flag here (or adding a separate knob for seeding), and/or short-circuiting if the on-disk cache file already exists and matches the current schema.
| /** | ||
| * Writes a final graph and step list directly to disk, bypassing the {@code isBuilding} | ||
| * guard used by {@link #getGraph} / {@link #getAllSteps}. Intended for use at | ||
| * {@code FlowExecutionListener.onCompleted}, where the execution is known complete but | ||
| * {@code WorkflowRun.isBuilding()} may not have flipped yet. Calling this avoids wasting | ||
| * the work already done by the live-state path: without it, the first read after | ||
| * completion falls through to a full scanner sweep. | ||
| */ | ||
| public void seed(WorkflowRun run, PipelineGraph graph, PipelineStepList allSteps) { | ||
| CachedPayload payload = load(run); | ||
| synchronized (payload) { | ||
| payload.graph = graph; | ||
| payload.allSteps = allSteps; | ||
| payload.schemaVersion = SCHEMA_VERSION; | ||
| write(run, payload); | ||
| } | ||
| } |
There was a problem hiding this comment.
seed(...) is described as an internal hook for FlowExecutionListener.onCompleted, but it is a new public API surface. To avoid accidentally committing to it for external consumers, consider marking it @Restricted(NoExternalUse.class) (or otherwise limiting visibility) similar to the other newly-added internal compute/build entry points.
| private final Cache<String, LiveGraphState> states = Caffeine.newBuilder() | ||
| .maximumSize(CACHE_MAX_SIZE) | ||
| .expireAfterAccess(Duration.ofMinutes(30)) | ||
| .build(); |
There was a problem hiding this comment.
expireAfterAccess(30m) can evict a live state while a run is still in progress (e.g. long input/sleep/semaphore with no new heads and no API reads). If that happens, the next onNewHead recreates a fresh LiveGraphState that never becomes ready (since onRunning/onResumed won't fire again), so the live-state path stays permanently disabled for that run. Consider removing the TTL for in-progress runs, switching to a much longer expireAfterWrite, or ensuring recreated states are marked ready (possibly with a safe catch-up) so eviction doesn't break functionality.
| synchronized (this) { | ||
| if (poisoned || !ready) { | ||
| return null; | ||
| } | ||
| nodesCopy = List.copyOf(nodes); | ||
| v = version; | ||
| } |
There was a problem hiding this comment.
poisoned is set without synchronization, but snapshot() only checks it while holding the monitor and then releases the lock to scan for WorkspaceAction. If poison() happens after the synchronized block, snapshot() can still return a non-null snapshot from a now-poisoned state (which defeats the “poison → always fall back to scanner” contract). Consider re-checking poisoned just before returning, or making poison() coordinated with snapshot (e.g., synchronized + a second check) while still keeping the WorkspaceAction scan outside the lock.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…lineNodeTreeScanner.getStageSteps line 136
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 30 out of 31 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
do we want to split this out into its own PR which we can merge first to make this one more focussed? |
Summary
Fixes #884.
Pipeline Overview has been effectively unusable during large in-progress builds — every
/treeand/allStepspoll triggered a fullDepthFirstScannersweep that contended with the CPS VM for storage locks.On a 300k-node run,
/treewas hitting 90+ seconds per request and/allSteps20-50 seconds. This change rebuilds the serving path so those requests stay in the low seconds even as the build progresses.What changed
Live graph state (
livestate/package). A@Extension GraphListener.Synchronousfeeds every newFlowNodeinto a per-runLiveGraphState, capturing the node's enclosing-block ancestry andhideFromViewmarkers atonNewHeadtime. HTTP readers work from a snapshot of this state — no storage reads for ancestry/flags on the hot path. On execution completion the final DTOs are seeded intothe on-disk cache from #885 so post-build reads stay cache-warm without a second scan. Feature flag:
-Dio.jenkins.plugins.pipelinegraphview.livestate.LiveGraphRegistry.enabled=falseto force scannerfallback.
Eliminated a quadratic in
getStageSteps. The previousgetAllStepscalled a per-stage filter that scanned all step nodes, doingList.contains()on each step's parent list. At 6000 stages × 300ksteps that was ~1.8 billion comparisons per request — 90s of CPU. Replaced with a single
O(steps)pass that buckets each step into its parent stage's list.Per-run compute dedup. Concurrent readers for the same run now serialise on a per-run monitor so N parallel
/treeor/allStepscallers don't each do the sameO(nodes)rebuild. Fast-pathre-check means repeat readers almost always see a cached result.
Eliminated hot-path debug formatting. Two guarded debug lines were doing
String.join(", ", knownNodes)inside per-node lookups — with DEBUG enabled onio.jenkins.plugins.pipelinegraphviewthatwas allocating a 300k-char string per enclosing-id check.
Switched JSON serialization to Jackson 3.
net.sf.json'sJSONObject.toStringwas spending ~9s of CPU on the 15MB/allStepsresponse. All four plugin JSON endpoints (/tree,/allSteps,/steps,/runs) now stream through a singlePipelineJsonWriterbacked byjackson3-api. Wire format is preserved (Stapler{"status":"ok","data":...}envelope, null fields omitted, enum nameslowercased). All eight
JsonBeanProcessor/JsonValueProcessorclasses have been removed in favour of Jackson annotations.Numbers (300k-node stress pipeline)
Follow-ups (out of scope for this PR)
O(N)per request; a future change could cache wrapped nodes per version so reads becomeO(ΔN).StatusAndTiming.findWorstWarningBetweenstill walks every node in a stage callinggetPersistentAction(WarningAction.class). Caching the result on the live snapshot (WarningActions don't changeafter attach) would remove the remaining ~1s of per-request cost.
net.sf.jsonuses inPipelineConsoleViewAction(small responses ondoRerun/hasNextBuild/doCancel— not on the perf path).These are the results from a job with ~250k flow nodes:
Future work
per node and never change, so the cache is stable per-version. That would knock another second or two off /allSteps.
All steps future work:
1. Switch this endpoint to Jackson - same DTO, ~5-10× faster serialization. Need to be careful about Stapler's default net.sf.json behavior; easiest way is to write to rsp.getWriter() directly with an ObjectMapper and set the content type. Self-contained change, no wire-format diff.- done, made quite a difference for all steps2. Stop sending all 300k steps on every 3s poll. The front-end only cares about steps for stages currently on screen. This is the real architectural fix but it's a frontend+backend refactor.
Testing done
Tested with the 300k node pipeline, its now near instant.
Submitter checklist