Skip to content

Make Pipeline Overview responsive on large in-progress builds#1225

Open
timja wants to merge 26 commits intojenkinsci:mainfrom
timja:issue-884-save-as-graph-progresses
Open

Make Pipeline Overview responsive on large in-progress builds#1225
timja wants to merge 26 commits intojenkinsci:mainfrom
timja:issue-884-save-as-graph-progresses

Conversation

@timja
Copy link
Copy Markdown
Member

@timja timja commented Apr 22, 2026

Summary

Fixes #884.

Pipeline Overview has been effectively unusable during large in-progress builds — every /tree and /allSteps poll triggered a full DepthFirstScanner sweep that contended with the CPS VM for storage locks.

On a 300k-node run, /tree was hitting 90+ seconds per request and /allSteps 20-50 seconds. This change rebuilds the serving path so those requests stay in the low seconds even as the build progresses.

What changed

Live graph state (livestate/ package). A @Extension GraphListener.Synchronous feeds every new FlowNode into a per-run LiveGraphState, capturing the node's enclosing-block ancestry and
hideFromView markers at onNewHead time. HTTP readers work from a snapshot of this state — no storage reads for ancestry/flags on the hot path. On execution completion the final DTOs are seeded into
the on-disk cache from #885 so post-build reads stay cache-warm without a second scan. Feature flag: -Dio.jenkins.plugins.pipelinegraphview.livestate.LiveGraphRegistry.enabled=false to force scanner
fallback.

Eliminated a quadratic in getStageSteps. The previous getAllSteps called a per-stage filter that scanned all step nodes, doing List.contains() on each step's parent list. At 6000 stages × 300k
steps that was ~1.8 billion comparisons per request — 90s of CPU. Replaced with a single O(steps) pass that buckets each step into its parent stage's list.

Per-run compute dedup. Concurrent readers for the same run now serialise on a per-run monitor so N parallel /tree or /allSteps callers don't each do the same O(nodes) rebuild. Fast-path
re-check means repeat readers almost always see a cached result.

Eliminated hot-path debug formatting. Two guarded debug lines were doing String.join(", ", knownNodes) inside per-node lookups — with DEBUG enabled on io.jenkins.plugins.pipelinegraphview that
was allocating a 300k-char string per enclosing-id check.

Switched JSON serialization to Jackson 3. net.sf.json's JSONObject.toString was spending ~9s of CPU on the 15MB /allSteps response. All four plugin JSON endpoints (/tree, /allSteps,
/steps, /runs) now stream through a single PipelineJsonWriter backed by jackson3-api. Wire format is preserved (Stapler {"status":"ok","data":...} envelope, null fields omitted, enum names
lowercased). All eight JsonBeanProcessor/JsonValueProcessor classes have been removed in favour of Jackson annotations.

Numbers (300k-node stress pipeline)

Endpoint Before After
/tree 90+s 200-600ms typical
/allSteps 20-50s 1-2s typical

Follow-ups (out of scope for this PR)

  • Incremental DTO assembly — the scanner pass is still O(N) per request; a future change could cache wrapped nodes per version so reads become O(ΔN).
  • StatusAndTiming.findWorstWarningBetween still walks every node in a stage calling getPersistentAction(WarningAction.class). Caching the result on the live snapshot (WarningActions don't change
    after attach) would remove the remaining ~1s of per-request cost.
  • Annotate-away the last net.sf.json uses in PipelineConsoleViewAction (small responses on doRerun/hasNextBuild/doCancel — not on the perf path).

These are the results from a job with ~250k flow nodes:

  14:11:14 allSteps   200    2502.9ms   2327032 bytes
  14:11:18 tree       200     641.6ms     48689 bytes
  14:11:19 allSteps   200    1367.4ms   2328882 bytes
  14:11:23 tree       200     761.2ms     49674 bytes
  14:11:26 allSteps   200    2797.3ms   2329862 bytes
  14:11:30 tree       200     799.4ms     49674 bytes
  14:11:31 allSteps   200    1715.4ms   2332203 bytes
  14:11:35 tree       200     647.0ms     49674 bytes
  14:11:37 allSteps   200    1687.9ms   2334295 bytes
  14:11:41 tree       200     725.4ms     49674 bytes
  14:11:42 allSteps   200    1289.8ms   2336273 bytes
  14:11:46 tree       200    1179.6ms     49674 bytes
  14:11:48 allSteps   200    1917.1ms   2338108 bytes
  14:11:52 tree       200    1065.7ms     49674 bytes
  14:11:56 allSteps   200    3610.0ms   2340317 bytes
  14:12:00 tree       200     894.2ms     49674 bytes
  14:12:03 allSteps   200    3052.3ms   2342736 bytes
  14:12:08 tree       200    2296.7ms     49674 bytes
  14:12:11 allSteps   200    2033.6ms   2345456 bytes
  14:12:14 tree       200     627.4ms     49674 bytes
  14:12:16 allSteps   200    1474.2ms   2347573 bytes
  14:12:20 tree       200     860.7ms     49674 bytes
  14:12:22 allSteps   200    1813.7ms   2349559 bytes
  14:12:25 tree       200     705.5ms     49674 bytes
  14:12:27 allSteps   200    1554.2ms   2351569 bytes
  14:12:31 tree       200     637.2ms     49674 bytes
  14:12:32 allSteps   200    1698.5ms   2353813 bytes
  14:12:36 tree       200     617.6ms     49674 bytes
  14:12:37 allSteps   200    1227.1ms   2355776 bytes
  14:12:41 tree       200     673.1ms     49674 bytes
  14:12:43 allSteps   200    1339.9ms   2357638 bytes
  14:12:46 tree       200     708.2ms     49674 bytes
  14:12:48 allSteps   200    1583.1ms   2359695 bytes

Future work

  • /tree fast-path that drops step nodes from the rebuild input — the endpoint only renders stages.
  • Incremental graph construction: update stage / step maps as onNewHead fires instead of rebuilding per request. Would make HTTP reads O(1).
  • cache findWorstWarningBetween results on the live snapshot alongside the ancestry/hideFromView data we already capture at onNewHead time. WarningActions are attached once
    per node and never change, so the cache is stable per-version. That would knock another second or two off /allSteps.
  • If the user wants an immediate workaround without code changes, they can set -Dio.jenkins.plugins.pipelinegraphview.analysis.StatusAndTiming.DISABLE_WARNING_ACTION_LOOKUP=true to skip the scan entirely (at the cost of losing UNSTABLE status detection).

All steps future work:
1. Switch this endpoint to Jackson - same DTO, ~5-10× faster serialization. Need to be careful about Stapler's default net.sf.json behavior; easiest way is to write to rsp.getWriter() directly with an ObjectMapper and set the content type. Self-contained change, no wire-format diff. - done, made quite a difference for all steps
2. Stop sending all 300k steps on every 3s poll. The front-end only cares about steps for stages currently on screen. This is the real architectural fix but it's a frontend+backend refactor.

Testing done

Tested with the 300k node pipeline, its now near instant.

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

@timja timja added the enhancement New feature or request label Apr 22, 2026
@timja timja force-pushed the issue-884-save-as-graph-progresses branch from 6e69845 to eeb3475 Compare April 22, 2026 10:19
@timja timja force-pushed the issue-884-save-as-graph-progresses branch from eeb3475 to a0d3a43 Compare April 22, 2026 10:20
@timja timja changed the title Save as graph progresses Compute graph as build progresses Apr 22, 2026
@timja
Copy link
Copy Markdown
Member Author

timja commented Apr 22, 2026

@jimklimov I'd be interested if its possible for you to test this on your instance.

I've done some basic testing and it seems to work functionally but haven't done any serious performance testing on it.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an in-memory “live graph” path for in-progress Pipeline runs so graph/steps DTOs can be produced without repeatedly walking the entire FlowNode graph via DepthFirstScanner on each HTTP request (fixes #884).

Changes:

  • Added a new livestate/ subsystem (registry + per-run state + listeners) to accumulate FlowNodes as the build runs and expose immutable snapshots.
  • Updated PipelineGraphApi / PipelineStepApi to prefer live snapshots (with per-version DTO caching) and fall back to the existing scanner path when unavailable.
  • Added constructors to PipelineNodeTreeScanner / PipelineNodeGraphAdapter to build from a pre-collected node set, plus new integration tests for lifecycle/caching behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphState.java Per-run mutable node accumulator + per-version DTO caches.
src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphSnapshot.java Immutable snapshot record of observed nodes/workspace nodes + version.
src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphRegistry.java Singleton registry (Caffeine) for live states + escape-hatch feature flag.
src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphPopulator.java GraphListener.Synchronous extension feeding FlowNodes into LiveGraphState with catch-up.
src/main/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphLifecycle.java FlowExecutionListener extension creating/evicting state on start/resume/complete.
src/main/java/io/jenkins/plugins/pipelinegraphview/utils/PipelineGraphApi.java Prefer live snapshot + per-version graph caching; reduce per-stage scans using workspace-node list.
src/main/java/io/jenkins/plugins/pipelinegraphview/utils/PipelineStepApi.java Prefer live snapshot + per-version steps caching.
src/main/java/io/jenkins/plugins/pipelinegraphview/treescanner/PipelineNodeTreeScanner.java New constructor + factoring to build from a caller-supplied node collection.
src/main/java/io/jenkins/plugins/pipelinegraphview/treescanner/PipelineNodeGraphAdapter.java New constructor to use the pre-collected node set.
src/test/java/io/jenkins/plugins/pipelinegraphview/livestate/LiveGraphLifecycleTest.java Integration tests covering live population, eviction, feature flag, and cache hit/miss behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +75 to +80
void cacheGraph(long version, PipelineGraph graph) {
VersionedCache<PipelineGraph> current = cachedGraph;
if (current == null || current.version < version) {
cachedGraph = new VersionedCache<>(version, graph);
}
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cacheGraph is not thread-safe: two concurrent callers can both observe the same current value and then overwrite cachedGraph out of order, allowing an older version to replace a newer one. This can cause stale PipelineGraph instances to be served even after a newer graph was cached. Make the update monotonic (e.g., synchronize cacheGraph, or use an AtomicReference/CAS loop that only replaces the cache when the stored version is still older).

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +92
void cacheAllSteps(long version, PipelineStepList steps) {
VersionedCache<PipelineStepList> current = cachedAllSteps;
if (current == null || current.version < version) {
cachedAllSteps = new VersionedCache<>(version, steps);
}
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cacheAllSteps has the same race as cacheGraph: the check-then-set on the volatile cachedAllSteps reference can regress the cached version under concurrent requests, leading to stale step lists being served. Use synchronization or an atomic compare-and-set style update to guarantee the cached version never decreases.

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +63
synchronized LiveGraphSnapshot snapshot() {
if (poisoned) {
return null;
}
return new LiveGraphSnapshot(List.copyOf(nodes), List.copyOf(workspaceNodes), version);
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snapshot() is synchronized and performs List.copyOf(...) over the full node lists on every call. For large in-progress graphs and frequent polling, this is O(N) work while holding the monitor that addNode also needs, which can block the CPS VM thread. Consider memoizing the last LiveGraphSnapshot and returning it when version is unchanged, or publishing an immutable snapshot from addNode so readers can fetch it without copying under the lock.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +33
/**
* Non-null when the live-state path supplied a pre-collected list of nodes carrying a
* {@link WorkspaceAction}. Avoids the per-stage {@code DepthFirstScanner} walk inside
* {@link #getStageNode(FlowNodeWrapper)}.
*/
@CheckForNull
private transient List<FlowNode> workspaceNodesOverride;

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workspaceNodesOverride is mutable shared state on the PipelineGraphApi instance. PipelineConsoleViewAction holds a single PipelineGraphApi per run and in-progress requests are not synchronized (see PipelineGraphViewCache#getGraph), so concurrent HTTP calls can race: one thread can clear/replace workspaceNodesOverride while another is in getStageNode, causing unpredictable fallback scanning and inconsistent performance. Avoid using an instance field for this; instead, thread the candidate iterable as a local variable/parameter (e.g., pass the workspace-node list into getStageNode or into createTree so each request stays isolated).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 119 to 137
LiveGraphRegistry.get().cacheAllSteps(run, snapshot.version(), computed);
return computed;
}
return getAllSteps(CachedPipelineNodeGraphAdaptor.instance.getFor(run), runIsComplete);
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeAllSteps() was changed from package-private to public so LiveGraphLifecycle (different package) can call it. This makes an internal implementation detail part of the plugin's public API. If it is not meant for external callers, consider keeping it non-public via package refactoring, or annotate it with @restricted(DoNotUse.class) to match the existing pattern used for internal-only public helpers (e.g., StatusAndTiming#printNodes).

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +69
synchronized LiveGraphSnapshot snapshot() {
if (poisoned) {
return null;
}
if (lastSnapshot != null && lastSnapshot.version() == version) {
return lastSnapshot;
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LiveGraphState.snapshot() memoizes and returns lastSnapshot when the node version hasn't changed. This can make workspaceNodes stale: WorkspaceAction may be attached to an existing FlowNode after onNewHead fires (as noted in the class Javadoc), but with no new nodes the version won't bump, so subsequent snapshot() calls can miss the newly-attached WorkspaceAction and return incorrect agent/node mapping. Consider recomputing workspaceNodes even when version is unchanged (e.g., reuse the previously-copied nodes list but rescan actions), or drop the lastSnapshot fast-path to preserve correctness.

Copilot uses AI. Check for mistakes.
Comment on lines +167 to +183
@@ -160,7 +168,18 @@ public PipelineGraph createTree() {
return PipelineGraphViewCache.get().getGraph(run, this::computeTree);
}

PipelineGraph computeTree() {
return createTree(CachedPipelineNodeGraphAdaptor.instance.getFor(run));
public PipelineGraph computeTree() {
LiveGraphSnapshot snapshot = LiveGraphRegistry.get().snapshot(run);
if (snapshot != null) {
PipelineGraph cached = LiveGraphRegistry.get().cachedGraph(run, snapshot.version());
if (cached != null) {
return cached;
}
PipelineGraph computed =
createTree(new PipelineNodeGraphAdapter(run, snapshot.nodes()), snapshot.workspaceNodes());
LiveGraphRegistry.get().cacheGraph(run, snapshot.version(), computed);
return computed;
}
return createTree(CachedPipelineNodeGraphAdaptor.instance.getFor(run), null);
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeTree() was changed from package-private to public solely so LiveGraphLifecycle (different package) can call it. Exposing this as a public method effectively expands the plugin's supported API surface. If this is intended as an internal hook, consider keeping it non-public and moving the caller into the same package, or annotate the method with @restricted(DoNotUse.class) (consistent with StatusAndTiming#printNodes) to avoid implying external use is supported.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +50 to +70
synchronized LiveGraphSnapshot snapshot() {
if (poisoned || !ready) {
return null;
}
// Filter for WorkspaceAction at snapshot time rather than at addNode time:
// WorkspaceAction is attached to a block-start node when the workspace is allocated,
// which can happen AFTER onNewHead has already fired for that node. A snapshot-time
// scan always observes the latest action state on each captured FlowNode.
//
// The list is built newest-first (reverse insertion order) to match the iteration
// order of DepthFirstScanner (from current heads backward): PipelineGraphApi#getStageNode
// returns on the first match, and for nested agents the innermost workspace is the
// more-specific match — the one a later-created inner `node {}` block sits in.
List<FlowNode> workspaceNodes = new ArrayList<>();
for (int i = nodes.size() - 1; i >= 0; i--) {
FlowNode n = nodes.get(i);
if (n.getAction(WorkspaceAction.class) != null) {
workspaceNodes.add(n);
}
}
return new LiveGraphSnapshot(List.copyOf(nodes), List.copyOf(workspaceNodes), version);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LiveGraphState.snapshot() holds the instance monitor while (a) scanning the entire nodes list for WorkspaceAction and (b) copying both nodes and workspaceNodes. Since addNode() is called from GraphListener.Synchronous on the CPS VM thread, this means an API read can block pipeline execution for O(N) time on large graphs. Consider minimizing time under the monitor by copying nodes+version under lock and doing the WorkspaceAction filtering outside the synchronized block (or maintaining a separate concurrent/workspace index) so writers are less likely to stall.

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +66
PipelineGraph graph = new PipelineGraphApi(run).computeTree();
PipelineStepList allSteps = new PipelineStepApi(run).computeAllSteps();
// WorkflowRun.isBuilding() can still be true here even though FlowExecution
// is complete; rebuild with runIsComplete=true so the persisted copy matches
// reality. PipelineGraph.complete already reflects FlowExecution.isComplete().
PipelineStepList finalSteps = new PipelineStepList(allSteps.steps, true);
PipelineGraphViewCache.get().seed(run, graph, finalSteps);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In onCompleted(), finalSteps reuses the same mutable List instance via new PipelineStepList(allSteps.steps, true). Since PipelineStepList.steps is publicly mutable, this aliases the list between the cached in-memory DTO (potentially concurrently read) and the persisted payload being written to disk. Prefer copying the list (or adding a constructor that defensively copies) when creating the “final” persisted version.

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +58
static void catchUp(FlowExecution execution, LiveGraphState state) {
try {
DepthFirstScanner scanner = new DepthFirstScanner();
scanner.setup(execution.getCurrentHeads());
for (FlowNode existing : scanner) {
state.addNode(existing);
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LiveGraphState.snapshot() assumes nodes insertion order is chronological (oldest→newest) and reverses it to approximate DepthFirstScanner (newest-first). In catchUp(), you iterate DepthFirstScanner directly and call state.addNode(existing) in that iteration order, which is already newest-first; after a restart this makes the subsequent reverse-scan produce oldest-first workspace candidates, so PipelineGraphApi#getStageNode may pick an outer workspace instead of the most-specific one. Consider collecting scanner output and adding nodes to the state in reverse (oldest→newest) to keep insertion order consistent with the live onNewHead path.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +107 to +115
return (cachedAllSteps != null && cachedAllSteps.version >= minVersion) ? cachedAllSteps.value : null;
}

synchronized void cacheAllSteps(long version, PipelineStepList steps) {
if (cachedAllSteps == null || cachedAllSteps.version < version) {
cachedAllSteps = new VersionedCache<>(version, steps);
}
}

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The live-state caches return the same PipelineStepList instance across calls, but PipelineStepList.steps is a publicly mutable List. That means a consumer (or even accidental mutation during serialization) can corrupt the cached value and affect other concurrent readers. Consider caching an immutable/defensive copy (or wrapping steps as unmodifiable) before storing/returning it from the live-state cache.

Suggested change
return (cachedAllSteps != null && cachedAllSteps.version >= minVersion) ? cachedAllSteps.value : null;
}
synchronized void cacheAllSteps(long version, PipelineStepList steps) {
if (cachedAllSteps == null || cachedAllSteps.version < version) {
cachedAllSteps = new VersionedCache<>(version, steps);
}
}
return (cachedAllSteps != null && cachedAllSteps.version >= minVersion)
? copyPipelineStepList(cachedAllSteps.value)
: null;
}
synchronized void cacheAllSteps(long version, PipelineStepList steps) {
if (cachedAllSteps == null || cachedAllSteps.version < version) {
cachedAllSteps = new VersionedCache<>(version, copyPipelineStepList(steps));
}
}
private PipelineStepList copyPipelineStepList(PipelineStepList steps) {
return new PipelineStepList(List.copyOf(steps.steps));
}

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +85
public void onCompleted(@NonNull FlowExecution execution) {
try {
WorkflowRun run = workflowRunFor(execution);
if (run != null) {
PipelineGraph graph;
PipelineStepList allSteps;
LiveGraphSnapshot snapshot = LiveGraphRegistry.get().snapshot(run);
if (snapshot != null) {
// Share a single adapter (and therefore a single tree-scanner pass) for
// both graph and steps rather than paying the cost twice.
PipelineNodeGraphAdapter adapter = new PipelineNodeGraphAdapter(run, snapshot.nodes());
// runIsComplete=true here directly: WorkflowRun.isBuilding() can still be
// true even though FlowExecution is complete. PipelineGraph.complete is
// already derived from FlowExecution.isComplete() inside createTreeFrom.
graph = new PipelineGraphApi(run).createTreeFrom(adapter, snapshot.workspaceNodes());
allSteps = new PipelineStepApi(run).getAllStepsFrom(adapter, true);
} else {
// No live state (feature disabled, poisoned, plugin installed mid-build).
// Fall back to the scanner-backed paths and rebuild steps with
// runIsComplete=true to keep the persisted copy consistent. Defensively
// copy the list: PipelineStepList.steps is publicly mutable, and `raw`
// may be held by the DTO cache or returned to a concurrent HTTP reader.
graph = new PipelineGraphApi(run).computeTree();
PipelineStepList raw = new PipelineStepApi(run).computeAllSteps();
allSteps = new PipelineStepList(new ArrayList<>(raw.steps), true);
}
PipelineGraphViewCache.get().seed(run, graph, allSteps);
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onCompleted now computes and writes the graph/steps for every finished run (including when the live-state feature flag is disabled, since snapshot() will be null and the code falls back to a full scanner sweep). This can add noticeable CPU and I/O to build finalization even if nobody ever requests the view. Consider honoring the .enabled flag here (or adding a separate knob for seeding), and/or short-circuiting if the on-disk cache file already exists and matches the current schema.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +88
/**
* Writes a final graph and step list directly to disk, bypassing the {@code isBuilding}
* guard used by {@link #getGraph} / {@link #getAllSteps}. Intended for use at
* {@code FlowExecutionListener.onCompleted}, where the execution is known complete but
* {@code WorkflowRun.isBuilding()} may not have flipped yet. Calling this avoids wasting
* the work already done by the live-state path: without it, the first read after
* completion falls through to a full scanner sweep.
*/
public void seed(WorkflowRun run, PipelineGraph graph, PipelineStepList allSteps) {
CachedPayload payload = load(run);
synchronized (payload) {
payload.graph = graph;
payload.allSteps = allSteps;
payload.schemaVersion = SCHEMA_VERSION;
write(run, payload);
}
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seed(...) is described as an internal hook for FlowExecutionListener.onCompleted, but it is a new public API surface. To avoid accidentally committing to it for external consumers, consider marking it @Restricted(NoExternalUse.class) (or otherwise limiting visibility) similar to the other newly-added internal compute/build entry points.

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +46
private final Cache<String, LiveGraphState> states = Caffeine.newBuilder()
.maximumSize(CACHE_MAX_SIZE)
.expireAfterAccess(Duration.ofMinutes(30))
.build();
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expireAfterAccess(30m) can evict a live state while a run is still in progress (e.g. long input/sleep/semaphore with no new heads and no API reads). If that happens, the next onNewHead recreates a fresh LiveGraphState that never becomes ready (since onRunning/onResumed won't fire again), so the live-state path stays permanently disabled for that run. Consider removing the TTL for in-progress runs, switching to a much longer expireAfterWrite, or ensuring recreated states are marked ready (possibly with a safe catch-up) so eviction doesn't break functionality.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +74
synchronized (this) {
if (poisoned || !ready) {
return null;
}
nodesCopy = List.copyOf(nodes);
v = version;
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poisoned is set without synchronization, but snapshot() only checks it while holding the monitor and then releases the lock to scan for WorkspaceAction. If poison() happens after the synchronized block, snapshot() can still return a non-null snapshot from a now-poisoned state (which defeats the “poison → always fall back to scanner” contract). Consider re-checking poisoned just before returning, or making poison() coordinated with snapshot (e.g., synchronized + a second check) while still keeping the WorkspaceAction scan outside the lock.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@timja timja requested a review from Copilot April 23, 2026 13:12
@timja timja marked this pull request as ready for review April 23, 2026 13:12
@timja timja requested a review from a team as a code owner April 23, 2026 13:12
@timja timja changed the title Compute graph as build progresses Make Pipeline Overview responsive on large in-progress builds Apr 23, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@timja timja requested a review from Copilot April 23, 2026 13:36
@timja timja enabled auto-merge (squash) April 23, 2026 13:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lewisbirks
Copy link
Copy Markdown
Contributor

Switched JSON serialization to Jackson 3. net.sf.json's JSONObject.toString was spending ~9s of CPU on the 15MB /allSteps response. All four plugin JSON endpoints (/tree, /allSteps,
/steps, /runs) now stream through a single PipelineJsonWriter backed by jackson3-api. Wire format is preserved (Stapler {"status":"ok","data":...} envelope, null fields omitted, enum names
lowercased). All eight JsonBeanProcessor/JsonValueProcessor classes have been removed in favour of Jackson annotations.

do we want to split this out into its own PR which we can merge first to make this one more focussed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Process graph as build progresses

3 participants