Skip to content

Side Input improvements#38363

Open
reuvenlax wants to merge 16 commits intoapache:masterfrom
reuvenlax:ontimer_sideinput
Open

Side Input improvements#38363
reuvenlax wants to merge 16 commits intoapache:masterfrom
reuvenlax:ontimer_sideinput

Conversation

@reuvenlax
Copy link
Copy Markdown
Contributor

@reuvenlax reuvenlax commented May 3, 2026

Allow side inputs to be accessed in OnTimer and OnWindowExpiration.

Fix latent bug in Dataflow runner. Previously the cleanup timer was set when an element was seen, even if that element was buffered. This could cause the cleanup timer to fire early, while those elements are still buffered. The new code doesn't set a cleanup timer for a window until an element is actually processed for that window (i.e. the window is unblocked).

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables the use of side inputs within timer-related callbacks, specifically @ontimer and @OnWindowExpiration. It addresses limitations in the current runner implementations to ensure that side inputs are correctly fetched and cached when these callbacks are triggered. Additionally, it includes robust fixes for streaming side input handling, ensuring that blocked elements and timers are properly managed and persisted across bundle boundaries.

Highlights

  • Side Input Support in Timers: Enabled side input access within @ontimer and @OnWindowExpiration methods by updating DoFn signatures and context implementations.
  • Streaming Side Input Fixes: Improved handling of blocked windows in streaming pipelines, ensuring timers are correctly managed and persisted when side inputs become available.
  • Testing Infrastructure: Added a new test category 'UsesSideInputsInTimer' and expanded ParDo tests to validate side input availability in timer callbacks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for side inputs within @ontimer and @OnWindowExpiration methods in DoFn. The implementation refactors SimpleParDoFn to handle streaming side inputs directly via a StreamingSideInputFetcher, replacing the previous StreamingSideInputDoFnRunner wrapper. It also updates the Beam SDK to permit side input parameters in these callbacks and provides necessary implementations for the Dataflow worker and Fn API runner. Feedback includes a high-priority suggestion to nullify the sideInputFetcher in finishBundle to prevent memory leaks, an optimization to move redundant state cleanup registration out of a loop, the removal of a side-effect-free line in StreamingSideInputFetcher, and a recommendation to replace long Thread.sleep calls in tests with more deterministic approaches like TestStream.

Comment on lines +504 to +506
if (sideInputFetcher != null) {
sideInputFetcher.persist();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The sideInputFetcher field should be cleared at the end of the bundle. Since SimpleParDoFn instances are typically reused across bundles in the Dataflow worker, failing to null out this field can lead to stale state being carried over to subsequent bundles or potential memory leaks. Clearing it here ensures that each bundle starts with a clean state.

Suggested change
if (sideInputFetcher != null) {
sideInputFetcher.persist();
}
if (sideInputFetcher != null) {
sideInputFetcher.persist();
sideInputFetcher = null;
}

Comment on lines +346 to +353
boolean hasState = fnSignature != null && !fnSignature.stateDeclarations().isEmpty();
if (hasState) {
// These elements are now processed. Register cleanup timers for all the unblocked
// windows.
registerStateCleanup(
(WindowingStrategy<?, W>) getDoFnInfo().getWindowingStrategy(),
(Collection<W>) readyWindows);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to registerStateCleanup is located inside the loop that iterates over elementsBags. Since it is passed the entire readyWindows collection, it will redundantly register cleanup timers for all unblocked windows in every iteration of the loop. This should be moved outside the loop to improve performance, especially when multiple windows are unblocked simultaneously.

Comment on lines +212 to +213
Set<Windmill.GlobalDataRequest> blocked = checkIfBlocked(window);
blockedMap().get(window);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to blockedMap().get(window) on line 213 is a side-effect-free operation whose result is ignored. It appears to be a leftover from refactoring and should be removed to keep the code clean.

Suggested change
Set<Windmill.GlobalDataRequest> blocked = checkIfBlocked(window);
blockedMap().get(window);
Set<Windmill.GlobalDataRequest> blocked = checkIfBlocked(window);

new DoFn<KV<String, String>, String>() {
@ProcessElement
public void process(OutputReceiver<String> o) throws InterruptedException {
Thread.sleep(java.time.Duration.ofSeconds(15).toMillis());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Thread.sleep(15000) in a test is highly discouraged as it significantly increases the execution time of the test suite and can lead to flakiness. Consider using a much shorter duration or a more deterministic approach using TestStream to simulate the side input availability delay if the test environment allows.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@github-actions github-actions Bot removed the direct label May 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (master@281270c). Learn more about missing BASE report.
⚠️ Report is 72 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #38363   +/-   ##
=========================================
  Coverage          ?   72.59%           
  Complexity        ?     3629           
=========================================
  Files             ?      335           
  Lines             ?    30806           
  Branches          ?     3750           
=========================================
  Hits              ?    22365           
  Misses            ?     6736           
  Partials          ?     1705           
Flag Coverage Δ
java 72.59% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@reuvenlax
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables side input access within OnTimer and OnWindowExpiration contexts for the Beam Java SDK and Dataflow runner. It refactors streaming side input logic into a StreamingSideInputProcessor class, integrated into SimpleParDoFn and StreamingSideInputDoFnRunner. Feedback identifies a missing progress tracker notification for unblocked elements, a potential memory risk from eager list accumulation in the processor, and opportunities to optimize list allocations and test execution times.

Comment on lines +337 to +338
for (WindowedValue<InputT> unblockedElement : unblockedElements) {
fnRunner.processElement(unblockedElement);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The outputsPerElementTracker is not notified when processing unblocked elements in startBundle. This can lead to incorrect progress tracking and metrics in Dataflow, as the tracker expects onProcessElement() to be called before processElement(). You should wrap the processElement call with the appropriate tracker notifications.

      for (WindowedValue<InputT> unblockedElement : unblockedElements) {
        outputsPerElementTracker.onProcessElement();
        fnRunner.processElement(unblockedElement);
        outputsPerElementTracker.onProcessElementSuccess();

Comment on lines +49 to +56
List<WindowedValue<InputT>> releaseElements = Lists.newArrayList();
for (BagState<WindowedValue<InputT>> elementsBag : elementsBags) {
Iterable<WindowedValue<InputT>> elements = elementsBag.read();
for (WindowedValue<InputT> elem : elements) {
releaseElements.add(elem);
}
elementsBag.clear();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Flattening all unblocked elements from all ready windows into a single ArrayList poses a significant memory risk and could lead to OutOfMemoryError if a large number of elements were buffered. The previous implementation in StreamingSideInputDoFnRunner processed elements bag-by-bag, which is much more memory-efficient. Consider refactoring this to process elements lazily or bag-by-bag to avoid accumulating everything in memory at once.

Comment on lines +368 to +373
windowsProcessed = Lists.newArrayList();
Iterable<WindowedValue<InputT>> elementsToProcess =
sideInputProcessor.handleProcessElement(elem);
for (WindowedValue<InputT> toProcess : elementsToProcess) {
fnRunner.processElement(toProcess);
windowsProcessed.addAll((Collection<W>) toProcess.getWindows());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The windowsProcessed list is allocated and populated even if hasState is false, in which case it is never used. Deferring the allocation and population of this list until it's known that state cleanup is required would be more efficient.

Suggested change
windowsProcessed = Lists.newArrayList();
Iterable<WindowedValue<InputT>> elementsToProcess =
sideInputProcessor.handleProcessElement(elem);
for (WindowedValue<InputT> toProcess : elementsToProcess) {
fnRunner.processElement(toProcess);
windowsProcessed.addAll((Collection<W>) toProcess.getWindows());
windowsProcessed = hasState ? Lists.newArrayList() : ImmutableList.of();
Iterable<WindowedValue<InputT>> elementsToProcess =
sideInputProcessor.handleProcessElement(elem);
for (WindowedValue<InputT> toProcess : elementsToProcess) {
fnRunner.processElement(toProcess);
if (hasState) {
windowsProcessed.addAll((Collection<W>) toProcess.getWindows());
}

new DoFn<KV<String, String>, String>() {
@ProcessElement
public void process(OutputReceiver<String> o) throws InterruptedException {
Thread.sleep(java.time.Duration.ofSeconds(15).toMillis());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Thread.sleep(15000) makes the test suite significantly slower. For simulating a delayed side input in these tests, a shorter duration (e.g., 5 seconds) is usually sufficient and helps maintain a faster CI pipeline.

Suggested change
Thread.sleep(java.time.Duration.ofSeconds(15).toMillis());
Thread.sleep(java.time.Duration.ofSeconds(5).toMillis());

@reuvenlax reuvenlax changed the title Side Input fixes Side Input improvements May 4, 2026
@reuvenlax
Copy link
Copy Markdown
Contributor Author

R: @ahmedabu98
R: @kennknowles

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@github-actions github-actions Bot added the prism label May 4, 2026
@reuvenlax
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for accessing side inputs within OnTimer and OnWindowExpiration methods in Beam's DoFn. It includes the necessary API changes in the SDK, implementation updates in the Fn API runner and Dataflow runner, and a new helper class StreamingSideInputProcessor to manage side input blocking and unblocking. The reviewer raised concerns regarding the use of a strict state assertion in the new processor and the use of Thread.sleep in the new test case, which could impact test stability and performance.

// be set via element processing (or another timer) in the same window, the window should be
// unblocked once
// we get here.
Preconditions.checkState(!sideInputFetcher.storeIfBlocked(timer));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of Preconditions.checkState(!sideInputFetcher.storeIfBlocked(timer)) assumes that side inputs are always ready when a timer fires. While this is generally true for user timers (which are set after an element is processed), it relies on the runner's guarantee that cleanup timers are only set after unblocking. If this assumption is ever violated, it will lead to an IllegalStateException rather than buffering the timer. Consider if it's safer to handle the blocked case by returning a boolean and letting the caller decide, or if this strict assertion is the intended behavior for the Dataflow runner.

new DoFn<KV<String, String>, String>() {
@ProcessElement
public void process(OutputReceiver<String> o) throws InterruptedException {
Thread.sleep(java.time.Duration.ofSeconds(15).toMillis());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Thread.sleep with a duration of 15 seconds in a test can lead to slow build times and potential flakiness in CI environments. While it's used here to simulate side input delay, consider if there's a more deterministic way to test this behavior, perhaps by using TestStream with multiple bundles or a custom PCollectionView that allows controlling readiness without blocking the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant