Feature/add max new dagruns to schedule by Nataneljpwd · Pull Request #64294 · apache/airflow

Nataneljpwd · 2026-03-27T12:27:37Z

When new dagruns are created in bulk (i.e with triggerDagRunOperator), the scheduler might struggle with the amount created, and cause other dagruns to starve.

This is due to the sort order in get_running_dagruns_to_examine which selects (with a nulls first) by last scheduling decision, which means that if a lot of new dagruns are created, the scheduler will examine them first, and in situations where the dags have a lot of tasks (hundreds to tens of thousands) it can cause the scheduler to stall, as it has to both examine a lot of dagruns, and create new tasks for those dagruns.

When we have tried to tune the max_dagruns_per_loop_to_schedule we either got starvation of other dagruns OR the scheduler being reset due to not returning a heartbeat for a long time and failing the readiness probe.

To fix this, a new configuration is added, max_new_dagruns_per_loop_to_schedule which can help when a lot of new dagruns are created in large batches at the same time, and allow the scheduler to both look at existing dagruns (not starving them and causing them to timeout with no running / scheduled tasks) and create and manage the new dagruns.

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)
No

Important

🛠️ Maintainer triage note for @Nataneljpwd · by @potiuk · 2026-06-17 14:51 UTC

Helpful heads-up from the maintainers — please address before this PR can be reviewed:

Failing test jobs: Low dep tests:core / All-core:LowestDeps:14:3.10:Core...Serialization, MySQL tests: core / DB-core:MySQL:8.0:3.10:Core...Serialization, Postgres tests: core / DB-core:Postgres:14:3.10:Core...Serialization, Sqlite tests: core / DB-core:Sqlite:3.10:Core...Serialization. Reproduce and fix locally, then push.
See the Pull Request quality criteria.

The ball is in your court — you've been assigned to this PR. Fix the above, then mark it Ready for review.

_{Automated triage — may be imperfect; a maintainer takes the next look.}

Copilot

Pull request overview

This PR introduces a scheduler tuning knob to limit how many new (never-before-examined) running DagRuns are considered per scheduling loop, to reduce starvation/slowdown when large batches of DagRuns are created at once.

Changes:

Add scheduler.max_new_dagruns_per_loop_to_schedule config (default 0) and plumb it into DagRun selection.
Update DagRun.get_running_dag_runs_to_examine() to optionally split selection into “previously examined” vs “new” DagRuns.
Add/adjust unit tests to cover the new selection behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
airflow-core/src/airflow/models/dagrun.py	Adds config-backed limit and changes running DagRun selection logic to optionally fetch “old” and “new” runs separately.
airflow-core/src/airflow/config_templates/config.yml	Documents the new scheduler configuration option.
airflow-core/tests/unit/models/test_dagrun.py	Adds tests for the new DagRun selection behavior and updates an existing test to handle the new return type.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

…/add-max-new-dagruns-to-schedule

Co-authored-by: Copilot <[email protected]>

potiuk · 2026-05-18T10:48:13Z

@Nataneljpwd — Removing the ready for maintainer review label and converting back to draft. The branch now has merge conflicts with main that surfaced after the label was added.

The label's contract is that the PR is ready for maintainer review — a regression like this means the PR temporarily isn't. Rebase your branch onto the latest main, resolve conflicts, then mark "Ready for review" again to re-enter the queue.

git fetch upstream main && git rebase upstream/main, resolve, git push --force-with-lease. See the working-with-git docs.

No rush.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

…/add-max-new-dagruns-to-schedule

potiuk · 2026-05-27T19:57:28Z

@Nataneljpwd — There is 1 unresolved review thread on this PR from kaxil, and you have pushed commits since the review (most recently the rebase that cleared the merge conflict). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, please mark the thread as resolved and ping the reviewer (kaxil) for a final look. They will either label the PR ready for maintainer review or follow up with additional feedback.

If you are still working on the thread, please reply with what is outstanding so the thread stays unresolved on purpose.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

eladkal · 2026-06-01T17:42:42Z

cc @kaxil waiting for 2nd review

kaxil · 2026-06-01T18:48:50Z

This one also needs a review from @ashb .

Also cc @BIS7 @ephraimbuddy -- who might be interested in reviewing it

…/add-max-new-dagruns-to-schedule

ashb

I'm not convinced that this is the right fix. Tuning and configuring the scheduler is already nigh on impossible I am wary of adding more.

Additionally, couldn't the already existing max active runs controls be used here? That would keep most of the dagruns in the Queued state, meaning the scheduler only looks at at most 16( by default I think) newly created runs and massively reduces the impact of "cause the scheduler to stall, as it has to both examine a lot of dagruns, and create new tasks for those dagruns." as it doesn't do that. That is why DagRuns can exist in the queued state.

Did you try this existing tunable first?

Nataneljpwd · 2026-06-01T20:46:27Z

I'm not convinced that this is the right fix. Tuning and configuring the scheduler is already nigh on impossible I am wary of adding more.

We have tried running both the scheduler count and the max dagruns per loop to schedule, and each time we had a different issue but I understand the concern, we have this locally and it fixed our problem, the main problem being is that dags are created in batches at our clusters, sometimes very large batches, and a new dagrun is heavier to process than a running one, mainly due to the fact of having to create tasks for it (when it starts) rather than other dagruns which occasionally (once tasks finish) create new tasks, while also having dagruns not moved to running due to processing large batches of new dagruns, we have tried increasing the scheduler count quite a bit, in addition to increasing the max dagruns per loop to schedule, which caused scheduler heartbeat timeouts as we had a lot of runs with mapped tasks, and so we had to also increase that configuration to a very big (10 minutes), that is in addition to dagruns timing out due to not being examined, and we even saw in the gant that there were large pauses between tasks where no task existed, and so dagruns could cause other dagruns to miss their sla, or even if I create a medium backfill, along with my regular dags which include mapped tasks, when I increase the number of examined dagruns, I get one of the issues stated above.

Additionally, couldn't the already existing max active runs controls be used here? That would keep most of the dagruns in the Queued state, meaning the scheduler only looks at at most 16( by default I think) newly created runs and massively reduces the impact of "cause the scheduler to stall, as it has to both examine a lot of dagruns, and create new tasks for those dagruns." as it doesn't do that. That is why DagRuns can exist in the queued state.

As states above, we had tried to tune it, we changed it to around 300 and even tripled the scheduler count, yet for both batch triggered runs and large backfills we still experienced the issue, we even tried dividing the batch size by a few times (spread more evenly), the scheduler either got a lot of queued dagruns and would never finish the batch OR when it was able to finish the batch it was reset quite often due to not emmiting a heartbeat and failing the readiness probe / having an oom / other dagruns timing out (which Is why we didn't increase the number beyond 300)

Did you try this existing tunable first?

As states above, yes, we have tried, I am pretty sure we had tried all related configurations, as I have gone over all of the scheduler configurations

ashb · 2026-06-02T07:00:16Z

No, not those. I mean the max_active_runs parameter to a dag

Nataneljpwd · 2026-06-02T10:17:14Z

No, not those. I mean the max_active_runs parameter to a dag

That as well, yet when our clients changed this we were unable to stay within the Dag's sla and more runs were created in a day than finished, yet it also happens when we have a lot of dags (over 1000) in one airflow instance where we limit the max active runs to 40 with a cluster policy, yet most clients use the default of 16

…/add-max-new-dagruns-to-schedule

ephraimbuddy

This looks like a deployment specific mitigation. Do you have a simple repro/benchmark showing max_active_runs and other existing scheduler knobs cannot solve this?

Also, as I read it, the starvation comes from the nulls_first(last_scheduling_decision) ordering: never examined runs are always pulled to the front. Have you considered fixing the ordering itself instead? I think that would address the starvation without adding a new knob. Something like the below:

.order_by(
    nulls_first(cast("ColumnElement[Any]", BackfillDagRun.sort_ordinal), session=session),
    coalesce(cls.last_scheduling_decision, cls.run_after),
    cls.run_after,
)

Fair aging: never-examined runs are ordered by when they became eligible (run_after), not pulled ahead of everything. A run examined long ago still outranks one examined a second ago.

potiuk · 2026-06-09T08:26:55Z

@Nataneljpwd A few things need addressing before review — see our Pull Request quality criteria.

❌ Merge conflicts with main. See docs.

No rush.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Nataneljpwd added 2 commits March 23, 2026 21:48

added new_dagruns_to_examine configuration

2bce943

added an option to choose new dagruns to schedule amount

03bacfb

Nataneljpwd requested review from XD-DENG and ashb as code owners March 27, 2026 12:27

boring-cyborg Bot added the area:ConfigTemplates label Mar 27, 2026

Merge branch 'main' into feature/add-max-new-dagruns-to-schedule

0399261

Nataneljpwd marked this pull request as draft March 27, 2026 12:28

Nataneljpwd and others added 5 commits March 27, 2026 16:21

separated to 2 queries

c70108c

removed subquery

c5431b9

fixed mypy

36e0514

fixed mypy

25ced1c

Merge branch 'main' into feature/add-max-new-dagruns-to-schedule

bdf59c7

Nataneljpwd marked this pull request as ready for review March 27, 2026 17:24

Nataneljpwd added 2 commits March 29, 2026 08:18

Merge branch 'main' into feature/add-max-new-dagruns-to-schedule

920f06f

Merge branch 'main' into feature/add-max-new-dagruns-to-schedule

0f6948c

kaxil requested a review from Copilot April 2, 2026 00:42

Copilot started reviewing on behalf of kaxil April 2, 2026 00:43 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Apr 2, 2026

Nataneljpwd added 2 commits April 4, 2026 08:48

Merge branch 'main' into feature/add-max-new-dagruns-to-schedule

eec2612

Merge branch 'main' into feature/add-max-new-dagruns-to-schedule

9de6190

eladkal added this to the Airflow 3.2.1 milestone Apr 9, 2026

eladkal added the backport-to-v3-2-test label Apr 9, 2026

kaxil requested a review from Copilot April 10, 2026 19:55

Copilot started reviewing on behalf of kaxil April 10, 2026 19:56 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/models/dagrun.py

Comment thread airflow-core/src/airflow/models/dagrun.py Outdated

vatsrahul1001 modified the milestones: Airflow 3.2.1, Airflow 3.2.2 Apr 15, 2026

Natanel Rudyuklakir and others added 2 commits April 15, 2026 22:04

Merge branch 'main' of https://github.com/apache/airflow into feature…

db00151

…/add-max-new-dagruns-to-schedule

Update airflow-core/tests/unit/models/test_dagrun.py

e727f7e

Co-authored-by: Copilot <[email protected]>

vatsrahul1001 added this to the Airflow 3.3.0 milestone May 12, 2026

potiuk removed the ready for maintainer review Set after triaging when all criteria pass. label May 18, 2026

potiuk marked this pull request as draft May 18, 2026 10:48

Natanel Rudyuklakir added 2 commits May 21, 2026 19:44

resolve CR comments

1491844

Merge branch 'main' of https://github.com/apache/airflow into feature…

06261d2

…/add-max-new-dagruns-to-schedule

eladkal removed the backport-to-v3-2-test label May 26, 2026

eladkal marked this pull request as ready for review May 27, 2026 09:34

eladkal requested a review from kaxil May 27, 2026 14:29

Nataneljpwd added 2 commits May 27, 2026 20:10

Merge branch 'main' into feature/add-max-new-dagruns-to-schedule

316f4af

Merge branch 'main' into feature/add-max-new-dagruns-to-schedule

28874df

Nataneljpwd added 3 commits June 1, 2026 21:55

Merge branch 'main' of https://github.com/apache/airflow into feature…

4007c65

…/add-max-new-dagruns-to-schedule

address CR comments

fa873ff

move the create_dagruns to a fixture

e647963

ashb requested changes Jun 1, 2026

View reviewed changes

Nataneljpwd and others added 4 commits June 3, 2026 22:12

Merge branch 'main' of https://github.com/apache/airflow into feature…

9f9a2e8

…/add-max-new-dagruns-to-schedule

fixed failing tests

f577267

Merge branch 'main' into feature/add-max-new-dagruns-to-schedule

3043732

fixed failing tests

177ee0b

ephraimbuddy reviewed Jun 8, 2026

View reviewed changes

DonHaul mentioned this pull request Jun 10, 2026

workflows: stress test Airflow on QA cern-sis/issues-inspire#1447

Open

potiuk assigned Nataneljpwd Jun 17, 2026

potiuk marked this pull request as draft June 18, 2026 21:46

Conversation

Nataneljpwd commented Mar 27, 2026 • edited by potiuk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Was generative AI tooling used to co-author this PR?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

potiuk commented May 18, 2026

Uh oh!

potiuk commented May 27, 2026

Uh oh!

eladkal commented Jun 1, 2026

Uh oh!

kaxil commented Jun 1, 2026

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

Nataneljpwd commented Jun 1, 2026

Uh oh!

ashb commented Jun 2, 2026

Uh oh!

Nataneljpwd commented Jun 2, 2026

Uh oh!

ephraimbuddy left a comment

Choose a reason for hiding this comment

Uh oh!

potiuk commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Nataneljpwd commented Mar 27, 2026 •

edited by potiuk

Loading