Skip to content

Cooperative memory reclaim via async MemoryReclaimer#22043

Draft
JanKaul wants to merge 2 commits intoapache:mainfrom
Embucket:memory-reclaimer
Draft

Cooperative memory reclaim via async MemoryReclaimer#22043
JanKaul wants to merge 2 commits intoapache:mainfrom
Embucket:memory-reclaimer

Conversation

@JanKaul
Copy link
Copy Markdown
Contributor

@JanKaul JanKaul commented May 6, 2026

Adds an async hook that lets a MemoryPool ask other consumers to free memory before failing an allocation.

This PR is complementary to #21425 — it is not a replacement. It exists to broaden the design discussion there with a concrete alternative, not to supersede that work.

Design

  • trait MemoryReclaimer (async) attached to a MemoryConsumer via with_reclaimer. Implements: reclaim(target), optional reclaimable_bytes, optional priority.
  • MemoryPool::try_grow_async — default delegates to sync try_grow. TrackConsumersPool overrides it to walk registered reclaimers (priority desc, size desc) on OOM, retry the grow after
    each, then fall through to inner.try_grow_async so a wrapped reclaim-aware pool isn't shadowed.
  • Operator-side state machine (SortExec): a channel-based ExternalSorterReclaimer hands a oneshot to the partition's stream loop; tokio::select! biased { reclaim_rx.recv() … ; input.next() … } spills end-to-end before replying with the freed-byte count. The stream loop is the sole owner of the sorter's batches, so the spill is ordered before the report — the bytes the pool sees are bytes already on disk.

How this differs from #21425

  • Async trait + try_grow_async instead of sync pool.reclaim(...) — matches the channel hand-off pattern needed for cooperative spill inside DataFusion's async execution.
  • Auto-triggered on OOM rather than caller-driven.
  • Includes operator wiring for SortExec to demonstrate the full flow.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-execution v53.1.0 (current)
       Built [  31.892s] (current)
     Parsing datafusion-execution v53.1.0 (current)
      Parsed [   0.024s] (current)
    Building datafusion-execution v53.1.0 (baseline)
       Built [  30.759s] (baseline)
     Parsing datafusion-execution v53.1.0 (baseline)
      Parsed [   0.026s] (baseline)
    Checking datafusion-execution v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.218s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure auto_trait_impl_removed: auto trait no longer implemented ---

Description:
A public type has stopped implementing one or more auto traits. This can break downstream code that depends on the traits being implemented.
        ref: https://doc.rust-lang.org/reference/special-types-and-traits.html#auto-traits
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/auto_trait_impl_removed.ron

Failed in:
  type MemoryConsumer is no longer UnwindSafe, in /home/runner/work/datafusion/datafusion/datafusion/execution/src/memory_pool/mod.rs:263
  type MemoryConsumer is no longer RefUnwindSafe, in /home/runner/work/datafusion/datafusion/datafusion/execution/src/memory_pool/mod.rs:263
  type TrackConsumersPool is no longer UnwindSafe, in /home/runner/work/datafusion/datafusion/datafusion/execution/src/memory_pool/pool.rs:410

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  64.096s] datafusion-execution
    Building datafusion-physical-plan v53.1.0 (current)
       Built [  33.084s] (current)
     Parsing datafusion-physical-plan v53.1.0 (current)
      Parsed [   0.123s] (current)
    Building datafusion-physical-plan v53.1.0 (baseline)
       Built [  33.251s] (baseline)
     Parsing datafusion-physical-plan v53.1.0 (baseline)
      Parsed [   0.129s] (baseline)
    Checking datafusion-physical-plan v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.609s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  68.389s] datafusion-physical-plan

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 6, 2026
@JanKaul
Copy link
Copy Markdown
Contributor Author

JanKaul commented May 7, 2026

It seems to work for Sort. I don't get any OOM errors where I received them before. Still need to hook up more operators though.

@JanKaul JanKaul force-pushed the memory-reclaimer branch from b8ec2f5 to 54022de Compare May 7, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change execution Related to the execution crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant