Implement power-of-two VMem allocator by Copilot · Pull Request #441 · ROCm/iris

Copilot · 2026-03-09T01:30:58Z

Original prompt

This section details on the original issue you should resolve

<issue_title>Implement power-of-two vmem allocator</issue_title>
<issue_description>Implement a new virtual memory power of two allocator. Allocator must support as symmetric api</issue_description>

<agent_instructions>You have access to AMD gpu. Make sure everything is tested and working </agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes Implement power-of-two vmem allocator #440

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: mawad-amd <[email protected]>

mawad-amd · 2026-03-09T02:04:13Z

@copilot Please address the following issues in this PR:

Critical Bugs

import_external_tensor OOM check mismatch: The OOM check on line 390 uses aligned_size (derived from alloc_size) but the actual mapping on line 403 uses aligned_export_size (derived from export_size). These can differ, so the check may pass but the mapping could still exceed the heap, or vice versa. The check should use aligned_export_size.
owns_tensor incorrectly claims zero-element tensors: owns_tensor returns True unconditionally for any zero-element tensor, even ones not allocated by this allocator. A torch.zeros(0, device="cuda") created externally would be incorrectly claimed as owned. This can break routing logic that depends on owns_tensor.
No GC-based auto-free: Tensors that go out of scope leak their VA block permanently — the block is never returned to the free list until the entire allocator is close()'d. The allocator should use a weak-reference finalizer (e.g., weakref.ref with a callback) on returned tensors to automatically return blocks to the free list when the tensor is garbage collected. Users should not be required to manually call allocator.free(t) for normal usage.
Physical memory not released on free(): Currently free() only does bookkeeping — the physical memory (mem_create handle) stays mapped, permanently holding VRAM for blocks that aren't in use. On free(), the allocator should call mem_unmap and mem_release to return physical memory to the system while keeping the VA offset on the free list for reuse. On re-allocation from the free list, call mem_create + mem_map again to back the reserved VA with fresh physical memory. The cumulative_mapped_size / mem_set_access logic will need adjustment to handle gaps in the physical mapping. The DMA-BUF concern can be handled by re-exporting segments after reallocation rather than assuming all segments are always physically present.

Design Issues

_CUDAArrayInterface class redefined on every call: The inner class _CUDAArrayInterface is redefined inside both allocate() and import_external_tensor() on every invocation. Define it once at module level or as a nested class of the allocator.
Inefficient element_size computation: torch.tensor([], dtype=dtype).element_size() creates a temporary tensor on every allocation just to get the element size. Cache this or use a more efficient approach.
File descriptor leak in import_external_tensor: If an exception occurs between export_dmabuf_handle and os.close(dmabuf_fd), the fd leaks. Wrap in try/finally.
Unused va_multiplier parameter: The constructor accepts va_multiplier but never uses it. Either implement it or remove it.

Testing Gaps

No test for OOM / heap exhaustion — the RuntimeError on VA space exhaustion is untested.
No test for thread safety — given the Lock, concurrent allocation/free should be tested.
No test for close() / resource cleanup — the close() and __del__ paths are untested.
Tests use print() for pass reporting — remove prints; rely on pytest assertions.

Co-authored-by: mawad-amd <[email protected]>

Copilot · 2026-03-09T02:53:39Z

@copilot Please address the following issues in this PR:

Critical Bugs...

All 12 issues addressed in commit 17edd0b. Key points:

Critical bugs:

OOM check in import_external_tensor now uses aligned_export_size (export called first, then check)
owns_tensor checks data_ptr() range for all tensors — zero-element tensors no longer unconditionally claimed
GC auto-free via weakref.finalize on tensor.untyped_storage() (not the Python wrapper, which doesn't survive .reshape()) + _pending_free deque to avoid lock-reentry deadlock
Physical memory is renewed at reuse time: _remap_free_block releases the old handle (mem_unmap + mem_release) and creates fresh physical memory (mem_create + mem_map) at the same VA, then restores mem_set_access(base_va, cumulative_size, …). Releasing at free() time is incompatible with ROCm's requirement that hipMemSetAccess must start from base_va — calling it on arbitrary sub-ranges gives HIP error 1 (invalid argument), so VA ranges must always stay gap-free

Design improvements: _CUDAArrayInterface at module level, _element_size() cache, try/finally fd guard, va_multiplier removed

Segment API: get_allocation_segments() now returns (offset, size, va, generation) 4-tuples; _remap_free_block increments the generation; symmetric_heap replaces the _peer_imported_segments set with _peer_segment_generations dict and handles stale peer mappings (unmap old, remap new, re-set access)

New tests: OOM, thread safety, close(), GC auto-free, generation counter, zero-element owns_tensor check; all print() calls removed

All 24 new tests pass on 2 ranks; all 2255 existing tests pass.

Initial plan

92e25d7

Copilot AI assigned Copilot and mawad-amd Mar 9, 2026

Copilot started work on behalf of mawad-amd March 9, 2026 01:31 View session

Implement power-of-two VMem allocator (VMemPow2Allocator)

52dd8a8

Co-authored-by: mawad-amd <[email protected]>

Copilot AI changed the title ~~[WIP] Implement power-of-two virtual memory allocator~~ Implement power-of-two VMem allocator Mar 9, 2026

Copilot finished work on behalf of mawad-amd March 9, 2026 01:48

Copilot started work on behalf of mawad-amd March 9, 2026 02:04 View session

Address review feedback: fix all 12 issues in VMemPow2Allocator

17edd0b

Co-authored-by: mawad-amd <[email protected]>

Copilot finished work on behalf of mawad-amd March 9, 2026 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement power-of-two VMem allocator#441

Implement power-of-two VMem allocator#441
Copilot wants to merge 3 commits into
mainfrom
copilot/implement-power-of-two-allocator

Copilot AI commented Mar 9, 2026 •

edited

Loading

Uh oh!

mawad-amd commented Mar 9, 2026

Uh oh!

Copilot AI commented Mar 9, 2026

Critical Bugs...

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comments on the Issue (you are @copilot in this section)

Uh oh!

mawad-amd commented Mar 9, 2026

Critical Bugs

Design Issues

Testing Gaps

Uh oh!

Copilot AI commented Mar 9, 2026

Critical Bugs...

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 9, 2026 •

edited

Loading