cuda.core: graph slot table for node attachment lifetimes#2280
cuda.core: graph slot table for node attachment lifetimes#2280Andy-Jost wants to merge 15 commits into
Conversation
Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit `GraphDefinition` view that shares ownership of the underlying `CUgraph`. The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the user-facing surface and locks in the state-guard rules. State semantics: - PRIMARY builder: only valid after `end_building()`. Before `begin_building()` no graph exists; during capture the driver is the sole writer, so explicit access is unsafe. - CONDITIONAL_BODY builder: valid both before `begin_building()` (the body graph is allocated at conditional-node creation time) and after `end_building()`. This enables a hybrid flow where a conditional body is populated entirely via the explicit API, with no capture at all. - FORKED builder: never valid. Forked builders share the primary's graph; access through the primary instead. Tests cover the happy path, both hybrid flows on conditional bodies (populate-via-explicit-API and capture-then-augment), the three error states (forked, capturing, primary pre-capture), and the shared-ownership guarantee (the `GraphDefinition` survives the builder's `close()`). Co-authored-by: Cursor <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
Introduce OpaqueHandle and a per-graph slot table retained on the CUgraph as a user object, preparing to replace ad-hoc per-resource user objects when wiring graph node attachments in a follow-up change.
|
Replace the per-resource CUDA user objects attached at each graph node with the per-graph slot table from phase 1. Kernel, event-record, event-wait, and host-callback nodes now store their owning handles in node slots via graph_set_slot. Stream-captured callbacks map the just-captured host node from cuStreamGetCaptureInfo and use the same path; forked builders share the primary's graph handle so their attachments reach the same table. Refine the phase 1 surface to support this: the slot table is created lazily on first attachment, so conditional-branch bodies (ref handles) get one too, and graph_set_slot returns CUresult for HANDLE_RETURN-style error checking. Removes _attach_user_object and the per-type heap-copy deleters.
Rename graph/_utils to graph/_host_callback now that it holds only host-callback machinery (the trampoline, _is_py_host_trampoline, and _resolve_host_callback), matching the concept-named files around it, and update the three cimport sites. Add _attach_host_callback_owners to share the "callback -> slot 0, user_data -> slot 1" attachment between the eager (GN_callback) and capture (add_callback) paths. Guard a zero-length user_data copy against malloc(0) and hoist the per-call ctypes import. Attach the kernel-argument tuple to the kernel node's slot 1 so the Python objects backing the arguments -- notably device Buffers -- outlive the graph. The driver copies argument values into the node at add time but does not keep the referenced device memory alive, so without this a kernel node could be left with a stale device pointer. This is the slot-table port of the user-object fix from NVIDIA#2041 (currently only on main).
GraphNode.memcpy/memset (and the GraphDefinition pass-throughs) now accept a Buffer or a raw int for each address. A new _resolve_ptr helper reads the device pointer from a Buffer and returns it as an owner; a raw int casts through with no owner. GN_memcpy attaches a Buffer dst to slot 0 and src to slot 1, and GN_memset attaches dst to slot 0, so buffers passed by value outlive the graph. Raw ints behave exactly as before (caller owns the lifetime), so this is backward compatible. Document the stream-capture lifetime contract on GraphBuilder: operations recorded during capture reference caller-owned memory and are not retained, unlike explicit GraphDefinition construction. Host callbacks are the one exception, retained on both the capture and explicit paths.
… capture callbacks Cover GraphDefinition memset/memcpy with Buffer operands (including clone), and GraphBuilder capture host callbacks retained after dropping Python refs.
Keyword-only *_owner args retain arbitrary objects for raw pointer operands; Buffer+owner combinations are rejected. Strengthen owner tests with weakref retention checks and add src_owner rejection test.
Store DevicePtrHandle in slot table instead of Buffer wrappers so reset/close cannot release memory while a graph still references it. Add test-only weak_handle() for deterministic allocation lifetime checks and extend graph lifetime tests accordingly.
13dab04 to
621ade8
Compare
|
/ok to test |
|
I'm trying to help a little bit using gpt-5.5 on my side. It'd be great to get into the habit of letting our agents add the authorship markers, at least to agent-generated tests. Findings
|
| _attach_user_object(as_cu(h_graph), <void*>kernel_args, | ||
| <cydriver.CUhostFn>py_object_user_object_destroy) | ||
| owner = make_opaque_py(kernel_args) | ||
| HANDLE_RETURN(graph_set_slot(h_graph, new_node, 1, owner)) |
There was a problem hiding this comment.
One of the things Claude identified here is that if either of these graph_set_slot calls fail, the node created with cuGraphAddKernelNode is still in the graph, and pointing at Python memory that could be freed from underneath it (since the slot setting thing here failed).
The fix is probably to have a try/except that deletes the node and then reraises in the except block.
But whether this is worth the squeeze I leave up to you.
====
cuda_core/cuda/core/graph/_graph_node.pyx:694-706 (and the analogous memset/memcpy/event/callback sites) — a graph node is committed to the driver before its owner is retained, with no rollback on failure.
cuGraphAddKernelNode (or cuGraphAddHostNode/memcpy/memset equivalents) runs first and permanently commits the node — for host-callback nodes this means the driver now holds a raw pointer into Python-owned memory. Only afterward does graph_set_slot(...) via HANDLE_RETURN attempt to retain the owner. If that call fails (ensure_slot_table returns null because the driver lacks cuUserObjectCreate/cuGraphRetainUserObject/cuUserObjectRelease, or either call errors for any other reason, e.g. OOM), an exception is raised — but the node is already permanently in the graph with no owner slot attached. If the caller's only Python reference to the callback/argument tuple is later dropped, a subsequent launch()/instantiate() dereferences a freed pointer. The "old driver" branch of this is unlikely in practice since cuUserObjectCreate (CUDA 11.3+) is present on every driver supporting the package's declared CUDA 12/13 dependency, but genuine transient failures from the two driver calls are not otherwise guarded against, and there's no cleanup/rollback path in any of the call sites.
mdboom
left a comment
There was a problem hiding this comment.
The review is a combination of some things Claude found and some things I found.
I'm putting this as request changes mainly for the API breakage -- we should make a call on how to handle that.
The rest of the comments are "real, but probably very hard to reproduce bugs" or random performance suggestions that I would have approved the PR even if unaddressed.
| p_cuUserObjectRelease(user_obj, 1); // drops refcount to 0 -> frees table | ||
| return nullptr; | ||
| } | ||
| } |
There was a problem hiding this comment.
Identified by Claude (full output below, my commentary above):
If two threads where to act on the same GraphBox here, there is a race condition on setting up its slot_table. Holding the GIL is probably a simple solution -- locking on the individual GraphBox would be more complex. It might also be possible to recheck that box->slot_table is still NULL after the GIL is re-aquired here and only overwrite if so.
This is only an issue if two threads were going to work on building the same graph in parallel. Given how complex graphs are, it's probably ok to say concurrent writing to them isn't supported, but I don't think we actually enforce that anywhere.
====
cuda_core/cuda/core/_cpp/resource_handles.cpp:1171-1196 (ensure_slot_table) — unsynchronized check-then-act on GraphBox::slot_table racing across threads.
The function checks box->slot_table for null, then releases the GIL around cuUserObjectCreate/cuGraphRetainUserObject, then writes box->slot_table. If two threads add the first node to the same graph concurrently, both can see slot_table == nullptr, both release the GIL, and both race to create and retain a separate GraphSlotTable/user object on the same graph — an unsynchronized concurrent write to a plain pointer (a data race under the C++ memory model) that leaves one of the two tables orphaned as the "cached" pointer while both are still retained on the graph. No lock guards slot_table. This requires genuinely concurrent graph-building from multiple threads on the same builder/graph, which may already be outside the library's supported usage, but nothing documents or enforces that.
There was a problem hiding this comment.
Agreed that concurrent graph modifications are outside of the supported concurrency model, so locking is overkill. The CUDA graph-construction calls are already documented as not thread safe (in a dedicated section “Graph object thread safety”) and the custom in Python numeric/GPU libraries like NumPy, PyTorch, CuPy is that synchronization for concurrent mutation falls to the user. Beyond that, cuda.core does not make any concurrency guarantees that I'm aware of.
My understanding, then, is that when making concurrent modifications to cuda.core objects, applications are responsible for providing synchronization.
There was a problem hiding this comment.
Makes sense. Maybe an AGENTS.md content would keep flagging this stuff. But nothing to do on that here.
| // when the graph is destroyed. | ||
| constexpr std::size_t SLOTS_PER_NODE = 2; | ||
| using NodeSlots = std::array<OpaqueHandle, SLOTS_PER_NODE>; | ||
| using GraphSlotTable = std::map<CUgraphNode, NodeSlots>; |
There was a problem hiding this comment.
Claude suggested that std::unordered_map might be a small performance win here.
====
cuda_core/cuda/core/_cpp/resource_handles.cpp:1132 — GraphSlotTable uses std::map instead of std::unordered_map.
<unordered_map> is already included and used elsewhere in the same file (HandleRegistry, ipc_ptr_cache). graph_set_slot does a red-black-tree lookup/insert (with a heap allocation per node) on every single node addition; an unordered_map<CUgraphNode, NodeSlots> would be O(1) amortized with no custom hash needed (CUgraphNode is a plain pointer). Cheap, low-risk swap for graphs with many nodes.
There was a problem hiding this comment.
I doubt it. For small maps (N<100) with integer keys, std::map wins in my experience.
| } // namespace | ||
|
|
||
| OpaqueHandle make_opaque_py(PyObject* obj) { | ||
| Py_INCREF(obj); |
There was a problem hiding this comment.
Should we aquire the GIL here (or at least assert that it's already held?). Usage looks fine at the moment, but it might be easy to misuse.
There was a problem hiding this comment.
AFAIK all of the entry points in resource_handles are called from Cython with the GIL held. Destroy callbacks acquire the GIL because they can be run at arbitrary times (we also check for interpreter shutdown in those). My feeling is that it's probably not worth the extra complexity to add GIL acquisition to all of the entry points.
| return GraphHandle(box, &box->resource); | ||
| } | ||
|
|
||
| CUresult graph_set_slot(const GraphHandle& h_graph, CUgraphNode node, |
There was a problem hiding this comment.
Most (but not all) of the places where this is called look like:
if owner:
graph_set_slot(graph, node, 0, owner)
I wonder if the places where there isn't a NULL check is a bug.
We could modify this function so that owner == NULL is accepted, but is a no-op. That would prevent such bugs in the future. (On the other hand it may read as less explicit at the call site, so I'm 50/50 on this suggestion).
| If the builder is forked, currently building, or (for primary | ||
| builders) has not started building yet. | ||
| """ | ||
| if self._kind == FORKED: |
There was a problem hiding this comment.
Minor performance nit: If you refactor this as if/elif (with the last and as a nested if) Cython will generate a switch statement.
| """ | ||
|
|
||
| def memset(self, dst: int, value, width: int, height: int=1, pitch: int=0) -> MemsetNode: | ||
| def memset(self, dst: Buffer | int, value, width: int, *, height: int=1, pitch: int=0, dst_owner=None) -> MemsetNode: |
There was a problem hiding this comment.
This is technically an API breakage. Before this change GraphDefinition.memset(buffer, value, width, height) would work. After, this would have to be GraphDefinition.memset(buffer, value, width, height=height).
(Even though this is in a private module, GraphDefinition is re-exposed to the public API, so this is public API).
Personally, I think this is a good change and small/obscure enough to not matter much in practice. At a minimum it needs a "Breaking change" release note. If we want to be really pedantic, it probably needs a deprecation cycle.
There was a problem hiding this comment.
Good catch. For this relatively new API, I'm leaning towards just adding an item in the release note.
There was a problem hiding this comment.
The ship has sailed. We must not break across 1.x.
| """ | ||
|
|
||
| def memcpy(self, dst: int, src: int, size: int) -> MemcpyNode: | ||
| def memcpy(self, dst: Buffer | int, src: Buffer | int, size: int, *, dst_owner=None, src_owner=None) -> MemcpyNode: |
There was a problem hiding this comment.
This one only affects new arguments, so isn't API breakage.
| """ | ||
|
|
||
| def memset(self, dst: int, value, width: int, height: int=1, pitch: int=0) -> MemsetNode: | ||
| def memset(self, dst: Buffer | int, value, width: int, *, height: int=1, pitch: int=0, dst_owner=None) -> MemsetNode: |
There was a problem hiding this comment.
Ditto as above: This is technically API breakage.
| """ | ||
|
|
||
| def memset(self, dst: int, value, width: int, height: int=1, pitch: int=0) -> MemsetNode: | ||
| def memset(self, dst: Buffer | int, value, width: int, *, height: int=1, pitch: int=0, dst_owner=None) -> MemsetNode: |
There was a problem hiding this comment.
The ship has sailed. We must not break across 1.x.
We need to set up our agents to be ultra-sensitive about breaking changes. I am glad that everyone's agent caught this, but it should not be labelled as P2, but instead P0 or blocking. |
Address PR NVIDIA#2280 review feedback: - Move the keyword-only "*" marker in GraphNode.memset and GraphDefinition.memset to after height/pitch, so pre-existing positional calls memset(dst, value, width, height, pitch) keep working. The new dst_owner argument remains keyword-only. This avoids a public API break across 1.x. memcpy is unchanged (its dst_owner/src_owner args are new, so the existing "*" placement is non-breaking). - Add @pytest.mark.agent_authored markers to the new graph tests in test_graph_builder.py and test_graph_definition_lifetime.py.
A node added via cuGraphAdd*Node is committed to the graph before its owner slots are attached. If graph_set_slot fails (e.g. the driver lacks cuUserObjectCreate, or a transient error), the node would remain in the graph referencing Python-owned memory with nothing keeping it alive, risking a later launch dereferencing freed memory. Guard the slot-attachment at each explicit-add site (kernel, memset, memcpy, event record/wait, host callback) with a try/except that destroys the node (best effort) and re-raises. The capture-path callback in _graph_builder is intentionally left alone: its node is created by cuLaunchHostFunc during active capture, where destroying a capture dependency would corrupt capture state.
For the record, it's the |
Convert the sequential guard checks in GraphBuilder.graph_definition to an if/elif chain (splitting the final compound condition into a nested if). Behavior is unchanged since each leading branch raises; the chain lets Cython generate tighter branch code. Addresses a review nit on PR NVIDIA#2280.
Centralize null-owner handling in graph_set_slot: a null OpaqueHandle now returns CUDA_SUCCESS without forcing slot-table (and user-object) creation. This resolves the reviewer question about the asymmetric per-call-site NULL checks -- optional owners are uniformly safe at the source, so callers no longer need to guard them. Update the header doc accordingly.
|
Thanks for the feedback everyone. The latest changes should address all of the issues. |
Summary
CUDA graph nodes frequently reference Python-owned resources — kernel-argument buffers, host-callback functions and their user data, and memcpy/memset operands — but the driver does not keep those resources alive. The driver copies argument values into a node at add time; it does not retain the Python wrappers or the device allocations they point at. If the owning object is collected (or a
Bufferis closed) before the graph is instantiated or launched, the node is left referencing freed memory.This PR introduces a per-graph slot table that binds the lifetime of these attachments to the lifetime of the
CUgraphitself. Each node gets a small, fixed set of slots; an owner placed in a slot is held until the graph is destroyed. The table is stored on the graph as a CUDA user object, so destroying (or cloning) the graph propagates ownership correctly with no per-node bookkeeping in Python.It also exposes
GraphBuilder.graph_definition, completing step 3 of #1330 by giving users an explicitGraphDefinitionview of a captured graph.Design
make_opaque_pyretains aPyObject*(incref/decref), andmake_opaque_mallocowns amalloc'd buffer. A slot can therefore hold any kind of owner uniformly.CUgraphNodeto a fixed-size array ofOpaqueHandleslots. It is created lazily on first attachment and retained on theCUgraphas a user object (cuUserObjectCreate+cuGraphRetainUserObjectwithMOVE). When the graph is destroyed, the user object's destructor frees the table and releases every owner it holds. Conditional-branch body graphs (ref handles) get their own table the same way.graph_set_slot(graph, node, slot, owner)installs an owner into a node slot and returnsCUresultforHANDLE_RETURN-style error handling. This replaces the previous approach of attaching an ad-hoc CUDA user object per resource at each node, along with its per-type heap-copy deleters.Changes
resource_handles.{hpp,cpp}(OpaqueHandle, the slot table, lazy creation,graph_set_slot) with its Cython surface in_resource_handles.{pxd,pyx}._graph_node.pyx— kernel, event-record, event-wait, host-callback, memcpy, and memset — now store their owning handles in node slots. Stream-captured callbacks recover the just-captured host node fromcuStreamGetCaptureInfoand use the same path; forked builders share the primary builder's graph handle, so their attachments land in the same table.Buffers — outlive the graph. This is the slot-table port of the user-object fix from cuda.core: keep kernel-argument objects alive in graph kernel nodes #2041.GraphNode.memcpy/memset(and theGraphDefinitionpass-throughs) now accept either aBufferor a rawintaddress for each operand. ABufferoperand is retained at the allocation level (itsDevicePtrHandle), soclose()/reset cannot free memory the graph still references; a rawintbehaves exactly as before (caller owns the lifetime), keeping the change backward compatible. Keyword-onlydst_owner/src_ownerarguments let callers attach an arbitrary owner to a raw-pointer operand; combining an owner with aBufferoperand is rejected.graph/_utilsis renamed tograph/_host_callbacknow that it holds only host-callback machinery, and a shared_attach_host_callback_ownersunifies the eager (GN_callback) and capture (add_callback) attachment paths.GraphBuilder.graph_definitionexposes the captured graph as aGraphDefinitionthat shares ownership of the underlyingCUgraph. State-guard rules: valid on a primary builder only afterend_building(); valid on a conditional body both beforebegin_building()and afterend_building(); never valid on a forked builder (access through the primary instead)._utils/_weak_handlesmodule providesweak_handle()for deterministic, refcount-free lifetime assertions.Stream-capture lifetime contract
Operations recorded during stream capture reference caller-owned memory and are not retained, unlike explicit
GraphDefinitionconstruction. Host callbacks are the one exception: they are retained on both the capture and explicit paths. This contract is documented onGraphBuilder.Test coverage
Buffermemcpy/memset operands (including clone), and for capture host callbacks retained after dropping their Python references.dst_owner/src_ownerretention verified with weakrefs, plus rejection ofBuffer+owner combinations.weak_handle()to confirm an allocation survivesreset/closewhile a graph still references it.graph_definitiontests: happy path, both hybrid conditional-body flows (populate-via-explicit-API and capture-then-augment), the three error states (forked, capturing, primary pre-capture), and the shared-ownership guarantee (theGraphDefinitionsurvives the builder'sclose()).Related work