Since #2008 merged on 2026-06-29, the Nightly standard (linux-aarch64) / Python 3.14, CUDA 13.3.0 (local), GPU l4 (x2) job (and its 3.14t sibling) in the "CI: Nightly optional-deps" workflow has intermittently failed on the same 6 tests in cuda_core/tests/graph/test_graph_builder.py (all newly added in #2008):
test_graph_begin_building_twice — raises CUDAError(CUDA_ERROR_ILLEGAL_STATE) instead of the expected RuntimeError("Graph builder is already building.")
test_graph_split_requires_building — DID NOT RAISE RuntimeError (expected: "Graph builder must be building before it can be split.")
test_graph_complete_after_close_forked — raises RuntimeError("Graph has not finished building.") instead of "Graph builder has been closed."
test_graph_update_after_source_close — TypeError: int() argument must be … not 'NoneType' at _graph_builder.pyx:843, instead of ValueError("Source graph builder has been closed.")
test_graph_embed_non_builder — AttributeError: 'object' object has no attribute '_building_ended' at _graph_builder.pyx:689, instead of the intended TypeError from the isinstance check
test_graph_close_is_idempotent — after graph.close() called twice, int(graph.handle) == 0 fails because graph.handle is None (not the null-int handle the assertion expects)
Nightly history on main
| Date |
HEAD |
aarch64 py3.14 |
Notes |
| 2026-06-28 |
ea0215fd09 |
✅ pass |
pre-#2008 |
| 2026-06-29 |
ea0215fd09 |
✅ pass |
ran 03:25 UTC, before #2008 merged (21:04 UTC same day) |
| 2026-06-30 |
dad6a421df |
❌ 6 graph_builder tests fail — py3.14 / py3.14t |
first nightly after #2008 (merge commit) |
| 2026-07-01 |
f9f3849bd8 |
❌ (unrelated nvshmem failure) — py3.14 / py3.14t |
graph_builder tests passed this night; only test_locate_bitcode_lib[nvshmem_device] failed (a separate, already-fixed issue via #2286) |
Also observed on PR #2283 push run ec04c554bd at 2026-07-01T17:48 UTC — same 6 graph_builder failures — py3.14 / py3.14t. Then a /ok to test rerun of the exact same tree (synthetic head 58fd95efb4, run 28542961462) passed all 6.
Direct link into the failure block for the 2026-06-30 run: https://github.com/NVIDIA/cuda-python/actions/runs/28418107489/job/84205333659#step:35:3873
Diagnosis
Two observed failures, two observed passes on the same tree — so the tests exhibit non-deterministic behavior on linux-aarch64 py3.14/3.14t (L4 x2). pytest-randomly is in the test group, which reshuffles order per invocation; some of the 6 failures (e.g. test_graph_close_is_idempotent finding graph.handle is None where the test expects int(graph.handle) == 0) point at test-to-test state leakage or a lifecycle assumption that only holds under a specific ordering. Not observed on linux-64, win-64, or linux-aarch64 py3.12/3.13.
Because #2008 introduced both the tests and the underlying refactor, the fix should live either in _graph_builder.pyx (make the invariants hold regardless of ordering) or in the new tests (make them order-independent — e.g. assert graph.handle in (0, None)), whichever matches the intended semantics.
cc @Andy-Jost
Since #2008 merged on 2026-06-29, the
Nightly standard (linux-aarch64) / Python 3.14, CUDA 13.3.0 (local), GPU l4 (x2)job (and its3.14tsibling) in the "CI: Nightly optional-deps" workflow has intermittently failed on the same 6 tests incuda_core/tests/graph/test_graph_builder.py(all newly added in #2008):test_graph_begin_building_twice— raisesCUDAError(CUDA_ERROR_ILLEGAL_STATE)instead of the expectedRuntimeError("Graph builder is already building.")test_graph_split_requires_building—DID NOT RAISE RuntimeError(expected:"Graph builder must be building before it can be split.")test_graph_complete_after_close_forked— raisesRuntimeError("Graph has not finished building.")instead of"Graph builder has been closed."test_graph_update_after_source_close—TypeError: int() argument must be … not 'NoneType'at_graph_builder.pyx:843, instead ofValueError("Source graph builder has been closed.")test_graph_embed_non_builder—AttributeError: 'object' object has no attribute '_building_ended'at_graph_builder.pyx:689, instead of the intendedTypeErrorfrom the isinstance checktest_graph_close_is_idempotent— aftergraph.close()called twice,int(graph.handle) == 0fails becausegraph.handleisNone(not the null-int handle the assertion expects)Nightly history on
mainea0215fd09ea0215fd09dad6a421dff9f3849bd8test_locate_bitcode_lib[nvshmem_device]failed (a separate, already-fixed issue via #2286)Also observed on PR #2283 push run
ec04c554bdat 2026-07-01T17:48 UTC — same 6 graph_builder failures — py3.14 / py3.14t. Then a/ok to testrerun of the exact same tree (synthetic head58fd95efb4, run 28542961462) passed all 6.Direct link into the failure block for the 2026-06-30 run: https://github.com/NVIDIA/cuda-python/actions/runs/28418107489/job/84205333659#step:35:3873
Diagnosis
Two observed failures, two observed passes on the same tree — so the tests exhibit non-deterministic behavior on
linux-aarch64py3.14/3.14t (L4 x2).pytest-randomlyis in the test group, which reshuffles order per invocation; some of the 6 failures (e.g.test_graph_close_is_idempotentfindinggraph.handle is Nonewhere the test expectsint(graph.handle) == 0) point at test-to-test state leakage or a lifecycle assumption that only holds under a specific ordering. Not observed onlinux-64,win-64, orlinux-aarch64py3.12/3.13.Because #2008 introduced both the tests and the underlying refactor, the fix should live either in
_graph_builder.pyx(make the invariants hold regardless of ordering) or in the new tests (make them order-independent — e.g.assert graph.handle in (0, None)), whichever matches the intended semantics.cc @Andy-Jost