From 046eac264025a2f80ee0ea9ca56c47ffc4fe471d Mon Sep 17 00:00:00 2001 From: Sri Koundinyan Date: Wed, 1 Jul 2026 13:45:09 -0400 Subject: [PATCH] cuda_core: add "10 minutes to cuda.core" tutorial to docs Add a beginner-friendly "10 minutes to cuda.core" guide that walks through the core workflow (select a device, compile a kernel, allocate memory, copy, launch, time with events, use multiple streams, capture a CUDA graph, and interoperate with CuPy/PyTorch), then wire it into the cuda.core docs table of contents between the installation and examples sections. Signed-off-by: Sri Koundinyan --- .../docs/source/10_minutes_to_cuda_core.rst | 438 ++++++++++++++++++ cuda_core/docs/source/index.rst | 1 + 2 files changed, 439 insertions(+) create mode 100644 cuda_core/docs/source/10_minutes_to_cuda_core.rst diff --git a/cuda_core/docs/source/10_minutes_to_cuda_core.rst b/cuda_core/docs/source/10_minutes_to_cuda_core.rst new file mode 100644 index 00000000000..411f6a67fd8 --- /dev/null +++ b/cuda_core/docs/source/10_minutes_to_cuda_core.rst @@ -0,0 +1,438 @@ +.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +.. SPDX-License-Identifier: Apache-2.0 + +.. currentmodule:: cuda.core + +10 minutes to ``cuda.core`` +=========================== + +Why ``cuda.core``? +------------------ + +``cuda.core`` gives you a **Pythonic interface to the CUDA runtime**: you can +compile a CUDA C++ kernel at runtime, launch it, move memory, time it, and +capture it into a CUDA graph, all without writing a single raw CUDA driver or +runtime call. It is part of `CUDA Python +`_, providing high-level, Pythonic access +to the CUDA runtime on top of the low-level +:doc:`cuda.bindings `, so you get clean, idiomatic Python: + +- **Write GPU kernels from Python.** Compile CUDA C++ to a runnable kernel in a + few lines, with no separate build step, no ``Makefile``, and no ``nvcc`` + invocation. +- **Stay in the Python GPU ecosystem.** It shares CUDA context, streams, and + memory with `CuPy `_ and `PyTorch `_, + so you can launch your own kernels directly on their arrays, with zero copies. +- **Safe by design.** Resources are real Python objects that you release with + ``close()``, and errors raise Python exceptions instead of return codes you + have to check. + +If you have ever wanted to drop a custom CUDA kernel into a NumPy/CuPy/PyTorch +workflow without leaving Python, this is the fastest way to do it. + +Before you begin +---------------- + +This is a short, hands-on introduction to ``cuda.core``, geared mainly toward +new users. It walks from "talk to a GPU" to "compile, launch, time, and capture +a kernel" using small, runnable snippets. For the full reference, see the +:doc:`API reference `; for more complete programs, see the +:doc:`examples page `. + +.. note:: + + These snippets target CUDA 13 and need a working CUDA 13 installation plus + ``cuda.core`` and a matching ``cuda.bindings`` (13.x). The quickest way to get + both from PyPI is: + + .. code-block:: console + + $ pip install cuda-core[cu13] + + The :doc:`install page ` covers driver requirements, conda, CUDA 12, + and installing from source. + +Customarily, we import as follows: + +.. code-block:: python + + from cuda.core import Device, LaunchConfig, Program, ProgramOptions, launch + + +Selecting a device +------------------ + +:class:`Device` is your entry point. Creating one does **not** initialize the +GPU; calling :meth:`Device.set_current` is what sets up a `CUDA context +`_ +on the current host thread. Always call it before doing GPU work. + +.. code-block:: python + + dev = Device() # device 0 by default; Device(1) selects another GPU + dev.set_current() # initialize CUDA and make this device current + + print(dev.name) # e.g. 'NVIDIA GB10' + print(dev.compute_capability) # ComputeCapability(major=12, minor=1) + print(dev.arch) # '121' (handy for building kernels below) + +:class:`Device` objects are thread-local singletons: ``Device(0)`` always hands +you back the same object for device 0 on that thread, so libraries sharing your +process see and use the same GPU. Rich device attributes live under +:attr:`Device.properties` (for example +``dev.properties.multiprocessor_count``). + + +Compiling a kernel +------------------ + +CUDA C++ source is compiled at runtime with :class:`Program`. You describe the +compile with :class:`ProgramOptions` (here: the C++ standard and the target +architecture, taken from ``dev.arch``), call :meth:`Program.compile` to get an +:class:`ObjectCode`, then pull out a callable :class:`Kernel` by name. + +.. code-block:: python + + code = r""" + extern "C" __global__ + void scale(float* data, float factor, size_t n) { + size_t i = threadIdx.x + blockIdx.x * blockDim.x; + if (i < n) + data[i] *= factor; + } + """ + + opts = ProgramOptions(std="c++17", arch=f"sm_{dev.arch}") + prog = Program(code, code_type="c++", options=opts) + mod = prog.compile("cubin") # "cubin" | "ptx" | "ltoir" + scale = mod.get_kernel("scale") + +We wrap the kernel in ``extern "C"`` so its name is exactly ``"scale"``, which is +the name we pass to :meth:`~ObjectCode.get_kernel`. + +When something goes wrong, ``cuda.core`` raises a regular Python exception rather +than returning an error code. A compilation failure is especially friendly: the +exception carries the compiler's log, so you can see exactly what went wrong +(other errors raise an exception with the CUDA error name and description). + +.. code-block:: python + + bad = r'extern "C" __global__ void k() { not_a_real_symbol; }' + try: + Program(bad, code_type="c++", options=opts).compile("cubin") + except Exception as e: + print(e) # includes: error: identifier "not_a_real_symbol" is undefined + + +Streams +------- + +A :class:`Stream` is an ordered queue of GPU work. Operations on the same stream +run in order; separate streams may overlap. Most ``cuda.core`` operations that +touch the GPU take a stream so you stay in control of ordering. Create one from +the device: + +.. code-block:: python + + stream = dev.create_stream() + +We create it first because the steps that follow, allocating, copying, and +launching, are all issued on a stream. We will pass this ``stream`` to each of +them in the next two sections. + + +Allocating memory +----------------- + +Memory comes from a :class:`MemoryResource`. Two you'll use early: + +- :meth:`Device.allocate` hands you a **device** :class:`Buffer` from the + device's default memory resource. +- :class:`PinnedMemoryResource` allocates **host** memory that is page-locked + (and host-accessible), which is convenient for staging data to and from the GPU. + +Like :meth:`Device.allocate`, :class:`PinnedMemoryResource` is stream-ordered, so +its :meth:`~PinnedMemoryResource.allocate` also takes a ``stream``. + +From here on we use `NumPy `_ for host data and to check +results, but only as a convenience: ``cuda.core`` works directly on raw memory +buffers and typed pointers, and does not depend on NumPy or any other array +library. + +.. code-block:: python + + import numpy as np + + from cuda.core import PinnedMemoryResource + + n = 1024 + nbytes = n * np.dtype(np.float32).itemsize + + pinned = PinnedMemoryResource() + host = pinned.allocate(nbytes, stream=stream) # host-accessible buffer + dbuf = dev.allocate(nbytes, stream=stream) # device buffer + + print(host.size, host.is_host_accessible) # 4096 True + print(dbuf.is_device_accessible) # True + +A :class:`Buffer` is an owning handle to an allocation. Because every +:class:`Buffer` implements ``__dlpack__``, we can get a NumPy view of the +*host* buffer with :func:`numpy.from_dlpack` (no copy, no raw pointers) and +fill it in place. (``from_dlpack`` hands back raw bytes, so we ``.view`` it as +``float32``.) + +.. code-block:: python + + host_np = np.from_dlpack(host).view(np.float32) # writable view, zero-copy + host_np[:] = np.arange(n, dtype=np.float32) + +The same trick works on the *device* side with CuPy: ``cp.from_dlpack(dbuf)`` +gives a CuPy array backed by the device buffer. + + +Copying and launching +--------------------- + +Copy host → device with :meth:`Buffer.copy_from`, run the kernel with +:func:`launch`, then copy device → host with :meth:`Buffer.copy_to`. These are +all stream-ordered and asynchronous; :meth:`Stream.sync` blocks until the queued +work finishes. + +:class:`LaunchConfig` describes the grid and block; :func:`launch` takes the +stream, the config, the kernel, and then the kernel arguments. A :class:`Buffer` +can be passed straight through as a pointer argument. Scalars, however, must +carry an explicit C type, so we use NumPy scalars (``np.float32``, +``np.uint64``) to match the kernel signature. + +.. code-block:: python + + dbuf.copy_from(host, stream=stream) # H2D + + block = 256 + grid = (n + block - 1) // block + config = LaunchConfig(grid=grid, block=block) + launch(stream, config, scale, dbuf, np.float32(3.0), np.uint64(n)) + + dbuf.copy_to(host, stream=stream) # D2H + stream.sync() # wait for all of the above + + assert np.array_equal(host_np, np.arange(n, dtype=np.float32) * 3.0) + +That is the core ``cuda.core`` workflow: **select a device, compile, allocate, +copy, launch, sync.** Everything below builds on it. + + +Timing with events +------------------ + +An :class:`Event` marks a point in a stream. Create timing-enabled events, +:meth:`record ` them around some work, synchronize, and subtract +to get elapsed GPU time in milliseconds. + +.. code-block:: python + + start = dev.create_event({"timing_enabled": True}) + end = dev.create_event({"timing_enabled": True}) + + stream.record(start) + for _ in range(100): + launch(stream, config, scale, dbuf, np.float32(1.0), np.uint64(n)) + stream.record(end) + end.sync() + + print(f"100 launches took {end - start:.4f} ms") + +Events are also how you build cross-stream dependencies without stalling the +host: :meth:`Stream.wait` makes one stream wait on an event (or another stream). + + +Working with multiple streams +----------------------------- + +A single stream is enough for ordered work, but multiple streams let independent +work proceed in parallel. The two launches below touch different buffers, so the +GPU is free to run them at the same time. When a later step *does* depend on +another stream's result, :meth:`Stream.wait` joins them: it makes one stream wait +for another's work to finish before continuing. + +.. code-block:: python + + stream_a = dev.create_stream() + stream_b = dev.create_stream() + + buf_a = dev.allocate(nbytes, stream=stream_a) + buf_b = dev.allocate(nbytes, stream=stream_b) + + host_np[:] = np.arange(n, dtype=np.float32) # known input + buf_a.copy_from(host, stream=stream_a) + buf_b.copy_from(host, stream=stream_b) + + # Independent work on each stream: free to run concurrently. + launch(stream_a, config, scale, buf_a, np.float32(2.0), np.uint64(n)) + launch(stream_b, config, scale, buf_b, np.float32(5.0), np.uint64(n)) + + # stream_b now needs stream_a's result, so it waits before reading it. + stream_b.wait(stream_a) + buf_b.copy_from(buf_a, stream=stream_b) # safe: buf_a is ready + + buf_b.copy_to(host, stream=stream_b) + stream_b.sync() + assert np.array_equal(host_np, np.arange(n, dtype=np.float32) * 2.0) + +Without the :meth:`Stream.wait`, the copy on ``stream_b`` could race ahead of the +kernel on ``stream_a``. Whether the two independent launches actually overlap is +up to the GPU scheduler, which can run them together only when each leaves the +device underutilized; using separate streams expresses that the work is +independent and lets the runtime overlap it when it can. The +:doc:`examples ` show this scaled up across multiple GPUs. + + +Capturing work in a CUDA graph +------------------------------ + +When you launch the same sequence of operations repeatedly, per-launch CPU +overhead adds up. A CUDA graph lets you record that sequence once and replay it +with a single launch. Use the stream's graph builder: begin building, issue the +launches into the *builder* instead of the stream, then complete the graph. + +.. code-block:: python + + gb = stream.create_graph_builder() + gb.begin_building() + + launch(gb, config, scale, dbuf, np.float32(1.0), np.uint64(n)) + launch(gb, config, scale, dbuf, np.float32(1.0), np.uint64(n)) + + graph = gb.end_building().complete() + + graph.upload(stream) + graph.launch(stream) # replay the whole sequence in one shot + stream.sync() + +See :cuda-core-example:`cuda_graphs.py ` +for a complete capture-and-replay example with a measured speedup. + + +Working with CuPy and PyTorch +----------------------------- + +``cuda.core`` is designed to interoperate with the rest of the Python GPU +ecosystem, so in real workflows you often skip manual host buffers entirely and +operate directly on CuPy or PyTorch arrays. + +**Current device/context.** Because :meth:`Device.set_current` sets a normal +CUDA context (the standard primary context), other CUDA-runtime libraries pick +it up automatically: if CuPy or PyTorch has already selected a device, +``Device()`` shares it, and vice versa. + +**Passing array data to a kernel.** Both CuPy and PyTorch expose their device +pointer, which you pass to :func:`launch` like any other buffer pointer. We reuse +the ``scale`` kernel and ``config`` from above, applying a factor of ``2.0`` to an +array of ones: + +.. code-block:: python + + # CuPy exposes its device pointer as .data.ptr + import cupy as cp + + a = cp.ones(n, dtype=cp.float32) + dev.sync() # CuPy fills "a" on its own stream; sync before our stream reads it + launch(stream, config, scale, a.data.ptr, np.float32(2.0), np.uint64(a.size)) + stream.sync() + assert bool((a == 2).all()) + +.. code-block:: python + + # PyTorch: wrap torch's stream via the __cuda_stream__ protocol + import torch + + class PyTorchStreamWrapper: + def __init__(self, pt_stream): + self.pt_stream = pt_stream + def __cuda_stream__(self): + return (0, self.pt_stream.cuda_stream) + + t = torch.ones(n, dtype=torch.float32, device="cuda") + ts = dev.create_stream(PyTorchStreamWrapper(torch.cuda.current_stream())) + launch(ts, config, scale, t.data_ptr(), np.float32(2.0), np.uint64(t.numel())) + ts.sync() + assert torch.allclose(t, torch.full_like(t, 2.0)) + +The ``__cuda_stream__`` protocol is how any object advertises that it represents +a CUDA stream; :meth:`Device.create_stream` accepts such objects so you can drive +``cuda.core`` work on another library's stream. See the +:doc:`interoperability guide ` for details. + +**Array-library-agnostic views.** To accept *any* CuPy/PyTorch/NumPy-like array +that supports DLPack or the CUDA Array Interface, decorate a function with +:func:`~cuda.core.utils.args_viewable_as_strided_memory`. The chosen arguments +become :class:`~cuda.core.utils.StridedMemoryView` objects exposing ``ptr``, +``shape``, ``dtype``, and ``is_device_accessible``: + +.. code-block:: python + + from cuda.core.utils import StridedMemoryView, args_viewable_as_strided_memory + + @args_viewable_as_strided_memory((0,)) + def scale_array(arr, work_stream, kern, factor): + view = arr.view(work_stream.handle) + assert isinstance(view, StridedMemoryView) + assert view.is_device_accessible + size = view.shape[0] + cfg = LaunchConfig(grid=(size + 255) // 256, block=256) + launch(work_stream, cfg, kern, view.ptr, np.float32(factor), np.uint64(size)) + work_stream.sync() + + # Works on a CuPy array, a PyTorch tensor, or anything else with DLPack/CAI: + buf = cp.ones(n, dtype=cp.float32) + dev.sync() + scale_array(buf, stream, scale, 2.0) + assert bool((buf == 2).all()) + +``cuda.core`` buffers also implement ``__dlpack__``, so a device :class:`Buffer` +can be handed to any DLPack importer for zero-copy exchange. + + +Cleaning up +----------- + +Buffers, streams, events, graphs, and graph builders hold CUDA resources. They +are released when garbage-collected, but you can release them explicitly with +``close()``, which the :cuda-core-examples:`cuda.core examples ` do in a +``finally`` block. Buffers are also context managers. + +.. code-block:: python + + graph.close() + gb.close() + dbuf.close(stream=stream) + host.close(stream=stream) + stream.close() + + # or scope a buffer with a with-statement: + with dev.allocate(nbytes, stream=stream) as tmp: + ... # tmp is freed at the end of the block + + +Where to go next +---------------- + +You now know the essential ``cuda.core`` workflow and how to plug it into the +Python GPU ecosystem. From here: + +- :doc:`Examples `: runnable programs for templated kernels + (``name_expressions``), multi-GPU, graphs, JIT link-time optimization, TMA, + and interop. +- :doc:`Interoperability `: the ``__cuda_stream__`` protocol, + DLPack/CAI, and ``StridedMemoryView`` in depth. +- :doc:`API reference `: every public class and function, including + :class:`Linker` (runtime linking/LTO), the memory-resource family + (:class:`DeviceMemoryResource`, :class:`ManagedMemoryResource`, + :class:`VirtualMemoryResource`, ...), and the :mod:`graph ` + node types. +- :doc:`Environment variables `: runtime knobs such as + the per-thread default stream. +- Prefer writing kernels in Python instead of CUDA C++? `Numba CUDA + `_ compiles a subset of Python into CUDA + kernels, and `numba-cuda-mlir `_ is + its next-generation evolution. diff --git a/cuda_core/docs/source/index.rst b/cuda_core/docs/source/index.rst index 7e4e2ffd4ec..bf9128b1389 100644 --- a/cuda_core/docs/source/index.rst +++ b/cuda_core/docs/source/index.rst @@ -12,6 +12,7 @@ Welcome to the documentation for ``cuda.core``. getting-started install + 10_minutes_to_cuda_core examples interoperability api