Skip to content

[ludus-renderer]: add Vulkan mesh-shader backend (LudusTimestampedContext)#135

Open
wlewNV wants to merge 8 commits into
mainfrom
dev/wlew/ludus_vk
Open

[ludus-renderer]: add Vulkan mesh-shader backend (LudusTimestampedContext)#135
wlewNV wants to merge 8 commits into
mainfrom
dev/wlew/ludus_vk

Conversation

@wlewNV

@wlewNV wlewNV commented May 21, 2026

Copy link
Copy Markdown
Collaborator
image image (left: cuda, middle: vk, right: diff)

Summary

Adds LudusTimestampedContext — a Vulkan VK_EXT_mesh_shader rendering backend that mirrors the LudusCudaTimestampedContext API via CUDA–Vulkan external-memory interop (one vkCmdDrawMeshTasksEXT submission per primitive family). CUDA stays the default; the Vulkan path is opt-in and also selectable in interactive-drive via --ludus-backend vulkan.

What's in it

  • C++ Vulkan backend (vkutil, ludus_timestamped_vk): external-memory SSBOs imported into CUDA, MSAA, nvjpeg encode.
  • GL_EXT_mesh_shader shaders for the polyline/polygon/obstacle families; SPIR-V embedded in shaders_spv.h (rebuild with shaders/compile.sh).
  • Python LudusTimestampedContext (lazy import; ImportError only on construction).
  • interactive-drive: --ludus-backend {cuda,vulkan}.
  • examples/compare_vulkan_vs_cuda.py (CUDA-vs-Vulkan parity) + tests/test_vulkan_backend.py.

Testing

# Parity vs CUDA (writes side-by-side + pixel diff to ./_vk_compare/):
cd integrations/omnidreams/ludus-renderer
uv run python examples/compare_vulkan_vs_cuda.py --frame 12   # auto-discovers cached USDZ
uv run python examples/compare_vulkan_vs_cuda.py --synthetic  # no scene data needed
uv run pytest tests/test_vulkan_backend.py
# interactive-drive on Vulkan (log shows: [rasterizer] ludus_backend=vulkan):
cd integrations/omnidreams
uv run python -m omnidreams.interactive_drive --backend raster --ludus-backend vulkan
uv run python -m omnidreams.interactive_drive --backend raster --ludus-backend vulkan --stream-mjpeg 8080

Caveats

  • CUDA→Vulkan interop uses opaque-FD external memory, so the Vulkan path is Linux-only. The CUDA backend is unchanged and cross-platform.

@copy-pr-bot

copy-pr-bot Bot commented May 21, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces LudusTimestampedContext, a Vulkan VK_EXT_mesh_shader rendering backend that mirrors the LudusCudaTimestampedContext API using CUDA–Vulkan external-memory interop. CUDA stays the default; the Vulkan path is opt-in and selectable via --ludus-backend vulkan in interactive-drive.

  • Adds a complete C++ Vulkan backend (vkutil, ludus_timestamped_vk) with MSAA support, CUDA–Vulkan SSBO interop via opaque-FD external memory, and nvjpeg encode.
  • Introduces GL_EXT_mesh_shader GLSL shaders for polyline/polygon/obstacle families (SPIR-V embedded in shaders_spv.h), a Python LudusTimestampedContext with lazy JIT compilation, and a compare_vulkan_vs_cuda.py parity example with a smoke test suite.

Confidence Score: 5/5

Safe to merge. The rendering path is correct, all previously raised correctness issues have been resolved, and the remaining findings are non-blocking quality notes.

All critical correctness fixes from the prior review round are present: the HostToDevice tombstone copy with synchronization, the render-pass/pipeline rebuild on MSAA change, the descriptor-pool reset moved before vkCmdBeginRenderPass, and the multi-scene SSBO cursor tracking. The remaining findings are validation-layer noise and JIT-loader robustness issues that do not affect rendered output or data integrity.

vkutil.cpp (transitionImageLayout wrong source masks on the readback return transition) and _plugin_vk.py (TORCH_CUDA_ARCH_LIST side effect and vulkaninfo availability heuristic).

Important Files Changed

Filename Overview
integrations/omnidreams/ludus-renderer/ludus_renderer/_cpp/render/ludus_timestamped_vk.cpp Core Vulkan renderer: render pass/pipeline creation, descriptor set management, CUDA→Vulkan SSBO sync (host-roundtrip + queue family barriers), and batch render dispatch. Previously flagged issues are all addressed. Minor: unchecked vkAllocateDescriptorSets return after pool reset.
integrations/omnidreams/ludus-renderer/ludus_renderer/_cpp/common/vkutil.cpp Vulkan context/buffer/image helpers with CUDA external-memory interop via opaque-FD. transitionImageLayout has hardcoded src access masks that are incorrect for the TRANSFER_SRC→GENERAL return transition used in every frame readback, causing validation layer noise and unnecessary stalls.
integrations/omnidreams/ludus-renderer/ludus_renderer/_ops/context_vk.py Python LudusTimestampedContext: scene/pool packing, SSBO cursor tracking, query assembly, and render dispatch. Global cursor state correctly mirrors C++ *Used counters. Struct layout verified against C++ types.
integrations/omnidreams/ludus-renderer/ludus_renderer/_ops/_plugin_vk.py JIT compilation loader for the Vulkan extension. Two issues: TORCH_CUDA_ARCH_LIST='' is set process-globally before compilation, and vulkaninfo PATH presence is used as a proxy for libvulkan.so availability.
integrations/omnidreams/ludus-renderer/ludus_renderer/_cpp/bindings/torch_rasterize_vk.cpp PyTorch pybind11 bindings wrapping the Vulkan state. FLU→RDF basis conversion, MSAA sample change triggers width/height reset for framebuffer rebuild, and color palette int32→float32 conversion are all correctly implemented.
integrations/omnidreams/ludus-renderer/tests/test_vulkan_backend.py Smoke tests for Vulkan backend: init, upload, single-frame render, and a CUDA/Vulkan lit-pixel parity check (within 25%). Tests gracefully skip on missing Vulkan ICD or mesh-shader support.
integrations/omnidreams/omnidreams/interactive_drive/rasterizer.py Adds backend selection: instantiates LudusTimestampedContext or LudusCudaTimestampedContext based on raster.ludus_backend. Change is minimal and correct.

Sequence Diagram

sequenceDiagram
    participant Py as Python (LudusTimestampedContext)
    participant Cpp as C++ Binding (torch_rasterize_vk)
    participant VkCpp as ludus_timestamped_vk
    participant CUDA as CUDA Runtime
    participant Vk as Vulkan GPU

    Py->>Cpp: upload_cameras(intrinsics_cuda_tensor)
    Cpp->>VkCpp: ludusUploadCamerasVk()
    VkCpp->>CUDA: cudaMemcpyAsync(DeviceToDevice, cameraIntrinsicsBuffer.cuDevPtr)

    Py->>Cpp: upload_scene(packed_tensors)
    Cpp->>VkCpp: ludusUploadSceneVk()
    VkCpp->>VkCpp: ensureBuffers() / resizeExternalBuffer()
    VkCpp->>CUDA: cudaMemcpyAsync(timestamps, int32, vertices DeviceToDevice)
    VkCpp->>VkCpp: updateDescriptorSet()

    Py->>Cpp: render_batch(queries, poses, resolution)
    Cpp->>VkCpp: ludusRenderBatchVk()
    VkCpp->>CUDA: cudaMemcpyAsync(queryBuffer, cameraPoseBuffer)
    VkCpp->>CUDA: cudaStreamSynchronize()

    alt kHostRoundtrip (LUDUS_VK_DIRECT_IMPORT not set)
        VkCpp->>CUDA: cudaMemcpy DeviceToHost (all SSBOs)
        VkCpp->>Vk: vkCmdUpdateBuffer (re-injects via Vulkan transfer path)
    else Direct import
        VkCpp->>Vk: VkBufferMemoryBarrier (VK_QUEUE_FAMILY_EXTERNAL to graphics)
    end

    VkCpp->>Vk: vkCmdBeginRenderPass
    VkCpp->>Vk: vkCmdDrawMeshTasksEXT (polyline / polygon / obstacle)
    VkCpp->>Vk: vkCmdEndRenderPass then vkQueueSubmit then vkWaitForFences

    Cpp->>VkCpp: ludusCopyBatchResultsVk()
    VkCpp->>Vk: vkCmdCopyImageToBuffer (colorImage to readbackBuffer)
    VkCpp->>CUDA: cudaMemcpyAsync(output, readbackBuffer.cuDevPtr, DeviceToDevice)
    Cpp-->>Py: NHWC uint8 tensor
Loading

Reviews (8): Last reviewed commit: "Track true per-pool varray max in Vulkan..." | Re-trigger Greptile

wlewNV added 2 commits June 8, 2026 17:04
…dContext)

Adds a parallel rendering backend that mirrors the public API of the
existing CUDA software rasterizer. The new path uses VK_EXT_mesh_shader
with CUDA-Vulkan external-memory interop so render uploads stay on the
GPU, and is selected at construction via LudusTimestampedContext while
LudusCudaTimestampedContext remains the default everywhere.

New: Vulkan context (vkutil), pipelines for polyline/polygon/obstacle
mesh+task+fragment shaders, NV->EXT GLSL converter and SPIR-V embed
scripts, JIT plugin, Python context wrapper, and a CUDA-vs-Vulkan
example/parity test. Multi-pool task shaders use a force_zero_tasks
flag to keep over-dispatched workgroups' SSBO reads in-bounds, which
is what prevents the giant cross-pool garbage triangles seen with the
naive early-EmitMeshTasksEXT(0) pattern.
@wlewNV wlewNV force-pushed the dev/wlew/ludus_vk branch from 67d3b69 to cea0dd0 Compare June 9, 2026 00:06
@wlewNV

wlewNV commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Marking as ready for review to get fresh comments from greptile.

@wlewNV wlewNV marked this pull request as ready for review June 9, 2026 02:15
wlewNV added 2 commits June 8, 2026 19:41
Review fixes (greptile):
- remove_scene: write the tombstone with cudaMemcpyHostToDevice + stream sync
  instead of cudaMemcpyDeviceToDevice on a host stack pointer (P0)
- MSAA: rebuild the render pass and pipelines when the sample count changes so
  the framebuffer attachment counts match (P1)
- gate the device-info log behind VK_DBG; re-read LUDUS_VK_DIRECT_IMPORT on
  every render instead of caching it in a static bool (P2)

Cleanup / redundancy:
- replace the loguru dependency with a small stdlib-logging shim (_logging.py)
- remove dead state (unused VkContext / LudusTimestampedVkState fields) and the
  inert VkCudaSync semaphore subsystem
- share VK_CHECK / VK_DBG via vkutil.h instead of duplicating them per file
- delete the orphaned ts_common.glsl and the one-time nv_to_ext.py migration;
  the committed shaders are maintained directly as GL_EXT_mesh_shader
- add render_to_staging device/contiguity checks and a removeScene device guard
- vectorize render_batch query packing to avoid per-query host/device syncs

Examples / packaging:
- compare_vulkan_vs_cuda.py: render clipgt USDZ scenes (extract clipgt/*.parquet)
  with scene-cache auto-discovery and an available-camera fallback
- stop tracking/shipping the intermediate .spv (regenerated by compile.sh and
  embedded in shaders_spv.h); ignore _vk_compare/ render output

The CUDA backend is unchanged and remains the default.
upload_scene packed every scene with buffer-base offsets of 0, so the task
shaders' scene.<buf>_offset + pool.<buf>_offset indexing made scene 2+ read
scene 0's pools / timestamps / vertices. Carry persistent global cursors that
mirror the C++ *Used counters: seed each scene descriptor's base offsets from
them, advance them per upload_scene, and reset them in clear_scenes. Single-
scene rendering is unchanged (all bases are 0 for the first scene).

Also tighten the verbose comments added in the previous commit.
@wlewNV

wlewNV commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test 1681b99

…135)

greptile re-review (3/5) flagged two issues in ludus_timestamped_vk.cpp:

- updateDescriptorSet (which calls vkResetDescriptorPool + reallocates the set)
  was invoked after vkCmdBeginRenderPass, which the Vulkan spec forbids. Move it
  before the render pass begins.
- The kHostRoundtrip path copied every CUDA-imported SSBO host->device on every
  render, even buffers unchanged since upload_scene. Add a sceneBuffersDirty flag
  set by the control-plane ops (upload scene/cameras/palette, remove, clear) so
  the scene/camera/palette buffers are re-pushed only when they actually change;
  the per-query buffers (query/cameraPose) still roundtrip every frame.
@wlewNV wlewNV changed the title feat(ludus-renderer): add Vulkan mesh-shader backend (LudusTimestampedContext) [ludus-renderer]: add Vulkan mesh-shader backend (LudusTimestampedContext) Jun 9, 2026
@wlewNV

wlewNV commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test ca140e6

…port

- Add --ludus-backend {cuda,vulkan} (default cuda), threaded through
  RasterConfig into the interactive-drive rasterizer so both the raster and
  world-model backends can render the HD map via LudusTimestampedContext.
- Fix createExternalImage: a Vulkan image with arrayLayers > 1 is layered, so
  the CUDA import must set CUDA_ARRAY3D_LAYERED; without it multi-frame/batched
  rendering failed with cuExternalMemoryGetMappedMipmappedArray INVALID_VALUE.
@wlewNV

wlewNV commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test 65e6bb1

@gtong-nv

gtong-nv commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

for the first run in a new setup, I got

  File "/home/horde/github/flashdreams/integrations/omnidreams/ludus-renderer/ludus_renderer/_ops/_plugin_vk.py", line 98, in _get_vk_plugin
    raise RuntimeError(f"Vulkan backend unavailable: {msg}")
RuntimeError: Vulkan backend unavailable: Vulkan headers not found. Install libvulkan-dev (Debian/Ubuntu) or set VULKAN_SDK to the Vulkan SDK root.

solved by installing libvulkan-dev. But other than that, it works fine in a machine with graphics capability

The `ruff format` pre-commit hook (CI "Run linter checks") reflowed the
--ludus-backend help string and the ctx_cls ternary. Pure formatting; no
behavior change.
@wlewNV

wlewNV commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test 2b63821

greptile re-review (3/5) flagged a hardcoded MAX_VARRAYS_PER_POOL=1000 that
drove both the mesh-task dispatch count and the u_max_varrays_per_pool push
constant, so any polyline/polygon pool with >1000 varrays at a timestamp had
its tail silently dropped (task shader never invoked for those workgroup IDs).

- context_vk.py: compute max_varrays_per_ts_{polyline,polygon} from the
  timestamped prefix sums (mirrors the CUDA backend) and pass them to upload.
- thread the two values through the binding into ludusUploadSceneVk; track the
  running max in LudusTimestampedVkState (reset in clear_scenes) like the other
  per-scene maxima.
- drive the dispatch stride and push constant from the tracked per-family max
  instead of the constant 1000.

Also gate the "[Vulkan] Context ready" line behind VK_DBG so context creation
is silent in production (the device/API line was already gated).

Verified: a single polyline pool with 1200 varrays now renders at CUDA<->Vulkan
parity (vk/cuda lit-pixel ratio 0.998); dot/polyline scenes unchanged.
@wlewNV

wlewNV commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/ok to test 9b81d3f


VkPhysicalDeviceFeatures2 features2 = {VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2};
features2.features.multiDrawIndirect = VK_TRUE;
features2.features.fillModeNonSolid = VK_TRUE;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renderer SPIR-V declares OpCapability Int64, but this device-creation chain never enables VkPhysicalDeviceFeatures::shaderInt64. The validation layer flags it as VUID-VkShaderModuleCreateInfo-pCode-08740 on every shader module create with an Int64 op:

▎ vkCreateShaderModule(): SPIR-V Capability Int64 was declared, but one of the following requirements is required (VkPhysicalDeviceFeatures::shaderInt64).

Suggested fix — one line:

Suggested change
features2.features.fillModeNonSolid = VK_TRUE;
features2.features.fillModeNonSolid = VK_TRUE;
features2.features.shaderInt64 = VK_TRUE;

Probably also worth querying vkGetPhysicalDeviceFeatures2 first and bailing out (or skipping the Int64 shader paths) if the physical device does not report shaderInt64 support, instead of blindly requesting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants