[ludus-renderer]: add Vulkan mesh-shader backend (LudusTimestampedContext)#135
[ludus-renderer]: add Vulkan mesh-shader backend (LudusTimestampedContext)#135wlewNV wants to merge 8 commits into
Conversation
Greptile SummaryThis PR introduces
Confidence Score: 5/5Safe to merge. The rendering path is correct, all previously raised correctness issues have been resolved, and the remaining findings are non-blocking quality notes. All critical correctness fixes from the prior review round are present: the HostToDevice tombstone copy with synchronization, the render-pass/pipeline rebuild on MSAA change, the descriptor-pool reset moved before vkCmdBeginRenderPass, and the multi-scene SSBO cursor tracking. The remaining findings are validation-layer noise and JIT-loader robustness issues that do not affect rendered output or data integrity. vkutil.cpp (transitionImageLayout wrong source masks on the readback return transition) and _plugin_vk.py (TORCH_CUDA_ARCH_LIST side effect and vulkaninfo availability heuristic). Important Files Changed
Sequence DiagramsequenceDiagram
participant Py as Python (LudusTimestampedContext)
participant Cpp as C++ Binding (torch_rasterize_vk)
participant VkCpp as ludus_timestamped_vk
participant CUDA as CUDA Runtime
participant Vk as Vulkan GPU
Py->>Cpp: upload_cameras(intrinsics_cuda_tensor)
Cpp->>VkCpp: ludusUploadCamerasVk()
VkCpp->>CUDA: cudaMemcpyAsync(DeviceToDevice, cameraIntrinsicsBuffer.cuDevPtr)
Py->>Cpp: upload_scene(packed_tensors)
Cpp->>VkCpp: ludusUploadSceneVk()
VkCpp->>VkCpp: ensureBuffers() / resizeExternalBuffer()
VkCpp->>CUDA: cudaMemcpyAsync(timestamps, int32, vertices DeviceToDevice)
VkCpp->>VkCpp: updateDescriptorSet()
Py->>Cpp: render_batch(queries, poses, resolution)
Cpp->>VkCpp: ludusRenderBatchVk()
VkCpp->>CUDA: cudaMemcpyAsync(queryBuffer, cameraPoseBuffer)
VkCpp->>CUDA: cudaStreamSynchronize()
alt kHostRoundtrip (LUDUS_VK_DIRECT_IMPORT not set)
VkCpp->>CUDA: cudaMemcpy DeviceToHost (all SSBOs)
VkCpp->>Vk: vkCmdUpdateBuffer (re-injects via Vulkan transfer path)
else Direct import
VkCpp->>Vk: VkBufferMemoryBarrier (VK_QUEUE_FAMILY_EXTERNAL to graphics)
end
VkCpp->>Vk: vkCmdBeginRenderPass
VkCpp->>Vk: vkCmdDrawMeshTasksEXT (polyline / polygon / obstacle)
VkCpp->>Vk: vkCmdEndRenderPass then vkQueueSubmit then vkWaitForFences
Cpp->>VkCpp: ludusCopyBatchResultsVk()
VkCpp->>Vk: vkCmdCopyImageToBuffer (colorImage to readbackBuffer)
VkCpp->>CUDA: cudaMemcpyAsync(output, readbackBuffer.cuDevPtr, DeviceToDevice)
Cpp-->>Py: NHWC uint8 tensor
Reviews (8): Last reviewed commit: "Track true per-pool varray max in Vulkan..." | Re-trigger Greptile |
…dContext) Adds a parallel rendering backend that mirrors the public API of the existing CUDA software rasterizer. The new path uses VK_EXT_mesh_shader with CUDA-Vulkan external-memory interop so render uploads stay on the GPU, and is selected at construction via LudusTimestampedContext while LudusCudaTimestampedContext remains the default everywhere. New: Vulkan context (vkutil), pipelines for polyline/polygon/obstacle mesh+task+fragment shaders, NV->EXT GLSL converter and SPIR-V embed scripts, JIT plugin, Python context wrapper, and a CUDA-vs-Vulkan example/parity test. Multi-pool task shaders use a force_zero_tasks flag to keep over-dispatched workgroups' SSBO reads in-bounds, which is what prevents the giant cross-pool garbage triangles seen with the naive early-EmitMeshTasksEXT(0) pattern.
|
Marking as ready for review to get fresh comments from greptile. |
Review fixes (greptile): - remove_scene: write the tombstone with cudaMemcpyHostToDevice + stream sync instead of cudaMemcpyDeviceToDevice on a host stack pointer (P0) - MSAA: rebuild the render pass and pipelines when the sample count changes so the framebuffer attachment counts match (P1) - gate the device-info log behind VK_DBG; re-read LUDUS_VK_DIRECT_IMPORT on every render instead of caching it in a static bool (P2) Cleanup / redundancy: - replace the loguru dependency with a small stdlib-logging shim (_logging.py) - remove dead state (unused VkContext / LudusTimestampedVkState fields) and the inert VkCudaSync semaphore subsystem - share VK_CHECK / VK_DBG via vkutil.h instead of duplicating them per file - delete the orphaned ts_common.glsl and the one-time nv_to_ext.py migration; the committed shaders are maintained directly as GL_EXT_mesh_shader - add render_to_staging device/contiguity checks and a removeScene device guard - vectorize render_batch query packing to avoid per-query host/device syncs Examples / packaging: - compare_vulkan_vs_cuda.py: render clipgt USDZ scenes (extract clipgt/*.parquet) with scene-cache auto-discovery and an available-camera fallback - stop tracking/shipping the intermediate .spv (regenerated by compile.sh and embedded in shaders_spv.h); ignore _vk_compare/ render output The CUDA backend is unchanged and remains the default.
upload_scene packed every scene with buffer-base offsets of 0, so the task shaders' scene.<buf>_offset + pool.<buf>_offset indexing made scene 2+ read scene 0's pools / timestamps / vertices. Carry persistent global cursors that mirror the C++ *Used counters: seed each scene descriptor's base offsets from them, advance them per upload_scene, and reset them in clear_scenes. Single- scene rendering is unchanged (all bases are 0 for the first scene). Also tighten the verbose comments added in the previous commit.
|
/ok to test 1681b99 |
…135) greptile re-review (3/5) flagged two issues in ludus_timestamped_vk.cpp: - updateDescriptorSet (which calls vkResetDescriptorPool + reallocates the set) was invoked after vkCmdBeginRenderPass, which the Vulkan spec forbids. Move it before the render pass begins. - The kHostRoundtrip path copied every CUDA-imported SSBO host->device on every render, even buffers unchanged since upload_scene. Add a sceneBuffersDirty flag set by the control-plane ops (upload scene/cameras/palette, remove, clear) so the scene/camera/palette buffers are re-pushed only when they actually change; the per-query buffers (query/cameraPose) still roundtrip every frame.
|
/ok to test ca140e6 |
…port
- Add --ludus-backend {cuda,vulkan} (default cuda), threaded through
RasterConfig into the interactive-drive rasterizer so both the raster and
world-model backends can render the HD map via LudusTimestampedContext.
- Fix createExternalImage: a Vulkan image with arrayLayers > 1 is layered, so
the CUDA import must set CUDA_ARRAY3D_LAYERED; without it multi-frame/batched
rendering failed with cuExternalMemoryGetMappedMipmappedArray INVALID_VALUE.
|
/ok to test 65e6bb1 |
|
for the first run in a new setup, I got solved by installing |
The `ruff format` pre-commit hook (CI "Run linter checks") reflowed the --ludus-backend help string and the ctx_cls ternary. Pure formatting; no behavior change.
|
/ok to test 2b63821 |
greptile re-review (3/5) flagged a hardcoded MAX_VARRAYS_PER_POOL=1000 that
drove both the mesh-task dispatch count and the u_max_varrays_per_pool push
constant, so any polyline/polygon pool with >1000 varrays at a timestamp had
its tail silently dropped (task shader never invoked for those workgroup IDs).
- context_vk.py: compute max_varrays_per_ts_{polyline,polygon} from the
timestamped prefix sums (mirrors the CUDA backend) and pass them to upload.
- thread the two values through the binding into ludusUploadSceneVk; track the
running max in LudusTimestampedVkState (reset in clear_scenes) like the other
per-scene maxima.
- drive the dispatch stride and push constant from the tracked per-family max
instead of the constant 1000.
Also gate the "[Vulkan] Context ready" line behind VK_DBG so context creation
is silent in production (the device/API line was already gated).
Verified: a single polyline pool with 1200 varrays now renders at CUDA<->Vulkan
parity (vk/cuda lit-pixel ratio 0.998); dot/polyline scenes unchanged.
|
/ok to test 9b81d3f |
|
|
||
| VkPhysicalDeviceFeatures2 features2 = {VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2}; | ||
| features2.features.multiDrawIndirect = VK_TRUE; | ||
| features2.features.fillModeNonSolid = VK_TRUE; |
There was a problem hiding this comment.
Renderer SPIR-V declares OpCapability Int64, but this device-creation chain never enables VkPhysicalDeviceFeatures::shaderInt64. The validation layer flags it as VUID-VkShaderModuleCreateInfo-pCode-08740 on every shader module create with an Int64 op:
▎ vkCreateShaderModule(): SPIR-V Capability Int64 was declared, but one of the following requirements is required (VkPhysicalDeviceFeatures::shaderInt64).
Suggested fix — one line:
| features2.features.fillModeNonSolid = VK_TRUE; | |
| features2.features.fillModeNonSolid = VK_TRUE; | |
| features2.features.shaderInt64 = VK_TRUE; |
Probably also worth querying vkGetPhysicalDeviceFeatures2 first and bailing out (or skipping the Int64 shader paths) if the physical device does not report shaderInt64 support, instead of blindly requesting it.
Summary
Adds
LudusTimestampedContext— a VulkanVK_EXT_mesh_shaderrendering backend that mirrors theLudusCudaTimestampedContextAPI via CUDA–Vulkan external-memory interop (onevkCmdDrawMeshTasksEXTsubmission per primitive family). CUDA stays the default; the Vulkan path is opt-in and also selectable ininteractive-drivevia--ludus-backend vulkan.What's in it
vkutil,ludus_timestamped_vk): external-memory SSBOs imported into CUDA, MSAA, nvjpeg encode.GL_EXT_mesh_shadershaders for the polyline/polygon/obstacle families; SPIR-V embedded inshaders_spv.h(rebuild withshaders/compile.sh).LudusTimestampedContext(lazy import;ImportErroronly on construction).interactive-drive:--ludus-backend {cuda,vulkan}.examples/compare_vulkan_vs_cuda.py(CUDA-vs-Vulkan parity) +tests/test_vulkan_backend.py.Testing
Caveats