The software rayQuery shim's _rqTraverseTlas detected BVH leaves with a
compile-time constant TLAS_BVH_LEAVES_START = 16384 - 1, while the actual
TLAS sweep tree is built at depth log2(next_pow2(instanceCount)). For any
scene with fewer than 8193 instances the padded leaf count is far below
16384, so no node index ever reached 16383: every node looked internal,
the descent walked into zeroed out-of-tree AABBs, and the pick reported a
permanent miss. This broke every rayQuery=true compute shader (builder
picking, splash queries) on the WebGPU backend.
Pass the per-build padded leaf count to the shim the same way the
megakernel _rtwTraverseTlas reads wfParams.tlasNPadded: a small uniform
(RqTlasMeta.nPadded) at @group(1) @binding(10), written each wgpuBuildTLAS
from wfNextPow2(instanceCount), and bound by both rayQuery dispatch paths.
_rqTraverseTlas now computes leavesStart = nPadded - 1 dynamically.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Indexing a `layout(descriptor_heap)` array with a runtime (non-constant)
index inside a ray-tracing hit shader device-losts on NVIDIA 610.43.02,
for both SSBO and sampled-image descriptors. A constant/spec-constant
index is fine, and the same dynamic pattern works in fragment shaders, so
it's an RT-stage-specific driver fault — the same family as #7/#15
(descriptor-heap AS reads) and #21/#22 (RT recursion + compute TLAS push).
Unlike the AS-read fault, this cannot be worked around transparently: a
sampled image has no device-address escape hatch the way an acceleration
structure does (OpConvertUToAccelerationStructureKHR), and a buffer-only
buffer_reference rewrite would need a whole address-table architecture
while still leaving the texture half broken. So the resolution is the
documented-limitation path (the precedent set by #7).
Records the fault and its isolation in README's Native RT status and in
the Sponza example README (the textured-closest-hit example, which already
reads its albedo through a spec-constant slot for exactly this reason).
Documents the recommended consumer pattern: bind one resource and index
*within* it dynamically (single geometry SSBO / buffer_reference at a
spec-constant slot; one texture2DArray indexed by layer) rather than
selecting a descriptor dynamically — what the WebGPU path already does.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two gaps in the Vulkan RT path that fault the device on the NVIDIA
proprietary driver with a non-trivial pipeline (simple VulkanTriangle
never hit them):
1. maxPipelineRayRecursionDepth was hardcoded to 1, so any closest-hit
shader that traces a secondary ray (shadow ray — a very common
pattern) recursed past the pipeline limit (UB → device fault).
PipelineRTVulkan::Init now takes a maxRecursionDepth parameter
(default 1, clamped to the device's maxRayRecursionDepth).
2. The NVIDIA descriptor-heap AS-read workaround rewrites every shader
that reads an accelerationStructureEXT from the heap — including
compute shaders — to read the TLAS device address from a push
constant, but only RTPass pushed that address. A compute shader that
ray-queries the TLAS (rayQueryEXT) therefore ran against an unwritten
push slot → garbage AS handle → VK_ERROR_DEVICE_LOST.
WorkaroundNvidiaAS::Patch now returns a per-shader PatchResult
{patched, tlasPushOffset} instead of writing the clobber-prone global
Device::workaroundTlasPushOffset (removed). VulkanShader stores it;
ShaderBindingTableVulkan/PipelineRTVulkan carry it for RTPass, and
ComputeShader tracks its own offset and pushes the caller-supplied
TLAS address in Dispatch (new defaulted tlasAddress parameter),
mirroring RTPass::Record.
The PushConstantRewrite regression test now asserts Patch's returned
patched/offset and adds two ray-querying compute-shader cases, proving
the rewrite is stage-agnostic and the per-shader offset is correct.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds tests/PushConstantRewrite, a host test that compiles representative
ray-generation shaders with glslang, runs the real WorkaroundNvidiaAS::Patch
over them, and asserts with spirv-val (the same invocation vkCreateShaderModule
uses) that the result is valid and contains exactly one push-constant block —
covering both the merge path (shaders that already declare a push constant,
including mat4/vec3/uint, a lone uint, and an array layout) and the synthesize
path, plus a no-op case (push constant but no AS read). It also checks the
published TLAS push offset for each layout.
The workaround namespace is exported so the test can drive Patch directly; both
go away with the rest of the workaround. project.cpp wires the test as an
executable that recompiles the module and requires glslang + spirv-val.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The NVIDIA descriptor-heap AS-read workaround (#15) rewrote heap
acceleration-structure reads into a load of the TLAS device address from
a push-constant block. It always *synthesized a new* push-constant block,
so any ray-tracing shader that already declared one ended up with two —
which SPIR-V forbids ("at most one push constant block statically used per
entry point"), and vkCreateShaderModule's spirv-val check rejected:
Entry point id '4' uses more than one PushConstant interface.
WorkaroundNvidiaAS::Patch now detects an existing PushConstant variable and,
when present, appends a single ulong member (the TLAS address) to that
block instead of adding a second one, reading the address through the
shader's own push-constant variable. The append offset is the end of the
user's block, computed from the members' explicit Offset/ArrayStride/
MatrixStride decorations (correct under both scalar and std140 layout) and
rounded up to 8. Shaders with no push constant of their own keep getting a
freshly synthesized single-member block at offset 0, exactly as before.
That offset is published via Device::workaroundTlasPushOffset and RTPass
feeds it to vkCmdPushDataEXT so the address lands where the rewritten load
reads it (0 for the synthesized case, preserving prior behaviour).
Verified on the affected driver (NVIDIA 610.43.02, RTX 4090): VulkanTriangle
ray-traces correctly and validation-clean both with and without a
user-declared raygen push constant.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The GPU-AV enable list was removed to dodge a crash in SDK 1.4.341,
whose GPU-AV null-deref'd on descriptor_heap pipelines
(VK_PIPELINE_CREATE_2_DESCRIPTOR_HEAP_BIT_EXT, layout = VK_NULL_HANDLE)
in PipelineSubState::GetPipelineLayoutUnion:
https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/12103
That was fixed in the next SDK release. The validation layer is now
1.4.350 (> 1.4.341), so restore VK_VALIDATION_FEATURE_ENABLE_GPU_ASSISTED_EXT
in the VkValidationFeaturesEXT enable list.
Verified by running the HelloUI example (which draws through the
descriptor_heap compute pipelines) with the layer active: it renders the
full UI for the entire run with GPU-AV reporting "Both GPU Assisted
Validation and Normal Core Check Validation are enabled" and no
descriptor-heap null-deref or VUID errors.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reading an acceleration structure through VK_EXT_descriptor_heap aborts
with VK_ERROR_DEVICE_LOST on NVIDIA 610.43.02 — a brand-new-extension
driver fault isolated in #7 (engine setup is correct and validation-clean;
images/buffers through the same heap work, and both traceRayEXT and inline
rayQuery fault identically on the AS read).
An acceleration structure can equally be reached by its device address via
OpConvertUToAccelerationStructureKHR, which reads no descriptor and so never
touches the faulting heap path. glslang has no GLSL spelling for that
conversion, so VulkanShader rewrites the compiled SPIR-V at module-load
time: every `OpLoad %accelStruct <heap-ptr>` becomes a load of the TLAS
device address from a synthesized push-constant block followed by the
convert. RTPass pushes the active frame's TLAS address into that push
constant. User GLSL and example code are unchanged; acceleration structures
still bind into the heap normally.
The workaround is gated on Device::workaroundDescriptorHeapAS (true only on
the NVIDIA proprietary driver) and confined to one fenced block in
Crafter.Graphics-ShaderVulkan.cppm plus the RTPass push and the shaderInt64
feature toggle — delete those once a fixed NVIDIA driver ships and the heap
AS read becomes the direct path again.
Verified: VulkanTriangle ray-traces correctly on native NVIDIA (RTX 4090),
validation-layer-clean, no device loss. The SPIR-V rewrite was independently
validated with spirv-val on both the VulkanTriangle and Sponza raygen
modules.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A 3x3x3 grid of AABB-geometry spheres rendered through an analytic
ray-sphere intersection shader, with an any-hit spherical-checkerboard
cut-out so the background shows through. Exercises both features end to
end on the WebGPU wavefront tracer.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The TRACE-stage BLAS descent now threads the payload through, runs the
any-hit shader for non-opaque candidates (DXR/VK opacity resolution:
ray FORCE flags > instance FORCE flags > geometry opaque bit), and
handles AABB leaves via the intersection shader. MeshRecord grows to 64
bytes with geomType + opaque. When any-hit/intersection are present the
TRACE pipeline takes the user bind-group layout so those shaders can
sample @group(3+) resources; otherwise TRACE keeps its zero-user-code
path unchanged. rayQuery stays triangle-only (skips AABB leaves).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PipelineRTWebGPU emits a runIntersection mega-switch and the
RT_HAS_ANYHIT / RT_HAS_INTERSECTION consts (+ the @CRAFTER_RT_TRACE_USER
marker) that gate the library's new TRACE-stage user callbacks, so an
opaque triangle-only scene still const-folds them away. Mesh-WebGPU
builds a SAH BVH2 over AABB primitives and uploads them in primitive
order for the intersection shader to fetch.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extends the cross-backend RT type surface for procedural geometry +
any-hit on the WebGPU path:
- RTShaderGroupType::ProceduralHitGroup + RTShaderGroup::intersectionShader
(mirror VK_RAY_TRACING_SHADER_GROUP_TYPE_PROCEDURAL_HIT_GROUP_KHR).
- WebGPURTStage::Intersection for AABB intersection shaders.
- Mesh::BuildProcedural(span<RTAabb>, opaque) — the WebGPU analog of a
VK_GEOMETRY_TYPE_AABBS_KHR geometry.
- wgpuRegisterMeshBLAS gains geomType / opaqueFlag / primCount.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A 1-D indirect dispatch of ceil(W*H/64) workgroups for the wavefront
TRACE/SHADE stages overflows maxComputeWorkgroupsPerDimension (65535 on
Dawn/Firefox) once the surface exceeds ~4.19M rays (~2560x1640). Per the
WebGPU spec such a dispatch is silently dropped — no validation error —
so at 4K the world is never traced and the accumulator stays black while
non-RT passes survive.
_wfPrep now spreads the workgroups across a 2-D grid (x clamped to 65535,
y = ceil(wg/65535)), and the wfTrace/wfShade entry points rebuild the
linear ray index from (global_invocation_id, num_workgroups). The existing
`i >= _wfCurCount()` guard absorbs the grid overshoot. GENERATE/RESOLVE
already use a 2-D tile dispatch and are unchanged.
Verified in Firefox/WebGPU with RTStress at a 3449x1739 surface (5.99M
rays, 93716 workgroups — well over the 65535 cap): renders the full cube
grid where master shows a black screen.
Resolves#11
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Investigated the VK_ERROR_DEVICE_LOST on the native VulkanTriangle (#7).
Verified the engine side is correct and validation-clean: the BLAS/TLAS
build finishes before render (FinishInit waits), the built instance is
well-formed (identity transform, mask=0xFF, correct BLAS ref), and
vkWriteResourceDescriptorsEXT stores the TLAS device address at the
expected heap offset (confirmed by dumping the heap bytes). Khronos
validation 1.4.350 reports zero errors.
The fault is isolated to reading the acceleration structure through
VK_EXT_descriptor_heap:
- images/buffers via the same heap render fine (trace disabled -> the
raygen imageStore path renders a full gradient);
- both traceRayEXT and inline rayQueryEXT (no SBT) fault identically on
the AS read;
- reproduces with the AS descriptor at heap byte 0 / shader index 0 (no
offset/stride ambiguity) and regardless of pAddressRange size.
NVIDIA 610.43.02 is the only descriptor_heap implementation available
(llvmpipe lacks the extension), so there is no second implementation to
cross-check. Conclusion: driver-side fault in NVIDIA's brand-new
VK_EXT_descriptor_heap acceleration-structure path; should be reported to
NVIDIA. The traceRayEXT call is left active so the example stays a
faithful reproducer. Documented in both READMEs.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dom-webgpu.js capped maxStorageBuffersPerShaderStage at 16 even when the
adapter reports far more (64 in our test env). The wavefront SHADE kernel
already binds ~16 storage buffers before any user binding, so any RT
pipeline declaring 2+ user storage buffers at @group(3) overflowed the
limit and failed to build with "Too many bindings of type StorageBuffers".
Request the adapter's reported maxStorageBuffersPerShaderStage /
maxStorageBuffersInPipelineLayout instead of a fixed 16. `clamp` already
mins against the adapter cap, so baseline-only devices still get a valid
request, and the `|| 16` fallback + the `typeof cap === "number"` guard
handle limit names a browser doesn't expose (Firefox returns null for
maxStorageBuffersInPipelineLayout).
Verified in-browser: a 17-storage-buffer compute pipeline fails with the
exact reported error on a device clamped to 16, and builds cleanly on a
device requesting the adapter's 64. RTStress renders correctly.
Resolves#8
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two Vulkan validation errors fired on startup of every native (Vulkan)
example, reported in #5:
1. vkCreateDevice enabledLayerCount != 0. Device layers are deprecated
and ignored since Vulkan 1.0; passing them is a spec violation
(VUID-VkDeviceCreateInfo-enabledLayerCount-12384). The device-layer
enumeration/match block in Device::Initialize is removed and
enabledLayerCount is pinned to 0 — layers are enabled at the instance
only.
2. vkQueueSubmit layout transition on a presentable image that "has not
been acquired". StartInit() and RecreateSwapchainAndImages() eagerly
transitioned every swapchain image UNDEFINED -> PRESENT_SRC_KHR before
any vkAcquireNextImageKHR, which the spec forbids (a presentable image
may only be touched after acquire). Those pre-transitions are removed.
Each image's first layout transition now happens lazily in Render(),
after acquire, from UNDEFINED; subsequent frames transition from
PRESENT_SRC_KHR. A per-image `imageInitialised` flag (reset in
CreateSwapchain) selects the correct oldLayout.
Verified under sway (headless, GPU renderer) + VK_LAYER_KHRONOS_validation:
the original code reproduces both errors on HelloUI; the fixed build emits
zero validation messages across initial render and swapchain recreation.
Resolves#5
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The LBVH bitonic sort still runs over the full 16384 (sentinels sink to
the tail), but the sweep tree is now built and traced at depth
log2(next_pow2(nReal)) instead of a fixed 14. Add nPadded to LbvhPC; leaf
init + bottom-up refit use it; the host passes the same next_pow2 to the
trace via WfParams.tlasNPadded. Renders correctly at 512 instances
(depth 9). The fragile sort phases are untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The RT pipeline now only builds the wavefront kernels, so the old
single-megakernel traversal/traceRay block (rtWgslMegakernelHelpers) and
the unused rtWgslPrelude alias are dead. Remove them. The rayQuery compute
path keeps rtWgslMegakernelBindings (its own _rq* traversal uses it).
RTStress still renders correctly with the trimmed prelude.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add _rtAabbT (AABB test returning entry-t); in both _rtwTraverseBlas and
_rtwTraverseTlas descend the nearer child first and push the farther only
when it hits, re-culling it against the (tightened) bestT when popped.
Render is identical (same closest hit) on VulkanTriangle, RTStress
(512/4096), and Sponza; cuts node visits on dense scenes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Restructure Sponza for the wavefront model: raygen emits the primary ray;
closesthit (in SHADE) gathers albedo/normal, accumulates ambient, and
emits a shadow ray carrying the pending direct term; miss adds the sky
(primary) or the direct term (shadow miss). resolve.wgsl applies the same
Reinhard+gamma the megakernel raygen did inline. User bindings moved to
group 3 (groups 0..2 reserved). RTPass maxDepth=2.
Renders the atrium correctly through the wavefront pipeline (textures,
two-sided shading, sun+ambient, shadows, tonemap).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Request the timestamp-query feature; write begin/end timestamps around
each wavefront pass via timestampWrites; resolve + read back (deferred to
after submit) and print a per-pass us breakdown ~1x/sec. RTStress @ 512
instances, 1920x995: TRACE dominates, total ~1.8-3.0ms/frame.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:
GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE
- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
the indirect-args buffer lives in its own bind group so it is never
bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
PerDimension so the W*H-sized work buffers fit past the 128MB baseline.
VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the disabled LSD radix sort in lbvhBuildMain with a data-oblivious
workgroup bitonic sorting network and enable it. The radix scatter was gated
behind `if (false)` because it produced count/distribution-dependent
corruption (TODO-lbvh-sort.md) — a memory-ordering bug in the Hillis-Steele
scan / parallel scatter that surfaced only for certain Morton distributions
(a small object beside a tight cluster), making geometry flicker.
A bitonic network's compare-exchange schedule depends only on N_PADDED, never
on key values, so it sidesteps that entire class of distribution-dependent
races (TODO strategy #5). 105 sub-stages over 2^14 keys, single workgroup of
1024 threads, 8 compare-exchanges/thread/sub-stage, operating in-place on
sortA with a storageBarrier between sub-stages. Sentinel keys (0xFFFFFFFF)
compare largest and settle at the tail, exactly where Phase 4 expects them.
Restores Morton (Z-order) spatial coherence to TLAS BVH leaves, which the
many-instance case needs. Removes the now-dead radix histogram/scan workgroup
memory and constants.
Verified on the Firefox/Dawn WebGPU stack: a GPU unit test diffs the kernel
output against a CPU oracle across all three required distributions
(all-uniform, all-one-bucket, small-object-next-to-cluster) plus random,
reverse, and empty inputs — all match bit-for-bit with a valid index
permutation. Sponza renders correctly with the sort live.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>