Investigated the VK_ERROR_DEVICE_LOST on the native VulkanTriangle (#7).
Verified the engine side is correct and validation-clean: the BLAS/TLAS
build finishes before render (FinishInit waits), the built instance is
well-formed (identity transform, mask=0xFF, correct BLAS ref), and
vkWriteResourceDescriptorsEXT stores the TLAS device address at the
expected heap offset (confirmed by dumping the heap bytes). Khronos
validation 1.4.350 reports zero errors.
The fault is isolated to reading the acceleration structure through
VK_EXT_descriptor_heap:
- images/buffers via the same heap render fine (trace disabled -> the
raygen imageStore path renders a full gradient);
- both traceRayEXT and inline rayQueryEXT (no SBT) fault identically on
the AS read;
- reproduces with the AS descriptor at heap byte 0 / shader index 0 (no
offset/stride ambiguity) and regardless of pAddressRange size.
NVIDIA 610.43.02 is the only descriptor_heap implementation available
(llvmpipe lacks the extension), so there is no second implementation to
cross-check. Conclusion: driver-side fault in NVIDIA's brand-new
VK_EXT_descriptor_heap acceleration-structure path; should be reported to
NVIDIA. The traceRayEXT call is left active so the example stays a
faithful reproducer. Documented in both READMEs.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two Vulkan validation errors fired on startup of every native (Vulkan)
example, reported in #5:
1. vkCreateDevice enabledLayerCount != 0. Device layers are deprecated
and ignored since Vulkan 1.0; passing them is a spec violation
(VUID-VkDeviceCreateInfo-enabledLayerCount-12384). The device-layer
enumeration/match block in Device::Initialize is removed and
enabledLayerCount is pinned to 0 — layers are enabled at the instance
only.
2. vkQueueSubmit layout transition on a presentable image that "has not
been acquired". StartInit() and RecreateSwapchainAndImages() eagerly
transitioned every swapchain image UNDEFINED -> PRESENT_SRC_KHR before
any vkAcquireNextImageKHR, which the spec forbids (a presentable image
may only be touched after acquire). Those pre-transitions are removed.
Each image's first layout transition now happens lazily in Render(),
after acquire, from UNDEFINED; subsequent frames transition from
PRESENT_SRC_KHR. A per-image `imageInitialised` flag (reset in
CreateSwapchain) selects the correct oldLayout.
Verified under sway (headless, GPU renderer) + VK_LAYER_KHRONOS_validation:
the original code reproduces both errors on HelloUI; the fixed build emits
zero validation messages across initial render and swapchain recreation.
Resolves#5
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The LBVH bitonic sort still runs over the full 16384 (sentinels sink to
the tail), but the sweep tree is now built and traced at depth
log2(next_pow2(nReal)) instead of a fixed 14. Add nPadded to LbvhPC; leaf
init + bottom-up refit use it; the host passes the same next_pow2 to the
trace via WfParams.tlasNPadded. Renders correctly at 512 instances
(depth 9). The fragile sort phases are untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The RT pipeline now only builds the wavefront kernels, so the old
single-megakernel traversal/traceRay block (rtWgslMegakernelHelpers) and
the unused rtWgslPrelude alias are dead. Remove them. The rayQuery compute
path keeps rtWgslMegakernelBindings (its own _rq* traversal uses it).
RTStress still renders correctly with the trimmed prelude.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add _rtAabbT (AABB test returning entry-t); in both _rtwTraverseBlas and
_rtwTraverseTlas descend the nearer child first and push the farther only
when it hits, re-culling it against the (tightened) bestT when popped.
Render is identical (same closest hit) on VulkanTriangle, RTStress
(512/4096), and Sponza; cuts node visits on dense scenes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Restructure Sponza for the wavefront model: raygen emits the primary ray;
closesthit (in SHADE) gathers albedo/normal, accumulates ambient, and
emits a shadow ray carrying the pending direct term; miss adds the sky
(primary) or the direct term (shadow miss). resolve.wgsl applies the same
Reinhard+gamma the megakernel raygen did inline. User bindings moved to
group 3 (groups 0..2 reserved). RTPass maxDepth=2.
Renders the atrium correctly through the wavefront pipeline (textures,
two-sided shading, sun+ambient, shadows, tonemap).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Request the timestamp-query feature; write begin/end timestamps around
each wavefront pass via timestampWrites; resolve + read back (deferred to
after submit) and print a per-pass us breakdown ~1x/sec. RTStress @ 512
instances, 1920x995: TRACE dominates, total ~1.8-3.0ms/frame.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:
GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE
- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
the indirect-args buffer lives in its own bind group so it is never
bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
PerDimension so the W*H-sized work buffers fit past the 128MB baseline.
VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the disabled LSD radix sort in lbvhBuildMain with a data-oblivious
workgroup bitonic sorting network and enable it. The radix scatter was gated
behind `if (false)` because it produced count/distribution-dependent
corruption (TODO-lbvh-sort.md) — a memory-ordering bug in the Hillis-Steele
scan / parallel scatter that surfaced only for certain Morton distributions
(a small object beside a tight cluster), making geometry flicker.
A bitonic network's compare-exchange schedule depends only on N_PADDED, never
on key values, so it sidesteps that entire class of distribution-dependent
races (TODO strategy #5). 105 sub-stages over 2^14 keys, single workgroup of
1024 threads, 8 compare-exchanges/thread/sub-stage, operating in-place on
sortA with a storageBarrier between sub-stages. Sentinel keys (0xFFFFFFFF)
compare largest and settle at the tail, exactly where Phase 4 expects them.
Restores Morton (Z-order) spatial coherence to TLAS BVH leaves, which the
many-instance case needs. Removes the now-dead radix histogram/scan workgroup
memory and constants.
Verified on the Firefox/Dawn WebGPU stack: a GPU unit test diffs the kernel
output against a CPU oracle across all three required distributions
(all-uniform, all-one-bucket, small-object-next-to-cluster) plus random,
reverse, and empty inputs — all match bit-for-bit with a valid index
permutation. Sponza renders correctly with the sort live.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both were on disk but missing from the V2 port's implementations list;
Rendertarget is required for RendertargetVulkan linkage, Shm is the
Wayland shared-memory helper.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>