Commit graph

251 commits

Author SHA1 Message Date
2b266262ee Merge pull request 'fix(webgpu-rt): dynamic rayQuery TLAS leaf-start so picks hit for realistic instance counts (#25)' (#26) from claude/issue-25 into master 2026-06-04 15:33:55 +02:00
catbot
b645746c8c test(webgpu-rt): RayQueryPick example exercising the rayQuery TLAS shim (#25)
Adds an 8^3 = 512-instance TLAS pick test that shoots one analytically
determined ray through a rayQuery=true PlainComputeShader and checks the
read-back committed hit (customIndex 484, t 40.75). 512 instances sit in
the < 8193 regime that the hardcoded 16384-leaf start used to miss, so the
example fails fast if the shim regresses. Verified in Firefox/WebGPU:
"[RayQueryPick] PASS".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 13:33:04 +00:00
catbot
8f6a52a460 fix(webgpu-rt): derive rayQuery TLAS leaf-start from dynamic nPadded (#25)
The software rayQuery shim's _rqTraverseTlas detected BVH leaves with a
compile-time constant TLAS_BVH_LEAVES_START = 16384 - 1, while the actual
TLAS sweep tree is built at depth log2(next_pow2(instanceCount)). For any
scene with fewer than 8193 instances the padded leaf count is far below
16384, so no node index ever reached 16383: every node looked internal,
the descent walked into zeroed out-of-tree AABBs, and the pick reported a
permanent miss. This broke every rayQuery=true compute shader (builder
picking, splash queries) on the WebGPU backend.

Pass the per-build padded leaf count to the shim the same way the
megakernel _rtwTraverseTlas reads wfParams.tlasNPadded: a small uniform
(RqTlasMeta.nPadded) at @group(1) @binding(10), written each wgpuBuildTLAS
from wfNextPow2(instanceCount), and bound by both rayQuery dispatch paths.
_rqTraverseTlas now computes leavesStart = nPadded - 1 dynamically.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 13:32:58 +00:00
f9d23cd1f9 Merge pull request 'docs(vulkan-rt): document dynamic descriptor_heap-index hit-shader fault (#23)' (#24) from claude/issue-23 into master 2026-06-03 22:05:45 +02:00
catbot
d08c7cea11 docs(vulkan-rt): document dynamic descriptor_heap-index hit-shader fault (#23)
Indexing a `layout(descriptor_heap)` array with a runtime (non-constant)
index inside a ray-tracing hit shader device-losts on NVIDIA 610.43.02,
for both SSBO and sampled-image descriptors. A constant/spec-constant
index is fine, and the same dynamic pattern works in fragment shaders, so
it's an RT-stage-specific driver fault — the same family as #7/#15
(descriptor-heap AS reads) and #21/#22 (RT recursion + compute TLAS push).

Unlike the AS-read fault, this cannot be worked around transparently: a
sampled image has no device-address escape hatch the way an acceleration
structure does (OpConvertUToAccelerationStructureKHR), and a buffer-only
buffer_reference rewrite would need a whole address-table architecture
while still leaving the texture half broken. So the resolution is the
documented-limitation path (the precedent set by #7).

Records the fault and its isolation in README's Native RT status and in
the Sponza example README (the textured-closest-hit example, which already
reads its albedo through a spec-constant slot for exactly this reason).
Documents the recommended consumer pattern: bind one resource and index
*within* it dynamically (single geometry SSBO / buffer_reference at a
spec-constant slot; one texture2DArray indexed by layer) rather than
selecting a descriptor dynamically — what the WebGPU path already does.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 20:05:12 +00:00
5358aee2f6 Merge pull request 'fix(vulkan-rt): configurable recursion depth + per-shader TLAS push for compute (#21)' (#22) from claude/issue-21 into master 2026-06-03 20:36:22 +02:00
catbot
1c310762a7 fix(vulkan-rt): configurable recursion depth + per-shader TLAS push for compute (#21)
Two gaps in the Vulkan RT path that fault the device on the NVIDIA
proprietary driver with a non-trivial pipeline (simple VulkanTriangle
never hit them):

1. maxPipelineRayRecursionDepth was hardcoded to 1, so any closest-hit
   shader that traces a secondary ray (shadow ray — a very common
   pattern) recursed past the pipeline limit (UB → device fault).
   PipelineRTVulkan::Init now takes a maxRecursionDepth parameter
   (default 1, clamped to the device's maxRayRecursionDepth).

2. The NVIDIA descriptor-heap AS-read workaround rewrites every shader
   that reads an accelerationStructureEXT from the heap — including
   compute shaders — to read the TLAS device address from a push
   constant, but only RTPass pushed that address. A compute shader that
   ray-queries the TLAS (rayQueryEXT) therefore ran against an unwritten
   push slot → garbage AS handle → VK_ERROR_DEVICE_LOST.

   WorkaroundNvidiaAS::Patch now returns a per-shader PatchResult
   {patched, tlasPushOffset} instead of writing the clobber-prone global
   Device::workaroundTlasPushOffset (removed). VulkanShader stores it;
   ShaderBindingTableVulkan/PipelineRTVulkan carry it for RTPass, and
   ComputeShader tracks its own offset and pushes the caller-supplied
   TLAS address in Dispatch (new defaulted tlasAddress parameter),
   mirroring RTPass::Record.

The PushConstantRewrite regression test now asserts Patch's returned
patched/offset and adds two ray-querying compute-shader cases, proving
the rewrite is stage-agnostic and the per-shader offset is correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 18:35:39 +00:00
2790bbd576 Merge pull request 'fix(vulkan-rt): merge TLAS push constant into existing block (#18)' (#20) from claude/issue-18 into master 2026-06-03 04:29:00 +02:00
catbot
471f480c5d test(vulkan-rt): spirv-val coverage for the push-constant rewrite (#18)
Adds tests/PushConstantRewrite, a host test that compiles representative
ray-generation shaders with glslang, runs the real WorkaroundNvidiaAS::Patch
over them, and asserts with spirv-val (the same invocation vkCreateShaderModule
uses) that the result is valid and contains exactly one push-constant block —
covering both the merge path (shaders that already declare a push constant,
including mat4/vec3/uint, a lone uint, and an array layout) and the synthesize
path, plus a no-op case (push constant but no AS read). It also checks the
published TLAS push offset for each layout.

The workaround namespace is exported so the test can drive Patch directly; both
go away with the rest of the workaround. project.cpp wires the test as an
executable that recompiles the module and requires glslang + spirv-val.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 02:28:09 +00:00
catbot
45ecc91424 fix(vulkan-rt): merge TLAS push constant into existing block (#18)
The NVIDIA descriptor-heap AS-read workaround (#15) rewrote heap
acceleration-structure reads into a load of the TLAS device address from
a push-constant block. It always *synthesized a new* push-constant block,
so any ray-tracing shader that already declared one ended up with two —
which SPIR-V forbids ("at most one push constant block statically used per
entry point"), and vkCreateShaderModule's spirv-val check rejected:

    Entry point id '4' uses more than one PushConstant interface.

WorkaroundNvidiaAS::Patch now detects an existing PushConstant variable and,
when present, appends a single ulong member (the TLAS address) to that
block instead of adding a second one, reading the address through the
shader's own push-constant variable. The append offset is the end of the
user's block, computed from the members' explicit Offset/ArrayStride/
MatrixStride decorations (correct under both scalar and std140 layout) and
rounded up to 8. Shaders with no push constant of their own keep getting a
freshly synthesized single-member block at offset 0, exactly as before.

That offset is published via Device::workaroundTlasPushOffset and RTPass
feeds it to vkCmdPushDataEXT so the address lands where the rewritten load
reads it (0 for the synthesized case, preserving prior behaviour).

Verified on the affected driver (NVIDIA 610.43.02, RTX 4090): VulkanTriangle
ray-traces correctly and validation-clean both with and without a
user-declared raygen push constant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 02:28:02 +00:00
7c00ddd474 Merge pull request 'feat(vulkan): re-enable GPU-Assisted Validation' (#19) from claude/issue-17 into master 2026-06-03 04:11:00 +02:00
catbot
e7469133e8 feat(vulkan): re-enable GPU-Assisted Validation
The GPU-AV enable list was removed to dodge a crash in SDK 1.4.341,
whose GPU-AV null-deref'd on descriptor_heap pipelines
(VK_PIPELINE_CREATE_2_DESCRIPTOR_HEAP_BIT_EXT, layout = VK_NULL_HANDLE)
in PipelineSubState::GetPipelineLayoutUnion:
  https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/12103

That was fixed in the next SDK release. The validation layer is now
1.4.350 (> 1.4.341), so restore VK_VALIDATION_FEATURE_ENABLE_GPU_ASSISTED_EXT
in the VkValidationFeaturesEXT enable list.

Verified by running the HelloUI example (which draws through the
descriptor_heap compute pipelines) with the layer active: it renders the
full UI for the entire run with GPU-AV reporting "Both GPU Assisted
Validation and Normal Core Check Validation are enabled" and no
descriptor-heap null-deref or VUID errors.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 02:10:23 +00:00
f24107264d Merge pull request 'fix(vulkan-rt): work around NVIDIA descriptor-heap AS-read device-loss (#15)' (#16) from claude/issue-15 into master 2026-06-03 04:00:38 +02:00
catbot
950059c86e fix(vulkan-rt): work around NVIDIA descriptor-heap AS-read device-loss (#15)
Reading an acceleration structure through VK_EXT_descriptor_heap aborts
with VK_ERROR_DEVICE_LOST on NVIDIA 610.43.02 — a brand-new-extension
driver fault isolated in #7 (engine setup is correct and validation-clean;
images/buffers through the same heap work, and both traceRayEXT and inline
rayQuery fault identically on the AS read).

An acceleration structure can equally be reached by its device address via
OpConvertUToAccelerationStructureKHR, which reads no descriptor and so never
touches the faulting heap path. glslang has no GLSL spelling for that
conversion, so VulkanShader rewrites the compiled SPIR-V at module-load
time: every `OpLoad %accelStruct <heap-ptr>` becomes a load of the TLAS
device address from a synthesized push-constant block followed by the
convert. RTPass pushes the active frame's TLAS address into that push
constant. User GLSL and example code are unchanged; acceleration structures
still bind into the heap normally.

The workaround is gated on Device::workaroundDescriptorHeapAS (true only on
the NVIDIA proprietary driver) and confined to one fenced block in
Crafter.Graphics-ShaderVulkan.cppm plus the RTPass push and the shaderInt64
feature toggle — delete those once a fixed NVIDIA driver ships and the heap
AS read becomes the direct path again.

Verified: VulkanTriangle ray-traces correctly on native NVIDIA (RTX 4090),
validation-layer-clean, no device loss. The SPIR-V rewrite was independently
validated with spirv-val on both the VulkanTriangle and Sponza raygen
modules.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 01:59:54 +00:00
b9f65f5273 Merge pull request 'feat(webgpu-rt): any-hit + AABB (procedural) geometry support' (#14) from claude/issue-13 into master 2026-06-03 00:10:17 +02:00
catbot
5dd1086f08 docs(webgpu-rt): add RTVolume example (procedural spheres + any-hit cut-out)
A 3x3x3 grid of AABB-geometry spheres rendered through an analytic
ray-sphere intersection shader, with an any-hit spherical-checkerboard
cut-out so the background shows through. Exercises both features end to
end on the WebGPU wavefront tracer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 22:09:30 +00:00
catbot
1628e1a58c feat(webgpu-rt): wire any-hit + AABB intersection into wavefront traversal
The TRACE-stage BLAS descent now threads the payload through, runs the
any-hit shader for non-opaque candidates (DXR/VK opacity resolution:
ray FORCE flags > instance FORCE flags > geometry opaque bit), and
handles AABB leaves via the intersection shader. MeshRecord grows to 64
bytes with geomType + opaque. When any-hit/intersection are present the
TRACE pipeline takes the user bind-group layout so those shaders can
sample @group(3+) resources; otherwise TRACE keeps its zero-user-code
path unchanged. rayQuery stays triangle-only (skips AABB leaves).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 22:09:25 +00:00
catbot
a91603c70b feat(webgpu-rt): emit intersection/any-hit dispatch + build AABB BVH
PipelineRTWebGPU emits a runIntersection mega-switch and the
RT_HAS_ANYHIT / RT_HAS_INTERSECTION consts (+ the @CRAFTER_RT_TRACE_USER
marker) that gate the library's new TRACE-stage user callbacks, so an
opaque triangle-only scene still const-folds them away. Mesh-WebGPU
builds a SAH BVH2 over AABB primitives and uploads them in primitive
order for the intersection shader to fetch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 22:09:20 +00:00
catbot
321fe596a7 feat(webgpu-rt): add intersection stage, procedural hit group, AABB BLAS API
Extends the cross-backend RT type surface for procedural geometry +
any-hit on the WebGPU path:

- RTShaderGroupType::ProceduralHitGroup + RTShaderGroup::intersectionShader
  (mirror VK_RAY_TRACING_SHADER_GROUP_TYPE_PROCEDURAL_HIT_GROUP_KHR).
- WebGPURTStage::Intersection for AABB intersection shaders.
- Mesh::BuildProcedural(span<RTAabb>, opaque) — the WebGPU analog of a
  VK_GEOMETRY_TYPE_AABBS_KHR geometry.
- wgpuRegisterMeshBLAS gains geomType / opaqueFlag / primCount.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 22:09:14 +00:00
d7b9a41b4f Merge pull request 'fix(webgpu): reshape wavefront TRACE/SHADE to 2-D to survive >4.19M rays' (#12) from claude/issue-11 into master 2026-06-01 13:10:05 +02:00
catbot
1e749818ef fix(webgpu): reshape wavefront TRACE/SHADE to 2-D to survive >4.19M rays
A 1-D indirect dispatch of ceil(W*H/64) workgroups for the wavefront
TRACE/SHADE stages overflows maxComputeWorkgroupsPerDimension (65535 on
Dawn/Firefox) once the surface exceeds ~4.19M rays (~2560x1640). Per the
WebGPU spec such a dispatch is silently dropped — no validation error —
so at 4K the world is never traced and the accumulator stays black while
non-RT passes survive.

_wfPrep now spreads the workgroups across a 2-D grid (x clamped to 65535,
y = ceil(wg/65535)), and the wfTrace/wfShade entry points rebuild the
linear ray index from (global_invocation_id, num_workgroups). The existing
`i >= _wfCurCount()` guard absorbs the grid overshoot. GENERATE/RESOLVE
already use a 2-D tile dispatch and are unchanged.

Verified in Firefox/WebGPU with RTStress at a 3449x1739 surface (5.99M
rays, 93716 workgroups — well over the 65535 cap): renders the full cube
grid where master shows a black screen.

Resolves #11

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 11:09:15 +00:00
afb9e320e1 Merge pull request 'docs(vulkan-rt): native descriptor-heap AS read is an NVIDIA driver fault (#7)' (#10) from claude/issue-7 into master 2026-06-01 00:22:52 +02:00
catbot
464cb66063 docs(vulkan-rt): record native descriptor-heap AS read as a driver fault
Investigated the VK_ERROR_DEVICE_LOST on the native VulkanTriangle (#7).
Verified the engine side is correct and validation-clean: the BLAS/TLAS
build finishes before render (FinishInit waits), the built instance is
well-formed (identity transform, mask=0xFF, correct BLAS ref), and
vkWriteResourceDescriptorsEXT stores the TLAS device address at the
expected heap offset (confirmed by dumping the heap bytes). Khronos
validation 1.4.350 reports zero errors.

The fault is isolated to reading the acceleration structure through
VK_EXT_descriptor_heap:
- images/buffers via the same heap render fine (trace disabled -> the
  raygen imageStore path renders a full gradient);
- both traceRayEXT and inline rayQueryEXT (no SBT) fault identically on
  the AS read;
- reproduces with the AS descriptor at heap byte 0 / shader index 0 (no
  offset/stride ambiguity) and regardless of pAddressRange size.

NVIDIA 610.43.02 is the only descriptor_heap implementation available
(llvmpipe lacks the extension), so there is no second implementation to
cross-check. Conclusion: driver-side fault in NVIDIA's brand-new
VK_EXT_descriptor_heap acceleration-structure path; should be reported to
NVIDIA. The traceRayEXT call is left active so the example stays a
faithful reproducer. Documented in both READMEs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 22:21:57 +00:00
6470c12db5 Merge pull request 'fix(webgpu): request adapter's storage-buffer limit, not hardcoded 16' (#9) from claude/issue-8 into master 2026-05-31 23:58:19 +02:00
catbot
23780d83a8 fix(webgpu): request adapter's storage-buffer limit, not hardcoded 16
dom-webgpu.js capped maxStorageBuffersPerShaderStage at 16 even when the
adapter reports far more (64 in our test env). The wavefront SHADE kernel
already binds ~16 storage buffers before any user binding, so any RT
pipeline declaring 2+ user storage buffers at @group(3) overflowed the
limit and failed to build with "Too many bindings of type StorageBuffers".

Request the adapter's reported maxStorageBuffersPerShaderStage /
maxStorageBuffersInPipelineLayout instead of a fixed 16. `clamp` already
mins against the adapter cap, so baseline-only devices still get a valid
request, and the `|| 16` fallback + the `typeof cap === "number"` guard
handle limit names a browser doesn't expose (Firefox returns null for
maxStorageBuffersInPipelineLayout).

Verified in-browser: a 17-storage-buffer compute pipeline fails with the
exact reported error on a device clamped to 16, and builds cleanly on a
device requesting the adapter's 64. RTStress renders correctly.

Resolves #8

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 21:55:42 +00:00
26a41ac528 Merge pull request 'fix(vulkan): clear startup validation errors on native triangle' (#6) from claude/issue-5 into master 2026-05-31 22:59:47 +02:00
catbot
cac433ee09 fix(vulkan): clear startup validation errors on native triangle
Two Vulkan validation errors fired on startup of every native (Vulkan)
example, reported in #5:

1. vkCreateDevice enabledLayerCount != 0. Device layers are deprecated
   and ignored since Vulkan 1.0; passing them is a spec violation
   (VUID-VkDeviceCreateInfo-enabledLayerCount-12384). The device-layer
   enumeration/match block in Device::Initialize is removed and
   enabledLayerCount is pinned to 0 — layers are enabled at the instance
   only.

2. vkQueueSubmit layout transition on a presentable image that "has not
   been acquired". StartInit() and RecreateSwapchainAndImages() eagerly
   transitioned every swapchain image UNDEFINED -> PRESENT_SRC_KHR before
   any vkAcquireNextImageKHR, which the spec forbids (a presentable image
   may only be touched after acquire). Those pre-transitions are removed.
   Each image's first layout transition now happens lazily in Render(),
   after acquire, from UNDEFINED; subsequent frames transition from
   PRESENT_SRC_KHR. A per-image `imageInitialised` flag (reset in
   CreateSwapchain) selects the correct oldLayout.

Verified under sway (headless, GPU renderer) + VK_LAYER_KHRONOS_validation:
the original code reproduces both errors on HelloUI; the fixed build emits
zero validation messages across initial render and swapchain recreation.

Resolves #5

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:59:10 +00:00
6a54c3c4ca Merge pull request 'WebGPU RT: wavefront/streaming tracer (replaces megakernel)' (#4) from claude/issue-3 into master 2026-05-31 22:31:35 +02:00
catbot
358084185a docs: wavefront RT in README + design-doc status; add RTStress to examples 2026-05-31 20:29:12 +00:00
catbot
afc0292fab WebGPU RT: dynamic TLAS sweep-tree depth (next_pow2 instances)
The LBVH bitonic sort still runs over the full 16384 (sentinels sink to
the tail), but the sweep tree is now built and traced at depth
log2(next_pow2(nReal)) instead of a fixed 14. Add nPadded to LbvhPC; leaf
init + bottom-up refit use it; the host passes the same next_pow2 to the
trace via WfParams.tlasNPadded. Renders correctly at 512 instances
(depth 9). The fragile sort phases are untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:28:12 +00:00
catbot
82e5e867d4 WebGPU RT: remove dead megakernel WGSL (no dual path)
The RT pipeline now only builds the wavefront kernels, so the old
single-megakernel traversal/traceRay block (rtWgslMegakernelHelpers) and
the unused rtWgslPrelude alias are dead. Remove them. The rayQuery compute
path keeps rtWgslMegakernelBindings (its own _rq* traversal uses it).
RTStress still renders correctly with the trimmed prelude.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:24:04 +00:00
catbot
dd4122f2ba WebGPU RT: ordered (nearest-child-first) traversal
Add _rtAabbT (AABB test returning entry-t); in both _rtwTraverseBlas and
_rtwTraverseTlas descend the nearer child first and push the farther only
when it hits, re-culling it against the (tightened) bestT when popped.
Render is identical (same closest hit) on VulkanTriangle, RTStress
(512/4096), and Sponza; cuts node visits on dense scenes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:21:44 +00:00
catbot
376e66aeed WebGPU RT: port Sponza to wavefront (shadow ray in SHADE)
Restructure Sponza for the wavefront model: raygen emits the primary ray;
closesthit (in SHADE) gathers albedo/normal, accumulates ambient, and
emits a shadow ray carrying the pending direct term; miss adds the sky
(primary) or the direct term (shadow miss). resolve.wgsl applies the same
Reinhard+gamma the megakernel raygen did inline. User bindings moved to
group 3 (groups 0..2 reserved). RTPass maxDepth=2.

Renders the atrium correctly through the wavefront pipeline (textures,
two-sided shading, sun+ambient, shadows, tonemap).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:16:04 +00:00
catbot
1d2e12dbc9 WebGPU RT: GPU timestamp-query per-pass harness
Request the timestamp-query feature; write begin/end timestamps around
each wavefront pass via timestampWrites; resolve + read back (deferred to
after submit) and print a per-pass us breakdown ~1x/sec. RTStress @ 512
instances, 1920x995: TRACE dominates, total ~1.8-3.0ms/frame.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:08:39 +00:00
catbot
f4d6493d91 wip: uncommitted changes from claude run on issue #3 2026-05-31 16:28:38 +00:00
catbot
4e42d663a6 WebGPU RT: wavefront tracer core (GENERATE/PREP/TRACE/SHADE/RESOLVE)
Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:

  GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE

- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
  the indirect-args buffer lives in its own bind group so it is never
  bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
  plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
  between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
  user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
  PerDimension so the W*H-sized work buffers fit past the 128MB baseline.

VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 16:24:41 +00:00
e0d72f57f2 Merge pull request 'WebGPU RT: enable TLAS spatial sort via bitonic network (plan phase 3)' (#2) from claude/issue-1 into master 2026-05-31 17:49:38 +02:00
catbot
14091dcdca WebGPU RT: enable TLAS spatial sort via bitonic network
Replace the disabled LSD radix sort in lbvhBuildMain with a data-oblivious
workgroup bitonic sorting network and enable it. The radix scatter was gated
behind `if (false)` because it produced count/distribution-dependent
corruption (TODO-lbvh-sort.md) — a memory-ordering bug in the Hillis-Steele
scan / parallel scatter that surfaced only for certain Morton distributions
(a small object beside a tight cluster), making geometry flicker.

A bitonic network's compare-exchange schedule depends only on N_PADDED, never
on key values, so it sidesteps that entire class of distribution-dependent
races (TODO strategy #5). 105 sub-stages over 2^14 keys, single workgroup of
1024 threads, 8 compare-exchanges/thread/sub-stage, operating in-place on
sortA with a storageBarrier between sub-stages. Sentinel keys (0xFFFFFFFF)
compare largest and settle at the tail, exactly where Phase 4 expects them.
Restores Morton (Z-order) spatial coherence to TLAS BVH leaves, which the
many-instance case needs. Removes the now-dead radix histogram/scan workgroup
memory and constants.

Verified on the Firefox/Dawn WebGPU stack: a GPU unit test diffs the kernel
output against a CPU oracle across all three required distributions
(all-uniform, all-one-bucket, small-object-next-to-cluster) plus random,
reverse, and empty inputs — all match bit-for-bit with a valid index
permutation. Sponza renders correctly with the sort live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 15:48:29 +00:00
162d98cf5b got rid of --local 2026-05-27 04:38:30 +02:00
909a9b46d2 wasm fixes 2026-05-26 22:50:49 +02:00
8347467e1e webgpu improvements 2026-05-24 13:32:08 +02:00
5a75571ffd readme update 2026-05-19 01:43:46 +02:00
850ef7bfb3 clipboard 2026-05-19 00:45:22 +02:00
b5d0f52da0 webgpu sponza 2026-05-19 00:27:09 +02:00
5553ded476 webgpu triangle 2026-05-18 18:43:30 +02:00
64116cd980 custom shader webgpu 2026-05-18 05:39:17 +02:00
dedf6b0467 webgpu support 2026-05-18 04:58:52 +02:00
5352ef69a2 browser DOM support 2026-05-18 02:07:48 +02:00
3859c43ce3 compression example 2026-05-12 00:27:55 +02:00
ac2eb7fb0a new input system 2026-05-12 00:24:48 +02:00