WebGPU RT: enable TLAS spatial sort via bitonic network (plan phase 3) #2
No reviewers
Labels
No labels
claude:done
claude:in-progress
claude:ready
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Catcrafts/Crafter.Graphics!2
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "claude/issue-1"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
This PR lands Phase 3 (TLAS coherence) of the wavefront-rewrite plan in #1: it re-enables the TLAS LBVH spatial sort by replacing the disabled, buggy LSD radix scatter with a data-oblivious workgroup bitonic sorting network.
The radix sort in
lbvhBuildMainwas gated behindif (false)because it produced count/distribution-dependent corruption (seeTODO-lbvh-sort.md): a memory-ordering bug in the Hillis-Steele scan / parallel scatter that surfaced only for certain Morton distributions — specifically a small object next to a tight cluster (the 3DForts projectile-near-fort case) — making geometry flicker. With the sort disabled, TLAS BVH leaves had no spatial coherence, which is fatal at the thousands-of-instances scale the many-instance benchmark targets.A bitonic network's compare-exchange schedule depends only on
N_PADDED, never on the key values, so it structurally cannot exhibit that class of distribution-dependent race (TODO-lbvh-sort.mdstrategy #5). This restores Morton (Z-order) spatial coherence to the TLAS.What changed
additional/dom-webgpu.js—lbvhBuildMainPhase 2 is now a bitonic sort: 105 compare-exchange sub-stages over 2^14 keys, single workgroup of 1024 threads, 8 compare-exchanges/thread/sub-stage, in-place onsortAwith astorageBarrierbetween sub-stages. Sentinel keys (0xFFFFFFFF) compare largest and settle at the tail, exactly where Phase 4 expects them. The now-dead radix histogram/scan workgroup memory (shHist/shOffsets/shScan) and constants (BUCKETS/PASSES/SCAN_STEPS) are removed (~130 fewer lines).TODO-lbvh-sort.md— marked resolved, historical analysis retained.Downstream phases (3: write permutation, 4: leaf AABBs, 5: sweep-tree refit) and TLAS traversal are unchanged — they already consume
sortAcorrectly and are agnostic to leaf order as long as sentinel leaves get degenerate AABBs (they do).Verification
GPU unit test (CPU oracle, real Firefox/Dawn WebGPU stack). The exact bitonic kernel was run against the
TODO-lbvh-sort.mdacceptance criteria — all three required distributions (all-uniform, all-one-bucket, small-object-next-to-cluster) plus random, reverse, and empty (all-sentinel) inputs. Every case matched a CPU ascending-sort oracle bit-for-bit, with a valid real-index permutation, and zero GPU errors. Critically this includes the very distribution that broke the old radix sort.End-to-end render. Sponza (25 TLAS instances) renders correctly with the sort live — no flicker, no missing geometry, no corruption.
Scope note
Issue #1 lays out a full multi-phase wavefront rewrite (megakernel split into GENERATE/TRACE/SHADE/RESOLVE, GPU-driven indirect bounce loop, the
RTStressbenchmark + timing HUD, the emit/accumulate API break, ordered traversal). This PR delivers the plan's Phase 3 (TLAS coherence) in isolation — the plan explicitly calls it out as independent of the kernel rewrite and validatable against the existing renderer, and it fixes the concrete documented bug that breaks the many-instance scene. The remaining phases (0, 2, 4–7) are intentionally not in this PR and remain open follow-up work; landing the coherence fix first de-risks the benchmark those phases will be measured against. Dynamic TLAS tree depth (N_PADDED = next_pow2(N_real)) is deferred — it couples the build and trace shaders and is secondary to coherence.Screenshots
Sponza atrium, software ray-traced with the bitonic-sorted TLAS:
Refs #1