4.9 KiB
LBVH parallel radix sort: count-dependent corruption
Summary
The parallel radix sort in lbvhBuildMain (additional/dom-webgpu.js) produces
incorrect output that depends on the input distribution. Symptom: geometry in
the BVH-built TLAS appears to flicker (instances missing or pointing at the
wrong entry) as soon as a small object enters the TLAS alongside a tight
cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).
Bisected by selectively skipping each LBVH phase. Skipping only the radix sort eliminates the corruption — every other phase (scene-AABB reduce, Morton-key write, leaf init, sweep-tree refit) is correctness-clean.
Current state: the sort is gated behind if (false) in lbvhBuildMain. BVH
leaves are in instance-index order with no spatial coherence. The BVH still
builds correctly and traversal still descends a real tree, just with looser
parent AABBs.
What we know
- The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
- Keys are
(morton16 << 16) | (tlasIndex16); sentinels (i >= n) get0xFFFFFFFF. - Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a Hillis-Steele exclusive prefix scan to compute per-thread destination offsets.
- Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
- The math of the Hillis-Steele scan was verified: after
log2(THREADS)=10steps with the read/barrier/write/barrier pattern,shScan[tid]holds the inclusive prefix sum. - Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
- localIdx
, whereexclusivePrefixis per-thread andlocalIdx` increments per-element within the thread.
- localIdx
- All required barriers are present:
workgroupBarrierbetween scan iterations.workgroupBarrierat end of each bucket iteration.storageBarrierat end of each radix pass.
What we suspect
The bug is likely one of:
- WGSL implementation issue in the specific browser/driver.
workgroup Barriersemantics aroundatomicLoadon workgroup memory, or around single-buffered Hillis-Steele where one thread readsshScan[tid - offset]while a neighbor writesshScan[tid]. Standard pattern, but the spec is subtle. - Memory model edge case triggered only with very unbalanced histograms (e.g. bucket 15 holding ~94% of entries because almost everything is sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15} buckets and exactly 15-16 for bucket 15; that mix may surface a compiler-introduced reordering.
- A logical bug in the scan or scatter that the human review keeps missing — re-reading the code is the last thing that helps; what's needed is a GPU-side trace.
Reproducing
- Run 3DForts WebGPU build with normal projectile firing.
- Aim near (not necessarily at) the fort.
- Observe braces / panels flickering as the projectile flies past.
Diagnostic strategies if revisiting
- GPU-side trace. Add a debug buffer (
array<u32>sized for all 16384 entries × a few u32). Have each thread write its intermediate scan values and final scatter destinations there. Read back to CPU and diff against an expected oracle (CPU-computed reference sort of the same input keys). - Halve the search. Reduce
PASSESto 1 and check: does a single-pass sort already corrupt, or does corruption only emerge after multiple ping-pongs? - Replace the scan. Swap Hillis-Steele for a Blelloch up/down-sweep
scan or a
subgroupExclusiveAddvariant where available. If the replacement fixes it, the bug is in the Hillis-Steele specifically. - Serialize the scatter. Have thread 0 do all scatters by itself (loop over all 16384 entries × 16 buckets sequentially). Slow but a provably-correct reference. If this fixes the flicker, the parallel scatter has the bug.
- Replace LSD with bitonic sort. Different algorithm entirely. If bitonic works, radix has a structural problem.
Why it's not blocking
At the current scale (~1011 entries), the BVH still functions:
- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the tree very cheaply (~1 AABB test per skipped subtree).
- The real-leaf subtree has ~10 levels of descent (
log2(1024)), all of which are real AABB tests. Without spatial coherence the AABBs are looser than a properly-sorted BVH, but they still bound the geometry. - Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction of the per-pixel cost.
Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real entries this stops being acceptable and the sort needs to actually work.
Acceptance criteria for "fixed"
- The diagnostic repro (3DForts: fire a projectile near the fort) shows no flicker at all.
- The sort produces output ordered by
(morton16, tlasIndex)ascending. - A unit test (CPU oracle vs GPU output) passes for at least three histogram distributions: all-uniform, all-in-one-bucket, and the 3DForts-style "one small object next to a tight cluster".