catbot 14091dcdca WebGPU RT: enable TLAS spatial sort via bitonic network

Replace the disabled LSD radix sort in lbvhBuildMain with a data-oblivious
workgroup bitonic sorting network and enable it. The radix scatter was gated
behind `if (false)` because it produced count/distribution-dependent
corruption (TODO-lbvh-sort.md) — a memory-ordering bug in the Hillis-Steele
scan / parallel scatter that surfaced only for certain Morton distributions
(a small object beside a tight cluster), making geometry flicker.

A bitonic network's compare-exchange schedule depends only on N_PADDED, never
on key values, so it sidesteps that entire class of distribution-dependent
races (TODO strategy #5). 105 sub-stages over 2^14 keys, single workgroup of
1024 threads, 8 compare-exchanges/thread/sub-stage, operating in-place on
sortA with a storageBarrier between sub-stages. Sentinel keys (0xFFFFFFFF)
compare largest and settle at the tail, exactly where Phase 4 expects them.
Restores Morton (Z-order) spatial coherence to TLAS BVH leaves, which the
many-instance case needs. Removes the now-dead radix histogram/scan workgroup
memory and constants.

Verified on the Firefox/Dawn WebGPU stack: a GPU unit test diffs the kernel
output against a CPU oracle across all three required distributions
(all-uniform, all-one-bucket, small-object-next-to-cluster) plus random,
reverse, and empty inputs — all match bit-for-bit with a valid index
permutation. Sponza renders correctly with the sort live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-31 15:48:29 +00:00

5.9 KiB

Raw Permalink Blame History

LBVH parallel radix sort: count-dependent corruption

RESOLVED (strategy #5 — bitonic sort). The LSD radix scatter was replaced with a data-oblivious workgroup bitonic sorting network in lbvhBuildMain (additional/dom-webgpu.js, Phase 2). Because a bitonic network's compare-exchange schedule depends only on N_PADDED — never on the key distribution — it cannot exhibit the count-dependent corruption documented below. The sort is now enabled (the old if (false) guard is gone) so TLAS leaves are Morton (Z-order) coherent again.

Verified on the Firefox/Dawn WebGPU stack with a GPU unit test that diffs the kernel output against a CPU oracle across all three required distributions (all-uniform, all-one-bucket, and the "small object next to a tight cluster" repro) plus random/reverse/empty edge cases — all match bit-for-bit, with a valid index permutation. Sponza renders correctly with the sort live. The historical analysis below is retained for context.

Summary

The parallel radix sort in lbvhBuildMain (additional/dom-webgpu.js) produces incorrect output that depends on the input distribution. Symptom: geometry in the BVH-built TLAS appears to flicker (instances missing or pointing at the wrong entry) as soon as a small object enters the TLAS alongside a tight cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).

Bisected by selectively skipping each LBVH phase. Skipping only the radix sort eliminates the corruption — every other phase (scene-AABB reduce, Morton-key write, leaf init, sweep-tree refit) is correctness-clean.

Current state: the sort is gated behind if (false) in lbvhBuildMain. BVH leaves are in instance-index order with no spatial coherence. The BVH still builds correctly and traversal still descends a real tree, just with looser parent AABBs.

What we know

The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
Keys are (morton16 << 16) | (tlasIndex16); sentinels (i >= n) get 0xFFFFFFFF.
Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a Hillis-Steele exclusive prefix scan to compute per-thread destination offsets.
Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
The math of the Hillis-Steele scan was verified: after log2(THREADS)=10 steps with the read/barrier/write/barrier pattern, shScan[tid] holds the inclusive prefix sum.
Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
- localIdx, where exclusivePrefixis per-thread andlocalIdx` increments per-element within the thread.
All required barriers are present:
- workgroupBarrier between scan iterations.
- workgroupBarrier at end of each bucket iteration.
- storageBarrier at end of each radix pass.

What we suspect

The bug is likely one of:

WGSL implementation issue in the specific browser/driver. workgroup Barrier semantics around atomicLoad on workgroup memory, or around single-buffered Hillis-Steele where one thread reads shScan[tid - offset] while a neighbor writes shScan[tid]. Standard pattern, but the spec is subtle.
Memory model edge case triggered only with very unbalanced histograms (e.g. bucket 15 holding ~94% of entries because almost everything is sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15} buckets and exactly 15-16 for bucket 15; that mix may surface a compiler-introduced reordering.
A logical bug in the scan or scatter that the human review keeps missing — re-reading the code is the last thing that helps; what's needed is a GPU-side trace.

Reproducing

Run 3DForts WebGPU build with normal projectile firing.
Aim near (not necessarily at) the fort.
Observe braces / panels flickering as the projectile flies past.

Diagnostic strategies if revisiting

GPU-side trace. Add a debug buffer (array<u32> sized for all 16384 entries × a few u32). Have each thread write its intermediate scan values and final scatter destinations there. Read back to CPU and diff against an expected oracle (CPU-computed reference sort of the same input keys).
Halve the search. Reduce PASSES to 1 and check: does a single-pass sort already corrupt, or does corruption only emerge after multiple ping-pongs?
Replace the scan. Swap Hillis-Steele for a Blelloch up/down-sweep scan or a subgroupExclusiveAdd variant where available. If the replacement fixes it, the bug is in the Hillis-Steele specifically.
Serialize the scatter. Have thread 0 do all scatters by itself (loop over all 16384 entries × 16 buckets sequentially). Slow but a provably-correct reference. If this fixes the flicker, the parallel scatter has the bug.
Replace LSD with bitonic sort. Different algorithm entirely. If bitonic works, radix has a structural problem.

Why it's not blocking

At the current scale (~1011 entries), the BVH still functions:

Sentinel half-subtrees are degenerate-AABB-rejected at the top of the tree very cheaply (~1 AABB test per skipped subtree).
The real-leaf subtree has ~10 levels of descent (log2(1024)), all of which are real AABB tests. Without spatial coherence the AABBs are looser than a properly-sorted BVH, but they still bound the geometry.
Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction of the per-pixel cost.

Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real entries this stops being acceptable and the sort needs to actually work.

Acceptance criteria for "fixed"

The diagnostic repro (3DForts: fire a projectile near the fort) shows no flicker at all.
The sort produces output ordered by (morton16, tlasIndex) ascending.
A unit test (CPU oracle vs GPU output) passes for at least three histogram distributions: all-uniform, all-in-one-bucket, and the 3DForts-style "one small object next to a tight cluster".

5.9 KiB Raw Permalink Blame History Unescape Escape