webgpu improvements

2026-05-24 13:32:08 +02:00 · 2026-05-24 13:32:08 +02:00 · 8347467e1e
commit 8347467e1e
parent 5a75571ffd
18 changed files with 1932 additions and 153 deletions
--- a/TODO-lbvh-sort.md
+++ b/TODO-lbvh-sort.md
@ -0,0 +1,106 @@
+# LBVH parallel radix sort: count-dependent corruption
+
+## Summary
+
+The parallel radix sort in `lbvhBuildMain` (additional/dom-webgpu.js) produces
+incorrect output that depends on the input distribution. Symptom: geometry in
+the BVH-built TLAS appears to flicker (instances missing or pointing at the
+wrong entry) as soon as a small object enters the TLAS alongside a tight
+cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).
+
+Bisected by selectively skipping each LBVH phase. Skipping only the radix
+sort eliminates the corruption — every other phase (scene-AABB reduce,
+Morton-key write, leaf init, sweep-tree refit) is correctness-clean.
+
+Current state: the sort is gated behind `if (false)` in `lbvhBuildMain`. BVH
+leaves are in instance-index order with no spatial coherence. The BVH still
+builds correctly and traversal still descends a real tree, just with looser
+parent AABBs.
+
+## What we know
+
+- The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
+- Keys are `(morton16 << 16) | (tlasIndex16)`; sentinels (i >= n) get
+  `0xFFFFFFFF`.
+- Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a
+  Hillis-Steele exclusive prefix scan to compute per-thread destination
+  offsets.
+- Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
+- The math of the Hillis-Steele scan was verified: after `log2(THREADS)=10`
+  steps with the read/barrier/write/barrier pattern, `shScan[tid]` holds the
+  inclusive prefix sum.
+- Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
+  + localIdx`, where `exclusivePrefix` is per-thread and `localIdx`
+  increments per-element within the thread.
+- All required barriers are present:
+  - `workgroupBarrier` between scan iterations.
+  - `workgroupBarrier` at end of each bucket iteration.
+  - `storageBarrier` at end of each radix pass.
+
+## What we suspect
+
+The bug is likely one of:
+
+1. **WGSL implementation issue** in the specific browser/driver. `workgroup
+   Barrier` semantics around `atomicLoad` on workgroup memory, or around
+   single-buffered Hillis-Steele where one thread reads `shScan[tid - offset]`
+   while a neighbor writes `shScan[tid]`. Standard pattern, but the spec is
+   subtle.
+2. **Memory model edge case** triggered only with very unbalanced histograms
+   (e.g. bucket 15 holding ~94% of entries because almost everything is
+   sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15}
+   buckets and exactly 15-16 for bucket 15; that mix may surface a
+   compiler-introduced reordering.
+3. **A logical bug in the scan or scatter** that the human review keeps
+   missing — re-reading the code is the last thing that helps; what's
+   needed is a GPU-side trace.
+
+## Reproducing
+
+1. Run 3DForts WebGPU build with normal projectile firing.
+2. Aim near (not necessarily at) the fort.
+3. Observe braces / panels flickering as the projectile flies past.
+
+## Diagnostic strategies if revisiting
+
+1. **GPU-side trace.** Add a debug buffer (`array<u32>` sized for all 16384
+   entries × a few u32). Have each thread write its intermediate scan
+   values and final scatter destinations there. Read back to CPU and diff
+   against an expected oracle (CPU-computed reference sort of the same
+   input keys).
+2. **Halve the search.** Reduce `PASSES` to 1 and check: does a single-pass
+   sort already corrupt, or does corruption only emerge after multiple
+   ping-pongs?
+3. **Replace the scan.** Swap Hillis-Steele for a Blelloch up/down-sweep
+   scan or a `subgroupExclusiveAdd` variant where available. If the
+   replacement fixes it, the bug is in the Hillis-Steele specifically.
+4. **Serialize the scatter.** Have thread 0 do all scatters by itself
+   (loop over all 16384 entries × 16 buckets sequentially). Slow but a
+   provably-correct reference. If this fixes the flicker, the parallel
+   scatter has the bug.
+5. **Replace LSD with bitonic sort.** Different algorithm entirely. If
+   bitonic works, radix has a structural problem.
+
+## Why it's not blocking
+
+At the current scale (~1011 entries), the BVH still functions:
+
+- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the
+  tree very cheaply (~1 AABB test per skipped subtree).
+- The real-leaf subtree has ~10 levels of descent (`log2(1024)`), all of
+  which are real AABB tests. Without spatial coherence the AABBs are
+  looser than a properly-sorted BVH, but they still bound the geometry.
+- Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction
+  of the per-pixel cost.
+
+Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real
+entries this stops being acceptable and the sort needs to actually work.
+
+## Acceptance criteria for "fixed"
+
+- The diagnostic repro (3DForts: fire a projectile near the fort) shows
+  no flicker at all.
+- The sort produces output ordered by `(morton16, tlasIndex)` ascending.
+- A unit test (CPU oracle vs GPU output) passes for at least three
+  histogram distributions: all-uniform, all-in-one-bucket, and the
+  3DForts-style "one small object next to a tight cluster".