Crafter.Graphics/TODO-lbvh-sort.md

106 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LBVH parallel radix sort: count-dependent corruption
## Summary
The parallel radix sort in `lbvhBuildMain` (additional/dom-webgpu.js) produces
incorrect output that depends on the input distribution. Symptom: geometry in
the BVH-built TLAS appears to flicker (instances missing or pointing at the
wrong entry) as soon as a small object enters the TLAS alongside a tight
cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).
Bisected by selectively skipping each LBVH phase. Skipping only the radix
sort eliminates the corruption — every other phase (scene-AABB reduce,
Morton-key write, leaf init, sweep-tree refit) is correctness-clean.
Current state: the sort is gated behind `if (false)` in `lbvhBuildMain`. BVH
leaves are in instance-index order with no spatial coherence. The BVH still
builds correctly and traversal still descends a real tree, just with looser
parent AABBs.
## What we know
- The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
- Keys are `(morton16 << 16) | (tlasIndex16)`; sentinels (i >= n) get
`0xFFFFFFFF`.
- Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a
Hillis-Steele exclusive prefix scan to compute per-thread destination
offsets.
- Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
- The math of the Hillis-Steele scan was verified: after `log2(THREADS)=10`
steps with the read/barrier/write/barrier pattern, `shScan[tid]` holds the
inclusive prefix sum.
- Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
+ localIdx`, where `exclusivePrefix` is per-thread and `localIdx`
increments per-element within the thread.
- All required barriers are present:
- `workgroupBarrier` between scan iterations.
- `workgroupBarrier` at end of each bucket iteration.
- `storageBarrier` at end of each radix pass.
## What we suspect
The bug is likely one of:
1. **WGSL implementation issue** in the specific browser/driver. `workgroup
Barrier` semantics around `atomicLoad` on workgroup memory, or around
single-buffered Hillis-Steele where one thread reads `shScan[tid - offset]`
while a neighbor writes `shScan[tid]`. Standard pattern, but the spec is
subtle.
2. **Memory model edge case** triggered only with very unbalanced histograms
(e.g. bucket 15 holding ~94% of entries because almost everything is
sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15}
buckets and exactly 15-16 for bucket 15; that mix may surface a
compiler-introduced reordering.
3. **A logical bug in the scan or scatter** that the human review keeps
missing — re-reading the code is the last thing that helps; what's
needed is a GPU-side trace.
## Reproducing
1. Run 3DForts WebGPU build with normal projectile firing.
2. Aim near (not necessarily at) the fort.
3. Observe braces / panels flickering as the projectile flies past.
## Diagnostic strategies if revisiting
1. **GPU-side trace.** Add a debug buffer (`array<u32>` sized for all 16384
entries × a few u32). Have each thread write its intermediate scan
values and final scatter destinations there. Read back to CPU and diff
against an expected oracle (CPU-computed reference sort of the same
input keys).
2. **Halve the search.** Reduce `PASSES` to 1 and check: does a single-pass
sort already corrupt, or does corruption only emerge after multiple
ping-pongs?
3. **Replace the scan.** Swap Hillis-Steele for a Blelloch up/down-sweep
scan or a `subgroupExclusiveAdd` variant where available. If the
replacement fixes it, the bug is in the Hillis-Steele specifically.
4. **Serialize the scatter.** Have thread 0 do all scatters by itself
(loop over all 16384 entries × 16 buckets sequentially). Slow but a
provably-correct reference. If this fixes the flicker, the parallel
scatter has the bug.
5. **Replace LSD with bitonic sort.** Different algorithm entirely. If
bitonic works, radix has a structural problem.
## Why it's not blocking
At the current scale (~1011 entries), the BVH still functions:
- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the
tree very cheaply (~1 AABB test per skipped subtree).
- The real-leaf subtree has ~10 levels of descent (`log2(1024)`), all of
which are real AABB tests. Without spatial coherence the AABBs are
looser than a properly-sorted BVH, but they still bound the geometry.
- Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction
of the per-pixel cost.
Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real
entries this stops being acceptable and the sort needs to actually work.
## Acceptance criteria for "fixed"
- The diagnostic repro (3DForts: fire a projectile near the fort) shows
no flicker at all.
- The sort produces output ordered by `(morton16, tlasIndex)` ascending.
- A unit test (CPU oracle vs GPU output) passes for at least three
histogram distributions: all-uniform, all-in-one-bucket, and the
3DForts-style "one small object next to a tight cluster".