Crafter.Graphics/TODO-lbvh-sort.md
catbot 14091dcdca WebGPU RT: enable TLAS spatial sort via bitonic network
Replace the disabled LSD radix sort in lbvhBuildMain with a data-oblivious
workgroup bitonic sorting network and enable it. The radix scatter was gated
behind `if (false)` because it produced count/distribution-dependent
corruption (TODO-lbvh-sort.md) — a memory-ordering bug in the Hillis-Steele
scan / parallel scatter that surfaced only for certain Morton distributions
(a small object beside a tight cluster), making geometry flicker.

A bitonic network's compare-exchange schedule depends only on N_PADDED, never
on key values, so it sidesteps that entire class of distribution-dependent
races (TODO strategy #5). 105 sub-stages over 2^14 keys, single workgroup of
1024 threads, 8 compare-exchanges/thread/sub-stage, operating in-place on
sortA with a storageBarrier between sub-stages. Sentinel keys (0xFFFFFFFF)
compare largest and settle at the tail, exactly where Phase 4 expects them.
Restores Morton (Z-order) spatial coherence to TLAS BVH leaves, which the
many-instance case needs. Removes the now-dead radix histogram/scan workgroup
memory and constants.

Verified on the Firefox/Dawn WebGPU stack: a GPU unit test diffs the kernel
output against a CPU oracle across all three required distributions
(all-uniform, all-one-bucket, small-object-next-to-cluster) plus random,
reverse, and empty inputs — all match bit-for-bit with a valid index
permutation. Sponza renders correctly with the sort live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 15:48:29 +00:00

121 lines
5.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LBVH parallel radix sort: count-dependent corruption
> **RESOLVED (strategy #5 — bitonic sort).** The LSD radix scatter was
> replaced with a data-oblivious workgroup **bitonic sorting network** in
> `lbvhBuildMain` (`additional/dom-webgpu.js`, Phase 2). Because a bitonic
> network's compare-exchange schedule depends only on N_PADDED — never on
> the key distribution — it cannot exhibit the count-dependent corruption
> documented below. The sort is now enabled (the old `if (false)` guard is
> gone) so TLAS leaves are Morton (Z-order) coherent again.
>
> Verified on the Firefox/Dawn WebGPU stack with a GPU unit test that diffs
> the kernel output against a CPU oracle across all three required
> distributions (all-uniform, all-one-bucket, and the "small object next to
> a tight cluster" repro) plus random/reverse/empty edge cases — all match
> bit-for-bit, with a valid index permutation. Sponza renders correctly with
> the sort live. The historical analysis below is retained for context.
## Summary
The parallel radix sort in `lbvhBuildMain` (additional/dom-webgpu.js) produces
incorrect output that depends on the input distribution. Symptom: geometry in
the BVH-built TLAS appears to flicker (instances missing or pointing at the
wrong entry) as soon as a small object enters the TLAS alongside a tight
cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).
Bisected by selectively skipping each LBVH phase. Skipping only the radix
sort eliminates the corruption — every other phase (scene-AABB reduce,
Morton-key write, leaf init, sweep-tree refit) is correctness-clean.
Current state: the sort is gated behind `if (false)` in `lbvhBuildMain`. BVH
leaves are in instance-index order with no spatial coherence. The BVH still
builds correctly and traversal still descends a real tree, just with looser
parent AABBs.
## What we know
- The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
- Keys are `(morton16 << 16) | (tlasIndex16)`; sentinels (i >= n) get
`0xFFFFFFFF`.
- Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a
Hillis-Steele exclusive prefix scan to compute per-thread destination
offsets.
- Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
- The math of the Hillis-Steele scan was verified: after `log2(THREADS)=10`
steps with the read/barrier/write/barrier pattern, `shScan[tid]` holds the
inclusive prefix sum.
- Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
+ localIdx`, where `exclusivePrefix` is per-thread and `localIdx`
increments per-element within the thread.
- All required barriers are present:
- `workgroupBarrier` between scan iterations.
- `workgroupBarrier` at end of each bucket iteration.
- `storageBarrier` at end of each radix pass.
## What we suspect
The bug is likely one of:
1. **WGSL implementation issue** in the specific browser/driver. `workgroup
Barrier` semantics around `atomicLoad` on workgroup memory, or around
single-buffered Hillis-Steele where one thread reads `shScan[tid - offset]`
while a neighbor writes `shScan[tid]`. Standard pattern, but the spec is
subtle.
2. **Memory model edge case** triggered only with very unbalanced histograms
(e.g. bucket 15 holding ~94% of entries because almost everything is
sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15}
buckets and exactly 15-16 for bucket 15; that mix may surface a
compiler-introduced reordering.
3. **A logical bug in the scan or scatter** that the human review keeps
missing — re-reading the code is the last thing that helps; what's
needed is a GPU-side trace.
## Reproducing
1. Run 3DForts WebGPU build with normal projectile firing.
2. Aim near (not necessarily at) the fort.
3. Observe braces / panels flickering as the projectile flies past.
## Diagnostic strategies if revisiting
1. **GPU-side trace.** Add a debug buffer (`array<u32>` sized for all 16384
entries × a few u32). Have each thread write its intermediate scan
values and final scatter destinations there. Read back to CPU and diff
against an expected oracle (CPU-computed reference sort of the same
input keys).
2. **Halve the search.** Reduce `PASSES` to 1 and check: does a single-pass
sort already corrupt, or does corruption only emerge after multiple
ping-pongs?
3. **Replace the scan.** Swap Hillis-Steele for a Blelloch up/down-sweep
scan or a `subgroupExclusiveAdd` variant where available. If the
replacement fixes it, the bug is in the Hillis-Steele specifically.
4. **Serialize the scatter.** Have thread 0 do all scatters by itself
(loop over all 16384 entries × 16 buckets sequentially). Slow but a
provably-correct reference. If this fixes the flicker, the parallel
scatter has the bug.
5. **Replace LSD with bitonic sort.** Different algorithm entirely. If
bitonic works, radix has a structural problem.
## Why it's not blocking
At the current scale (~1011 entries), the BVH still functions:
- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the
tree very cheaply (~1 AABB test per skipped subtree).
- The real-leaf subtree has ~10 levels of descent (`log2(1024)`), all of
which are real AABB tests. Without spatial coherence the AABBs are
looser than a properly-sorted BVH, but they still bound the geometry.
- Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction
of the per-pixel cost.
Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real
entries this stops being acceptable and the sort needs to actually work.
## Acceptance criteria for "fixed"
- The diagnostic repro (3DForts: fire a projectile near the fort) shows
no flicker at all.
- The sort produces output ordered by `(morton16, tlasIndex)` ascending.
- A unit test (CPU oracle vs GPU output) passes for at least three
histogram distributions: all-uniform, all-in-one-bucket, and the
3DForts-style "one small object next to a tight cluster".