Crafter.Graphics/TODO-lbvh-sort.md

# LBVH parallel radix sort: count-dependent corruption

> **RESOLVED (strategy #5 — bitonic sort).** The LSD radix scatter was
> replaced with a data-oblivious workgroup **bitonic sorting network** in
> `lbvhBuildMain` (`additional/dom-webgpu.js`, Phase 2). Because a bitonic
> network's compare-exchange schedule depends only on N_PADDED — never on
> the key distribution — it cannot exhibit the count-dependent corruption
> documented below. The sort is now enabled (the old `if (false)` guard is
> gone) so TLAS leaves are Morton (Z-order) coherent again.
>
> Verified on the Firefox/Dawn WebGPU stack with a GPU unit test that diffs
> the kernel output against a CPU oracle across all three required
> distributions (all-uniform, all-one-bucket, and the "small object next to
> a tight cluster" repro) plus random/reverse/empty edge cases — all match
> bit-for-bit, with a valid index permutation. Sponza renders correctly with
> the sort live. The historical analysis below is retained for context.

## Summary

The parallel radix sort in `lbvhBuildMain` (additional/dom-webgpu.js) produces
incorrect output that depends on the input distribution. Symptom: geometry in
the BVH-built TLAS appears to flicker (instances missing or pointing at the
wrong entry) as soon as a small object enters the TLAS alongside a tight
cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).

Bisected by selectively skipping each LBVH phase. Skipping only the radix
sort eliminates the corruption — every other phase (scene-AABB reduce,
Morton-key write, leaf init, sweep-tree refit) is correctness-clean.

Current state: the sort is gated behind `if (false)` in `lbvhBuildMain`. BVH
leaves are in instance-index order with no spatial coherence. The BVH still
builds correctly and traversal still descends a real tree, just with looser
parent AABBs.

## What we know

- The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
- Keys are `(morton16 << 16) | (tlasIndex16)`; sentinels (i >= n) get
  `0xFFFFFFFF`.
- Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a
  Hillis-Steele exclusive prefix scan to compute per-thread destination
  offsets.
- Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
- The math of the Hillis-Steele scan was verified: after `log2(THREADS)=10`
  steps with the read/barrier/write/barrier pattern, `shScan[tid]` holds the
  inclusive prefix sum.
- Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
  + localIdx`, where `exclusivePrefix` is per-thread and `localIdx`
  increments per-element within the thread.
- All required barriers are present:
  - `workgroupBarrier` between scan iterations.
  - `workgroupBarrier` at end of each bucket iteration.
  - `storageBarrier` at end of each radix pass.

## What we suspect

The bug is likely one of:

1. **WGSL implementation issue** in the specific browser/driver. `workgroup
   Barrier` semantics around `atomicLoad` on workgroup memory, or around
   single-buffered Hillis-Steele where one thread reads `shScan[tid - offset]`
   while a neighbor writes `shScan[tid]`. Standard pattern, but the spec is
   subtle.
2. **Memory model edge case** triggered only with very unbalanced histograms
   (e.g. bucket 15 holding ~94% of entries because almost everything is
   sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15}
   buckets and exactly 15-16 for bucket 15; that mix may surface a
   compiler-introduced reordering.
3. **A logical bug in the scan or scatter** that the human review keeps
   missing — re-reading the code is the last thing that helps; what's
   needed is a GPU-side trace.

## Reproducing

1. Run 3DForts WebGPU build with normal projectile firing.
2. Aim near (not necessarily at) the fort.
3. Observe braces / panels flickering as the projectile flies past.

## Diagnostic strategies if revisiting

1. **GPU-side trace.** Add a debug buffer (`array<u32>` sized for all 16384
   entries × a few u32). Have each thread write its intermediate scan
   values and final scatter destinations there. Read back to CPU and diff
   against an expected oracle (CPU-computed reference sort of the same
   input keys).
2. **Halve the search.** Reduce `PASSES` to 1 and check: does a single-pass
   sort already corrupt, or does corruption only emerge after multiple
   ping-pongs?
3. **Replace the scan.** Swap Hillis-Steele for a Blelloch up/down-sweep
   scan or a `subgroupExclusiveAdd` variant where available. If the
   replacement fixes it, the bug is in the Hillis-Steele specifically.
4. **Serialize the scatter.** Have thread 0 do all scatters by itself
   (loop over all 16384 entries × 16 buckets sequentially). Slow but a
   provably-correct reference. If this fixes the flicker, the parallel
   scatter has the bug.
5. **Replace LSD with bitonic sort.** Different algorithm entirely. If
   bitonic works, radix has a structural problem.

## Why it's not blocking

At the current scale (~1011 entries), the BVH still functions:

- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the
  tree very cheaply (~1 AABB test per skipped subtree).
- The real-leaf subtree has ~10 levels of descent (`log2(1024)`), all of
  which are real AABB tests. Without spatial coherence the AABBs are
  looser than a properly-sorted BVH, but they still bound the geometry.
- Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction
  of the per-pixel cost.

Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real
entries this stops being acceptable and the sort needs to actually work.

## Acceptance criteria for "fixed"

- The diagnostic repro (3DForts: fire a projectile near the fort) shows
  no flicker at all.
- The sort produces output ordered by `(morton16, tlasIndex)` ascending.
- A unit test (CPU oracle vs GPU output) passes for at least three
  histogram distributions: all-uniform, all-in-one-bucket, and the
  3DForts-style "one small object next to a tight cluster".
-												webgpu improvements

											
										
										
											2026-05-24 13:32:08 +02:00
+								# LBVH parallel radix sort: count-dependent corruption
-												WebGPU RT: enable TLAS spatial sort via bitonic network

Replace the disabled LSD radix sort in lbvhBuildMain with a data-oblivious
workgroup bitonic sorting network and enable it. The radix scatter was gated
behind `if (false)` because it produced count/distribution-dependent
corruption (TODO-lbvh-sort.md) — a memory-ordering bug in the Hillis-Steele
scan / parallel scatter that surfaced only for certain Morton distributions
(a small object beside a tight cluster), making geometry flicker.

A bitonic network's compare-exchange schedule depends only on N_PADDED, never
on key values, so it sidesteps that entire class of distribution-dependent
races (TODO strategy #5). 105 sub-stages over 2^14 keys, single workgroup of
1024 threads, 8 compare-exchanges/thread/sub-stage, operating in-place on
sortA with a storageBarrier between sub-stages. Sentinel keys (0xFFFFFFFF)
compare largest and settle at the tail, exactly where Phase 4 expects them.
Restores Morton (Z-order) spatial coherence to TLAS BVH leaves, which the
many-instance case needs. Removes the now-dead radix histogram/scan workgroup
memory and constants.

Verified on the Firefox/Dawn WebGPU stack: a GPU unit test diffs the kernel
output against a CPU oracle across all three required distributions
(all-uniform, all-one-bucket, small-object-next-to-cluster) plus random,
reverse, and empty inputs — all match bit-for-bit with a valid index
permutation. Sponza renders correctly with the sort live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-05-31 15:48:29 +00:00
+								> **RESOLVED (strategy #5 — bitonic sort).** The LSD radix scatter was
 								> replaced with a data-oblivious workgroup **bitonic sorting network** in
 								> `lbvhBuildMain` (`additional/dom-webgpu.js`, Phase 2). Because a bitonic
 								> network's compare-exchange schedule depends only on N_PADDED — never on
 								> the key distribution — it cannot exhibit the count-dependent corruption
 								> documented below. The sort is now enabled (the old `if (false)` guard is
 								> gone) so TLAS leaves are Morton (Z-order) coherent again.
 								>
 								> Verified on the Firefox/Dawn WebGPU stack with a GPU unit test that diffs
 								> the kernel output against a CPU oracle across all three required
 								> distributions (all-uniform, all-one-bucket, and the "small object next to
 								> a tight cluster" repro) plus random/reverse/empty edge cases — all match
 								> bit-for-bit, with a valid index permutation. Sponza renders correctly with
 								> the sort live. The historical analysis below is retained for context.
-												webgpu improvements

											
										
										
											2026-05-24 13:32:08 +02:00
+								## Summary
 								The parallel radix sort in `lbvhBuildMain` (additional/dom-webgpu.js) produces
 								incorrect output that depends on the input distribution. Symptom: geometry in
 								the BVH-built TLAS appears to flicker (instances missing or pointing at the
 								wrong entry) as soon as a small object enters the TLAS alongside a tight
 								cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).
 								Bisected by selectively skipping each LBVH phase. Skipping only the radix
 								sort eliminates the corruption — every other phase (scene-AABB reduce,
 								Morton-key write, leaf init, sweep-tree refit) is correctness-clean.
 								Current state: the sort is gated behind `if (false)` in `lbvhBuildMain`. BVH
 								leaves are in instance-index order with no spatial coherence. The BVH still
 								builds correctly and traversal still descends a real tree, just with looser
 								parent AABBs.
 								## What we know
 								- The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
 								- Keys are `(morton16 << 16) | (tlasIndex16)`; sentinels (i >= n) get
 								  `0xFFFFFFFF`.
 								- Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a
 								  Hillis-Steele exclusive prefix scan to compute per-thread destination
 								  offsets.
 								- Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
 								- The math of the Hillis-Steele scan was verified: after `log2(THREADS)=10`
 								  steps with the read/barrier/write/barrier pattern, `shScan[tid]` holds the
 								  inclusive prefix sum.
 								- Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
 								  + localIdx`, where `exclusivePrefix` is per-thread and `localIdx`
 								  increments per-element within the thread.
 								- All required barriers are present:
 								  - `workgroupBarrier` between scan iterations.
 								  - `workgroupBarrier` at end of each bucket iteration.
 								  - `storageBarrier` at end of each radix pass.
 								## What we suspect
 								The bug is likely one of:
 . **WGSL implementation issue** in the specific browser/driver. `workgroup
 								   Barrier` semantics around `atomicLoad` on workgroup memory, or around
 								   single-buffered Hillis-Steele where one thread reads `shScan[tid - offset]`
 								   while a neighbor writes `shScan[tid]`. Standard pattern, but the spec is
 								   subtle.
 . **Memory model edge case** triggered only with very unbalanced histograms
 								   (e.g. bucket 15 holding ~94% of entries because almost everything is
 								   sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15}
 								   buckets and exactly 15-16 for bucket 15; that mix may surface a
 								   compiler-introduced reordering.
 . **A logical bug in the scan or scatter** that the human review keeps
 								   missing — re-reading the code is the last thing that helps; what's
 								   needed is a GPU-side trace.
 								## Reproducing
 . Run 3DForts WebGPU build with normal projectile firing.
 . Aim near (not necessarily at) the fort.
 . Observe braces / panels flickering as the projectile flies past.
 								## Diagnostic strategies if revisiting
 . **GPU-side trace.** Add a debug buffer (`array<u32>` sized for all 16384
 								   entries × a few u32). Have each thread write its intermediate scan
 								   values and final scatter destinations there. Read back to CPU and diff
 								   against an expected oracle (CPU-computed reference sort of the same
 								   input keys).
 . **Halve the search.** Reduce `PASSES` to 1 and check: does a single-pass
 								   sort already corrupt, or does corruption only emerge after multiple
 								   ping-pongs?
 . **Replace the scan.** Swap Hillis-Steele for a Blelloch up/down-sweep
 								   scan or a `subgroupExclusiveAdd` variant where available. If the
 								   replacement fixes it, the bug is in the Hillis-Steele specifically.
 . **Serialize the scatter.** Have thread 0 do all scatters by itself
 								   (loop over all 16384 entries × 16 buckets sequentially). Slow but a
 								   provably-correct reference. If this fixes the flicker, the parallel
 								   scatter has the bug.
 . **Replace LSD with bitonic sort.** Different algorithm entirely. If
 								   bitonic works, radix has a structural problem.
 								## Why it's not blocking
 								At the current scale (~1011 entries), the BVH still functions:
 								- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the
 								  tree very cheaply (~1 AABB test per skipped subtree).
 								- The real-leaf subtree has ~10 levels of descent (`log2(1024)`), all of
 								  which are real AABB tests. Without spatial coherence the AABBs are
 								  looser than a properly-sorted BVH, but they still bound the geometry.
 								- Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction
 								  of the per-pixel cost.
 								Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real
 								entries this stops being acceptable and the sort needs to actually work.
 								## Acceptance criteria for "fixed"
 								- The diagnostic repro (3DForts: fire a projectile near the fort) shows
 								  no flicker at all.
 								- The sort produces output ordered by `(morton16, tlasIndex)` ascending.
 								- A unit test (CPU oracle vs GPU output) passes for at least three
 								  histogram distributions: all-uniform, all-in-one-bucket, and the
 DForts-style "one small object next to a tight cluster".