WebGPU RT: enable TLAS spatial sort via bitonic network (plan phase 3) #2

Merged
catbot merged 1 commit from claude/issue-1 into master 2026-05-31 17:49:39 +02:00
Member

Summary

This PR lands Phase 3 (TLAS coherence) of the wavefront-rewrite plan in #1: it re-enables the TLAS LBVH spatial sort by replacing the disabled, buggy LSD radix scatter with a data-oblivious workgroup bitonic sorting network.

The radix sort in lbvhBuildMain was gated behind if (false) because it produced count/distribution-dependent corruption (see TODO-lbvh-sort.md): a memory-ordering bug in the Hillis-Steele scan / parallel scatter that surfaced only for certain Morton distributions — specifically a small object next to a tight cluster (the 3DForts projectile-near-fort case) — making geometry flicker. With the sort disabled, TLAS BVH leaves had no spatial coherence, which is fatal at the thousands-of-instances scale the many-instance benchmark targets.

A bitonic network's compare-exchange schedule depends only on N_PADDED, never on the key values, so it structurally cannot exhibit that class of distribution-dependent race (TODO-lbvh-sort.md strategy #5). This restores Morton (Z-order) spatial coherence to the TLAS.

What changed

  • additional/dom-webgpu.jslbvhBuildMain Phase 2 is now a bitonic sort: 105 compare-exchange sub-stages over 2^14 keys, single workgroup of 1024 threads, 8 compare-exchanges/thread/sub-stage, in-place on sortA with a storageBarrier between sub-stages. Sentinel keys (0xFFFFFFFF) compare largest and settle at the tail, exactly where Phase 4 expects them. The now-dead radix histogram/scan workgroup memory (shHist/shOffsets/shScan) and constants (BUCKETS/PASSES/SCAN_STEPS) are removed (~130 fewer lines).
  • TODO-lbvh-sort.md — marked resolved, historical analysis retained.

Downstream phases (3: write permutation, 4: leaf AABBs, 5: sweep-tree refit) and TLAS traversal are unchanged — they already consume sortA correctly and are agnostic to leaf order as long as sentinel leaves get degenerate AABBs (they do).

Verification

GPU unit test (CPU oracle, real Firefox/Dawn WebGPU stack). The exact bitonic kernel was run against the TODO-lbvh-sort.md acceptance criteria — all three required distributions (all-uniform, all-one-bucket, small-object-next-to-cluster) plus random, reverse, and empty (all-sentinel) inputs. Every case matched a CPU ascending-sort oracle bit-for-bit, with a valid real-index permutation, and zero GPU errors. Critically this includes the very distribution that broke the old radix sort.

End-to-end render. Sponza (25 TLAS instances) renders correctly with the sort live — no flicker, no missing geometry, no corruption.

Scope note

Issue #1 lays out a full multi-phase wavefront rewrite (megakernel split into GENERATE/TRACE/SHADE/RESOLVE, GPU-driven indirect bounce loop, the RTStress benchmark + timing HUD, the emit/accumulate API break, ordered traversal). This PR delivers the plan's Phase 3 (TLAS coherence) in isolation — the plan explicitly calls it out as independent of the kernel rewrite and validatable against the existing renderer, and it fixes the concrete documented bug that breaks the many-instance scene. The remaining phases (0, 2, 4–7) are intentionally not in this PR and remain open follow-up work; landing the coherence fix first de-risks the benchmark those phases will be measured against. Dynamic TLAS tree depth (N_PADDED = next_pow2(N_real)) is deferred — it couples the build and trace shaders and is secondary to coherence.

Screenshots

Sponza atrium, software ray-traced with the bitonic-sorted TLAS:

Sponza atrium ray-traced with bitonic-sorted TLAS

Refs #1

## Summary This PR lands **Phase 3 (TLAS coherence)** of the wavefront-rewrite plan in #1: it re-enables the TLAS LBVH spatial sort by replacing the disabled, buggy LSD radix scatter with a **data-oblivious workgroup bitonic sorting network**. The radix sort in `lbvhBuildMain` was gated behind `if (false)` because it produced count/distribution-dependent corruption (see `TODO-lbvh-sort.md`): a memory-ordering bug in the Hillis-Steele scan / parallel scatter that surfaced only for certain Morton distributions — specifically a small object next to a tight cluster (the 3DForts projectile-near-fort case) — making geometry flicker. With the sort disabled, TLAS BVH leaves had **no spatial coherence**, which is fatal at the thousands-of-instances scale the many-instance benchmark targets. A bitonic network's compare-exchange schedule depends only on `N_PADDED`, never on the key values, so it structurally cannot exhibit that class of distribution-dependent race (`TODO-lbvh-sort.md` strategy #5). This restores Morton (Z-order) spatial coherence to the TLAS. ## What changed - `additional/dom-webgpu.js` — `lbvhBuildMain` Phase 2 is now a bitonic sort: 105 compare-exchange sub-stages over 2^14 keys, single workgroup of 1024 threads, 8 compare-exchanges/thread/sub-stage, in-place on `sortA` with a `storageBarrier` between sub-stages. Sentinel keys (`0xFFFFFFFF`) compare largest and settle at the tail, exactly where Phase 4 expects them. The now-dead radix histogram/scan workgroup memory (`shHist`/`shOffsets`/`shScan`) and constants (`BUCKETS`/`PASSES`/`SCAN_STEPS`) are removed (~130 fewer lines). - `TODO-lbvh-sort.md` — marked resolved, historical analysis retained. Downstream phases (3: write permutation, 4: leaf AABBs, 5: sweep-tree refit) and TLAS traversal are unchanged — they already consume `sortA` correctly and are agnostic to leaf order as long as sentinel leaves get degenerate AABBs (they do). ## Verification **GPU unit test (CPU oracle, real Firefox/Dawn WebGPU stack).** The exact bitonic kernel was run against the `TODO-lbvh-sort.md` acceptance criteria — all three required distributions (**all-uniform**, **all-one-bucket**, **small-object-next-to-cluster**) plus random, reverse, and empty (all-sentinel) inputs. Every case matched a CPU ascending-sort oracle **bit-for-bit**, with a valid real-index permutation, and zero GPU errors. Critically this includes the very distribution that broke the old radix sort. **End-to-end render.** Sponza (25 TLAS instances) renders correctly with the sort live — no flicker, no missing geometry, no corruption. ## Scope note Issue #1 lays out a full multi-phase wavefront rewrite (megakernel split into GENERATE/TRACE/SHADE/RESOLVE, GPU-driven indirect bounce loop, the `RTStress` benchmark + timing HUD, the emit/accumulate API break, ordered traversal). This PR delivers the plan's **Phase 3 (TLAS coherence)** in isolation — the plan explicitly calls it out as independent of the kernel rewrite and validatable against the existing renderer, and it fixes the concrete documented bug that breaks the many-instance scene. The remaining phases (0, 2, 4–7) are intentionally **not** in this PR and remain open follow-up work; landing the coherence fix first de-risks the benchmark those phases will be measured against. Dynamic TLAS tree depth (`N_PADDED = next_pow2(N_real)`) is deferred — it couples the build and trace shaders and is secondary to coherence. ## Screenshots Sponza atrium, software ray-traced with the bitonic-sorted TLAS: ![Sponza atrium ray-traced with bitonic-sorted TLAS](https://forgejo.catcrafts.net/attachments/aa278ef8-12bb-49f7-bcee-99fae929b029) Refs #1
Replace the disabled LSD radix sort in lbvhBuildMain with a data-oblivious
workgroup bitonic sorting network and enable it. The radix scatter was gated
behind `if (false)` because it produced count/distribution-dependent
corruption (TODO-lbvh-sort.md) — a memory-ordering bug in the Hillis-Steele
scan / parallel scatter that surfaced only for certain Morton distributions
(a small object beside a tight cluster), making geometry flicker.

A bitonic network's compare-exchange schedule depends only on N_PADDED, never
on key values, so it sidesteps that entire class of distribution-dependent
races (TODO strategy #5). 105 sub-stages over 2^14 keys, single workgroup of
1024 threads, 8 compare-exchanges/thread/sub-stage, operating in-place on
sortA with a storageBarrier between sub-stages. Sentinel keys (0xFFFFFFFF)
compare largest and settle at the tail, exactly where Phase 4 expects them.
Restores Morton (Z-order) spatial coherence to TLAS BVH leaves, which the
many-instance case needs. Removes the now-dead radix histogram/scan workgroup
memory and constants.

Verified on the Firefox/Dawn WebGPU stack: a GPU unit test diffs the kernel
output against a CPU oracle across all three required distributions
(all-uniform, all-one-bucket, small-object-next-to-cluster) plus random,
reverse, and empty inputs — all match bit-for-bit with a valid index
permutation. Sponza renders correctly with the sort live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
catbot merged commit e0d72f57f2 into master 2026-05-31 17:49:39 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Catcrafts/Crafter.Graphics!2
No description provided.