WebGPU RT: wavefront/streaming tracer (replaces megakernel) #4

Merged
catbot merged 8 commits from claude/issue-3 into master 2026-05-31 22:31:35 +02:00
Member

Replaces the WebGPU megakernel software ray tracer with a wavefront /
streaming
tracer, end-to-end, in one PR (the remaining phases from #1; #2
landed only the TLAS bitonic sort). Refs #1.

Pipeline

The single @compute megakernel is gone. The RT pipeline now compiles five
kernels sharing one module, connected by GPU ray/hit/payload buffers and a
GPU-driven indirect bounce loop:

GENERATE → (PREP → TRACE → SHADE) × maxDepth → RESOLVE
  • GENERATE (1 thread/pixel): runs user raygen → rtEmitPrimaryRay.
  • PREP (1 thread): publishes dispatchWorkgroupsIndirect args from the
    live ray count; zeroes the next emit counter. The indirect-args buffer
    lives in its own bind group so it is never bound read-write in the same
    dispatch that consumes it as INDIRECT (a Dawn usage-scope rule proven
    out with a standalone emit→prep→indirect round-trip first).
  • TRACE (1 thread/ray, indirect): zero user code_rtwTraverseTlas
    /_rtwTraverseBlas only. This is the occupancy win: TRACE's register
    footprint is the traversal loop alone, with no SBT/shading code inlined.
  • SHADE (1 thread/ray, indirect): runs user closesthit/miss, which
    rtAccumulate radiance and rtEmitRay continuation/shadow rays into the
    next buffer.
  • RESOLVE (1 thread/pixel): optional user tonemap (identity by default).

Breaking API change

raygen emits a primary ray instead of calling traceRay; closesthit/miss
run in SHADE and emit/accumulate. New API: rtEmitPrimaryRay, rtEmitRay,
rtAccumulate, and an optional WebGPURTStage::Resolve tonemap hook. The
Payload-typed wfPayload storage binding is emitted in the codegen region
after the user's struct Payload; payload travels with each ray
(2·W·H slots, double-buffered). User bindings move to @group(3)
(0..2 reserved for WfParams / data heaps / indirect args).

Also in this PR

  • RTStress (examples/RTStress/): the standing many-instance benchmark —
    an N³ cube grid (kGrid knob, 512 → 8000), primary + shadow rays.
  • GPU timestamp-query per-pass HUD (per-pass µs printed ~1×/sec).
  • Ordered (nearest-child-first) traversal in both BLAS and TLAS
    (_rtAabbT entry-t; push the farther child only when it hits, re-cull on
    pop).
  • Dynamic TLAS sweep-tree depth: built and traced at
    log2(next_pow2(instances)) instead of a fixed 14 (the bitonic sort is
    untouched — sentinels already sink to the tail).
  • Device limits: requests maxBufferSize, maxStorageBufferBindingSize
    (payload store ≈245 MB at 1080p, over the 128 MB baseline),
    maxComputeWorkgroupsPerDimension, and the timestamp-query feature.
  • Megakernel dead path removed — no dual path remains.
  • Binding packing (Phase 7): skipped — the target reports 64 storage
    buffers/stage (≥12), so the merge is unnecessary (issue gates it on <12).

Validation

All three render correctly through the wavefront pipeline (validated in
Firefox/Dawn WebGPU):

  • VulkanTriangle — bit-identical to the pre-rewrite megakernel at
    maxDepth=1 (also exercises the single-instance nPadded=1 degenerate
    tree).
  • RTStress — 512 and 4096 instances, no flicker as count scales.
  • Sponza — atrium with textures, two-sided shading, sun + ambient,
    shadows, Reinhard+gamma resolve.

Per-pass GPU time (timestamp-query, 1920×995, primary+shadow)

scene GENERATE TRACE SHADE total ~fps
RTStress 512 0.80 ms 1.63 ms 1.00 ms 3.52 ms ~280
RTStress 4096 0.80 ms 1.95 ms 1.00 ms 3.85 ms ~260
Sponza 0.79 ms 1.81 ms 1.00 ms 3.69 ms

8× the instances costs only ~16% more TRACE — the spatial TLAS + ordered
descent scale sub-linearly.

Honest caveat on the DoD's "4090 fps before/after + register delta":
this CI container is not a 4090, and a megakernel "before" number would
require reviving the path this PR deletes. The numbers above are this
container's GPU via timestamp-query. The TRACE-kernel register/occupancy
delta needs a GPU profiler not available here; the architectural basis for
the win (TRACE carries zero user code) is structural rather than measured.

See WAVEFRONT-DESIGN.md for the full design.

Screenshots

Sponza — atrium through the wavefront pipeline (textures, two-sided shading, sun + ambient, shadows, Reinhard+gamma resolve):

Sponza

RTStress — 512 instances (left, default kGrid=8) and 4096 (right, kGrid=16), primary + shadow:

RTStress 512
RTStress 4096

VulkanTriangle — bit-identical to the pre-rewrite megakernel:

VulkanTriangle

Resolves #3

Replaces the WebGPU **megakernel** software ray tracer with a **wavefront / streaming** tracer, end-to-end, in one PR (the remaining phases from #1; #2 landed only the TLAS bitonic sort). Refs #1. ## Pipeline The single `@compute` megakernel is gone. The RT pipeline now compiles five kernels sharing one module, connected by GPU ray/hit/payload buffers and a **GPU-driven indirect bounce loop**: ``` GENERATE → (PREP → TRACE → SHADE) × maxDepth → RESOLVE ``` - **GENERATE** (1 thread/pixel): runs user raygen → `rtEmitPrimaryRay`. - **PREP** (1 thread): publishes `dispatchWorkgroupsIndirect` args from the live ray count; zeroes the next emit counter. The indirect-args buffer lives in its own bind group so it is never bound read-write in the same dispatch that consumes it as `INDIRECT` (a Dawn usage-scope rule proven out with a standalone emit→prep→indirect round-trip first). - **TRACE** (1 thread/ray, indirect): **zero user code** — `_rtwTraverseTlas` /`_rtwTraverseBlas` only. This is the occupancy win: TRACE's register footprint is the traversal loop alone, with no SBT/shading code inlined. - **SHADE** (1 thread/ray, indirect): runs user closesthit/miss, which `rtAccumulate` radiance and `rtEmitRay` continuation/shadow rays into the next buffer. - **RESOLVE** (1 thread/pixel): optional user tonemap (identity by default). ### Breaking API change raygen emits a primary ray instead of calling `traceRay`; closesthit/miss run in SHADE and emit/accumulate. New API: `rtEmitPrimaryRay`, `rtEmitRay`, `rtAccumulate`, and an optional `WebGPURTStage::Resolve` tonemap hook. The `Payload`-typed `wfPayload` storage binding is emitted in the codegen region after the user's `struct Payload`; payload travels with each ray (`2·W·H` slots, double-buffered). User bindings move to `@group(3)` (0..2 reserved for WfParams / data heaps / indirect args). ## Also in this PR - **RTStress** (`examples/RTStress/`): the standing many-instance benchmark — an N³ cube grid (`kGrid` knob, 512 → 8000), primary + shadow rays. - **GPU timestamp-query** per-pass HUD (per-pass µs printed ~1×/sec). - **Ordered (nearest-child-first) traversal** in both BLAS and TLAS (`_rtAabbT` entry-t; push the farther child only when it hits, re-cull on pop). - **Dynamic TLAS sweep-tree depth**: built and traced at `log2(next_pow2(instances))` instead of a fixed 14 (the bitonic sort is untouched — sentinels already sink to the tail). - **Device limits**: requests `maxBufferSize`, `maxStorageBufferBindingSize` (payload store ≈245 MB at 1080p, over the 128 MB baseline), `maxComputeWorkgroupsPerDimension`, and the `timestamp-query` feature. - Megakernel dead path removed — no dual path remains. - **Binding packing (Phase 7): skipped** — the target reports 64 storage buffers/stage (≥12), so the merge is unnecessary (issue gates it on <12). ## Validation All three render correctly through the wavefront pipeline (validated in Firefox/Dawn WebGPU): - **VulkanTriangle** — bit-identical to the pre-rewrite megakernel at `maxDepth=1` (also exercises the single-instance `nPadded=1` degenerate tree). - **RTStress** — 512 and 4096 instances, no flicker as count scales. - **Sponza** — atrium with textures, two-sided shading, sun + ambient, shadows, Reinhard+gamma resolve. ### Per-pass GPU time (timestamp-query, 1920×995, primary+shadow) | scene | GENERATE | TRACE | SHADE | total | ~fps | |---|---|---|---|---|---| | RTStress 512 | 0.80 ms | 1.63 ms | 1.00 ms | 3.52 ms | ~280 | | RTStress 4096 | 0.80 ms | 1.95 ms | 1.00 ms | 3.85 ms | ~260 | | Sponza | 0.79 ms | 1.81 ms | 1.00 ms | 3.69 ms | — | 8× the instances costs only ~16% more TRACE — the spatial TLAS + ordered descent scale sub-linearly. > **Honest caveat on the DoD's "4090 fps before/after + register delta":** > this CI container is **not** a 4090, and a megakernel "before" number would > require reviving the path this PR deletes. The numbers above are this > container's GPU via `timestamp-query`. The TRACE-kernel register/occupancy > delta needs a GPU profiler not available here; the architectural basis for > the win (TRACE carries zero user code) is structural rather than measured. See [WAVEFRONT-DESIGN.md](WAVEFRONT-DESIGN.md) for the full design. ## Screenshots **Sponza** — atrium through the wavefront pipeline (textures, two-sided shading, sun + ambient, shadows, Reinhard+gamma resolve): ![Sponza](https://forgejo.catcrafts.net/attachments/a05ecb86-4f14-489b-971f-c934bfafd554) **RTStress** — 512 instances (left, default `kGrid=8`) and 4096 (right, `kGrid=16`), primary + shadow: ![RTStress 512](https://forgejo.catcrafts.net/attachments/b303e6ee-3e28-48a3-a9e9-20b99a84bd1c) ![RTStress 4096](https://forgejo.catcrafts.net/attachments/826b77e1-d141-4aec-a325-b1060e8b84ed) **VulkanTriangle** — bit-identical to the pre-rewrite megakernel: ![VulkanTriangle](https://forgejo.catcrafts.net/attachments/5217bca5-ed95-497e-9f0c-7d8ea9b8fed6) Resolves #3
Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:

  GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE

- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
  the indirect-args buffer lives in its own bind group so it is never
  bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
  plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
  between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
  user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
  PerDimension so the W*H-sized work buffers fit past the 128MB baseline.

VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Request the timestamp-query feature; write begin/end timestamps around
each wavefront pass via timestampWrites; resolve + read back (deferred to
after submit) and print a per-pass us breakdown ~1x/sec. RTStress @ 512
instances, 1920x995: TRACE dominates, total ~1.8-3.0ms/frame.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Restructure Sponza for the wavefront model: raygen emits the primary ray;
closesthit (in SHADE) gathers albedo/normal, accumulates ambient, and
emits a shadow ray carrying the pending direct term; miss adds the sky
(primary) or the direct term (shadow miss). resolve.wgsl applies the same
Reinhard+gamma the megakernel raygen did inline. User bindings moved to
group 3 (groups 0..2 reserved). RTPass maxDepth=2.

Renders the atrium correctly through the wavefront pipeline (textures,
two-sided shading, sun+ambient, shadows, tonemap).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add _rtAabbT (AABB test returning entry-t); in both _rtwTraverseBlas and
_rtwTraverseTlas descend the nearer child first and push the farther only
when it hits, re-culling it against the (tightened) bestT when popped.
Render is identical (same closest hit) on VulkanTriangle, RTStress
(512/4096), and Sponza; cuts node visits on dense scenes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The RT pipeline now only builds the wavefront kernels, so the old
single-megakernel traversal/traceRay block (rtWgslMegakernelHelpers) and
the unused rtWgslPrelude alias are dead. Remove them. The rayQuery compute
path keeps rtWgslMegakernelBindings (its own _rq* traversal uses it).
RTStress still renders correctly with the trimmed prelude.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The LBVH bitonic sort still runs over the full 16384 (sentinels sink to
the tail), but the sweep tree is now built and traced at depth
log2(next_pow2(nReal)) instead of a fixed 14. Add nPadded to LbvhPC; leaf
init + bottom-up refit use it; the host passes the same next_pow2 to the
trace via WfParams.tlasNPadded. Renders correctly at 512 instances
(depth 9). The fragile sort phases are untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
catbot merged commit 6a54c3c4ca into master 2026-05-31 22:31:35 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Catcrafts/Crafter.Graphics!4
No description provided.