fix(vulkan-rt): work around NVIDIA descriptor-heap AS-read device-loss (#15)
Reading an acceleration structure through VK_EXT_descriptor_heap aborts with VK_ERROR_DEVICE_LOST on NVIDIA 610.43.02 — a brand-new-extension driver fault isolated in #7 (engine setup is correct and validation-clean; images/buffers through the same heap work, and both traceRayEXT and inline rayQuery fault identically on the AS read). An acceleration structure can equally be reached by its device address via OpConvertUToAccelerationStructureKHR, which reads no descriptor and so never touches the faulting heap path. glslang has no GLSL spelling for that conversion, so VulkanShader rewrites the compiled SPIR-V at module-load time: every `OpLoad %accelStruct <heap-ptr>` becomes a load of the TLAS device address from a synthesized push-constant block followed by the convert. RTPass pushes the active frame's TLAS address into that push constant. User GLSL and example code are unchanged; acceleration structures still bind into the heap normally. The workaround is gated on Device::workaroundDescriptorHeapAS (true only on the NVIDIA proprietary driver) and confined to one fenced block in Crafter.Graphics-ShaderVulkan.cppm plus the RTPass push and the shaderInt64 feature toggle — delete those once a fixed NVIDIA driver ships and the heap AS read becomes the direct path again. Verified: VulkanTriangle ray-traces correctly on native NVIDIA (RTX 4090), validation-layer-clean, no device loss. The SPIR-V rewrite was independently validated with spirv-val on both the VulkanTriangle and Sponza raygen modules. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
b9f65f5273
commit
950059c86e
7 changed files with 270 additions and 30 deletions
|
|
@ -28,22 +28,36 @@ cd examples/VulkanTriangle
|
|||
crafter-build -r
|
||||
```
|
||||
|
||||
On a working driver you should see a 1280×720 window with a triangle
|
||||
filling roughly the centre. **On the current NVIDIA driver the native
|
||||
build aborts with `VK_ERROR_DEVICE_LOST` the moment `traceRayEXT` runs —
|
||||
see below.**
|
||||
You should see a 1280×720 window with an RGB-barycentric triangle filling
|
||||
roughly the centre. On the NVIDIA driver this works through an engine-side
|
||||
workaround for a driver fault — see below.
|
||||
|
||||
## Native status — known driver fault (`VK_ERROR_DEVICE_LOST`)
|
||||
## Native status — NVIDIA driver fault, worked around
|
||||
|
||||
On NVIDIA driver `610.43.02` (Vulkan 1.4) the native build aborts with
|
||||
`VK_ERROR_DEVICE_LOST` on the first frame as soon as the shader reads the
|
||||
acceleration structure. `VK_EXT_device_fault` reports an invalid GPU read
|
||||
(address `~0xffff…`) plus instruction-pointer faults inside the
|
||||
ray-tracing shader. Commenting out the `traceRayEXT` call makes the crash
|
||||
disappear (the dispatch + `imageStore` path renders a solid colour fine).
|
||||
On NVIDIA driver `610.43.02` (Vulkan 1.4) reading the acceleration
|
||||
structure through `VK_EXT_descriptor_heap` aborts the device with
|
||||
`VK_ERROR_DEVICE_LOST` on the first frame. This is a **driver-side fault**
|
||||
in the brand-new descriptor-heap acceleration-structure path, not an engine
|
||||
bug (full investigation in #7, summarised below).
|
||||
|
||||
This was investigated thoroughly and traced to the **acceleration-structure
|
||||
read through `VK_EXT_descriptor_heap`**, *not* to the engine's RT setup:
|
||||
**The engine works around it transparently** (issue #15). On the NVIDIA
|
||||
proprietary driver only, `VulkanShader` rewrites the compiled SPIR-V at
|
||||
module-load time so that every `OpLoad` of an `accelerationStructureEXT`
|
||||
out of the heap becomes a load of the TLAS *device address* (from a
|
||||
synthesized push-constant block) followed by
|
||||
`OpConvertUToAccelerationStructureKHR` — which reads no descriptor and so
|
||||
never touches the faulting path. `RTPass` feeds the active frame's TLAS
|
||||
address in as push data. `raygen.glsl` and the example code are unchanged;
|
||||
acceleration structures still bind into the heap normally. On every other
|
||||
driver the workaround is inert. It's gated on
|
||||
`Device::workaroundDescriptorHeapAS` and confined to one fenced block in
|
||||
`interfaces/Crafter.Graphics-ShaderVulkan.cppm` so it can be deleted wholesale
|
||||
once a fixed NVIDIA driver ships.
|
||||
|
||||
### The underlying fault (#7)
|
||||
|
||||
The fault was traced to the **acceleration-structure read through
|
||||
`VK_EXT_descriptor_heap`**, *not* to the engine's RT setup:
|
||||
|
||||
- The BLAS/TLAS build is correct and finishes before rendering
|
||||
(`Window::FinishInit` does `vkQueueWaitIdle`). The built TLAS instance
|
||||
|
|
@ -70,7 +84,7 @@ read through `VK_EXT_descriptor_heap`**, *not* to the engine's RT setup:
|
|||
second conformant implementation to cross-check against.
|
||||
|
||||
**Conclusion:** this is a driver-side fault in NVIDIA's
|
||||
`VK_EXT_descriptor_heap` acceleration-structure path, not an engine bug. It
|
||||
should be reported to NVIDIA. The `traceRayEXT` call is intentionally left
|
||||
in `raygen.glsl` so this stays a faithful one-file reproducer; the example
|
||||
will start rendering the triangle again once a fixed driver ships.
|
||||
`VK_EXT_descriptor_heap` acceleration-structure path, not an engine bug, and
|
||||
it should be reported to NVIDIA. Until a fixed driver ships, the SPIR-V
|
||||
rewrite above keeps the native RT path working; once it does, remove the
|
||||
workaround and the heap AS read becomes the direct path again.
|
||||
|
|
|
|||
|
|
@ -201,12 +201,13 @@ int main() {
|
|||
RTPass rtPass(&pipeline);
|
||||
window.passes.push_back(&rtPass);
|
||||
|
||||
// NOTE: on NVIDIA 610.43.02 this aborts with VK_ERROR_DEVICE_LOST the
|
||||
// first time the raygen shader reads the acceleration structure out of
|
||||
// the VK_EXT_descriptor_heap. The build, descriptors and SBT are all
|
||||
// correct and validation-clean; it is a driver-side fault in the
|
||||
// descriptor-heap acceleration-structure path. See README.md
|
||||
// ("Native status — known driver fault") for the full investigation.
|
||||
// NOTE: reading the acceleration structure through VK_EXT_descriptor_heap
|
||||
// aborts with VK_ERROR_DEVICE_LOST on NVIDIA 610.43.02 (a driver fault —
|
||||
// see #7). The engine transparently works around it: on the NVIDIA driver
|
||||
// VulkanShader rewrites the heap AS read into a TLAS-device-address +
|
||||
// OpConvertUToAccelerationStructureKHR path and RTPass feeds the address in
|
||||
// as push data. Nothing here (or in raygen.glsl) changes. See README.md
|
||||
// ("Native status") and interfaces/Crafter.Graphics-ShaderVulkan.cppm.
|
||||
window.Render();
|
||||
window.StartSync();
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue