Minimalist ray-tracing leveraging only acceleration structures

AnKi has had ray-tracing support for quite some time. Beyond managing acceleration structures, the engine uses VK_KHR_ray_tracing_pipeline/DXR 1.0 for shadows (soon deprecated), indirect diffuse, and indirect specular. Upon initialization, the engine reads pre-compiled shader binary blobs and creates ray-tracing pipelines containing a number of ray-generation shaders, one or more miss shaders, and several hit shaders. Currently, two RT libraries are built: one for RT shadows and one for indirect RT. When the engine loads materials, the materials select the appropriate hit shaders, which are then forwarded to the shader binding table during rendering. This approach works well, as materials have the flexibility to choose different RT shaders, just as they choose different combinations of vertex and pixel shaders.

Some of the problems with VK_KHR_ray_tracing_pipeline/DXR 1.0 are that it’s not widely supported on mobile, and when it is supported, performance might not be great. This motivates the work here: creating a low-quality (“potato” mode) RT implementation to see how it performs on indirect specular and indirect diffuse. But first, let’s look at how VK_KHR_ray_tracing_pipeline/DXR 1.0 works.

RT using ray_tracing_pipeline/DXR 1.0

Indirect RT uses hit shaders that are minimal in functionality-they simply return a thin G-Buffer. The ray-generation shaders then take that thin G-Buffer and perform the lighting calculations. Light shading does not happen in the hit shaders. To better understand, here is the ray payload used:

struct [raypayload] RtMaterialFetchRayPayload
{
	Vec3 m_diffuseColor : write(closesthit, miss): read(caller);
	Vec3 m_worldNormal : write(closesthit, miss): read(caller);
	Vec3 m_emission : write(closesthit, miss): read(caller);
	F32 m_textureLod : write(caller): read(closesthit);
	F32 m_rayT : write(closesthit, miss): read(caller);
};

The thin G-Buffer contains only the diffuse color, the world normal of the surface hit, and the emission. The m_rayT is used to calculate the world position of the surface the ray intersected. The m_textureLod controls the texture LOD when the hit shaders read the material textures. For indirect specular, it’s a good idea to use detailed mipmaps, but for indirect diffuse, we can afford to cheat a lot and use low quality mipmaps.

And here is a debug view of the diffuse color:

The normals are built from the vertex positions:

Implementing the “potato” ray-tracing

Now that we understand how the ray_tracing_pipeline/DXR 1.0 is used, let’s look at the cheap alternative. The idea is simple: use ray queries/inline RT instead of ray_tracing_pipeline/DXR 1.0, and avoid using any textures when building the thin G-Buffer.

The first change was for AnKi to compute the average color of textures, which happens at asset baking time. Materials then look at their textures and use either the average color of the diffuse texture, or, if there is no diffuse texture, a fallback path is used to infer it. The average diffuse color is then stored directly in the instance of the top-level acceleration structure (TLAS). The TLAS provides 24 bits that can be used for anything, which is enough to store the diffuse color. In Vulkan, the spare bits are in instanceCustomIndex:

struct AccelerationStructureInstance
{
	Mat3x4 m_transform;
	U32 m_instanceCustomIndex : 24; // Custom value that can be accessed in the shaders.
	U32 m_mask : 8;
	U32 m_instanceShaderBindingTableRecordOffset : 24;
	U32 m_flags : 8;
	U64 m_accelerationStructureAddress;
};

The instanceCustomIndex can be accessed inside shaders by calling the CommittedInstanceID() method of the RayQuery objects:

RayQuery<...> q;
...
U32 id = q.CommittedInstanceID();
Uvec3 coloru = (UVec3)id >> UVec3(16, 8, 0);
coloru &= 0xFF;
Vec3 color = Vec3(coloru) / 255.0;

Now that we’ve solved the diffuse color, we need to figure out the normals. To achieve this we will be using VK_KHR_ray_tracing_position_fetch, a useful extension that provides the three positions of the primitive that was hit. Unfortunately DX12 doesn’t support something like this. Taking the cross product of those positions and transforming it to world space gives the face normal. Here’s a snippet showing how to use position fetch in HLSL:

// Define the inline SPIR-V in some header
#define SpvRayQueryPositionFetchKHR 5391
#define SpvOpRayQueryGetIntersectionTriangleVertexPositionsKHR 5340
#define SpvRayQueryCandidateIntersectionKHR 0
#define SpvRayQueryCommittedIntersectionKHR 1

[[vk::ext_capability(SpvRayQueryPositionFetchKHR)]]
[[vk::ext_extension("SPV_KHR_ray_tracing_position_fetch")]]
[[vk::ext_instruction(SpvOpRayQueryGetIntersectionTriangleVertexPositionsKHR)]]
float3 spvRayQueryGetIntersectionTriangleVertexPositionsKHR([[vk::ext_reference]] RayQuery<RAY_FLAG_FORCE_OPAQUE> query, int committed)[3];

// Access the positions
Vec3 positions[3] = spvRayQueryGetIntersectionTriangleVertexPositionsKHR(q, SpvRayQueryCommittedIntersectionKHR);
Vec3 vertNormal = normalize(cross(positions[1] - positions[0], positions[2] - positions[1]));

To transform the normal to world space we will making use of the TLAS once again to access the transform of the instance:

Vec3 worldNormal = normalize(mul(q.CommittedObjectToWorld3x4(), Vec4(vertNormal, 0.0)));

What about the emission? A solution hasn’t been implemented yet, but one idea is to reserve one bit in instanceCustomIndex as a flag that switches the remaining 23 bits from storing diffuse color to storing tone-mapped emission. Not perfect, but potentially workable.

One important thing to note is that with “potato” RT, there is no need for shader binding tables. This can save some CPU or GPU time, depending on where it’s built. In AnKi, it’s built on the GPU.

And this is how the diffuse color looks in “potato” RT:

Ray-tracing pipeline VS “potato” ray-tracing

Now let’s look at the full blown indirect diffuse term. First we have the RT pipeline:

And next is the “potato” RT:

Not much difference really in Sponza but that doesn’t mean that cheating can work on every scene.

Now let’s look at reflections. We hacked the roughness to be zero and removed the normal maps to create the worst possible reflection scenario-basically making everything as smooth as possible.

First the RT pipeline:

And the reflections in “potato” RT:

Once again, they look similar. The primary reason is that AnKi uses screen-space reflections, which add some detail to the reflections.

Performance

And now some numbers, taken from an NVIDIA 4080 running at 4K native. This time, we’re not using Sponza but a more complex scene: Bistro. Unfortunately, it wasn’t possible to gather numbers on AMD due to a driver bug (VK_KHR_ray_tracing_position_fetch not working with ray queries).

As we can see the “potato” RT is faster but nothing crazy.

Conclusion

We presented a method for implementing ray tracing using only ray queries and data stored directly in the top-level acceleration structure-completely eliminating ray pipelines, shader binding tables, and textures. Despite its simplicity, this approach produces surprisingly promising results for indirect diffuse lighting, thanks to the low-frequency nature of indirect diffuse. If emission is properly handled then the results would be even better. Reflections remain more challenging, but in many scenes, creative approximations can still yield visually acceptable results. Overall, this “potato” RT approach demonstrates that lightweight ray tracing is feasible and with often acceptable quality.

Simplified pipeline barriers

One of the challenges of low level APIs such as Vulkan and D3D12 is synchronization of GPU work. The most used primitive in GPU synchronization is pipeline barriers and if someone takes a look at vkCmdPipelineBarrier or ID3D12GraphicsCommandList::ResourceBarrier/Barrier it’s easy to understand why many people struggle. In this article we will decompose the barriers and put them back together in a more simplified form. We will use Vulkan as the base of our discussion but everything is transferable to D3D12’s Extended Barriers (link). D3D12’s Extended Barriers is extremely close to Vulkan’s barrier model so most concepts discussed here are transferable to D3D12 as well. D3D12’s old barrier model (ID3D12GraphicsCommandList::ResourceBarrier) is out of scope since I’m not a big fan of it. Unfortunately this article requires a deep understanding of synchronization so be prepared.

Continue reading “Simplified pipeline barriers”

Workarounds for issues with mesh shaders + Vulkan + HLSL

TL;DR: If using per-primitive in/out in mesh shaders in Vulkan and in HLSL/DXC you need to manually decorate variables and struct members using SPIR-V intrinsics. See how at the bottom.

This is going to be a quick post to raise awareness to a difficult to track bug with mesh shaders in Vulkan when using HLSL. Hopefully this will save some time to the next poor soul that bumps into this.

With mesh shaders you have two options when it comes to output variables. Your output can be per-vertex (just like in vertex shaders) or it can be per-primitive which is a new addition with mesh shaders. This is how a mesh shader entry point could look like:

[numthreads(64, 1, 1)] [outputtopology("triangle")] 
void meshMain(
    out vertices PerVertOut myVerts[XXX], 
    out primitives PerPrimitiveOut myPrimitives[XXX], 
    out indices uint3 indices[XXX])
{
...
}

The “vertices” keyword denotes that the output will be per-vertex and the “primitives” keyword that it’s per-primitive. DirectX compiler will decorate the myPrimitives with the PerPrimitiveEXT decoration which is special to mesh shaders. That’s all fine but the real problem lies in the fragment shader. Vulkan mandates that myPrimitives should be decorated with PerPrimitiveEXT in the fragment shader as well. But in HLSL there is no way to do that. The “primitives” keyword is not allowed in fragment shaders.

FragOut psMain(in PerVertOut myVerts, PerPrimitiveOut myPrimitives)
{
...
}

In other words, if someone tries to use HLSL to write mesh shaders that use per-primitive input/output variables they will end with incorrect SPIR-V. The problems continue because at the moment this is a very hard to diagnose error.

  • The HLSL is valid
  • It works in D3D12 (because D3D12 does semantics linking at PSO creation time)
  • It works on nVidia + Vulkan (because nVidia has a tendency to workaround user errors)
  • Vulkan validation doesn’t complain

… but it fails on AMD and nobody can understand why.

How to workaround this issue then? We can use SPIR-V inline intrinsics! It’s a little bit cumbersome but it can be done. This is how we should change our fragment shader:

struct PerPrimitiveOut {
    [[vk::ext_decorate(5271 /*PerPrimitiveEXT*/)]] float4 someMember;
    [[vk::ext_decorate(5271 /*PerPrimitiveEXT*/)]] float4 someOtherMember;
};

FragOut psMain(
    in PerVertOut myVerts, 
    [[vk::ext_extension("SPV_EXT_mesh_shader")]] [[vk::ext_capability(5283 /*MeshShadingEXT*/)]] PerPrimitiveOut myPrimitives)
{
...
}

And voila! It works on AMD.

For reference I’ve create a couple of github issues that will help avoiding this issue:

https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/8398

https://github.com/microsoft/DirectXShaderCompiler/issues/6862

Parsing and rewriting SPIR-V

SPIR-V is a binary language that describes the GPU shader code for the Vulkan and OpenCL APIs. For most graphics programmers SPIR-V is a completely opaque binary blob that they don’t have to touch or know much about. Someone can use their preferred compiler to translate GLSL or HLSL to SPIR-V, can use SPIRV-Tools to optimize or assemble/disassemble, can use SPIRV-Cross or SPIRV-Reflect to perform shader reflection and finally they can use SPIRV-Cross to cross compile SPIR-V to some high level language. There is enough tooling out there to not have to worry about the internals. SPIR-V is very well defined and quite simple to understand and manipulating it without the use of 3rd part tools is not something to feel intimidated by and this is the point I’m trying to make with this post.

Why parse and/or rewrite SPIR-V? For AnKi I’ve stumbled into a couple of cases where there was no existing tooling that could do what I wanted. The first case is quite simple. I wanted to find if the fragment shader was discarding. This was quite simple, just search if the shader contains spv::OpKill. The 2nd usecase is more elaborate. Due to various reasons AnKi’s shaders were rewritten to support HLSL and HLSL’s binding model (using the register keyword). Since there is not direct mapping of HLSL’s binding model to Vulkan/SPIR-V we had to get a little creative. Without going into many details, DXC is remapping HLSL registers to some logical Vulkan bindings using -fvk-b-shift and co (register -> spv::DecorationBinding). These logical bindings are used to identify the register when performing shader reflection (spv::DecorationBinding -> register). After DXC completes the translation the output SPIR-V contains logical bindings that need to be replaced. So after reflection AnKi rewrites the bindings. And this is the 2nd case where AnKi had to parse SPIR-V but also rewrite it.

This page describes how parsing SPIR-V works: https://github.com/KhronosGroup/SPIRV-Guide/blob/main/chapters/parsing_instructions.md

SPIR-V binary starts with a header that is 20 bytes. After that the instructions follow. Each instruction starts with a 32bit opcode and a variable number of 32bit arguments. The opcode encodes the opcode itself and the number of arguments. Iterating all the instructions is pretty simple. Finding if the SPIR-V binary contains spv::OpKill can be done like this:

bool hasOpKill(uint32_t* pCode, uint32_t codeSize)
{
	uint32_t offset = 5; // first 5 words of module are the headers

	while(offset < codeSize)
	{
		uint32_t instruction = pCode[offset];

		uint32_t length = instruction >> 16;
		uint32_t opcode = instruction & 0x0ffffu;

		offset += length;

		if(opcode == spv::OpKill)
		{
			return true;
		}
	}

	return false;
}

Similar story if someone would want to rewrite the bindings for example:

void rewriteBindings(uint32_t* pCode, uint32_t codeSize)
{
	uint32_t offset = 5; // first 5 words of module are the headers

	while(offset < codeSize)
	{
		uint32_t instruction = pCode[offset];

		uint32_t length = instruction >> 16;
		uint32_t opcode = instruction & 0x0ffffu;

		//  Encoding: OpDecorate || id || DecorationBinding || <literal>
		if(opcode == spv::OpDecorate && pCode[offset + 2] == spv::DecorationBinding)
		{
			pCode[offset + 3] = someValue; // 3rd argument is the binding. Re-write it
		}

		offset += length;
	}
}

Pretty simple.

GPU driven rendering in AnKi: A high level overview

For the last few months AnKi underwent a heavy re-write in order to enable GPU driven rendering. This post will quickly go through the changes and the design without diving too deep into details. There will be another post (or maybe video) with all the implementation details.

What is GPU driven rendering (GDR from now on)? It’s the process of moving logic that traditionally runs into the CPU, to the GPU. The first category of GDR work includes various simulation workloads. Things like particle simulations, clutter placement, snow/water deformations, cloth simulation and more. The next GDR category includes techniques that do GPU based occlusion culling. That includes per object occlusion and more fine-grained meshlet (aka cluster) occlusion. The last category is generating command buffers from the GPU directly. In DX12 this is done using Extended Indirect and in Vulkan with NV_device_generated_commands. A very recent addition to this category is Workgraphs from AMD but that’s a bit experimental at the moment. Due to Vulkan limitations AnKi doesn’t use device generated commands so the ceiling of AnKi’s implementation is multi-draw indirect with indirect count (aka VK_KHR_draw_indirect_count, aka MDI).

So what kind of GDR variant AnKi uses? The final goal was to perform all per-object visibility on the GPU. Visibility for regular meshes as well as lights, probes, decals and the rest of the meta-objects. The whole work was split into 2 logical parts. The 1st part re-organizes data in a more GDR friendly way and the 2nd is the GPU visibility and command buffer generation itself.

GDR Data / GPU Scene

With GDR the way we store data needs to be re-thinked. So for AnKi there are 3 main components. The 1st is the Unified Geometry Buffer (UGB) which is a huge buffer that holds all index and vertex buffers of the whole game. The 2nd is the GPU scene buffer that stores a GPU view of the scenegraph. Things like object transforms, lights, probes, uniforms (aka constants) and more. The 3rd bit is a descriptor set with all the textures, in other words bindless textures. Every renderable object has a structure (a data record) in the GPU scene that points to transforms, uniforms, bindles texture indices etc.

The intent with these 3 data holders is to have persistent data in the GPU and be able to address index and vertex buffers using indirect arguments. No more CPU to GPU roundtrips. At the beginning of the frame there is what we call micro-patching of the GPU scene. For example, if an object was moved there will be a small copy from the CPU to the GPU scene.

GPU occlusion

The next bit to discuss is how GPU occlusion works at a very high level. Here things are somewhat vanilla. Nothing that hasn’t been tried before. There is a Hierarchical Z Buffer (HZB) generated in frame N and used in frame N+1. This HZB is used for the occlusion of GBuffer pass and the Transparents rendering but also used to cull lights, decals and probes. The 1 frame delay is fine since the occlusion is per object but if it was per meshlet this delay might have been problematic.

Occlusion is also done for the shadow cascades using a variation of Assasin Creed’s Unity technique. Meaning, get the max (in our case) depth of each 64×64 tile of the depth buffer, then render boxes for each tile into light space and use that to construct the HZB. Do that for each cascade.

One additional problem had to do with the other shadow casting lights (and the probes). The CPU is not aware of the visible punctual lights but it needs to know them in order to build the commands required for shadows. What AnKi does in this case is to send some info of the visible lights from the GPU to the CPU. With a few frames delay the CPU will schedule the shadow passes. This mechanism is also used for probes that need to be updated real-time.

Command Buffer Building

As we mentioned at the beginning AnKi is using multi-draw indirect. The CPU organizes all renderable objects present in the scene into graphics state buckets. The state is actually derived by the graphics pipeline. Then the renderer will build a single MDI call for each state bucket. When doing visibility testing the GPU will place the indirect arguments of the visible renderables into the correct bucket.

The GPU visibility will populate the indirect arguments but also an additional buffer per visible instance. This buffer is a vertex buffer with per-instance rate and this is a well known trick to workaround the lack of base instance support in DX. This vertex buffer holds some offsets required by the renderable. For example one of these offsets points to the transform and another to the uniforms.

There are a few complexities in this scheme that have to do with counting drawcalls for gathering statistics. For that reason there is a small drawcall at the beginning of the renderpass that iterates the buckets and gathers the draw counts. The CPU will read that count at some point later.

Future

Next stop is extending the occlusion to meshlets. Let’s see how that goes.

Random thoughts after moving from GLSL to HLSL

AnKi always had its shaders written in GLSL because it started as an OpenGL engine. Recently I decided to start rewriting them from GLSL to HLSL with one of the main motivations being that GLSL, as a language, has stopped evolving and it’s missing modern features. Sure, there are new extensions added to GLSL all the time but those extensions expose new Vulkan functionality, they don’t add new syntax to the language. So let’s dive into some random thoughts around HLSL that are tailored to Vulkan and SPIR-V.

Continue reading “Random thoughts after moving from GLSL to HLSL”

Porting AnKi to Android… again after ~8 years

I think it was 8 or 9 years ago when AnKi was first ported to Android for a demo competition and I remember the process being quite painful. Building native projects for Android was a hack, the micro differences between OpenGL ES and desktop OpenGL required compromises, GLSL related issues were a constant pain (compilation issues, bugs, shader reflection differences between implementations etc). These and more issues were the reason I left the Android port to rot and eventually remove from the codebase.

All that changed 10 months ago when I decided to re-port AnKi to Android. But if it was so painful why try again? The main reason is that the ecosystem had improved over the years. Android tooling for native code development (basically C/C++) got better but the biggest motivator was Vulkan. I was optimistic that there will be less issues with Vulkan this time around. At the end of the day was it less painful? The short answer is “yes” but if you want the long answer continue reading.

Continue reading “Porting AnKi to Android… again after ~8 years”