One of the challenges of low level APIs such as Vulkan and D3D12 is synchronization of GPU work. The most used primitive in GPU synchronization is pipeline barriers and if someone takes a look at vkCmdPipelineBarrier or ID3D12GraphicsCommandList::ResourceBarrier/Barrier it’s easy to understand why many people struggle. In this article we will decompose the barriers and put them back together in a more simplified form. We will use Vulkan as the base of our discussion but everything is transferable to D3D12’s Extended Barriers (link). D3D12’s Extended Barriers is extremely close to Vulkan’s barrier model so most concepts discussed here are transferable to D3D12 as well. D3D12’s old barrier model (ID3D12GraphicsCommandList::ResourceBarrier) is out of scope since I’m not a big fan of it. Unfortunately this article requires a deep understanding of synchronization so be prepared.
Continue reading “Simplified pipeline barriers”GPU driven rendering in AnKi
This time I decided to would try something different and do a video presentation instead of a blog post. There were a lot of things to cover and a video presentation seemed better.
Anyway, here is the video:
And here the link the the presentation: https://docs.google.com/presentation/d/1tsGjwZmP2mZBNyURNN4-XYsgJxfsG6omaC3w5ZQXzr4/edit#slide=id.g2155b36a3d3_0_0
Workarounds for issues with mesh shaders + Vulkan + HLSL
TL;DR: If using per-primitive in/out in mesh shaders in Vulkan and in HLSL/DXC you need to manually decorate variables and struct members using SPIR-V intrinsics. See how at the bottom.
This is going to be a quick post to raise awareness to a difficult to track bug with mesh shaders in Vulkan when using HLSL. Hopefully this will save some time to the next poor soul that bumps into this.
With mesh shaders you have two options when it comes to output variables. Your output can be per-vertex (just like in vertex shaders) or it can be per-primitive which is a new addition with mesh shaders. This is how a mesh shader entry point could look like:
[numthreads(64, 1, 1)] [outputtopology("triangle")]
void meshMain(
out vertices PerVertOut myVerts[XXX],
out primitives PerPrimitiveOut myPrimitives[XXX],
out indices uint3 indices[XXX])
{
...
}
The “vertices” keyword denotes that the output will be per-vertex and the “primitives” keyword that it’s per-primitive. DirectX compiler will decorate the myPrimitives with the PerPrimitiveEXT decoration which is special to mesh shaders. That’s all fine but the real problem lies in the fragment shader. Vulkan mandates that myPrimitives should be decorated with PerPrimitiveEXT in the fragment shader as well. But in HLSL there is no way to do that. The “primitives” keyword is not allowed in fragment shaders.
FragOut psMain(in PerVertOut myVerts, PerPrimitiveOut myPrimitives)
{
...
}
In other words, if someone tries to use HLSL to write mesh shaders that use per-primitive input/output variables they will end with incorrect SPIR-V. The problems continue because at the moment this is a very hard to diagnose error.
- The HLSL is valid
- It works in D3D12 (because D3D12 does semantics linking at PSO creation time)
- It works on nVidia + Vulkan (because nVidia has a tendency to workaround user errors)
- Vulkan validation doesn’t complain
… but it fails on AMD and nobody can understand why.
How to workaround this issue then? We can use SPIR-V inline intrinsics! It’s a little bit cumbersome but it can be done. This is how we should change our fragment shader:
struct PerPrimitiveOut {
[[vk::ext_decorate(5271 /*PerPrimitiveEXT*/)]] float4 someMember;
[[vk::ext_decorate(5271 /*PerPrimitiveEXT*/)]] float4 someOtherMember;
};
FragOut psMain(
in PerVertOut myVerts,
[[vk::ext_extension("SPV_EXT_mesh_shader")]] [[vk::ext_capability(5283 /*MeshShadingEXT*/)]] PerPrimitiveOut myPrimitives)
{
...
}
And voila! It works on AMD.
For reference I’ve create a couple of github issues that will help avoiding this issue:
https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/8398
https://github.com/microsoft/DirectXShaderCompiler/issues/6862
Parsing and rewriting SPIR-V
SPIR-V is a binary language that describes the GPU shader code for the Vulkan and OpenCL APIs. For most graphics programmers SPIR-V is a completely opaque binary blob that they don’t have to touch or know much about. Someone can use their preferred compiler to translate GLSL or HLSL to SPIR-V, can use SPIRV-Tools to optimize or assemble/disassemble, can use SPIRV-Cross or SPIRV-Reflect to perform shader reflection and finally they can use SPIRV-Cross to cross compile SPIR-V to some high level language. There is enough tooling out there to not have to worry about the internals. SPIR-V is very well defined and quite simple to understand and manipulating it without the use of 3rd part tools is not something to feel intimidated by and this is the point I’m trying to make with this post.
Why parse and/or rewrite SPIR-V? For AnKi I’ve stumbled into a couple of cases where there was no existing tooling that could do what I wanted. The first case is quite simple. I wanted to find if the fragment shader was discarding. This was quite simple, just search if the shader contains spv::OpKill. The 2nd usecase is more elaborate. Due to various reasons AnKi’s shaders were rewritten to support HLSL and HLSL’s binding model (using the register keyword). Since there is not direct mapping of HLSL’s binding model to Vulkan/SPIR-V we had to get a little creative. Without going into many details, DXC is remapping HLSL registers to some logical Vulkan bindings using -fvk-b-shift and co (register -> spv::DecorationBinding). These logical bindings are used to identify the register when performing shader reflection (spv::DecorationBinding -> register). After DXC completes the translation the output SPIR-V contains logical bindings that need to be replaced. So after reflection AnKi rewrites the bindings. And this is the 2nd case where AnKi had to parse SPIR-V but also rewrite it.
This page describes how parsing SPIR-V works: https://github.com/KhronosGroup/SPIRV-Guide/blob/main/chapters/parsing_instructions.md
SPIR-V binary starts with a header that is 20 bytes. After that the instructions follow. Each instruction starts with a 32bit opcode and a variable number of 32bit arguments. The opcode encodes the opcode itself and the number of arguments. Iterating all the instructions is pretty simple. Finding if the SPIR-V binary contains spv::OpKill can be done like this:
bool hasOpKill(uint32_t* pCode, uint32_t codeSize)
{
uint32_t offset = 5; // first 5 words of module are the headers
while(offset < codeSize)
{
uint32_t instruction = pCode[offset];
uint32_t length = instruction >> 16;
uint32_t opcode = instruction & 0x0ffffu;
offset += length;
if(opcode == spv::OpKill)
{
return true;
}
}
return false;
}
Similar story if someone would want to rewrite the bindings for example:
void rewriteBindings(uint32_t* pCode, uint32_t codeSize)
{
uint32_t offset = 5; // first 5 words of module are the headers
while(offset < codeSize)
{
uint32_t instruction = pCode[offset];
uint32_t length = instruction >> 16;
uint32_t opcode = instruction & 0x0ffffu;
// Encoding: OpDecorate || id || DecorationBinding || <literal>
if(opcode == spv::OpDecorate && pCode[offset + 2] == spv::DecorationBinding)
{
pCode[offset + 3] = someValue; // 3rd argument is the binding. Re-write it
}
offset += length;
}
}
Pretty simple.
GPU driven rendering in AnKi: A high level overview
For the last few months AnKi underwent a heavy re-write in order to enable GPU driven rendering. This post will quickly go through the changes and the design without diving too deep into details. There will be another post (or maybe video) with all the implementation details.
What is GPU driven rendering (GDR from now on)? It’s the process of moving logic that traditionally runs into the CPU, to the GPU. The first category of GDR work includes various simulation workloads. Things like particle simulations, clutter placement, snow/water deformations, cloth simulation and more. The next GDR category includes techniques that do GPU based occlusion culling. That includes per object occlusion and more fine-grained meshlet (aka cluster) occlusion. The last category is generating command buffers from the GPU directly. In DX12 this is done using Extended Indirect and in Vulkan with NV_device_generated_commands. A very recent addition to this category is Workgraphs from AMD but that’s a bit experimental at the moment. Due to Vulkan limitations AnKi doesn’t use device generated commands so the ceiling of AnKi’s implementation is multi-draw indirect with indirect count (aka VK_KHR_draw_indirect_count, aka MDI).
So what kind of GDR variant AnKi uses? The final goal was to perform all per-object visibility on the GPU. Visibility for regular meshes as well as lights, probes, decals and the rest of the meta-objects. The whole work was split into 2 logical parts. The 1st part re-organizes data in a more GDR friendly way and the 2nd is the GPU visibility and command buffer generation itself.
GDR Data / GPU Scene
With GDR the way we store data needs to be re-thinked. So for AnKi there are 3 main components. The 1st is the Unified Geometry Buffer (UGB) which is a huge buffer that holds all index and vertex buffers of the whole game. The 2nd is the GPU scene buffer that stores a GPU view of the scenegraph. Things like object transforms, lights, probes, uniforms (aka constants) and more. The 3rd bit is a descriptor set with all the textures, in other words bindless textures. Every renderable object has a structure (a data record) in the GPU scene that points to transforms, uniforms, bindles texture indices etc.
The intent with these 3 data holders is to have persistent data in the GPU and be able to address index and vertex buffers using indirect arguments. No more CPU to GPU roundtrips. At the beginning of the frame there is what we call micro-patching of the GPU scene. For example, if an object was moved there will be a small copy from the CPU to the GPU scene.
GPU occlusion
The next bit to discuss is how GPU occlusion works at a very high level. Here things are somewhat vanilla. Nothing that hasn’t been tried before. There is a Hierarchical Z Buffer (HZB) generated in frame N and used in frame N+1. This HZB is used for the occlusion of GBuffer pass and the Transparents rendering but also used to cull lights, decals and probes. The 1 frame delay is fine since the occlusion is per object but if it was per meshlet this delay might have been problematic.
Occlusion is also done for the shadow cascades using a variation of Assasin Creed’s Unity technique. Meaning, get the max (in our case) depth of each 64×64 tile of the depth buffer, then render boxes for each tile into light space and use that to construct the HZB. Do that for each cascade.
One additional problem had to do with the other shadow casting lights (and the probes). The CPU is not aware of the visible punctual lights but it needs to know them in order to build the commands required for shadows. What AnKi does in this case is to send some info of the visible lights from the GPU to the CPU. With a few frames delay the CPU will schedule the shadow passes. This mechanism is also used for probes that need to be updated real-time.
Command Buffer Building
As we mentioned at the beginning AnKi is using multi-draw indirect. The CPU organizes all renderable objects present in the scene into graphics state buckets. The state is actually derived by the graphics pipeline. Then the renderer will build a single MDI call for each state bucket. When doing visibility testing the GPU will place the indirect arguments of the visible renderables into the correct bucket.
The GPU visibility will populate the indirect arguments but also an additional buffer per visible instance. This buffer is a vertex buffer with per-instance rate and this is a well known trick to workaround the lack of base instance support in DX. This vertex buffer holds some offsets required by the renderable. For example one of these offsets points to the transform and another to the uniforms.
There are a few complexities in this scheme that have to do with counting drawcalls for gathering statistics. For that reason there is a small drawcall at the beginning of the renderpass that iterates the buckets and gathers the draw counts. The CPU will read that count at some point later.
Future
Next stop is extending the occlusion to meshlets. Let’s see how that goes.
Random thoughts after moving from GLSL to HLSL
AnKi always had its shaders written in GLSL because it started as an OpenGL engine. Recently I decided to start rewriting them from GLSL to HLSL with one of the main motivations being that GLSL, as a language, has stopped evolving and it’s missing modern features. Sure, there are new extensions added to GLSL all the time but those extensions expose new Vulkan functionality, they don’t add new syntax to the language. So let’s dive into some random thoughts around HLSL that are tailored to Vulkan and SPIR-V.
Continue reading “Random thoughts after moving from GLSL to HLSL”Porting AnKi to Android… again after ~8 years
I think it was 8 or 9 years ago when AnKi was first ported to Android for a demo competition and I remember the process being quite painful. Building native projects for Android was a hack, the micro differences between OpenGL ES and desktop OpenGL required compromises, GLSL related issues were a constant pain (compilation issues, bugs, shader reflection differences between implementations etc). These and more issues were the reason I left the Android port to rot and eventually remove from the codebase.
All that changed 10 months ago when I decided to re-port AnKi to Android. But if it was so painful why try again? The main reason is that the ecosystem had improved over the years. Android tooling for native code development (basically C/C++) got better but the biggest motivator was Vulkan. I was optimistic that there will be less issues with Vulkan this time around. At the end of the day was it less painful? The short answer is “yes” but if you want the long answer continue reading.
Continue reading “Porting AnKi to Android… again after ~8 years”How to use debugPrintf in Vulkan
Some days ago I was trying to use the debugPrintf functionality that was introduced to Vulkan and adjacent projects more than a year ago. Since I haven’t found (or maybe I missed it) a good online document that describes all the steps to enable such functionality programmatically, I thought it might be a good idea to document it myself. This is going to be a short post.
First of all, what is debugPrintf? debugPrintf is a way to write text messages from shaders that execute in the GPU to stdout or to your output of your own choosing. In other words, the GPU can print text messages that the CPU will display. debugPrintf’s primary use-case is to help debug shaders.
How it works? Someone can add expressions like these in their GLSL shaders:
#extension GLSL_EXT_debug_printf : enable
...
debugPrintfEXT("This is a message from the GPU. Some float=%f, some int=%d", 1.0, 123);
As you can see debugPrintfEXT looks quite similar to printf which makes it quite powerful. Using glslang (aka glslangValidator) you can convert shaders that contain debugPrintfEXT to SPIR-V and pass that SPIR-V to a VkShaderModule.
The majority of the the implementation of debugPrintf lives in the Vulkan validation layer. The validation layer will rewrite the SPIR-V generated by glslang and add code that processes the given text and sends it down to the CPU. The validation layer will make use of a hidden descriptor set, atomics and hidden buffers to pass data from the GPU to the CPU. All this work and setup is transparent to the user and it can be quite slow.
So what are the steps to start using debugPrintf?
Step 1: Enable the extension in your shaders by adding: #extension GLSL_EXT_debug_printf : enable
Step 2: Enable the validation layer while creating the VkInstance:
VkInstanceCreateInfo instanceCreateInfo;
...
const char* layerNames[1] = {"VK_LAYER_KHRONOS_validation"};
instanceCreateInfo.ppEnabledLayerNames = &layerNames[0];
instanceCreateInfo.enabledLayerCount = 1;
Step 3: Enable the debugPrintf validation layer feature while creating the VkInstance:
// Populate the VkValidationFeaturesEXT
VkValidationFeaturesEXT validationFeatures = {};
validationFeatures.sType = VK_STRUCTURE_TYPE_VALIDATION_FEATURES_EXT;
validationFeatures.enabledValidationFeatureCount = 1;
VkValidationFeatureEnableEXT enabledValidationFeatures[1] = {
VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT};
validationFeatures.pEnabledValidationFeatures = enabledValidationFeatures;
// Then add the VkValidationFeaturesEXT to the VkInstanceCreateInfo
validationFeatures.pNext = instanceCreateInfo.pNext;
instanceCreateInfo.pNext = &validationFeatures;
Step 4: Setup the callback that will print the messages:
VkDebugReportCallbackEXT debugCallbackHandle;
// Populate the VkDebugReportCallbackCreateInfoEXT
VkDebugReportCallbackCreateInfoEXT ci = {};
ci.sType = VK_STRUCTURE_TYPE_DEBUG_REPORT_CALLBACK_CREATE_INFO_EXT;
ci.pfnCallback = myDebugCallback;
ci.flags = VK_DEBUG_REPORT_INFORMATION_BIT_EXT;
ci.pUserData = myUserData;
// Create the callback handle
vkCreateDebugReportCallbackEXT(vulkanInstance, &ci, nullptr, &debugCallbackHandle);
...
// And this is the callback that the validator will call
VkBool32 myDebugCallback(VkDebugReportFlagsEXT flags,
VkDebugReportObjectTypeEXT objectType,
uint64_t object,
size_t location,
int32_t messageCode,
const char* pLayerPrefix,
const char* pMessage,
void* pUserData)
{
if(flags & VK_DEBUG_REPORT_ERROR_BIT_EXT)
{
printf("debugPrintfEXT: %s", pMessage);
}
return false;
}
Step 5: Make sure you enable the VK_KHR_shader_non_semantic_info device extension while building your VkDevice. This is pretty trivial so I won’t show any code.
Some additional notes:
- It is possible to use debugPrintf with the DirectX compiler. In DX land debugPrintfEXT is just named printf
- It is also possible to avoid all this annoying setup and use the Vulkan configurator (aka vkconfig). vkconfig is part of Vulkan SDK. More info on vkconfig here https://vulkan.lunarg.com/doc/view/1.2.135.0/windows/vkconfig.html
- Latest RenderDoc also supports debugPrintf but I’m unsure about the details
And that’s pretty much it. If I missed something feel free to drop a comment bellow.