Workarounds for issues with mesh shaders + Vulkan + HLSL

TL;DR: If using per-primitive in/out in mesh shaders in Vulkan and in HLSL/DXC you need to manually decorate variables and struct members using SPIR-V intrinsics. See how at the bottom.

This is going to be a quick post to raise awareness to a difficult to track bug with mesh shaders in Vulkan when using HLSL. Hopefully this will save some time to the next poor soul that bumps into this.

With mesh shaders you have two options when it comes to output variables. Your output can be per-vertex (just like in vertex shaders) or it can be per-primitive which is a new addition with mesh shaders. This is how a mesh shader entry point could look like:

[numthreads(64, 1, 1)] [outputtopology("triangle")] 
void meshMain(
    out vertices PerVertOut myVerts[XXX], 
    out primitives PerPrimitiveOut myPrimitives[XXX], 
    out indices uint3 indices[XXX])
{
...
}

The “vertices” keyword denotes that the output will be per-vertex and the “primitives” keyword that it’s per-primitive. DirectX compiler will decorate the myPrimitives with the PerPrimitiveEXT decoration which is special to mesh shaders. That’s all fine but the real problem lies in the fragment shader. Vulkan mandates that myPrimitives should be decorated with PerPrimitiveEXT in the fragment shader as well. But in HLSL there is no way to do that. The “primitives” keyword is not allowed in fragment shaders.

FragOut psMain(in PerVertOut myVerts, PerPrimitiveOut myPrimitives)
{
...
}

In other words, if someone tries to use HLSL to write mesh shaders that use per-primitive input/output variables they will end with incorrect SPIR-V. The problems continue because at the moment this is a very hard to diagnose error.

  • The HLSL is valid
  • It works in D3D12 (because D3D12 does semantics linking at PSO creation time)
  • It works on nVidia + Vulkan (because nVidia has a tendency to workaround user errors)
  • Vulkan validation doesn’t complain

… but it fails on AMD and nobody can understand why.

How to workaround this issue then? We can use SPIR-V inline intrinsics! It’s a little bit cumbersome but it can be done. This is how we should change our fragment shader:

struct PerPrimitiveOut {
    [[vk::ext_decorate(5271 /*PerPrimitiveEXT*/)]] float4 someMember;
    [[vk::ext_decorate(5271 /*PerPrimitiveEXT*/)]] float4 someOtherMember;
};

FragOut psMain(
    in PerVertOut myVerts, 
    [[vk::ext_extension("SPV_EXT_mesh_shader")]] [[vk::ext_capability(5283 /*MeshShadingEXT*/)]] PerPrimitiveOut myPrimitives)
{
...
}

And voila! It works on AMD.

For reference I’ve create a couple of github issues that will help avoiding this issue:

https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/8398

https://github.com/microsoft/DirectXShaderCompiler/issues/6862

Parsing and rewriting SPIR-V

SPIR-V is a binary language that describes the GPU shader code for the Vulkan and OpenCL APIs. For most graphics programmers SPIR-V is a completely opaque binary blob that they don’t have to touch or know much about. Someone can use their preferred compiler to translate GLSL or HLSL to SPIR-V, can use SPIRV-Tools to optimize or assemble/disassemble, can use SPIRV-Cross or SPIRV-Reflect to perform shader reflection and finally they can use SPIRV-Cross to cross compile SPIR-V to some high level language. There is enough tooling out there to not have to worry about the internals. SPIR-V is very well defined and quite simple to understand and manipulating it without the use of 3rd part tools is not something to feel intimidated by and this is the point I’m trying to make with this post.

Why parse and/or rewrite SPIR-V? For AnKi I’ve stumbled into a couple of cases where there was no existing tooling that could do what I wanted. The first case is quite simple. I wanted to find if the fragment shader was discarding. This was quite simple, just search if the shader contains spv::OpKill. The 2nd usecase is more elaborate. Due to various reasons AnKi’s shaders were rewritten to support HLSL and HLSL’s binding model (using the register keyword). Since there is not direct mapping of HLSL’s binding model to Vulkan/SPIR-V we had to get a little creative. Without going into many details, DXC is remapping HLSL registers to some logical Vulkan bindings using -fvk-b-shift and co (register -> spv::DecorationBinding). These logical bindings are used to identify the register when performing shader reflection (spv::DecorationBinding -> register). After DXC completes the translation the output SPIR-V contains logical bindings that need to be replaced. So after reflection AnKi rewrites the bindings. And this is the 2nd case where AnKi had to parse SPIR-V but also rewrite it.

This page describes how parsing SPIR-V works: https://github.com/KhronosGroup/SPIRV-Guide/blob/main/chapters/parsing_instructions.md

SPIR-V binary starts with a header that is 20 bytes. After that the instructions follow. Each instruction starts with a 32bit opcode and a variable number of 32bit arguments. The opcode encodes the opcode itself and the number of arguments. Iterating all the instructions is pretty simple. Finding if the SPIR-V binary contains spv::OpKill can be done like this:

bool hasOpKill(uint32_t* pCode, uint32_t codeSize)
{
	uint32_t offset = 5; // first 5 words of module are the headers

	while(offset < codeSize)
	{
		uint32_t instruction = pCode[offset];

		uint32_t length = instruction >> 16;
		uint32_t opcode = instruction & 0x0ffffu;

		offset += length;

		if(opcode == spv::OpKill)
		{
			return true;
		}
	}

	return false;
}

Similar story if someone would want to rewrite the bindings for example:

void rewriteBindings(uint32_t* pCode, uint32_t codeSize)
{
	uint32_t offset = 5; // first 5 words of module are the headers

	while(offset < codeSize)
	{
		uint32_t instruction = pCode[offset];

		uint32_t length = instruction >> 16;
		uint32_t opcode = instruction & 0x0ffffu;

		//  Encoding: OpDecorate || id || DecorationBinding || <literal>
		if(opcode == spv::OpDecorate && pCode[offset + 2] == spv::DecorationBinding)
		{
			pCode[offset + 3] = someValue; // 3rd argument is the binding. Re-write it
		}

		offset += length;
	}
}

Pretty simple.

GPU driven rendering in AnKi: A high level overview

For the last few months AnKi underwent a heavy re-write in order to enable GPU driven rendering. This post will quickly go through the changes and the design without diving too deep into details. There will be another post (or maybe video) with all the implementation details.

What is GPU driven rendering (GDR from now on)? It’s the process of moving logic that traditionally runs into the CPU, to the GPU. The first category of GDR work includes various simulation workloads. Things like particle simulations, clutter placement, snow/water deformations, cloth simulation and more. The next GDR category includes techniques that do GPU based occlusion culling. That includes per object occlusion and more fine-grained meshlet (aka cluster) occlusion. The last category is generating command buffers from the GPU directly. In DX12 this is done using Extended Indirect and in Vulkan with NV_device_generated_commands. A very recent addition to this category is Workgraphs from AMD but that’s a bit experimental at the moment. Due to Vulkan limitations AnKi doesn’t use device generated commands so the ceiling of AnKi’s implementation is multi-draw indirect with indirect count (aka VK_KHR_draw_indirect_count, aka MDI).

So what kind of GDR variant AnKi uses? The final goal was to perform all per-object visibility on the GPU. Visibility for regular meshes as well as lights, probes, decals and the rest of the meta-objects. The whole work was split into 2 logical parts. The 1st part re-organizes data in a more GDR friendly way and the 2nd is the GPU visibility and command buffer generation itself.

GDR Data / GPU Scene

With GDR the way we store data needs to be re-thinked. So for AnKi there are 3 main components. The 1st is the Unified Geometry Buffer (UGB) which is a huge buffer that holds all index and vertex buffers of the whole game. The 2nd is the GPU scene buffer that stores a GPU view of the scenegraph. Things like object transforms, lights, probes, uniforms (aka constants) and more. The 3rd bit is a descriptor set with all the textures, in other words bindless textures. Every renderable object has a structure (a data record) in the GPU scene that points to transforms, uniforms, bindles texture indices etc.

The intent with these 3 data holders is to have persistent data in the GPU and be able to address index and vertex buffers using indirect arguments. No more CPU to GPU roundtrips. At the beginning of the frame there is what we call micro-patching of the GPU scene. For example, if an object was moved there will be a small copy from the CPU to the GPU scene.

GPU occlusion

The next bit to discuss is how GPU occlusion works at a very high level. Here things are somewhat vanilla. Nothing that hasn’t been tried before. There is a Hierarchical Z Buffer (HZB) generated in frame N and used in frame N+1. This HZB is used for the occlusion of GBuffer pass and the Transparents rendering but also used to cull lights, decals and probes. The 1 frame delay is fine since the occlusion is per object but if it was per meshlet this delay might have been problematic.

Occlusion is also done for the shadow cascades using a variation of Assasin Creed’s Unity technique. Meaning, get the max (in our case) depth of each 64×64 tile of the depth buffer, then render boxes for each tile into light space and use that to construct the HZB. Do that for each cascade.

One additional problem had to do with the other shadow casting lights (and the probes). The CPU is not aware of the visible punctual lights but it needs to know them in order to build the commands required for shadows. What AnKi does in this case is to send some info of the visible lights from the GPU to the CPU. With a few frames delay the CPU will schedule the shadow passes. This mechanism is also used for probes that need to be updated real-time.

Command Buffer Building

As we mentioned at the beginning AnKi is using multi-draw indirect. The CPU organizes all renderable objects present in the scene into graphics state buckets. The state is actually derived by the graphics pipeline. Then the renderer will build a single MDI call for each state bucket. When doing visibility testing the GPU will place the indirect arguments of the visible renderables into the correct bucket.

The GPU visibility will populate the indirect arguments but also an additional buffer per visible instance. This buffer is a vertex buffer with per-instance rate and this is a well known trick to workaround the lack of base instance support in DX. This vertex buffer holds some offsets required by the renderable. For example one of these offsets points to the transform and another to the uniforms.

There are a few complexities in this scheme that have to do with counting drawcalls for gathering statistics. For that reason there is a small drawcall at the beginning of the renderpass that iterates the buckets and gathers the draw counts. The CPU will read that count at some point later.

Future

Next stop is extending the occlusion to meshlets. Let’s see how that goes.

Random thoughts after moving from GLSL to HLSL

AnKi always had its shaders written in GLSL because it started as an OpenGL engine. Recently I decided to start rewriting them from GLSL to HLSL with one of the main motivations being that GLSL, as a language, has stopped evolving and it’s missing modern features. Sure, there are new extensions added to GLSL all the time but those extensions expose new Vulkan functionality, they don’t add new syntax to the language. So let’s dive into some random thoughts around HLSL that are tailored to Vulkan and SPIR-V.

Continue reading “Random thoughts after moving from GLSL to HLSL”

Porting AnKi to Android… again after ~8 years

I think it was 8 or 9 years ago when AnKi was first ported to Android for a demo competition and I remember the process being quite painful. Building native projects for Android was a hack, the micro differences between OpenGL ES and desktop OpenGL required compromises, GLSL related issues were a constant pain (compilation issues, bugs, shader reflection differences between implementations etc). These and more issues were the reason I left the Android port to rot and eventually remove from the codebase.

All that changed 10 months ago when I decided to re-port AnKi to Android. But if it was so painful why try again? The main reason is that the ecosystem had improved over the years. Android tooling for native code development (basically C/C++) got better but the biggest motivator was Vulkan. I was optimistic that there will be less issues with Vulkan this time around. At the end of the day was it less painful? The short answer is “yes” but if you want the long answer continue reading.

Continue reading “Porting AnKi to Android… again after ~8 years”

How to use debugPrintf in Vulkan

Some days ago I was trying to use the debugPrintf functionality that was introduced to Vulkan and adjacent projects more than a year ago. Since I haven’t found (or maybe I missed it) a good online document that describes all the steps to enable such functionality programmatically, I thought it might be a good idea to document it myself. This is going to be a short post.

First of all, what is debugPrintf? debugPrintf is a way to write text messages from shaders that execute in the GPU to stdout or to your output of your own choosing. In other words, the GPU can print text messages that the CPU will display. debugPrintf’s primary use-case is to help debug shaders.

How it works? Someone can add expressions like these in their GLSL shaders:

#extension GLSL_EXT_debug_printf : enable
...
debugPrintfEXT("This is a message from the GPU. Some float=%f, some int=%d", 1.0, 123);

As you can see debugPrintfEXT looks quite similar to printf which makes it quite powerful. Using glslang (aka glslangValidator) you can convert shaders that contain debugPrintfEXT to SPIR-V and pass that SPIR-V to a VkShaderModule.

The majority of the the implementation of debugPrintf lives in the Vulkan validation layer. The validation layer will rewrite the SPIR-V generated by glslang and add code that processes the given text and sends it down to the CPU. The validation layer will make use of a hidden descriptor set, atomics and hidden buffers to pass data from the GPU to the CPU. All this work and setup is transparent to the user and it can be quite slow.

So what are the steps to start using debugPrintf?

Step 1: Enable the extension in your shaders by adding: #extension GLSL_EXT_debug_printf : enable

Step 2: Enable the validation layer while creating the VkInstance:

VkInstanceCreateInfo instanceCreateInfo;
...

const char* layerNames[1] = {"VK_LAYER_KHRONOS_validation"};
instanceCreateInfo.ppEnabledLayerNames = &layerNames[0];

instanceCreateInfo.enabledLayerCount = 1;

Step 3: Enable the debugPrintf validation layer feature while creating the VkInstance:

// Populate the VkValidationFeaturesEXT
VkValidationFeaturesEXT validationFeatures = {};
validationFeatures.sType = VK_STRUCTURE_TYPE_VALIDATION_FEATURES_EXT;
validationFeatures.enabledValidationFeatureCount = 1;

VkValidationFeatureEnableEXT enabledValidationFeatures[1] = {
	VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT};
validationFeatures.pEnabledValidationFeatures = enabledValidationFeatures;

// Then add the VkValidationFeaturesEXT to the VkInstanceCreateInfo
validationFeatures.pNext = instanceCreateInfo.pNext;
instanceCreateInfo.pNext = &validationFeatures;

Step 4: Setup the callback that will print the messages:

VkDebugReportCallbackEXT debugCallbackHandle;

// Populate the VkDebugReportCallbackCreateInfoEXT
VkDebugReportCallbackCreateInfoEXT ci = {};
ci.sType = VK_STRUCTURE_TYPE_DEBUG_REPORT_CALLBACK_CREATE_INFO_EXT;
ci.pfnCallback = myDebugCallback;
ci.flags = VK_DEBUG_REPORT_INFORMATION_BIT_EXT;
ci.pUserData = myUserData;

// Create the callback handle
vkCreateDebugReportCallbackEXT(vulkanInstance, &ci, nullptr, &debugCallbackHandle);

...

// And this is the callback that the validator will call
VkBool32 myDebugCallback(VkDebugReportFlagsEXT flags,
	VkDebugReportObjectTypeEXT objectType,
	uint64_t object, 
	size_t location, 
	int32_t messageCode,
	const char* pLayerPrefix,
	const char* pMessage, 
	void* pUserData)
{
	if(flags & VK_DEBUG_REPORT_ERROR_BIT_EXT)
	{
		printf("debugPrintfEXT: %s", pMessage);
	}

	return false;
}

Step 5: Make sure you enable the VK_KHR_shader_non_semantic_info device extension while building your VkDevice. This is pretty trivial so I won’t show any code.

Some additional notes:

  • It is possible to use debugPrintf with the DirectX compiler. In DX land debugPrintfEXT is just named printf
  • It is also possible to avoid all this annoying setup and use the Vulkan configurator (aka vkconfig). vkconfig is part of Vulkan SDK. More info on vkconfig here https://vulkan.lunarg.com/doc/view/1.2.135.0/windows/vkconfig.html
  • Latest RenderDoc also supports debugPrintf but I’m unsure about the details

And that’s pretty much it. If I missed something feel free to drop a comment bellow.

Resource uniformity & bindless access in Vulkan

This is going to be a short post around bindless descriptor access in Vulkan. There were other posts that touched this topic in the past but in this one I’ll focus more on the Vulkan spec and the terminology in general. The concepts described here also apply to DX12 so I’ll try to cover both terminologies when possible.

In SPIR-V/GLSL/HLSL you can have arrays of values or arrays of descriptors or whatever. These arrays can be sized, unsized, runtime sized etc, doesn’t matter. There are a few ways to index into those arrays:

  • Constant integral index
  • Dynamically uniform index
  • Non-dynamically uniform index
  • Subgroup uniform index

Constant integral index is a very typical access pattern. Nothing special:

const int idx = 10;
... = myArray[idx];

Dynamically uniform access is uniform across an invocation group. The Vulkan spec is a bit vague (on purpose) on defining what the invocation group is but to be 100% covered we can view it as a whole drawcall or a whole compute dispatch. So dynamically uniform access doesn’t diverge inside a drawcall at all. All invocations (threads in DX12) of a drawcall (or dispatch) access the same thing.

layout(...) uniform MyConstantBuffer
{
    bool dynamicallUniformBool;
    int dynamicallUniformIndex;
};

...

if(dynamicallUniformBool)
{
    ... = myArray[dynamicallUniformIndex];
}

In the above example myArray is accessed using a dynamically uniform index and it’s inside a dynamically uniform control flow. All invocations of that drawcall will access the same array element.

Non-dynamically uniform access is when there is divergence between invocations of an invocation group (aka drawcall or dispatch).

int idx = rand() % 100;
... = myArray[idx];

Subgroup uniform access is when something doesn’t diverge between the invocations that form a subgroup (wave in DX12). This is not explicitly exposed by the shading languages so we’ll leave that out for now.

We spoke about various access methods as a general concept but what we are really interested in is access of arrays of descriptors. This is what bindless really is. The Vulkan spec have added support for bindless in version 1.2 and as usual it also exposed a bunch of caps that define what’s allowed and what’s not.

shaderUniformBufferArrayDynamicIndexing, shaderSampledImageArrayDynamicIndexing, shaderStorageBufferArrayDynamicIndexing and shaderStorageImageArrayDynamicIndexing are caps since Vulkan 1.0. Having those false means that the arrays of the relevant resources can only be accessed using a constant index (or even better: any constant expression). Pretty much everyone has those set to true so let’s move on. Vulkan 1.2 added shaderInputAttachmentArrayDynamicIndexing, shaderUniformTexelBufferArrayDynamicIndexing and shaderStorageTexelBufferArrayDynamicIndexing and for most ISVs these are true as well. Note that sampler descriptors are absent from these caps.

Then there is the XXXArrayNonUniformIndexing family of caps. If these are false then the implementation doesn’t allow non-dynamically uniform access of descriptors. If that cap is true then you can do bindless on the specific type of descriptor. Most vendors have these set to true except Intel which doesn’t enable all of them.

An additional family of caps is the XXXArrayNonUniformIndexingNative. This feels like a performance warning more than anything else. If the XXXArrayNonUniformIndexingNative is false then the shader compiler will have to add additional instructions to work with non-dynamically uniform access. This varies between ISVs quite a bit.

One additional piece to the puzzle is the NonUniform SPIR-V decoration which is exposed via nonuniformEXT in GLSL and NonUniformResourceIndex() in HLSL. The default SPIR-V behavior mandates that descriptor accesses are dynamically uniform. When they are not, things might break. So when doing non-dynamically uniform accesses (when the implementation allows it ofcourse) you are required to use the NonUniform decoration. The NonUniform decoration is somewhat orthogonal to the caps discussed above. It doesn’t mean that if XXXArrayNonUniformIndexingNative is true you can omit the NonUniform decoration. The spec doesn’t really say when and if you can omit the NonUniform so the best thing to do is to always use it to decorate non-dynamically uniform accesses. If an implementation doesn’t care then it will simply ignore it.

Example of bindless in GLSL:

layout(...) uniform texture2D myBindlessHandles[]; // Runtime sized array
layout(...) uniform sampler mySampler;

...
vec4 color = texture(texture2D(myBindlessHandles[nonuniformEXT(nonUniformIndex)], mySampler), uvs);
...

So, putting all these together. AMD for example allows non-dynamically uniform access on sampled images (shaderSampledImageArrayNonUniformIndexing=true) but these accesses are not native (shaderSampledImageArrayNonUniformIndexingNative=false). By default AMD’s compiler will treat all descriptor accesses as dynamically uniform and use SGPR to store the descriptors. If the access is non-dynamically uniform then things might break. Then NonUniform comes into play. Since AMD’s HW doesn’t natively support non-dynamically uniform the NonUniform will instruct the compiler to add extra instructions to ensure subgroup invariance.

Similar story for Arm’s Mali, different reason though. On Mali some instructions require some arguments to be subgroup invariant and this is where non-dynamically uniform patterns become a problem.

One additional thing worth mentioning is that using buffer addresses to load data from buffers (exposed by VK_KHR_device_buffer_address and part of Vulkan 1.2) doesn’t require any NonUniform decoration. NonUniform is irrelevant if your shader code doesn’t index arrays of descriptors. Addresses don’t point to descriptors, they point to some raw memory.

The final bit to the puzzle is to understand which builtins are dynamically uniform and which are not. The answer is hidden inside the spec and only gl_DrawID is explicitly mentioned as dynamically uniform and everything else is not. If for example you are using gl_InstanceIndex/SV_InstanceID (directly or indirectly) to index resources then you technically need to use the NonUnifom decoration.

Big thanks to Christian Forfang for providing some early feedback!