Multi Draw Indirect Rendering in Vulkan

Ultimate Ldrago
Dec 8, 2022
6 min read

Updated: Dec 11, 2022

In this blog, I will show how I implemented multi draw indirect rendering in my engine.

As usual, this post is not meant to be a tutorial, but more a way to document my learning process. If you spot any mistakes in my explanations, please feel free to let me know!

Note: Please note that I am coding in Java using LWJGL, and thus some of the vulkan method signatures might look a bit different

Why Multi Draw Indirect?

So first of all, what is Draw Indirect? If we take a look at how a render loop might usually be written:

...

for (int i = 0; i < renderables.size(); i++) {
        var renderable = renderables.get(i);
        
        vkCmdDrawIndexed(commandBuffer,
                renderable.mesh.indices.size(), 1, renderable.firstIndex,
                renderable.firstVertex, i);
}
...

(Note: This example setup uses a shared vertex and index buffer for all meshes, an SSBO object buffer that holds all the object specific information such as the transformation matrices, and there are no object specific descriptors that need to be set)

We can see that a vkCmdDrawIndexed() command is being recorded for each object in the scene. Obviously, in this scenario where there aren't any object specific changes being recorded such as descriptor set bindings, and push constants or UBO changes, wouldn't it be so much more efficient if there was a way we could render all the objects with just a single command!? Yup, that's essentially what multidraw indirect is.

Ignoring a few (really important) steps, the above rendering loop, but instead with multidraw indirect enabled, would look something this:

long offset = VkDrawIndexedIndirectCommand.SIZEOF;
int stride = VkDrawIndexedIndirectCommand.SIZEOF;

vkCmdDrawIndexedIndirect(commandBuffer, 
        indirectCommandsBuffer, offset, renderables.size(), stride);

Ignoring offset, stride, and indirectCommandsBuffer for now, this single statement is equivalent to all the vkCmdDrawIndexed() calls made from within the for-loop.

So How does it work?

If we take a closer look at the arguments for vkCmdDrawIndexedIndirect(), it might give us an idea.

vkCmdDrawIndexedIndirect(commandBuffer, 
        indirectCommandsBuffer, offset, renderables.size(), stride);

As the name suggests, indirectCommandsBuffer is some sort of a buffer that stores draw indirect commands in it. So basically what's happening is that instead of explicitly recording a vkCmdDrawIndexed() command for every object directly to the commandBuffer from from the CPU side, we have a buffer filled with all the draw commands, which is then recorded as a single indirect draw call to the commandBuffer

Other than recording a single command to the commandBuffer being more efficient, the main advantage of doing it this is way is that because the commands are stored in a buffer, they could be modified from the GPU too. This gives us a lot of flexibility in terms of allowing the renderer to be more GPU-driven by reducing the amount of CPU side communication.

In the next bog, I'll show how I implemented GPU driven frustum culling using compute shaders, which makes use of multi draw indirect.

Implementation

The first step is to enable Multi Draw Indirect when creating the logical device during program initialization.

...
var deviceFeatures = VkPhysicalDeviceFeatures.calloc(stack);
...
deviceFeatures.multiDrawIndirect(true);

Then we need to create and allocate the indirect commands buffer.

createBufferVMA(
        vmaAllocator,
        MAXOBJECTS * VkDrawIndexedIndirectCommand.SIZEOF,
        VK_BUFFER_USAGE_STORAGE_BUFFER_BIT |
        VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT |
        VK_BUFFER_USAGE_TRANSFER_DST_BIT,
        VMA_MEMORY_USAGE_AUTO,
        VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT
        );

(Note: I am using VMA to create and allocate buffers, and the function call shown here is a light abstraction over the actual VMA functions)

As could be seen, we allocate enough space for MAXOBJECTS number of indirect commands to be stored. You could choose a value for MaxObjects large enough that the number of objects in your scene would not exceed it.

The buffer being declared with the VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT should be self explanatory. We also declare it as a VK_BUFFER_USAGE_TRANSFER_DST_BIT because this buffer is going to be GPU-memory-only, and thus to initialize it with values from the CPU, we would need to update it by copying data into it via a staging buffer. It is also declared with a VK_BUFFER_USAGE_STORAGE_BUFFER_BIT because in the next blog, I would be accessing it as an SSBO from the compute shader while implementing frustum culling.

Finally, VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT just tells VMA to create the buffer as GPU-memory-only.

Then, we initialize the buffer with values from the CPU:

...
        // Create a temporary buffer to copy values into the indirect
        // commands buffer
        var alignmentSize = VkDrawIndexedIndirectCommand.SIZEOF;
        var bufferSize = alignmentSize * renderables.size();
        
        var stagingBuffer = createBufferVMA(vmaAllocator,
                bufferSize,
                VK_BUFFER_USAGE_TRANSFER_SRC_BIT,
                VMA_MEMORY_USAGE_AUTO,
                VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT);
        
        //  BufferWriter is my abstraction for managing memory buffers.
        //  This writes the initialization values to the temp buffer
        var bw = new BufferWriter(vmaAllocator, stagingBuffer,
                                 alignmentSize, bufferSize);
        bw.mapBuffer();
        for(int i = 0; i < renderables.size(); i++) {
            bw.setPosition(i);

//            uint32_t    indexCount;
//            uint32_t    instanceCount;
//            uint32_t    firstIndex;
//            int32_t     vertexOffset;
//            uint32_t    firstInstance;

            bw.put(renderables.get(i).mesh.indices.size());
            bw.put(1);
            bw.put(renderables.get(i).firstIndex);
            bw.put(renderables.get(i).firstVertex);
            bw.put(i);
        }
        bw.unmapBuffer();

        // Copy temp buffer into the indirect commands buffer
        try (var stack = stackPush()) {
            var copy = VkBufferCopy.calloc(1, stack);
            
            Consumer<VkCommandBuffer> copyCmd = cmd -> {
                copy.dstOffset(0);
                copy.size(bufferSize);
                copy.srcOffset(0);
                vkCmdCopyBuffer(cmd, stagingBuffer.buffer, 
                frame.indirectCommandBuffer.buffer, copy);
            };

            submitImmediateCommand(copyCmd,
             singleTimeTransferCommandContext);
        }
        
        // Destroy temp buffer
        vmaDestroyBuffer(vmaAllocator, stagingBuffer.buffer,
        stagingBuffer.allocation);

The code should be self explanatory, and pretty similar to how you would normally setup other vulkan buffers.

For every object in the scene, we instantiate a VkDrawIndexedIndirectCommand. We set the appropriate vertex and index offsets according to how they are laid out in the shared vertex and index buffers.

Similar to how it was setup when individual draw commands were being recorded for each object, we set the value of firstInstance to i because this variable is used by the vertex shader to access the appropriate object specific data from the objects buffer.

What's more interesting here is the instanceCount parameter. When this is set to one, the GPU renders one instance of the object as expected. If this is set to 0, the object will not be rendered. We would later be manipulating this variable when we implement GPU frustum culling, to cull or render an object. Basically, we would perform culling on the Compute Shader, and change this variable from the compute shader to control whether the object is rendered or not.

That should be it, this is essentially how you could enable and integrate indirect commands with your rendering engine.

But what about object specific changes?

You might have one question in mind; how do you get indirect commands working if your current rendering pipeline requires object specific descriptor sets or push constant updates to work?

Ideally your engine would be fully GPU-driven, meaning that you don't have to make any object specific descriptor set bindings or UBO/push constant updates, but unfortunately making the renderer fully bindless and GPU-driven is difficult, and rather than making huge architectural changes, you probably just want this one feature to (kinda) work for now, and optimize it later.

My Kurama engine currently requires a descriptor set binding change too if the texture being used changes. The solution I implemented for this is the method shown in vkGuide.

Here is how my code looks:

...

// Group objects into batches that share the same texture descriptor set
var indirectBatches = compactDraws(renderables);

// Loop through each batch, bind the descriptor set, and record a single
// vkCmdDrawIndexedIndirect() for all the objects in the batch

int stride = VkDrawIndexedIndirectCommand.SIZEOF;
for(var batch: indirectBatches) {
    vkCmdBindDescriptorSets(commandBuffer, 
                            VK_PIPELINE_BIND_POINT_GRAPHICS,
                            pipelineLayout, 2,
                            stack.longs(renderables.get(batch.first).
                                                    textureDescriptorSet),
                             null);
    long offset = batch.first * VkDrawIndexedIndirectCommand.SIZEOF;
    
    vkCmdDrawIndexedIndirect(commandBuffer, indirectCommandBuffer, offset, 
                            batch.count, stride);
}

The solution is to batch the individual objects into batches that share the same texture descriptor set, and loop through each batch, bind the descriptor set, and record a single

vkCmdDrawIndexedIndirect() for all the objects in the batch.

Of course, if every object has a different texture, then this isn't any better than just recording a single draw command for every object, but remember that this still gives us the flexibility to perform GPU-driven culling. Also, if you use texture atlases, this should be a bit more efficient.

Another thing to note is that since culling would be directly performed by the GPU without having to go through the CPU, you would not have to re-record the command buffer each frame, but instead you would only have to re-record it whenever an object is added or removed from the scene.

And just for the sake of completion, this is my compactDraws() function:

public static List<IndirectBatch>  compactDraws(List<Renderable> renderables) {
    var draws = new ArrayList<IndirectBatch>();

    var lastDraw = new IndirectBatch(renderables.get(0).mesh,
                                     renderables.get(0).getMaterial(), 
                                     0, 1);
    draws.add(lastDraw);

    for(int i = 1; i < renderables.size(); i++) {

        var sameMaterial = renderables.get(i).getMaterial() == 
                                                    lastDraw.material;

        if(sameMaterial) {
            lastDraw.count++;
        }
        else {
            lastDraw = new IndirectBatch(renderables.get(i).mesh,
                                renderables.get(i).getMaterial(), i, 1);
            draws.add(lastDraw);
        }

    }

    return draws;
}

Code

I do not have a code example specifically created to showcase multi draw indirect rendering, but this could be seen implemented in my Kurama Engine's code base. Try to Ctrl+F for the relevant method calls on the ActiveShutterRenderer.java file.

https://github.com/alanjoshua/Kurama-Engine/blob/master/RenderEngine/projects/ActiveShutter/code/main/ActiveShutterRenderer.java