Implementing GPU frustum culling using Compute Shaders

Ultimate Ldrago
Dec 11, 2022
8 min read

Updated: Dec 13, 2022

In this blog, I will show how I implemented GPU driven frustum culling using compute shaders in my engine. As usual, this post is not meant to be a tutorial, but more a way to document my learning process. If you spot any mistakes in my explanations, please feel free to let me know!

Also, This is going to be a extension to my blog about implementing multi draw indirect commands, so I would recommend reading that first.

Note: Please note that I am coding in Java using LWJGL, and thus some of the vulkan method signatures might look a bit different

So what is Frustum Culling?

Frustum Culling is the process of culling (excluding from rendering) objects that are not visible to the camera. When you have hundreds of complex meshes in your scene, frustum culling is probably one of the simplest yet most impactful performance optimization methods you could implement.

To perform frustum culling, for each frame, you first construct a frustum from your camera, then for each object, check whether its bounds are present within the frustum or not. You then only render the objects that are within the camera's view frustum.

The simplest way to do this would be to all the frustum checks sequentially on the CPU, and then adjust the draw calls. But you can imagine how this would quickly rack up a significant performance cost as the number of objects in the scene increases.

The solution we will be implementing is a little bit more sophisticated, where we perform frustum culling on the GPU via compute shaders. We would also be using bounding spheres for the frustum checks, since you only need to store an extra float (bound radius) per object for this.

Note: I will not really be going over the math of how the frustum check works, but it is pretty straight forward, and there are a lot of resources available.

Compute Pipeline

Setting up the compute shader should be pretty similar to how you set up the graphics pipeline.

First when we pick the VkPhysicalDevice, similar to how you check for a graphics and transfer queue, we need to make sure that it supports a separate compute queue:

private static int getQueueFamily(VkQueueFamilyProperties.Buffer
                                                  queueFamilies, int flag) {

    // Dedicated queue for compute
    // Try to find a queue family index that supports compute but not
    // graphics
    if((flag & VK_QUEUE_COMPUTE_BIT) == flag) {
        for(int i = 0; i < queueFamilies.capacity(); i++) {
            
            if((
           (queueFamilies.get(i).queueFlags() & VK_QUEUE_COMPUTE_BIT)!= 0)
                                            &&
       ((queueFamilies.get(i).queueFlags() & VK_QUEUE_GRAPHICS_BIT) == 0)) 
            {
                return i;
            }
            
        }
    }
   
   ...

Then get the compute queue from the queue family index:

vkGetDeviceQueue(device, indices.computeFamily, 0, pQueue);
computeQueue = new VkQueue(pQueue.get(0), device);

We then create the compute pipeline and pipeline layout objects:

public void createComputePipeline() {
    
    // Abstraction for building graphics and compute pipelines and 
    // pipeline layouts
    var builder = new 
            PipelineBuilder(PipelineBuilder.PipelineType.COMPUTE);

    builder.shaderStages.add(new
            PipelineBuilder.ShaderStageCreateInfo("shaders/cull.comp",
                                            VK_SHADER_STAGE_COMPUTE_BIT));
    builder.descriptorSetLayouts = new long[]
                    {multiViewRenderPass.computeDescriptorSetLayout};
    
    var result = builder.build(device, null);
    computePipelineLayout = result.pipelineLayout();
    computePipeline = result.pipeline();
    
    // Add to deletion queue to destroy the compute pipeline objects when 
    // the program is terminated
    deletionQueue.add(() -> vkDestroyPipeline(device, computePipeline,
                                                             null));
    deletionQueue.add(() -> vkDestroyPipelineLayout(device, 
                                            computePipelineLayout, null));
}

This is my abstraction over the pipeline creation step, so the functions are of course different from what you would use when using pure Vulkan, but I included this just to show how much more simpler it is to setup a compute pipeline when compare to a graphics pipeline; the user only needs to pass in the shader and the descriptor set layout (We'll take a look at the descriptor set layout when discussing the compute shader).

Internally, this is how build() for the compute pipeline looks:

...

if(descriptorSetLayouts == null) {
    throw new RuntimeException("Compute shaders must have descriptors");
}

// Create the shaderStageInfo buffer the same way you create it when 
// building your graphics pipeline

// I use a helper class to convert from .glsl to Spirv, so ignore the code related to that
var shaderStagesBuffer =
       VkPipelineShaderStageCreateInfo.calloc(shaderStages.size(), stack);

var shaderModules = new ArrayList<Long>();
var shaderSPIRVs = new ArrayList<ShaderSPIRVUtils.SPIRV>();

for(int i = 0; i < shaderStages.size(); i++) {
    var shader = shaderStages.get(i);
    var entryPoint = stack.UTF8(shader.entryPoint);
    ShaderSPIRVUtils.ShaderKind shaderKind = null;

    if (shader.ShaderType == VK_SHADER_STAGE_COMPUTE_BIT) {
        shaderKind = COMPUTE_SHADER;
    }
    if(shaderKind == null) {
        throw new IllegalArgumentException("Invalid shader type was passed
                 in");
    }
    var shaderSPRIV = compileShaderFile(shader.shaderFile, shaderKind);
    var shaderModule = createShaderModule(shaderSPRIV.bytecode(), device);
    shaderModules.add(shaderModule);
    shaderSPIRVs.add(shaderSPRIV);

    var shaderStageInfo = shaderStagesBuffer.get(i);
    shaderStageInfo.sType(
        VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO);
        
    shaderStageInfo.stage(shader.ShaderType);
    shaderStageInfo.module(shaderModule);
    shaderStageInfo.pName(entryPoint);
}   
             
var pipelineLayoutInfo = VkPipelineLayoutCreateInfo.calloc(stack);
pipelineLayoutInfo.sType(VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO);
pipelineLayoutInfo.pSetLayouts(stack.longs(descriptorSetLayouts));

// ===> PIPELINE LAYOUT CREATION <===
LongBuffer pPipelineLayout = stack.longs(VK_NULL_HANDLE);
vkCheck(vkCreatePipelineLayout(device, pipelineLayoutInfo, null, pPipelineLayout));

// ===> PIPELINE CREATION <===
LongBuffer pPipeline = stack.longs(VK_NULL_HANDLE);
var pipelineInfo = VkComputePipelineCreateInfo.calloc(1, stack);
pipelineInfo.sType(VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO);
pipelineInfo.layout(pPipelineLayout.get(0));
pipelineInfo.flags(0);
pipelineInfo.stage(shaderStagesBuffer.get(0));

vkCheck(vkCreateComputePipelines(device, VK_NULL_HANDLE, pipelineInfo, null, pPipeline));

// RELEASE RESOURCES
shaderModules.forEach(s -> vkDestroyShaderModule(device, s, null));
shaderSPIRVs.forEach(s -> s.free());

return new PipelineLayoutAndPipeline(pPipelineLayout.get(0),
                                     pPipeline.get(0));
...

Compute Shader

Now that we have the compute pipeline, lets take a look at how we define the compute shader to perform frustum culling.

As I already mentioned, we use spherical bounds to encapsulate the bound for each object, which we tack along to the objects buffer.

struct ObjectData {
    mat4 model;
    vec4 bound_radius; 
};

layout(std140, binding = 0) readonly buffer ObjectBuffer {
    ObjectData objects[];
} objectBuffer;

Aside from that, we also take the indirect commands buffer as an input

// Same layout as VkDrawIndexedIndirectCommand
struct IndexedIndirectCommand
{
    uint indexCount;
    uint instanceCount;
    uint firstIndex;
    uint vertexOffset;
    uint firstInstance;
};

// Binding 1: Multi draw output
layout(std430, binding = 1) writeonly buffer IndirectDraws {
    IndexedIndirectCommand indirectDraws[ ];
}

We also take in as input a UBO which stores the frustum information and total objects count, and an output UBO to keep track of how many objects were actaully drawn:

layout(std140, binding = 2) uniform UBO {
    vec4 frustumPlanes[6];
    uint objectCount;
} ubo;

// Binding 3: Indirect draw stats
layout (binding = 3) buffer UBOOut
{
    uint drawCount;
} uboOut;

The frustum is encapsulated as a set of planes, one for each side of the frustum, so 6 planes in total.

The main function of the compute shader looks something like this:

layout (local_size_x = 16) in;

void main() {
    uint idx = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y *
                 gl_NumWorkGroups.x * gl_WorkGroupSize.x;

    // Clear stats on first invocation
    if (idx == 0) {
        atomicExchange(uboOut.drawCount, 0);
    }

    vec4 pos = vec4(objectBuffer.objects[idx].model[3].xyz, 1.0);
    float bound_radius= objectBuffer.objects[idx].bound_radius.x;

    // Check if object is within current viewing frustum
    if (isWithinFrustum(pos, bound_radius)) {
        indirectDraws[idx].instanceCount = 1;
        // Increase number of indirect draw counts
        atomicAdd(uboOut.drawCount, 1);
    }
    else {
        indirectDraws[idx].instanceCount = 0;
    }

}

A compute shader instance is invoked for each object in the scene. We then get the center of the bounding sphere directly from the 4x4 transformation matrix of the object, and pass it along with the bound_radius to the isWithinFrustum() function to check whether the object is within the view frustum or not. If it is within the frustum, we set the corresponding indirect command's instanceCount to 1. If not, we set it to zero. As we had already discussed in the multi draw indirect commands blog, toggling the instanceCount between 0 and 1 lets us control whether vertex and pixel shaders are called for that object or not. We also increment the drawCount by 1.

(Note: isWithinFrustum() is a custum function that checks for whether the bounding sphere is within the frustum or not. I am intentionally not providing the code for it, along with the code to generate the camera frustum, as my current setup has a bug where I need to have the logic inverted for the math to work)

Now that we know the inputs and outputs for the compute shader, we can finally create the descriptor set and descriptor set layout:

// CREATE COMPUTE SHADER DESCRIPTOR SETS
var result = new DescriptorBuilder(descriptorSetLayoutCache, 
                                        descriptorAllocator)
        .bindBuffer(0, new DescriptorBufferInfo(0, VK_WHOLE_SIZE, 
                                                objectBuffer.buffer),
                                   VK_DESCRIPTOR_TYPE_STORAGE_BUFFER,
                                    VK_SHADER_STAGE_COMPUTE_BIT)
        .bindBuffer(1, new DescriptorBufferInfo(0, VK_WHOLE_SIZE,
                                        indirectCommandBuffer.buffer),
                                    VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 
                                    VK_SHADER_STAGE_COMPUTE_BIT)
        .bindBuffer(2, new DescriptorBufferInfo(0, VK_WHOLE_SIZE,
                                        computeUBOBuffer.buffer),
                                    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,
                                    VK_SHADER_STAGE_COMPUTE_BIT)
        .bindBuffer(3, new DescriptorBufferInfo(0, VK_WHOLE_SIZE,
                                   indirectDrawCountBuffer.buffer),                                                                                                                                                                                               
                                 VK_DESCRIPTOR_TYPE_STORAGE_BUFFER,
                                       VK_SHADER_STAGE_COMPUTE_BIT)
        .build();

computeDescriptorSetLayout = result.layout();
computeDescriptorSet = result.descriptorSet();

My abstraction layer over descriptor set management is being shown here. I presume you are already familiar with the actual, underlying vulkan functions that are used to create descriptor sets. I am also not showing the creation of the individual UBO buffers.

Command Buffer

Now that the compute shader and pipeleine is ready, we now need to record the command buffer.

Similar to how we handle command buffers and command pools for the graphics pipeline, we create a compute command pool and command buffer for every frame-in-flight. The only difference in the code would be that, here, the command pools would be created from the compute family queue, instead of the graphics family queue:

...
VkCommandPoolCreateInfo poolInfo = VkCommandPoolCreateInfo.calloc(stack);
poolInfo.sType(VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO);
poolInfo.queueFamilyIndex(queueFamilyIndices.computeFamily);
poolInfo.flags(VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT);

...
// create command pools and command buffers per frame in flight

Now, we record the command buffers:

VkCommandBufferBeginInfo beginInfo = VkCommandBufferBeginInfo.calloc(stack);
beginInfo.sType(VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO);

vkCheck(vkBeginCommandBuffer(commandBuffer, beginInfo), "Could not begin compute command buffer");

// Acquire barrier
// Add memory barrier to ensure that the indirect commands have been consumed before the compute shader updates them
if(queueFamilyIndices.graphicsFamily != queueFamilyIndices.computeFamily) {

    var bufferBarrier = VkBufferMemoryBarrier.calloc(1, stack);
    bufferBarrier.sType(VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER);
    bufferBarrier.srcAccessMask(0);
    bufferBarrier.dstAccessMask(VK_ACCESS_SHADER_WRITE_BIT);
    bufferBarrier.srcQueueFamilyIndex(queueFamilyIndices.graphicsFamily);
    bufferBarrier.dstQueueFamilyIndex(queueFamilyIndices.computeFamily);
    bufferBarrier.buffer(indirectCommandBuffer.buffer);
    bufferBarrier.offset(0);
    bufferBarrier.size(VK_WHOLE_SIZE);

    vkCmdPipelineBarrier(commandBuffer,
            VK_PIPELINE_STAGE_TRANSFER_BIT,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
            0, null, bufferBarrier, null);
}

vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, computePipeline);
vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, computePipelineLayout, 0, stack.longs(computeDescriptorSet), null);

vkCmdDispatch(commandBuffer, (computeUBOIn.objectCount / 16)+1, 1, 1);

// Release barrier
if(queueFamilyIndices.graphicsFamily != queueFamilyIndices.computeFamily) {
    var bufferBarrier = VkBufferMemoryBarrier.calloc(1, stack);
    bufferBarrier.sType(VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER);
    bufferBarrier.srcAccessMask(VK_ACCESS_SHADER_WRITE_BIT);
    bufferBarrier.dstAccessMask(0);
    bufferBarrier.srcQueueFamilyIndex(queueFamilyIndices.computeFamily);
    bufferBarrier.dstQueueFamilyIndex(queueFamilyIndices.graphicsFamily);
    bufferBarrier.buffer(frame.indirectCommandBuffer.buffer);
    bufferBarrier.offset(0);
    bufferBarrier.size(VK_WHOLE_SIZE);

    vkCmdPipelineBarrier(commandBuffer,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
            VK_PIPELINE_STAGE_TRANSFER_BIT,
            0, null, bufferBarrier, null);
}

vkEndCommandBuffer(commandBuffer);

Most of the code here should look familiar, except for the vkKCmdDispatch and the vkCmdPipelineBarrier.

Basically, vkCmdDispatch takes in the number of workgroups to launch as input. Each work ground contains a certain number of threads, which we set earlier in the compute shader as 16:

layout (local_size_x = 16) in;

Thus, the total number of workgroups to launch is object count / 16 + 1. We add 1 to account for the integer division.

The more interesting things here are the buffer barriers. Buffer barriers are necessary for the indirect commands buffer because the compute shader writes data into it, while the graphics pipeline launches vertex shaders by reading the values off of it. Since the compute and graphics buffers are also in different queues, the buffer needs to get transferred from one queue to the other, which is also done via the buffer barriers.

The first vkCmdPipelineBarrier() transfers the buffer to the compute queue from the graphics queue, and sets the relevant access masks and pipeline stages, thus locking the buffer. The second vkCmdPipelineBarrier() transfers the buffer back to the graphics queue and releases it, so that the graphics pipeline is not getting blocked anymore.

This would indicate that buffer barriers would have to be setup while recording the command buffers for the graphics pipeline too, and that's exactly what we do here:

...
vkCmdSetViewport(commandBuffer, 0, viewportBuffer);
vkCmdSetScissor(commandBuffer, 0, scissorBuffer);

// Acquire barrier
// Add memory barrier to ensure that the indirect commands have been consumed before the compute shader updates them
if(queueFamilyIndices.graphicsFamily != queueFamilyIndices.computeFamily) {

    var bufferBarrier = VkBufferMemoryBarrier.calloc(1, stack);
    bufferBarrier.sType(VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER);
    bufferBarrier.srcAccessMask(0);
    bufferBarrier.dstAccessMask(VK_ACCESS_INDIRECT_COMMAND_READ_BIT);
    bufferBarrier.srcQueueFamilyIndex(queueFamilyIndices.computeFamily);
    bufferBarrier.dstQueueFamilyIndex(queueFamilyIndices.graphicsFamily);
    bufferBarrier.buffer(currentFrame.indirectCommandBuffer.buffer);
    bufferBarrier.offset(0);
    bufferBarrier.size(VK_WHOLE_SIZE);

    vkCmdPipelineBarrier(commandBuffer,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
            VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
            0, null, bufferBarrier, null);
}

renderPassInfo.framebuffer(multiViewRenderPass.frameBuffer);

vkCmdBeginRenderPass(commandBuffer, renderPassInfo,
                     VK_SUBPASS_CONTENTS_INLINE);
{
    recordSceneToCommandBufferIndirect(renderables, commandBuffer,
                                     currentFrame, frameIndex, stack);
}
vkCmdEndRenderPass(commandBuffer);

// Release barrier
if(queueFamilyIndices.graphicsFamily != queueFamilyIndices.computeFamily) {

    var bufferBarrier = VkBufferMemoryBarrier.calloc(1, stack);
    bufferBarrier.sType(VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER);
    bufferBarrier.srcAccessMask(VK_ACCESS_INDIRECT_COMMAND_READ_BIT);
    bufferBarrier.dstAccessMask(0);
    bufferBarrier.srcQueueFamilyIndex(queueFamilyIndices.graphicsFamily);
    bufferBarrier.dstQueueFamilyIndex(queueFamilyIndices.computeFamily);
    bufferBarrier.buffer(currentFrame.indirectCommandBuffer.buffer);
    bufferBarrier.offset(0);
    bufferBarrier.size(VK_WHOLE_SIZE);

    vkCmdPipelineBarrier(commandBuffer,
            VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
            VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT,
            0, null, bufferBarrier, null);
}
...

We check for whether the graphics and compute queue families are the same or not before setting the buffer barriers because, if we are using the same queue families for both compute and graphics, we do not need to transition queue families for the buffer.

if (queueFamilyIndices.graphicsFamily != queueFamilyIndices.computeFamily)

All that's left now is to just to run the compute pipeline before we call the graphics pipeline. The code here should be pretty similar, we submit the compute command buffer to the GPU using vkQueueSubmit().

// Submit rendering commands to GPU
VkSubmitInfo submitInfo = VkSubmitInfo.calloc(stack);
submitInfo.sType(VK_STRUCTURE_TYPE_SUBMIT_INFO);
...
submitInfo.pCommandBuffers(stack.pointers(computeCommandBuffer));

vkCheck(vkQueueSubmit(computeQueue, submitInfo, VK_NULL_HANDLE), 
                                "Could not submit to compute queue");

Note: You would have to use fences/semaphores to syncronize between the compute and graphics pipelines.

We are done!

And that's it, I have described most of the important steps required to setting up GPU-driven frustum culling using compute shaders. Of course, I skipped out on some details here and there, but again, this is not meant to be an indepth tutorial.

Code

I do not have a code example specifically created to showcase this in isolation, but the concepts shown here could be seen implemented in my Kurama Engine's code base. Try to Ctrl+F for the relevant method calls in the ActiveShutterRenderer.java file.

https://github.com/alanjoshua/Kurama-Engine/blob/master/RenderEngine/projects/ActiveShutter/code/main/ActiveShutterRenderer.java