Point Cloud Rendering using the Mesh Shader pipeline

Note: This post only goes over the general idea for the algorithm and some implementation details. I haven't yet implemented this in code, so keep that in mind.

Introduction

The idea for this algorithm arose due to needing it for a research project I am currently working on. Basically, I am working on a research project with Dr. Aurisano that involves having to render a huge Lidar point cloud dataset collected from the Barro Colorado islands in Panama. Movement data from a nocturnal, tree-dwelling animal called the Kinkajou was also collected, which should be overlayed with the point cloud environment.

Point Cloud Rendering - Overview

From the full point cloud model, we first generate a hierarchical LOD (Level Of Detail) tree structure. This would then be traversed each frame to get a view dependent cut of the tree, so that only nodes that are within the camera's frustum and are necessary for the required level of detail are rendered. This structure was first described in 2004 in [1], which used a binary tree structure, but more recent papers have modified it to use other tree structures, such as an octree [2].

There are programs such as Potree Converter that could be used to generate the LOD hierarchical tree, so we are not going to worry too much about the generation of the LOD tree structure right now.

We then use a Task-Mesh shader rendering pipeline to traverse through the hierarchy tree and render the point cloud, all in parallel on the GPU and a single draw call.

Hierarchical LOD tree structure

We are going to describe how the LOD structure works in this section. I am mainly going to be explaining using a binary tree structure, but the same concept extends to other types of trees too.

Unlike LOD cluster structures generally used for meshes, where the children nodes could be rendered on their own to portray a complete and simplified version of the parent node, the hierarchy structure used here is additive, meaning that all the nodes from the root node up to the desired child node would have to be rendered together to produce the required LOD.

Taking a step back, when generating the tree structure, we must first set a hyper parameter M that controls how many samples each node would contain. For example, if M = 100, then the root node would sample (different sampling methods could be used) M samples from the full point cloud. Ideally this would a very approximate representation of the entire model, as could be seen in the box at the top in the above figure. The remaining points are then spatially divided between the child nodes, and each of the child nodes sample M points. This is then repeated until the number of points in a node is less than M, which would mean that we have reached a leaf node.

A child node could be thought of as to refine a parent node by adding an extra M points to the parent point cloud.

During each frame, a view dependent cut of the tree is generated, which is then rendered. View dependent cut just means that it takes the primary camera into consideration, and frustum culls entire subtrees that are not currently within the camera's frustum. This is highly efficient, because when we find that a node has been frustum culled, the entire subtree starting from that node could be safely culled without having to do additional checks.

We figure out where to "cut" the tree by making use of a "quality" metric. Different metrics could be tested, but one of the metrics that could be used is the average sample distance 'r' at that node. This could be computed during the tree generation for each node. Then, during the tree traversal, once the node has passed the frustum cull check, r is projected to screen space and then compared against a threshold hyper-parameter to check whether the required quality has been reached or not. If the quality is not enough at the current node, the child nodes would then be processed during the traversal. It the quality is sufficient, the tree is "cut" at the current node.

One thing to note is that since I am planning on using the task-mesh shader pipeline, M cannot be too large, since mesh shaders have a maximum limit as to the total number of vertices they could output. In fact, NVIDIA recommends a vertex count of 64 vertices [3].

Point Cloud Rendering Algorithm

The main "feature" of this rendering algorithm is that it would make use of the mesh shader pipeline. I am sure that other AAA engines are already using a mesh shader pipeline for point cloud rendering, but I was not able to find any research papers or technical documentations for such a similar implementation as detailed below.

The main advantage of using a task/mesh shader pipeline is that this lets us replace both the compute pipeline and the traditional geometry pipeline with just the task/mesh pipeline, thus removing additional overheads, and also removes the necessity for intermediary output buffers. (It is not shown in the above picture, but the traditional pipeline does not have feature-parity with the task/mesh pipeline unless you have a compute pipeline before it.)

The basic idea of the algorithm is that instead of finding and rendering the perfect view-dependent cut each frame, the nodes to be rendered are updated progressively, where during each frame, updates flow from one LOD level at a time. The advantage of this approach is that sudden camera movements wouldn't cause the system to lag due to all the additional processing, but obviously this also means that there would be a delay between the camera movement and when everything finishes updating. The hope would be that the player camera normally wouldn't be moving too fast and thus the delay would be unnoticeable.

Algorithm walkthrough using an example

Please ignore my terrible sketching abilities, but using this binary tree LOD structure as an example, I will now to try to explain how the algorithm would work.

Frame 1:

During the first frame, only the root node would be processed. The task shader would perform frustum culling (checks whether the node is within the camera's view). If it is, then it checks whether the current node provides enough detail using the LOD metric. In our case, the root node does not provide enough detail, and thus we add its children to the list of nodes to be rendered the next frame. The task shader then invokes a mesh shader instance for the current node, which renders the current node.

Frame 2:

Now we have the root node and its two children in the list of nodes to be rendered. A task shader would be invoked for each of the nodes, and the above process would be repeated. The only difference is that the children nodes would first check whether the parent is actually being rendered or not before doing any other calculations, since if the parent has been culled, it obviously means that the current node does not need to be rendered. The grandchildren nodes have also been now added to the render list. Mesh shaders are invoked for each of the nodes.

Frame 3:

During this frame, all the nodes are being considered for rendering, and the same process as described above happens.

Scenario for when the current node provides sufficient LOD:

Lets say that instead of the above example, we have a more complex model, and the camera is at a distance where not all nodes have to be rendered. During the LOD sufficiciency test, if we find out that the current node provides enough detail and thus the children don't need to be rendered, we simply remove the children nodes from the render_nodes list. But remember that this list is for the next frame, and thus the children would still be rendered during the current frame.

Implementation Details

When the rendering first starts, only the root node would be present in the list of nodes that would be considered for rendering. Task shaders would then be invoked for these nodes.

(Note: Admittedly I am not yet very clear on the workgroup structure, and whether I should have one task shader invocation per node being rendered, and whether each task shader should be responsible for only making one mesh shader invocation, but these are questions I am sure would be answered as I start implementing the code, and thus details regarding this are probably going to change from what I discuss here. )

We have a separate nodes list that keeps track of node specific information. Technically we would have three of these, nodes_static, nodes_read and nodes_write, one for storing static node information, one for reading dynamic data and the other for writing dynamic data from the GPU respectively. We need three lists because, since there are dynamic data updates that are happening but everything is also happening in parallel on the GPU, we separate the static data to avoid duplicating it, and have two read and write lists to handle the data updates. At the end of each frame, the read and write buffers are swapped. These buffers would be GPU-memory-only buffers.

(We would also need two render_nodes lists that are swapped each frame and cleared to account for GPU parallel compute, or maybe a different structure to more efficiently handle the updates to the render_nodes list)

The Node information structures might look something like the pseudocode shown here:

// Node information that will not be updated
public class NodeStatic {
    BoundingInfo bound; // Place holder for whatever bounding method
                        // you_want to use. Bounding spheres are probably
                        // the simplest
    float averageSampleDist; // LOD quality metric
    int childIndices[];
    int parentIndex;
    
    // A shared vertex buffer is used for all nodes, and the vertices of
    // the same node are arranged in sequence in this buffer. Thus, just
    // providing the start and end indices is enough to extract the node
    // geometry
    
    int startIndex;
    int endIndex;
} 

// Node information that might be updated in the GPU
public class NodeDynamic {
    boolean isCulled;
    boolean isLODSufficient;
}

and here is the Task shader pesudocode:

struct NodeStatic {...} // As shown above
struct NodeDynamic {...} // As shown above

layout(std...binding...) readonly buffer NodesBufferStatic {
    NodeStatic nodes[];
}nodes_static;

layout(std...binding...) readonly buffer NodesBufferRead {
    NodeDynamic nodes[];
}nodes_read;

layout(std...binding...) writeonly buffer NodesBufferWrite {
    NodeDynamic nodes[];
}nodes_write;

... 
// Other buffers, variables and functions are loaded and defined
...

void main() {
    int id = gl_....; // Get task shader id;
    int nodeId = currNodesToRender[id]; // Get index of node to render

    NodeStatic curNodeStatic = nodes_static.nodes[nodeId];
    NodeStatic parent = nodes_read.nodes[curNodeStatic.parentId];
    
    
    if (parent.isCulled || parent.isLODSufficient) {
        // Don't render current node since parent is either culled or 
        // already provides enough detail
        
        removeNodeFromNextFrameList(nodeId);
        nodes_write.nodes[nodeId].isCulled = true;
        return;
    }
    
    if(!isWithinFrustum(curNodeStatic)) {
        // Do the same if the current node is not within the camera 
        //frustum
        removeNodeFromNextFrameList(nodeId);
        nodes_write.nodes[nodeId].isCulled = true;
        return;
    }
    else {
        nodes_write.nodes[nodeId].isCulled = false;
    }
    
    if (!isDetailSufficient(curNodeStatic)) {
        addChildNodesToNextFrameList(curNodeStatic.childIndices);
        nodes_write.nodes[nodeId].isLODSufficient = false;
    }
    else {
        nodes_write.nodes[nodeId].isLODSufficient = true;
    }
    
    invokeMeshShader(nodeId);
}

I am not providing pseudocode for the Mesh shader since it should be fairly striaght forward. It would get the vertices from the primary vertex buffer using the indices in the nodes_static list, and call the fragment shader.

Well, that's the gist of it for now. I would soon be starting to implement this in my Kurama Engine, and I am sure I would discover that I would have to change a lot of things here and there. I will probably make a blog post once I have a working demo implemented.

References

[1] E. Gobbetti and F. Marton, “Layered point clouds: a simple and efficient multiresolution structure for distributing and rendering gigantic point-sampled models,” Computers & Graphics, vol. 28, no. 6, pp. 815–826, Dec. 2004, doi: 10.1016/j.cag.2004.08.010.

[2] M. Schütz, S. Ohrhallinger, and M. Wimmer, “Fast Out-of-Core Octree Generation for Massive Point Clouds,” Computer Graphics Forum, vol. 39, no. 7, pp. 155–167, 2020, doi: 10.1111/cgf.14134.

[3] https://developer.nvidia.com/blog/introduction-turing-mesh-shaders/

Point Cloud Rendering using the Mesh Shader pipeline - Algorithm overview

Introduction

Point Cloud Rendering - Overview

Hierarchical LOD tree structure

Point Cloud Rendering Algorithm

References

Recent Posts

Comments