Introduction Culling

8 min readApr 29, 2020

In 3D rendering, the term culling describes the early rejection of objects of any kind (objects, draw calls, triangles, and pixels) that do not contribute to the final image or 3D configurator. There are many techniques that reject objects at different stages of the rendering pipeline. Some of these techniques are performed entirely in software on the GPU, others are hardware-based (GPU), and still others are integrated into the graphics card. It is helpful to understand all these techniques in order to achieve good performance. To keep the processing effort as low as possible, it is better to select early and select more. On the other hand, the culling itself should not cost too much power and memory. To ensure good performance, we automatically balance the system. This leads to better performance, but also makes the system properties more difficult to understand.

As a rule, the engine culling takes place on the Draw Call level or with several Draw Calls. We don’t go to the triangle level or even further, as this is often not efficient. This means that it can be useful to divide large objects into several parts so that not all parts have to be rendered together.

No static culling techniques.

We deliberately avoid static techniques such as prepared PVS (Potential Visibility Tests), which were common in early 3D engines. The big advantage of PVS is the very low cost of runtime performance, but with modern computers this is less problematic. The PVS is usually created in a time-consuming pre-processing step, which is bad for production in modern game worlds. The gameplay often requires dynamic geometry updates (e.g. opening/closing doors, destroying buildings) and the static nature of a PVS is not very suitable for this. By avoiding PVS and similar precalculated techniques we can even save some space, which is very valuable on consoles.

Portals.

Another popular approach besides PVS are portals. Portals are usually hand placed, flat, convex 2D objects. The idea is that when a portal is visible, the world section behind the portal must be rendered. If the world section is considered visible, the geometry can be further tested as the portal shape or the intersection of several portals allows higher culling. By separating world sections with portals, designers can create efficient layers. Good portal positions are located in narrow areas such as doors and for performance it is beneficial to create environments where portals remove a lot of content. There are algorithms for automatically placing portals, but it is likely that the resulting level of performance will be less optimal if designers are not involved in the optimization process.

Hand-placed portals avoid time-consuming pre-processing steps, consume little additional memory, and allow some dynamic updates. Portals can be turned on and off, and some engineseven implement special effects with portals (e.g. mirrors, transporters). In CryEngine only portals are used to improve rendering performance. We decided not to extend the use of portals to make the code and portal workflow simple and efficient. Portals have their advantages, but it requires additional effort from designers and often it is difficult to find good portal positions. Open environments such as cities or pure nature often do not allow efficient portal usage. Portals are supported by CryEngine, but should only be used where they work better than the coverage buffer.

Anti-portals.

The portal technology can be extended by the opposite of portals, which are generally referred to as anti-portals. The objects that are often convex in 2D or 3D can close other portals. Imagine a large column in a room with several doors leading to other rooms. This is a difficult case for classic portals, but the typical use case for anti-portals. Anti portals can be implemented with geometric intersections of objects, but this method has problems with merging multiple anti portals and efficiency suffers. Instead of geometric anti-portals, we have the cover buffer, which serves the same purpose but has better properties.

GPU Z Test.

In modern graphics cards, the Z buffer is used to solve the hidden interface problem. Here is a simple explanation: For each pixel on the screen, the so-called Z or depth value is stored, which represents the distance of the camera to the next geometry at this pixel position. All renderable objects must consist of triangles. All pixels covered by the triangles perform a Z comparison (Z buffer vs. Z value of the triangle) and depending on the result, the triangle pixel is discarded or not. This elegantly solves the problem of removing hidden surfaces even from overlapping objects. The already mentioned problem of occluder fusion is solved without further effort. The Z-test is quite late in the rendering pipeline, which means that many engine setup costs (e.g. skinning, shader constants) are already done.

In some cases it allows to avoid pixel shader execution or frame buffer blending, but its main purpose is to solve the problem of hidden surfaces, culling is a side effect. By roughly sorting objects from front to back, culling performance can be improved. The early Z-pass technique (sometimes referred to as Z-pre-pass) makes this less indispensable, as the first pass is explicitly quickly aligned to per-pixel performance. Some hardware even runs at double speed when color writing is disabled. Unfortunately, we have to output data there to set up the G-buffer data for delayed lighting.

The Z buffer precision is influenced by the pixel format (24bit), the Z buffer area and in a very extreme (non-linear) way by the Z near value. The Z Near value defines how close an object can be to the viewer before it is cut away. By halving the Z Near (e.g. from 10cm above 5cm) you effectively halve the accuracy of the Z buffer. This has no effect on most object renderings, but decals are often rendered correctly because their Z value is slightly smaller than the surface beneath them. It’s a good idea not to change the Z near at runtime.

GPU Z Cull / HiZ.

Efficient Z buffer implementations in the GPU cull fragments (pixels or multiple-sampled subsamples) in coarser blocks at an earlier time. This helps to reduce the execution of pixel shaders. There are many conditions required to enable this optimization, and seemingly harmless renderer changes can easily break this optimization. The rules are complicated and depend on the graphics card.

GPU occlusion queries.

The occlusion query function of modern graphics cards allows the CPU to retrieve information about previously performed Z buffer tests. This function can be used to implement more advanced culling techniques. After rendering some occluders (preferably from front to back, first large objects), objects (occludees) can be tested for visibility. The graphics hardware makes it possible to test multiple objects efficiently, but there is a big problem. Since the entire rendering is heavily buffered, the information whether an object is visible is delayed by a long time (up to several frames). This is unacceptable because it means either very bad states (frame rate couplings), bad frame rate in general, or for a while invisible objects where they shouldn’t be.

For some hardware/drivers, this latency problem is less severe than for others, but a delay of about one frame is about the best there is. This also means that we can’t perform efficient hierarchical tests efficiently, e.g. when a closed box is visible and then perform fine-grained tests with subdivisions. The Occlusion test functionality is implemented in the engine and is currently used for ocean rendering. We even use the number of visible pixels to scale the update frequency of the reflection. Unfortunately, we can also have the situation that the ocean is not visible for one or two images due to fast position changes of the view. This happened in Crysis, for example, when the player left the submarine.

Software Coverage Buffer (cbuffer).

The Z buffer performance depends on the number of triangles, the number of vertices and the number of covered pixels. This is all very fast on graphics hardware and would be very slow on the CPU. However, the CPU wouldn’t have the latency problem of occlusion queries and modern CPUs get faster. So we did a software implementation on the CPU called “Coverage Buffer”. To achieve good performance, we use simplified Occluder and Occludees. Artists can add a few occluder triangles to well occluding objects and we test for the occlusion of the object boundary box. Animated objects are not considered. We also use a lower resolution and hand-optimized code for the triangle grid. The result is a rather aggressively optimized set of objects that need to be rendered. It’s possible that an object is occuled even though it should still be visible, but this is very rare and often due to a bad asset (e.g. the occluder polygon is slightly larger than the object). We decided to prefer performance, efficiency and simplicity of code over correctness.

Cover buffer with Z buffer readback.

On some hardware (PlayStation 3, Xbox360) we can efficiently copy the Z buffer into main memory and perform coverage buffer tests. This still leads to the same latency problems, but integrates well into the software implementation of the cover buffer and is efficient when used for many objects. This method introduces a frame delay, so for example fast rotations can be a problem.

Backface Culling.

Normally Backface Culling is a piece of cake for graphic programmers. Depending on the triangle orientation (clockwise or counterclockwise with respect to the viewer) the hardware does not have to represent the rear triangles and we get some acceleration. Only for some alpha blended objects or special effects we need to disable backface culling. With the PS3, this issue needs to be reconsidered. The GPU performance in processing nodes or retrieving nodes can be a bottleneck and the good connection between the SPUs and the CPU makes it possible to create data on demand. The effort for SPU transformation and triangle testing can be worthwhile. An efficient CPU implementation could perform frustum culling by combining small triangles, backface culling, mesh skinning, and even lighting. However, this is not an easy task. In addition to maintaining this PS3-specific code path, the mesh data must be available in CPU memory. At the time of writing this article, we haven’t done this optimization yet because memory is scarce. We might reconsider this once we have efficient streaming from CPU to memory (code that uses this data may have to do with frame latency).

Conditional rendering.

The occlusion query function could be used for another culling technique. Many draw calls must be made in multiple passes and subsequent passes could be avoided if the previous pass had been fetched. This requires a lot of occlusion queries and accounting effort. On most hardware, this would not pay off, but our PS3 renderer implementation can access data structures at a very low level, so the overhead is lower.

Heightmap raycasting.

Heightmaps enable efficient ray cast tests. With these objects in the vicinity, the ground hidden behind a terrain can be cleared. This culling technique has been available since CryEngine 1, but now that we have many other methods and store the heightmap data in compressed form, the technique has become less efficient. This is all the more true when you consider the changes that have taken place in the hardware since then. Over time, the computing power increased faster than the memory power. This culling technique can be replaced by others over time.

Thank you for your visit.

Introduction Culling

Written by Max Klomeier