Trying to understand ‘Add special game perf workaround for Starfield and other DGC junkies’

confusion thumbnail

This entry is the first in a series where I take a specific example of something I don’t understand and research it until I understand what’s going on. I’ve found taking a vertical slice of an existing problem to be a great way of exposing myself to new topics and understanding while limiting the work or surface area needed. I hope by doing so publicly, it also reminds others that it’s okay to not understand something initially. Confusion is the sweat of learning.

Introduction

The talk of the town recently has been the release of Bethesda’s video game Starfield. Some users are complaining that the game is unoptimized for PC, to which the studio has responded that it has optimized for PC and users may need a better computer. Naturally, this comment went over well with the community (narrator: it did not).

It’s unclear where on the spectrum of

[mild performance drops for a few users] to [massively seen horrible performance]

that Starfield actually lies, because the vocal criticisms may not reflect the proportion of users that are experiencing performance issues. What is clear, however, is that a number of patches to systems related to low-level graphics software have been released in the past few days to address some of the diagnosed issues. One that has gained mild online traction over the past few days is a GitHub pull request opened on the vkd3d-proton project titled ‘Add special game perf workaround for Starfield and other DGC junkies’. I looked at the PR description and was mildly intimidated by the verbage. To be honest, I know little about GPU hardware. So I figured I’d roll up my sleeves and try to understand what was going on.

You can check out to PR yourself before continuing: https://github.com/HansKristian-Work/vkd3d-proton/pull/1694

Notes before we begin

Hello there, Nathan-from-the-future here to clarify a couple things. First, it’s worth mentioning that much of the GPU terminology is steeped in its computer graphics origins. Therefore, there will be terms that seem to refer to graphics-specific concepts that are misleading, because they have been since updated to have a general computation meaning. A good example of this is compute shader, which used to mean “calculate shading/textures for computer graphics” and now has the general meaning of “program that runs on the GPU.” Another example is GPU itself, which stands for Graphics Processing Unit, but could more accurately be described now as linear-algebra-go-brrr-machine (or GPGPU, for general purpose GPU, but it’s less catchy IMO).

Second, the author notes that the Pull Request results in minimal performance improvements (about 1%). This post should be understood less as a substantive guide to improving performance and more as an real-world-motivated entrypoint into exploring GPU computing. As always, there is no easy way to determine what your performance bottlenecks actually are, so you’ll need to profile your own application.

Understanding the vkd3d-proton project

It seems relevant to first understand the project that this PR is being raised to.

From looking over the README,

vkd3d-proton is a fork of VKD3D, which aims to implement the full Direct3D 12 API on top of Vulkan. The project serves as the development effort for Direct3D 12 support in Proton.

Great, we just replaced 1 term I don’t understand with 3 terms I don’t understand. We can read READMEs recursively until we get back to simple english.

+--------------+
| vkd3d-proton | _ (fork)
+--------------+  \ 
                 +-------+   +-----------------+
                 | VKD3D | = | Direct3D 12 API | (Microsoft graphics API) 
                 +-------+   +-----------------+
                             |     Vulkan      | (Khronos group's graphics API)
                             +-----------------+

Okay, this kinda makes sense. Basically the project seems to be a graphics API involved in running Windows games in Linux for Steam (probably for their SteamDeck, maybe other stuff as well). So we at least know we’re in Computer Graphics and GPU-land.

Moving on to understanding ExecuteIndirect

The heart of the PR focuses on the inefficient use of the ExecuteIndirect API call by Starfield and other games.

What is ExecuteIndirect?

This term took a little while to dig up. I started with googling the specific API call, but that led to esoteric DirectX documentation. It did little to explain what the Indirect in ExecuteIndirect actually means. I decided to switch the direction of inquiry and start from general GPU principles, following Apple’s Metal documentation to understand how CPUs and GPUs interact.

Understanding CPU/GPU communication with an example

Let’s say we have 3 tasks on the CPU that we want the GPU to perform, [A, B, C] where the inputs to C depend on the output of B. For example, we can say B calculates which triangles are visible and C draws the visible triangles.

Naive solution

We could define a first-pass at CPU-GPU communication like such:

       CPU                   GPU
+--------------+     
| Issue Call A |     
+--------------+ --->
                     +----------------+
                     | Perform Call A |
                <--- +----------------+
+--------------+
| Issue Call B |
+--------------+ ---> 
                     +----------------+
                     | Perform Call B |
                <--- +----------------+
+--------------+
| Issue Call C |
+--------------+ ---> 
                     +----------------+
                     | Perform Call C |
                     +----------------+
                <---

The CPU sends a command to the GPU, the GPU performs the command, and returns to the CPU with the output and asking for more work.

However, this approach leaves a lot of idling of the CPU and GPU. The CPU has to wait for the GPU, the GPU has to wait for the CPU, etc. In addition, every time data is moved between the CPU and GPU, it requires communication via the bus (not discussed here) which can become a bottleneck. Preferrably, we’d like to remove unnecessary dependencies between the CPU and GPU.

Improving with a command buffer

Since A and B have no dependency on each other, we can write them to a command buffer in the GPU. A command buffer is a memory buffer that is created at GPU program startup that both the CPU and GPU can access. The buffer acts as a queue between the CPU and GPU, where the CPU can queue up commands and the GPU can execute them as its free. By de-coupling the timing of the CPU and GPU, we can run them in parallel.

       CPU         |   (GPU command buffer)          GPU            
+--------------+                                            
| Issue Call A |   |   gpu      cpu                                      
+--------------+  --->  V        V                                       
+--------------+   |    +--------------+        +----------------+   
| Issue Call B |  --->  | Call A | ... |  --->  | Perform Call A |   
+--------------+  <---  +--------------+  <---  +----------------+   
                   |      gpu      cpu                                 
                        +--V--------V--+        +----------------+   
                   |    |..| Call B |..|  --->  | Perform Call B |   
                  <---  +--------------+  <---  +----------------+   
+--------------+   |                           
| Issue Call C |                              
+--------------+  --->    gpu      cpu                            
                   |    +--V--------V--+        +----------------+   
                        |..| Call C |..|  --->  | Perform Call C |   
                  <---  +--------------+  <---  +----------------+   
                   | 
                       
                   |

That’s better, but we still need to do a round-trip between B and C, because the input to C depend on the output of B. There’s no way to get around returning to the CPU in-between commands…or is there??

Finishing with data buffers and indirect execution

Okay, if we don’t want to return to the CPU between B and C, we will need some way for the GPU to pass data between the outputs of commands and the inputs of other commands.

Idea: We can re-use the buffer concept above, but modify it so that the only the GPU can read it (no CPU allowed). We will also need to update Call B to tell the GPU to store the output in a specific GPU data buffer and update Call C to read from the same data buffer.

Since we don’t explicitly pass in the parameters from the CPU, but rather dynamically determine them, we call this execution IndirectExecution.

(Hint: you can scroll the diagram left/right)

       CPU         |   (GPU command buffer)            GPU                  (GPU data buffer)   
+--------------+                                                                          
| Issue Call A |   |   gpu      cpu                                                            
+--------------+  --->  V        V                                                             
+--------------+   |    +--------------+        +----------------+                           
| Issue Call B |  --->  | Call A | ... |  --->  | Perform Call A |                       
+--------------+  <---  +--------------+  <---  +----------------+                           
                   |      gpu      cpu                                  Call B                  
+--------------+        +--V--------V--+        +----------------+        V-----------------+
| Issue Call C |  --->  |..| Call B |..|  --->  | Perform Call B |  --->  |    ...   | ...      
+--------------+  <---  +--------------+  <---  +----------------+        +-----------------+   
                   |      gpu      cpu                                  Call C     Call B        
                        +--V--------V--+        +----------------+        V----------V------+
                   |    |..| Call C |..|  --->  | Perform Call C |  <---  | Output B | ...
                        +--------------+  <---  +----------------+        +-----------------+   
                   |

Now we’re basically running the CPU and GPU in parallel!

That’s basically the gist of IndirectExecution. The CPU and GPU set up a series of buffers to store 1) commands and 2) data. The CPU sends IndirectExecution commands which can be run on the output of other GPU commands, in sort of a completion-handler-y way I suppose.

Since it’s important to the next part of the post, I’ve included an example data buffer between Call B and Call C:

+----------------------------------------------------------+
|VertexData(0,0,0),VertexData(0,0,0),...,TrianglesToDraw(0)|
+----------------------------------------------------------+

Notably, the buffer just contains 1) triangle data and 2) how many triangles are to be drawn. The buffer should probably include which triangles are to be drawn, but for simplicity’s sake the buffer omits that info in our example.

Boring technicalities

In the complex reality of GPUs, our above example of B (calculate which triangles are visible) and C (draw the visible triangles) may not actually be the best solution. Speculative execution (for both CPUs and GPUs) can oftentimes be faster, which does work and then throw it out if not needed. For example, perhaps the GPU should run B and C at the same time, then synthesize the two and throw away any triangles drawn in C not visible via B. Point is, it’s complicated. Still, it’s a good thematic example for our purposes :)

Back to the PR - What issues does it address?

In light of our newfound GPU knowledge, we can re-read the PR and understand what it’s trying to accomplish.

The goal of this refactor is to optimize for cases where games (Starfield in particular) uses advanced ExecuteIndirect in very inefficient ways.

  • Indirect count, but indirect count ends up being 0.
  • Non-indirect count, but none of the active draws inside the multi-draw indirect actually result in a draw. Multiple back-to-back ExecuteIndirects are called like this, causing big bubbles on the GPU.

Issue 1: Indirect count ends up being 0

This issue is pretty straightfoward. Basically, the performance issue is that the data buffer states “here are (100 triangles with data), we’re going to draw (0) of them” and the GPU says “Okay! I’ll create a new shader, sync the shader, and then run the operation NO-OP on each of the triangles because we’re drawing none of them.” Doing this adds overhead of, ya know, creating the shader, syncing it, etc…

As the author HansKristian notes, there already exists an optimization where you can add a predicate (a predicate simply returns true or false) to the data buffer to check if draw_count is 0. If the predicate returns true, the GPU will automatically skip generating the shader etc. and boom, performance increase.

Issue 2: Non-indirect count, but the number of drawn triangles end up being 0

This solves the issue where we pass an indirect count and we find it to be 0. However, some draw call commands pass the draw_count input directly while the triangle data is still indirect. Example call: draw 16 triangles with this indirect data. This is not performant when none of the triangles in the buffer will actually be drawn.

It turns out it’s pretty easy to have the GPU scan the data buffer and output the draw_count indirectly from the buffer data rather than use the direct draw_count.

Example

Here’s the original code, where we can’t run our predicate because the draw_count is non-indirect

   GPU

   (command buffer)
+--------------------+
| Run shader to      |
| transform Output B |
| to Input C         |
+--------------------+
| Sync GPU           |   ( Technically, because a shader runs non-linearly, we
| post-shader        |     need to fence the GPU before continuing )
+--------------------+
| ExecuteIndirect    |
| (X, draw_count=16) | <- Note the hardcoded draw_count in the command
+--------------------+
| ExecuteIndirect    |
| (Y, draw_count=16) |
+--------------------+
| ExecuteIndirect    |
| (Z, draw_count=16) |
+--------------------+

And here’s how we can modify the GPU execution to create an indirect (and accurate) count of the draw_count, and then run the same predicate as before

(Hint: again, you can scroll left/right. I have too much fun making these diagrams)

   GPU
                              (new init buffer,
   (command buffer)        creates command buffer)   (new command buffer)
+--------------------+                          ,-->+-------------------+
| Run shader to      |                          |   | ExecuteIndirect   |
| transform Output B |                          |   | (X, draw_count=x) | <- Note now the draw_count 
| to Input C         |                          |   +-------------------+    is reflective of actual 
+--------------------+                          |   | ExecuteIndirect   |    draw count data and is 
| Sync GPU           |                          |   | (Y, draw_count=y) |    indirect
| post-shader        |                          |   +-------------------+
+--------------------+ -->                      |   | ExecuteIndirect   |
                           +------------------+ |   | (Z, draw_count=z) |
                           |Scan Input C, find| |   +-------------------+
                           | draw_count of X  | |
                           +------------------+ |
                           |Scan Input C, find| |
                           | draw_count of Y  | |
                           +------------------+ |
                           |Scan Input C, find| |
                           | draw_count of Z  | |
                           +------------------+-

This change lets us run the same predicate as in Issue 1 and skip any commands that have no actual draws!

A natural question is:

Does scanning the input and creating another command buffer really improve performance over just not-drawing the triangles?

To which my response would be:

*shrugs shoulders* I guess so, based on the PR. As always, performance simply depends on (time of path)*(amount the path is taken) and the PR mentions Starfield apparently does a lot of these empty draw calls.

Conclusion

Not so scary after all, right? Once you understand basic CPU/GPU interactions the performance issues become much more understandable. Again, the author mentions only minute performance increases can be achieved by these changes. Still, it’s fun to use current trends to learn something (such as GPUs) in more depth.

Note: I can’t promise everything I say in this post is accurate; merely, my own initial understanding of the topics. In addition, I omitted some details from the original PR, so I strongly recommend reading it now that we’ve covered some of the necessary background.

Below is a random diagram that I also found useful, but didn’t fit into the flow of the post

       CPU                      GPU
+--------------+
| Issue direct |
| culling call |  data transfer...
+--------------+  ---> 
                       +--------------------+
                       | Generate arguments |
                       +--------------------+
      data transfer    | Write to regular   |
            (again)... | buffer             |
                  <--- +--------------------+
+-----------------+
| Read from       |
| regular buffer  |
+-----------------+ data transfer
| Issue draw call |   (yet again)...
+-----------------+ ---> +----------------------+
                         | Execute regular call |
                         +----------------------+
                         | Complete pass        |
                         +----------------------+

vs

       CPU                      GPU
+----------------+
| Issue indirect |
| culling call   |  data transfer...
+----------------+  ---> +--------------------+
                         | Generate arguments |
                         +--------------------+
                         | Write to indirect  |
                         | buffer             |
                         +--------------------+
                         | Complete pass      |
                         +--------------------+
                         | Execute indirect   |
                         | call               |
                         +--------------------+
                         | Complete pass      |
                         +--------------------+

Thanks and References

Special thanks to Liam and Zander to feedback on an earlier draft of this post.

Here are my references that I used to learn while writing this post