Wednesday, July 3, 2024

Vulkan Synchronization: Why it sucks (Part 2)

There are a variety of reasons why I don't think the Vulkan committee designed synchronization very well.

Let's get one thing out of the way first:

State/Hazard Tracking is a Thing that People Need to Do

At GDC 2016 (the year that Vulkan came out), Matthaeus Chajdas presented making the point that you shouldn't need to do state tracking in Vulkan. He says that, instead of tracking state at runtime, you should just *know* which barriers are needed in which places, and just hardcode them in the right places to be correct. After all - you wrote your rendering algorithm.

I don't think this is realistic. If you're making a toy, then sure, you'll just know where the barriers go. But no serious engines are toys.

Consider something like Unreal Engine - who knows what kind of crazy shit the artists, who don't work at Epic, created in UE that the engine is being asked to render. Node graphs can be arbitrarily complex and can have dependencies with multiple subsystems within the engine. I'd put good money on the claim that the people who wrote Unreal Engine don't know exactly where all the barriers go, without even being able to see the content they're being asked to render.

Also, consider that Direct3D is the lingua franca of 3D graphics (whether you want it to be or not). Almost no games call Vulkan directly. Vulkan is, instead, used as a porting layer - a platform API that your calls have to funnel through if you want your game to run on Android. Game code either calls Direct3D directly, or in the case of the big engines, go through an RHI that's looks similar to Direct3D. For all this content, the way it ends up running on Vulkan is with a layer like Proton - implementing the Direct3D API - or something similar to the Direct3D API - on top of Vulkan, which necessarily requires state tracking, because the Direct3D synchronization API is significantly less detailed/precise than Vulkan is.

So, I just reject outright the idea that devs should *just know* where all their barriers should go, and should magically place them in the right place because the people writing the Vulkan game engine (which doesn't exist) are just really smart. It's just not realistic.

It's Actually D3D, It's Always Been D3D

The Vulkan spec pretends that each stage in a pipeline may execute on a different chip with different caches. Therefore, you have to track each stage in every pipeline independently. But there's no hardware which actually works this way. No GPU actually puts *every* stage of the graphics pipeline on different chips with different caches. So you're doing all this tracking for each one of these stages, but the hardware only has a few (usually 1) different chips that all the programmable stages execute on. So almost all these different barriers are all going to boil down to the same thing under the hood anyway.

Imagine that you weren't writing a Vulkan app, but instead were writing a device-specific driver. You'd know exactly the topology of the device you're writing for, and you'd track only the state that the device actually cares about. Instead, Vulkan forces you to track all the state that a conceptual worst-case device might need. No real device actually needs all that precision.

But the worst part of all this is that, when device manufacturers design devices, they are informed by the graphics APIs of the day. They make their hardware knowing that Direct3D is the lingua franca API, and that the vast majority of content will be written to that API or a similar API. So they design the hardware accordingly. So, what we're left with is: the apps are written for the D3D model, then they have to be filtered down to the super precise Vulkan model, which requires state tracking, only to end up *back* at the D3D model when it hits the hardware. So what was the point of all that tracking for Vulkan? Nothing!

Naive Hazard Tracking is Too Precise

Each array slice and mip level and aspect (depth/stencil/color) of each image can be in a different layout. Synchronization between previous writes and current reads can synchronize each array slice and mip level and aspect independently. In a buffer, every byte range can be synchronized independently.

Now, imagine that you wanted to do hazard tracking at this granularity. You'd need a data structure that remembered, for each texture, for each array slice, for each mip level, for each aspect, what the most recent write to that area of the resource was. Even if you do some kind of run-length encoding, there's no getting around the fact that a 4-dimensional data structure is gonna be a mess. You could have a hashmap of hashmaps of hashmaps of hashmaps, which would waste a ton of space, or you could have something like a 4-dimensional kD tree, which is going to end up with a ton of nodes. There's no getting around the fact that you're gonna create a monster.

If you do something simpler, and don't represent exactly the precision that Vulkan has, that means you're going to somewhat oversynchronize.

The crux here is that there's a nonobvious crossover point. If you're super precise in your tracking data structure, you're going to spend a bunch of time partitioning / compacting it. If you're imprecise, you're going to oversynchronize. There's no good solution.

Image Layouts are (Differently) Too Precise

There are 36 different layouts a region of an image can be in (if you include all non-vendor-specific extensions). Some of those layouts are capable of only being used for one purpose, but many of them are capable of being used for multiple purposes, while only being optimal for one.

So, the developer has a choice. They can either transition every image to whichever layout is optimal for each command they're being used in, thereby issuing lots of transitions. Or, they can pick layouts which are *compatible* with multiple uses but optimal for just some (or maybe even none), and issue fewer barriers. Or something in the middle.

How should the developer make this decision? With which information are they armed in order to make this decision? The answer is: nothing. The developer has no idea which Vulkan layouts actually correspond to identical byte layouts under the hood. (And, D3D has way fewer layouts, so you better believe that not all of the Vulkan ones will actually be distinct in the hardware.)

So, the only thing the developer can do is write the code multiple ways and test the performance. But how many places on the spectrum should they implement? How do they know which codepath to pick if the user runs their code on a device they haven't tested with? There is no good answer.

Vulkan Made an Intentional Decision

The thing that really gets me about Vulkan's synchronization API is that they *didn't* actually go all the way to the extreme:

  • Semaphores automatically create memory dependencies, and you don't have to specify an access mask with them. This didn't have to be the case, though - the Vulkan committee could have decided to require you to specify access masks with semaphores
  • Sempahores signaled via queue submissions automatically form a memory dependency with previous command buffers. The Vulkan committee didn't have to do this - they could have said "if you want to form a memory dependency with a command buffer, you must interact with *that* command buffer"
  • Queue submission automatically forms a memory dependency with host accesses prior to the queue submission (coherent mappings notwithstanding). The Vulkan committee didn't need to do this - they could have said that you need to issue a device-side synchronization command to synchronize with the host
  • Pipeline barriers themselves don't have to be synchronized with memory dependencies, which is a little surprising because pipeline barriers can change image layouts, which can read/write every byte in the image. The Vulkan committee could have said image layout transitions are done via separate commands which need to have memory dependencies around them.
  • Pipeline barriers require a bitmask of pipline stages as well as a bitmask of access masks - but there's no relation between them. You can't say "stage X uses access mask Y, whereas stage Z uses access mask W" without issuing multiple barriers. This means that the most natural way to use pipeline barriers (lumping all the stages and access masks together into a single barrier) is actually oversynchronizing - it's totally possible for a single access mask to apply to multiple pipeline stages, which you might not have anticipated when issuing the barrier.

I would understand it if the Vulkan committee went all the way, and said that Vulkan would give you nothing for free, and it's up to the author to do everything. But, instead, they chose a middle ground, which indicates they think this is an appropriate middle ground. They intentionally chose this compromise. But it's a terrible compromise! It's precise enough that you can't do it well, and it's entirely needless because it pessimizes to no hardware that actually exists.

Manufacturers are Better Equipped for this than Game Developers

I'd like to make this point again: State/hazard tracking is a thing that people have to do with Vulkan, but there's no way to do it better than a device-specific driver, because Vulkan pessimizes. I appreciate that Vulkan lets the app developer choose what granularity of state/hazard tracking they perform, but I think that's a constituency inversion: the only rubric for the tradeoffs of tracking is performance, and the tradeoffs will be device-dependent, so the manufacturers making the devices, who are experts in each individual device they produce, will be best equipped to make that tradeoff. They'll certainly do a better job than an individual game developer house with hundreds of GPUs from many vendors to test out. This goes doubly for GPUs that release after the game does, which is a situation where the game has to run on a device the developer has never seen before.

The game developer doesn't know the internal topology of the GPU they're running on, and they're not equipped to make the kinds of tradeoffs that the Vulkan API forces them to make. Only the device manufacturers can actually make these sensible tradeoffs for each device.

No comments:

Post a Comment