Litherum: July 2024

Wednesday, July 3, 2024

Vulkan Synchronization: Why it sucks (Part 2)

There are a variety of reasons why I don't think the Vulkan committee designed synchronization very well.

Let's get one thing out of the way first:

State/Hazard Tracking is a Thing that People Need to Do

At GDC 2016 (the year that Vulkan came out), Matthaeus Chajdas presented making the point that you shouldn't need to do state tracking in Vulkan. He says that, instead of tracking state at runtime, you should just *know* which barriers are needed in which places, and just hardcode them in the right places to be correct. After all - you wrote your rendering algorithm.

I don't think this is realistic. If you're making a toy, then sure, you'll just know where the barriers go. But no serious engines are toys.

Consider something like Unreal Engine - who knows what kind of crazy shit the artists, who don't work at Epic, created in UE that the engine is being asked to render. Node graphs can be arbitrarily complex and can have dependencies with multiple subsystems within the engine. I'd put good money on the claim that the people who wrote Unreal Engine don't know exactly where all the barriers go, without even being able to see the content they're being asked to render.

Also, consider that Direct3D is the lingua franca of 3D graphics (whether you want it to be or not). Almost no games call Vulkan directly. Vulkan is, instead, used as a porting layer - a platform API that your calls have to funnel through if you want your game to run on Android. Game code either calls Direct3D directly, or in the case of the big engines, go through an RHI that's looks similar to Direct3D. For all this content, the way it ends up running on Vulkan is with a layer like Proton - implementing the Direct3D API - or something similar to the Direct3D API - on top of Vulkan, which necessarily requires state tracking, because the Direct3D synchronization API is significantly less detailed/precise than Vulkan is.

So, I just reject outright the idea that devs should *just know* where all their barriers should go, and should magically place them in the right place because the people writing the Vulkan game engine (which doesn't exist) are just really smart. It's just not realistic.

It's Actually D3D, It's Always Been D3D

The Vulkan spec pretends that each stage in a pipeline may execute on a different chip with different caches. Therefore, you have to track each stage in every pipeline independently. But there's no hardware which actually works this way. No GPU actually puts *every* stage of the graphics pipeline on different chips with different caches. So you're doing all this tracking for each one of these stages, but the hardware only has a few (usually 1) different chips that all the programmable stages execute on. So almost all these different barriers are all going to boil down to the same thing under the hood anyway.

Imagine that you weren't writing a Vulkan app, but instead were writing a device-specific driver. You'd know exactly the topology of the device you're writing for, and you'd track only the state that the device actually cares about. Instead, Vulkan forces you to track all the state that a conceptual worst-case device might need. No real device actually needs all that precision.

But the worst part of all this is that, when device manufacturers design devices, they are informed by the graphics APIs of the day. They make their hardware knowing that Direct3D is the lingua franca API, and that the vast majority of content will be written to that API or a similar API. So they design the hardware accordingly. So, what we're left with is: the apps are written for the D3D model, then they have to be filtered down to the super precise Vulkan model, which requires state tracking, only to end up *back* at the D3D model when it hits the hardware. So what was the point of all that tracking for Vulkan? Nothing!

Naive Hazard Tracking is Too Precise

Each array slice and mip level and aspect (depth/stencil/color) of each image can be in a different layout. Synchronization between previous writes and current reads can synchronize each array slice and mip level and aspect independently. In a buffer, every byte range can be synchronized independently.

Now, imagine that you wanted to do hazard tracking at this granularity. You'd need a data structure that remembered, for each texture, for each array slice, for each mip level, for each aspect, what the most recent write to that area of the resource was. Even if you do some kind of run-length encoding, there's no getting around the fact that a 4-dimensional data structure is gonna be a mess. You could have a hashmap of hashmaps of hashmaps of hashmaps, which would waste a ton of space, or you could have something like a 4-dimensional kD tree, which is going to end up with a ton of nodes. There's no getting around the fact that you're gonna create a monster.

If you do something simpler, and don't represent exactly the precision that Vulkan has, that means you're going to somewhat oversynchronize.

The crux here is that there's a nonobvious crossover point. If you're super precise in your tracking data structure, you're going to spend a bunch of time partitioning / compacting it. If you're imprecise, you're going to oversynchronize. There's no good solution.

Image Layouts are (Differently) Too Precise

There are 36 different layouts a region of an image can be in (if you include all non-vendor-specific extensions). Some of those layouts are capable of only being used for one purpose, but many of them are capable of being used for multiple purposes, while only being optimal for one.

So, the developer has a choice. They can either transition every image to whichever layout is optimal for each command they're being used in, thereby issuing lots of transitions. Or, they can pick layouts which are *compatible* with multiple uses but optimal for just some (or maybe even none), and issue fewer barriers. Or something in the middle.

How should the developer make this decision? With which information are they armed in order to make this decision? The answer is: nothing. The developer has no idea which Vulkan layouts actually correspond to identical byte layouts under the hood. (And, D3D has way fewer layouts, so you better believe that not all of the Vulkan ones will actually be distinct in the hardware.)

So, the only thing the developer can do is write the code multiple ways and test the performance. But how many places on the spectrum should they implement? How do they know which codepath to pick if the user runs their code on a device they haven't tested with? There is no good answer.

Vulkan Made an Intentional Decision

The thing that really gets me about Vulkan's synchronization API is that they *didn't* actually go all the way to the extreme:

Semaphores automatically create memory dependencies, and you don't have to specify an access mask with them. This didn't have to be the case, though - the Vulkan committee could have decided to require you to specify access masks with semaphores
Sempahores signaled via queue submissions automatically form a memory dependency with previous command buffers. The Vulkan committee didn't have to do this - they could have said "if you want to form a memory dependency with a command buffer, you must interact with *that* command buffer"
Queue submission automatically forms a memory dependency with host accesses prior to the queue submission (coherent mappings notwithstanding). The Vulkan committee didn't need to do this - they could have said that you need to issue a device-side synchronization command to synchronize with the host
Pipeline barriers themselves don't have to be synchronized with memory dependencies, which is a little surprising because pipeline barriers can change image layouts, which can read/write every byte in the image. The Vulkan committee could have said image layout transitions are done via separate commands which need to have memory dependencies around them.
Pipeline barriers require a bitmask of pipline stages as well as a bitmask of access masks - but there's no relation between them. You can't say "stage X uses access mask Y, whereas stage Z uses access mask W" without issuing multiple barriers. This means that the most natural way to use pipeline barriers (lumping all the stages and access masks together into a single barrier) is actually oversynchronizing - it's totally possible for a single access mask to apply to multiple pipeline stages, which you might not have anticipated when issuing the barrier.

I would understand it if the Vulkan committee went all the way, and said that Vulkan would give you nothing for free, and it's up to the author to do everything. But, instead, they chose a middle ground, which indicates they think this is an appropriate middle ground. They intentionally chose this compromise. But it's a terrible compromise! It's precise enough that you can't do it well, and it's entirely needless because it pessimizes to no hardware that actually exists.

Manufacturers are Better Equipped for this than Game Developers

I'd like to make this point again: State/hazard tracking is a thing that people have to do with Vulkan, but there's no way to do it better than a device-specific driver, because Vulkan pessimizes. I appreciate that Vulkan lets the app developer choose what granularity of state/hazard tracking they perform, but I think that's a constituency inversion: the only rubric for the tradeoffs of tracking is performance, and the tradeoffs will be device-dependent, so the manufacturers making the devices, who are experts in each individual device they produce, will be best equipped to make that tradeoff. They'll certainly do a better job than an individual game developer house with hundreds of GPUs from many vendors to test out. This goes doubly for GPUs that release after the game does, which is a situation where the game has to run on a device the developer has never seen before.

The game developer doesn't know the internal topology of the GPU they're running on, and they're not equipped to make the kinds of tradeoffs that the Vulkan API forces them to make. Only the device manufacturers can actually make these sensible tradeoffs for each device.

Vulkan Synchronization: What it is (Part 1)

Synchronization in Vulkan is actually somewhat tricky to grok. And you have to grok it, because it isn't done automatically for you - you have to explicitly synchronize between every data hazards that might arise. (This is in contrast to APIs like Metal which do most, or even all, of the synchronization for you.)

Hazards

So let's start out with: What's a data hazard? A data hazard is when two operations can't happen simultaneously, either because one operation depends on the result of the other one (think: a read depends on a previous write) or one needs to be independent of the other (think: a read before a write needs to *not* see the result of the write). To forbid these, you make explicit calls to Vulkan to tell it what to disallow from happening simultaneously with what else.

There are 2 flavors of synchronization in Vulkan: execution dependencies and memory dependencies.

Execution dependencies are the simplest - they don't say *anything* about memory, but instead *just* describe a "happens-before" relationship. This is sufficient for a write-after-read hazard, where the read shouldn't see the results of the write - it's enough to just delay the write until after the read completed.

Memory dependencies imply an execution dependency - so memory dependencies are strictly more powerful than execution dependencies. A memory dependency is where there is actually communication going on, through memory. It is sufficient for read-after-write hazards, where the read needs to see the result of the write, and also write-after-write hazards. There are 2 pieces to it:

The first operation's result become "available". Think of this as a cache flush. After the first operation, the data has to actually move from the source cache out to main memory
The memory becomes "visible" to the second operation. Think of this as a cache invalidation. If the receiver wants to see the result of the previous operation, it has to mark its own cache as invalid, so the operation actually goes out to main memory to see the results.

There is no such thing as a read-read hazard, so no synchronization is necessary in that case.

Tools

Vulkan provides many tools that you can use to synchronize.

Fences are used to synchronize the device ("device" = "GPU") with the host ("host" = "CPU"). You specify a fence to be signaled when the device work is done as a part of vkQueueSubmit(). This kind of synchronization is actually (surprisingly) really easy, because Section 7.9 essentially says that you don't need to deal with fences when uploading data to the device. The spec for vkWaitForFences() indicates that the only thing you need to do for downloading data from the device is simply wait on the fence. (Also, if your resource isn't coherently mapped, you need to use vkFlushMappedMemoryRanges() and vkInvalidateMemoryRanges() so the CPU's caches will get out of the way.
Semaphores are used to synchronize between different queues. They actually form a dependency graph, where each command buffer submitted to a queue specifies a set of semaphores to wait for before it starts executing, and a set of semaphores to signal when it's done executing.
Events are used to synchronize different commands within the same queue. Just because two operations are submitted to the same queue does not mean they will execute in-order. Events are "split" in that there is vkCmdSetEvent() as distinct from vkCmdWaitEvents(). These are commands that get recorded into a command buffer.
Pipeline barriers are also used to synchronize different commands within the same queue, but the difference with events is that barriers aren't split - there's just a single vkCmdPipelineBarrier(). This is also a command that gets recorded into a command buffer.

I won't say much more about fences - as I described above, you really don't need to think about them much. One of the cool things about fences is that you can "export" it to a "sync fd" on UNIX platforms, and the resulting fd can be used in select() and epoll(), which makes for a nice way to integrate into your app's existing run loop.

I also won't say much more about events - they're just the same thing as pipeline barriers, but split into two halves.

Stages and Accesses

Accesses by the GPU happen within a "pipeline," which is comprised of a sequence of stages. For example, the vertex shader and the fragment shader are stages of the graphics pipeline (among many other stages). Vulkan is designed so that each stage within a pipeline can execute on a totally different chip than any of the other stages - and each chip will have its own cache. Therefore, if your pipeline has n stages, Vulkan forces you to pessimize and assume that each of those n stages will execute on a different chip with a different set of caches, so you have n different caches you have to manage.

However, it's actually worse than that. Each stage might be able to access a resource in a variety of different way - for example, a sampled texture vs a storage texture. Each of these different ways to access a resource *also* might have its own cache - the cache used for storage textures might be a totally different cache than the cache used for sampled textures. So you actually have more than n cached to worry about.

The more caches you have to manage, the more synchronization calls you need to make.

When you issue a pipeline barrier to Vulkan, you have to describe which source and destination you're synchronizing. The source and destination both include which stage is doing the access, and which kind of access it is (e.g. sampled texture vs storage texture). If you just supply the stages, but don't supply the accesses, that describes an execution dependency. If you supply both, that describes a memory dependency.

Semaphores always describe memory dependencies, and they don't ask you for what kind of accesses it's synchronizing with - it presumably pessimizes and assumes it has to synchronize as-if all kinds of accesses happened. Instead, it just asks you which stages it should provide a memory barrier between.

I should also probably mention that it's possible for *you* to pessimize too - the enum for access kind is a bitmask, so you can specify multiple values, and it also has values for all and none.

Pipeline Barriers et al.

Pipeline barriers also do 2 more things: image layouts and queue transfers.

Image layouts are fundamentally different than what I've been describing so far. Previously, I've been describing synchronization - cache operations and source and destination processors. On the other hand, an image layout is a state that the image is in. The spec says it's an in-memory ordering of the data blocks within the image. There are lots of different layouts an image can be in, with each one being optimized for some particular purpose. Transitioning an image may require reading and writing all of the data within the image - to move its blocks around. So, you can't pessimize here in the same way you can pessimize about synchronization (and issue the biggest pipeline barrier possible between every command) - instead, if you want your accesses to be optimal, you have to remember which layout every (region of every) image is in, and transition it as necessary. If you were going to pessimize, you'd just leave the image in the "general" layout and never change it.

Queue transfers are somewhat similar, in that they are state the image is in. When creating an image, you decide whether the image is "exclusive" to a single queue, or shared among multiple queues. If the image is shared among multiple queues, you have to use synchronization to make sure the different queues don't step on each others' toes. Otherwise, if the image is exclusive, you can change which queue it's owned by with a queue transfer in a pipeline barrier. It's actually pretty straightforward - the source queue issues the pipeline barrier to release its ownership, and the destination queue issues the same pipeline barrier to acquire ownership.

Which Bytes?

There's one last piece of Vulkan synchronization - and that is chopping up resources. When you issue a pipeline barrier on a buffer, you get to say which byte region of the buffer the synchronization applies to. (If you supply something that the implementation can't exactly match - let's say you didn't supply it on page boundaries or something - the implementation is allowed to pessimize and synchronize more of the resource than you asked for.)

For textures, it's a little more complicated. You can't issue a pipeline barrier for a particular rectangular region of a 2D texture. Instead, your barriers can target a specific mip level range, array layer range, and aspect range (aspect = depth part, stencil part, or color part). Each mip level, array layer, or aspect of a texture can be in a different layout and synchronized independently.

The Story

So, as your Vulkan program runs, it will issue reads and writes to specific regions of resources. The sequence of reads and writes to a particular part of a resource will cause hazards, and you need to classify the kinds of hazards and issue synchronization calls to cause them to be synchronized. You can use synchronization calls to create either execution dependencies or memory dependencies. Within a single queue, you use pipeline barriers (or events) to synchronize, and across queues you use semaphores (which act upon entire command buffers), and to synchronize command buffers with the host you use fences.

There's kind of a problem, though - hazards become apparent at the site of recording the destination access, but the necessary pipeline barrier requires you to specify the *previous* access. Now, maybe you *just know* the previous access - you wrote your rendering engine, after all - but usually command buffers are recorded independently (perhaps even in parallel). At the time you record a command into a command buffer, you may not know what the previous access was to that portion of the resource - maybe the last access happened in a totally different command buffer that hasn't even been recorded yet.

The natural solution to this is a two-phase tracking solution: within a single command buffer, try to issue whatever pipeline barriers you know are necessary. For the ones you can't issue because you don't know the source accesses, simply remember those (and don't issue pipeline barriers for them). Your queue submission system can then do its own global tracking to use semaphores to synchronize with whichever accesses actually got submitted just before the current command buffer's accesses.

I'll be discussing this in more in a forthcoming Part 2.