Wednesday, July 3, 2024

Vulkan Synchronization: What it is (Part 1)

Synchronization in Vulkan is actually somewhat tricky to grok. And you have to grok it, because it isn't done automatically for you - you have to explicitly synchronize between every data hazards that might arise. (This is in contrast to APIs like Metal which do most, or even all, of the synchronization for you.)

Hazards

So let's start out with: What's a data hazard? A data hazard is when two operations can't happen simultaneously, either because one operation depends on the result of the other one (think: a read depends on a previous write) or one needs to be independent of the other (think: a read before a write needs to *not* see the result of the write). To forbid these, you make explicit calls to Vulkan to tell it what to disallow from happening simultaneously with what else.

There are 2 flavors of synchronization in Vulkan: execution dependencies and memory dependencies.

Execution dependencies are the simplest - they don't say *anything* about memory, but instead *just* describe a "happens-before" relationship. This is sufficient for a write-after-read hazard, where the read shouldn't see the results of the write - it's enough to just delay the write until after the read completed.

Memory dependencies imply an execution dependency - so memory dependencies are strictly more powerful than execution dependencies. A memory dependency is where there is actually communication going on, through memory. It is sufficient for read-after-write hazards, where the read needs to see the result of the write, and also write-after-write hazards. There are 2 pieces to it:

  • The first operation's result become "available". Think of this as a cache flush. After the first operation, the data has to actually move from the source cache out to main memory
  • The memory becomes "visible" to the second operation. Think of this as a cache invalidation. If the receiver wants to see the result of the previous operation, it has to mark its own cache as invalid, so the operation actually goes out to main memory to see the results.
There is no such thing as a read-read hazard, so no synchronization is necessary in that case.

Tools


Vulkan provides many tools that you can use to synchronize.
  • Fences are used to synchronize the device ("device" = "GPU") with the host ("host" = "CPU"). You specify a fence to be signaled when the device work is done as a part of vkQueueSubmit(). This kind of synchronization is actually (surprisingly) really easy, because Section 7.9 essentially says that you don't need to deal with fences when uploading data to the device. The spec for vkWaitForFences() indicates that the only thing you need to do for downloading data from the device is simply wait on the fence. (Also, if your resource isn't coherently mapped, you need to use vkFlushMappedMemoryRanges() and vkInvalidateMemoryRanges() so the CPU's caches will get out of the way.
  • Semaphores are used to synchronize between different queues. They actually form a dependency graph, where each command buffer submitted to a queue specifies a set of semaphores to wait for before it starts executing, and a set of semaphores to signal when it's done executing.
  • Events are used to synchronize different commands within the same queue. Just because two operations are submitted to the same queue does not mean they will execute in-order. Events are "split" in that there is vkCmdSetEvent() as distinct from vkCmdWaitEvents(). These are commands that get recorded into a command buffer.
  • Pipeline barriers are also used to synchronize different commands within the same queue, but the difference with events is that barriers aren't split - there's just a single vkCmdPipelineBarrier(). This is also a command that gets recorded into a command buffer.
I won't say much more about fences - as I described above, you really don't need to think about them much. One of the cool things about fences is that you can "export" it to a "sync fd" on UNIX platforms, and the resulting fd can be used in select() and epoll(), which makes for a nice way to integrate into your app's existing run loop.

I also won't say much more about events - they're just the same thing as pipeline barriers, but split into two halves.

Stages and Accesses


Accesses by the GPU happen within a "pipeline," which is comprised of a sequence of stages. For example, the vertex shader and the fragment shader are stages of the graphics pipeline (among many other stages). Vulkan is designed so that each stage within a pipeline can execute on a totally different chip than any of the other stages - and each chip will have its own cache. Therefore, if your pipeline has n stages, Vulkan forces you to pessimize and assume that each of those n stages will execute on a different chip with a different set of caches, so you have n different caches you have to manage.

However, it's actually worse than that. Each stage might be able to access a resource in a variety of different way - for example, a sampled texture vs a storage texture. Each of these different ways to access a resource *also* might have its own cache - the cache used for storage textures might be a totally different cache than the cache used for sampled textures. So you actually have more than n cached to worry about.

The more caches you have to manage, the more synchronization calls you need to make.

When you issue a pipeline barrier to Vulkan, you have to describe which source and destination you're synchronizing. The source and destination both include which stage is doing the access, and which kind of access it is (e.g. sampled texture vs storage texture). If you just supply the stages, but don't supply the accesses, that describes an execution dependency. If you supply both, that describes a memory dependency.

Semaphores always describe memory dependencies, and they don't ask you for what kind of accesses it's synchronizing with - it presumably pessimizes and assumes it has to synchronize as-if all kinds of accesses happened. Instead, it just asks you which stages it should provide a memory barrier between.

I should also probably mention that it's possible for *you* to pessimize too - the enum for access kind is a bitmask, so you can specify multiple values, and it also has values for all and none.

Pipeline Barriers et al.


Pipeline barriers also do 2 more things: image layouts and queue transfers.

Image layouts are fundamentally different than what I've been describing so far. Previously, I've been describing synchronization - cache operations and source and destination processors. On the other hand, an image layout is a state that the image is in. The spec says it's an in-memory ordering of the data blocks within the image. There are lots of different layouts an image can be in, with each one being optimized for some particular purpose. Transitioning an image may require reading and writing all of the data within the image - to move its blocks around. So, you can't pessimize here in the same way you can pessimize about synchronization (and issue the biggest pipeline barrier possible between every command) - instead, if you want your accesses to be optimal, you have to remember which layout every (region of every) image is in, and transition it as necessary. If you were going to pessimize, you'd just leave the image in the "general" layout and never change it.

Queue transfers are somewhat similar, in that they are state the image is in. When creating an image, you decide whether the image is "exclusive" to a single queue, or shared among multiple queues. If the image is shared among multiple queues, you have to use synchronization to make sure the different queues don't step on each others' toes. Otherwise, if the image is exclusive, you can change which queue it's owned by with a queue transfer in a pipeline barrier. It's actually pretty straightforward - the source queue issues the pipeline barrier to release its ownership, and the destination queue issues the same pipeline barrier to acquire ownership.

Which Bytes?


There's one last piece of Vulkan synchronization - and that is chopping up resources. When you issue a pipeline barrier on a buffer, you get to say which byte region of the buffer the synchronization applies to. (If you supply something that the implementation can't exactly match - let's say you didn't supply it on page boundaries or something - the implementation is allowed to pessimize and synchronize more of the resource than you asked for.)

For textures, it's a little more complicated. You can't issue a pipeline barrier for a particular rectangular region of a 2D texture. Instead, your barriers can target a specific mip level range, array layer range, and aspect range (aspect = depth part, stencil part, or color part). Each mip level, array layer, or aspect of a texture can be in a different layout and synchronized independently.

The Story


So, as your Vulkan program runs, it will issue reads and writes to specific regions of resources. The sequence of reads and writes to a particular part of a resource will cause hazards, and you need to classify the kinds of hazards and issue synchronization calls to cause them to be synchronized. You can use synchronization calls to create either execution dependencies or memory dependencies. Within a single queue, you use pipeline barriers (or events) to synchronize, and across queues you use semaphores (which act upon entire command buffers), and to synchronize command buffers with the host you use fences.

There's kind of a problem, though - hazards become apparent at the site of recording the destination access, but the necessary pipeline barrier requires you to specify the *previous* access. Now, maybe you *just know* the previous access - you wrote your rendering engine, after all - but usually command buffers are recorded independently (perhaps even in parallel). At the time you record a command into a command buffer, you may not know what the previous access was to that portion of the resource - maybe the last access happened in a totally different command buffer that hasn't even been recorded yet.

The natural solution to this is a two-phase tracking solution: within a single command buffer, try to issue whatever pipeline barriers you know are necessary. For the ones you can't issue because you don't know the source accesses, simply remember those (and don't issue pipeline barriers for them). Your queue submission system can then do its own global tracking to use semaphores to synchronize with whichever accesses actually got submitted just before the current command buffer's accesses.

I'll be discussing this in more in a forthcoming Part 2.

No comments:

Post a Comment