Tuesday, November 28, 2023

Nvidia SLI from Vulkan's Point of View

SLI is an Nvidia technology, which (is supposed to) allow multiple GPUs to act as one. The use case is supposed to be simple: you turn it on, and everything gets faster. However, that's not how it works in Vulkan (because of course it isn't - nothing is simple in Vulkan). So let's dig in and see exactly how it works and what's exposed in Vulkan.

Logical Device Creation

SLI is exposed in Vulkan with 2 extensions, both of which have been promoted to core in Vulkan 1.1: VK_KHR_device_group_creation, and VK_KHR_device_group. The reason there are 2 is esoteric: one is an "instance extension" and the other is a "device extension." Because enumerating device groups has to happen before you actually create a logical device, those enumeration functions can't be part of a device extension, so they're part of the instance extension instead. The instance extension is really small - it essentially just lets you list device groups, and for each group, list the physical devices inside it. When you create your logical device, you just list which physical devices should be part of the new logical device.

Now that you've created your logical device, there are a few different pieces to how this stuff works.

Beginning of the Frame

At the beginning of your frame, you would normally call vkAcquireNextImageKHR(), which schedules a semaphore to be signaled when the next swapchain image is "acquired" (which means "able to be rendered to"). (The rest of your rendering is supposed to wait on this semaphore to be signaled.) VK_KHR_device_group replaces this function with vkAcquireNextImage2KHR(), which adds a single parameter: a "device mask" of which physical devices in the logical device should be ready before the semaphore is signaled.

It took me a while to figure this out, but each physical device gets its own distinct contents of the swapchain image. When you write your Vulkan program, and you bind a swapchain image to a framebuffer, that actually binds n different contents - one on each physical device. When a physical device executes and interacts with the image, it sees its own independent contents of the image.

End of the Frame

At the end of the frame, you'll want to present, and this is where things get a little complicated. Each physical device in the logical device may or may not have a "presentation engine" in it. Also, recall that each physical device has its own distinct contents of the swapchain image.

There are 4 different presentation "modes" (VkDeviceGroupPresentModeFlagBitsKHR). Your logical device will support some subset of these modes. The 4 modes are:

  1. Local presentation: Any physical device with a presentation engine can present, but it can only present the contents of its own image. When you present, you tell Vulkan which physical device and image to present (VkDeviceGroupPresentInfoKHR).
  2. Remote presentation: Any physical device with a presentation engine can present, and it can present contents from other physical devices. Vulkan exposes a graph (vkGetDeviceGroupPresentCapabilities()) that describes which physical devices can present from which other physical devices in the group. When you present, you tell Vulkan which image to present, and there's a requirement that some physical device with a presentation engine is able to present the image you selected.
  3. Sum presentation: Any physical device with a presentation engine can present, and it presents the component-wise sum of the contents of the image from multiple physical devices. Again, there's a graph that indicates, for each physical device that has a presentation image, which other physical devices it's able to sum from. When you present, you specify which physical devices' contents to sum, via a device mask (and there's a requirement that there is some physical device with a presentation engine that can sum from all of the requested physical devices).
  4. Local multi-device presentation: Different physical devices (with presentation engines) can present different disjoint rects of their own images, which get merged together to a final image. You can tell which physical devices present which rects by calling vkGetPhysicalDevicePresentRectanglesKHR(). When you present, you specify a device mask, which tells which physical devices present their rects.

On my machine, only the local presentation mode is supported, and both GPUs have presentation engines. That means the call to present gets to pick (VkDeviceGroupPresentInfoKHR) which of the two image contents actually gets presented.

Middle of the Frame

The commands in the middle of the frame are probably actually the most straightforward. When you begin a command buffer, you can specify a device mask (VkDeviceGroupCommandBufferBeginInfo) of which physical devices will execute the command buffer. Inside the command buffer, when you start a render pass, you can also specify another device mask (VkDeviceGrupRenderPassBeginInfo) for which physical devices will execute the render pass, as well as assigning each physical device its own distinct "render area" rect. Inside the render pass, you can run vkCmdSetDeviceMask() to change the set of currently running physical devices. In your SPIR-V shader, there's even a built-in intrinsic "DeviceIndex" to tell you which GPU in the group you're running on. And then, finally, when you actually submit the command buffer, you can supply (VkDeviceGroupSubmitInfo) a device mask you want to submit the command buffers to.

There's even a convenience vkCmdDispatchBase() which lets you set "base" values for workgroup IDs, which is convenient if you want to spread one workload across multiple GPUs. Pipelines have to be created with VK_PIPELINE_CREATE_DISPATCH_BASE_KHR to use this, though.

Resources

It's all well and good to have multiple physical devices executing the same command buffer, but simply execution is not enough: you also need to bind resources to those shaders and commands that get run.

When allocating a resource, there are 2 ways for it to happen: either each physical device gets its own distinct contents of the allocation, or all the physical devices share a single contents. If the allocation's heap is marked as VK_MEMORY_HEAP_MULTI_INSTANCE_BIT_KHR, then all allocations will be replicated distinctly across each of the physical devices. Even if the heap isn't marked that way, the individual allocation can still be marked that way (VkMemoryAllocateFlagsInfo). On my device, the GPU-local heap is marked as multi-instance.

Communication can happen between the devices by using Vulkan's existing memory binding infrastructure. Recall that, in Vulkan, you don't just create a resource; instead, you make an allocation, and a resource, and then you separately bind the two together. Well, it's possible to bind a resource on one physical device with an allocation on a different physical device (VkBindBufferMemoryDeviceGroupInfo, VkBindImageMemoryDeviceGroupInfo)! When you make one of these calls, it will execute on all the physical devices, so these structs indicate the graph of which resources on which physical devices get bound to which allocations on which (other) physical resources. For textures, you can even be more fine-grained than this, and bind just a region of a texture across physical devices (assuming you created the image with VK_IMAGE_CREATE_SPLIT_INSTANCE_BIND_REGIONS_BIT_KHR). This also works with sparse resources - when you bind a sparse region of a texture, that sparse region can come from another physical device, too (VkDeviceGroupBindSparseInfo).

Alas, there are restrictions. vkGetDeviceGroupPeerMemoryFeatures() tells you, once you've created a resource and bound it to an allocation on a different physical device, how you're allowed to use that resource. For each combination of (heap index, local device index, and remove device index), a subset of 4 possible uses will be allowed (VkPeerMemoryFeatureFlagBits):

  1. The local device can copy to the remote device
  2. The local device can copy from the remote device
  3. The local device can read the resource directly
  4. The local device can write to the resource directly

This is really exciting - if either of the bottom two uses are allowed, it means you can bind one of these cross-physical-device resources to a shader and use it as-if it were any normal resource! Even if neither of the bottom two uses are allowed, just being able to copy between devices without having to round-trip through main memory is already cool. On my device, only the first 3 uses are allowed.

Swapchain Resources

Being able to bind a new resource to a different physical device's allocation is good, but swapchain images come pre-bound, which means that mechanism won't work for swapchain resources. So there's a new mechanism for that: it's possible to bind a new image to the storage for an existing swapchain image (VkBindImageMemorySwapchainInfoKHR). This can be used in conjunction with VkBindImageMemoryDeviceGroupInfo which I mentioned above, to make the allocations cross physical devices.

So, if you want to copy from one physical device's swapchain image to another physical device's swapchain image, what you'd do is:

  1. Create a new image (of the right size, format, etc.). Specify VkImageSwapchainCreateInfoKHR to indicate its storage will come from the swap chain.
  2. Bind it (VkBindImageMemoryInfo), but use both...
    1. VkBindImageMemorySwapchainInfoKHR to have its storage come from the swap chain, and
    2. VkBindImageMemoryDeviceGroupInfo to specify that its storage comes from another physical device's swap chain contents
  3. Execute a copy command to copy from one image to the other image.

Conclusion

It's a pretty complicated system! Certainly much more complicated than SLI is in Direct3D. It seems like there are 3 core benefits of device groups:

  1. You can execute the same command stream on multiple devices without having to re-encode it multiple times or call into Vulkan multiple times for each command. The device masks implicitly duplicate the execution.
  2. There are a variety of presentation modes, which allow automatic merging of rendering results, without having to explicitly execute a render pass or a compute shader to merge the results. Unfortunately, my cards don't support this.
  3. Direct physical-device-to-physical-device communication, without round-tripping through main memory. Indeed, for some use cases, you can just bind a remote resource and use it as-if it was local. Very cool!

I'm not quite at the point where I can run some benchmarks to see how much SLI improves performance over simply creating two independent Vulkan logical devices. I'm working on a ray tracer, so there are a few different ways of joining the rendering results from the two GPUs. To avoid seams, the denoiser will probably have to run on just one of the GPUs.