Litherum

Wednesday, July 3, 2024

Vulkan Synchronization: Why it sucks (Part 2)

There are a variety of reasons why I don't think the Vulkan committee designed synchronization very well.

Let's get one thing out of the way first:

State/Hazard Tracking is a Thing that People Need to Do

At GDC 2016 (the year that Vulkan came out), Matthaeus Chajdas presented making the point that you shouldn't need to do state tracking in Vulkan. He says that, instead of tracking state at runtime, you should just *know* which barriers are needed in which places, and just hardcode them in the right places to be correct. After all - you wrote your rendering algorithm.

I don't think this is realistic. If you're making a toy, then sure, you'll just know where the barriers go. But no serious engines are toys.

Consider something like Unreal Engine - who knows what kind of crazy shit the artists, who don't work at Epic, created in UE that the engine is being asked to render. Node graphs can be arbitrarily complex and can have dependencies with multiple subsystems within the engine. I'd put good money on the claim that the people who wrote Unreal Engine don't know exactly where all the barriers go, without even being able to see the content they're being asked to render.

Also, consider that Direct3D is the lingua franca of 3D graphics (whether you want it to be or not). Almost no games call Vulkan directly. Vulkan is, instead, used as a porting layer - a platform API that your calls have to funnel through if you want your game to run on Android. Game code either calls Direct3D directly, or in the case of the big engines, go through an RHI that's looks similar to Direct3D. For all this content, the way it ends up running on Vulkan is with a layer like Proton - implementing the Direct3D API - or something similar to the Direct3D API - on top of Vulkan, which necessarily requires state tracking, because the Direct3D synchronization API is significantly less detailed/precise than Vulkan is.

So, I just reject outright the idea that devs should *just know* where all their barriers should go, and should magically place them in the right place because the people writing the Vulkan game engine (which doesn't exist) are just really smart. It's just not realistic.

It's Actually D3D, It's Always Been D3D

The Vulkan spec pretends that each stage in a pipeline may execute on a different chip with different caches. Therefore, you have to track each stage in every pipeline independently. But there's no hardware which actually works this way. No GPU actually puts *every* stage of the graphics pipeline on different chips with different caches. So you're doing all this tracking for each one of these stages, but the hardware only has a few (usually 1) different chips that all the programmable stages execute on. So almost all these different barriers are all going to boil down to the same thing under the hood anyway.

Imagine that you weren't writing a Vulkan app, but instead were writing a device-specific driver. You'd know exactly the topology of the device you're writing for, and you'd track only the state that the device actually cares about. Instead, Vulkan forces you to track all the state that a conceptual worst-case device might need. No real device actually needs all that precision.

But the worst part of all this is that, when device manufacturers design devices, they are informed by the graphics APIs of the day. They make their hardware knowing that Direct3D is the lingua franca API, and that the vast majority of content will be written to that API or a similar API. So they design the hardware accordingly. So, what we're left with is: the apps are written for the D3D model, then they have to be filtered down to the super precise Vulkan model, which requires state tracking, only to end up *back* at the D3D model when it hits the hardware. So what was the point of all that tracking for Vulkan? Nothing!

Naive Hazard Tracking is Too Precise

Each array slice and mip level and aspect (depth/stencil/color) of each image can be in a different layout. Synchronization between previous writes and current reads can synchronize each array slice and mip level and aspect independently. In a buffer, every byte range can be synchronized independently.

Now, imagine that you wanted to do hazard tracking at this granularity. You'd need a data structure that remembered, for each texture, for each array slice, for each mip level, for each aspect, what the most recent write to that area of the resource was. Even if you do some kind of run-length encoding, there's no getting around the fact that a 4-dimensional data structure is gonna be a mess. You could have a hashmap of hashmaps of hashmaps of hashmaps, which would waste a ton of space, or you could have something like a 4-dimensional kD tree, which is going to end up with a ton of nodes. There's no getting around the fact that you're gonna create a monster.

If you do something simpler, and don't represent exactly the precision that Vulkan has, that means you're going to somewhat oversynchronize.

The crux here is that there's a nonobvious crossover point. If you're super precise in your tracking data structure, you're going to spend a bunch of time partitioning / compacting it. If you're imprecise, you're going to oversynchronize. There's no good solution.

Image Layouts are (Differently) Too Precise

There are 36 different layouts a region of an image can be in (if you include all non-vendor-specific extensions). Some of those layouts are capable of only being used for one purpose, but many of them are capable of being used for multiple purposes, while only being optimal for one.

So, the developer has a choice. They can either transition every image to whichever layout is optimal for each command they're being used in, thereby issuing lots of transitions. Or, they can pick layouts which are *compatible* with multiple uses but optimal for just some (or maybe even none), and issue fewer barriers. Or something in the middle.

How should the developer make this decision? With which information are they armed in order to make this decision? The answer is: nothing. The developer has no idea which Vulkan layouts actually correspond to identical byte layouts under the hood. (And, D3D has way fewer layouts, so you better believe that not all of the Vulkan ones will actually be distinct in the hardware.)

So, the only thing the developer can do is write the code multiple ways and test the performance. But how many places on the spectrum should they implement? How do they know which codepath to pick if the user runs their code on a device they haven't tested with? There is no good answer.

Vulkan Made an Intentional Decision

The thing that really gets me about Vulkan's synchronization API is that they *didn't* actually go all the way to the extreme:

Semaphores automatically create memory dependencies, and you don't have to specify an access mask with them. This didn't have to be the case, though - the Vulkan committee could have decided to require you to specify access masks with semaphores
Sempahores signaled via queue submissions automatically form a memory dependency with previous command buffers. The Vulkan committee didn't have to do this - they could have said "if you want to form a memory dependency with a command buffer, you must interact with *that* command buffer"
Queue submission automatically forms a memory dependency with host accesses prior to the queue submission (coherent mappings notwithstanding). The Vulkan committee didn't need to do this - they could have said that you need to issue a device-side synchronization command to synchronize with the host
Pipeline barriers themselves don't have to be synchronized with memory dependencies, which is a little surprising because pipeline barriers can change image layouts, which can read/write every byte in the image. The Vulkan committee could have said image layout transitions are done via separate commands which need to have memory dependencies around them.
Pipeline barriers require a bitmask of pipline stages as well as a bitmask of access masks - but there's no relation between them. You can't say "stage X uses access mask Y, whereas stage Z uses access mask W" without issuing multiple barriers. This means that the most natural way to use pipeline barriers (lumping all the stages and access masks together into a single barrier) is actually oversynchronizing - it's totally possible for a single access mask to apply to multiple pipeline stages, which you might not have anticipated when issuing the barrier.

I would understand it if the Vulkan committee went all the way, and said that Vulkan would give you nothing for free, and it's up to the author to do everything. But, instead, they chose a middle ground, which indicates they think this is an appropriate middle ground. They intentionally chose this compromise. But it's a terrible compromise! It's precise enough that you can't do it well, and it's entirely needless because it pessimizes to no hardware that actually exists.

Manufacturers are Better Equipped for this than Game Developers

I'd like to make this point again: State/hazard tracking is a thing that people have to do with Vulkan, but there's no way to do it better than a device-specific driver, because Vulkan pessimizes. I appreciate that Vulkan lets the app developer choose what granularity of state/hazard tracking they perform, but I think that's a constituency inversion: the only rubric for the tradeoffs of tracking is performance, and the tradeoffs will be device-dependent, so the manufacturers making the devices, who are experts in each individual device they produce, will be best equipped to make that tradeoff. They'll certainly do a better job than an individual game developer house with hundreds of GPUs from many vendors to test out. This goes doubly for GPUs that release after the game does, which is a situation where the game has to run on a device the developer has never seen before.

The game developer doesn't know the internal topology of the GPU they're running on, and they're not equipped to make the kinds of tradeoffs that the Vulkan API forces them to make. Only the device manufacturers can actually make these sensible tradeoffs for each device.

Vulkan Synchronization: What it is (Part 1)

Synchronization in Vulkan is actually somewhat tricky to grok. And you have to grok it, because it isn't done automatically for you - you have to explicitly synchronize between every data hazards that might arise. (This is in contrast to APIs like Metal which do most, or even all, of the synchronization for you.)

Hazards

So let's start out with: What's a data hazard? A data hazard is when two operations can't happen simultaneously, either because one operation depends on the result of the other one (think: a read depends on a previous write) or one needs to be independent of the other (think: a read before a write needs to *not* see the result of the write). To forbid these, you make explicit calls to Vulkan to tell it what to disallow from happening simultaneously with what else.

There are 2 flavors of synchronization in Vulkan: execution dependencies and memory dependencies.

Execution dependencies are the simplest - they don't say *anything* about memory, but instead *just* describe a "happens-before" relationship. This is sufficient for a write-after-read hazard, where the read shouldn't see the results of the write - it's enough to just delay the write until after the read completed.

Memory dependencies imply an execution dependency - so memory dependencies are strictly more powerful than execution dependencies. A memory dependency is where there is actually communication going on, through memory. It is sufficient for read-after-write hazards, where the read needs to see the result of the write, and also write-after-write hazards. There are 2 pieces to it:

The first operation's result become "available". Think of this as a cache flush. After the first operation, the data has to actually move from the source cache out to main memory
The memory becomes "visible" to the second operation. Think of this as a cache invalidation. If the receiver wants to see the result of the previous operation, it has to mark its own cache as invalid, so the operation actually goes out to main memory to see the results.

There is no such thing as a read-read hazard, so no synchronization is necessary in that case.

Tools

Vulkan provides many tools that you can use to synchronize.

Fences are used to synchronize the device ("device" = "GPU") with the host ("host" = "CPU"). You specify a fence to be signaled when the device work is done as a part of vkQueueSubmit(). This kind of synchronization is actually (surprisingly) really easy, because Section 7.9 essentially says that you don't need to deal with fences when uploading data to the device. The spec for vkWaitForFences() indicates that the only thing you need to do for downloading data from the device is simply wait on the fence. (Also, if your resource isn't coherently mapped, you need to use vkFlushMappedMemoryRanges() and vkInvalidateMemoryRanges() so the CPU's caches will get out of the way.
Semaphores are used to synchronize between different queues. They actually form a dependency graph, where each command buffer submitted to a queue specifies a set of semaphores to wait for before it starts executing, and a set of semaphores to signal when it's done executing.
Events are used to synchronize different commands within the same queue. Just because two operations are submitted to the same queue does not mean they will execute in-order. Events are "split" in that there is vkCmdSetEvent() as distinct from vkCmdWaitEvents(). These are commands that get recorded into a command buffer.
Pipeline barriers are also used to synchronize different commands within the same queue, but the difference with events is that barriers aren't split - there's just a single vkCmdPipelineBarrier(). This is also a command that gets recorded into a command buffer.

I won't say much more about fences - as I described above, you really don't need to think about them much. One of the cool things about fences is that you can "export" it to a "sync fd" on UNIX platforms, and the resulting fd can be used in select() and epoll(), which makes for a nice way to integrate into your app's existing run loop.

I also won't say much more about events - they're just the same thing as pipeline barriers, but split into two halves.

Stages and Accesses

Accesses by the GPU happen within a "pipeline," which is comprised of a sequence of stages. For example, the vertex shader and the fragment shader are stages of the graphics pipeline (among many other stages). Vulkan is designed so that each stage within a pipeline can execute on a totally different chip than any of the other stages - and each chip will have its own cache. Therefore, if your pipeline has n stages, Vulkan forces you to pessimize and assume that each of those n stages will execute on a different chip with a different set of caches, so you have n different caches you have to manage.

However, it's actually worse than that. Each stage might be able to access a resource in a variety of different way - for example, a sampled texture vs a storage texture. Each of these different ways to access a resource *also* might have its own cache - the cache used for storage textures might be a totally different cache than the cache used for sampled textures. So you actually have more than n cached to worry about.

The more caches you have to manage, the more synchronization calls you need to make.

When you issue a pipeline barrier to Vulkan, you have to describe which source and destination you're synchronizing. The source and destination both include which stage is doing the access, and which kind of access it is (e.g. sampled texture vs storage texture). If you just supply the stages, but don't supply the accesses, that describes an execution dependency. If you supply both, that describes a memory dependency.

Semaphores always describe memory dependencies, and they don't ask you for what kind of accesses it's synchronizing with - it presumably pessimizes and assumes it has to synchronize as-if all kinds of accesses happened. Instead, it just asks you which stages it should provide a memory barrier between.

I should also probably mention that it's possible for *you* to pessimize too - the enum for access kind is a bitmask, so you can specify multiple values, and it also has values for all and none.

Pipeline Barriers et al.

Pipeline barriers also do 2 more things: image layouts and queue transfers.

Image layouts are fundamentally different than what I've been describing so far. Previously, I've been describing synchronization - cache operations and source and destination processors. On the other hand, an image layout is a state that the image is in. The spec says it's an in-memory ordering of the data blocks within the image. There are lots of different layouts an image can be in, with each one being optimized for some particular purpose. Transitioning an image may require reading and writing all of the data within the image - to move its blocks around. So, you can't pessimize here in the same way you can pessimize about synchronization (and issue the biggest pipeline barrier possible between every command) - instead, if you want your accesses to be optimal, you have to remember which layout every (region of every) image is in, and transition it as necessary. If you were going to pessimize, you'd just leave the image in the "general" layout and never change it.

Queue transfers are somewhat similar, in that they are state the image is in. When creating an image, you decide whether the image is "exclusive" to a single queue, or shared among multiple queues. If the image is shared among multiple queues, you have to use synchronization to make sure the different queues don't step on each others' toes. Otherwise, if the image is exclusive, you can change which queue it's owned by with a queue transfer in a pipeline barrier. It's actually pretty straightforward - the source queue issues the pipeline barrier to release its ownership, and the destination queue issues the same pipeline barrier to acquire ownership.

Which Bytes?

There's one last piece of Vulkan synchronization - and that is chopping up resources. When you issue a pipeline barrier on a buffer, you get to say which byte region of the buffer the synchronization applies to. (If you supply something that the implementation can't exactly match - let's say you didn't supply it on page boundaries or something - the implementation is allowed to pessimize and synchronize more of the resource than you asked for.)

For textures, it's a little more complicated. You can't issue a pipeline barrier for a particular rectangular region of a 2D texture. Instead, your barriers can target a specific mip level range, array layer range, and aspect range (aspect = depth part, stencil part, or color part). Each mip level, array layer, or aspect of a texture can be in a different layout and synchronized independently.

The Story

So, as your Vulkan program runs, it will issue reads and writes to specific regions of resources. The sequence of reads and writes to a particular part of a resource will cause hazards, and you need to classify the kinds of hazards and issue synchronization calls to cause them to be synchronized. You can use synchronization calls to create either execution dependencies or memory dependencies. Within a single queue, you use pipeline barriers (or events) to synchronize, and across queues you use semaphores (which act upon entire command buffers), and to synchronize command buffers with the host you use fences.

There's kind of a problem, though - hazards become apparent at the site of recording the destination access, but the necessary pipeline barrier requires you to specify the *previous* access. Now, maybe you *just know* the previous access - you wrote your rendering engine, after all - but usually command buffers are recorded independently (perhaps even in parallel). At the time you record a command into a command buffer, you may not know what the previous access was to that portion of the resource - maybe the last access happened in a totally different command buffer that hasn't even been recorded yet.

The natural solution to this is a two-phase tracking solution: within a single command buffer, try to issue whatever pipeline barriers you know are necessary. For the ones you can't issue because you don't know the source accesses, simply remember those (and don't issue pipeline barriers for them). Your queue submission system can then do its own global tracking to use semaphores to synchronize with whichever accesses actually got submitted just before the current command buffer's accesses.

I'll be discussing this in more in a forthcoming Part 2.

Saturday, April 27, 2024

Avoiding Seemingly-Necessary Retain Cycles

I was recently implementing an API that seemed like it required a retain cycle. Object A has a property of object B, and object B has a property of object A. Customer code is allowed to retain either object A or object B and must be able to access the other one via the property. It seems like this requires a retain cycle to implement, doesn't it!

Specifically, I'm implementing reflection for a programming language compiler. The customer gives a program to the system as a string, and the system compiles their program, but also returns a reflection object that describes the contents of their program. The reflection object consists of a collection of subobjects, each of which represents a part of the customer's program.

So, if the customer's program is something like:

struct Foo {

Foo* link;

}

There will be a subobject which represents the struct, and there will be a subobject under that which represents the field. The field has a subobject which represents the type of the field, and that object has an accessor which references the inner type of the pointer - which is the Foo struct again. That's the retain cycle.

There is a solution, though, which doesn't actually involve retain cycles. It's something I learned from the WebKit team during my time working on that project. The solution is that you take all the objects in the cycle, and instead of having each object have its own reference count, you make a single reference count for the entire collection. If any customer code wants to reference any subobject, the subobjects are set up to automatically forward it on to the entire collection's single reference count. If you know, then, that every object in the collection has the same reference count and therefore the same lifetime, then all links within the collection (from one subobject to another subobject) can be raw pointers (rather than strong references that would increment the reference count). Indeed, you actually *can't* have a strong reference from one subobject to another, because that would actually mean that the collection has a strong reference to itself.

So, let's break it down.

Step 1: Identify a single owning object under which all of the subobjects live. In my case, that's the top-level "Reflection" object. This object's reference count will represent the reference count for the entire collection of subobjects under it.

Step 2: For each subobject under the reflection object, give it a raw pointer to the owning parent object. This can be a raw pointer, because we are guaranteeing that the lifetime of all the subobjects is equal to the lifetime of the single owning object. We can also have each subobject notify the owning object that it owns the subobject. This will be important later, when the owning object eventually dies, and wants to *actually* destroy each subobject.

Step 3: For each subobject, override its retain and release method to call the owning object's retain and release method instead. This means that any time any customer wants to retain or release a subobject, they end up retaining and releasing the single owning object instead. In this way, the owning object's lifetime is the union of the lifetimes of each of the subobjects.

Step 4: We have to modify any references from any subobject to any other subobject to be a raw pointer, instead of a strong reference. At this point, if there is any strong reference from a subobject to another subobject, that strong reference will be forwarded to the parent object, and the parent object would stay alive forever. So we have to avoid this, and use raw pointers for all links between subobjects.

Step 5: We have to modify the destructor of the parent object to destroy each of the subobjects under it. There are 2 pieces to this. The first piece is that the parent object needs a vector of raw pointers to each of the subobjects. We can build up this vector when the subobject notifies the parent object back in step 2. Also, the vector needs to hold raw pointers (as opposed to strong references) for the same reason that subobjects need to hold raw pointers among themselves - if we don't, the owning object will live forever. The second piece is that there has to be some way for the parent to *actually* destroy a subobject. It can't just call release on the subobject, because that will end up being forwarded back to the owning object. Therefore, all of the subobjects need a new method, "actually release," which actually releases the subobject without forwarding it to the owning object. The destructor of the owning object calls this on each subobject which was registered to it.

Step 6: There's one last gotcha: When subobjects are created, the customer code is going to think they are +1 objects, and release them eventually. Therefore, the constructor of the subobjects need to retain their owner, which balances out this pending release. This looks like a leak, but it's not - the owning object will destroy every subobject when the owning object gets destroyed, and the owning object gets destroyed when the last reference to either itself or any of the subobjects gets destroyed.

It's a little tricky to get it all correct, but it is possible. The API contract *seems* like it would require a retain cycle, but it's actually possible to implement without one, by unifying the retain counts of all the objects in the cycle into a single retain count for the whole group, thereby treating the cycle as a single item.

Monday, March 18, 2024

So I wrote a double delete...

I wrote a double delete. Actually, it was a double autorelease. Here's a fun story describing the path it took to figure out the problem.

I'm essentially writing a plugin to a different app, and the app I'm plugging-in-to is closed-source. So, I'm making a dylib which gets loaded at runtime. I'm doing this on macOS.

The Symptom

When running my code, I'm seeing this:

Let's see what we can learn from this.

First, it's a crash. We can see we're accessing memory that we shouldn't be accessing.

Second, it's inside objc_release(). We can use a bit of deductive reasoning here: If the object we're releasing has a positive retain count, then the release shouldn't crash. Therefore, either we're releasing something that isn't an object, or we're releasing something that has a retain count of 0 (meaning: a double release).

Third, we can actually read a bit of the assembly to understand what's happening. The first two instructions are just a way to check if %rdi is null, and, if so, jump to an address that's later in the function. Therefore, we can deduce that %rdi isn't null.

%rdi is interesting because it's the register that holds the first argument. It's probably a safe assumption to make that objc_release() probably just takes a single argument, and that argument is a pointer, and that pointer is stored in %rdi. This assumption is somewhat-validated by reading the assembly: nothing seems to be using any of the other parameter registers.

The next 3 lines check if the low bit in %rdi is 1 or not. If it's 1, then we again jump to an address that's later in the function. Therefore, we can deduce that %rdi is an even number (its low bit isn't 1).

The next 3 lines load a value that %rdi is pointing to, and mask off most of its bits. The next line, which is the line that's crashing, is trying to load the value that the result points to.

All this makes total sense: Releasing a null pointer should do nothing, and releasing tagged pointers (which I'm assuming are marked by having their low bit set to 1) should do nothing as well. If the argument is an Objective-C object, it looks like we're trying to load the isa pointer, which probably holds something useful at offset 0x20. That's the point where we're crashing.

That leads to the deduction: Either the thing we're trying to release isn't an Objective-C object, or it's already been released, and the release procedure clears (or somehow poisons) the isa value, which caused this crash. Either way, we're releasing something that we shouldn't be releasing.

One of the really useful observations about the assembly is that nothing before the crash point clobbers the value of %rdi. This means that a pointer to the object that's getting erronously released is *still* in %rdi at the crash site.

We can also see that the crash is happening inside AutoreleasePool:

This doesn't indicate much - just that we're autoreleasing the object instead of releasing it directly. It also means that, because autorelease is delayed, we can't see anything useful in the stack trace. (If we were releasing directly instead of autoreleasing, we could see exactly what caused it in the stack trace.)

The First Thing That Didn't Work

The most natural solution would be "Let's use Instruments!" It's supposed to have a tool that shows all the retain stacks and release stacks for every object.

When running with Instruments, we get a nice crash popup showing us that we crashed:

The coolest part about this is that it shows us the register state at the crash site, which gives us %rdi, the pointer to the object getting erroneously released.

Cool, so the object which is getting erroneously released is at 0x600002a87b40. Let's see what instruments lists for that address:

Well, it didn't list anything for that address. It listed something for an address just before it, and just after it, but not what we were looking for. Thanks for nothing, Instruments.

The Second Thing That Didn't Work

Well, I'm allocating and destroying objects in my own code. Why don't I try to add logging all my own objects to see where they all get retained and released! Hopefully, by cross referencing the address of the object that gets erroneously deleted with the logging of the locations of my own objects, I'll be able to tell what's going wrong.

We can do this by overriding the -[NSObject release] and -[NSObject retain] calls:

As well as init / dealloc:

Unfortunately, this spewed out a bunch of logging, but the only thing it told me was the object that was being erroneously released wasn't one of my own objects. It must be some other object (NSString, NSArray, etc.).

The Third Thing That Didn't Work

Okay, we know the object is being erroneously autoreleased. Why don't we log some useful information every time anyone autoreleases anything? We can add a symbolic breakpoint on -[NSObject autorelease].

Here's what it looks like when this breakpoint is hit:

Interesting - so it looks like all calls to -[NSObject autorelease] are immediately redirected to _objc_rootAutorelease(). The self pointer is preserved as the value of the first argument.

If you list the registers at the time of the call, you can see the object being released:

So let's modify the breakpoint to print all the information we're looking for:

Unfortunately, this didn't work because it was too slow. Every time lldb evaluates something, it takes a bunch of time, and this was evaluating 3 things every time anybody wanted to autorelease anything, which is essentially all the time. The closed-source application I'm debugging is sensitive enough, that if anything takes too long, the application just quits.

The Fourth Thing That Didn't Work

Lets try to print out the same information as before, but do it inside the application rather than in lldb. That way, it will be much faster.

The way we can do this is with something called "function interposing." This uses a feature of dyld which can replace a library's function with your own. Note that this only works if you disable SIP and set the nvram variable amfi_get_out_of_my_way=0x1 and reboot.

We can do this to swap out all calls to _objc_rootAutorelease() with our own function.

Inside our own version of _objc_rootAutorelease(), we want to keep track of everything that gets autoreleased. So, let's keep track of a global dictionary, from pointer value to info string.

We can initialize this dictionary inside a "constructor," which is a special function in a dylib which gets run when the dylib gets loaded by dyld. This is a great way to initialize a global.

Inside my_objc_rootAutorelease(), we can just add information to the dictionary. Then, when the crash occurs, we can print the dictionary and find information about the thing that was autoreleased.

However, something is wrong...

The dictionary only holds 315 items. That can't possibly be right - it's inconceivable that only 315 things got autoreleased.

The Fifth Thing That Didn't Work

We're close - we just need to figure out why so few things got autoreleased. Let's verify our assumptions, that [foo autorelease] actually calls _objc_rootAutorelease() by writing such code and looking at its disassembly.

And if you look at the disassembly...

You can see 2 really interesting things: the call to alloc and init got compressed to a single C call to objc_alloc_init(), and the call to autorelease got compressed to a single C call to obc_autorelease(). I suppose the Objective-C compiler knows about the autorelease message, and is smart enough to not invoke the entire objc_msgSend() infrastructure for it, but instead just emits a raw C call for it. So that means we've interposed the wrong function - we were interposing _objc_rootAutorelease() when we should have been interposing objc_autorelease(). So let's interpose both:

This, of course, almost worked - we just have to be super sure that my_objc_autorelease() doesn't accidentally call autorelease on any object - that would cause infinite recursion.

The Sixth Thing That Didn't Work

Avoiding calling autorelease inside my_objc_autorelease() is actually pretty much impossible, because anything interesting you could log about an object will, almost necessarily, call autorelease. Remember that we're logging information about literally every object which gets autoreleased, which is, in effect, every object in the entire world. Even if you call NSStringFromClass([object class]) that will still cause something to be autoreleased.

So, the solution is to set some global state for the duration of the call to my_objc_autorelease(). If we see a call to my_objc_autorelease() while the state is set, that means we're autoreleasing inside being autoreleased, and we can skip our custom logic and just call the underlying objc_autorelease() directly. However, there's a caveat: this "global" state can't actually be global, because Objective-C objects are created and retained and released on every thread, which means this state has to be thread-local. Therefore, because we're writing in Objective-C and not C++, we must use the pthreads API. The pthreads threadspecific API uses a "key" which has to be set up once, so we can do that in our constructor:

Then we can use pthread_setspecific() and pthread_getspecific() to determine if our calls are being nested.

Except this still didn't actually work, because abort() is being called...

The Seventh Thing That Didn't Work

Luckily, when abort() is called, Xcode shows us a pending Objective-C exception:

Okay, something is being set to nil when it shouldn't be. Let's set an exception breakpoint to see what is being set wrong:

Welp. It turns out NSStringFromClass([object class]) can sometimes return nil...

The Eighth Thing That Worked

Okay, let's fix that by checking for nil and using [NSNull null]. Now, the program actually crashes in the right place!

That's more like it. Let's see what the pointer we're looking for is...

Okay, let's look for it in bigDict!

Woohoo! Finally some progress. The object being autoreleased is an NSDictionary.

But that's not enough, though. What we really want is a backtrace. We can't use lldb's backtrace because it's too slow, but luckily macOS has a backtrace() function which gives us backtrace information! Let's build a string out of the backtrace information:

Welp, that's too slow - the program exits. Let's try again by setting frameCount to 6:

So here is the final autorelease function:

Okay, now let's run it, and print out the object we're interested in:

And the bigDict:

Woohoo! It's a great success! Here's the stack trace, formatted:

Excellent! This was enough for me to find the place where I had over-released the object.

Tuesday, November 28, 2023

Nvidia SLI from Vulkan's Point of View

SLI is an Nvidia technology, which (is supposed to) allow multiple GPUs to act as one. The use case is supposed to be simple: you turn it on, and everything gets faster. However, that's not how it works in Vulkan (because of course it isn't - nothing is simple in Vulkan). So let's dig in and see exactly how it works and what's exposed in Vulkan.

Logical Device Creation

SLI is exposed in Vulkan with 2 extensions, both of which have been promoted to core in Vulkan 1.1: VK_KHR_device_group_creation, and VK_KHR_device_group. The reason there are 2 is esoteric: one is an "instance extension" and the other is a "device extension." Because enumerating device groups has to happen before you actually create a logical device, those enumeration functions can't be part of a device extension, so they're part of the instance extension instead. The instance extension is really small - it essentially just lets you list device groups, and for each group, list the physical devices inside it. When you create your logical device, you just list which physical devices should be part of the new logical device.

Now that you've created your logical device, there are a few different pieces to how this stuff works.

Beginning of the Frame

At the beginning of your frame, you would normally call vkAcquireNextImageKHR(), which schedules a semaphore to be signaled when the next swapchain image is "acquired" (which means "able to be rendered to"). (The rest of your rendering is supposed to wait on this semaphore to be signaled.) VK_KHR_device_group replaces this function with vkAcquireNextImage2KHR(), which adds a single parameter: a "device mask" of which physical devices in the logical device should be ready before the semaphore is signaled.

It took me a while to figure this out, but each physical device gets its own distinct contents of the swapchain image. When you write your Vulkan program, and you bind a swapchain image to a framebuffer, that actually binds n different contents - one on each physical device. When a physical device executes and interacts with the image, it sees its own independent contents of the image.

End of the Frame

At the end of the frame, you'll want to present, and this is where things get a little complicated. Each physical device in the logical device may or may not have a "presentation engine" in it. Also, recall that each physical device has its own distinct contents of the swapchain image.

There are 4 different presentation "modes" (VkDeviceGroupPresentModeFlagBitsKHR). Your logical device will support some subset of these modes. The 4 modes are:

Local presentation: Any physical device with a presentation engine can present, but it can only present the contents of its own image. When you present, you tell Vulkan which physical device and image to present (VkDeviceGroupPresentInfoKHR).
Remote presentation: Any physical device with a presentation engine can present, and it can present contents from other physical devices. Vulkan exposes a graph (vkGetDeviceGroupPresentCapabilities()) that describes which physical devices can present from which other physical devices in the group. When you present, you tell Vulkan which image to present, and there's a requirement that some physical device with a presentation engine is able to present the image you selected.
Sum presentation: Any physical device with a presentation engine can present, and it presents the component-wise sum of the contents of the image from multiple physical devices. Again, there's a graph that indicates, for each physical device that has a presentation image, which other physical devices it's able to sum from. When you present, you specify which physical devices' contents to sum, via a device mask (and there's a requirement that there is some physical device with a presentation engine that can sum from all of the requested physical devices).
Local multi-device presentation: Different physical devices (with presentation engines) can present different disjoint rects of their own images, which get merged together to a final image. You can tell which physical devices present which rects by calling vkGetPhysicalDevicePresentRectanglesKHR(). When you present, you specify a device mask, which tells which physical devices present their rects.

On my machine, only the local presentation mode is supported, and both GPUs have presentation engines. That means the call to present gets to pick (VkDeviceGroupPresentInfoKHR) which of the two image contents actually gets presented.

Middle of the Frame

The commands in the middle of the frame are probably actually the most straightforward. When you begin a command buffer, you can specify a device mask (VkDeviceGroupCommandBufferBeginInfo) of which physical devices will execute the command buffer. Inside the command buffer, when you start a render pass, you can also specify another device mask (VkDeviceGrupRenderPassBeginInfo) for which physical devices will execute the render pass, as well as assigning each physical device its own distinct "render area" rect. Inside the render pass, you can run vkCmdSetDeviceMask() to change the set of currently running physical devices. In your SPIR-V shader, there's even a built-in intrinsic "DeviceIndex" to tell you which GPU in the group you're running on. And then, finally, when you actually submit the command buffer, you can supply (VkDeviceGroupSubmitInfo) a device mask you want to submit the command buffers to.

There's even a convenience vkCmdDispatchBase() which lets you set "base" values for workgroup IDs, which is convenient if you want to spread one workload across multiple GPUs. Pipelines have to be created with VK_PIPELINE_CREATE_DISPATCH_BASE_KHR to use this, though.

Resources

It's all well and good to have multiple physical devices executing the same command buffer, but simply execution is not enough: you also need to bind resources to those shaders and commands that get run.

When allocating a resource, there are 2 ways for it to happen: either each physical device gets its own distinct contents of the allocation, or all the physical devices share a single contents. If the allocation's heap is marked as VK_MEMORY_HEAP_MULTI_INSTANCE_BIT_KHR, then all allocations will be replicated distinctly across each of the physical devices. Even if the heap isn't marked that way, the individual allocation can still be marked that way (VkMemoryAllocateFlagsInfo). On my device, the GPU-local heap is marked as multi-instance.

Communication can happen between the devices by using Vulkan's existing memory binding infrastructure. Recall that, in Vulkan, you don't just create a resource; instead, you make an allocation, and a resource, and then you separately bind the two together. Well, it's possible to bind a resource on one physical device with an allocation on a different physical device (VkBindBufferMemoryDeviceGroupInfo, VkBindImageMemoryDeviceGroupInfo)! When you make one of these calls, it will execute on all the physical devices, so these structs indicate the graph of which resources on which physical devices get bound to which allocations on which (other) physical resources. For textures, you can even be more fine-grained than this, and bind just a region of a texture across physical devices (assuming you created the image with VK_IMAGE_CREATE_SPLIT_INSTANCE_BIND_REGIONS_BIT_KHR). This also works with sparse resources - when you bind a sparse region of a texture, that sparse region can come from another physical device, too (VkDeviceGroupBindSparseInfo).

Alas, there are restrictions. vkGetDeviceGroupPeerMemoryFeatures() tells you, once you've created a resource and bound it to an allocation on a different physical device, how you're allowed to use that resource. For each combination of (heap index, local device index, and remove device index), a subset of 4 possible uses will be allowed (VkPeerMemoryFeatureFlagBits):

The local device can copy to the remote device
The local device can copy from the remote device
The local device can read the resource directly
The local device can write to the resource directly

This is really exciting - if either of the bottom two uses are allowed, it means you can bind one of these cross-physical-device resources to a shader and use it as-if it were any normal resource! Even if neither of the bottom two uses are allowed, just being able to copy between devices without having to round-trip through main memory is already cool. On my device, only the first 3 uses are allowed.

Swapchain Resources

Being able to bind a new resource to a different physical device's allocation is good, but swapchain images come pre-bound, which means that mechanism won't work for swapchain resources. So there's a new mechanism for that: it's possible to bind a new image to the storage for an existing swapchain image (VkBindImageMemorySwapchainInfoKHR). This can be used in conjunction with VkBindImageMemoryDeviceGroupInfo which I mentioned above, to make the allocations cross physical devices.

So, if you want to copy from one physical device's swapchain image to another physical device's swapchain image, what you'd do is:

Create a new image (of the right size, format, etc.). Specify VkImageSwapchainCreateInfoKHR to indicate its storage will come from the swap chain.
Bind it (VkBindImageMemoryInfo), but use both...

VkBindImageMemorySwapchainInfoKHR to have its storage come from the swap chain, and
VkBindImageMemoryDeviceGroupInfo to specify that its storage comes from another physical device's swap chain contents

Execute a copy command to copy from one image to the other image.

Conclusion

It's a pretty complicated system! Certainly much more complicated than SLI is in Direct3D. It seems like there are 3 core benefits of device groups:

You can execute the same command stream on multiple devices without having to re-encode it multiple times or call into Vulkan multiple times for each command. The device masks implicitly duplicate the execution.
There are a variety of presentation modes, which allow automatic merging of rendering results, without having to explicitly execute a render pass or a compute shader to merge the results. Unfortunately, my cards don't support this.
Direct physical-device-to-physical-device communication, without round-tripping through main memory. Indeed, for some use cases, you can just bind a remote resource and use it as-if it was local. Very cool!

I'm not quite at the point where I can run some benchmarks to see how much SLI improves performance over simply creating two independent Vulkan logical devices. I'm working on a ray tracer, so there are a few different ways of joining the rendering results from the two GPUs. To avoid seams, the denoiser will probably have to run on just one of the GPUs.

Sunday, October 29, 2023

My First Qt App

Just for fun, I wanted to try to make a Qt app that graphs some data. For contrast, I'm aware of Swift Charts, and I thought using Qt to graph some stuff would be a fun little project, now that I'm using FreeBSD full time, rather than macOS. The latest version of Qt is version 6, so that's what I'll be using.

Basics

When you use Qt Creator to create a new Qt project, it only creates 4 files:

CMakeLists.txt
CMakeLists.txt.user
main.cpp
Main.qml

Qt Creator can understand CMakeLists.txt directly - if you want to open the "project," you open that file. Just like with Cocoa programming, main.cpp doesn't contain much inside it - it's just a few lines line and it initializes the existing infrastructure to load the app's UI.

Also, like Cocoa programming, most of the description of the UI of the app is described declaratively, in the .qml file. The way this works is you say something like:

Foo {

bar: baz

}

And this means "when the QML file is loaded, create an instance of type Foo, and set the property named bar on this new object to a value of baz."

The outermost level is this:

Window {
    width: 640
    height: 480
    visible: true
    title: qsTr("Hello World")
}

Then, you can add "elements" inside the window, by placing it inside the {}s. There are collection views (Row, Column, Grid, Flow) which define how to lay out their children, and there are also more general elements like Rectangle. When your layout is not naturally specified (because you're not using containers or whatever), you describe the layout using anchors, like anchors.centerIn: parent or anchors.fill: parent.

Qt Charts

Qt has a built-in chart element, so the first thing I did was just copy the ChartView example directly into my QML document as a child of the Window. However, that didn't work, and some searching found this note:

> Note: An instance of QApplication is required for the QML types as the module depends on Qt's Graphics View Framework for rendering. QGuiApplication is not sufficient. However, projects created with Qt Creator's Qt Quick Application wizard are based on the Qt Quick template that uses QGuiApplication by default. All the QGuiApplication instances in such projects must be replaced with QApplication.

Okay, so I replaced QGuiApplication with QApplication in main.cpp, and changed #include <QGuiApplication> to #include <QApplication>, only to find that there is now a compile error: the compiler can't find that file. After some more searching, it turns out I needed to change this:

find_package(Qt6 6.5 REQUIRED COMPONENTS Quick

find_package(Qt6 6.5 REQUIRED COMPONENTS Quick Widgets)

and change

target_link_libraries(appGrapher
PRIVATE Qt6::Quick
)

target_link_libraries(appGrapher
PRIVATE Qt6::Quick
PRIVATE Qt6::Widgets
)

Huh. After doing that, it worked no problem.

Data Source (C++ interop)

So now I have a chart, which is pretty cool, but the data that the chart uses is spelled out literally in the QML file. That's not very useful - I plan on generating thousands of data points, and I don't want to have to put them inline inside this QML thing. Instead, I want to load them from an external source.

QML files allow you to run JavaScript by literally placing bits of JavaScript inside the QML file, but I think I want to do better - I want my data source to come from C++ code, so I have full freedom about how I generate it. From some searching, it looks like there are 2 ways of having C++ and QML JavaScript interoperate:

You can register a singleton, or a singleton instance, and then the JavaScript can call methods on that singleton
You can register a type, and have the QML create an instance of that type, just like any other element
(You can setContextProperty(), which lets the QML look up an instance that you set ahead of time. However, there's a note that says "You should not use context properties to inject values into your QML components" which is exactly what I'm trying to do, so this probably isn't the right solution.)

I have a general aversion to singletons, and I think registering a type is actually what I want, because I want the QML infrastructure to own the instance and define its lifetime, so that's the approach I went with. The way you do this is, in main() after you create the QApplication but before you do anything else, you call qmlRegisterType(). Here is what main() says:

qmlRegisterType<DataSource>("com.litherum", 1, 0, "DataSource");

This allows the QML to say import com.litherum, which is pretty cool.

QObject

Defining the DataSource type in C++ is a bit weird. It turns out that Qt objects are not just regular C++ objects. Instead, you write your classes in a different language, which is similar to C++, and then there is a "meta-object compiler" which will compile your source to actual C++. It looks like the main purpose of this is to be able to connect signals and slots, where an object can emit a signal, and if a slot in some other object is connected to that signal, then the slot callback gets run in that other object. It seems pretty similar to observers in Objective-C. They also have the ability to perform introspection, like Objective-C .... I kind of don't understand why they didn't just invent a real language rather than doing this C++ transpilation silliness.

Anway, you can define your (not-)C++ class, inherit from QObject, annotate the class with Q_OBJECT and QML_ELEMENT, and give it a method with the Q_INVOKABLE annotation. Sure, fine. Then, in the QML file, you can add a stanza which tells the system to create an instance of this class, and you can use the Component.onCompleted JavaScript handler to call into it (via its id). Now you can call the C++ method you just defined from within the QML. Cool. This is what the C++ header says:

class DataSource : public QObject
{
    Q_OBJECT
    QML_ELEMENT
public:
    explicit DataSource(QObject *parent = nullptr);

    Q_INVOKABLE void updateData(QXYSeries*, double time);
};

Okay, the method is supposed to set the value of the SplineSeries in the chart. The most natural way to do this is to pass the SplineSeries into the C++ function as a parameter. This is actually pretty natural - all the QML types have corresponding C++ types, so you just make the C++ function accept a QSplineSeries*. Except we run into the same compiler error where the compiler can't find #include <QSplineSeries>. It turns out that in CMakeLists.txt we have to make a similar addition and add Charts to both places that we added Widgets above. Fine. Here's what the QML says:

DataSource {
    id: dataSource
    Component.onCompleted: function() {
        dataSource.updateData(splineSeries, Date.now());
    }
}

Once you do this, it actually works out well - the C++ code can call methods on the QSplineSeries, and it can see the values that have been set in the QML. It can generate a QList<QPointF> and call QSplineSeries::replace() with the new list.

The one thing I couldn't get it to do was automatically rescale the charts' axes when I swap in new data with different bounds. Oh well.

I did want to go one step further, though!

Animation

One of the coolest things about retained-mode UI toolkits is that they often allow for animations for free. Swapping out the data in the series should allow Qt to smoothly animate from the first data set to the second. And it actually totally worked! It took me a while to figure out how specifically to spell the values, but in the QML file, you can set these on the ChartView:

animationOptions: ChartView.AllAnimations

animationDuration: 300 // milliseconds

animationEasingCurve {

type: Easing.InOutQuad

}

I found these by looking at the documentation for QChart. And, lo and behold, changing the data values smoothly animated the spline on the graph! I also needed some kind of timer to actually call my C++ function to generate new data, which you do with QML also:

Timer {
    interval: 1000 // milliseconds
    running: true
    repeat: true
    onTriggered: function() {
        dataSource.updateData(splineSeries, Date.now());
    }
}

Super cool stuff! I'm always impressed when you can enable animations in a declarative way, without having your own code running at 60fps. Also, while the animations are running, from watching KSysGuard, it looks like the rendering is multithreaded, which is super cool too! (And, I realized that KSysGuard probably uses Qt Charts under the hood too, to show its performance graphs.)

Conclusion

It looks like Qt Charts is pretty powerful, has lots of options to make it beautiful, and is somewhat fairly performant (though I didn't rigorously test the performance). Using it did require creating a whole Qt application, but the application is super small, only has a few files, and each file is pretty small and understandable. And, being able to make arbitrary dynamic updates over time while getting animation for free was pretty awesome. I think being able to describe most of the UI declaratively, rather than having to describe it all 100% in code, is definitely a good design decision for Qt. And the C++ interop story was a little convoluted (having to touch main() is a bit unfortunate) but honestly not too bad in the end.

Saturday, October 28, 2023

ReSTIR Part 2: Characterizing Sample Reuse

After enumerating all the building blocks of ReSTIR, there isn't actually that much more. The rendering equation is an integral, and our job is to approximate the value of the integral by sampling it in the most intelligent way possible.

Importance sampling tells us that we want to generate samples with a density that's proportional to the contribution of those samples to the value of the final integral. (So, where the light is strongest, sample that with highest density.) We can't directly produce samples with this probability density function, though - if we could, we could just compute the integral directly rather than dealing with all this sampling business

The function being integrated in the rendering equation is the product of a few independent functions:

The BRDF (BSDF) function, which is a property of the material we are rendering,
The distribution of incoming light. For direct illumination, this is distributed over the relevant light sources
A geometry term, where the orientation of the surface(s) affects the result
A visibility term (the point being shaded might be in shadow)

The fact that there are a bunch of independent terms means that Multiple Importance Sampling (MIS) works well - we can use these independent functions to produce a single aggregated "target" function which we expect will approximate the real function fairly well. So, we can generate samples according to the target function, using Sequential Importance Resampling (SIR), evaluate the real function at those sampling locations (by tracing rays or whatever), then use Resampled Importance Sampling (RIS) to calculate an integral. Easy peasy, right?

ReSTIR

This is where ReSTIR starts. The first observation that ReSTIR makes is that it's possible to use reservoir sampling (RS) to turn this into a streaming algorithm. The paper assumes that the reservoir only holds a single sample (though this isn't actually necessary). The contents of the reservoir represent a set of (one) sample with pdf proportional to the target function, and the more samples the reservoir encounters, the better that pdf matches the target function. The name of the game, now, is to make the reservoir encounter as many samples as possible.

Which brings us to the second observation that ReSTIR makes. Imagine if there was some way of merging reservoirs in constant time (or rather: in time proportional to the size of the reservoirs, rather than time proportional to the number of samples the reservoirs have encountered). If this were possible, you could imagine a classic parallel reduction algorithm: each thread (pixel) could start out with a naive reservoir (a poor approximation of your target function), but then adjacent threads could merge their reservoirs, then one-of-every-4-threads could merge their reservoirs, then one-of-every-8, etc, until you have a single result that incorporates results from all the threads. If only a single level (generation) of this reduction occurs each frame, you end up with a result where you perform a constant amount of work each frame per thread, but the result is that an exponential number of samples end up being accumulated. This is the key insight that ReSTIR makes.

Merging Reservoirs

Merging reservoirs is a subtle business, though. The problem is that different pixels/threads are shading different materials oriented at different orientations. In effect, your target function you're sampling (and the real function you're evaluating) are different from pixel to pixel. If you ignore this fact, and pretend that all your pixels are all sampling the same thing, you can naively just jam the reservoirs together, by creating a new reservoir which encounters the values saved in the reservoirs of the inputs. This is fast, but gets wrong results (called "bias" in the literature).

What you have to do instead is to treat the merging operation with care. The key here lies with the concept of "supports." Essentially, if you're trying to sample a function, you have to be able to generate samples at every place the function is nonzero. If there's an area where the function is nonzero but you never sample that area, your answer will turn out wrong. Well, the samples that one pixel generates (recall that the things in the reservoirs are sample locations) might end up not being applicable to a different pixel. For example, consider if there's an occlusion edge where one pixel is in shadow and a nearby pixel isn't. Or, another example: the surface normal varies across the object, and the sample at one pixel is at a sharp angle, such that if you use that same sample at a different pixel, that sample actually points behind the object. You have to account for this in the formulas involved.

Jacobian Determinant

There's a generalization of this, which uses the concept of a Jacobian determinant. Recall that, in general, a function describes a relationship between inputs and outputs. The Jacobian determinant of a function describes, for a particular point in the input space of the function, if you make a small perturbation and feed a slightly different input point into the function, how much the output of the function will be perturbed. It's kind of a measure of sensitivity - at a particular point, how sensitive are changes in the output to changes in the input.

Well, if you have a sample at one particular pixel of an image, and you then apply it to a different pixel, you have an input (the sample at the original pixel) and you have an output (the sample at the destination pixel) and you have a relationship between the two (the probability of that sample won't be exactly the same at the two different places). So, the Jacobian tells you how to compensate for the fact that you're changing the domain of the sample.

In order to incorporate the Jacobian, you have to be able to calculate it (of course), which means you have to be able to characterize how sample reuse across pixels affects the probabilities involved. For direct illumination, that's just assumed to be 1 or 0 depending on the value of the sample point - hence why above you just ignore some samples altogether when reusing them. For indirect illumination (path tracing), a sample is an entire path, and when you re-use it at a different pixel, you're producing a new path that is slightly different than the original path. This path manipulation is called "shift mapping" of a path in the gradient domain rendering literature, and common shift mappings have well-defined Jacobian functions associated with them. So, if you spatially reuse a path, you can pick a "shift mapping" for how to define the new path, and then include that shift mapping's Jacobian in the reservoir merging formula.

This concept of a "shift mapping" and its Jacobian can be generalized to any kind of sampling - it's not just for path tracing.

Conclusion

So that's kind of it. If you're careful about it, you can merge reservoirs in closed form (or, at least, in closed form for each sample in the reservoirs), which results in a pdf of the values in the reservoir that are informed by the union of samples of all the input reservoirs. This leads to a computation tree of merges which allows the number of samples to be aggregated exponentially over time, where each frame only has to do constant work per pixel. You can perform this reuse both spatially and temporally, if you remember information about the previous frame. The more samples you aggregate, the closer the pdf of the samples in the reservoir matches the target function, and the target function is formed by using MIS to approximate the rendering equation. This allows you to sample with a density very close to the final function you're integrating, which has the effect of reducing variance (noise) in the output image.

ReSTIR also has some practical concerns, such as all the reuse causing an echo chamber of old data - the authors deal with that by weighting old and new data differently, to try to strike a balance between reuse (high sample counts) vs quickly adhering to new changes in geometry or whatever. It's a tunable parameter.