Litherum: 2024

Wednesday, July 3, 2024

Vulkan Synchronization: Why it sucks (Part 2)

There are a variety of reasons why I don't think the Vulkan committee designed synchronization very well.

Let's get one thing out of the way first:

State/Hazard Tracking is a Thing that People Need to Do

At GDC 2016 (the year that Vulkan came out), Matthaeus Chajdas presented making the point that you shouldn't need to do state tracking in Vulkan. He says that, instead of tracking state at runtime, you should just *know* which barriers are needed in which places, and just hardcode them in the right places to be correct. After all - you wrote your rendering algorithm.

I don't think this is realistic. If you're making a toy, then sure, you'll just know where the barriers go. But no serious engines are toys.

Consider something like Unreal Engine - who knows what kind of crazy shit the artists, who don't work at Epic, created in UE that the engine is being asked to render. Node graphs can be arbitrarily complex and can have dependencies with multiple subsystems within the engine. I'd put good money on the claim that the people who wrote Unreal Engine don't know exactly where all the barriers go, without even being able to see the content they're being asked to render.

Also, consider that Direct3D is the lingua franca of 3D graphics (whether you want it to be or not). Almost no games call Vulkan directly. Vulkan is, instead, used as a porting layer - a platform API that your calls have to funnel through if you want your game to run on Android. Game code either calls Direct3D directly, or in the case of the big engines, go through an RHI that's looks similar to Direct3D. For all this content, the way it ends up running on Vulkan is with a layer like Proton - implementing the Direct3D API - or something similar to the Direct3D API - on top of Vulkan, which necessarily requires state tracking, because the Direct3D synchronization API is significantly less detailed/precise than Vulkan is.

So, I just reject outright the idea that devs should *just know* where all their barriers should go, and should magically place them in the right place because the people writing the Vulkan game engine (which doesn't exist) are just really smart. It's just not realistic.

It's Actually D3D, It's Always Been D3D

The Vulkan spec pretends that each stage in a pipeline may execute on a different chip with different caches. Therefore, you have to track each stage in every pipeline independently. But there's no hardware which actually works this way. No GPU actually puts *every* stage of the graphics pipeline on different chips with different caches. So you're doing all this tracking for each one of these stages, but the hardware only has a few (usually 1) different chips that all the programmable stages execute on. So almost all these different barriers are all going to boil down to the same thing under the hood anyway.

Imagine that you weren't writing a Vulkan app, but instead were writing a device-specific driver. You'd know exactly the topology of the device you're writing for, and you'd track only the state that the device actually cares about. Instead, Vulkan forces you to track all the state that a conceptual worst-case device might need. No real device actually needs all that precision.

But the worst part of all this is that, when device manufacturers design devices, they are informed by the graphics APIs of the day. They make their hardware knowing that Direct3D is the lingua franca API, and that the vast majority of content will be written to that API or a similar API. So they design the hardware accordingly. So, what we're left with is: the apps are written for the D3D model, then they have to be filtered down to the super precise Vulkan model, which requires state tracking, only to end up *back* at the D3D model when it hits the hardware. So what was the point of all that tracking for Vulkan? Nothing!

Naive Hazard Tracking is Too Precise

Each array slice and mip level and aspect (depth/stencil/color) of each image can be in a different layout. Synchronization between previous writes and current reads can synchronize each array slice and mip level and aspect independently. In a buffer, every byte range can be synchronized independently.

Now, imagine that you wanted to do hazard tracking at this granularity. You'd need a data structure that remembered, for each texture, for each array slice, for each mip level, for each aspect, what the most recent write to that area of the resource was. Even if you do some kind of run-length encoding, there's no getting around the fact that a 4-dimensional data structure is gonna be a mess. You could have a hashmap of hashmaps of hashmaps of hashmaps, which would waste a ton of space, or you could have something like a 4-dimensional kD tree, which is going to end up with a ton of nodes. There's no getting around the fact that you're gonna create a monster.

If you do something simpler, and don't represent exactly the precision that Vulkan has, that means you're going to somewhat oversynchronize.

The crux here is that there's a nonobvious crossover point. If you're super precise in your tracking data structure, you're going to spend a bunch of time partitioning / compacting it. If you're imprecise, you're going to oversynchronize. There's no good solution.

Image Layouts are (Differently) Too Precise

There are 36 different layouts a region of an image can be in (if you include all non-vendor-specific extensions). Some of those layouts are capable of only being used for one purpose, but many of them are capable of being used for multiple purposes, while only being optimal for one.

So, the developer has a choice. They can either transition every image to whichever layout is optimal for each command they're being used in, thereby issuing lots of transitions. Or, they can pick layouts which are *compatible* with multiple uses but optimal for just some (or maybe even none), and issue fewer barriers. Or something in the middle.

How should the developer make this decision? With which information are they armed in order to make this decision? The answer is: nothing. The developer has no idea which Vulkan layouts actually correspond to identical byte layouts under the hood. (And, D3D has way fewer layouts, so you better believe that not all of the Vulkan ones will actually be distinct in the hardware.)

So, the only thing the developer can do is write the code multiple ways and test the performance. But how many places on the spectrum should they implement? How do they know which codepath to pick if the user runs their code on a device they haven't tested with? There is no good answer.

Vulkan Made an Intentional Decision

The thing that really gets me about Vulkan's synchronization API is that they *didn't* actually go all the way to the extreme:

Semaphores automatically create memory dependencies, and you don't have to specify an access mask with them. This didn't have to be the case, though - the Vulkan committee could have decided to require you to specify access masks with semaphores
Sempahores signaled via queue submissions automatically form a memory dependency with previous command buffers. The Vulkan committee didn't have to do this - they could have said "if you want to form a memory dependency with a command buffer, you must interact with *that* command buffer"
Queue submission automatically forms a memory dependency with host accesses prior to the queue submission (coherent mappings notwithstanding). The Vulkan committee didn't need to do this - they could have said that you need to issue a device-side synchronization command to synchronize with the host
Pipeline barriers themselves don't have to be synchronized with memory dependencies, which is a little surprising because pipeline barriers can change image layouts, which can read/write every byte in the image. The Vulkan committee could have said image layout transitions are done via separate commands which need to have memory dependencies around them.
Pipeline barriers require a bitmask of pipline stages as well as a bitmask of access masks - but there's no relation between them. You can't say "stage X uses access mask Y, whereas stage Z uses access mask W" without issuing multiple barriers. This means that the most natural way to use pipeline barriers (lumping all the stages and access masks together into a single barrier) is actually oversynchronizing - it's totally possible for a single access mask to apply to multiple pipeline stages, which you might not have anticipated when issuing the barrier.

I would understand it if the Vulkan committee went all the way, and said that Vulkan would give you nothing for free, and it's up to the author to do everything. But, instead, they chose a middle ground, which indicates they think this is an appropriate middle ground. They intentionally chose this compromise. But it's a terrible compromise! It's precise enough that you can't do it well, and it's entirely needless because it pessimizes to no hardware that actually exists.

Manufacturers are Better Equipped for this than Game Developers

I'd like to make this point again: State/hazard tracking is a thing that people have to do with Vulkan, but there's no way to do it better than a device-specific driver, because Vulkan pessimizes. I appreciate that Vulkan lets the app developer choose what granularity of state/hazard tracking they perform, but I think that's a constituency inversion: the only rubric for the tradeoffs of tracking is performance, and the tradeoffs will be device-dependent, so the manufacturers making the devices, who are experts in each individual device they produce, will be best equipped to make that tradeoff. They'll certainly do a better job than an individual game developer house with hundreds of GPUs from many vendors to test out. This goes doubly for GPUs that release after the game does, which is a situation where the game has to run on a device the developer has never seen before.

The game developer doesn't know the internal topology of the GPU they're running on, and they're not equipped to make the kinds of tradeoffs that the Vulkan API forces them to make. Only the device manufacturers can actually make these sensible tradeoffs for each device.

Vulkan Synchronization: What it is (Part 1)

Synchronization in Vulkan is actually somewhat tricky to grok. And you have to grok it, because it isn't done automatically for you - you have to explicitly synchronize between every data hazards that might arise. (This is in contrast to APIs like Metal which do most, or even all, of the synchronization for you.)

Hazards

So let's start out with: What's a data hazard? A data hazard is when two operations can't happen simultaneously, either because one operation depends on the result of the other one (think: a read depends on a previous write) or one needs to be independent of the other (think: a read before a write needs to *not* see the result of the write). To forbid these, you make explicit calls to Vulkan to tell it what to disallow from happening simultaneously with what else.

There are 2 flavors of synchronization in Vulkan: execution dependencies and memory dependencies.

Execution dependencies are the simplest - they don't say *anything* about memory, but instead *just* describe a "happens-before" relationship. This is sufficient for a write-after-read hazard, where the read shouldn't see the results of the write - it's enough to just delay the write until after the read completed.

Memory dependencies imply an execution dependency - so memory dependencies are strictly more powerful than execution dependencies. A memory dependency is where there is actually communication going on, through memory. It is sufficient for read-after-write hazards, where the read needs to see the result of the write, and also write-after-write hazards. There are 2 pieces to it:

The first operation's result become "available". Think of this as a cache flush. After the first operation, the data has to actually move from the source cache out to main memory
The memory becomes "visible" to the second operation. Think of this as a cache invalidation. If the receiver wants to see the result of the previous operation, it has to mark its own cache as invalid, so the operation actually goes out to main memory to see the results.

There is no such thing as a read-read hazard, so no synchronization is necessary in that case.

Tools

Vulkan provides many tools that you can use to synchronize.

Fences are used to synchronize the device ("device" = "GPU") with the host ("host" = "CPU"). You specify a fence to be signaled when the device work is done as a part of vkQueueSubmit(). This kind of synchronization is actually (surprisingly) really easy, because Section 7.9 essentially says that you don't need to deal with fences when uploading data to the device. The spec for vkWaitForFences() indicates that the only thing you need to do for downloading data from the device is simply wait on the fence. (Also, if your resource isn't coherently mapped, you need to use vkFlushMappedMemoryRanges() and vkInvalidateMemoryRanges() so the CPU's caches will get out of the way.
Semaphores are used to synchronize between different queues. They actually form a dependency graph, where each command buffer submitted to a queue specifies a set of semaphores to wait for before it starts executing, and a set of semaphores to signal when it's done executing.
Events are used to synchronize different commands within the same queue. Just because two operations are submitted to the same queue does not mean they will execute in-order. Events are "split" in that there is vkCmdSetEvent() as distinct from vkCmdWaitEvents(). These are commands that get recorded into a command buffer.
Pipeline barriers are also used to synchronize different commands within the same queue, but the difference with events is that barriers aren't split - there's just a single vkCmdPipelineBarrier(). This is also a command that gets recorded into a command buffer.

I won't say much more about fences - as I described above, you really don't need to think about them much. One of the cool things about fences is that you can "export" it to a "sync fd" on UNIX platforms, and the resulting fd can be used in select() and epoll(), which makes for a nice way to integrate into your app's existing run loop.

I also won't say much more about events - they're just the same thing as pipeline barriers, but split into two halves.

Stages and Accesses

Accesses by the GPU happen within a "pipeline," which is comprised of a sequence of stages. For example, the vertex shader and the fragment shader are stages of the graphics pipeline (among many other stages). Vulkan is designed so that each stage within a pipeline can execute on a totally different chip than any of the other stages - and each chip will have its own cache. Therefore, if your pipeline has n stages, Vulkan forces you to pessimize and assume that each of those n stages will execute on a different chip with a different set of caches, so you have n different caches you have to manage.

However, it's actually worse than that. Each stage might be able to access a resource in a variety of different way - for example, a sampled texture vs a storage texture. Each of these different ways to access a resource *also* might have its own cache - the cache used for storage textures might be a totally different cache than the cache used for sampled textures. So you actually have more than n cached to worry about.

The more caches you have to manage, the more synchronization calls you need to make.

When you issue a pipeline barrier to Vulkan, you have to describe which source and destination you're synchronizing. The source and destination both include which stage is doing the access, and which kind of access it is (e.g. sampled texture vs storage texture). If you just supply the stages, but don't supply the accesses, that describes an execution dependency. If you supply both, that describes a memory dependency.

Semaphores always describe memory dependencies, and they don't ask you for what kind of accesses it's synchronizing with - it presumably pessimizes and assumes it has to synchronize as-if all kinds of accesses happened. Instead, it just asks you which stages it should provide a memory barrier between.

I should also probably mention that it's possible for *you* to pessimize too - the enum for access kind is a bitmask, so you can specify multiple values, and it also has values for all and none.

Pipeline Barriers et al.

Pipeline barriers also do 2 more things: image layouts and queue transfers.

Image layouts are fundamentally different than what I've been describing so far. Previously, I've been describing synchronization - cache operations and source and destination processors. On the other hand, an image layout is a state that the image is in. The spec says it's an in-memory ordering of the data blocks within the image. There are lots of different layouts an image can be in, with each one being optimized for some particular purpose. Transitioning an image may require reading and writing all of the data within the image - to move its blocks around. So, you can't pessimize here in the same way you can pessimize about synchronization (and issue the biggest pipeline barrier possible between every command) - instead, if you want your accesses to be optimal, you have to remember which layout every (region of every) image is in, and transition it as necessary. If you were going to pessimize, you'd just leave the image in the "general" layout and never change it.

Queue transfers are somewhat similar, in that they are state the image is in. When creating an image, you decide whether the image is "exclusive" to a single queue, or shared among multiple queues. If the image is shared among multiple queues, you have to use synchronization to make sure the different queues don't step on each others' toes. Otherwise, if the image is exclusive, you can change which queue it's owned by with a queue transfer in a pipeline barrier. It's actually pretty straightforward - the source queue issues the pipeline barrier to release its ownership, and the destination queue issues the same pipeline barrier to acquire ownership.

Which Bytes?

There's one last piece of Vulkan synchronization - and that is chopping up resources. When you issue a pipeline barrier on a buffer, you get to say which byte region of the buffer the synchronization applies to. (If you supply something that the implementation can't exactly match - let's say you didn't supply it on page boundaries or something - the implementation is allowed to pessimize and synchronize more of the resource than you asked for.)

For textures, it's a little more complicated. You can't issue a pipeline barrier for a particular rectangular region of a 2D texture. Instead, your barriers can target a specific mip level range, array layer range, and aspect range (aspect = depth part, stencil part, or color part). Each mip level, array layer, or aspect of a texture can be in a different layout and synchronized independently.

The Story

So, as your Vulkan program runs, it will issue reads and writes to specific regions of resources. The sequence of reads and writes to a particular part of a resource will cause hazards, and you need to classify the kinds of hazards and issue synchronization calls to cause them to be synchronized. You can use synchronization calls to create either execution dependencies or memory dependencies. Within a single queue, you use pipeline barriers (or events) to synchronize, and across queues you use semaphores (which act upon entire command buffers), and to synchronize command buffers with the host you use fences.

There's kind of a problem, though - hazards become apparent at the site of recording the destination access, but the necessary pipeline barrier requires you to specify the *previous* access. Now, maybe you *just know* the previous access - you wrote your rendering engine, after all - but usually command buffers are recorded independently (perhaps even in parallel). At the time you record a command into a command buffer, you may not know what the previous access was to that portion of the resource - maybe the last access happened in a totally different command buffer that hasn't even been recorded yet.

The natural solution to this is a two-phase tracking solution: within a single command buffer, try to issue whatever pipeline barriers you know are necessary. For the ones you can't issue because you don't know the source accesses, simply remember those (and don't issue pipeline barriers for them). Your queue submission system can then do its own global tracking to use semaphores to synchronize with whichever accesses actually got submitted just before the current command buffer's accesses.

I'll be discussing this in more in a forthcoming Part 2.

Saturday, April 27, 2024

Avoiding Seemingly-Necessary Retain Cycles

I was recently implementing an API that seemed like it required a retain cycle. Object A has a property of object B, and object B has a property of object A. Customer code is allowed to retain either object A or object B and must be able to access the other one via the property. It seems like this requires a retain cycle to implement, doesn't it!

Specifically, I'm implementing reflection for a programming language compiler. The customer gives a program to the system as a string, and the system compiles their program, but also returns a reflection object that describes the contents of their program. The reflection object consists of a collection of subobjects, each of which represents a part of the customer's program.

So, if the customer's program is something like:

struct Foo {

Foo* link;

}

There will be a subobject which represents the struct, and there will be a subobject under that which represents the field. The field has a subobject which represents the type of the field, and that object has an accessor which references the inner type of the pointer - which is the Foo struct again. That's the retain cycle.

There is a solution, though, which doesn't actually involve retain cycles. It's something I learned from the WebKit team during my time working on that project. The solution is that you take all the objects in the cycle, and instead of having each object have its own reference count, you make a single reference count for the entire collection. If any customer code wants to reference any subobject, the subobjects are set up to automatically forward it on to the entire collection's single reference count. If you know, then, that every object in the collection has the same reference count and therefore the same lifetime, then all links within the collection (from one subobject to another subobject) can be raw pointers (rather than strong references that would increment the reference count). Indeed, you actually *can't* have a strong reference from one subobject to another, because that would actually mean that the collection has a strong reference to itself.

So, let's break it down.

Step 1: Identify a single owning object under which all of the subobjects live. In my case, that's the top-level "Reflection" object. This object's reference count will represent the reference count for the entire collection of subobjects under it.

Step 2: For each subobject under the reflection object, give it a raw pointer to the owning parent object. This can be a raw pointer, because we are guaranteeing that the lifetime of all the subobjects is equal to the lifetime of the single owning object. We can also have each subobject notify the owning object that it owns the subobject. This will be important later, when the owning object eventually dies, and wants to *actually* destroy each subobject.

Step 3: For each subobject, override its retain and release method to call the owning object's retain and release method instead. This means that any time any customer wants to retain or release a subobject, they end up retaining and releasing the single owning object instead. In this way, the owning object's lifetime is the union of the lifetimes of each of the subobjects.

Step 4: We have to modify any references from any subobject to any other subobject to be a raw pointer, instead of a strong reference. At this point, if there is any strong reference from a subobject to another subobject, that strong reference will be forwarded to the parent object, and the parent object would stay alive forever. So we have to avoid this, and use raw pointers for all links between subobjects.

Step 5: We have to modify the destructor of the parent object to destroy each of the subobjects under it. There are 2 pieces to this. The first piece is that the parent object needs a vector of raw pointers to each of the subobjects. We can build up this vector when the subobject notifies the parent object back in step 2. Also, the vector needs to hold raw pointers (as opposed to strong references) for the same reason that subobjects need to hold raw pointers among themselves - if we don't, the owning object will live forever. The second piece is that there has to be some way for the parent to *actually* destroy a subobject. It can't just call release on the subobject, because that will end up being forwarded back to the owning object. Therefore, all of the subobjects need a new method, "actually release," which actually releases the subobject without forwarding it to the owning object. The destructor of the owning object calls this on each subobject which was registered to it.

Step 6: There's one last gotcha: When subobjects are created, the customer code is going to think they are +1 objects, and release them eventually. Therefore, the constructor of the subobjects need to retain their owner, which balances out this pending release. This looks like a leak, but it's not - the owning object will destroy every subobject when the owning object gets destroyed, and the owning object gets destroyed when the last reference to either itself or any of the subobjects gets destroyed.

It's a little tricky to get it all correct, but it is possible. The API contract *seems* like it would require a retain cycle, but it's actually possible to implement without one, by unifying the retain counts of all the objects in the cycle into a single retain count for the whole group, thereby treating the cycle as a single item.

Monday, March 18, 2024

So I wrote a double delete...

I wrote a double delete. Actually, it was a double autorelease. Here's a fun story describing the path it took to figure out the problem.

I'm essentially writing a plugin to a different app, and the app I'm plugging-in-to is closed-source. So, I'm making a dylib which gets loaded at runtime. I'm doing this on macOS.

The Symptom

When running my code, I'm seeing this:

Let's see what we can learn from this.

First, it's a crash. We can see we're accessing memory that we shouldn't be accessing.

Second, it's inside objc_release(). We can use a bit of deductive reasoning here: If the object we're releasing has a positive retain count, then the release shouldn't crash. Therefore, either we're releasing something that isn't an object, or we're releasing something that has a retain count of 0 (meaning: a double release).

Third, we can actually read a bit of the assembly to understand what's happening. The first two instructions are just a way to check if %rdi is null, and, if so, jump to an address that's later in the function. Therefore, we can deduce that %rdi isn't null.

%rdi is interesting because it's the register that holds the first argument. It's probably a safe assumption to make that objc_release() probably just takes a single argument, and that argument is a pointer, and that pointer is stored in %rdi. This assumption is somewhat-validated by reading the assembly: nothing seems to be using any of the other parameter registers.

The next 3 lines check if the low bit in %rdi is 1 or not. If it's 1, then we again jump to an address that's later in the function. Therefore, we can deduce that %rdi is an even number (its low bit isn't 1).

The next 3 lines load a value that %rdi is pointing to, and mask off most of its bits. The next line, which is the line that's crashing, is trying to load the value that the result points to.

All this makes total sense: Releasing a null pointer should do nothing, and releasing tagged pointers (which I'm assuming are marked by having their low bit set to 1) should do nothing as well. If the argument is an Objective-C object, it looks like we're trying to load the isa pointer, which probably holds something useful at offset 0x20. That's the point where we're crashing.

That leads to the deduction: Either the thing we're trying to release isn't an Objective-C object, or it's already been released, and the release procedure clears (or somehow poisons) the isa value, which caused this crash. Either way, we're releasing something that we shouldn't be releasing.

One of the really useful observations about the assembly is that nothing before the crash point clobbers the value of %rdi. This means that a pointer to the object that's getting erronously released is *still* in %rdi at the crash site.

We can also see that the crash is happening inside AutoreleasePool:

This doesn't indicate much - just that we're autoreleasing the object instead of releasing it directly. It also means that, because autorelease is delayed, we can't see anything useful in the stack trace. (If we were releasing directly instead of autoreleasing, we could see exactly what caused it in the stack trace.)

The First Thing That Didn't Work

The most natural solution would be "Let's use Instruments!" It's supposed to have a tool that shows all the retain stacks and release stacks for every object.

When running with Instruments, we get a nice crash popup showing us that we crashed:

The coolest part about this is that it shows us the register state at the crash site, which gives us %rdi, the pointer to the object getting erroneously released.

Cool, so the object which is getting erroneously released is at 0x600002a87b40. Let's see what instruments lists for that address:

Well, it didn't list anything for that address. It listed something for an address just before it, and just after it, but not what we were looking for. Thanks for nothing, Instruments.

The Second Thing That Didn't Work

Well, I'm allocating and destroying objects in my own code. Why don't I try to add logging all my own objects to see where they all get retained and released! Hopefully, by cross referencing the address of the object that gets erroneously deleted with the logging of the locations of my own objects, I'll be able to tell what's going wrong.

We can do this by overriding the -[NSObject release] and -[NSObject retain] calls:

As well as init / dealloc:

Unfortunately, this spewed out a bunch of logging, but the only thing it told me was the object that was being erroneously released wasn't one of my own objects. It must be some other object (NSString, NSArray, etc.).

The Third Thing That Didn't Work

Okay, we know the object is being erroneously autoreleased. Why don't we log some useful information every time anyone autoreleases anything? We can add a symbolic breakpoint on -[NSObject autorelease].

Here's what it looks like when this breakpoint is hit:

Interesting - so it looks like all calls to -[NSObject autorelease] are immediately redirected to _objc_rootAutorelease(). The self pointer is preserved as the value of the first argument.

If you list the registers at the time of the call, you can see the object being released:

So let's modify the breakpoint to print all the information we're looking for:

Unfortunately, this didn't work because it was too slow. Every time lldb evaluates something, it takes a bunch of time, and this was evaluating 3 things every time anybody wanted to autorelease anything, which is essentially all the time. The closed-source application I'm debugging is sensitive enough, that if anything takes too long, the application just quits.

The Fourth Thing That Didn't Work

Lets try to print out the same information as before, but do it inside the application rather than in lldb. That way, it will be much faster.

The way we can do this is with something called "function interposing." This uses a feature of dyld which can replace a library's function with your own. Note that this only works if you disable SIP and set the nvram variable amfi_get_out_of_my_way=0x1 and reboot.

We can do this to swap out all calls to _objc_rootAutorelease() with our own function.

Inside our own version of _objc_rootAutorelease(), we want to keep track of everything that gets autoreleased. So, let's keep track of a global dictionary, from pointer value to info string.

We can initialize this dictionary inside a "constructor," which is a special function in a dylib which gets run when the dylib gets loaded by dyld. This is a great way to initialize a global.

Inside my_objc_rootAutorelease(), we can just add information to the dictionary. Then, when the crash occurs, we can print the dictionary and find information about the thing that was autoreleased.

However, something is wrong...

The dictionary only holds 315 items. That can't possibly be right - it's inconceivable that only 315 things got autoreleased.

The Fifth Thing That Didn't Work

We're close - we just need to figure out why so few things got autoreleased. Let's verify our assumptions, that [foo autorelease] actually calls _objc_rootAutorelease() by writing such code and looking at its disassembly.

And if you look at the disassembly...

You can see 2 really interesting things: the call to alloc and init got compressed to a single C call to objc_alloc_init(), and the call to autorelease got compressed to a single C call to obc_autorelease(). I suppose the Objective-C compiler knows about the autorelease message, and is smart enough to not invoke the entire objc_msgSend() infrastructure for it, but instead just emits a raw C call for it. So that means we've interposed the wrong function - we were interposing _objc_rootAutorelease() when we should have been interposing objc_autorelease(). So let's interpose both:

This, of course, almost worked - we just have to be super sure that my_objc_autorelease() doesn't accidentally call autorelease on any object - that would cause infinite recursion.

The Sixth Thing That Didn't Work

Avoiding calling autorelease inside my_objc_autorelease() is actually pretty much impossible, because anything interesting you could log about an object will, almost necessarily, call autorelease. Remember that we're logging information about literally every object which gets autoreleased, which is, in effect, every object in the entire world. Even if you call NSStringFromClass([object class]) that will still cause something to be autoreleased.

So, the solution is to set some global state for the duration of the call to my_objc_autorelease(). If we see a call to my_objc_autorelease() while the state is set, that means we're autoreleasing inside being autoreleased, and we can skip our custom logic and just call the underlying objc_autorelease() directly. However, there's a caveat: this "global" state can't actually be global, because Objective-C objects are created and retained and released on every thread, which means this state has to be thread-local. Therefore, because we're writing in Objective-C and not C++, we must use the pthreads API. The pthreads threadspecific API uses a "key" which has to be set up once, so we can do that in our constructor:

Then we can use pthread_setspecific() and pthread_getspecific() to determine if our calls are being nested.

Except this still didn't actually work, because abort() is being called...

The Seventh Thing That Didn't Work

Luckily, when abort() is called, Xcode shows us a pending Objective-C exception:

Okay, something is being set to nil when it shouldn't be. Let's set an exception breakpoint to see what is being set wrong:

Welp. It turns out NSStringFromClass([object class]) can sometimes return nil...

The Eighth Thing That Worked

Okay, let's fix that by checking for nil and using [NSNull null]. Now, the program actually crashes in the right place!

That's more like it. Let's see what the pointer we're looking for is...

Okay, let's look for it in bigDict!

Woohoo! Finally some progress. The object being autoreleased is an NSDictionary.

But that's not enough, though. What we really want is a backtrace. We can't use lldb's backtrace because it's too slow, but luckily macOS has a backtrace() function which gives us backtrace information! Let's build a string out of the backtrace information:

Welp, that's too slow - the program exits. Let's try again by setting frameCount to 6:

So here is the final autorelease function:

Okay, now let's run it, and print out the object we're interested in:

And the bigDict:

Woohoo! It's a great success! Here's the stack trace, formatted:

Excellent! This was enough for me to find the place where I had over-released the object.