Litherum

Vulkan Synchronization: Why it sucks (Part 2)

2024-07-03T19:17:00.000-07:00

There are a variety of reasons why I don't think the Vulkan committee designed synchronization very well.

Let's get one thing out of the way first:

State/Hazard Tracking is a Thing that People Need to Do

At GDC 2016 (the year that Vulkan came out), Matthaeus Chajdas presented making the point that you shouldn't need to do state tracking in Vulkan. He says that, instead of tracking state at runtime, you should just *know* which barriers are needed in which places, and just hardcode them in the right places to be correct. After all - you wrote your rendering algorithm.

I don't think this is realistic. If you're making a toy, then sure, you'll just know where the barriers go. But no serious engines are toys.

Consider something like Unreal Engine - who knows what kind of crazy shit the artists, who don't work at Epic, created in UE that the engine is being asked to render. Node graphs can be arbitrarily complex and can have dependencies with multiple subsystems within the engine. I'd put good money on the claim that the people who wrote Unreal Engine don't know exactly where all the barriers go, without even being able to see the content they're being asked to render.

Also, consider that Direct3D is the lingua franca of 3D graphics (whether you want it to be or not). Almost no games call Vulkan directly. Vulkan is, instead, used as a porting layer - a platform API that your calls have to funnel through if you want your game to run on Android. Game code either calls Direct3D directly, or in the case of the big engines, go through an RHI that's looks similar to Direct3D. For all this content, the way it ends up running on Vulkan is with a layer like Proton - implementing the Direct3D API - or something similar to the Direct3D API - on top of Vulkan, which necessarily requires state tracking, because the Direct3D synchronization API is significantly less detailed/precise than Vulkan is.

So, I just reject outright the idea that devs should *just know* where all their barriers should go, and should magically place them in the right place because the people writing the Vulkan game engine (which doesn't exist) are just really smart. It's just not realistic.

It's Actually D3D, It's Always Been D3D

The Vulkan spec pretends that each stage in a pipeline may execute on a different chip with different caches. Therefore, you have to track each stage in every pipeline independently. But there's no hardware which actually works this way. No GPU actually puts *every* stage of the graphics pipeline on different chips with different caches. So you're doing all this tracking for each one of these stages, but the hardware only has a few (usually 1) different chips that all the programmable stages execute on. So almost all these different barriers are all going to boil down to the same thing under the hood anyway.

Imagine that you weren't writing a Vulkan app, but instead were writing a device-specific driver. You'd know exactly the topology of the device you're writing for, and you'd track only the state that the device actually cares about. Instead, Vulkan forces you to track all the state that a conceptual worst-case device might need. No real device actually needs all that precision.

But the worst part of all this is that, when device manufacturers design devices, they are informed by the graphics APIs of the day. They make their hardware knowing that Direct3D is the lingua franca API, and that the vast majority of content will be written to that API or a similar API. So they design the hardware accordingly. So, what we're left with is: the apps are written for the D3D model, then they have to be filtered down to the super precise Vulkan model, which requires state tracking, only to end up *back* at the D3D model when it hits the hardware. So what was the point of all that tracking for Vulkan? Nothing!

Naive Hazard Tracking is Too Precise

Each array slice and mip level and aspect (depth/stencil/color) of each image can be in a different layout. Synchronization between previous writes and current reads can synchronize each array slice and mip level and aspect independently. In a buffer, every byte range can be synchronized independently.

Now, imagine that you wanted to do hazard tracking at this granularity. You'd need a data structure that remembered, for each texture, for each array slice, for each mip level, for each aspect, what the most recent write to that area of the resource was. Even if you do some kind of run-length encoding, there's no getting around the fact that a 4-dimensional data structure is gonna be a mess. You could have a hashmap of hashmaps of hashmaps of hashmaps, which would waste a ton of space, or you could have something like a 4-dimensional kD tree, which is going to end up with a ton of nodes. There's no getting around the fact that you're gonna create a monster.

If you do something simpler, and don't represent exactly the precision that Vulkan has, that means you're going to somewhat oversynchronize.

The crux here is that there's a nonobvious crossover point. If you're super precise in your tracking data structure, you're going to spend a bunch of time partitioning / compacting it. If you're imprecise, you're going to oversynchronize. There's no good solution.

Image Layouts are (Differently) Too Precise

There are 36 different layouts a region of an image can be in (if you include all non-vendor-specific extensions). Some of those layouts are capable of only being used for one purpose, but many of them are capable of being used for multiple purposes, while only being optimal for one.

So, the developer has a choice. They can either transition every image to whichever layout is optimal for each command they're being used in, thereby issuing lots of transitions. Or, they can pick layouts which are *compatible* with multiple uses but optimal for just some (or maybe even none), and issue fewer barriers. Or something in the middle.

How should the developer make this decision? With which information are they armed in order to make this decision? The answer is: nothing. The developer has no idea which Vulkan layouts actually correspond to identical byte layouts under the hood. (And, D3D has way fewer layouts, so you better believe that not all of the Vulkan ones will actually be distinct in the hardware.)

So, the only thing the developer can do is write the code multiple ways and test the performance. But how many places on the spectrum should they implement? How do they know which codepath to pick if the user runs their code on a device they haven't tested with? There is no good answer.

Vulkan Made an Intentional Decision

The thing that really gets me about Vulkan's synchronization API is that they *didn't* actually go all the way to the extreme:

Semaphores automatically create memory dependencies, and you don't have to specify an access mask with them. This didn't have to be the case, though - the Vulkan committee could have decided to require you to specify access masks with semaphores
Sempahores signaled via queue submissions automatically form a memory dependency with previous command buffers. The Vulkan committee didn't have to do this - they could have said "if you want to form a memory dependency with a command buffer, you must interact with *that* command buffer"
Queue submission automatically forms a memory dependency with host accesses prior to the queue submission (coherent mappings notwithstanding). The Vulkan committee didn't need to do this - they could have said that you need to issue a device-side synchronization command to synchronize with the host
Pipeline barriers themselves don't have to be synchronized with memory dependencies, which is a little surprising because pipeline barriers can change image layouts, which can read/write every byte in the image. The Vulkan committee could have said image layout transitions are done via separate commands which need to have memory dependencies around them.
Pipeline barriers require a bitmask of pipline stages as well as a bitmask of access masks - but there's no relation between them. You can't say "stage X uses access mask Y, whereas stage Z uses access mask W" without issuing multiple barriers. This means that the most natural way to use pipeline barriers (lumping all the stages and access masks together into a single barrier) is actually oversynchronizing - it's totally possible for a single access mask to apply to multiple pipeline stages, which you might not have anticipated when issuing the barrier.

I would understand it if the Vulkan committee went all the way, and said that Vulkan would give you nothing for free, and it's up to the author to do everything. But, instead, they chose a middle ground, which indicates they think this is an appropriate middle ground. They intentionally chose this compromise. But it's a terrible compromise! It's precise enough that you can't do it well, and it's entirely needless because it pessimizes to no hardware that actually exists.

Manufacturers are Better Equipped for this than Game Developers

I'd like to make this point again: State/hazard tracking is a thing that people have to do with Vulkan, but there's no way to do it better than a device-specific driver, because Vulkan pessimizes. I appreciate that Vulkan lets the app developer choose what granularity of state/hazard tracking they perform, but I think that's a constituency inversion: the only rubric for the tradeoffs of tracking is performance, and the tradeoffs will be device-dependent, so the manufacturers making the devices, who are experts in each individual device they produce, will be best equipped to make that tradeoff. They'll certainly do a better job than an individual game developer house with hundreds of GPUs from many vendors to test out. This goes doubly for GPUs that release after the game does, which is a situation where the game has to run on a device the developer has never seen before.

The game developer doesn't know the internal topology of the GPU they're running on, and they're not equipped to make the kinds of tradeoffs that the Vulkan API forces them to make. Only the device manufacturers can actually make these sensible tradeoffs for each device.

Vulkan Synchronization: What it is (Part 1)

2024-07-03T17:19:00.000-07:00

Synchronization in Vulkan is actually somewhat tricky to grok. And you have to grok it, because it isn't done automatically for you - you have to explicitly synchronize between every data hazards that might arise. (This is in contrast to APIs like Metal which do most, or even all, of the synchronization for you.)

Hazards

So let's start out with: What's a data hazard? A data hazard is when two operations can't happen simultaneously, either because one operation depends on the result of the other one (think: a read depends on a previous write) or one needs to be independent of the other (think: a read before a write needs to *not* see the result of the write). To forbid these, you make explicit calls to Vulkan to tell it what to disallow from happening simultaneously with what else.

There are 2 flavors of synchronization in Vulkan: execution dependencies and memory dependencies.

Execution dependencies are the simplest - they don't say *anything* about memory, but instead *just* describe a "happens-before" relationship. This is sufficient for a write-after-read hazard, where the read shouldn't see the results of the write - it's enough to just delay the write until after the read completed.

Memory dependencies imply an execution dependency - so memory dependencies are strictly more powerful than execution dependencies. A memory dependency is where there is actually communication going on, through memory. It is sufficient for read-after-write hazards, where the read needs to see the result of the write, and also write-after-write hazards. There are 2 pieces to it:

The first operation's result become "available". Think of this as a cache flush. After the first operation, the data has to actually move from the source cache out to main memory
The memory becomes "visible" to the second operation. Think of this as a cache invalidation. If the receiver wants to see the result of the previous operation, it has to mark its own cache as invalid, so the operation actually goes out to main memory to see the results.

There is no such thing as a read-read hazard, so no synchronization is necessary in that case.

Tools

Vulkan provides many tools that you can use to synchronize.

Fences are used to synchronize the device ("device" = "GPU") with the host ("host" = "CPU"). You specify a fence to be signaled when the device work is done as a part of vkQueueSubmit(). This kind of synchronization is actually (surprisingly) really easy, because Section 7.9 essentially says that you don't need to deal with fences when uploading data to the device. The spec for vkWaitForFences() indicates that the only thing you need to do for downloading data from the device is simply wait on the fence. (Also, if your resource isn't coherently mapped, you need to use vkFlushMappedMemoryRanges() and vkInvalidateMemoryRanges() so the CPU's caches will get out of the way.
Semaphores are used to synchronize between different queues. They actually form a dependency graph, where each command buffer submitted to a queue specifies a set of semaphores to wait for before it starts executing, and a set of semaphores to signal when it's done executing.
Events are used to synchronize different commands within the same queue. Just because two operations are submitted to the same queue does not mean they will execute in-order. Events are "split" in that there is vkCmdSetEvent() as distinct from vkCmdWaitEvents(). These are commands that get recorded into a command buffer.
Pipeline barriers are also used to synchronize different commands within the same queue, but the difference with events is that barriers aren't split - there's just a single vkCmdPipelineBarrier(). This is also a command that gets recorded into a command buffer.

I won't say much more about fences - as I described above, you really don't need to think about them much. One of the cool things about fences is that you can "export" it to a "sync fd" on UNIX platforms, and the resulting fd can be used in select() and epoll(), which makes for a nice way to integrate into your app's existing run loop.

I also won't say much more about events - they're just the same thing as pipeline barriers, but split into two halves.

Stages and Accesses

Accesses by the GPU happen within a "pipeline," which is comprised of a sequence of stages. For example, the vertex shader and the fragment shader are stages of the graphics pipeline (among many other stages). Vulkan is designed so that each stage within a pipeline can execute on a totally different chip than any of the other stages - and each chip will have its own cache. Therefore, if your pipeline has n stages, Vulkan forces you to pessimize and assume that each of those n stages will execute on a different chip with a different set of caches, so you have n different caches you have to manage.

However, it's actually worse than that. Each stage might be able to access a resource in a variety of different way - for example, a sampled texture vs a storage texture. Each of these different ways to access a resource *also* might have its own cache - the cache used for storage textures might be a totally different cache than the cache used for sampled textures. So you actually have more than n cached to worry about.

The more caches you have to manage, the more synchronization calls you need to make.

When you issue a pipeline barrier to Vulkan, you have to describe which source and destination you're synchronizing. The source and destination both include which stage is doing the access, and which kind of access it is (e.g. sampled texture vs storage texture). If you just supply the stages, but don't supply the accesses, that describes an execution dependency. If you supply both, that describes a memory dependency.

Semaphores always describe memory dependencies, and they don't ask you for what kind of accesses it's synchronizing with - it presumably pessimizes and assumes it has to synchronize as-if all kinds of accesses happened. Instead, it just asks you which stages it should provide a memory barrier between.

I should also probably mention that it's possible for *you* to pessimize too - the enum for access kind is a bitmask, so you can specify multiple values, and it also has values for all and none.

Pipeline Barriers et al.

Pipeline barriers also do 2 more things: image layouts and queue transfers.

Image layouts are fundamentally different than what I've been describing so far. Previously, I've been describing synchronization - cache operations and source and destination processors. On the other hand, an image layout is a state that the image is in. The spec says it's an in-memory ordering of the data blocks within the image. There are lots of different layouts an image can be in, with each one being optimized for some particular purpose. Transitioning an image may require reading and writing all of the data within the image - to move its blocks around. So, you can't pessimize here in the same way you can pessimize about synchronization (and issue the biggest pipeline barrier possible between every command) - instead, if you want your accesses to be optimal, you have to remember which layout every (region of every) image is in, and transition it as necessary. If you were going to pessimize, you'd just leave the image in the "general" layout and never change it.

Queue transfers are somewhat similar, in that they are state the image is in. When creating an image, you decide whether the image is "exclusive" to a single queue, or shared among multiple queues. If the image is shared among multiple queues, you have to use synchronization to make sure the different queues don't step on each others' toes. Otherwise, if the image is exclusive, you can change which queue it's owned by with a queue transfer in a pipeline barrier. It's actually pretty straightforward - the source queue issues the pipeline barrier to release its ownership, and the destination queue issues the same pipeline barrier to acquire ownership.

Which Bytes?

There's one last piece of Vulkan synchronization - and that is chopping up resources. When you issue a pipeline barrier on a buffer, you get to say which byte region of the buffer the synchronization applies to. (If you supply something that the implementation can't exactly match - let's say you didn't supply it on page boundaries or something - the implementation is allowed to pessimize and synchronize more of the resource than you asked for.)

For textures, it's a little more complicated. You can't issue a pipeline barrier for a particular rectangular region of a 2D texture. Instead, your barriers can target a specific mip level range, array layer range, and aspect range (aspect = depth part, stencil part, or color part). Each mip level, array layer, or aspect of a texture can be in a different layout and synchronized independently.

The Story

So, as your Vulkan program runs, it will issue reads and writes to specific regions of resources. The sequence of reads and writes to a particular part of a resource will cause hazards, and you need to classify the kinds of hazards and issue synchronization calls to cause them to be synchronized. You can use synchronization calls to create either execution dependencies or memory dependencies. Within a single queue, you use pipeline barriers (or events) to synchronize, and across queues you use semaphores (which act upon entire command buffers), and to synchronize command buffers with the host you use fences.

There's kind of a problem, though - hazards become apparent at the site of recording the destination access, but the necessary pipeline barrier requires you to specify the *previous* access. Now, maybe you *just know* the previous access - you wrote your rendering engine, after all - but usually command buffers are recorded independently (perhaps even in parallel). At the time you record a command into a command buffer, you may not know what the previous access was to that portion of the resource - maybe the last access happened in a totally different command buffer that hasn't even been recorded yet.

The natural solution to this is a two-phase tracking solution: within a single command buffer, try to issue whatever pipeline barriers you know are necessary. For the ones you can't issue because you don't know the source accesses, simply remember those (and don't issue pipeline barriers for them). Your queue submission system can then do its own global tracking to use semaphores to synchronize with whichever accesses actually got submitted just before the current command buffer's accesses.

I'll be discussing this in more in a forthcoming Part 2.

Avoiding Seemingly-Necessary Retain Cycles

2024-04-27T17:10:00.000-07:00

I was recently implementing an API that seemed like it required a retain cycle. Object A has a property of object B, and object B has a property of object A. Customer code is allowed to retain either object A or object B and must be able to access the other one via the property. It seems like this requires a retain cycle to implement, doesn't it!

Specifically, I'm implementing reflection for a programming language compiler. The customer gives a program to the system as a string, and the system compiles their program, but also returns a reflection object that describes the contents of their program. The reflection object consists of a collection of subobjects, each of which represents a part of the customer's program.

So, if the customer's program is something like:

struct Foo {

Foo* link;

}

There will be a subobject which represents the struct, and there will be a subobject under that which represents the field. The field has a subobject which represents the type of the field, and that object has an accessor which references the inner type of the pointer - which is the Foo struct again. That's the retain cycle.

There is a solution, though, which doesn't actually involve retain cycles. It's something I learned from the WebKit team during my time working on that project. The solution is that you take all the objects in the cycle, and instead of having each object have its own reference count, you make a single reference count for the entire collection. If any customer code wants to reference any subobject, the subobjects are set up to automatically forward it on to the entire collection's single reference count. If you know, then, that every object in the collection has the same reference count and therefore the same lifetime, then all links within the collection (from one subobject to another subobject) can be raw pointers (rather than strong references that would increment the reference count). Indeed, you actually *can't* have a strong reference from one subobject to another, because that would actually mean that the collection has a strong reference to itself.

So, let's break it down.

Step 1: Identify a single owning object under which all of the subobjects live. In my case, that's the top-level "Reflection" object. This object's reference count will represent the reference count for the entire collection of subobjects under it.

Step 2: For each subobject under the reflection object, give it a raw pointer to the owning parent object. This can be a raw pointer, because we are guaranteeing that the lifetime of all the subobjects is equal to the lifetime of the single owning object. We can also have each subobject notify the owning object that it owns the subobject. This will be important later, when the owning object eventually dies, and wants to *actually* destroy each subobject.

Step 3: For each subobject, override its retain and release method to call the owning object's retain and release method instead. This means that any time any customer wants to retain or release a subobject, they end up retaining and releasing the single owning object instead. In this way, the owning object's lifetime is the union of the lifetimes of each of the subobjects.

Step 4: We have to modify any references from any subobject to any other subobject to be a raw pointer, instead of a strong reference. At this point, if there is any strong reference from a subobject to another subobject, that strong reference will be forwarded to the parent object, and the parent object would stay alive forever. So we have to avoid this, and use raw pointers for all links between subobjects.

Step 5: We have to modify the destructor of the parent object to destroy each of the subobjects under it. There are 2 pieces to this. The first piece is that the parent object needs a vector of raw pointers to each of the subobjects. We can build up this vector when the subobject notifies the parent object back in step 2. Also, the vector needs to hold raw pointers (as opposed to strong references) for the same reason that subobjects need to hold raw pointers among themselves - if we don't, the owning object will live forever. The second piece is that there has to be some way for the parent to *actually* destroy a subobject. It can't just call release on the subobject, because that will end up being forwarded back to the owning object. Therefore, all of the subobjects need a new method, "actually release," which actually releases the subobject without forwarding it to the owning object. The destructor of the owning object calls this on each subobject which was registered to it.

Step 6: There's one last gotcha: When subobjects are created, the customer code is going to think they are +1 objects, and release them eventually. Therefore, the constructor of the subobjects need to retain their owner, which balances out this pending release. This looks like a leak, but it's not - the owning object will destroy every subobject when the owning object gets destroyed, and the owning object gets destroyed when the last reference to either itself or any of the subobjects gets destroyed.

It's a little tricky to get it all correct, but it is possible. The API contract *seems* like it would require a retain cycle, but it's actually possible to implement without one, by unifying the retain counts of all the objects in the cycle into a single retain count for the whole group, thereby treating the cycle as a single item.

So I wrote a double delete...

2024-03-18T00:11:00.000-07:00

I wrote a double delete. Actually, it was a double autorelease. Here's a fun story describing the path it took to figure out the problem.

I'm essentially writing a plugin to a different app, and the app I'm plugging-in-to is closed-source. So, I'm making a dylib which gets loaded at runtime. I'm doing this on macOS.

The Symptom

When running my code, I'm seeing this:

Let's see what we can learn from this.

First, it's a crash. We can see we're accessing memory that we shouldn't be accessing.

Second, it's inside objc_release(). We can use a bit of deductive reasoning here: If the object we're releasing has a positive retain count, then the release shouldn't crash. Therefore, either we're releasing something that isn't an object, or we're releasing something that has a retain count of 0 (meaning: a double release).

Third, we can actually read a bit of the assembly to understand what's happening. The first two instructions are just a way to check if %rdi is null, and, if so, jump to an address that's later in the function. Therefore, we can deduce that %rdi isn't null.

%rdi is interesting because it's the register that holds the first argument. It's probably a safe assumption to make that objc_release() probably just takes a single argument, and that argument is a pointer, and that pointer is stored in %rdi. This assumption is somewhat-validated by reading the assembly: nothing seems to be using any of the other parameter registers.

The next 3 lines check if the low bit in %rdi is 1 or not. If it's 1, then we again jump to an address that's later in the function. Therefore, we can deduce that %rdi is an even number (its low bit isn't 1).

The next 3 lines load a value that %rdi is pointing to, and mask off most of its bits. The next line, which is the line that's crashing, is trying to load the value that the result points to.

All this makes total sense: Releasing a null pointer should do nothing, and releasing tagged pointers (which I'm assuming are marked by having their low bit set to 1) should do nothing as well. If the argument is an Objective-C object, it looks like we're trying to load the isa pointer, which probably holds something useful at offset 0x20. That's the point where we're crashing.

That leads to the deduction: Either the thing we're trying to release isn't an Objective-C object, or it's already been released, and the release procedure clears (or somehow poisons) the isa value, which caused this crash. Either way, we're releasing something that we shouldn't be releasing.

One of the really useful observations about the assembly is that nothing before the crash point clobbers the value of %rdi. This means that a pointer to the object that's getting erronously released is *still* in %rdi at the crash site.

We can also see that the crash is happening inside AutoreleasePool:

This doesn't indicate much - just that we're autoreleasing the object instead of releasing it directly. It also means that, because autorelease is delayed, we can't see anything useful in the stack trace. (If we were releasing directly instead of autoreleasing, we could see exactly what caused it in the stack trace.)

The First Thing That Didn't Work

The most natural solution would be "Let's use Instruments!" It's supposed to have a tool that shows all the retain stacks and release stacks for every object.

When running with Instruments, we get a nice crash popup showing us that we crashed:

The coolest part about this is that it shows us the register state at the crash site, which gives us %rdi, the pointer to the object getting erroneously released.

Cool, so the object which is getting erroneously released is at 0x600002a87b40. Let's see what instruments lists for that address:

Well, it didn't list anything for that address. It listed something for an address just before it, and just after it, but not what we were looking for. Thanks for nothing, Instruments.

The Second Thing That Didn't Work

Well, I'm allocating and destroying objects in my own code. Why don't I try to add logging all my own objects to see where they all get retained and released! Hopefully, by cross referencing the address of the object that gets erroneously deleted with the logging of the locations of my own objects, I'll be able to tell what's going wrong.

We can do this by overriding the -[NSObject release] and -[NSObject retain] calls:

As well as init / dealloc:

Unfortunately, this spewed out a bunch of logging, but the only thing it told me was the object that was being erroneously released wasn't one of my own objects. It must be some other object (NSString, NSArray, etc.).

The Third Thing That Didn't Work

Okay, we know the object is being erroneously autoreleased. Why don't we log some useful information every time anyone autoreleases anything? We can add a symbolic breakpoint on -[NSObject autorelease].

Here's what it looks like when this breakpoint is hit:

Interesting - so it looks like all calls to -[NSObject autorelease] are immediately redirected to _objc_rootAutorelease(). The self pointer is preserved as the value of the first argument.

If you list the registers at the time of the call, you can see the object being released:

So let's modify the breakpoint to print all the information we're looking for:

Unfortunately, this didn't work because it was too slow. Every time lldb evaluates something, it takes a bunch of time, and this was evaluating 3 things every time anybody wanted to autorelease anything, which is essentially all the time. The closed-source application I'm debugging is sensitive enough, that if anything takes too long, the application just quits.

The Fourth Thing That Didn't Work

Lets try to print out the same information as before, but do it inside the application rather than in lldb. That way, it will be much faster.

The way we can do this is with something called "function interposing." This uses a feature of dyld which can replace a library's function with your own. Note that this only works if you disable SIP and set the nvram variable amfi_get_out_of_my_way=0x1 and reboot.

We can do this to swap out all calls to _objc_rootAutorelease() with our own function.

Inside our own version of _objc_rootAutorelease(), we want to keep track of everything that gets autoreleased. So, let's keep track of a global dictionary, from pointer value to info string.

We can initialize this dictionary inside a "constructor," which is a special function in a dylib which gets run when the dylib gets loaded by dyld. This is a great way to initialize a global.

Inside my_objc_rootAutorelease(), we can just add information to the dictionary. Then, when the crash occurs, we can print the dictionary and find information about the thing that was autoreleased.

However, something is wrong...

The dictionary only holds 315 items. That can't possibly be right - it's inconceivable that only 315 things got autoreleased.

The Fifth Thing That Didn't Work

We're close - we just need to figure out why so few things got autoreleased. Let's verify our assumptions, that [foo autorelease] actually calls _objc_rootAutorelease() by writing such code and looking at its disassembly.

And if you look at the disassembly...

You can see 2 really interesting things: the call to alloc and init got compressed to a single C call to objc_alloc_init(), and the call to autorelease got compressed to a single C call to obc_autorelease(). I suppose the Objective-C compiler knows about the autorelease message, and is smart enough to not invoke the entire objc_msgSend() infrastructure for it, but instead just emits a raw C call for it. So that means we've interposed the wrong function - we were interposing _objc_rootAutorelease() when we should have been interposing objc_autorelease(). So let's interpose both:

This, of course, almost worked - we just have to be super sure that my_objc_autorelease() doesn't accidentally call autorelease on any object - that would cause infinite recursion.

The Sixth Thing That Didn't Work

Avoiding calling autorelease inside my_objc_autorelease() is actually pretty much impossible, because anything interesting you could log about an object will, almost necessarily, call autorelease. Remember that we're logging information about literally every object which gets autoreleased, which is, in effect, every object in the entire world. Even if you call NSStringFromClass([object class]) that will still cause something to be autoreleased.

So, the solution is to set some global state for the duration of the call to my_objc_autorelease(). If we see a call to my_objc_autorelease() while the state is set, that means we're autoreleasing inside being autoreleased, and we can skip our custom logic and just call the underlying objc_autorelease() directly. However, there's a caveat: this "global" state can't actually be global, because Objective-C objects are created and retained and released on every thread, which means this state has to be thread-local. Therefore, because we're writing in Objective-C and not C++, we must use the pthreads API. The pthreads threadspecific API uses a "key" which has to be set up once, so we can do that in our constructor:

Then we can use pthread_setspecific() and pthread_getspecific() to determine if our calls are being nested.

Except this still didn't actually work, because abort() is being called...

The Seventh Thing That Didn't Work

Luckily, when abort() is called, Xcode shows us a pending Objective-C exception:

Okay, something is being set to nil when it shouldn't be. Let's set an exception breakpoint to see what is being set wrong:

Welp. It turns out NSStringFromClass([object class]) can sometimes return nil...

The Eighth Thing That Worked

Okay, let's fix that by checking for nil and using [NSNull null]. Now, the program actually crashes in the right place!

That's more like it. Let's see what the pointer we're looking for is...

Okay, let's look for it in bigDict!

Woohoo! Finally some progress. The object being autoreleased is an NSDictionary.

But that's not enough, though. What we really want is a backtrace. We can't use lldb's backtrace because it's too slow, but luckily macOS has a backtrace() function which gives us backtrace information! Let's build a string out of the backtrace information:

Welp, that's too slow - the program exits. Let's try again by setting frameCount to 6:

So here is the final autorelease function:

Okay, now let's run it, and print out the object we're interested in:

And the bigDict:

Woohoo! It's a great success! Here's the stack trace, formatted:

Excellent! This was enough for me to find the place where I had over-released the object.

Nvidia SLI from Vulkan's Point of View

2023-11-28T00:32:00.000-08:00

SLI is an Nvidia technology, which (is supposed to) allow multiple GPUs to act as one. The use case is supposed to be simple: you turn it on, and everything gets faster. However, that's not how it works in Vulkan (because of course it isn't - nothing is simple in Vulkan). So let's dig in and see exactly how it works and what's exposed in Vulkan.

Logical Device Creation

SLI is exposed in Vulkan with 2 extensions, both of which have been promoted to core in Vulkan 1.1: VK_KHR_device_group_creation, and VK_KHR_device_group. The reason there are 2 is esoteric: one is an "instance extension" and the other is a "device extension." Because enumerating device groups has to happen before you actually create a logical device, those enumeration functions can't be part of a device extension, so they're part of the instance extension instead. The instance extension is really small - it essentially just lets you list device groups, and for each group, list the physical devices inside it. When you create your logical device, you just list which physical devices should be part of the new logical device.

Now that you've created your logical device, there are a few different pieces to how this stuff works.

Beginning of the Frame

At the beginning of your frame, you would normally call vkAcquireNextImageKHR(), which schedules a semaphore to be signaled when the next swapchain image is "acquired" (which means "able to be rendered to"). (The rest of your rendering is supposed to wait on this semaphore to be signaled.) VK_KHR_device_group replaces this function with vkAcquireNextImage2KHR(), which adds a single parameter: a "device mask" of which physical devices in the logical device should be ready before the semaphore is signaled.

It took me a while to figure this out, but each physical device gets its own distinct contents of the swapchain image. When you write your Vulkan program, and you bind a swapchain image to a framebuffer, that actually binds n different contents - one on each physical device. When a physical device executes and interacts with the image, it sees its own independent contents of the image.

End of the Frame

At the end of the frame, you'll want to present, and this is where things get a little complicated. Each physical device in the logical device may or may not have a "presentation engine" in it. Also, recall that each physical device has its own distinct contents of the swapchain image.

There are 4 different presentation "modes" (VkDeviceGroupPresentModeFlagBitsKHR). Your logical device will support some subset of these modes. The 4 modes are:

Local presentation: Any physical device with a presentation engine can present, but it can only present the contents of its own image. When you present, you tell Vulkan which physical device and image to present (VkDeviceGroupPresentInfoKHR).
Remote presentation: Any physical device with a presentation engine can present, and it can present contents from other physical devices. Vulkan exposes a graph (vkGetDeviceGroupPresentCapabilities()) that describes which physical devices can present from which other physical devices in the group. When you present, you tell Vulkan which image to present, and there's a requirement that some physical device with a presentation engine is able to present the image you selected.
Sum presentation: Any physical device with a presentation engine can present, and it presents the component-wise sum of the contents of the image from multiple physical devices. Again, there's a graph that indicates, for each physical device that has a presentation image, which other physical devices it's able to sum from. When you present, you specify which physical devices' contents to sum, via a device mask (and there's a requirement that there is some physical device with a presentation engine that can sum from all of the requested physical devices).
Local multi-device presentation: Different physical devices (with presentation engines) can present different disjoint rects of their own images, which get merged together to a final image. You can tell which physical devices present which rects by calling vkGetPhysicalDevicePresentRectanglesKHR(). When you present, you specify a device mask, which tells which physical devices present their rects.

On my machine, only the local presentation mode is supported, and both GPUs have presentation engines. That means the call to present gets to pick (VkDeviceGroupPresentInfoKHR) which of the two image contents actually gets presented.

Middle of the Frame

The commands in the middle of the frame are probably actually the most straightforward. When you begin a command buffer, you can specify a device mask (VkDeviceGroupCommandBufferBeginInfo) of which physical devices will execute the command buffer. Inside the command buffer, when you start a render pass, you can also specify another device mask (VkDeviceGrupRenderPassBeginInfo) for which physical devices will execute the render pass, as well as assigning each physical device its own distinct "render area" rect. Inside the render pass, you can run vkCmdSetDeviceMask() to change the set of currently running physical devices. In your SPIR-V shader, there's even a built-in intrinsic "DeviceIndex" to tell you which GPU in the group you're running on. And then, finally, when you actually submit the command buffer, you can supply (VkDeviceGroupSubmitInfo) a device mask you want to submit the command buffers to.

There's even a convenience vkCmdDispatchBase() which lets you set "base" values for workgroup IDs, which is convenient if you want to spread one workload across multiple GPUs. Pipelines have to be created with VK_PIPELINE_CREATE_DISPATCH_BASE_KHR to use this, though.

Resources

It's all well and good to have multiple physical devices executing the same command buffer, but simply execution is not enough: you also need to bind resources to those shaders and commands that get run.

When allocating a resource, there are 2 ways for it to happen: either each physical device gets its own distinct contents of the allocation, or all the physical devices share a single contents. If the allocation's heap is marked as VK_MEMORY_HEAP_MULTI_INSTANCE_BIT_KHR, then all allocations will be replicated distinctly across each of the physical devices. Even if the heap isn't marked that way, the individual allocation can still be marked that way (VkMemoryAllocateFlagsInfo). On my device, the GPU-local heap is marked as multi-instance.

Communication can happen between the devices by using Vulkan's existing memory binding infrastructure. Recall that, in Vulkan, you don't just create a resource; instead, you make an allocation, and a resource, and then you separately bind the two together. Well, it's possible to bind a resource on one physical device with an allocation on a different physical device (VkBindBufferMemoryDeviceGroupInfo, VkBindImageMemoryDeviceGroupInfo)! When you make one of these calls, it will execute on all the physical devices, so these structs indicate the graph of which resources on which physical devices get bound to which allocations on which (other) physical resources. For textures, you can even be more fine-grained than this, and bind just a region of a texture across physical devices (assuming you created the image with VK_IMAGE_CREATE_SPLIT_INSTANCE_BIND_REGIONS_BIT_KHR). This also works with sparse resources - when you bind a sparse region of a texture, that sparse region can come from another physical device, too (VkDeviceGroupBindSparseInfo).

Alas, there are restrictions. vkGetDeviceGroupPeerMemoryFeatures() tells you, once you've created a resource and bound it to an allocation on a different physical device, how you're allowed to use that resource. For each combination of (heap index, local device index, and remove device index), a subset of 4 possible uses will be allowed (VkPeerMemoryFeatureFlagBits):

The local device can copy to the remote device
The local device can copy from the remote device
The local device can read the resource directly
The local device can write to the resource directly

This is really exciting - if either of the bottom two uses are allowed, it means you can bind one of these cross-physical-device resources to a shader and use it as-if it were any normal resource! Even if neither of the bottom two uses are allowed, just being able to copy between devices without having to round-trip through main memory is already cool. On my device, only the first 3 uses are allowed.

Swapchain Resources

Being able to bind a new resource to a different physical device's allocation is good, but swapchain images come pre-bound, which means that mechanism won't work for swapchain resources. So there's a new mechanism for that: it's possible to bind a new image to the storage for an existing swapchain image (VkBindImageMemorySwapchainInfoKHR). This can be used in conjunction with VkBindImageMemoryDeviceGroupInfo which I mentioned above, to make the allocations cross physical devices.

So, if you want to copy from one physical device's swapchain image to another physical device's swapchain image, what you'd do is:

Create a new image (of the right size, format, etc.). Specify VkImageSwapchainCreateInfoKHR to indicate its storage will come from the swap chain.
Bind it (VkBindImageMemoryInfo), but use both...

VkBindImageMemorySwapchainInfoKHR to have its storage come from the swap chain, and
VkBindImageMemoryDeviceGroupInfo to specify that its storage comes from another physical device's swap chain contents

Execute a copy command to copy from one image to the other image.

Conclusion

It's a pretty complicated system! Certainly much more complicated than SLI is in Direct3D. It seems like there are 3 core benefits of device groups:

You can execute the same command stream on multiple devices without having to re-encode it multiple times or call into Vulkan multiple times for each command. The device masks implicitly duplicate the execution.
There are a variety of presentation modes, which allow automatic merging of rendering results, without having to explicitly execute a render pass or a compute shader to merge the results. Unfortunately, my cards don't support this.
Direct physical-device-to-physical-device communication, without round-tripping through main memory. Indeed, for some use cases, you can just bind a remote resource and use it as-if it was local. Very cool!

I'm not quite at the point where I can run some benchmarks to see how much SLI improves performance over simply creating two independent Vulkan logical devices. I'm working on a ray tracer, so there are a few different ways of joining the rendering results from the two GPUs. To avoid seams, the denoiser will probably have to run on just one of the GPUs.

My First Qt App

2023-10-29T22:23:00.006-07:00

Just for fun, I wanted to try to make a Qt app that graphs some data. For contrast, I'm aware of Swift Charts, and I thought using Qt to graph some stuff would be a fun little project, now that I'm using FreeBSD full time, rather than macOS. The latest version of Qt is version 6, so that's what I'll be using.

Basics

When you use Qt Creator to create a new Qt project, it only creates 4 files:

CMakeLists.txt
CMakeLists.txt.user
main.cpp
Main.qml

Qt Creator can understand CMakeLists.txt directly - if you want to open the "project," you open that file. Just like with Cocoa programming, main.cpp doesn't contain much inside it - it's just a few lines line and it initializes the existing infrastructure to load the app's UI.

Also, like Cocoa programming, most of the description of the UI of the app is described declaratively, in the .qml file. The way this works is you say something like:

Foo {

bar: baz

}

And this means "when the QML file is loaded, create an instance of type Foo, and set the property named bar on this new object to a value of baz."

The outermost level is this:

Window {
    width: 640
    height: 480
    visible: true
    title: qsTr("Hello World")
}

Then, you can add "elements" inside the window, by placing it inside the {}s. There are collection views (Row, Column, Grid, Flow) which define how to lay out their children, and there are also more general elements like Rectangle. When your layout is not naturally specified (because you're not using containers or whatever), you describe the layout using anchors, like anchors.centerIn: parent or anchors.fill: parent.

Qt Charts

Qt has a built-in chart element, so the first thing I did was just copy the ChartView example directly into my QML document as a child of the Window. However, that didn't work, and some searching found this note:

> Note: An instance of QApplication is required for the QML types as the module depends on Qt's Graphics View Framework for rendering. QGuiApplication is not sufficient. However, projects created with Qt Creator's Qt Quick Application wizard are based on the Qt Quick template that uses QGuiApplication by default. All the QGuiApplication instances in such projects must be replaced with QApplication.

Okay, so I replaced QGuiApplication with QApplication in main.cpp, and changed #include <QGuiApplication> to #include <QApplication>, only to find that there is now a compile error: the compiler can't find that file. After some more searching, it turns out I needed to change this:

find_package(Qt6 6.5 REQUIRED COMPONENTS Quick

find_package(Qt6 6.5 REQUIRED COMPONENTS Quick Widgets)

and change

target_link_libraries(appGrapher
PRIVATE Qt6::Quick
)

target_link_libraries(appGrapher
PRIVATE Qt6::Quick
PRIVATE Qt6::Widgets
)

Huh. After doing that, it worked no problem.

Data Source (C++ interop)

So now I have a chart, which is pretty cool, but the data that the chart uses is spelled out literally in the QML file. That's not very useful - I plan on generating thousands of data points, and I don't want to have to put them inline inside this QML thing. Instead, I want to load them from an external source.

QML files allow you to run JavaScript by literally placing bits of JavaScript inside the QML file, but I think I want to do better - I want my data source to come from C++ code, so I have full freedom about how I generate it. From some searching, it looks like there are 2 ways of having C++ and QML JavaScript interoperate:

You can register a singleton, or a singleton instance, and then the JavaScript can call methods on that singleton
You can register a type, and have the QML create an instance of that type, just like any other element
(You can setContextProperty(), which lets the QML look up an instance that you set ahead of time. However, there's a note that says "You should not use context properties to inject values into your QML components" which is exactly what I'm trying to do, so this probably isn't the right solution.)

I have a general aversion to singletons, and I think registering a type is actually what I want, because I want the QML infrastructure to own the instance and define its lifetime, so that's the approach I went with. The way you do this is, in main() after you create the QApplication but before you do anything else, you call qmlRegisterType(). Here is what main() says:

qmlRegisterType<DataSource>("com.litherum", 1, 0, "DataSource");

This allows the QML to say import com.litherum, which is pretty cool.

QObject

Defining the DataSource type in C++ is a bit weird. It turns out that Qt objects are not just regular C++ objects. Instead, you write your classes in a different language, which is similar to C++, and then there is a "meta-object compiler" which will compile your source to actual C++. It looks like the main purpose of this is to be able to connect signals and slots, where an object can emit a signal, and if a slot in some other object is connected to that signal, then the slot callback gets run in that other object. It seems pretty similar to observers in Objective-C. They also have the ability to perform introspection, like Objective-C .... I kind of don't understand why they didn't just invent a real language rather than doing this C++ transpilation silliness.

Anway, you can define your (not-)C++ class, inherit from QObject, annotate the class with Q_OBJECT and QML_ELEMENT, and give it a method with the Q_INVOKABLE annotation. Sure, fine. Then, in the QML file, you can add a stanza which tells the system to create an instance of this class, and you can use the Component.onCompleted JavaScript handler to call into it (via its id). Now you can call the C++ method you just defined from within the QML. Cool. This is what the C++ header says:

class DataSource : public QObject
{
    Q_OBJECT
    QML_ELEMENT
public:
    explicit DataSource(QObject *parent = nullptr);

    Q_INVOKABLE void updateData(QXYSeries*, double time);
};

Okay, the method is supposed to set the value of the SplineSeries in the chart. The most natural way to do this is to pass the SplineSeries into the C++ function as a parameter. This is actually pretty natural - all the QML types have corresponding C++ types, so you just make the C++ function accept a QSplineSeries*. Except we run into the same compiler error where the compiler can't find #include <QSplineSeries>. It turns out that in CMakeLists.txt we have to make a similar addition and add Charts to both places that we added Widgets above. Fine. Here's what the QML says:

DataSource {
    id: dataSource
    Component.onCompleted: function() {
        dataSource.updateData(splineSeries, Date.now());
    }
}

Once you do this, it actually works out well - the C++ code can call methods on the QSplineSeries, and it can see the values that have been set in the QML. It can generate a QList<QPointF> and call QSplineSeries::replace() with the new list.

The one thing I couldn't get it to do was automatically rescale the charts' axes when I swap in new data with different bounds. Oh well.

I did want to go one step further, though!

Animation

One of the coolest things about retained-mode UI toolkits is that they often allow for animations for free. Swapping out the data in the series should allow Qt to smoothly animate from the first data set to the second. And it actually totally worked! It took me a while to figure out how specifically to spell the values, but in the QML file, you can set these on the ChartView:

animationOptions: ChartView.AllAnimations

animationDuration: 300 // milliseconds

animationEasingCurve {

type: Easing.InOutQuad

}

I found these by looking at the documentation for QChart. And, lo and behold, changing the data values smoothly animated the spline on the graph! I also needed some kind of timer to actually call my C++ function to generate new data, which you do with QML also:

Timer {
    interval: 1000 // milliseconds
    running: true
    repeat: true
    onTriggered: function() {
        dataSource.updateData(splineSeries, Date.now());
    }
}

Super cool stuff! I'm always impressed when you can enable animations in a declarative way, without having your own code running at 60fps. Also, while the animations are running, from watching KSysGuard, it looks like the rendering is multithreaded, which is super cool too! (And, I realized that KSysGuard probably uses Qt Charts under the hood too, to show its performance graphs.)

Conclusion

It looks like Qt Charts is pretty powerful, has lots of options to make it beautiful, and is somewhat fairly performant (though I didn't rigorously test the performance). Using it did require creating a whole Qt application, but the application is super small, only has a few files, and each file is pretty small and understandable. And, being able to make arbitrary dynamic updates over time while getting animation for free was pretty awesome. I think being able to describe most of the UI declaratively, rather than having to describe it all 100% in code, is definitely a good design decision for Qt. And the C++ interop story was a little convoluted (having to touch main() is a bit unfortunate) but honestly not too bad in the end.

ReSTIR Part 2: Characterizing Sample Reuse

2023-10-28T18:26:00.001-07:00

After enumerating all the building blocks of ReSTIR, there isn't actually that much more. The rendering equation is an integral, and our job is to approximate the value of the integral by sampling it in the most intelligent way possible.

Importance sampling tells us that we want to generate samples with a density that's proportional to the contribution of those samples to the value of the final integral. (So, where the light is strongest, sample that with highest density.) We can't directly produce samples with this probability density function, though - if we could, we could just compute the integral directly rather than dealing with all this sampling business

The function being integrated in the rendering equation is the product of a few independent functions:

The BRDF (BSDF) function, which is a property of the material we are rendering,
The distribution of incoming light. For direct illumination, this is distributed over the relevant light sources
A geometry term, where the orientation of the surface(s) affects the result
A visibility term (the point being shaded might be in shadow)

The fact that there are a bunch of independent terms means that Multiple Importance Sampling (MIS) works well - we can use these independent functions to produce a single aggregated "target" function which we expect will approximate the real function fairly well. So, we can generate samples according to the target function, using Sequential Importance Resampling (SIR), evaluate the real function at those sampling locations (by tracing rays or whatever), then use Resampled Importance Sampling (RIS) to calculate an integral. Easy peasy, right?

ReSTIR

This is where ReSTIR starts. The first observation that ReSTIR makes is that it's possible to use reservoir sampling (RS) to turn this into a streaming algorithm. The paper assumes that the reservoir only holds a single sample (though this isn't actually necessary). The contents of the reservoir represent a set of (one) sample with pdf proportional to the target function, and the more samples the reservoir encounters, the better that pdf matches the target function. The name of the game, now, is to make the reservoir encounter as many samples as possible.

Which brings us to the second observation that ReSTIR makes. Imagine if there was some way of merging reservoirs in constant time (or rather: in time proportional to the size of the reservoirs, rather than time proportional to the number of samples the reservoirs have encountered). If this were possible, you could imagine a classic parallel reduction algorithm: each thread (pixel) could start out with a naive reservoir (a poor approximation of your target function), but then adjacent threads could merge their reservoirs, then one-of-every-4-threads could merge their reservoirs, then one-of-every-8, etc, until you have a single result that incorporates results from all the threads. If only a single level (generation) of this reduction occurs each frame, you end up with a result where you perform a constant amount of work each frame per thread, but the result is that an exponential number of samples end up being accumulated. This is the key insight that ReSTIR makes.

Merging Reservoirs

Merging reservoirs is a subtle business, though. The problem is that different pixels/threads are shading different materials oriented at different orientations. In effect, your target function you're sampling (and the real function you're evaluating) are different from pixel to pixel. If you ignore this fact, and pretend that all your pixels are all sampling the same thing, you can naively just jam the reservoirs together, by creating a new reservoir which encounters the values saved in the reservoirs of the inputs. This is fast, but gets wrong results (called "bias" in the literature).

What you have to do instead is to treat the merging operation with care. The key here lies with the concept of "supports." Essentially, if you're trying to sample a function, you have to be able to generate samples at every place the function is nonzero. If there's an area where the function is nonzero but you never sample that area, your answer will turn out wrong. Well, the samples that one pixel generates (recall that the things in the reservoirs are sample locations) might end up not being applicable to a different pixel. For example, consider if there's an occlusion edge where one pixel is in shadow and a nearby pixel isn't. Or, another example: the surface normal varies across the object, and the sample at one pixel is at a sharp angle, such that if you use that same sample at a different pixel, that sample actually points behind the object. You have to account for this in the formulas involved.

Jacobian Determinant

There's a generalization of this, which uses the concept of a Jacobian determinant. Recall that, in general, a function describes a relationship between inputs and outputs. The Jacobian determinant of a function describes, for a particular point in the input space of the function, if you make a small perturbation and feed a slightly different input point into the function, how much the output of the function will be perturbed. It's kind of a measure of sensitivity - at a particular point, how sensitive are changes in the output to changes in the input.

Well, if you have a sample at one particular pixel of an image, and you then apply it to a different pixel, you have an input (the sample at the original pixel) and you have an output (the sample at the destination pixel) and you have a relationship between the two (the probability of that sample won't be exactly the same at the two different places). So, the Jacobian tells you how to compensate for the fact that you're changing the domain of the sample.

In order to incorporate the Jacobian, you have to be able to calculate it (of course), which means you have to be able to characterize how sample reuse across pixels affects the probabilities involved. For direct illumination, that's just assumed to be 1 or 0 depending on the value of the sample point - hence why above you just ignore some samples altogether when reusing them. For indirect illumination (path tracing), a sample is an entire path, and when you re-use it at a different pixel, you're producing a new path that is slightly different than the original path. This path manipulation is called "shift mapping" of a path in the gradient domain rendering literature, and common shift mappings have well-defined Jacobian functions associated with them. So, if you spatially reuse a path, you can pick a "shift mapping" for how to define the new path, and then include that shift mapping's Jacobian in the reservoir merging formula.

This concept of a "shift mapping" and its Jacobian can be generalized to any kind of sampling - it's not just for path tracing.

Conclusion

So that's kind of it. If you're careful about it, you can merge reservoirs in closed form (or, at least, in closed form for each sample in the reservoirs), which results in a pdf of the values in the reservoir that are informed by the union of samples of all the input reservoirs. This leads to a computation tree of merges which allows the number of samples to be aggregated exponentially over time, where each frame only has to do constant work per pixel. You can perform this reuse both spatially and temporally, if you remember information about the previous frame. The more samples you aggregate, the closer the pdf of the samples in the reservoir matches the target function, and the target function is formed by using MIS to approximate the rendering equation. This allows you to sample with a density very close to the final function you're integrating, which has the effect of reducing variance (noise) in the output image.

ReSTIR also has some practical concerns, such as all the reuse causing an echo chamber of old data - the authors deal with that by weighting old and new data differently, to try to strike a balance between reuse (high sample counts) vs quickly adhering to new changes in geometry or whatever. It's a tunable parameter.

ReSTIR Part 1: Building Blocks

2023-10-17T18:34:00.006-07:00

ReSTIR is built on a bunch of other technologies. Let's discuss them one-by-one.

Rejection Sampling

Rejection sampling isn't actually used in ReSTIR, but it's useful to cover it anyway. It is a technique to convert samples from one PDF (probability density function) to another PDF.

So, you start with the fact that you have 2 PDFs: a source PDF and a destination PDF. The first thing you do is you find a scalar "M" which, when scaling the source PDF, causes the source PDF to be strictly larger than the destination PDF, for all x coordinates. Then, for every sample in the source, accept that sample with a probability equal to destination PDF at the sample / (M * source PDF at the sample). You'll end up with fewer samples than you started with, but that's the price you pay. You can see how the scalar M is necessary to keep the probabilities between 0 and 1.

The larger the distance between the destination PDF and M * the source PDF, the fewer samples will be accepted. So, if you pick M very conservatively, you'll end up with almost no samples accepted. That's a downside to rejection sampling.

On the other hand, if the source PDF and the destination PDF are the same, then M = 1, and all the samples will be accepted. Which is good, because the input samples are exactly what should be produced by the algorithm.

Sequential Importance Resampling

This is another technique used to convert samples from one PDF to another PDF. Compared to rejection sampling, we don't reject samples as we encounter them; instead, we pick ahead of time how many samples we want to accept.

Again, you have a source PDF and a destination PDF. Go through all your samples, and compute a "score" which is the destination PDF at the sample / the source PDF at the sample. Now that you have all your scores, select N samples from them, with probabilities proportional to the scores. You might end up with duplicate samples; that's okay.

Compared to rejection sampling, this approach has a number of benefits. The first is that you don't have to pick that "M" value. The scores are allowed to be any (non-negative) value - not necessarily between 0 and 1. This means you don't have to have any global knowledge about the PDFs involved.

Another benefit is that you know how many samples you're going to get at the end - you can't end up in a situation where you accidentally don't end up with any samples.

The downside to this algorithm is that you have to pick N up front ahead of time. But, usually that's not actually a big deal.

The other really cool thing about SIR is that the source and destination PDFs don't actually have to be normalized. Because the scores can be arbitrary, it's okay if your destination PDF is actually just some arbitrary (non-normalized) function. This is super valuable, as we'll see later.

Monte Carlo Integration

The goal of Monte Carlo integration is to compute an integral of a function. You simply sample it at random locations, and average the results.

This assumes that the pdf you're using to generate random numbers is constant, from 0 - 1.

So, the formula is: 1/N * sum from 1 to N of f(x_i)

Importance Sampling

The idea here is to improve upon basic Monte Carlo integration as described above. Certain samples will contribute to the final result more than others. Instead of sampling from a constant PDF, if you instead sample using a PDF that approximates the function being integrated, you'll more quickly approach the final answer.

Doing so adds another term to the formula. It now is: 1/N * sum from 1 to N of f(x_i) / q(x_i), where q(x) is the PDF used to generate samples.

The best PDF, of course, is proportional the function being sampled - if you pick this, f(x_i) / q(x_i) will be a constant value for all i, which means you only need 1 term to calculate the final perfect answer. However, usually this is impractical - if you knew how to generate samples proportional to the function being integrated, you probably know enough to just integrate the function directly. For direct illumination, you can use things like the BRDF of the material, or the locations where the light sources are. Those will probably match the final answer pretty well.

Multiple Importance Sampling

So now the question becomes how to generate that approximating function. If you look at the above fomula, you'll notice that when f(x) is large, but q(x) is small, that leads to the worst possible situation - you are trying to compute an integral, but you're not generating any samples in an area that large contributes to it.

The other extreme - where f(x) is small but q(x) is big - isn't actually harmful, but it is wasteful. You're generating all these samples that don't actually contribute much to the final answer.

The idea behind MIS is that you can generate q(x) from multiple base formulas. For example, one of the base formulas might be the uniform distribution, and another might be proportional to the BRDF of the material you're shading, and another might be proportional to the direction of where the lights in the scene are. The idea is that, by linearly blending all these formulas, you can generate a better q(x) PDF.

Incorporating the uniform distribution is useful to make sure that q(x) never gets too small anywhere, thereby solving the problem where f(x) is large and q(x) is small.

Resampled Importance Sampling

RIS is what happens when you bring together importance sampling and SIR. You can use SIR to generate samples proportional to your approximating function. You can then use the importance sampling formula to compute the integral.

If, when using SIR, your approximating function isn't normalized, there's another term added into the formula to re-normalize the result, which allows the correct integral to be calculated.

This is really exciting, because it means that we can calculate integrals (like the rendering equation) by sampling in strategic places - and the pdf of those strategic places can be arbitrary (non-normalized) functions.

Reservoir Sampling

Reservoir Sampling is a reformulation of SIR, to make it streamable. Recall that, in SIR, you encounter samples, and each sample produces a weight, and then you select N samples proportional to each sample's weight. Reservoir sampling allows you to select the N samples without knowing the total number of samples there are. The idea is that you keep a "reservoir" of N samples, and each time you encounter a new sample, you update the contents of the reservoir depending on probabilities involved. The invariant is that the contents of the reservoir is proportional to the probabilities of all the samples encountered.

The other cool thing about reservoir sampling is that 2 reservoirs can be joined together into a single reservoir, via only looking at the contents of the reservoirs, without requiring another full pass over all the data.

Conclusion

So far, we've set ourselves up for success. We can calculate integrals, in a streamable way, by "resampling" our samples to approximate the final function being integrated. Being streamable is important, as we need to be able to update our results as we encounter new samples (perhaps across new frames, or across other pixels). The fact that you can merge reservoirs in constant time is super powerful, as it the merged result to behave as if it saw 2*N samples, while only running a constant-time algorithm. This can be done multiple times, thereby allowing for synthesis of an exponential number of samples, but each operation is constant time.

Implementing a GPU's Programming Model on a CPU

2023-10-13T03:31:00.007-07:00

SIMT

The programming model of a GPU uses what has been coined "single instruction multiple thread." The idea is that the programmer writes their program from the perspective of a single thread, using normal regular variables. So, for example, a programmer might write something like:

int x = threadID;

int y = 6;

int z = x + y;

Straightforward, right? Then, they ask the system to run this program a million times, in parallel, with different threadIDs.

The system *could* simply schedule a million threads to do this, but GPUs do better than this. Instead, the compiler will transparently rewrite the program to use vector registers and instructions in order to run multiple "threads" at the same time. So, imagine you have a vector register, where each item in the vector represents a scalar from a particular "thread." In the above, program, x corresponds to a vector of [0, 1, 2, 3, etc.] and y corresponds to a vector of [6, 6, 6, 6, etc.]. Then, the operation x + y is simply a single vector add operation of both vectors. This way, performance can be dramatically improved, because these vector operations are usually significantly faster than if you had performed each scalar operation one-by-one.

(This is in contrast to SIMD, or "single instruction multiple data," where the programmer explicitly uses vector types and operations in their program. The SIMD approach is suited for when you have a single program that has to process a lot of data, whereas SIMT is suited for when you have many programs and each one operates on its own data.)

SIMT gets complicated, though, when you have control flow. Imagine the program did something like:

if (threadID < 4) {

doSomethingObservable();

}

Here, the system has to behave as-if threads 0-3 executed the "then" block, but also behave as-if threads 4-n didn't execute it. And, of course, thread 0-3 want to take advantage of vector operations - you don't want to pessimize and run each thread serially. So, what do you do?

Well, the way that GPUs handle this is by using predicated instructions. There is a bitmask which indicates which "threads" are alive: within the above "then" block, that bitmask will have value 0xF. Then, all the vector instructions use this bitmask to determine which elements of the vector it should actually operate on. So, if the bitmask is 0xF, and you execute a vector add operation, the vector add operation is only going to perform the add on the 0th-3rd items in the vector. (Or, at least, it will behave "as-if" it only performed the operation on those items, from an observability perspective.) So, the way that control flow like this works is: all threads actually execute the "then" block, but all the operations in the block are predicated on a bitmask which specifies that only certain items in the vector operations should actually be performed. The "if" statement itself just modifies the bitmask.

The Project

AVX-512 is an optional instruction set on some (fairly rare) x86_64 machines. The exciting thing about AVX-512 is that it adds support for this predication bitmask thing. It has a bunch of vector registers (512 bits wide, named zmm0 - zmm31) and it also adds a set of predication bitmask registers (k0 - k7). The instructions that act upon the vector registers can be predicated on the value of one of those predication registers, to achieve the effect of SIMT.

It turns out I actually have a machine lying around in my home which supports AVX-512, so I thought I'd give it a go, and actually implement a compiler that compiles a toy language, but performs the SIMT transformation to use the vector operations and predication registers. The purpose of this exercise isn't really to achieve incredible performance - there are lots of sophisticated compiler optimizations which I am not really interested in implementing - but instead the purpose is really just as a learning exercise. Hopefully, by implementing this transformation myself for a toy language, I can learn more about the kinds of things that real GPU compilers do.

The toy language is one I invented myself - it's very similar to C, with some syntax that's slightly easier to parse. Programs look like this:

function main(index: uint64): uint64 {

variable accumulator: uint64 = 0;

variable accumulatorPointer: pointer<uint64> = &accumulator;

for (variable i: uint64 = 0; i < index; i = i + 1) {

accumulator = *accumulatorPointer + i;

}

return accumulator;

}

It's pretty straightforward. It doesn't have things like ++ or +=. It also doesn't have floating-point numbers (which is fine, because AVX-512 supports vector integer operations). It has pointers, for loops, continue & break statements, early returns... the standard stuff.

Tour

Let's take a tour, and examine how each piece of a C-like language gets turned into AVX-512 SIMT. I implemented this so it can run real programs, and tested it somewhat-rigorously - enough to be fairly convinced that it's generally right and correct.

Variables and Simple Math

The most straightforward part of this system is variables and literal math. Consider:

variable accumulator: uint64;

This is a variable declaration. Each thread may store different values into the variable, so its storage needs to be a vector. No problem, right?

What about if the variable's type is a complex type? Consider:

struct Foo {

x: uint64;

y: uint64;

}

variable bar: Foo;

Here, we need to maintain the invariant that Foo.x has the same memory layout as any other uint64. This means that, rather than alternating x,y,x,y,x,y in memory, there instead has to be a vector for all the threads' x value, followed by another vector for all the threads' y values. This works recursively: if a struct has other structs inside it, the compiler will to through all the leaf-types in the tree, turn each leaf type into a vector, and then lay them out in memory end-to-end.

Simple math is even more straightforward. Literal numbers have the same value no matter which thread you're running, so they just turn into broadcast instructions. The program says "3" and the instruction that gets executed is "broadcast 3 to every item in a vector". Easy peasy.

L-values and R-values

In a C-like language, every value is categorized as either an "l-value" or an "r-value". An l-value is defined as having a location in memory, and r-values don't have a location in memory. The value produced by the expression "2 + 3" is an r-value, but the value produced by the expression "*foo()" is an l-value, because you dereferenced the pointer, so the thing the pointer points to is the location in memory of the resulting value. L-values can be assigned to; r-values cannot be assigned to. So, you can say things like "foo = 3 + 4;" (because "foo" refers to a variable, which has a memory location) but you can't say "3 + 4 = foo;". That's why it's called "l-value" and "r-value" - l-values are legal on the left side of an assignment.

At runtime, every expression has to produce some value, which is consumed by its parent in the AST. E.g, in "3 * 4 + 5", the "3 * 4" has to produce a "12" which the "+" will consume. The simplest way to handle l-values is to make them produce a pointer. This is so expressions like "&foo" work - the "foo" is an lvalue and produces a pointer that points to the variable's storage, and the & operator receives this pointer and produces that same pointer (unmodified!) as an r-value. The same thing happens in reverse for the unary * ("dereference") operator: it accepts an r-value of pointer type, and produces an l-value - which is just the pointer it just received. This is how expressions like "*&*&*&*&*&foo = 7;" work (which is totally legal and valid C!): the "foo" produces a pointer, which the & operator accepts and passes through untouched to the &, which takes it and passes it through untouched, all the way to the final *, which produces the same pointer as an lvalue, that points to the storage of foo.

The assignment operator knows that the thing on its left side must be an lvalue and therefore will always produce a pointer, so that's the storage that the assignment stores into. The right side can either be an l-value or an r-value; if it's an l-value, the assignment operator has to read from the thing it points to; otherwise, it's an r-value, and the assignment operator reads the value itself. This is generalized to every operation: it's legal to say "foo + 3", so the + operator needs to determine which of its parameters are l-values, and will thus produce pointers instead of values, and it will need to react accordingly to read from the storage the pointers point to.

All this stuff means that, even for simple programs where the author didn't even spell the name "pointer" anywhere in the program, or even use the * or & operators anywhere in the program, there will still be pointers internally used just by virtue of the fact that there will be l-values used in the program. So, dealing with pointers is a core part of the language. They appear everywhere, whether the program author wants them to or not.

Pointers

If we now think about what this means for SIMT, l-values produce pointers, but each thread has to get its own distinct pointer! That's because of programs like this:

variable x: pointer<uint64>;

if (...) {

x = &something;

} else {

x = &somethingElse;

}

*x = 4;

That *x expression is an l-value. It's not special - it's just like any other l-value. The assignment operator needs to handle the fact that, in SIMT, the lvalue that *x produces is a vector of pointers, where each pointer can potentially be distinct. Therefore, that assignment operator doesn't actually perform a single vector store; instead, it performs a "scatter" operation. There's a vector of pointers, and there's a vector of values to store to those pointers; the assignment operator might end up spraying those values all around memory. In AVX-512, there's an instruction that does this scatter operation.

(Aside: That scatter operation in AVX-512 uses a predication mask register (of course), but the instruction has a side-effect of clearing that register. That kind of sucks from the programmer's point of view - the program has to save and restore the value of the register just because of a quirk of this instruction. But then, thinking about it more, I realized that the memory operation might cause a page fault, which has to be handled by the operating system. The operating system therefore needs to know which address triggered the page fault, so it knows which pages to load. The predication register holds this information - as each memory access completes, the corresponding bit in the predication register gets set to false. So the kernel can look at the register to determine the first predication bit that's high, which indicates which pointer in the vector caused the fault. So it makes sense why the operation will clear the register, but it is annoying to deal with from the programmer's perspective.)

And, of course, the operation can also say "foo = *x;" which means that there also has to be a gather operation. Sure. Something like "*x = *y;" will end up doing both a gather and a scatter.

Copying

Consider a program like:

struct Foo {

x: uint64;

y: uint64;

}

someVariableOfFooType = aFunctionThatReturnsAFoo();

That initializer needs to set both fields inside the Foo. Naively, a compiler might be tempted to use a memcpy() to copy the contents - after all, the contents could be arbitrarily complex, with nested structs. However, that won't work for SIMT, because only some of the threads might be alive at this point in the program. Therefore, that assignment has to only copy the items of the vectors for the threads that are alive; it can't copy the whole vectors because that can clobber other entries in the destination vector which are supposed to persist.

So, all the stores to someVariableOfFooType need to be predicated using the predication registers - we can't naively use a memcpy(). This means that every assignment needs to actually perform n memory operations, where n is the number of leaf types in the struct being assigned - because those memory operations can be predicated correctly using the predication registers. We have to copy structs leaf-by-leaf. This means that the number of instructions to copy a type is proportional to the complexity of the type. Also, both the left side and the right side may be l-values, which means each leaf-copy could actually be a gather/scatter pair of instructions. So, depending on the complexity of the type and the context of the assignment, that single "=" operation might actually generate a huge amount of code.

Pointers (Part 2)

There's one other decision that needs to be made about pointers: Consider:

variable x: uint64;

... &x ...

As I described above, the storage for the variable x is a vector (each thread owns one value in the vector). &x produces a vector of pointers, sure. The question is: should all the pointer values point to the beginning of the x vector? Or should each pointer value point to its own slot inside the x vector? If they point to the beginning, that makes the & operator itself really straightforward: it's just a broadcast instruction. But it also means that the scatter/gather operations get more complicated: they have to offset each pointer by a different amount in order to scatter/gather to the correct place. On the other hand, if each pointer points to its own slot inside x, that means the scatter/gather operations are already set up correctly, but the & operation itself gets more complicated.

Both options will work, but I ended up making all the pointer point to the beginning of x. The reason for that is for programs like:

struct Foo {

x: uint32;

y: uint64;

}

variable x: Foo;

... &x ...

If I picked the other option, and had the pointers point to their own slot inside x, it isn't clear which member of Foo they should be pointing inside of. I could have, like, found the first leaf, and made the pointers point into that, but what if the struct is empty... It's not very elegant.

Also, if I'm assigning to x or something where I need to populate every field, because every copy operation has to copy leaf-by-leaf, I'm going to have to be modifying the pointers to point to each field. If one of the fields is a uint32 and the next one is a uint64, I can't simply just add a constant amount to each pointer to get it to point to its slot in the next leaf. So, if I'm going to be mucking about with individual pointer values for each leaf in a copy operation, I might as well have the original pointer point to the overall x vector rather than individual fields, because pointing to individual fields doesn't actually make anything simpler.

Function Calls

This language supports function pointers, which are callable. This means that you can write a program like this (taken from the test suite):

function helper1(): uint64 ...

function helper2(): uint64 ...

function main(index: uint64): uint64 {

variable x: FunctionPointer<uint64>;

if (index < 3) {

x = helper1;

} else {

x = helper2;

}

return x();

}

Here, that call to x() allows different threads to point to different functions. This is a problem for us, because all the "threads" that are running share the same instruction pointer. We can't actually have some threads call one function and other threads call another function. So, what we have to do instead is to set the predication bitmask to only the "threads" which call one function, then call that function, then set the predication bitmask to the remaining threads, then call the other function. Both functions get called, but the only "threads" that are alive during each call are only the ones that are supposed to actually be running the function.

This is tricky to get right, though, because anything could be in that function pointer vector. Maybe all the threads ended up with the same pointers! Or maybe each thread ended up with a different pointer! You *could* do the naive thing and do something like:

for i in 0 ..< numThreads:

predicationMask = originalPredicationMask & (1 << i)

call function[i]

But this has really atrocious performance characteristics. This means that every call actually calls numThreads functions, one-by-one. But each one of those functions can have more function calls! The execution time will be proportional to numThreads ^ callDepth. Given that function calls are super common, this exponential runtime isn't acceptable.

Instead, what you have to do is gather up and deduplicate function pointers. You need to do something like this instead:

func generateMask(functionPointers, target):

mask = 0;

for i in 0 ..< numThreads:

if functionPointers[i] == target:

mask |= 1 << i;

return mask;

for pointer in unique(functionPointers):

predicationMask = originalPredicationMask & generateMask(functionPointers, pointer)

call pointer

I couldn't find an instruction in the Intel instruction set that did this. This is also a complicated enough algorithm that I didn't want to write this in assembly and have the compiler emit the instructions for it. So, instead, I wrote it in C++, and had the compiler emit code to call this function at runtime. Therefore, this routine can be considered a sort of "runtime library": a function that automatically gets called when the code the author writes does a particular thing (in this case, "does a particular thing" means "calls a function").

Doing it this way means that you don't get exponential runtime. Indeed, if your threads all have the same function pointer value, you get constant runtime. And if the threads diverge, the slowdown will be at most proportional to the number of threads. You'll never run a function where the predication bitmask is 0, which means there is a floor about how slow the worst case can be - it will never get worse than having each thread individually diverge from all the other threads.

Control Flow

As described above, control flow (meaning: if statements, for loops, breaks, continues, and returns) are implemented by changing the value of the predication bitmask register. The x86_64 instruction set has instructions that do this.

There are 2 ways to handle the predication registers. One way is to observe the fact that there are 8 predication registers, and to limit the language to only allow 8 (7? 6? 3?) levels of nested control flow. If you pick this approach, the code that you emit inside each if statement and for loop would use a different predication register. (Sibling if statements can use the same predication register, but nested ones have to use different predication registers.)

I elected to not add this restriction, but instead to save and restore the values of the predication register to the stack. This is slower, but it means that control flow can be nested without limit. So, all the instructions I emit are all predicated on the k1 register - I never use k2 - k7 (except - I use k2 to save/restore the value of k1 during scatter/gather operations because those clobber the value of whichever register you pass into it).

For an "if" statement, you actually need to save 2 predication masks:

One that saves the predication mask that was incoming to the beginning of the "if" statement. You need to save this so that, after the "if" statement is totally completed, you can restore it back to what it was originally
If there's an "else" block, you also need to save the bitmask of the threads that should run the "else" block. You might think that you can compute this value at runtime instead of saving/loading it (it would be the inverse of the threads that ran the "then" block, and-ed with the set of incoming threads) but you actually can't do that because break and continue statements might actually need to modify this value. Consider if there's a break statement as a direct child of the "then" block - at the end of the "then" block, there will be no threads executing (because they all executed the "break" statement). If you then use the set of currently executing threads to try to determine which should execute the "else" block, you'll erroneously determine that all threads (even the ones which ran the "then" block!) should run the "else" block. Instead, you need to compute up-front the set of threads should be running the "else" block, save it, and re-load it when starting to execute the "then" block.

For a "for" loop, you also need to save 2 predication masks:

Again, you need to store the incoming predication mask, to restore it after the loop has totally completed
You also need to save and restore the set of threads which should execute the loop increment operation at the end of the loop. The purpose of saving and restoring this is so that break statements can modify it. Any thread that executes a break statement needs to remove itself from the set of threads which executes the loop increment. Any thread that executes a continue statement needs to remove itself from executing *until* the loop increment. Again, this is a place where you can't recompute the value at runtime because you don't know which threads will execute break or continue statements.

If you set up "if" statements and "for" loops as above, then break and continue statements actually end up really quite simple. First, you can verify statically that no statement directly follows them - they should be the last statement in their block.

Then, what a break statement does is:

Find the deepest loop it's inside of, and find all the "if" statements between that loop and the break statement
For each of the "if" statements:

Emit code to remove all the currently running threads from both of the saved bitmasks associated with that "if" statement. Any thread that executes a break statement should not run an "else" block and should not come back to life after the "if" statement.

Emit code to remove all the currently running threads from just the second bitmask associated with the loop. (This is the one that gets restored just before the loop increment operation). Any thread that executes a break statement should not execute the loop increment.

A "continue" statement does the same thing except for the last step (those threads *should* execute the loop increment). And a "return" statement removes all the currently running threads from all bitmasks from every "if" statement and "for" loop it's inside of.

This is kind of interesting - it means an early return doesn't actually stop the function or perform a jmp or ret. The function still continues executing, albeit with a modified predication bitmask, because there might still be some threads "alive." It also means that "if" statements don't actually need to have any jumps in them - in the general case, both the "then" block and the "else" block will be executed, so instead of jumps you can just modify the predication bitmasks - and emit straight-line code. (Of course, you'll want the "then" block and the "else" block to both jump to the end if they find that they start executing with an empty predication bitmask, but this isn't technically necessary - it's just an optimization.)

Shared Variables

When you're using the SIMT approach, one thing that becomes useful is the ability to interact with external memory. GPU threads don't really perform I/O as such, but instead just communicate with the outside world via reading/writing global memory. This is a bit of a problem for SIMT-generated code, because it will assume that the type of everything is vector type - one for each thread. But, when interacting with external memory, all "threads" see the same values - a shared int is just an int, not a vector of ints.

That means we now have a 3rd kind of value classification. Previously, we had l-values and r-values, but l-values can be further split into vector-l-values and scalar-l-values. A pointer type now needs to know statically whether it points to a vector-l-value or a scalar-l-value. (This information needs to be preserved as we pass it from l-value pointers through the & and * operators.) In the language, this looks like "pointer<uint64 | shared>".

It turns out that, beyond the classical type-checking analysis, it's actually pretty straightforward to deal with scalar-l-values. They are actually strictly simpler than vector-l-values.

In the language, you can declare something like:

variable<shared> x: uint64;

x = 4;

which means that it is shared among all the threads. If you then refer to x, that reference expression becomes a scalar-l-value, and produces a vector of pointers, all of which point to x's (shared) storage. The "=" in the "x = 4;" statement now has to be made aware that:

If the left side is a vector-l-value, then the scatter operation needs to offset each pointer in the vector to point to the specific place inside the destination vectors that the memory operations should write to
But, if the left side is a scalar-l-value, then no such offset needs to occur. The pointers already point to the one single shared memory location. Everybody points to the right place already.

(And, of course, same thing for the right side of the assignment, which can be either a vector-l-value, a scalar-l-value, or an r-value.)

Comparisons and Booleans

AVX-512 of course has vector compare instructions. The result of these vector comparisons *isn't* another vector. Instead, you specify one of the bitmask registers to receive the result of the comparison. This is useful if the comparison is the condition of an "if" statement, but it's also reasonable for a language to have a boolean type. If the boolean type is represented as a normal vector holding 0s and 1s, there's an elegant way to convert between the comparison and the boolean.

The comparison instructions look like:

vpcmpleq %zmm1,%zmm0,%k2{%k1}

If you were to speak this aloud, what you'd say is "do a vector packed compare for less-than-or-equal-to on the quadwords in zmm0 and zmm1, put the result in k2, and predicate the whole operation on the value of k1." Importantly, the operation itself is predicated, and the result can be put into a different predication register. This means that, after you execute this thing, you know which threads executed the instruction (because k1 is still there) but you also know the result of the comparison (because it's in k2).

So, what you can do is: use k1 to broadcast a constant 0 into a vector register, and then use k2 to broadcast a constant 1 into the same vector register. This will leave a 1 in all the spots where the test succeeded, and a 0 in all the remaining spots. Pretty cool!

If you want to go the other way, to convert from a boolean to a mask, you can just compare the boolean vector to a broadcasted 0, and compare for "not equal." Pretty straightforward.

Miscellanea

I'm using my own calling convention (and ABI) to pass values into and out of functions. It's for simplicity - the x64 calling convention is kind of complicated if you're using vector registers for everything. One of the most useful decisions I made was to formalize this calling convention by encoding it in a C++ class in the compiler. Rather than having various different parts of the compiler just assume they knew where parameters were stored, it was super useful to create a single source of truth about the layout of the stack at call frame boundaries. I ended up changing the layout a few different times, and having this single point of truth meant that such changes only required updating a single class, rather than a global change all over the compiler.

Inventing my own ABI also means that there will be a boundary, where the harness will have to call the generated code. At this boundary, there has to be a trampoline, where the contents of the stack gets rejiggered to set it up for the generated code to look in the right place for stuff. And, this trampoline can't be implemented in C++, because it has to do things like align the stack pointer register, which you can't do in C++. AVX-512 requires vectors to be loaded and stored at 64-byte alignment, but Windows only requires 16-byte stack alignments. So, in my own ABI I've said "stack frames are all aligned to 64-byte boundaries" which means the trampoline has to enforce this before the entry point can be run. So the trampoline has to be written in assembly.

The scatter/gather operations (which are required for l-values to work) only operate on 32-bit and 64-bit values inside the AVX-512 registers. This means that the only types in the language can be 32-bit and 64-bit types. An AVX-512 vector, which is 512 bits = 64 bytes wide, can hold 8 64-bit values, or 16 32-bit values. However, the entire SIMT programming model requires you to pick up front how many "threads" will be executing at once. If some calculations in your program can calculate 8 values at a time, and some other calculations can calculate 16 values at a time, it doesn't matter - you have to pessimize and only use 8-at-a-time. So, if the language contains 64-bit types, then the max number of "threads" you can run at once is 8. If the language only contains 32-bit types (and you get rid of 64-bit type support, including 64-bit pointers), then you can run 16 "threads" at once. For me, I picked to include 64-bit types and do 8 "threads" at a time, because I didn't want to limit myself to the first 4GB of memory (the natural stack and heap are already farther than 4GB apart from each other in virtual address space, so I'd have to, like, mess with Windows's VM subsystem to allocate my own stack/heap and put them close to each other, and yuck I'll just use 64-bit pointers thankyouverymuch).

Conclusion

And that's kind of it. I learned a lot along the way - there seem to be good reasons why, in many shading languages (which use this SIMT model),

Support for 8-bit and 16-bit types is rare - the scatter/gather operations might not support them.
Support for 64-bit types is also rare - the smaller your types, the more parallelism you get, for a particular vector bit-width.
Memory loads and stores turn into scatter/gather operations instead.

A sophisticated compiler could optimize this, and turn some of them into vector loads/stores instead.
This might be why explicit support for pointers is relatively rare in shading languages - no pointers means you can _always_ use vector load/store operations instead of scatter/gather operations (I think).

You can't treat memory as a big byte array and memcpy() stuff around; instead you need to treat it logically and operate on well-typed fields, so the predication registers can do the right thing.
Shading languages usually don't have support for function pointers, because calling them ends up becoming a loop (with a complicated pointer coalescing phase, no less) in the presence of non-uniformity. Instead, it's easy for the language to just say "You know what? All calls have to be direct. Them's the rules."
Pretty much every shading language has a concept of multiple address spaces. The need for them naturally arises when you have local variables which are stored in vectors, but you also need to interact with global memory, which every thread "sees" identically. Address spaces and SIMT are thoroughly intertwined.
I thought it was quite cool how AVX-512 complimented the existing (scalar) instruction set. E.g. all the math operations in the language use vector operations, but you still use the normal call/ret instructions. You use the same rsp/rbp registers to interact with the stack. The vector instructions can still use the SIB byte. The broadcast instruction broadcasts from a scalar register to a vector register. Given that AVX-512 came out of the Larrabee project, it strikes me as a very Intel-y way to build a GPU instruction set.

Video Splitter

2022-11-09T20:05:00.006-08:00

I record ~all the games I play. Not for any particular reason, but really just because sometimes it's fun to go back and re-watch them, to repeat a positive experience. Usually, I just upload the raw footage directly to YouTube, but this time I wanted to see if I could do a little better.

Problem Statement

I just finished playing Cyberpunk 2077, and recorded 77 hours of footage. This footage is split across 41 video files. Also, when recording the footage, I wasn't particularly meticulous about stopping recording when I had to step away for a little bit. Therefore, there are a bunch of times in the videos when I stepped away for a few minutes, and nothing much is happening on-screen.

I want to take these videos, and concatenate them into a few long videos. Ideally the result would be a single video, but YouTube doesn't allow you to upload anything longer than 10 hours, so the result will be a few 10-hour-long videos. Also, I'd like to identify the periods of time in the footage when nothing much is happening, and remove those periods from the result.

Also, just for the sake of convenience, I'd like to do this with as few libraries as possible, focusing just on using the software that's built into my Mac.

Plan

There are 3 phases:

Feature Extraction
Partitioning
Writing out the final result.

Let's take these one at a time.

Feature Extraction

This is the part where I analyze the videos to pull out useful information from them. This entails going through all the videos frame-by-frame, and mapping each frame to a set of metrics. I ended up using 5 metrics:

The number of words that appear in the frame
The cross-correlation from the previous frame to the frame in question
The optical-flow from the previous frame to the frame in question
The standard deviation of the optical flow
The average luminance of the frame

Let's consider these one-by-one.

Number of Words

This is pretty straightforward to gather. Apple's Vision framework contains API to recognize the text in an image. There's one other step beyond simply using this API, though - the results contain strings, but each string may contain many words. I'm interested in the number of words, rather than the number of strings in the image. So, I use another Apple API - CFStringTokenizer to pull out the words from the string. Then I simply count the number of words. Easy peasy.

Cross Correlation

The goal here is to cross-correlate adjacent frames of video, to see roughly how much is changing from frame to frame. This one was the most difficult to implement.

Cross correlation is usually defined a little differently than what I'm interested in. Classically, cross correlation is a function that takes 2 functions as input, and produces another function. The idea is that you take the 2 functions, multiply them together and integrate the result, to produce a particular value. However, you often want to actually displace the two input functions away from each other. The size of that displacement is why the output of cross-correlation is a function - the input of that function represents the displacement that the two input functions are separated from each other before multiplying and integrating. (I'm assuming here that the functions are real-valued.)

For me, my 2 functions are discrete - There are X and Y inputs, and the ouptut is the color value of that pixel. So, instead of multiplying and integrating like you would for continuous functions, this operation actually becomes simply a dot product. I also am using just the luminance of the image, so that if a color changes chroma but luminance stays the same, that still counts as a high cross-correlation. Also, having a 1-dimensional result is a little more convenient than if I had a 3-dimensional result (by treating red, green, and blue as distinct).

However, I don't want my output to be a whole function - I just want a single value. Usually, in the sciences, they do this by maximizing the value of the output function - often by trying every input, and reporting the maximum result achievable. This would be great: if I'm turning the camera in the game, this operation would find the location where the previous frame best matches up with the current frame. Unfortunately, it's too slow, though - for a 4K image, there are 35,389,440 inputs to try, and each trial operates on 2 entire 4K images. So, instead, I just set displacement = 0, and assume that adjacent video frames aren't changing a huge amount from frame to frame.

From reading Wikipedia's article about cross correlation, it looks like what I want is the "zero normalized cross-correlation" which normalizes the values in the image around the mean, and divides by the standard deviation. The idea is that if the image gets brighter as a whole, but nothing else changes, that should count the same as if it didn't get brighter. It's measuring relative changes, rather than absolute changes.

So this all boils down to:

Calculate the luminance of both images
Calculate the average and standard deviation luminance for each image. This ignores geometry and just treats all the pixels as an unordered set.
For each image, create a new "zero-normalized" image, which is (old image - mean) / standard deviation
Perform a dot product of the two zero-normalized images. The result of this is a single scalar.
Divide the scalar by the number of pixels

Okay, how to actually do this in code?

Luminance

Calculating the luminance is actually a little tricky. The most natural way I found was to convert the image into the XYZ colorspace, whose Y channel represents luminance. I'm doing this using MPSImageConversion which can convert images between any 2 color spaces. It operates on MTLTextures, so I had to bounce through Core Image to actually produce them (via CIRenderDestination.) I then broadcasted the Y channel to every channel of the result, which isn't strictly necessary, but makes it more convenient to use later - I can't forget which channel is the correct channel to use. I did this broadcast using CIFilter.colorMatrix.

Mean

Okay, so now we've got luminance, let's calculate the mean, which is pretty straighforward - CoreImage already has CIFilter.areaAverage which produces a 1x1 image of the average. We can tile that 1x1 image using CIFilter.affineTile so it's as big as the input image.

Subtraction

Subtracting an image from its mean now is actually kind of tricky - Both CIDifferenceBlendMode and CISubtractBlendMode seem to try really hard to not produce negative values. What I had to do in the end was to add the negative of the second image (instead of subtracting). Adding is just CIFilter.additionCompositing, and negating is just CIFilter.multiplyCompositing with an image full of -1s. However, you can't use CIConstantColorGenerator to fill an image with -1s, because that will clamp to 0. Instead you have to actually create a new CIImage from a 1x1 CPU-side bitmap buffer that holds a -1, and then use CIFilter.affineTile to make it big enough. Also, in the CIContext, you have to set the working format to one that can represent negative values (normally Core Image uses unsigned values); I'm using CIFormat.RGBAf.

Standard Deviation

Okay, so the next step is to calculate standard deviation. The standard deviation is the square root of variance, and to calculate variance, we take all the pixels, subtract them from the mean like we just did above, square the result, then find the average of the result values. Luckily, we already did all the hard parts - squaring the result is just CIFilter.multiplyCompositing, and we can use the same CIFilter.areaAverage & CIFilter.affineTile to find the average. No problem. We can then take the square root to find standard deviation by using CIFilter.gammaAdjust with a gamma of 0.5.

Division

We can't actually divide by the standard deviation using Core Image as far as I can tell - CIDivideBlendMode doesn't seem to do a naive division like we want. However, because the standard deviation is constant across the whole image, we can hoist that division out of the dot product computation. The dot product results in a single scalar, and the standard deviation for an image is a single scalar, so we can just calculate these things independently, and then divide them on the CPU afterwards. No problem.

Dot Product

Okay, so now we've got our zero-normalized images. Let's do a dot product and average the result! This is pretty easy too - a dot product is just CIFilter.multiplyCompositing, and averaging the result is CIFilter.areaAverage.

Phew! All that Core Image work just to get a single value!

Optical Flow

Optical flow between 2 images produces an (x, y) displacement vector that indicates, for each pixel in the first image, where it moved to in the second image. I'm interested in this because as I rotate the camera around in the game, that should cause most of those displacements the be pointing in roughly the same direction across the whole image. So, the average of the optical flow should tell me if I'm moving the camera or not.

On the other hand, if I'm walking forward in the game, then pixels at the top of the screen will move up, pixels on the right will move more to the right, etc. In that situation, the displacements will all cancel each other out! That's why I'm also interested in the standard deviation of the flow. If the average is 0, but the standard deviation is high, that means I'm walking forward (or backward). If the standard deviation is 0, but the average is high, that means I'm turning the camera.

Calculating this is pretty straightforward. Apple's Vision framework contains API to calculate optical flow. We can then calculate its average and standard deviation using the same method use used above, in the cross correlation section.

Optical flow is a 2-dimensional value - pixels can move in the X and Y dimension. I'm not really interested in the direction of the movement, though; I'm more interested in the amount of movement. So, after calculating the average and the standard deviation, I take the magnitude of them, to turn them into scalar values.

Average Luminance

This is pretty straightforward - in fact, we already calculated this above in the cross-correlation section. It's useful because the menus have a black background, so they are darker than regular gameplay. Also, the luminance of menus is very consistent, as opposed to regular gameplay, where luminance is going up and down all the time. So, just looking at a graph of average luminance, you can kind of already see where the menus are in the video.

Partitioning

Alright, now we've extracted 5 features from each frame of video. Each of these features is a single scalar value. The next task is to try to use statistical analysis to determine which parts of the video are the boring parts that I should cut out, and which parts are full of action and should be kept in.

Originally, I thought I could just use cross-correlation to do this. I thought that if the cross-correlation between adjacent frames is low, that means I've cut to a menu or something, and that would be a good place to cut the video up. However, this turned out not to work very well, because menus actually come on screen with an animation, so they don't actually have low cross-correlation. Also, regular gameplay has a bunch of flashes in it (things explode, the camera can turn quickly, visual effects distort the screen, etc.).

Instead, I wanted to model the data I had gathered using a piecewise constant function. E.g. during gameplay, the 5 features will adhere to a particular distribution, and during menus or boring parts, the 5 features will adhere to a different distribution. I'm modelling these distributions as normal distributions, but with different means. I'm trying to partition the data by time, and calculate a new normal distribution for each partition, such that each partition's distribution fits its data as well as possible. The trick here is to find the partioning points. I'm looking for a statistical method of finding places where the data is discontiguous.

Originally, I thought this would be a classic K-means clustering problem, but after implementing it, it turned out not to work very well. No matter how hard I tried, the partitions overlapped a lot, and looked pretty random. So that didn't work.

Bayesian Information Criterion

Next, I discovered something called the Bayesian information criterion (BIC). This comes from the science of model selection, which is essentially exactly what I'm trying to do. I have a bunch of data, and I have a family of models in mind (disjoint normal distributions, one for each partition) and I'm trying to select among the family for the best model for my data.

The BIC is a measure of how well data fits a model. Importantly, though, it tries to mitigate overfitting. For example, in my data, if every data point was in its own partition, then the model would perfectly fit the data; but this wouldn't actually solve my problem. The Bayesian information criterion has 2 terms - one that reflects how well the data fits the model, and another one that reflects how many parameters the model has. The better the data fits the model, the better the BIC; but the more parameters in the model, the worse the BIC. It's essentially a way of balancing fitting vs overfitting.

There are 2^n different ways of partitioning n values, and for me, n is the number of frames in 77 hours of video. So exhaustively calculating the BIC for every possible partitioning is clearly too expensive. I instead opted to use a greedy approach. Given a particular partitioning, we can try to add a new split at every location (which is O(n) locations), and calculate the BIC as-if we split at that position. We then pick the location which results in the best BIC, and add it into the partitioning. We keep doing this until we find there is no single new splitting location which will improve the BIC (because adding another split would overfit the model).

If you preprocess the data to precalculate prefix-sums, you can answer "sum all the values from here to there" queries in O(1) time. This allows you to compute the BIC for a single partition in O(1) time (if you expand out the polynomial in the formula).

Therefore, determining the BIC for a particular partitioning is O(number of partitions). Every time we pick a splitting point, the number of partitions increases by 1. We want to calculate a BIC once for each candidate splitting point, which is O(number of partitions * number of frames). Even better, calculating the BIC at every candidates splitting point is an embarassingly parallel problem, and we can parallelize this across all the cores in our machine. It turns out this is fast enough - the longest single video I had is 5 hours long, and this algorithm completed on that video in around 10 minutes on my 20-core machine.

There are 2 tweaks that I ended up having to do to the above:

The BIC assumes that the data you have is one-dimensional. However, my data is 5-dimensional; I extracted 5 features from each frame of the video, so each data point has 5 components. There may be a way to generalize the BIC to higher dimensions, but I instead opted to do something similar: for each 5-dimensional data point, create a single one-dimensional synthetic data point that represents it. This is meaningless, but isn't without precedent: there are lots of examples of people using this approach to average together multiple benchmarks into a single synthetic score. I opted to combine the 5 values using the geometeric mean, because that is sensitive to ratios of the data points, rather than the magnitudes of the data points themselves. (Arithmetic mean is definitely wrong, because calculating the arithmetic mean involves adding together the 5 values, but the 5 values have different units, so they can't be added.)

I also decided to use 1 - cross correlation instead of cross correlation itself, because all the other values have a baseline near 0, but cross correlation has a baseline near 1 (because most frames are similar to their adjacent frames). This makes all the values behave a bit more similarly, and makes ratios more meaningful.

The BIC formula involves a tradeoff between fitting the data and overfitting the data. The way overfitting is measured is that it's proportional to the number of parameters in the model. For me, each different partition has (I believe) 2 parameters: the mean of the data within that partition, and the constant variance of the errors. However, using this value causes the data to be overfitted significantly, so I instead multiplied that by 10, which ended up with a pretty good result. Doing it this way ended up with the average partition in each video being around 30 - 60 seconds long, regardless of the length of the video being partitioned. So that's a pretty cool result.

Writing out the Result

The last step is to pick a threshold: for each partition, I have to determine whether that partition should be present in the final video. I already have a model, where each partition is associated with the average value of the synthetic one-dimensional calculated values inside that partition. The way I picked which regions were in and which were out was just by inspecting a bunch of partitions, to see what was happening in the video during that time. I found that, if the average synthetic value is less than around 0.12145, then the partition was boring and should not be included.

Writing the result is pretty straightforward - I'm using AVAssetWriter and passing it frames from the input file (which was read using AVAssetReader).

Results

All this work seems, somewhat surprisingly, to give pretty good results. Not perfect, but better than I was expecting. In the first few hours of gameplay that I reviewed, it neatly cut out:

A section when I was reading an in-game lore book thing (it was just showing text on the screen for a minute or so)
A section when the game was paused
A section when I was fussing with my inventory for a few minutes
A section when I was looking at my character in the mirror

The remarkable part of this is that it cut out all these things wholesale: from right when they started, to right when they end. So you see the character walk into the elevator, and then they're immediately walking out of the elevator. And it didn't cut out any of the action or interesting parts. It also seems to have some resistance to durations of events: there was a point when I read a different in-game lore book, but I only read it for a few seconds, and it kept that part in; - presumably because it wasn't worth another partition.

Understanding CVDisplayLink

2021-05-12T23:41:00.003-07:00

I found it actually somewhat difficult to understand how to use CVDisplayLink. But, after a while of playing around with it, I think I've got a pretty good handle on it. It's not too complicated.

The main use of a CVDisplayLink is to have a callback that runs once per vsync of a screen. It's also stateful, so you can stop and start the callback stream.

Creation

When you create one of these objects, you have to tell the system which screen to match - because different screens can have different refresh rates. The type CVDisplayLink accepts to do this is CGDirectDisplayID. You can get this from an NSScreen* as follows:

NSDictionary<NSDeviceDescriptionKey, id> *deviceDescription = theScreen.deviceDescription;

NSNumber *directDisplayIDNumber = deviceDescription[@"NSScreenNumber"];

CGDirectDisplayID directDisplay = directDisplayIDNumber.unsignedIntValue;

Then, you can use CVDisplayLinkCreateWithCGDisplay() to create the object for that display.

Setup

The setup can be either a block or a C function. The block doesn't need a void* userInfo object because that context is implicitly captured by the block. So, you just say:

CVDisplayLinkSetOutputHandler(displayLink, ^CVReturn (CVDisplayLinkRef displayLink, const CVTimeStamp *inNow, const CVTimeStamp *inOutputTime, CVOptionFlags flagsIn, CVOptionFlags *flagsOut) {

...

return kCVReturnSuccess;

});

And then you start it with just CVDisplayLinkStart(displayLink); Easy peasy. There are also functions for stopping, retaining, and releasing the CVDisplayLink.

Interpreting the Arguments

It actually took me quite a while to figure out what each of the arguments means. The docs say that flagsIn and flagsOut are 0, and the displayLink is the CVDisplayLink that you started, so there are only really two interesting arguments: inNow, and inOutputTime, both of which are of type CVTimeStamp. inNow represents the time that this callback is being run, and inOutputTime represents the time that anything you draw in the callback is supposed to show up at.

So, let's dig into CVTimeStamp. The version and reserved fields are 0, and the flags field tells you which of the fields in the CVTimeStamp are valid. I don't know what SMPTE time is, but it never seems to be set/valid, so I'm going to ignore that one. So these are the ones that are remaining:

hostTime
rateScalar
videoRefreshPeriod
videoTime
videoTimeScale

The thing you have to realize is that there are two timelines happening concurrently: "host" time and "video" time. So, a "point" in time actually has two different representations: one for each of the timelines.

The hostTime field uses the same tick count that mach_absolute_time() uses. To convert it to seconds, you have to use mach_timebase_info(). And, the "meaning" of the hostTime field is the current time as measured by your application - exactly what mach_absolute_time() returns.

The videoTime field does not use those same tick counts. Instead, it uses the videoTimeScale field. It's a rational number: videoTime / videoTimeScale = seconds. videoRefreshPeriod is a rational number too, using the same denominator, but it represents the delta between adjacent video frames.

For CVDisplayLink, the "video" time represents time as measured by vsyncs. You can think of vsyncs as an independent clock - it ticks every so-often, and those ticks don't have to be in exact cadence with any of the other clocks on the system. They're supposed to be, but when you actually measure them, they won't perfectly line up, because of course nothing is that perfect. So, if videoRefreshPeriod / videoTimeScale equals 1/60, and you record adjacent frames' hostTime and convert them to seconds using mach_timebase_info(), you'll get something that's close to 1/60, but it won't be exact, because nothing is ever that exact all the time.

So that's what rateScalar tries to measure. It's the only field that is floating point, and it measures the speed of the video timeline relative to the speed of the host timeline. Ideally, it would always be 1.0, but, of course, nothing is ever that perfect. It's not sensitive to workload, just as time doesn't dilate when you start asking your computer to do some work.

The video time is time based on vsyncs, not time based on the window server render loop or the core animation render loop. If some other application loads up a big Core Animation scene, your CVDisplayLink isn't going to tick slower.

Also, I assume the fact that videoRefreshPeriod is passed into each callback indicates that videos can change their refresh rate ... but I'm not sure.

Addition Font

2019-03-07T22:26:00.002-08:00

Shaping

When laying out text using a font file, the code points in the string are mapped one-to-one to glyphs inside the file. A glyph is a little picture inside the font file, and is identified by ID, which is a number from 0 - 65535. However, there’s a step after character mapping but before rasterization: shaping.

For example, one application of shaping is seen in Arabic text. In Arabic, each letter has four different forms, depending on which letters are next to it. For example, two letters in their isolated form look like ف and ي but as soon as you put them together, they form new shapes and look like في. This type of modification isn’t possible if characters are naively mapped to glyphs and then rasterized directly. Instead, there needs to be a step in the middle to modify the glyph forms so the correct thing is rasterized.

This “middle step” is called shaping, and is implemented by three tables inside the OpenType font file: GSUB, GPOS, and GDEF. Let’s consider GSUB alone.

GSUB

The GSUB table, or “glyph substitution” table, is designed to let font authors replace glyphs with other glyphs. It describes a transformation where the input is a sequence of glyphs and the output is a different sequence of glyphs. It is made up of a collection of constituent “lookup tables,” each of which has a “type.”

Type 1 (“single substitution”) provides a map from glyph to glyph. This is used for example, when someone enables the ‘swsh’ feature, the font can substitute out the ampersand with a fancy ampersand. In that situation, the map would contain a mapping from the regular ampersand to the fancy ampersand (possibly in addition to some more additional mappings, too).

Type 2 (“multiple substitution”) provides a map from glyph to sequence-of-glyphs. This is used, for example, if diacritic (accent) marks are represented as separate glyphs inside the font. The font can replace the “è” glyph with the “e" glyph followed by the ◌̀ glyph (and then the GPOS table later can position the two glyphs physically on top of each other).

Type 4 (“ligature substitution”) provides a map of sequence-of-glyphs to single glyph (the opposite of type 2). This is used for ligatures, so if you have a fancy “ffi” ligature, you can represent all three of those letters in the same fancy glyph.

Type 5 (“contextual substitution”) is special. It doesn’t do any replacements directly, but instead maps a sequence of glyphs to a list of other tables that should be applied at specific points in the glyph sequence. So it can say things like “in the glyph sequence ‘abcde’, apply table #7 at index 2, and then when you’re done with that, apply table #13 at index 4.” Tables #7 and #13 can be any of the types above, so you could use this table to say something like “swap out the ‘d’ for an ‘f’, but only if it appears in the sequence ‘abcde’.” This sort of thing is used to implement the “contextual alternates” feature.

There are also three other types, but they’re not particularly relevant, so I’m going to ignore them.

So, the inputs to the text system are a set of features and an input string of glyphs (The characters have already been mapped to glyphs via the “cmap” table). Features are mapped to a set of lookup tables, each of which is of a type listed above. Each of those lookup tables describes a map where the keys are sequences of glyphs, so the runtime iterates through the glyph sequence until it finds a sequence that’s an input to one of the tables. The runtime then performs that glyph replacement according to the rules of the tables, and continues iterating.

Turing Complete

So this is pretty cool, but it turns out that the contextual substitution lookup type is really powerful. This is because the table that it references can be itself, which means it can be recursive.

Let’s pretend we have a lookup table named 42 (presumably because it’s the 42nd lookup table inside the font), and it’s a contextual substitution lookup table. This table maps glyph sequences to tuples of (lookup table to recurse to, offset in glyph sequence to recurse). Let’s say we design it with these two mappings:

Table42 {
A A : (Table42, 1);
A B : (Table100, 1);
}

If the runtime is operating on a glyph sequence of “AAAAB”, the first two “AA”s will match the first rule, so then the system recurses and runs Table42 on the stream “AAAB”. Then these first two “AA”s will match, and so-on. This happens until you get to the end of the string, “AB” matches, and then Table100 is run on the string “B”.

This is “tail recursion,” and can be used to implement all different types of loops. Also, each mapping in the table acts as an “if” statement because it only executes if the pattern is matched.

You can use the glyph stream as memory by reading and writing to it; that is, after all, what the shaping algorithm is designed to do. You can delete a glyph by using Type 2 to map it to an empty sequence. You can insert a glyph by using Type 2 to map the preceding glyph to a sequence of [itself, the new glyph you want to insert]. And, once you’ve inserted a glyph, you can check for its presence by using the “if” statements described above.

So thats a pretty powerful virtual machine. I think the above is sufficient to prove Turing complete-ness.

Caveat

So it turns out the example (“Table42”) above doesn’t actually work in DirectWrite. This is because, in DirectWrite, inner matches have to be entirely contained within outer matches. So when the outer call to Table42 matched “AA”, the inner call to Table42 can only match within that specific “AA”. This means it’s impossible to, for example, find the first even glyphID and move it to the beginning. So, DirectWrite’s implementation isn’t Turing complete. However, it does work in HarfBuzz and CoreText, so those implementations are Turing complete.

But even in HarfBuzz and CoreText, there are hard limits on the recursion depth. HarfBuzz sets its limit to 6. Therefore, the above example will only work on strings of length 7 or fewer. HarfBuzz is open source, though, so I simply used a custom build of HarfBuzz which bumps up this limit to 4 billion. This let me recurse to my heart’s content. A limit of 6 is probably a good thing; I don’t think users generally expect their text engines to be running arbitrary computation during layout. But I want to go beyond it.

DSL

After making the above realizations, I decided to try to implement a nontrivial algorithm using only the GSUB table in a font. I wanted to try to implement addition. The input glyph stream would be of the form “=1234+5678=” and the shaping process would turn that string into “6912”.

When thinking about how to do this, I started jotting down some ideas on paper for what the lookup tables should be, and the things I were writing down were very similar to that “Table42” example above. However, writing down tables of types 1, 2, and 4 was quite cumbersome, because what I really wanted to describe were things like “move this glyph to the beginning of the sequence” rather than individual insertions or deletions.

I looked at the “fea” language, which is how these lookups are traditionally written by font designers. However, after reading the parser, it looks like it doesn’t support recursive or mutually-recursive lookups. So, I rolled my own.

So, I did what any good programmer does in this situation: I invented a new domain-specific language.

The DSL has two types of statements. The first is a way of giving a set of glyphs a name. I wanted to be able to address all of the digits without having to write out every individual digit. So, there’s a statement that looks like this:

digit: 1 2 3 4 5 6 7 8 9 10;

Note that those numbers are glyph IDs, not code points. In my specific font, the “0” character is mapped to glyph 1, “1” is mapped to glyph 2, etc.

Then, you can describe a lookup using the syntax above:

digitMove {
digit digit: (1, digitMove), (1, digitMove2);
digit plus: (1, digitMove2);
}

The lookup is named digitMove, and it has two rules inside it. The first rule matches any two digits next to each other, and if they match, it invokes the lookup named digitMove (which is the name of the current lookup, so this is recursive) at index 1, and then after that, invokes the lookup named digitMove2 at index 1.

Each of these stanzas gets translated fairly trivially to lookup with type 5.

These rules are recursive, as above, but they need a terminal form so that the recursion will eventually end. Those are described like this:

flagPayload {
digit digit plus digit: flag \3 \0 \1 \2;
}

This rule is a terminal because there are no parentheses on the right side of the colon. The right side represents a glyph sequence to replace the match with. The values on the right side are either a literal glyph ID, a glyph set that contains exactly one glyph, or a backreference which starts with a backslash. “\3” means “whatever glyph was third glyph in the matched sequence”. So the rule above would turn the glyph sequence “34+5” into the sequence “F534+”. (The flag glyph is read and removed in a later stage of processing).

Translating one of these rules to lookups is nontrivial. I tried a few things, but ended up with the following design:

For each output glyph, right to left:

If it’s a backreference, duplicate the glyph it’s referencing, and perform a sequence of swaps to move the new glyph all the way to the right.
If it’s a literal glyph, insert it at the beginning, and perform a sequence of swaps to move it all the way to the right.

This means we need to have a way to do the following operations:

Duplicate. This is a type 4 lookup that maps every glyph to a sequence of two of that glyph.
Swap. This has two pieces: a type 5 lookup that has a rule for every combination of two glyphs, and each rule maps each glyph to a lookup of type 1 which replaces it with the appropriate glyph. This means you need n of these inner (type 1) lookups, allowing you to map any glyph to any other glyph. However, the encoding format allows us to encode each of these inner lookups in constant space in the font, so these inner lookups don’t take that much space. Instead, the outer type 5 lookup takes n^2 space.
Insert a literal. If you implemented this by simply making a type 2 that mapped every glyph to that same glyph + the literal, you would need n^2 space because there would be n of these tables. Instead, you can cut down the size by doing it in two phases: inserting a flag glyph (which is O(n) space using a lookup type 2) and mapping that glyph to any constant value (also O(n) space using a type 1).

Above, I’m worried about space constraints in the file because pointers in the file are (in general) 2 bytes, meaning the maximum size that anything can be is 2^16. If n^2 space is needed, that means n can only be 2^8 = 256, which isn’t that big. Most fonts have on the order of 256 glyphs. Therefore, we need to reduce the places where we require O(n^2) space as much as possible. LookupType 7 helps somewhat, because it allows you to use 32-bit pointers in on specific place, but it only helps that one place.

My font only has 14 glyphs in it, so i didn’t end up near any of these limits, but it’s still important to watch out for out-of-bounds problems.

So, given all that, we can make a parser which builds an AST for the language, and we can build an intermediate representation which represents the bytes in the file, and we can make a lowering phase which lowers the AST to the IR. Then we can serialize the IR and write out the data to the file.

Addition

So, once the language was up and running, I had to actually write a program that represented addition. It works in four phases.

First, define some glyphs:

digit: 1 2 3 4 5 6 7 8 9 10;
flag: 13;
plus: 11;
equals: 12;
digit0: 1;

Then, parse the string. If the string didn’t match the form “=digits+digits=” then I wanted nothing to happen. You can do this by recursing across the string, and if you find that it matches the pattern, insert a flag, and then when all the calls return, move the flag leftward.

parse {
equals digit: (1, parseLeft), (0, afterParse);
}

parseLeft {
digit digit: (1, parseLeft), (0, moveFlagAcrossDigit);
digit plus digit: (2, parseRight), (0, moveFlagAcrossPlus);
}

parseRight {
digit digit: (1, parseRight), (0, moveFlagAcrossDigit);
digit equals: flag \0 \1;
}

moveFlagAcrossDigit {
digit flag digit: \1 \0 \2;
}

moveFlagAcrossPlus {
digit plus flag digit: \2 \0 \1 \3;
}

afterParse {
equals flag: (0, removeFlag), (0, startDigitMove);
}

removeFlag {
equals flag: \0;
}

The next step is to pair up the glyphs. For example, this would turn “1234+5678” into “15263748+”.

startDigitMove {
equals digit: (1, digitMove), (1, startPhase2), (0, removeEquals);
}

removeEquals {
equals digit: \1;
}

digitMove {
digit digit: (1, digitMove), (1, digitMove2);
digit plus: (1, digitMove2);
}

digitMove2 {
digit digit: (1, digitMove2), (0, swapDigits);
digit plus digit: (2, digitMove2), (0, digitMove3);
digit plus equals: digit0 \0 \1 \2;
plus digit: (1, digitMove2), (0, swapPlusDigit);
plus equals: digit0 \0 \1;
digit equals: \0 \1;
}

swapDigits {
digit digit: \1 \0;
}

digitMove3 {
digit plus digit: \2 \0 \1;
}

swapPlusDigit {
plus digit: \1 \0;
}

The next step is to see if there were any glyphs left over on the right side that didn’t get moved. This happens if the right side is longer than the left side. For example, if the input string is “12+3456” we now would have “1526+34”. We want to turn this into “03041526”

startPhase2 {
digit digit: (0, phase2), (0, beginPhase3);
}

phase2 {
digit digit: (0, phase2Move), (0, checkPayload);
}

phase2Move {
digit digit digit digit: (2, phase2Move), (0, movePayload);
digit digit plus equals: \0 \1 \2 \3;
digit digit plus digit: (3, payload), (0, flagPayload);
}

payload {
digit digit: (1, payload), (0, swapDigits);
digit equals: \0 \1;
}

flagPayload {
digit digit plus digit: flag \3 \0 \1 \2;
}

movePayload {
digit digit flag digit: \2 \3 \0 \1;
}

checkPayload {
flag digit digit digit: (0, rearrangePayload), (0, phase2);
}

rearrangePayload {
flag digit digit digit: digit0 \1 \2 \3;
}

The last step is to actually perform the addition. This works like a ripple carry adder. We want to take the glyphs two-at-a-time, and add them, and produce a carry. Then the next pair of glyphs will add, and include the carry. We start the process by introducing a carry = 0.

beginPhase3 {
digit digit: (0, phase3), (0, removeZero);
}

phase3 {
digit digit digit digit: (2, phase3), (0, addPair);
digit digit plus equals: (0, insertCarry), (0, addPair);
}

insertCarry {
digit digit plus equals: \0 \1 digit0;
}

removeZero {
digit0 digit: (1, removeZero), (0, removeSingleZero);
}

removeSingleZero {
digit0 digit: \1;
}

addPair {
1 1 1: 1 1;
1 1 2: 1 2;
1 2 1: 1 2;
1 2 2: 1 3;
1 3 1: 1 3;
1 3 2: 1 4;
… more here
}

HarfBuzz

So I’m doing this whole thing using the HarfBuzz shaper, as described above. This is because it’s open source, so I can find where I’m hitting limits and increase the limits. It turned out that I not only had to increase the HB_MAX_NESTING_LEVEL to 4294967294, but I also was running into more limits. I ended up just taking all the limits in hb-buffer.hh, hb-machinery.hh, and hb-ot-layout-common.hh and increasing them by a factor of 10.

There’s one more piece that was necessary to get it to work. Inside apply_lookup() in hb-ot-layout-gsubgpos.hh, there’s a section if (end <= int (match_positions[idx])). It looks to me like this section is detecting if a recursive call caused the glyph sequence to get shorter than the size of the match. Inside this block, it says /* There can't be any further changes. */ break; which seems to stop the recursion (which seems incorrect to me, but I’m not a HarfBuzz developer, so I could be wrong). In order to get this whole system to work, I had to comment out the “break” statement.

So that’s it! After doing that, the system works, and the font correctly adds numbers. The font has 75 shaping rules and is 32KB large.

The glyph paths (contours) were taken from from the Retroscape font.

Font file download

Wide color vs HDR

2019-03-02T19:04:00.000-08:00

Over the past few years, there’s been something of a renaissance in display technology. It started with retina displays and now is extending to wide color and HDR. Wide color has been added to Apple’s devices, and HDR support has arrived in Windows. These are similar technologies, but they aren’t the same.

HDR allows you to display the same colors you could display with out, but at a higher luminosity (colloquially: brightness). This is sort of like the difference between one red lightbulb and two red lightbulbs. Looking at two red lightbulbs doesn’t change the color of the red, but instead it’s just brighter.

Wide color, on the other hand, lets you see colors that weren’t possible to see before. It is possible to display colors that are more saturated than otherwise could have been.

HDR monitors use the same color primaries as non-HDR monitors, but the luminosity of each of those primaries can grow beyond 1.0. On the other hand, wide color monitors use different, more saturated primaries.

Click and drag to rotate!
The colorful cube is sRGB, normalized to the luminosity of an iPad Pro screen. The white lines describe the gamut of an iPad Pro screen using the Display-P3 color space. The light blue describes the gamut of an ASUS ROG PG27UQ monitor, which is both HDR and wide color. The purple describes the gamut of a SurfaceBook laptop. The coordinate system is XYZ, but transformed such that sRGB is a unit cube.

In the above diagram, luminosity is roughly equivalent to distance in the +X+Y+Z direction. The chroma (hue and saturation) of a point is roughly the angle between two lines, one of which goes through the origin and the point, and the other goes through the origin and pure white. Therefore, wider colors are characterized by the three primary axes pointing in more opposite directions, whereas luminosity is roughly how far those lines extend.

You can see this above. The black and white points are shared between sRGB and Display P3, but the Display P3 monitor can show more points around the middle. So it isn’t more luminous, but it is wider. The ASUS monitor is both wide and HDR, so its axes open up widely, and also extend very far. A monitor that’s HDR but not wide would have the same primaries as sRGB, but would extend out far like the ASUS monitor.

Luminosity isn’t only tangentially related to color; in fact, each color has exactly one luminosity value. If you take a color and convert it to the XYZ color space, the Y component is luminosity. So, an HDR monitor can show colors with Y components significantly larger than non HDR monitors. A wide color monitor can’t, but it can show colors with X and Z values other than the values non-wide monitors can.

This is kind of interesting, because the sRGB spec says that its white point is defined to be 80 nits (which is the unit of luminosity). However, over the decades, monitors have gotten brighter, presumably because psychologically, consumers prefer to buy brighter displays than dimmer displays. Nowadays, most monitors are around 200-300 nits. Therefore, if you strictly adhere to the spec, an sRGB color value (r, g, b) should be some particular point in XYZ space, but in practice, because everyone bought brighter monitors, those same color values (r, g, b) are actually a point with a much greater Y value in XYZ. So different displays have different primaries, but they also have a different luminosity, which affects how far away from 0 the white point is in sRGB. You can see this in the above diagram - the SurfaceBook’s maximum white point is significantly smaller than the color cube, which is because the Surface Book reports a luminosity of only 270 nits. The diagram above is normalized to the luminosity of an iPad pro, which is measured by laptopmag.com to be 368 nits.

You can get all this information on Windows by using the IDXGIOutput6::GetDesc1() API call. This call gives you a lot of information, and it’s a little bit difficult to decipher. The redPrimary, greenPrimary, and bluePrimary give you the direction of each of the primaries in XYZ space. Each one is reported as an (x, y) tuple, which is the result of the calculation X/(X+Y+Z) and Y/(X+Y+Z), respectively[6]. Notice that you’re only given two pieces of information; that means that this isn’t a 3D point in XYZ space, but it’s rather a line. The line can be given in parametric form:

X(t) = x * t
Y(t) = y * t
Z(t) = (1-x-y) * t

As you can see, this passes through the origin and extends outward in some direction forever. Therefore, these xy values give you direction, but not magnitude.

To get magnitude, you need to consider the white point. The white point also has a direction, given in xy coordinates, which tells you which direction that farthest corner of the cube lies, but it doesn’t tell you how far along that line the corner is. To figure this out, you have to use the luminance figures reported by that API. Luminance is the Y channel of XYZ, so if you know the Y value and the direction of the line, you can solve for X and Z. Then, once you know that point, you can solve the maximum extents of the primaries by using the formula of redPrimary + greenPrimary + bluePrimary = whitePoint. That gives you the entire cube.

Calculating the cube for iOS is less detailed. The Display P3 color space is supposed to match the colors representable on the monitor, so we can interrogate the color space instead of the monitor’s reported info. You can construct a CGColor using the CGColorSpace.displayP3 and then use CGColor’s conversion function to turn it into an XYZ color. You can then scale the result by the luminosity of the display (which I looked up from laptopmag.com).

Here's the full text of the Swift Playground I used to calculate the Windows information:
import Foundation
import CoreGraphics
import GLKit

func calculateWhitePoint() -> (CGFloat, CGFloat, CGFloat) {
let xWhite = CGFloat(0.3125)
let yWhite = CGFloat(0.329101563)
let zWhite = 1 - xWhite - yWhite
let luminance = CGFloat(658.345215)
let normalizedLuminance = luminance / 374

let t = normalizedLuminance / yWhite
let XWhite = xWhite * t
let YWhite = yWhite * t
let ZWhite = (1 - xWhite - yWhite) * t

return (XWhite, YWhite, ZWhite)
}

func convertXYZToRGB(X: CGFloat, Y: CGFloat, Z: CGFloat) -> (CGFloat, CGFloat, CGFloat) {
let r = 3.2406 * X - 1.5372 * Y - 0.4986 * Z
let g = -0.9689 * X + 1.8758 * Y + 0.0415 * Z
let b = 0.0557 * X - 0.2040 * Y + 1.0570 * Z
return (r, g, b)
}

let (XWhite, YWhite, ZWhite) = calculateWhitePoint()

// X(t) = x * t
// Y(t) = y * t
// Z(t) = (1 - x - y) * t

let xRed = Float(0.674804688)
let yRed = Float(0.316406250)
let zRed = 1 - xRed - yRed
let xGreen = Float(0.1953125)
let yGreen = Float(0.708007813)
let zGreen = 1 - xGreen - yGreen
let xBlue = Float(0.151367188)
let yBlue = Float(0.046875)
let zBlue = 1 - xBlue - yBlue

// Red primary (XRed, YRed, ZRed): s * (xRed, yRed, zRed)
// Green primary (XGreen, YGreen, ZGreen): t * (xGreen, yGreen, zGreen)
// Blue primary (XBlue, YBlue, ZBlue): u * (xBlue, yBlue, zBlue)

// XWhite = XRed + XGreen + XBlue
// YWhite = YRed + YGreen + YBlue
// ZWhite = ZRed + ZGreen + ZBlue

// XWhite = s * xRed + t * xGreen + u * xBlue
// YWhite = s * yRed + t * yGreen + u * yBlue
// ZWhite = s * zRed + t * zGreen + u * zBlue

// [xRed, xGreen, xBlue] [s] [XWhite]
// [yRed, yGreen, yBlue] * [t] = [YWhite]
// [zRed, zGreen, zBlue] [u] [ZWhite]

let matrix = GLKMatrix3MakeAndTranspose(xRed, xGreen, xBlue, yRed, yGreen, yBlue, zRed, zGreen, zBlue)
let inverted = GLKMatrix3Invert(matrix, nil)
let solution = GLKMatrix3MultiplyVector3(inverted, GLKVector3Make(Float(XWhite), Float(YWhite), Float(ZWhite)))
let s = solution.x
let t = solution.y
let u = solution.z

let XRed = s * xRed
let YRed = s * yRed
let ZRed = s * zRed
let XGreen = t * xGreen
let YGreen = t * yGreen
let ZGreen = t * zGreen
let XBlue = u * xBlue
let YBlue = u * yBlue
let ZBlue = u * zBlue

// Let's check our work
XRed + XGreen + XBlue
XWhite
YRed + YGreen + YBlue
YWhite
ZRed + ZGreen + ZBlue
ZWhite
XRed / (XRed + YRed + ZRed)
xRed
YRed / (XRed + YRed + ZRed)
yRed
XGreen / (XGreen + YGreen + ZGreen)
xGreen
YGreen / (XGreen + YGreen + ZGreen)
yGreen
XBlue / (XBlue + YBlue + ZBlue)
xBlue
YBlue / (XBlue + YBlue + ZBlue)
yBlue

// 0 0 0 -> 0 0 0
// 1 0 0 -> XRed, YRed, ZRed
// 0 1 0 -> XGreen, YGreen, ZGreen
// 0 0 1 -> XBlue, YBlue, ZBlue
// 1 1 0 -> XRed + XGreen, YRed + YGreen, ZRed + ZGreen
// 0 1 1 -> XGreen + XBlue, YGreen + YBlue, ZGreen + ZBlue
// 1 0 1 -> XRed + XBlue, YRed + YBlue, ZRed + ZBlue
// 1 1 1 -> XRed + XGreen + XBlue, YRed + YGreen + YBlue, ZRed + ZGreen + ZBlue

let _000 = convertXYZToRGB(X: 0, Y: 0, Z: 0)
let _100 = convertXYZToRGB(X: CGFloat(XRed), Y: CGFloat(YRed), Z: CGFloat(ZRed))
let _010 = convertXYZToRGB(X: CGFloat(XGreen), Y: CGFloat(YGreen), Z: CGFloat(ZGreen))
let _001 = convertXYZToRGB(X: CGFloat(XBlue), Y: CGFloat(YBlue), Z: CGFloat(ZBlue))
let _110 = convertXYZToRGB(X: CGFloat(XRed + XGreen), Y: CGFloat(YRed + YGreen), Z: CGFloat(ZRed + ZGreen))
let _011 = convertXYZToRGB(X: CGFloat(XGreen + XBlue), Y: CGFloat(YGreen + YBlue), Z: CGFloat(ZGreen + ZBlue))
let _101 = convertXYZToRGB(X: CGFloat(XRed + XBlue), Y: CGFloat(YRed + YBlue), Z: CGFloat(ZRed + ZBlue))
let _111 = convertXYZToRGB(X: CGFloat(XRed + XGreen + XBlue), Y: CGFloat(YRed + YGreen + YBlue), Z: CGFloat(ZRed + ZGreen + ZBlue))

/*
0, 0, 0, 0, 1, 0, // left front
0, 0, 0, 1, 0, 0, // bottom front
1, 0, 0, 1, 1, 0, // right front
0, 1, 0, 1, 1, 0, // top front
*/
print("\(_000.0), \(_000.1), \(_000.2), \(_010.0), \(_010.1), \(_010.2), // left front")
print("\(_000.0), \(_000.1), \(_000.2), \(_100.0), \(_100.1), \(_100.2), // bottom front")
print("\(_100.0), \(_100.1), \(_100.2), \(_110.0), \(_110.1), \(_110.2), // right front")
print("\(_010.0), \(_010.1), \(_010.2), \(_110.0), \(_110.1), \(_110.2), // top front")

/*
0, 0, 1, 0, 1, 1, // left back
0, 0, 1, 1, 0, 1, // bottom back
1, 0, 1, 1, 1, 1, // right back
0, 1, 1, 1, 1, 1, // top back
*/
print("\(_001.0), \(_001.1), \(_001.2), \(_011.0), \(_011.1), \(_011.2), // left back")
print("\(_001.0), \(_001.1), \(_001.2), \(_101.0), \(_101.1), \(_101.2), // bottom back")
print("\(_101.0), \(_101.1), \(_101.2), \(_111.0), \(_111.1), \(_111.2), // right back")
print("\(_011.0), \(_011.1), \(_011.2), \(_111.0), \(_111.1), \(_111.2), // top back")

/*
0, 1, 0, 0, 1, 1, // top left
0, 0, 0, 0, 0, 1, // bottom left
1, 1, 0, 1, 1, 1, // top right
1, 0, 0, 1, 0, 1 // bottom right
*/
print("\(_010.0), \(_010.1), \(_010.2), \(_011.0), \(_011.1), \(_011.2), // top left")
print("\(_000.0), \(_000.1), \(_000.2), \(_001.0), \(_001.1), \(_001.2), // bottom left")
print("\(_110.0), \(_110.1), \(_110.2), \(_111.0), \(_111.1), \(_111.2), // top right")
print("\(_100.0), \(_100.1), \(_100.2), \(_101.0), \(_101.1), \(_101.2), // bottom right")

Texture Sampling

2018-09-29T13:47:00.000-07:00

Textures are one of the fundamental data types in 3D graphics. Any time you want to show an image on a 3D surface, you use a texture.

Texture Types

First of all, there are many kinds of textures. The simplest kind of texture to understand is a 2D texture, whose purpose is to act like a rectangular image. Each element in the image is configurable; you can specify that it’s a float, or an int, or 4 floats (for each channel of RGBA), etc. These “elements” are usually called “texels.” Similarly, there are 1D textures and 3D textures, which act similarly.

Then, you’ve got 1D texture arrays, and 2D texture arrays, which are not simply arrays-of-textures. Instead, they are distinct types, where each element in the array is the relevant texture type. They are their own distinct resource types because GPUs can operate on them in hardware, so the array doesn’t have to be implemented in software. As such, the hardware restricts each element in the array to have the same dimensions. If you don’t like this requirement, you can create a software array of textures, and it will go slower but the requirement won’t apply. (Or you could even have an array of texture arrays!)

Mipmaps

There’s one other important piece to textures: mipmaps. Generally, textures are mapped onto arbitrary 3d geometry, which means that the number of pixels on-screen the texture is stretched over is totally arbitrary. Using regular projection matrices, the farther the geometry is from the viewer, the fewer pixels the texture is mapped onto.

Consider drawing a single pixel of geometry that is far away from the camera. Here, the entire texture will be squished to fit into a small number of pixels, so that single pixel will be covered by many texels. So, if the renderer wanted to compute an accurate color for that pixel, it would have to average all the covered pixels together. However, what if that geometry moves closer to the camera, such that each pixel contains only ~1 texel? In this situation, no averaging is necessary; you can just do a read in the texture data.

So, if the texture is big relative to the size on-screen it’s drawn, that’s a problem, but if it’s small, that’s no problem. Think about that for a second - big data sizes are a problem, but small data sizes okay. So what if the system could just reduce the big texture to a small texture as a preprocess? In fact, if there was a collection of reductions of various sizes, there would always be a size that is appropriate for the number of pixels being drawn.

That’s exactly what a mipmap is. If a 2D texture has dimensions m * n, the object also has storage for an additional level of m/2 * n/2, and an additional level of m/4 * n/4, etc, down to a single texel. This doesn’t even waste that much memory, because it’s provable that x + x/2 + x/4 + x/8 … = 2*x, so the memory overhead is as much as an additional texture. This storage requirement also assumes that texture sizes are always powers-of-two, which is generally required, though nowadays many implementations have extensions that relax this requirement.

So, naïvely, addressing a 2D texture requires 3 components: x, y, and which mipmap level. 3D textures require 4 components, and 1D textures require 2 components. 2D texture arrays require 4 components (there’s an extra one for the layer in the array) and 1D texture arrays require 3 components. With these components, the system only has to do a single read at runtime - no looping over texels required.

Automatic Miplevel Selection

The shader API, however, can calculate the mipmap level for you, so you don’t have to do that yourself in the shader (though you can if you want to). The key here is to figure out how many texels per pixel the texture is getting squished down to. If the answer is 2, you should use the second mipmap level. If the answer is 4, you should use the third mipmap level (since each level is half as large as the previous).

So how does the system know how many texels cover your pixel? Well, if you think about it, this is the screen-space derivative of the sampling coordinate in the base level. Stated differently, it’s the rate of change of the texture coordinate (in texels) across the screen. So, how do you calculate this?

If the code you’re writing is differentiable, you could just calculate it yourself in closed-form, and just use that. However, the system can approximate it automatically, using the fact that the GPU scheduler can schedule fragment shader threads however it likes. If the scheduler chooses to dispatch fragment shader threads in 2x2 blocks, then each thread in the block can share data among each other. Then, approximating this derivative is easy, it’s simply change-in-y / change-in-x = the difference of adjacent sampling coordinates divided by the difference of adjacent screen-space coordinates. Because we are sampling adjacent pixels, the difference of adjacent screen-space coordinates is just 1, so this derivative is calculated by just subtracting the sampling position of adjacent pixels. The pixels in the 2x2 block can share the result. (Of course this sharing only works if every fragment shader in the 2x2 block is at the same point in the shader so they can cooperate together.)

So, the system does this subtraction of adjacent sampling coordinates to estimate the derivative, and takes the log base 2 of the derivative to select which miplevel to use. The result of this may not be exactly integral, so the sampler describes whether or not to just clamp to the nearest integer miplevel or to read both straddling miplevels and use a weighted average. You can also short-circuit this computation by explicitly specifying derivatives to use (which means the derivatives won’t be automatically calculated, but everything else will work the same way) or by just specifying which miplevel to use directly.

Dimension Reduction

But I’ve breezed over one of the details here - 2D textures have 2-dimensional texel coordinates, and screens also have 2-dimensional coordinates. How do we reduce these to a single miplevel? The Vulkan spec doesn’t actually describe exactly how to reduce the 2-dimensional texel coordinates into a single scalar, but it does say in section 15.6.7:

ρ_x and ρ_y may be approximated with functions f_x and f_y, subject to the following constraints:
f_x is continuous and monotonically increasing in each of m_ux, m_vx, and m_wx
f_y is continuous and monotonically increasing in each of m_uy, m_vy, and m_wy
max(|m_ux|, |m_vx|, |m_wx|) <= f_x <= sqrt(2) * (|m_ux| + |m_vx| + |m_wx|)
max(|m_uy|, |m_vy|, |m_wy|) <= f_y <= sqrt(2) * (|m_uy| + |m_vy| + |m_wy|)

So, you reduce the n-dimensional texture coordinate to a scalar by making up a formula that fits the above requirements. You apply the function twice - once for the horizontal screen derivative direction, and once for the vertical screen derivative direction.

So this tells you (roughly) how many texels fit in the pixel vertically, and how many texels fit in the pixel horizontally. But these values don’t have to be the same. Imagine looking out in first-person across a rendered floor. There are many texels squished vertically, but not that many horizontally.

This is called anisotropy. The amount of anisotropy is just the ratio of these two values. By default, texture sampling will just use the minimum of these two values when figuring out which miplevel to use. Remember - miplevels are zero-indexed, so the smaller the index, the more data is in that level, so the smaller miplevel means the highest level of detail. However, there some techniques in this area that involve doing extra work to improve the quality of the result.

Wrapping Things Up

At this point, the sampler provides shader authors some control over the miplevel selection. The sampler / optional arguments can include a “LOD Bias” which gets added to this value, so the author can get higher-or-lower detail as necessary. The sampler / optional arguments can also include a “LOD Clamp” which will be applied here, if, for example, not all the miplevels of the texture have their contents populated yet.

So, now that you have a miplevel, you can do the rest of the operation. If the sampler says the sampling coordinate is normalized, you denormalize it by multiplying by the dimensions of the miplevel, and modulus / mirror / whatever the sampler tells you to do. Then, depending on the sampler settings, you either round the denormalized coordinates to the nearest integer, or you read all the straddling texels and perform a weighted average. Then, if the sampler tells you to, you do it all again at the next miplevel, and perform yet another weighted average.

There’s one last tiny detail I’ve skipped over, and that is the fact that texel elements are considered to lie at the center of the texel. So, if you have a 1D texture with 2 texels, where one is black and one is white, 1/4 of the way through the texture will be full black, and 3/4 of the way through the texture will be full white, and 1/4 - 3/4 will be a gradient from black to white. But what is drawn from 0 - 1/4 and from 3/4 - 1? What about values less than 0 or greater than 1? The sampler allows for configuring this. The modulus / mirroring operation results in a value that is either on the interior of the texture, or 1 texel around the edge. These texels around the edge either get values from being repeated / mirrored / whatever, or they can just be set to a constant “border color.” This color is fed as input to the weighted average calculation, so everything just works correctly.

Comparison of Entity Framework to Core Data

2018-09-09T19:09:00.003-07:00

Object Relational Mapping libraries connect in-memory object graphs to relational databases. Object-oriented programming is built upon the idea that there is an in-memory object graph, where each object is an instance of a class. An ORM is the software that can save that object graph to a database, either on-disk or using a service across the network.

Entity Framework is Microsoft’s premier ORM library, and Core Data is Apple’s premier ORM library. Both have the same goals - to persist an object graph to a database - but they were developed by different companies for different languages. It stands to reason that they made some different design choices.

Which Entity Framework?

Microsoft is infamous for creating multiple ways to do the same thing, and ORM libraries are no different. There are two versions of Entity Framework: Entity Framework 6, and Entity Framework Core. The documentation says that Entity Framework Core is the new hotness. Also, Entity Framework Core is open source.

So let’s start using Entity Framework Core, right? Well, not so fast. It turns out that you have to pick a runtime that Entity Framework Core will run on top of.

Which Runtime?

Entity Framework was originally developed for .NET. So that’s fine, but it turns out there are multiple versions of .NET.

.NET Framework only runs on Windows
.NET Core is written by Microsoft, and runs on Windows, Linux, and macOS. The documentation says that .NET Core is better than .NET Framework. Also, .NET Core is open source.
.NET Standard is just a standard. It isn’t a piece of software - it’s a specification that describes a level of support that a runtime needs to have in order to be compliant. Xamarin is another .NET runtime that supports the .NET Standard (and it runs on iOS / Android). Targeting this runtime means your app will work in every .NET implementation, but it won’t have access to some of the libraries only present in .NET Core.
The Universal Windows Platform is a runtime compliant with the .NET Standard. The Entity Framework documentation says that UWP is now supported. One interesting note: as part of the compilation process, the platform-independent .NET bytecode is run through the .NET Native toolchain, which produces a platform-dependent binary. They say this is to improve performance. (So I guess this means that the Universal Windows Platform isn’t really universal?) This compilation is somewhat lossy because reflection doesn’t fully work in native apps, and it sounds like Entity Framework had some bugs here that they had to fix.

There’s an example in the Entity Framework Core documentation about how to use it with the Universal Windows Platform, and UWP is the new hotness, so I’ll use that. If you dig into the example, you’ll find that the Entity Framework tools don’t work with UWP projects, so they had to make a dummy .NET Core project with nothing inside it, just to run the tools. How unfortunate.

Getting Entity Framework

Entity Framework is not built in to the system. Instead, you’ll have to get it from Visual Studio’s blessed package manager, named NuGet. When you install packages with NuGet, they’re not installed across the whole system; instead, they’re installed only for a single project. NuGet is built in to Visual Studio - simply go to Project -> Manage NuGet Packages to search/install packages.

Entity Framework is designed to be pluggable to different kinds of databases, and each database has its own package inside NuGet. The example uses a SQLite database, so it uses the Microsoft.EntityFrameworkCore.Sqlite package. There is also another package, Microsoft.EntityFrameworkCore.Tools, which includes command-line tools to generate migration code / apply migrations, so that one is included too.

How to get Core Data

It’s already part of the platform, and there’s only one version. Just use it.

High Level

Both libraries have a concept of a “context” which is the thing that holds the link to all the objects in the object graph. For Entity Framework, this is the Microsoft.EntityFrameworkCore.DbContext, and for Core Data, this is the NSManagedObjectContext. When you create an object, you register it with the context, and when you delete an object, you notify the context that it has been deleted. After you’ve done all your modifications, you tell the context to “save,” which stores all the changes in the database.

Entity Framework:

var blog = new Blog { url = url };

db.Blogs.Add(blog);

db.SaveChanges();

Core Data:

let blog = Blog(context: context, url: url)

try context.save()

Read/Modify/Write operations are also quite similar:

Entity Framework:

var blog = db.Blogs.First();

blog.Url = url;

db.SaveChanges();

Core Data:

let fetchRequest = Blog.fetchRequest() as NSFetchRequest

fetchRequest.fetchLimit = 1

let blog = try context.fetch(fetchRequest)[0]

blog.url = url

try context.save()

Context

In Core Data, the NSManagedObjectContext is just a class. When modifications are made to the object graph, the NSManagedObjectContext makes a strong reference to the object (because Swift is reference-counted, the distinction between strong and weak references are important). When it gets saved, the NSManagedObjectContext knows what to save.

However, in Entity Framework, the DbContext is magical. The application needs to subclass DbContext, and the subclass needs to have DbSet properties. These DbSets refer to the various tables in the database. When the DbContext’s constructor is run, it uses reflection to inspect itself, find all the DbSet properties, and inspect the generic type argument to determine the data model. It builds up a Microsoft.EntityFrameworkCore.ModelBuilder, and lets you make any last-minute changes you want inside DbContext.OnModelCreating().

Objects

In Core Data, each object in the object graph is represented by NSManagedObject. This object acts like a dictionary; you can “set properties” by using the Key-Value Coding functions value(forKey:) and setValue(_, forKey:). You can get better type-safety if you subclass NSManagedObject for each of your entities, and add typed properties. However, if you do this, you have to make sure that getting/setting these properties calls the Key-Value Coding methods on the inner NSManagedObject. Swift has a helpful keyword, @NSManaged, which does this for you. Even further, Xcode will even generate the subclass for you at compilation time, with the appropriately typed @NSManaged properties, if you select the appropriate value for “Codegen” in the right sidebar, with the entity selected. (Or you can use the managedObjectClassName string property on NSEntityDescription when building the NSManagedObjectModel, and Core Data will construct this class at runtime using the Objective-C runtime).

NSManagedObjects know which context they belong to, and their context requires you to pass in the context. This is presumably so when values get modified, the NSManagedObject can notify the NSManagedObjectContext.

In Entity Framework, each object in the object graph is just a regular object. No subclassing required, and no manifest or custom model creation code either. The DbContext learns about the object’s shape from reflection. This means that the ChangeTracker in the DbContext doesn’t automatically know about changes; instead, it has to DetectChanges() which iterates through the known objects. This is done automatically when it’s required.

Connection Between Classes and Data

In Core Data, when the system wants to populate a property of an object, it can do it dynamically, because the getter of the property will be filtered through value(forKey:). This way, the setter doesn’t have to know what the name of the field is at compilation time, which is required when the data model is created at runtime.

However, in Entity Framework, objects are just regular classes. This is a problem, though; how can Entity Framework set the correct property on the class when the name of the property is only known at runtime (because the model can be modified at runtime)? Well, it turns out it uses Linq to build a program at runtime that can set properties that are only known at runtime. This is extremely powerful; it looks like you can use Linq to write almost anything that you could write in C#.

Data Model

In Entity Framework, the DbContext constructor uses reflection to discover the object graph. You get a chance to modify the model at runtime in DbContext.OnModelCreating(), which is called inside the DbContext’s constructor. However, adding an entity to the model requires a class to match that entity. However, for properties, you can have a property that is present in the model but isn’t present in the class. This is valuable for things like automatically saved date fields.

In Core Data, there is a separate data file that describes the model declaratively (with the file extension .xcdatamodeld). You can edit these with a GUI inside Xcode. This file corresponds to a NSManagedObjectModel, which you can build at runtime instead, if you want. Then, when you bring up the Core Data stack, you can specify this model.

Fetch Queries

In EntityFramework, the DbSet implements the IQueryable interface. This is an interface that represents a Query node inside the Linq framework. Functions like .where() and .OrderBy() operate on these nodes and return other nodes, letting you chain up these operators. These operators aren’t actually applied at the time you call the function; instead they are a sort of retained-mode program. Whenever you want to actually pull data out of the query at the end, the runtime will look at the chain of operators and figure out how best to apply it (usually by creating SQL that matches the operation). However, some of the operations need to be applied by the client; this transparently works, but it obviously isn’t great for performance.

Core Data uses the same sort of thing, encapsulated by NSPredicate and NSExpression. NSExpression is the same kind of node inside a retained-mode program. These are quite powerful; you can even call arbitrary selectors on arbitrary objects. The big difference between this and Linq is that, in true Objective-C style, NSExpression isn’t typed, but Linq is typed.

Parallelism

Both Entity Framework and Core Data’s contexts are single-threaded, which means the managed objects all have to live on the same thread as their context. However, fetches and stores involve round trips to databases, which can be quite slow and would block the main thread. Entity Framework gets around this by providing Async versions of the fetching / saving functions. In this model, the objects live on the main thread, but the UI can still be redraw during the slow database operations.

Core Data has two approaches to this. One way is to host the entire Core Data object graph in another thread. You get this if the NSManagedObjectContext is initialized with the concurrencyType argument set to .privateQueueConcurrencyType. If you do this, the NSManagedObjectContext will create its own private queue, and operations on the NSManagedObjectContext are only valid from that queue. You run code on that queue by using NSManagedObjectContext’s perform(_:) function. Inside the callback, you can execute your fetch requests, build up some data, and post a message back to the main queue with your data (but not with NSManagedObjects!).

Alternatively, you can use the main queue, and use NSAsynchronousFetchRequest to create objects asynchronously. As far as I can tell, there is no equivalent call for NSManagedObjectContext.save(), and from my sampling, it appears that NSManagedObjectContext.save() is synchronous (though perhaps it doesn't have to be?)

Entity Framework:

var blog = await db.Blogs.FirstAsync();

blog.Url = url;

await db.SaveChangesAsync();

Core Data:

let fetchRequest = Blog.fetchRequest() as NSFetchRequest

fetchRequest.fetchLimit = 1

let asynchronousFetch = NSAsynchronousFetchRequest(fetchRequest: fetchRequest) { (result) in

    let blog = result.finalResult![0]

    blog.url = url

    do {

        try context.save()

    } catch {

        …

    }

}

try context.execute(asynchronousFetch)

Edit: The Core Data Best Practices video from 2012 describes how you can achieve asynchronous saves by using a parent/child NSManagedObjectContext pair. You set the child to live on one thread and the parent to live on the other thread, and when you tell the child to save, it will just push its changes to the other context on the other thread. Then you can asynchronously tell the other thread to save by using perform(_:).

Migrations

In Entity Framework, a migration is modeled as a chunk of code. However, this code is written by one of the tools inside Microsoft.EntityFrameworkCore.Tools. The command line tool saves a snapshot of whatever the current database schema is, and can create a new schema by using the same mechanism that DbContext uses when it creates a schema at runtime. Then, after you’ve created a migration, you can apply it, which involves running the code on your local development machine to upgrade the database to the new version. These tools even have documentation. You have a chance to fine-tune the migration by editing the source code the tool created, because creating the migration code and applying it to the database are two distinct steps. Because the migration is generated code, you can run it in your app instead of on your local development machine.

But wait, not so fast! The command-line tools use reflection on your source code to generate a model? Yep. That means the command-line tools build your source code. Then they look in your source code for the new model. If the command-line tools are supposed to perform the migration, then it’s supposed to connect to the database, too. But wait, how does it connect to the database? Well, your source code connects to the database … and the command-line tools will just run that code. The documentation describes what functions / classes it will look for in your code and run on your local machine.

Core Data handles migrations totally differently. Some simple migrations can happen automatically, right when you open the database (and you can check whether your change is “simple” by using a class function on NSMappingModel.) But, more complicated migrations are described declaratively in a .xcmappingmodel file, which Xcode lets you edit with a GUI. The expressions are described by strings, which (presumably) are the same strings that NSExpression accepts. This file corresponds to a NSMappingModel, which you can construct at runtime instead of loading from a bundle. Then, when you want to run the migration, you can use NSMigrationManager and pass in the NSMappingModel you want it to use. (One gotcha: to create a .xcmappingmodel in Xcode, it has to be between two different versions of the same model. You can create a new version of a model by selecting Editor -> Add Model Version.)

Configuring the Database

The constructor to Microsoft.EntityFrameworkCore.DbContext requires a Microsoft.EntityFrameworkCore.DbContextOptions, which is built by a Microsoft.EntityFrameworkCore.DbContextOptionsBuilder. C# has this nifty feature where you can declare a free function, but give the first argument the “this” keyword, and that free function will appear as if it was inside the class definition. So, the individual database package adds a function to the DbContextOptionsBuilder. (I haven’t investigated what the package does inside this function.) Then, the client code calls optionsBuilder.UseSqlite(connectionString), for example. You can use Microsoft.Data.Sqlite.SqliteConnectionStringBuilder() to build the connection string. You do this inside the DbContext.OnConfiguring() function so the command-line tools know how to configure the database.

Core Data works differently. Each persistent store is described via a NSPersistentStoreDescription, which includes a string “type” property. This “type” refers to the registeredStoreTypes registry inside NSPersistentStoreCoordinator, which can be extended with additional subclasses of NSPersistentStore. There are also 4 built-in strings for well-known database types.

Development Analogy

2018-08-25T15:00:00.000-07:00

iOS	Windows
Swift	C#
Metal	Direct3D 12
Core Animation	DirectComposition
Core Graphics	Direct2D
WebKit	EdgeHTML
WKWebView	Windows.UI.Xaml.Controls.WebView
JavaScriptCore	Chakra
Core Text	DirectWrite
Core Data	EntityFramework
XMLParser	Windows.Data.Xml.Dom
JSONSerialization	Windows.Data.Json
.xib	.xaml
.dylib	.dll
dlopen/dlsym	LoadLibrary/GetProcAddress
UISplitViewController	Windows.UI.Xaml.Controls.SplitView
UITextField	Windows.UI.Xaml.Controls.TextBox
URLSession	Windows.Web.Http.HttpClient
Bundle	Windows.ApplicationModel.Package
Xcode	Visual Studio

Wide and Deep Color in Metal and OpenGL

2017-07-31T14:27:00.000-07:00

“Wide Color” and “Deep Color” refer to different things. A color space can be “wide” if it has a gamut that is bigger than sRGB. “Gamut” roughly corresponds to how saturated it is possible to represent a color. The wider the color space is, it is possible to represent more and more saturated colors.

“Deep color” refers to the number of representable values in a particular encoding of a color space. An encoding of a color space is “deep” if it has more than 2^24 representable values.

Consider widening a color space without making it deeper. In this situation, you have the same number of representable colors, but these individual points are being stretched farther apart. Therefore, the density of representable colors decreases. This is a problem because it means that our eyes might be able to distinguish between adjacent colors with a higher granularity than the granularity at which they are represented. This commonly leads to “banding,” where what should be a smooth gradient of color over an area appears to our eyes as having stripes of individual colors.

Consider deepening a color space without making it wider. In this situation, you are squeezing more and more points within the same volume of colors, making the density of these points increase. Now, adjacent points may be so close that our eye may not be able to distinguish them. This results in image quality that isn’t any better, but the amount of information required to store the information is higher, resulting in wasted space.

The trick is to do both at once. Widening the gamut, and increasing the number of representable values within that gamut, keeps the density of points roughly equivalent. More information is required to store the image, and the image looks more vibrant to our eyes.

OpenGL

Originally, OpenGL itself didn’t specify what color space its result pixels are in. At the time it was created, this meant that by default, the results were interpreted as sRGB. However, sRGB is a non-linear color space, which means that math on pixel values is meaningless. Unfortunately, alpha blending is math on pixel values, which meant that, by default, blend operations (and all math done in pixel shaders, unless this math was explicitly fixed by the shader author) was broken.

One solution is to simply make the operating system interpret the pixel results as in “linear sRGB.” Indeed, macOS lets you do this by setting the colorSpace property of an NSWindow or CGLayer. Unfortunately, this doesn’t give good results because these pixel results are in 24-bit color, and all of these representable colors should be (roughly) perceptually equidistant from each other. Our eyes, though, are better at perceiving color differences in low-light, which means that dark colors need a higher density of representable values than bright colors. So, in “linear sRGB,” the density of representable values is constant, so we actually don’t have enough definition for dark colors to look good. Increasing the density of representable values would solve the problem for dark colors, but it would make bright colors waste information. (This extra information would probably cost GPU bandwidth, which would probably be fine for just displaying the image on a monitor, but not all GPUs support rendering to > 24-bit color…)

So the colors in the framebuffer need to be in regular sRGB, not linear sRGB. But this means that blending math is meaningless! OpenGL solved this by creating an extension, EXT_TEXTURE_SRGB (which later got promoted to be part of OpenGL Core), which says “whenever you want to perform blending, read the contents of the sRGB destination color from the framebuffer, convert it to a float, linearize it, perform the blend, delinearize it, convert it back to 24-bit color, and store it to the framebuffer”. This way, the final results are always in sRGB, but the blending is done in linear space. This ugly processing only happens on the framebuffer color, not on the output of the fragment shader, so your fragment shader can assume that everything is in linear space, so any math performed will be meaningful.

The trigger to perform this processing is a special format for the framebuffer (so it’s an opt-in feature). Now, in OpenGL, the default framebuffer is not created by OpenGL. Instead, it is created by the Operating System and handed as-is to OpenGL. This means that you have to tell the OS, not OpenGL, to create a framebuffer with one of these special formats. On iOS, you do this by setting the drawableColorFormat of the GLKView. Note that opting in to sRGB is not orthogonal to using other formats - only certain formats are compatible with the sRGB processing.

On iOS, as far as I can tell, OpenGL does not support wide or deep color (because you can’t tell the OS how to interpret the pixel results of OpenGL like you can on macOS - all OpenGL pixels are assumed to be in sRGB). CAEAGLLayer doesn't have a "colorSpace" property. I can’t find any extended-range formats.

Metal

On iOS, Metal supports the same type of sRGB / non-sRGB formats that OpenGL does. You can set the MTKView’s colorPixelFormat to one of the sRGB formats, which has the same effect as is it does in OpenGL. Setting it to a non-sRGB format means that blending is performed as-is, which is broken; however, the sRGB formats perform the correct linearization / delinearization for sRGB.

iOS doesn’t support the same sort of color space annotation that macOS does. In particular, a UIWindow or a CALayer doesn’t have a “colorspace” property. Because of this, all colors are expected to be in sRGB. For non-deep and non-wide color, using the regular sRGB pixel formats is sufficient, and these will clamp to the sRGB gamut (meaning clamped between 0 and 1).

And then wide color came along. As noted earlier, wide color and deep color need to happen together, so they aren’t controllable independently. However, there is a conundrum: Because the programmer can’t annotate a particular layer with what color space the values should be interpreted as, how do you represent colors outside of sRGB? The solution is for the colorspace to be extended to beyond the 0 - 1 range. This way, colors within 0 - 1 are interpreted as sRGB as they always have. However, colors outside that range represent the new wider colors. It’s important to note that, because because the new gamut completely includes sRGB, that values must be able to be negative as well as greater than 1. A completely saturated red in the display’s native color space (which is similar to P3) has negative components for green and blue.

The mechanism for enabling this is similar to OpenGL: you select a new special pixel format. The new pixel formats have “_XR” in their name, for “extended range.” These formats aren’t clamped to 0 - 1. sRGB also applies here; the new extended range pixel formats have sRGB variants, which perform a similar gamma function as they did before in OpenGL. This gamma function is extended (in the natural way) to values greater than 1. For values less than 0, this gamma curve is flipped around to curve pointing downward (this makes it an “odd” function).

Using these new pixel formats causes your colors to go from 8 bits per channel to 10 bits per channel. The new 10 bits per channel colors are now signed (because they can go < 0), which means that there are 4 times as many representable values, and half of them are below 0, so the number of positive representable values doubled. In a non-sRGB variant, the maximum value is just around 2, but in an sRGB variant, the maximum value is greater than 2 because of the gamma curve.

On macOS, there is a way to explicitly tell the system how to interpret the color values in a NSWindow or a CALayer using the colorspace property. This works because there is a secondary pass which will convert the pixels into the color space of the monitor. (Presumably iOS doesn’t have this pass for performance, thereby leading to the restriction on which color spaces a pixel value is represented as.) Therefore, to output colors using P3, simply assign the appropriate color space value to the CALayer you are using with Metal. If you do this, remember that “1.0” doesn’t represent sRGB’s 1.0, instead it represents the most saturated color in the new color space. If you don’t also change your rendering code to compensate for this, your colors will be stretched across the gamut, leading to oversaturated colors and ugly renderings. You can solve this by setting this to the new “Extended sRGB” color space of CGColor, which will cause you to have the same rendering as iOS (and allowing values > 1.0). Note that if you do this, you can’t render to an integer pixel format, because those are clipped at 1.0; instead, you’ll have to render to a floating-point pixel format so that you can have values > 1.0.

So, on iOS, you have one switch which turns on both deep color and wide color, and on macOS, you have two switches, one of which turns on wide color and one of which turns on deep color.

Chromaticity Diagrams

2017-05-28T22:45:00.003-07:00

Humans can see light of the wavelengths between around 380 nm and 780 nm. We see many photons at a time, and we recognize the collection of photons as a particular color. Each photon has a frequency, which means that a particular color is the effect of how much power the photons have at each particular frequency. Put another way, a color is a distribution of power throughout the visible wavelengths of light. For example, if 1/3 of your power is at 700 nm and 2/3 of your power is at 400 nm, the color is a deep purple. This color can be described by the 2-dimensional function:

Different curves over this domain represent different colors. Here is the curve for daylight:

So, if we want to represent a color, we can describe the power function over the domain of visible wavelengths. However, we can do better if we include some biology.

Biology

We have three types of cells (called “cones”) in our eyes which react to light. Each of the three kinds of cones are sensitive to different wavelengths of light. Cones only exist the the center of our eye (the “fovea”) and not in our peripheral vision, so this model is only accurate to describe the colors we are directly looking at. Here is a graph of the sensitivities of the three kinds of cones:

Here, you can see that the “S” cones are mostly sensitive to light at around 430 nm, but they still respond to light within a window of about 75 nm around it. You can also see that if all the light entering your eye is at 540 nm, the M cones will respond the most, the L cones will respond strongly (but not as much as the M cones), and the S cones will respond almost not at all.

This means that the entire power distribution of light entering your eye is encoded as three values on its way to your brain. There are significantly fewer degrees of freedom in this encoding than there are in the source frequency distribution. This means that information is lost. Put another way, there are many different frequency distributions which get encoded the same way by your cones’ response.

This is actually a really interesting finding. It means we can represent color by three values instead of a whole function across the frequency spectrum, which results in a significant space savings. It also means that, if you have two colors which appear to “match,” their frequency distributions may not match, so if you perform a modification the same way to both colors, they may cease to match.

If you think about it, though, this is the principle that computer monitors and TVs use. They have phosphors in them which emit light at a particular frequency. When we watch TV, the frequency diagram of the light we are viewing contains three spikes at the three frequencies of phosphors. However, the images we see appear to match our idea of nature, which is represented by a much more continuous and flat frequency diagram. Somehow, the images we see on TV and the images we see in nature match.

Describing color

So color can be represented by a triple of numbers: (Response strength of S cones, Response strength of M cones, Response strength of L cones). Every combination of these three values represents every color we can perceive.

It would be great if we could simply represent a color by the response strength of each of the particular kinds of cones in our eyes; however, this is difficult to measure. Instead, let’s pick frequencies of light which we can easily produce. Let’s also select these frequencies such that they will correspond as well as possible to each of the three cones. By varying the power these lights produce, we should be able to produce many of the three triples, and therefore many of the colors we can see.

In 1931, two experiments attempted to “match” colors using lights of 435.8 nm, 546.1 nm, and 700 nm (let’s call them “blue,” “green,” and “red” lamps). The first two wavelengths are easily created by using mercury vapor tubes, and correspond to the S and M cones, respectively. The last frequency corresponds to the L cones, and, though isn’t easily created with mercury, is insensitive to small errors because the L cones’ frequency is close to flat in this neighborhood.

So, which colors should be matched? Every color can be decomposed to a collection of power values at particular frequencies. Therefore, if we could find a way to match every frequency in the observable range by humans, this data would be sufficient to match any color. For example, if you have a color with a peak at 680 nm and a peak at 400 nm, and you know that 680 nm light corresponds to our lamp powers of (a, b, c) and 400 nm corresponds to our lamp powers of (d, e, f), then the (a + d, b + e, c + f) should match our color.

This was performed by two people: Guild and Wright, using 7 and 10 samples, respectively (and averaging the results). They went through every frequency of light in the visible range, and found how much power they had to make each of the lamps emit in order to match the color.

However they found something a little upsetting. Consider the challenge of matching light at wavelength 510 nm. At this wavelength, we can see that the S cones would react near 0, and that the M cones would react maybe 20% more than the L cones. So, we are looking for how much power our primaries should emit to construct this same response in our cones.

(The grey bars are our primaries, and the light blue bar is our target)

Our primaries lie at 435.8 nm, 546.1 nm, and 700 nm. So, the blue lamp should be at or near 0; so far so good. If we select a power of the green light which gives us the correct M cone response, we find that it causes too high of an L cone response (because of how the cones overlap). Adding more of the red light only causes the problem to grow. Therefore, because the cones overlap, it is impossible to achieve a color match with this wavelength of light using these primaries.

The solution is to subtract the red light instead of adding it. The reason we couldn’t find a match before is because our green light added too much L cone response. If we could remove some of the L cone’s response, our color would match. We can do this by, instead of matching against 520 nm, let’s instead match against the sum of 520nm plus some of our red lamp. This has the effect of subtracting out some of the L cone response, and lets us match our color.

Using this approach, we can construct a graph, where for each wavelength, the three powers of the three lights are plotted. It will include negative values where matches would otherwise be impossible.

X Y Z color space

Once we have this, we now can represent any color by a triple, possibly negative, where each value in the triple represents the power of our particular primary. However, the fact that these values can be negative kind of sucks. In particular, machines were created which can measure colors, but the machines would have to be more complicated if some of the values could be negative.

Luckily, the power of light follows mathematical operations. In particular, addition and multiplication hold. Color A plus color B yields a consistent result, no matter what frequency distribution color A is represented by. The same is true for multiplication. This means that we are actually dealing with a vector space. A vector space can be transformed via a linear transformation.

So, the power values of each of the lights at each frequency were transformed such that the resulting values were positive at each frequency. This new, transformed, vector space, is called X Y Z, and is not physically-based.

Given these new non-physical primaries, you can construct a similar graph. It shows, for each frequency of light, how much of each primary is necessary to represent that frequency.

Chromaticity graphs

So, for each frequency of light, we have an associated (X, Y, Z) triple. Let’s plot it on a 3-D graph!

Axes

XYZ Triplets

Projection of XYZ Triplets onto X+Y+Z=1 plane

sRGB Cube

Projection of sRGB Primaries onto X+Y+Z=1 plane

Projection of DCI-P3 Primaries onto X+Y+Z=1 plane

(Best viewed in Safari Nightly build.)
Click and drag to control the rotation of the graph!

The white curve is our collection of (X, Y, Z) triples. (The red is our unit axes.)

Remember that every visible color is represented as a linear combination of the vectors from the origin to points on this (one-dimensional) curve.

The origin is black, because it represents 0 power. The hue and saturation of the color is described by the orientation of the point, not the distance the point is from the origin. If we want to represent this space in two dimensions, it would make sense to eliminate the brightness component and instead only show the hue and saturation. This can be done by projecting each point onto the X + Y + Z = 1 plane, as seen by the green on the above chart.

Note that this shape is convex. This is particularly interesting: any point on the contour of the shape, plus any other point on the contour of the shape, yields a point within the interior of the shape (when projected back to our projection plane). Recall that all visible colors are equal the the linear combination of points on the contour of the curve. Therefore, all visible colors equal all the points in the interior of this curve. Points on the exterior of this curve represent colors with negative cone response for at least one type of cone (which cannot happen).

So, the inside of this horseshoe shape represents every visible color. This shape is usually visualized by projecting it down to the (X, Y) plane. That yields the familiar diagram:

Inside this horseshoe represents every color we can see. Also, you can notice that, because X Y Z was constructed so that every visible color has positive coordinates, and the projection we are viewing is onto the X + Y + Z = 1 plane, all the points on the diagram are below the X + Y = 1 line.

Color spaces

A color space is usually represented as three primary colors, as well as a white point (or a maximum bounds on the magnitude of each primary color). The colors in the color space are usually represented as a linear combination of the primary colors (subject to some maximum). In our chromaticity diagram, we aren’t concerned with brightness, so we can ignore these maximums values (and associated white-points). Because we know the representable colors in a color space are a linear combination of the primaries, we can plot the primaries in X, Y, Z color space and project them to the same X + Y + Z = 1 plane. Using the same logic we used above, we know that the representable colors in the color space are on the interior of the triangle realized by this projection.

You can see the result of this projection for the primaries of sRGB in the shaded triangle in the above chart. As you can see, there are many colors that human eyes can see which aren’t representable within sRGB. The chart also allows you to toggle the bounding triangle for the DCI-P3 color space, which Apple recently released on some of its devices. You can see how Display P3 includes more colors than sRGB.

Because the shape of all visible colors isn’t a triangle, it isn’t possible to create a color space where each primary is a visible color and the colorspace encompasses every visible color. If your color space encompasses every visible color, the primaries must lie outside of the horseshoe and are therefore not visible. If your primaries lie inside the horseshoe, there are visible colors which cannot be captured by your primaries. Having your primaries be real physical colors is valuable so that you can, for example, actually build physical devices which include your primaries (like the phosphors in a computer monitor). You can get closer to encompassing every visible color if you increase the number of primaries to 4 or 5, at the cost of making each color "fatter."

Keep in mind that these chromaticity diagrams (which are the ones in 2D above) are only useful for plotting individual points. Specifically, 2-D distances across this diagram are not meaningful. Points that are close together on the diagram may not be visually similar, and points which are visually similar may not be close together on the above diagram.

Also, when reading these horseshoe graphs, realize that they are simply projections of a 3D graph onto a somewhat-arbitrary plane. A better visualization of color would include all three dimensions.

Relationship Between Glyphs and Code Points

2017-05-20T13:52:00.000-07:00

Recently, there have been some discussions about various Unicode concepts like surrogate pairs, variation selectors, and combining clusters, but I thought I could shed some light into how these pieces all fit together and their relationship with the things we actually see on screen.

tl;dr: The relationship between what you see on the screen and the unicode string behind it is completely arbitrary.

The biggest piece to understand is the difference between Unicode's specs and the contents of font files. A string is a sequence of code points. Certain code points have certain meanings, which can affect things like the width of the rendered string, caret placement, and editing commands. Once you have a string and you want to render it, you partition it into runs, where each run can be rendered with a single font (and has other properties, like a single direction throughout the run). You then map code points to glyphs one-to-one, and then you run a Turing-complete "shaping" pass over the sequence of glyphs and advances. Once you've got your shaped glyphs and advances, you can finally render them.

Code Points

Alright, so what's a code point? A code point is just a number. There are many specs which describe a mapping of number to meaning. Most of them are language-specific, which makes sense because, in any given document, there will likely only be a single language. For example, in the GBK encoding, character number 33088 (which is 0x8140 in hex) represents the 丂 character in Chinese. Unicode includes another such mapping. In Unicode, this same character number represents the 腀 character in Chinese. Therefore, the code point number alone is insufficient unless you know what encoding it is in.

Unicode is special because it aims to include characters from every writing system on the planet. Therefore, it is a convenient choice for an internal encoding inside text engines. For example, if you didn't have a single internal encoding, all your editing commands would have to be reimplemented for each supported encoding. For this reason (and potentially some others), it has become the standard encoding for most text systems.

UTF-32

In Unicode, there are over 1 million (0x10FFFF) available options for code points, though most of those haven't been assigned a meaning yet. This means you need 21 bits to represent a code point. One way to do this is to use a 32-bit type and pad it out with zeroes. This is called UTF-32 (which is just a way of mapping a 21-bit number to a sequence of bytes so it can be stored). If you have one of these strings on disk or in memory, you need to know the endianness of each of these 4-byte numbers so that you can properly interpret it. You should already have an out-of-band mechanism to know what encoding the string is in, so this same mechanism is often re-used to describe the endianness of the bytes. (On the Web, this is HTTP headers or the tag.) There's also this neat hack called Byte Order Markers, if you don't have any out-of-band data.

UTF-16

Unfortunately, including 11 bits of 0s for every character is kind of wasteful. There is a more efficient encoding, called UTF-16. In this encoding, each code point may be encoded as either a single 16-bit number or a pair of 16-bit numbers. For code points which fit into a 16-bit number naturally, the encoding is the identity function. Unfortunately, there are over a million (0x100000) code points remaining which don't fit into a 16-bit number themselves. Because there are 20 bits of entropy in these remaining code points, we can split it into a pair of 10 bit numbers, and then encode this pair as two successive "code units." Once you've done that, you need a way of knowing, if someone hands you a 16-bit number, if it's a standalone code point or if it's part of a pair. This is done by reserving two 10-bit ranges inside the character mapping. By saying that code points 0xD800 - 0xDBFF are invalid, and code points 0xDC00 - 0xDFFFF are invalid, we can now use these ranges to encode these 20-bit numbers. So, if someone hands you a 16-bit number, if it's in one of those ranges, you know you need to read a second 16-bit number, mask the 10 low bits of each, shift them together, and add to 0x10000 to get the real code point (otherwise, the number is equal to the code point it represents).

There are some interesting details here. The first is that the two 10-bit ranges are distinct. It could have been possible to re-use the same 10-bit range for both items in the pair (and use its position in the pair to determine its meaning). However, if you have an item missing from a long string of these surrogates, it may cause every code point after the missing one to be wrong. By using distinct ranges, if you come across an unpaired surrogate (like two high surrogates next to each other), most text systems will simply consider the first surrogate alone, treat it like an unsupported character, and resume processing correctly at the next surrogate.

UTF-8

There's also another one called UTF-8, which represents code points as either 1, 2, 3, 4, or 5 byte sequences. Because it uses bytes, endianness is irrelevant. However, the encoding is more complicated and it can be less efficient for some strings than UTF-16. It does have the nice property, however, that no byte within a UTF-8 string can be 0, which means it is compatible with C strings.

"💩".length === 2

Because its encoding is somewhat simple, but fairly compact, many text systems including Web browsers, ICU, and Cocoa strings use UTF-16. This decision has actually had kind of a profound impact on the web. It is the reason that the "length" attribute on emoji returns 2: the "length" attribute returns the number of code units in the UTF-16 string, not the number of code points. If it wanted to return the number of code points, it would require linear time to compute. The choice of which number represents which "character" (or emoji) isn't completely arbitrary, but some things we think of as emoji actually have a number value less than 0x10000. This is why some code points have a length of two but some have a length of one.

Combining code points

Unicode also includes the concept of combining marks. The idea is that if you want to have the character "é", you can represent it as the "e" character followed by U+301 COMBINING ACUTE ACCENT. This is so that every combination of diacritic marks and base characters doesn't have to be encoded in Unicode. It's important because, once a code point is assigned a meaning, it can never ever be un-assigned.

To make matters worse, there is also a standalone code point U+E9 LATIN SMALL LETTER E WITH ACUTE. When doing string comparisons, these two strings need to be equal. Therefore, string comparisons aren't just raw byte comparisons.

This idea can happen even without these zero-width combining marks. In Korean, adjacent letters in words are grouped up to form blocks. For example, the letters ㅂ ㅓ ㅂ join to form the Korean word for "rice:" 법 (read from top left to bottom right). Unicode includes a code point for each letter of the alphabet (ㅂ is U+3142 HANGUL LETTER PIEUP), as well as a code point for each joined block (법 is U+BC95). It also includes joining letters, so 법 can be represented as a single code point, but can also be represented by the string:

U+1107 HANGUL CHOSEONG PIEUP
U+1161 HANGUL JUNGSEONG A
U+11B8 HANGUL JONGSEONG PIEUP

This means, in JavaScript, you can have two strings which are treated exactly equally by the text system, and look visually identical (they literally have the same glyph drawn on screen), but have different lengths in JavaScript.

Normalization

One way to perform these string comparisons is to use Unicode's notion of "normalization." The idea is that strings which are conceptually equal should be normalized to the same sequence of code points. There are a few different normalization algorithms, depending on if you want the string to be exploded as much as possible into its constituent parts, or if you want it to be combined to be as short as possible, etc.

Fonts

When reading text, people see pictures, not numbers. Or, put another way, computer monitors are not capable of showing you numbers; instead, they can only show pictures. All the picture information for text is contained within fonts. Unicode doesn't describe what information is included in a font file.

When people think of emoji, people usually think of it as the little color pictures inside our text. These little color pictures come from font files. A font file can do whatever it wants with the string it is tasked with rendering. It can draw emoji without color. It can draw non-emoji with color. The idea of color in a glyph is orthogonal to whether or not a code point is classified as "emoji."

Similarly, a font can include a ligature, which draws multiple code points as a single glyph ("glyph" just means "picture"). A font can also draw a single code point as multiple glyphs (for example, an accent over é may be implemented as as separate glyph from e). But it doesn't have to. The choice of what glyphs to use where is totally an implementation detail of the font. The choice of which glyphs include color is totally an implementation detail of the font. Some ligatures get caret positions inside them; others don't.

For example, Arabic is a handwritten script, which means that the letters flow together from one to the next. Here are two images of two different fonts (Geeza Pro and Noto Nastaliq Urdu) rendering the same string, where each glyph is painted in a different color. You can see that both fonts show the string with a different number of glyphs. Sometimes diacritics are contained within their base glyph, but sometimes not.

Variation Selectors

There are other classes of code points which are invisible and are added after a base code point to modify it. One example is the using Variation Selector 15 and Variation Selector 16. The problem these try to solve is the fact that some code points may be drawn in either text style (☃︎) or emoji style (☃️). Variation Selector 16 is an invisible code point that means "please draw the base character like an emoji" while #15 means "please draw the base character like text." The platform also has a default representation which is used when no variation selector is present. Unicode includes a table of which code points should be able to accept these variation selectors (but, like everything Unicode creates, it affects but doesn't dictate implementations).

These variation selectors are a little special because they are the only combining codepoints I know of that can interact with the "cmap" table is the font, and therefore can affect font selection. This means that a font can say "I support the snowman code point, but not the emoji style of it." Many text systems have special processing for these variation selectors.

Zero-Width-Joiner Sequences

Rendering on old platforms is also important when Unicode defines new emoji. Some Unicode characters, such as "👨‍👩‍👧" are a collection of things (people) which can already be represented with other code points. This specific "emoji" is actually the string of code points:

U+1F468 MAN
U+200D ZERO WIDTH JOINER
U+1F469 WOMAN
U+200D ZERO WIDTH JOINER
U+1F467 GIRL

The zero width joiners are necessary for backwards compatibility. If someone had a string somewhere that was just a list of people in a row, the creation of this new "emoji" shouldn't magically join them up into a family. The benefit of using the collection of code points is that older systems showing the new string will show something understandable instead of just an empty square. Fonts often implement these as ligatures. Unicode specifies which sequences should be represented by a single glyph, but, again, it's up to each implementation to actually do that, and implementations vary.

Caret Positions

Similarly to how Unicode describes sequences of codepoints which should visually combine to a single thing, Unicode also describes what a "character" is, in the sense of what most people mean when they say "character." Unicode calls this as a "grapheme clusters." Part of the ICU library (which implements pieces of Unicode) creates iterators which will give you all the locations where lines can break, words can be formed (in Chinese this is hard), and characters' boundaries lie. If you give it the string of "e" followed by U+301 COMBINING ACUTE ACCENT, it should tell you that these codepoints are part of the same grapheme cluster. It does this by ingesting data tables which Unicode creates.

However, this isn't quite sufficient to know where to put the caret when the user presses the arrow keys, delete key, or forward-delete key (Fn + delete on macOS). Consider the following string in Hindi "कि". This is composed of the following two code points:

U+915 DEVANAGARI LETTER KA
U+93F DEVANAGARI VOWEL SIGN I

Here, if you select the text or use arrow keys, the entire string is selected as a unit. However, if you place the caret after the string and press delete, only the U+93F is deleted. This is particularly confusing because this vowel sign is actually drawn to the left of the letter, so it isn't even adjacent to the caret when you press delete. (Hindi is a left-to-right script.) If you place the caret just before the string and press the forward delete key (Fn + delete), both code points get deleted. The user expectations for the results of these kinds of editing commands are somewhat platform-specific, and aren't entirely codified in Unicode currently.
Try it out here:

==> कि <==

Simplified and Traditional Chinese

The Chinese language is many thousands of years old. In the 1950s and 1960s, the Chinese government (PRC) decided that their characters had too many strokes, and simplifying the characters would increase literacy rates. So, they decided to change how about 1/3 of the characters were written. Some of the characters were untouched, some were touched only very slightly, and some were completely changed.

When Unicode started codifying these characters, they had to figure out whether or not to give these simplified characters new code points. For the code points which were completely unchanged, it is obvious they shouldn't get their own code points. For code points which were entirely changed, it is obvious that they should get their own code points. However, what about the characters which changed only slightly? The characters were decided on a case-by-case basis, and some of these slightly-changed characters did not receive their own new code points.

This is really problematic for a text engine, because this is a discernible difference between the two, and if you show the wrong one, it's wrong. This means that the text engine has to know out-of-band which one to show.

Here's an example showing the same code point with two different "lang" tags.
Simplified Chinese:

雪

Traditional Chinese:

雪

There are a few different mechanisms for this. HTML includes the "lang" attribute, which includes whether or not the language is supposed to be simplified or traditional. This is used during font selection. On macOS and iOS, every Chinese face actually includes two font files: one for Simplified Chinese and one for Traditional Chinese. (For example, PingFang SC and PingFang TC.) Browsers use the language of the element when deciding which of these fonts to use. If the lang tag isn't present or doesn't include the information browsers need, browsers will use the language the machine is configured to use.

Rather than including two separate fonts for every face, another mechanism to implement this is by using font features. This is part of that "shaping" step I mentioned earlier. This shaping step can include a set of key/value pairs provided by the environment. CSS controls this with the font-variant-east-asian property. This works by having the font include glyphs for both kinds of Chinese, and the correct one is selected as part of text layout. This only works, however, with text renderers which support complex shaping and font features.

I think there's at least one other way to have a single font file be able to draw both simplified and traditional forms, but I can't remember what they are right now.

Single Screen GPU Handoff

2016-11-16T00:08:00.000-08:00

Over the past few years, a collection of laptops have been released with two graphics cards. The idea is that one is low-power and one is high power. When you want long battery life, you can use the low-power GPU, but when you want high performance, you can use the high-power GPU. However, there is a wrinkle: the laptop only has one screen.

The screen’s contents have to come from somewhere. One way to implement this system would be to daisy-chain the two GPUs, thereby keeping the screen always plugged into the same GPU. In this system, the primary GPU (which the screen is plugged into) would have to be told to give the results of the secondary GPU to the screen.

A different approach is to connect both GPUs in parallel with a switch between them. The system will decide when to flip the switch between each of the GPUs. When the screen is connected to one GPU, the other GPU can be turned off completely.

The question, then, is how this looks to a user application. I’ll be investigating three different scenarios here. Note that I’m not discussing what happens if you drag a window between two different monitors each plugged into a separate card; instead, I’m discussing the specific hardware which allows multiple graphics cards to display to the same monitor.

OpenGL on macOS

On macOS, you can tell which GPU your OpenGL context is running on by running glGetString(GL_VENDOR). When you create your context, you declare whether or not you are capable of using the low-power GPU (the high-power GPU is the default). macOS has the design where if any context requires the high-power GPU, the whole system is flipped to use it. This is observable by using gfxCardStatus. This means that the whole system may switch out from under you while your app is running because of something a completely different app did.

For many apps, this isn’t a problem because macOS will copy your OpenGL resources between the GPUs, which means your app may be able to continue without caring that the switch occurred. This works because the OpenGL context itself survives the switch, but the internal renderer changes. Because the context is still alive, your app can likely continue.

The problem, though, is with OpenGL extensions. Different renderers support different extensions, and app logic may depend on the presence of an extension. On my machine, the high-powered GPU supports both GL_EXT_depth_bounds_test and GL_EXT_texture_mirror_clamp, but the low-powered one doesn’t. Therefore, if an app relies on an extension, and the renderer changes in the middle of operation, the app may malfunction. The way to fix this is to listen to the NSWindowDidChangeScreenNotification in the default NSNotificationCenter. When you receive this notification, re-interrogate the OpenGL context for its supported extensions. Note that switching in both directions may occur - the system switches to the high-power GPU when some other app is launched, and the system switches back when that app is quit.

You only have to do this if you opt-in to running on the low-power GPU, because if you don’t opt in, you will run on the high-power GPU, which means your app will be the app keeping the system on the high-power GPU, which means the system will never switch back while your app is alive.

Metal on macOS

Metal takes a different approach. When you want to create a MTLDevice, you must choose which GPU your device reflects. There is an API call, MTLCopyAllDevices(), which will simply return a list, and you are free to interrogate each device in the list to determine which one you want to run on. In addition, there’s a MTLCreateSystemDefaultDevice() which will simply pick one for you. On my machine, this “default device” isn’t magical - it is simply exactly equal (by pointer equality) to one of the items in the list that MTLCopyAllDevices() returns. On my machine, it returns the high-powered GPU.

However, MTLDevices don’t have the concept of an internal renderer. In fact, even if you cause the system to change the active GPU (using the above approach of making another app create an OpenGL context), your MTLDevice still refers to the same device that it did when you created it.

I was suspicious of this, so I ran a performance test. I created a shader which got 28 fps on the high-powered GPU and 11 fps on the low-powered one. While this program was running on the low-powered GPU, I opened up an OpenGL app which I knew would cause the system to switch to the high-powered GPU, and I saw that the app’s fps didn’t change. Therefore, the Metal device doesn’t migrate to a new GPU when the system switches GPUs.

Another interesting thing I noticed during this experiment was that the Metal app was responsive throughout the entire test. This means that the rendering was being performed on the low-power GPU, but the results were being shown on the high-power GPU. I can only guess that this means that the visual results of the rendering are being copied between GPUs every frame. This would also seem to mean that both GPUs were on at the same time, which seems like it would be bad for battery life.

DirectX 12 on Windows 10

I recently bought a Microsoft Surface Book which has the same kind of setup: one low-power GPU and one high-power GPU. Similarly to Metal, when you create a DirectX 12 context, you have to select which adapter you want to use. IDXGIFactory4::EnumAdapters1() returns a list of adapters, and you are free to interrogate them and choose which one you prefer. However, there is no separate API call to get the default adapter; there is simply a convention that the first device in the list is the one you should be using, and that it is the low-power GPU.

As I stated above, on macOS, switching to the discrete GPU is all-or-nothing - the screen’s signal is either coming from the high-power GPU or the low-power GPU. I don’t know whether or not this is true on Windows 10 because I don’t know of a way to observe it there.

However, an individual DirectX 12 context won’t migrate between GPUs on Windows 10. This is observable with a similar test as the one described above. Automatic migration occurred on previous versions of Windows, but it doesn’t occur now.

Therefore, the model here is similar to Metal on macOS, so it seems like the visual results of rendering are copied between the two cards, and that both cards are kept on at the same time if there are any contexts executing on the high-power GPU.

However, the Surface Book has an interesting design: the high-power GPU is in the bottom part of the laptop, near the keyboard, and the laptop’s upper (screen) half can separate from the lower half. This means that the high-power GPU can be removed from the system.

Before the machine’s two parts can be separated, the user must press a special button on the keyboard which is more than just a physical switch. It causes software to run which inspects all the contexts on the machine to determine if any app is using the high-powered GPU on the bottom half of the machine. If it is being used by any app, the machine refuses to separate from the base (and shows a pop up asking the user to please quit the app, or presumably just destroy the DirectX context). There is currently no way for the app to react to the button being pressed so that it could destroy its context. Instead, currently, the user must quit the app.

However, it is possible to lose your DirectX context in other ways. For example, if a user connects to your machine via Terminal Services (similar to VNC), the system will switch from a GPU-accelerated environment to a software-rendering environment. To an app, this will look like the call to IDXGISwapChain3::Present() will return DXGI_ERROR_DEVICE_REMOVED or DXGI_ERROR_DEVICE_RESET. Apps should react to this by destroying their device and re-querying the system for the present devices. This sort of thing will also happen when Windows Update updates GPU drivers or when some older Windows versions (before Windows 10) perform a global low-power to high-power (or vice-versa) switch. So, a well-formed app should already be handling the DEVICE_REMOVED error. Unfortunately, this doesn’t help the use case of separating the two pieces of the Surface Book.

Thanks to Frank Olivier for lots of help with this post.

Variation Fonts Demo

2016-09-30T21:08:00.003-07:00

Try opening this in a recent Safari nightly build.

The first line shows the text with no variations.
The second line animates the weight.
The third line animations the width.
The fourth line animates both.

hamburgefonstiv

Variable Fonts in CSS Draft

2016-09-22T16:46:00.000-07:00

Recently, the CSS Working Group in the W3C resolved to pursue adding support for variable fonts within CSS. A draft has been added to the CSS Fonts Level 4 spec. Your questions and comments are extremely appreciated, and will help shape the future of variation fonts support in CSS! Please add them to either a new CSS GitHub issue, tweet at @Litherum, email to mmaxfield@apple.com, or use any other means to get in contact with anyone at the CSSWG! Thank you very much!

Here is what CSS would look like using the current draft:

1. Use a preinstalled font with a semibold weight:

<div style="font-weight: 632;">hamburgefonstiv</div>

2. Use a preinstalled font with a semicondensed weight:

<div style='font-stretch: 83.7%;'>hamburgefonstiv</div>

3. Use the "ital" axis to enable italics

// Note: No change! The browser can enable variation italics automatically.

<div style="font-style: italic;">hamburgefonstiv</div>

4. Set the "fancy" axis to 9001:

<div style="
font-variation-settings: 'fncy' 9001;">hamgurgefonstiv</div>

5. Animate the weight and width axes together:

@keyframes zooming {

  from {

    font-variation-settings: 'wght' 400, 'wdth' 85;

  }



  to {

    font-variation-settings: 'wght' 800, 'wdth' 105;

  }

}



<div style="animation-duration: 3s;
animation-name: zooming;">hamburgefonstiv</div>

6. Use a variation font as a web font (without fallback):

@font-face {

  // Note that this is identical to what you currently do today!

  font-family: "VariationFont";

  src: url("VariationFont.otf");

}



<div style="font-family: 'VariationFont';"> hamburgefonstiv</div>

7. Use a variation font as a web font (with fallback):

@font-face {

  font-family: 'FancyFont';

  src: url("FancyFont.otf") format("opentype-variations"), url("FancyFont-600.otf") format("opentype");

  font-weight: 600;

  // Old browsers would fail to parse "615",

  // so it would be ignored and 600 remains.

  // New browsers would parse it correctly so 615 would win.

  // Note that, because of the font selection

  // rules, the font-weight descriptor above may

  // be sufficient thereby making the font-weight

  // descriptor below unnecessary.

  font-weight: 615;

}



#fancy {

  font-family: "FancyFont";

  font-weight: 600;

  font-weight: 615;

}



<div id="fancy">hamburgefonstiv</div>

8. Use two variations of the same variation font

@font-face {

  font-family: "VariationFont";

  src: url("VariationFont.otf");

  font-weight: 400;

}



<div style="font-family: VariationFont; font-weight: 300;">hamburgefonstiv</div>



<div style="font-family: VariationFont; font-weight: 700;">hamburgefonstiv</div>

9. Combine two variation fonts together as if they were a single font: one for weights 1-300 and another for weights 301-999:

@font-face {

  font-family: "SegmentedVariationFont";

  src: url("SegmentedVariationFont-LightWeights.otf");

  font-weight: 1;

}



@font-face {

  // There is complication here due to the peculiar nature of the font selection rules.

  // Note how this block uses the same source file as the block below.

  font-family: "SegmentedVariationFont";

  src: url("SegmentedVariationFont-HeavyWeights.otf");

  font-weight: 301;

}



@font-face {

  font-family: "SegmentedVariationFont";

  src: url("SegmentedVariationFont-HeavyWeights.otf");

  font-weight: 999;

}

Font Taxonomy

2016-09-15T10:03:00.000-07:00

OpenGL on iOS

2016-09-03T14:16:00.004-07:00

The model of OpenGL on iOS is much simpler than that on macOS. In particular, the context creation routine on macOS is older than the concept of OpenGL frame buffers, which is why it is structured the way that it is. Back then, the model was much simpler: the OS gave you a buffer, and you drew stuff into it. If you wanted to render offscreen, you had to ask the OS to give you an offscreen buffer.

That all changed with frame buffer objects. Now, in OpenGL, you can create your own offscreen render targets, render into them, and when you’re done, read from them (either as a texture or into host memory). This means that there is a conceptual divide between that buffer the OS gives you when you create your context, and the frame buffer objects you have created in your own OpenGL code.

On iOS, the OpenGL infrastructure was created with frame buffer objects in mind. Instead of asking the OS to give you a buffer to render into, you instead, ask the OS to assign a backing store to a render buffer (which is part of a framebuffer). Specifically, you do this after the OpenGL context is created. This means that almost all of those creation parameters are now unnecessary, since most of them define the structure of that buffer the OS gives you. Indeed, on iOS, when you create a context, the only thing you specify is which version of OpenGL ES you want to use.

On iOS, the way you render directly to the screen is with CoreAnimation layers. There is a method on EAGLContext, renderbufferStorage:fromDrawable: which connects an EAGLDrawable with a renderbuffer. Currently, CAEAGLLayer is the only class which implements EAGLDrawable, which means you have to draw into a layer in the CoreAnimation layer tree. (You can also draw into an offscreen IOSurface by wrapping a texture around it and using render-to-texture, as detailed in my previous post).

This model is quite different from CAOpenGLLayer, as used on macOS. Here, you can affect the properties of the drawable by setting the drawableProperties property on the EAGLDrawable.

There is a higher-level abstraction: a GLKView, which subclasses UIView. This class has a GLKViewDelegate which provides the drawing operations. It has properties which let you specify the attributes of the drawable. There’s also the associate GLKViewController which subclasses UIViewController, which has its own GLKViewControllerDelegate. This delegate has an update() method, which is called between frames. The idea is that you shouldn’t need to subclass GLKView or GLKViewController, but you should subclass the delegates.

Many iOS devices have retina screens. The programmer has to opt-in to high density screens by setting the contentsScale property of the CAEAGLLayer to whatever UIScreen.nativeScale is set to. If you don’t do this, your view will be stretched and blurry. This also means that you have to take care to update any places where you interact with pixel data directly, like glReadPixels().

iOS devices also support multiple monitors via AirPlay. With AirPlay, an app can render content on to a remote display. However, the model for this is a little different than on macOS: instead of the user dragging a window to another monitor, and the system telling the app about it, the app handles the movement to the external monitor. The system will give you a UIScreenDidConnectNotification / UIScreenDidDisconnectNotification when the user enables AirPlay. Then, you can see that the [UIScreen screens] array has multiple items in it. You can then move a view hierarchy to the external screen by assigning the screen to your UIWindow’s screen property. You can create a new UIWindow by using the regular alloc / initWithFrame constructor and passing in the UIScreen’s bounds. You then set the rootViewController of this new window to whatever you want to show on the external monitor. Therefore, when this occurs, you have the freedom to query the properties of the remote screen (using UIScreen APIs, such as UIScreen.nativeScale) and react accordingly. For example, if you have a retina device but you are moving content to a 1x screen, you can know this by querying the screen at the time you move the window to it.

On macOS, an OpenGL context could have many renderers inside it, with only one being active at a current time. On iOS devices, there is only one GPU, which means there is only one renderer. This means you don’t have to worry about a switch in renderers. This means that the model is much simpler and you don’t have to worry so much about things changing out from under you.

OpenGL on macOS

2016-08-22T01:56:00.005-07:00

OpenGL is a specification created by a cross-vendor group, and is designed to work on all (fairly modern) graphics cards. While this sounds obvious, it actually has some interesting implications. It means that nothing platform-specific is inside the OpenGL spec itself. Instead, only the common pieces are inside the spec.

In addition, technically, OpenGL is not a piece of software. OpenGL is a document designed for humans to read. There are many libraries written by many people which claim to implement this spec, but it’s important to realize that these libraries are not OpenGL itself. There can be problems with an individual implementation, and there can be problems with the spec, and those are separate problems.

OpenGL operates inside a “context” which is “current” to a thread. However, the spec doesn’t include any way of interacting with this context directly (like creating it or making it current). This is because each platform has their own way of creating this context. On macOS, this is done with the CGL (Core OpenGL) framework.
 Another example of something not existing in the spec is the issue of device memory availability. The OpenGL spec does not list any way to ask the device how much memory is available or used on the device. This is because GPUs can be implemented with many different regions of memory with different performance characteristics. For example, many GPUs have a separate area where constant memory or texture memory lives. On the other hand, an integrated GPU uses main memory, which is shared with regular applications, so the whole concept of available graphics memory doesn’t make a lot of sense. (Also, imagine a theoretical GPU with automatic memory compression.) Indeed, these varied memory architectures are incredibly valuable, and GPU vendors should be able to innovate in this space. If being able to ask for available memory limits were added to the spec, it would either 1) be simple but meaningless on many GPUs with varied memory architectures, or 2) be so generic and nebulous that it would be impossible for a program to make any actionable decisions at runtime. The lack of such an API is actually a success, not an oversight. If you are running on a specific GPU whose memory architecture you understand, perhaps the vendor of that GPU can give you a vendor-specific API to answer these kinds of question in a platform-specific way. However, this API would only work on that specific GPU.

Another example is the idea of “losing” a context. Most operating systems include mechanisms which will cause your OpenGL context to become invalid, or “lost.” Each operating system has its own affordances for why a context may be lost, or how to listen for events which may cause the context to be lost. Similar to context creation, this concept falls squarely in the “platform-dependent” bucket. Therefore, the spec itself just assumes your context is valid, and it is the programmer’s responsibility to make sure that’s true on any specific operating system.

As mentioned above, OpenGL contexts on macOS are interacted with directly by using CGL (in addition to its higher-level NSOpenGL* wrappers). There are a few concepts involved with using CGL:

Pixel Formats
Renderers
Virtual Screens
Contexts

A context is the thing you need to run OpenGL functions. In order to create a context, you need to specify a pixel format. This is a configuration of the external resources the context will be able to access. For example, you can say things like “Make a double-buffered color buffer 8 bits-per-channel, with a similar 8-bit depth buffer.” This information needs to be specified on the context itself (and is therefore not in the OpenGL spec because it’s platform-specific) because there is a relationship between what you specify here and the integration with the rest of the machine. For example, you can only successfully create a context with a pixel format that the window server understands, because at the end of the day, the window server needs to composite the output of your OpenGL rendering with the rest of the windows on the system. (This is also the reason why there’s no “present” call in the OpenGL spec - it requires interaction with the platform-specific window server.)

Because the pixel format attributes also act as configuration parameters to the renderer in general, this is also the place where you specify things like which version of OpenGL the context should support (which is necessary because OpenGL deprecated some things) and increasingly moves things from ARB extensions into core. Parameters like this one don’t affect the format of the pixels, per se, but they do affect the selection of the CGL renderer used to implement the OpenGL functions.

A CGL renderer is conceptually similar to a vtable which backs the OpenGL drawing commands. There is a software renderer, as well as a renderer provided by the GPU driver. On a MacBook Pro with both an integrated and discrete GPU, different renderers are used for each one. A renderer can operate on one or more virtual screens, which are conceptually similar to physical screens attached to the machine, but generalized (virtualized) so it is possible to, for example, have a virtual screen that spans across two physical screens. There is a relationship between CGDisplayIDs and OpenGL virtual screens, so it’s possible to map back and forth between them. This means that you can get semantic knowledge of an OpenGL renderer based on existing context in your program. It’s possible to iterate through all the renderers on the system (and their relationships with virtual screens) and then use CGL to query attributes about each renderer.

A CGL context has a set of renderers that it may use for rendering. (This set can have more than one object in it.) The context may decide to migrate from one renderer to another. When this happens, the context the application uses doesn’t change; instead if you query the context for its current renderer, it will just reply with a different answer.

(Side note: it’s possible to create an OpenGL context where you specify exactly one renderer to use with kCGLPFARendererID. If you do this, the renderer won’t change; however, the virtual screen can change if, for example, the user drags the window to a second monitor attached to the same video card.)

Therefore, this causes something of a problem. Inside a single context, the system may decide to switch you to a different renderer, but different renderers have different capabilities. Therefore, if you were relying on the specific capabilities of the current renderer, you may have to change your program logic if the renderer changes. Similarly, even if the renderer doesn’t change, but the virtual screen does change, your program may also need to alter its logic if it was relying on specific traits of the screen. Luckily, if the renderer changes, then the virtual screen will also change (even on a MacBook pro with integrated & discrete GPU switching).

On macOS, the only supported way to show something on the screen is to use Cocoa (NSWindow / NSView, etc.). Therefore, using NSOpenGLView with NSOpenGLContext is a natural fit. The best part of NSOpenGLView is that it provides an “update” method which you can override in a subclass. Cocoa will call this update method any time the view’s format changes. For example, if you drag a window from a 1x screen to a 2x screen, Cocoa will call your “update” method, because you need to be aware that the format changed. Inside the “update” function, you’re supposed to investigate the current state of the world (including the current renderer / format / virtual screen, etc.), figure out what changed, and react accordingly.

This means that using the “update” method on NSOpenGLView is how you support Hi-DPI screens. You also should opt-in to Hi-DPI support using wantsBestResolutionOpenGLSurface. If you don’t do this and you’re using a 2x display, your OpenGL content will be rendered at 1x and then stretched across the relevant portion of the 2x display. You can convert between these logical coordinates and the 2x pixel coordinates by using the convert*ToBacking methods on NSView. By default, this stretching happens so calls like glReadPixels() will still work in the default case even without mapping coordinates to their backing equivalent. (Therefore, if you want to support 2x screens, all your calls which interact with pixels directly, like glReadPixels(), will need to be updated.)

Similarly, NSOpenGLView has a property which supports wide-gamut color: wantsExtendedDynamicRangeOpenGLSurface. There is an explanatory comment next to this property which describes how normally colors are clipped in the 0.0 - 1.0 range, but if you set this boolean, the maximum clipping value may increase to something larger than 1.0 depending on which monitor you’re using. You can query this by asking the NSScreen for its maximumExtendedDynamicRangeColorComponentValue. Similar to before, the update method should be called whenever anything relevant here changes, thereby giving you an opportunity to investigate what changed and react accordingly.

However, if you increase the color gamut (think: boundary threshold color) your numbers are supposed to span, it means that one of two things will happen:

You keep the same number of representable values as before, but spread each representable value farther from its neighbors (so that the same number of representable values spans the larger space)
You add more representable values to keep the density of representable values the same (or higher!) than before.

The first option sucks because the distance of adjacent representable values are fairly close to the minimum perception threshold in our eyes. Therefore, if you increase the distance between adjacent representable values, these “adjacent” colors actually start looking fairly distinct to us humans. The effect becomes obvious if you look at what should be a smooth gradient, because you see bands of solid color instead of the smooth transition.

The second option sucks because more representable values means more information, which means your numbers have to be held in more bits. More bits means more memory is required.

Usually, the best solution is to pay for the additional memory (either by repurposing the alpha channel bits to be used as the color channel, and going to a 10-bit/10-bit/10-bit/2-bit pixel format, which means you use the same amount of memory, but give up alpha fidelity), or by going to a half float (16-bit) pixel format, which means your memory use doubles (since each channel before was 8-bit and now you’re going to 16-bit). Therefore, if you want to use wide color, you probably want deep color, which means you should be specifying an appropriate deep-color pixel format attribute when you create your OpenGL context. You probably want to specify NSOpenGLPFAColorFloat as well as NSOpenGLPFAColorSize 64. Note that, if you don’t use a floating point pixel format (meaning: you use a regular integral pixel format), you do get additional fidelity, but might not be able to represent values outside of the 0.0 - 1.0 range, depending on how the mapping of the integral units maps to the color space (which I don’t know).

There’s one other interesting piece of interesting tech released in the past few years - A MacBook Pro with two GPUs (one integrated and one discrete) will switch between them based on which apps are running and which contexts have been created across the entire system. This switch occurs for all apps, which means that one app can cause the screen to change for all the existing apps. As mentioned before, this means that the renderer inside your OpenGL context could change at an arbitrary time, which means a well-behaved app should listen for these changes and respond accordingly. However, not all existing apps do this, which means that the switching behavior is entirely opt-in. This means that if any app is running which doesn’t understand this switching behavior, the system will simply pick a GPU (the discrete one) and force the entire system to use it until the app closes (or, if more than one naive app is running, until they all close). Therefore, no switches will occur when these apps are running, and the apps can run in peace. However, keeping the discrete GPU running for a long time is a battery drain, so it’s valuable to teach your apps how to react correctly to a GPU switch.

Unfortunately, I’ve found that Cocoa doesn’t call NSOpenGLView’s “update” method when one of these GPU switches occurs. The switch is modeled in OpenGL as a change of the virtual screen of the OpenGL context. You can listen for a virtual screen change in two possible ways:

Add an observer to the default NSNotificationCenter to listen for the NSWindowDidChangeScreenNotification
Use CGDisplayRegisterReconfigurationCallback

If you’re rendering to the screen, then using NSNotificationCenter should be okay because you’re using Cocoa anyway (because the only way to render to the screen is by using Cocoa). There’s no way to associate a CGL context directly with an NSView without going through NSOpenGLContext. If you’re not rendering to the screen, then presumably you wouldn’t care which GPU is outputting to the screen.

Inside these callbacks, you can simply read the currentVirtualScreen property on NSOpenGLView (or use CGLGetVirtualScreen() - Cocoa will automatically call the setter when necessary). Once you’ve detected a virtual screen change, you should probably re-render your scene because the contents of your view will be stale.

After you’ve implemented support for switching GPUs, you then have to tell the system that the support exists, so that it won’t take the legacy approach of choosing one GPU for the lifetime of your app. You can do this either by setting NSSupportsAutomaticGraphicsSwitching = YES in your Info.plist inside your app’s bundle, or, if you’re using CGL, you can use the kCGLPFASupportsAutomaticGraphicsSwitching pixel format attribute when you create the context. Luckily, CGLPixelFormatObj and NSOpenGLPixelFormat can be freely converted between (likewise with CGLContextObj and NSOpenGLContext).

Now that you’ve told the system you know how to switch GPUs, the system won’t force us to use the discrete GPU. However, if you naively create an OpenGL context, you will still use the discrete GPU by default. It means, however, you now have the ability to specify that you would prefer the integrated GPU. You do this by specifying that you would like an “offline” renderer (NSOpenGLPFAAllowOfflineRenderers).

So far, I’ve discussed how we go about rendering into an NSView. However, there are a few other rendering destinations that we can render into.

The first is: no rendering destination. This is considered an “offscreen” context. You can create one of these contexts by never setting the context’s view (which NSOpenGLView does for you). One way to do this is to simply create the context with CGL, and then never touch NSOpenGLView.

Why would you want to do this? Because OpenGL commands you run inside an offscreen context still execute. You can use your newly constructed context to create a framebuffer object, and render to an OpenGL renderbuffer. Then, you can read the results out of the render buffer with glReadPixels(). If your goal is rendering a 3D scene, but aren’t interested in outputting it on a screen, this is the way to do it.

Another destination is a CoreAnimation layer. In order to do this, you would use a CAOpenGLLayer or NSOpenGLLayer. The layer owns and creates the OpenGL context and pixel format; however, it does this with input from you. The idea is that you would subclass CAOpenGLLayer/NSOpenGLLayer and override the copyCGLPixelFormatForDisplayMask: method (and/or the copyCGLContextForPixelFormat: method). When CoreAnimation wants to create its context, it will call these methods. By supplying the pixel format method, you can specify that, for example, you want an OpenGL version 4 context rather than a version 2 context. Then, when CoreAnimation wants you to render, it will call a draw method which you should override in your subclass and perform any drawing you prefer. By default, it will only ask you to draw in response to setNeedsDisplay, but you can set the “asynchronous” flag to ask CoreAnimation to continually ask you to draw.

Another destination is an IOSurface. An IOSurface is a buffer which can live in graphics memory which can represent a 2D image. The interesting part of an IOSurface is that it can be shared across process boundaries. If you do that, you have to implement synchronization yourself between the multiple processes. It’s possible to wrap an OpenGL texture around an IOSurface, which means you can render to an IOSurface with render-to-texture. If you create a framebuffer object, create a texture from the IOSurface using CGLTexImageIOSurface2D(), bind the texture to the framebuffer, then render into the framebuffer, the result is that you render into the IOSurface. You can share a handle to the IOSurface by using IOSurfaceCreateXPCObject(). Then, if you manage synchronization yourself, you can have another process read from the IOSurface by locking it with IOSurfaceLock() and getting the pointer to the mapped data with IOSurfaceGetBaseAddressOfPlane(). Alternately, you can set it as the “contents” of an CoreAnimation layer. Or, you could use it in another OpenGL context in the other process.