Monday, March 18, 2024

So I wrote a double delete...

I wrote a double delete. Actually, it was a double autorelease. Here's a fun story describing the path it took to figure out the problem.

I'm essentially writing a plugin to a different app, and the app I'm plugging-in-to is closed-source. So, I'm making a dylib which gets loaded at runtime. I'm doing this on macOS.

The Symptom

When running my code, I'm seeing this:


Let's see what we can learn from this.

First, it's a crash. We can see we're accessing memory that we shouldn't be accessing.

Second, it's inside objc_release(). We can use a bit of deductive reasoning here: If the object we're releasing has a positive retain count, then the release shouldn't crash. Therefore, either we're releasing something that isn't an object, or we're releasing something that has a retain count of 0 (meaning: a double release).

Third, we can actually read a bit of the assembly to understand what's happening. The first two instructions are just a way to check if %rdi is null, and, if so, jump to an address that's later in the function. Therefore, we can deduce that %rdi isn't null.

%rdi is interesting because it's the register that holds the first argument. It's probably a safe assumption to make that objc_release() probably just takes a single argument, and that argument is a pointer, and that pointer is stored in %rdi. This assumption is somewhat-validated by reading the assembly: nothing seems to be using any of the other parameter registers.

The next 3 lines check if the low bit in %rdi is 1 or not. If it's 1, then we again jump to an address that's later in the function. Therefore, we can deduce that %rdi is an even number (its low bit isn't 1).

The next 3 lines load a value that %rdi is pointing to, and mask off most of its bits. The next line, which is the line that's crashing, is trying to load the value that the result points to.

All this makes total sense: Releasing a null pointer should do nothing, and releasing tagged pointers (which I'm assuming are marked by having their low bit set to 1) should do nothing as well. If the argument is an Objective-C object, it looks like we're trying to load the isa pointer, which probably holds something useful at offset 0x20. That's the point where we're crashing.

That leads to the deduction: Either the thing we're trying to release isn't an Objective-C object, or it's already been released, and the release procedure clears (or somehow poisons) the isa value, which caused this crash. Either way, we're releasing something that we shouldn't be releasing.

One of the really useful observations about the assembly is that nothing before the crash point clobbers the value of %rdi. This means that a pointer to the object that's getting erronously released is *still* in %rdi at the crash site.

We can also see that the crash is happening inside AutoreleasePool:

This doesn't indicate much - just that we're autoreleasing the object instead of releasing it directly. It also means that, because autorelease is delayed, we can't see anything useful in the stack trace. (If we were releasing directly instead of autoreleasing, we could see exactly what caused it in the stack trace.)

The First Thing That Didn't Work

The most natural solution would be "Let's use Instruments!" It's supposed to have a tool that shows all the retain stacks and release stacks for every object.

When running with Instruments, we get a nice crash popup showing us that we crashed:

The coolest part about this is that it shows us the register state at the crash site, which gives us %rdi, the pointer to the object getting erroneously released.

Cool, so the object which is getting erroneously released is at 0x600002a87b40. Let's see what instruments lists for that address:

Well, it didn't list anything for that address. It listed something for an address just before it, and just after it, but not what we were looking for. Thanks for nothing, Instruments.

The Second Thing That Didn't Work

Well, I'm allocating and destroying objects in my own code. Why don't I try to add logging all my own objects to see where they all get retained and released! Hopefully, by cross referencing the address of the object that gets erroneously deleted with the logging of the locations of my own objects, I'll be able to tell what's going wrong.

We can do this by overriding the -[NSObject release] and -[NSObject retain] calls:

As well as init / dealloc:

Unfortunately, this spewed out a bunch of logging, but the only thing it told me was the object that was being erroneously released wasn't one of my own objects. It must be some other object (NSString, NSArray, etc.).

The Third Thing That Didn't Work

Okay, we know the object is being erroneously autoreleased. Why don't we log some useful information every time anyone autoreleases anything? We can add a symbolic breakpoint on -[NSObject autorelease].

Here's what it looks like when this breakpoint is hit:

Interesting - so it looks like all calls to -[NSObject autorelease] are immediately redirected to _objc_rootAutorelease(). The self pointer is preserved as the value of the first argument.

If you list the registers at the time of the call, you can see the object being released:

So let's modify the breakpoint to print all the information we're looking for:

Unfortunately, this didn't work because it was too slow. Every time lldb evaluates something, it takes a bunch of time, and this was evaluating 3 things every time anybody wanted to autorelease anything, which is essentially all the time. The closed-source application I'm debugging is sensitive enough, that if anything takes too long, the application just quits.

The Fourth Thing That Didn't Work

Lets try to print out the same information as before, but do it inside the application rather than in lldb. That way, it will be much faster.

The way we can do this is with something called "function interposing." This uses a feature of dyld which can replace a library's function with your own. Note that this only works if you disable SIP and set the nvram variable amfi_get_out_of_my_way=0x1 and reboot.

We can do this to swap out all calls to _objc_rootAutorelease() with our own function.

Inside our own version of _objc_rootAutorelease(), we want to keep track of everything that gets autoreleased. So, let's keep track of a global dictionary, from pointer value to info string.

We can initialize this dictionary inside a "constructor," which is a special function in a dylib which gets run when the dylib gets loaded by dyld. This is a great way to initialize a global.

Inside my_objc_rootAutorelease(), we can just add information to the dictionary. Then, when the crash occurs, we can print the dictionary and find information about the thing that was autoreleased.

However, something is wrong...

The dictionary only holds 315 items. That can't possibly be right - it's inconceivable that only 315 things got autoreleased.

The Fifth Thing That Didn't Work

We're close - we just need to figure out why so few things got autoreleased. Let's verify our assumptions, that [foo autorelease] actually calls _objc_rootAutorelease() by writing such code and looking at its disassembly.

And if you look at the disassembly...

You can see 2 really interesting things: the call to alloc and init got compressed to a single C call to objc_alloc_init(), and the call to autorelease got compressed to a single C call to obc_autorelease(). I suppose the Objective-C compiler knows about the autorelease message, and is smart enough to not invoke the entire objc_msgSend() infrastructure for it, but instead just emits a raw C call for it. So that means we've interposed the wrong function - we were interposing _objc_rootAutorelease() when we should have been interposing objc_autorelease(). So let's interpose both:

This, of course, almost worked - we just have to be super sure that my_objc_autorelease() doesn't accidentally call autorelease on any object - that would cause infinite recursion.

The Sixth Thing That Didn't Work

Avoiding calling autorelease inside my_objc_autorelease() is actually pretty much impossible, because anything interesting you could log about an object will, almost necessarily, call autorelease. Remember that we're logging information about literally every object which gets autoreleased, which is, in effect, every object in the entire world. Even if you call NSStringFromClass([object class]) that will still cause something to be autoreleased.

So, the solution is to set some global state for the duration of the call to my_objc_autorelease(). If we see a call to my_objc_autorelease() while the state is set, that means we're autoreleasing inside being autoreleased, and we can skip our custom logic and just call the underlying objc_autorelease() directly. However, there's a caveat: this "global" state can't actually be global, because Objective-C objects are created and retained and released on every thread, which means this state has to be thread-local. Therefore, because we're writing in Objective-C and not C++, we must use the pthreads API. The pthreads threadspecific API uses a "key" which has to be set up once, so we can do that in our constructor:

Then we can use pthread_setspecific() and pthread_getspecific() to determine if our calls are being nested.

Except this still didn't actually work, because abort() is being called...

The Seventh Thing That Didn't Work

Luckily, when abort() is called, Xcode shows us a pending Objective-C exception:

Okay, something is being set to nil when it shouldn't be. Let's set an exception breakpoint to see what is being set wrong:

Welp. It turns out NSStringFromClass([object class]) can sometimes return nil...

The Eighth Thing That Worked

Okay, let's fix that by checking for nil and using [NSNull null]. Now, the program actually crashes in the right place!

That's more like it. Let's see what the pointer we're looking for is...

Okay, let's look for it in bigDict!

Woohoo! Finally some progress. The object being autoreleased is an NSDictionary.

But that's not enough, though. What we really want is a backtrace. We can't use lldb's backtrace because it's too slow, but luckily macOS has a backtrace() function which gives us backtrace information! Let's build a string out of the backtrace information:

Welp, that's too slow - the program exits. Let's try again by setting frameCount to 6:

So here is the final autorelease function:

Okay, now let's run it, and print out the object we're interested in:

And the bigDict:

Woohoo! It's a great success! Here's the stack trace, formatted:

Excellent! This was enough for me to find the place where I had over-released the object.

Tuesday, November 28, 2023

Nvidia SLI from Vulkan's Point of View

SLI is an Nvidia technology, which (is supposed to) allow multiple GPUs to act as one. The use case is supposed to be simple: you turn it on, and everything gets faster. However, that's not how it works in Vulkan (because of course it isn't - nothing is simple in Vulkan). So let's dig in and see exactly how it works and what's exposed in Vulkan.

Logical Device Creation

SLI is exposed in Vulkan with 2 extensions, both of which have been promoted to core in Vulkan 1.1: VK_KHR_device_group_creation, and VK_KHR_device_group. The reason there are 2 is esoteric: one is an "instance extension" and the other is a "device extension." Because enumerating device groups has to happen before you actually create a logical device, those enumeration functions can't be part of a device extension, so they're part of the instance extension instead. The instance extension is really small - it essentially just lets you list device groups, and for each group, list the physical devices inside it. When you create your logical device, you just list which physical devices should be part of the new logical device.

Now that you've created your logical device, there are a few different pieces to how this stuff works.

Beginning of the Frame

At the beginning of your frame, you would normally call vkAcquireNextImageKHR(), which schedules a semaphore to be signaled when the next swapchain image is "acquired" (which means "able to be rendered to"). (The rest of your rendering is supposed to wait on this semaphore to be signaled.) VK_KHR_device_group replaces this function with vkAcquireNextImage2KHR(), which adds a single parameter: a "device mask" of which physical devices in the logical device should be ready before the semaphore is signaled.

It took me a while to figure this out, but each physical device gets its own distinct contents of the swapchain image. When you write your Vulkan program, and you bind a swapchain image to a framebuffer, that actually binds n different contents - one on each physical device. When a physical device executes and interacts with the image, it sees its own independent contents of the image.

End of the Frame

At the end of the frame, you'll want to present, and this is where things get a little complicated. Each physical device in the logical device may or may not have a "presentation engine" in it. Also, recall that each physical device has its own distinct contents of the swapchain image.

There are 4 different presentation "modes" (VkDeviceGroupPresentModeFlagBitsKHR). Your logical device will support some subset of these modes. The 4 modes are:

  1. Local presentation: Any physical device with a presentation engine can present, but it can only present the contents of its own image. When you present, you tell Vulkan which physical device and image to present (VkDeviceGroupPresentInfoKHR).
  2. Remote presentation: Any physical device with a presentation engine can present, and it can present contents from other physical devices. Vulkan exposes a graph (vkGetDeviceGroupPresentCapabilities()) that describes which physical devices can present from which other physical devices in the group. When you present, you tell Vulkan which image to present, and there's a requirement that some physical device with a presentation engine is able to present the image you selected.
  3. Sum presentation: Any physical device with a presentation engine can present, and it presents the component-wise sum of the contents of the image from multiple physical devices. Again, there's a graph that indicates, for each physical device that has a presentation image, which other physical devices it's able to sum from. When you present, you specify which physical devices' contents to sum, via a device mask (and there's a requirement that there is some physical device with a presentation engine that can sum from all of the requested physical devices).
  4. Local multi-device presentation: Different physical devices (with presentation engines) can present different disjoint rects of their own images, which get merged together to a final image. You can tell which physical devices present which rects by calling vkGetPhysicalDevicePresentRectanglesKHR(). When you present, you specify a device mask, which tells which physical devices present their rects.

On my machine, only the local presentation mode is supported, and both GPUs have presentation engines. That means the call to present gets to pick (VkDeviceGroupPresentInfoKHR) which of the two image contents actually gets presented.

Middle of the Frame

The commands in the middle of the frame are probably actually the most straightforward. When you begin a command buffer, you can specify a device mask (VkDeviceGroupCommandBufferBeginInfo) of which physical devices will execute the command buffer. Inside the command buffer, when you start a render pass, you can also specify another device mask (VkDeviceGrupRenderPassBeginInfo) for which physical devices will execute the render pass, as well as assigning each physical device its own distinct "render area" rect. Inside the render pass, you can run vkCmdSetDeviceMask() to change the set of currently running physical devices. In your SPIR-V shader, there's even a built-in intrinsic "DeviceIndex" to tell you which GPU in the group you're running on. And then, finally, when you actually submit the command buffer, you can supply (VkDeviceGroupSubmitInfo) a device mask you want to submit the command buffers to.

There's even a convenience vkCmdDispatchBase() which lets you set "base" values for workgroup IDs, which is convenient if you want to spread one workload across multiple GPUs. Pipelines have to be created with VK_PIPELINE_CREATE_DISPATCH_BASE_KHR to use this, though.

Resources

It's all well and good to have multiple physical devices executing the same command buffer, but simply execution is not enough: you also need to bind resources to those shaders and commands that get run.

When allocating a resource, there are 2 ways for it to happen: either each physical device gets its own distinct contents of the allocation, or all the physical devices share a single contents. If the allocation's heap is marked as VK_MEMORY_HEAP_MULTI_INSTANCE_BIT_KHR, then all allocations will be replicated distinctly across each of the physical devices. Even if the heap isn't marked that way, the individual allocation can still be marked that way (VkMemoryAllocateFlagsInfo). On my device, the GPU-local heap is marked as multi-instance.

Communication can happen between the devices by using Vulkan's existing memory binding infrastructure. Recall that, in Vulkan, you don't just create a resource; instead, you make an allocation, and a resource, and then you separately bind the two together. Well, it's possible to bind a resource on one physical device with an allocation on a different physical device (VkBindBufferMemoryDeviceGroupInfo, VkBindImageMemoryDeviceGroupInfo)! When you make one of these calls, it will execute on all the physical devices, so these structs indicate the graph of which resources on which physical devices get bound to which allocations on which (other) physical resources. For textures, you can even be more fine-grained than this, and bind just a region of a texture across physical devices (assuming you created the image with VK_IMAGE_CREATE_SPLIT_INSTANCE_BIND_REGIONS_BIT_KHR). This also works with sparse resources - when you bind a sparse region of a texture, that sparse region can come from another physical device, too (VkDeviceGroupBindSparseInfo).

Alas, there are restrictions. vkGetDeviceGroupPeerMemoryFeatures() tells you, once you've created a resource and bound it to an allocation on a different physical device, how you're allowed to use that resource. For each combination of (heap index, local device index, and remove device index), a subset of 4 possible uses will be allowed (VkPeerMemoryFeatureFlagBits):

  1. The local device can copy to the remote device
  2. The local device can copy from the remote device
  3. The local device can read the resource directly
  4. The local device can write to the resource directly

This is really exciting - if either of the bottom two uses are allowed, it means you can bind one of these cross-physical-device resources to a shader and use it as-if it were any normal resource! Even if neither of the bottom two uses are allowed, just being able to copy between devices without having to round-trip through main memory is already cool. On my device, only the first 3 uses are allowed.

Swapchain Resources

Being able to bind a new resource to a different physical device's allocation is good, but swapchain images come pre-bound, which means that mechanism won't work for swapchain resources. So there's a new mechanism for that: it's possible to bind a new image to the storage for an existing swapchain image (VkBindImageMemorySwapchainInfoKHR). This can be used in conjunction with VkBindImageMemoryDeviceGroupInfo which I mentioned above, to make the allocations cross physical devices.

So, if you want to copy from one physical device's swapchain image to another physical device's swapchain image, what you'd do is:

  1. Create a new image (of the right size, format, etc.). Specify VkImageSwapchainCreateInfoKHR to indicate its storage will come from the swap chain.
  2. Bind it (VkBindImageMemoryInfo), but use both...
    1. VkBindImageMemorySwapchainInfoKHR to have its storage come from the swap chain, and
    2. VkBindImageMemoryDeviceGroupInfo to specify that its storage comes from another physical device's swap chain contents
  3. Execute a copy command to copy from one image to the other image.

Conclusion

It's a pretty complicated system! Certainly much more complicated than SLI is in Direct3D. It seems like there are 3 core benefits of device groups:

  1. You can execute the same command stream on multiple devices without having to re-encode it multiple times or call into Vulkan multiple times for each command. The device masks implicitly duplicate the execution.
  2. There are a variety of presentation modes, which allow automatic merging of rendering results, without having to explicitly execute a render pass or a compute shader to merge the results. Unfortunately, my cards don't support this.
  3. Direct physical-device-to-physical-device communication, without round-tripping through main memory. Indeed, for some use cases, you can just bind a remote resource and use it as-if it was local. Very cool!

I'm not quite at the point where I can run some benchmarks to see how much SLI improves performance over simply creating two independent Vulkan logical devices. I'm working on a ray tracer, so there are a few different ways of joining the rendering results from the two GPUs. To avoid seams, the denoiser will probably have to run on just one of the GPUs.

Sunday, October 29, 2023

My First Qt App

Just for fun, I wanted to try to make a Qt app that graphs some data. For contrast, I'm aware of Swift Charts, and I thought using Qt to graph some stuff would be a fun little project, now that I'm using FreeBSD full time, rather than macOS. The latest version of Qt is version 6, so that's what I'll be using.

Basics

When you use Qt Creator to create a new Qt project, it only creates 4 files:

  • CMakeLists.txt
  • CMakeLists.txt.user
  • main.cpp
  • Main.qml

Qt Creator can understand CMakeLists.txt directly - if you want to open the "project," you open that file. Just like with Cocoa programming, main.cpp doesn't contain much inside it - it's just a few lines line and it initializes the existing infrastructure to load the app's UI.

Also, like Cocoa programming, most of the description of the UI of the app is described declaratively, in the .qml file. The way this works is you say something like:

Foo {
    bar: baz
}

And this means "when the QML file is loaded, create an instance of type Foo, and set the property named bar on this new object to a value of baz."

The outermost level is this:

Window {
    width: 640
    height: 480
    visible: true
    title: qsTr("Hello World")
}

Then, you can add "elements" inside the window, by placing it inside the {}s. There are collection views (Row, Column, Grid, Flow) which define how to lay out their children, and there are also more general elements like Rectangle. When your layout is not naturally specified (because you're not using containers or whatever), you describe the layout using anchors, like anchors.centerIn: parent or anchors.fill: parent.

Qt Charts

Qt has a built-in chart element, so the first thing I did was just copy the ChartView example directly into my QML document as a child of the Window. However, that didn't work, and some searching found this note:

> Note: An instance of QApplication is required for the QML types as the module depends on Qt's Graphics View Framework for rendering. QGuiApplication is not sufficient. However, projects created with Qt Creator's Qt Quick Application wizard are based on the Qt Quick template that uses QGuiApplication by default. All the QGuiApplication instances in such projects must be replaced with QApplication.

Okay, so I replaced QGuiApplication with QApplication in main.cpp, and changed #include <QGuiApplication> to #include <QApplication>, only to find that there is now a compile error: the compiler can't find that file. After some more searching, it turns out I needed to change this:

find_package(Qt6 6.5 REQUIRED COMPONENTS Quick

to

find_package(Qt6 6.5 REQUIRED COMPONENTS Quick Widgets)

and change 

target_link_libraries(appGrapher
    PRIVATE Qt6::Quick
)

to

target_link_libraries(appGrapher
    PRIVATE Qt6::Quick
    PRIVATE Qt6::Widgets
)

Huh. After doing that, it worked no problem.

Data Source (C++ interop)

So now I have a chart, which is pretty cool, but the data that the chart uses is spelled out literally in the QML file. That's not very useful - I plan on generating thousands of data points, and I don't want to have to put them inline inside this QML thing. Instead, I want to load them from an external source.

QML files allow you to run JavaScript by literally placing bits of JavaScript inside the QML file, but I think I want to do better - I want my data source to come from C++ code, so I have full freedom about how I generate it. From some searching, it looks like there are 2 ways of having C++ and QML JavaScript interoperate:

  • You can register a singleton, or a singleton instance, and then the JavaScript can call methods on that singleton
  • You can register a type, and have the QML create an instance of that type, just like any other element
  • (You can setContextProperty(), which lets the QML look up an instance that you set ahead of time. However, there's a note that says "You should not use context properties to inject values into your QML components" which is exactly what I'm trying to do, so this probably isn't the right solution.)

I have a general aversion to singletons, and I think registering a type is actually what I want, because I want the QML infrastructure to own the instance and define its lifetime, so that's the approach I went with. The way you do this is, in main() after you create the QApplication but before you do anything else, you call qmlRegisterType(). Here is what main() says:

qmlRegisterType<DataSource>("com.litherum", 1, 0, "DataSource");

This allows the QML to say import com.litherum, which is pretty cool.

QObject

Defining the DataSource type in C++ is a bit weird. It turns out that Qt objects are not just regular C++ objects. Instead, you write your classes in a different language, which is similar to C++, and then there is a "meta-object compiler" which will compile your source to actual C++. It looks like the main purpose of this is to be able to connect signals and slots, where an object can emit a signal, and if a slot in some other object is connected to that signal, then the slot callback gets run in that other object. It seems pretty similar to observers in Objective-C. They also have the ability to perform introspection, like Objective-C .... I kind of don't understand why they didn't just invent a real language rather than doing this C++ transpilation silliness.

Anway, you can define your (not-)C++ class, inherit from QObject, annotate the class with Q_OBJECT and QML_ELEMENT, and give it a method with the Q_INVOKABLE annotation. Sure, fine. Then, in the QML file, you can add a stanza which tells the system to create an instance of this class, and you can use the Component.onCompleted JavaScript handler to call into it (via its id). Now you can call the C++ method you just defined from within the QML. Cool. This is what the C++ header says:

class DataSource : public QObject
{
    Q_OBJECT
    QML_ELEMENT
public:
    explicit DataSource(QObject *parent = nullptr);

    Q_INVOKABLE void updateData(QXYSeries*, double time);
};
 

Okay, the method is supposed to set the value of the SplineSeries in the chart. The most natural way to do this is to pass the SplineSeries into the C++ function as a parameter. This is actually pretty natural - all the QML types have corresponding C++ types, so you just make the C++ function accept a QSplineSeries*. Except we run into the same compiler error where the compiler can't find #include <QSplineSeries>. It turns out that in CMakeLists.txt we have to make a similar addition and add Charts to both places that we added Widgets above. Fine. Here's what the QML says:

 DataSource {
    id: dataSource
    Component.onCompleted: function() {
        dataSource.updateData(splineSeries, Date.now());
    }
}

Once you do this, it actually works out well - the C++ code can call methods on the QSplineSeries, and it can see the values that have been set in the QML. It can generate a QList<QPointF> and call QSplineSeries::replace() with the new list.

The one thing I couldn't get it to do was automatically rescale the charts' axes when I swap in new data with different bounds. Oh well.

I did want to go one step further, though!

Animation

One of the coolest things about retained-mode UI toolkits is that they often allow for animations for free. Swapping out the data in the series should allow Qt to smoothly animate from the first data set to the second. And it actually totally worked! It took me a while to figure out how specifically to spell the values, but in the QML file, you can set these on the ChartView:

animationOptions: ChartView.AllAnimations
animationDuration: 300 // milliseconds
animationEasingCurve {
    type: Easing.InOutQuad
}

I found these by looking at the documentation for QChart. And, lo and behold, changing the data values smoothly animated the spline on the graph! I also needed some kind of timer to actually call my C++ function to generate new data, which you do with QML also:

Timer {
    interval: 1000 // milliseconds
    running: true
    repeat: true
    onTriggered: function() {
        dataSource.updateData(splineSeries, Date.now());
    }
}

Super cool stuff! I'm always impressed when you can enable animations in a declarative way, without having your own code running at 60fps. Also, while the animations are running, from watching KSysGuard, it looks like the rendering is multithreaded, which is super cool too! (And, I realized that KSysGuard probably uses Qt Charts under the hood too, to show its performance graphs.)

Conclusion

It looks like Qt Charts is pretty powerful, has lots of options to make it beautiful, and is somewhat fairly performant (though I didn't rigorously test the performance). Using it did require creating a whole Qt application, but the application is super small, only has a few files, and each file is pretty small and understandable. And, being able to make arbitrary dynamic updates over time while getting animation for free was pretty awesome. I think being able to describe most of the UI declaratively, rather than having to describe it all 100% in code, is definitely a good design decision for Qt. And the C++ interop story was a little convoluted (having to touch main() is a bit unfortunate) but honestly not too bad in the end.

Saturday, October 28, 2023

ReSTIR Part 2: Characterizing Sample Reuse

After enumerating all the building blocks of ReSTIR, there isn't actually that much more. The rendering equation is an integral, and our job is to approximate the value of the integral by sampling it in the most intelligent way possible.

Importance sampling tells us that we want to generate samples with a density that's proportional to the contribution of those samples to the value of the final integral. (So, where the light is strongest, sample that with highest density.) We can't directly produce samples with this probability density function, though - if we could, we could just compute the integral directly rather than dealing with all this sampling business

The function being integrated in the rendering equation is the product of a few independent functions:

  • The BRDF (BSDF) function, which is a property of the material we are rendering,
  • The distribution of incoming light. For direct illumination, this is distributed over the relevant light sources
  • A geometry term, where the orientation of the surface(s) affects the result
  • A visibility term (the point being shaded might be in shadow)
The fact that there are a bunch of independent terms means that Multiple Importance Sampling (MIS) works well - we can use these independent functions to produce a single aggregated "target" function which we expect will approximate the real function fairly well. So, we can generate samples according to the target function, using Sequential Importance Resampling (SIR), evaluate the real function at those sampling locations (by tracing rays or whatever), then use Resampled Importance Sampling (RIS) to calculate an integral. Easy peasy, right?
 

ReSTIR


This is where ReSTIR starts. The first observation that ReSTIR makes is that it's possible to use reservoir sampling (RS) to turn this into a streaming algorithm. The paper assumes that the reservoir only holds a single sample (though this isn't actually necessary). The contents of the reservoir represent a set of (one) sample with pdf proportional to the target function, and the more samples the reservoir encounters, the better that pdf matches the target function. The name of the game, now, is to make the reservoir encounter as many samples as possible.

Which brings us to the second observation that ReSTIR makes. Imagine if there was some way of merging reservoirs in constant time (or rather: in time proportional to the size of the reservoirs, rather than time proportional to the number of samples the reservoirs have encountered). If this were possible, you could imagine a classic parallel reduction algorithm: each thread (pixel) could start out with a naive reservoir (a poor approximation of your target function), but then adjacent threads could merge their reservoirs, then one-of-every-4-threads could merge their reservoirs, then one-of-every-8, etc, until you have a single result that incorporates results from all the threads. If only a single level (generation) of this reduction occurs each frame, you end up with a result where you perform a constant amount of work each frame per thread, but the result is that an exponential number of samples end up being accumulated. This is the key insight that ReSTIR makes.
 

Merging Reservoirs


Merging reservoirs is a subtle business, though. The problem is that different pixels/threads are shading different materials oriented at different orientations. In effect, your target function you're sampling (and the real function you're evaluating) are different from pixel to pixel. If you ignore this fact, and pretend that all your pixels are all sampling the same thing, you can naively just jam the reservoirs together, by creating a new reservoir which encounters the values saved in the reservoirs of the inputs. This is fast, but gets wrong results (called "bias" in the literature).

What you have to do instead is to treat the merging operation with care. The key here lies with the concept of "supports." Essentially, if you're trying to sample a function, you have to be able to generate samples at every place the function is nonzero. If there's an area where the function is nonzero but you never sample that area, your answer will turn out wrong. Well, the samples that one pixel generates (recall that the things in the reservoirs are sample locations) might end up not being applicable to a different pixel. For example, consider if there's an occlusion edge where one pixel is in shadow and a nearby pixel isn't. Or, another example: the surface normal varies across the object, and the sample at one pixel is at a sharp angle, such that if you use that same sample at a different pixel, that sample actually points behind the object. You have to account for this in the formulas involved.
 

Jacobian Determinant


There's a generalization of this, which uses the concept of a Jacobian determinant. Recall that, in general, a function describes a relationship between inputs and outputs. The Jacobian determinant of a function describes, for a particular point in the input space of the function, if you make a small perturbation and feed a slightly different input point into the function, how much the output of the function will be perturbed. It's kind of a measure of sensitivity - at a particular point, how sensitive are changes in the output to changes in the input.

Well, if you have a sample at one particular pixel of an image, and you then apply it to a different pixel, you have an input (the sample at the original pixel) and you have an output (the sample at the destination pixel) and you have a relationship between the two (the probability of that sample won't be exactly the same at the two different places). So, the Jacobian tells you how to compensate for the fact that you're changing the domain of the sample.

In order to incorporate the Jacobian, you have to be able to calculate it (of course), which means you have to be able to characterize how sample reuse across pixels affects the probabilities involved. For direct illumination, that's just assumed to be 1 or 0 depending on the value of the sample point - hence why above you just ignore some samples altogether when reusing them. For indirect illumination (path tracing), a sample is an entire path, and when you re-use it at a different pixel, you're producing a new path that is slightly different than the original path. This path manipulation is called "shift mapping" of a path in the gradient domain rendering literature, and common shift mappings have well-defined Jacobian functions associated with them. So, if you spatially reuse a path, you can pick a "shift mapping" for how to define the new path, and then include that shift mapping's Jacobian in the reservoir merging formula.

This concept of a "shift mapping" and its Jacobian can be generalized to any kind of sampling - it's not just for path tracing.

Conclusion


So that's kind of it. If you're careful about it, you can merge reservoirs in closed form (or, at least, in closed form for each sample in the reservoirs), which results in a pdf of the values in the reservoir that are informed by the union of samples of all the input reservoirs. This leads to a computation tree of merges which allows the number of samples to be aggregated exponentially over time, where each frame only has to do constant work per pixel. You can perform this reuse both spatially and temporally, if you remember information about the previous frame. The more samples you aggregate, the closer the pdf of the samples in the reservoir matches the target function, and the target function is formed by using MIS to approximate the rendering equation. This allows you to sample with a density very close to the final function you're integrating, which has the effect of reducing variance (noise) in the output image.

ReSTIR also has some practical concerns, such as all the reuse causing an echo chamber of old data - the authors deal with that by weighting old and new data differently, to try to strike a balance between reuse (high sample counts) vs quickly adhering to new changes in geometry or whatever. It's a tunable parameter.

Tuesday, October 17, 2023

ReSTIR Part 1: Building Blocks

ReSTIR is built on a bunch of other technologies. Let's discuss them one-by-one.

Rejection Sampling


Rejection sampling isn't actually used in ReSTIR, but it's useful to cover it anyway. It is a technique to convert samples from one PDF (probability density function) to another PDF.

So, you start with the fact that you have 2 PDFs: a source PDF and a destination PDF. The first thing you do is you find a scalar "M" which, when scaling the source PDF, causes the source PDF to be strictly larger than the destination PDF, for all x coordinates. Then, for every sample in the source, accept that sample with a probability equal to destination PDF at the sample / (M * source PDF at the sample). You'll end up with fewer samples than you started with, but that's the price you pay. You can see how the scalar M is necessary to keep the probabilities between 0 and 1.

The larger the distance between the destination PDF and M * the source PDF, the fewer samples will be accepted. So, if you pick M very conservatively, you'll end up with almost no samples accepted. That's a downside to rejection sampling.

On the other hand, if the source PDF and the destination PDF are the same, then M = 1, and all the samples will be accepted. Which is good, because the input samples are exactly what should be produced by the algorithm.

Sequential Importance Resampling


This is another technique used to convert samples from one PDF to another PDF. Compared to rejection sampling, we don't reject samples as we encounter them; instead, we pick ahead of time how many samples we want to accept.

Again, you have a source PDF and a destination PDF. Go through all your samples, and compute a "score" which is the destination PDF at the sample / the source PDF at the sample. Now that you have all your scores, select N samples from them, with probabilities proportional to the scores. You might end up with duplicate samples; that's okay.

Compared to rejection sampling, this approach has a number of benefits. The first is that you don't have to pick that "M" value. The scores are allowed to be any (non-negative) value - not necessarily between 0 and 1. This means you don't have to have any global knowledge about the PDFs involved.

Another benefit is that you know how many samples you're going to get at the end - you can't end up in a situation where you accidentally don't end up with any samples.

The downside to this algorithm is that you have to pick N up front ahead of time. But, usually that's not actually a big deal.

The other really cool thing about SIR is that the source and destination PDFs don't actually have to be normalized. Because the scores can be arbitrary, it's okay if your destination PDF is actually just some arbitrary (non-normalized) function. This is super valuable, as we'll see later.

Monte Carlo Integration


The goal of Monte Carlo integration is to compute an integral of a function. You simply sample it at random locations, and average the results.

This assumes that the pdf you're using to generate random numbers is constant, from 0 - 1.

So, the formula is: 1/N * sum from 1 to N of f(x_i)

Importance Sampling


The idea here is to improve upon basic Monte Carlo integration as described above. Certain samples will contribute to the final result more than others. Instead of sampling from a constant PDF, if you instead sample using a PDF that approximates the function being integrated, you'll more quickly approach the final answer.

Doing so adds another term to the formula. It now is: 1/N * sum from 1 to N of f(x_i) / q(x_i), where q(x) is the PDF used to generate samples.

The best PDF, of course, is proportional the function being sampled - if you pick this, f(x_i) / q(x_i) will be a constant value for all i, which means you only need 1 term to calculate the final perfect answer. However, usually this is impractical - if you knew how to generate samples proportional to the function being integrated, you probably know enough to just integrate the function directly. For direct illumination, you can use things like the BRDF of the material, or the locations where the light sources are. Those will probably match the final answer pretty well.

Multiple Importance Sampling


So now the question becomes how to generate that approximating function. If you look at the above fomula, you'll notice that when f(x) is large, but q(x) is small, that leads to the worst possible situation - you are trying to compute an integral, but you're not generating any samples in an area that large contributes to it.

The other extreme - where f(x) is small but q(x) is big - isn't actually harmful, but it is wasteful. You're generating all these samples that don't actually contribute much to the final answer.

The idea behind MIS is that you can generate q(x) from multiple base formulas. For example, one of the base formulas might be the uniform distribution, and another might be proportional to the BRDF of the material you're shading, and another might be proportional to the direction of where the lights in the scene are. The idea is that, by linearly blending all these formulas, you can generate a better q(x) PDF. 

Incorporating the uniform distribution is useful to make sure that q(x) never gets too small anywhere, thereby solving the problem where f(x) is large and q(x) is small.

Resampled Importance Sampling


RIS is what happens when you bring together importance sampling and SIR. You can use SIR to generate samples proportional to your approximating function. You can then use the importance sampling formula to compute the integral.

If, when using SIR, your approximating function isn't normalized, there's another term added into the formula to re-normalize the result, which allows the correct integral to be calculated.

This is really exciting, because it means that we can calculate integrals (like the rendering equation) by sampling in strategic places - and the pdf of those strategic places can be arbitrary (non-normalized) functions.

Reservoir Sampling


Reservoir Sampling is a reformulation of SIR, to make it streamable. Recall that, in SIR, you encounter samples, and each sample produces a weight, and then you select N samples proportional to each sample's weight. Reservoir sampling allows you to select the N samples without knowing the total number of samples there are. The idea is that you keep a "reservoir" of N samples, and each time you encounter a new sample, you update the contents of the reservoir depending on probabilities involved. The invariant is that the contents of the reservoir is proportional to the probabilities of all the samples encountered.

The other cool thing about reservoir sampling is that 2 reservoirs can be joined together into a single reservoir, via only looking at the contents of the reservoirs, without requiring another full pass over all the data.

Conclusion


So far, we've set ourselves up for success. We can calculate integrals, in a streamable way, by "resampling" our samples to approximate the final function being integrated. Being streamable is important, as we need to be able to update our results as we encounter new samples (perhaps across new frames, or across other pixels). The fact that you can merge reservoirs in constant time is super powerful, as it the merged result to behave as if it saw 2*N samples, while only running a constant-time algorithm. This can be done multiple times, thereby allowing for synthesis of an exponential number of samples, but each operation is constant time.

Friday, October 13, 2023

Implementing a GPU's Programming Model on a CPU

SIMT

The programming model of a GPU uses what has been coined "single instruction multiple thread." The idea is that the programmer writes their program from the perspective of a single thread, using normal regular variables. So, for example, a programmer might write something like:

int x = threadID;

int y = 6;

int z = x + y;

Straightforward, right? Then, they ask the system to run this program a million times, in parallel, with different threadIDs.

The system *could* simply schedule a million threads to do this, but GPUs do better than this. Instead, the compiler will transparently rewrite the program to use vector registers and instructions in order to run multiple "threads" at the same time. So, imagine you have a vector register, where each item in the vector represents a scalar from a particular "thread." In the above, program, x corresponds to a vector of [0, 1, 2, 3, etc.] and y corresponds to a vector of [6, 6, 6, 6, etc.]. Then, the operation x + y is simply a single vector add operation of both vectors. This way, performance can be dramatically improved, because these vector operations are usually significantly faster than if you had performed each scalar operation one-by-one.

(This is in contrast to SIMD, or "single instruction multiple data," where the programmer explicitly uses vector types and operations in their program. The SIMD approach is suited for when you have a single program that has to process a lot of data, whereas SIMT is suited for when you have many programs and each one operates on its own data.)

SIMT gets complicated, though, when you have control flow. Imagine the program did something like:

if (threadID < 4) {

    doSomethingObservable();

}

Here, the system has to behave as-if threads 0-3 executed the "then" block, but also behave as-if threads 4-n didn't execute it. And, of course, thread 0-3 want to take advantage of vector operations - you don't want to pessimize and run each thread serially. So, what do you do?

Well, the way that GPUs handle this is by using predicated instructions. There is a bitmask which indicates which "threads" are alive: within the above "then" block, that bitmask will have value 0xF. Then, all the vector instructions use this bitmask to determine which elements of the vector it should actually operate on. So, if the bitmask is 0xF, and you execute a vector add operation, the vector add operation is only going to perform the add on the 0th-3rd items in the vector. (Or, at least, it will behave "as-if" it only performed the operation on those items, from an observability perspective.) So, the way that control flow like this works is: all threads actually execute the "then" block, but all the operations in the block are predicated on a bitmask which specifies that only certain items in the vector operations should actually be performed. The "if" statement itself just modifies the bitmask.

The Project

AVX-512 is an optional instruction set on some (fairly rare) x86_64 machines. The exciting thing about AVX-512 is that it adds support for this predication bitmask thing. It has a bunch of vector registers (512 bits wide, named zmm0 - zmm31) and it also adds a set of predication bitmask registers (k0 - k7). The instructions that act upon the vector registers can be predicated on the value of one of those predication registers, to achieve the effect of SIMT.

It turns out I actually have a machine lying around in my home which supports AVX-512, so I thought I'd give it a go, and actually implement a compiler that compiles a toy language, but performs the SIMT transformation to use the vector operations and predication registers. The purpose of this exercise isn't really to achieve incredible performance - there are lots of sophisticated compiler optimizations which I am not really interested in implementing - but instead the purpose is really just as a learning exercise. Hopefully, by implementing this transformation myself for a toy language, I can learn more about the kinds of things that real GPU compilers do.

The toy language is one I invented myself - it's very similar to C, with some syntax that's slightly easier to parse. Programs look like this:

function main(index: uint64): uint64 {

    variable accumulator: uint64 = 0;

    variable accumulatorPointer: pointer<uint64> = &accumulator;

    for (variable i: uint64 = 0; i < index; i = i + 1) {

        accumulator = *accumulatorPointer + i;

    }

    return accumulator;

}

It's pretty straightforward. It doesn't have things like ++ or +=. It also doesn't have floating-point numbers (which is fine, because AVX-512 supports vector integer operations). It has pointers, for loops, continue & break statements, early returns... the standard stuff.

Tour

Let's take a tour, and examine how each piece of a C-like language gets turned into AVX-512 SIMT. I implemented this so it can run real programs, and tested it somewhat-rigorously - enough to be fairly convinced that it's generally right and correct.

Variables and Simple Math

The most straightforward part of this system is variables and literal math. Consider:

variable accumulator: uint64;

This is a variable declaration. Each thread may store different values into the variable, so its storage needs to be a vector. No problem, right?

What about if the variable's type is a complex type? Consider:

struct Foo {

    x: uint64;

    y: uint64;

}

variable bar: Foo;

Here, we need to maintain the invariant that Foo.x has the same memory layout as any other uint64. This means that, rather than alternating x,y,x,y,x,y in memory, there instead has to be a vector for all the threads' x value, followed by another vector for all the threads' y values. This works recursively: if a struct has other structs inside it, the compiler will to through all the leaf-types in the tree, turn each leaf type into a vector, and then lay them out in memory end-to-end.

Simple math is even more straightforward. Literal numbers have the same value no matter which thread you're running, so they just turn into broadcast instructions. The program says "3" and the instruction that gets executed is "broadcast 3 to every item in a vector". Easy peasy.

L-values and R-values

In a C-like language, every value is categorized as either an "l-value" or an "r-value". An l-value is defined as having a location in memory, and r-values don't have a location in memory. The value produced by the expression "2 + 3" is an r-value, but the value produced by the expression "*foo()" is an l-value, because you dereferenced the pointer, so the thing the pointer points to is the location in memory of the resulting value. L-values can be assigned to; r-values cannot be assigned to. So, you can say things like "foo = 3 + 4;" (because "foo" refers to a variable, which has a memory location) but you can't say "3 + 4 = foo;". That's why it's called "l-value" and "r-value" - l-values are legal on the left side of an assignment.

At runtime, every expression has to produce some value, which is consumed by its parent in the AST. E.g, in "3 * 4 + 5", the "3 * 4" has to produce a "12" which the "+" will consume. The simplest way to handle l-values is to make them produce a pointer. This is so expressions like "&foo" work - the "foo" is an lvalue and produces a pointer that points to the variable's storage, and the & operator receives this pointer and produces that same pointer (unmodified!) as an r-value. The same thing happens in reverse for the unary * ("dereference") operator: it accepts an r-value of pointer type, and produces an l-value - which is just the pointer it just received. This is how expressions like "*&*&*&*&*&foo = 7;" work (which is totally legal and valid C!): the "foo" produces a pointer, which the & operator accepts and passes through untouched to the &, which takes it and passes it through untouched, all the way to the final *, which produces the same pointer as an lvalue, that points to the storage of foo.

The assignment operator knows that the thing on its left side must be an lvalue and therefore will always produce a pointer, so that's the storage that the assignment stores into. The right side can either be an l-value or an r-value; if it's an l-value, the assignment operator has to read from the thing it points to; otherwise, it's an r-value, and the assignment operator reads the value itself. This is generalized to every operation: it's legal to say "foo + 3", so the + operator needs to determine which of its parameters are l-values, and will thus produce pointers instead of values, and it will need to react accordingly to read from the storage the pointers point to.

All this stuff means that, even for simple programs where the author didn't even spell the name "pointer" anywhere in the program, or even use the * or & operators anywhere in the program, there will still be pointers internally used just by virtue of the fact that there will be l-values used in the program. So, dealing with pointers is a core part of the language. They appear everywhere, whether the program author wants them to or not.

Pointers

If we now think about what this means for SIMT, l-values produce pointers, but each thread has to get its own distinct pointer! That's because of programs like this:

variable x: pointer<uint64>;

if (...) {

    x = &something;

} else {

    x = &somethingElse;

}

*x = 4;

That *x expression is an l-value. It's not special - it's just like any other l-value. The assignment operator needs to handle the fact that, in SIMT, the lvalue that *x produces is a vector of pointers, where each pointer can potentially be distinct. Therefore, that assignment operator doesn't actually perform a single vector store; instead, it performs a "scatter" operation. There's a vector of pointers, and there's a vector of values to store to those pointers; the assignment operator might end up spraying those values all around memory. In AVX-512, there's an instruction that does this scatter operation.

(Aside: That scatter operation in AVX-512 uses a predication mask register (of course), but the instruction has a side-effect of clearing that register. That kind of sucks from the programmer's point of view - the program has to save and restore the value of the register just because of a quirk of this instruction. But then, thinking about it more, I realized that the memory operation might cause a page fault, which has to be handled by the operating system. The operating system therefore needs to know which address triggered the page fault, so it knows which pages to load. The predication register holds this information - as each memory access completes, the corresponding bit in the predication register gets set to false. So the kernel can look at the register to determine the first predication bit that's high, which indicates which pointer in the vector caused the fault. So it makes sense why the operation will clear the register, but it is annoying to deal with from the programmer's perspective.)

And, of course, the operation can also say "foo = *x;" which means that there also has to be a gather operation. Sure. Something like "*x = *y;" will end up doing both a gather and a scatter.

Copying

Consider a program like:

struct Foo {

    x: uint64;

    y: uint64;

}

someVariableOfFooType = aFunctionThatReturnsAFoo();

That initializer needs to set both fields inside the Foo. Naively, a compiler might be tempted to use a memcpy() to copy the contents - after all, the contents could be arbitrarily complex, with nested structs. However, that won't work for SIMT, because only some of the threads might be alive at this point in the program. Therefore, that assignment has to only copy the items of the vectors for the threads that are alive; it can't copy the whole vectors because that can clobber other entries in the destination vector which are supposed to persist.

So, all the stores to someVariableOfFooType need to be predicated using the predication registers - we can't naively use a memcpy(). This means that every assignment needs to actually perform n memory operations, where n is the number of leaf types in the struct being assigned - because those memory operations can be predicated correctly using the predication registers. We have to copy structs leaf-by-leaf. This means that the number of instructions to copy a type is proportional to the complexity of the type. Also, both the left side and the right side may be l-values, which means each leaf-copy could actually be a gather/scatter pair of instructions. So, depending on the complexity of the type and the context of the assignment, that single "=" operation might actually generate a huge amount of code.

Pointers (Part 2)

There's one other decision that needs to be made about pointers: Consider:

variable x: uint64;

... &x ...

As I described above, the storage for the variable x is a vector (each thread owns one value in the vector). &x produces a vector of pointers, sure. The question is: should all the pointer values point to the beginning of the x vector? Or should each pointer value point to its own slot inside the x vector? If they point to the beginning, that makes the & operator itself really straightforward: it's just a broadcast instruction. But it also means that the scatter/gather operations get more complicated: they have to offset each pointer by a different amount in order to scatter/gather to the correct place. On the other hand, if each pointer points to its own slot inside x, that means the scatter/gather operations are already set up correctly, but the & operation itself gets more complicated.

Both options will work, but I ended up making all the pointer point to the beginning of x. The reason for that is for programs like:

struct Foo {

    x: uint32;

    y: uint64;

}

variable x: Foo;

... &x ...

If I picked the other option, and had the pointers point to their own slot inside x, it isn't clear which member of Foo they should be pointing inside of. I could have, like, found the first leaf, and made the pointers point into that, but what if the struct is empty... It's not very elegant.

Also, if I'm assigning to x or something where I need to populate every field, because every copy operation has to copy leaf-by-leaf, I'm going to have to be modifying the pointers to point to each field. If one of the fields is a uint32 and the next one is a uint64, I can't simply just add a constant amount to each pointer to get it to point to its slot in the next leaf. So, if I'm going to be mucking about with individual pointer values for each leaf in a copy operation, I might as well have the original pointer point to the overall x vector rather than individual fields, because pointing to individual fields doesn't actually make anything simpler.

Function Calls

This language supports function pointers, which are callable. This means that you can write a program like this (taken from the test suite):

function helper1(): uint64 ...

function helper2(): uint64 ...

function main(index: uint64): uint64 {

    variable x: FunctionPointer<uint64>;

    if (index < 3) {

        x = helper1;

    } else {

        x = helper2;

    }

    return x();

}

Here, that call to x() allows different threads to point to different functions. This is a problem for us, because all the "threads" that are running share the same instruction pointer. We can't actually have some threads call one function and other threads call another function. So, what we have to do instead is to set the predication bitmask to only the "threads" which call one function, then call that function, then set the predication bitmask to the remaining threads, then call the other function. Both functions get called, but the only "threads" that are alive during each call are only the ones that are supposed to actually be running the function.

This is tricky to get right, though, because anything could be in that function pointer vector. Maybe all the threads ended up with the same pointers! Or maybe each thread ended up with a different pointer! You *could* do the naive thing and do something like:

for i in 0 ..< numThreads:

    predicationMask = originalPredicationMask & (1 << i)

    call function[i]

But this has really atrocious performance characteristics. This means that every call actually calls numThreads functions, one-by-one. But each one of those functions can have more function calls! The execution time will be proportional to numThreads ^ callDepth. Given that function calls are super common, this exponential runtime isn't acceptable.

Instead, what you have to do is gather up and deduplicate function pointers. You need to do something like this instead:

func generateMask(functionPointers, target):

    mask = 0;

    for i in 0 ..< numThreads:

        if functionPointers[i] == target:

            mask |= 1 << i;

    return mask;

for pointer in unique(functionPointers):

    predicationMask = originalPredicationMask & generateMask(functionPointers, pointer)

    call pointer

I couldn't find an instruction in the Intel instruction set that did this. This is also a complicated enough algorithm that I didn't want to write this in assembly and have the compiler emit the instructions for it. So, instead, I wrote it in C++, and had the compiler emit code to call this function at runtime. Therefore, this routine can be considered a sort of "runtime library": a function that automatically gets called when the code the author writes does a particular thing (in this case, "does a particular thing" means "calls a function").

Doing it this way means that you don't get exponential runtime. Indeed, if your threads all have the same function pointer value, you get constant runtime. And if the threads diverge, the slowdown will be at most proportional to the number of threads. You'll never run a function where the predication bitmask is 0, which means there is a floor about how slow the worst case can be - it will never get worse than having each thread individually diverge from all the other threads.

Control Flow

As described above, control flow (meaning: if statements, for loops, breaks, continues, and returns) are implemented by changing the value of the predication bitmask register. The x86_64 instruction set has instructions that do this.

There are 2 ways to handle the predication registers. One way is to observe the fact that there are 8 predication registers, and to limit the language to only allow 8 (7? 6? 3?) levels of nested control flow. If you pick this approach, the code that you emit inside each if statement and for loop would use a different predication register. (Sibling if statements can use the same predication register, but nested ones have to use different predication registers.) 

I elected to not add this restriction, but instead to save and restore the values of the predication register to the stack. This is slower, but it means that control flow can be nested without limit. So, all the instructions I emit are all predicated on the k1 register - I never use k2 - k7 (except - I use k2 to save/restore the value of k1 during scatter/gather operations because those clobber the value of whichever register you pass into it).

For an "if" statement, you actually need to save 2 predication masks:

  1. One that saves the predication mask that was incoming to the beginning of the "if" statement. You need to save this so that, after the "if" statement is totally completed, you can restore it back to what it was originally
  2. If there's an "else" block, you also need to save the bitmask of the threads that should run the "else" block. You might think that you can compute this value at runtime instead of saving/loading it (it would be the inverse of the threads that ran the "then" block, and-ed with the set of incoming threads) but you actually can't do that because break and continue statements might actually need to modify this value. Consider if there's a break statement as a direct child of the "then" block - at the end of the "then" block, there will be no threads executing (because they all executed the "break" statement). If you then use the set of currently executing threads to try to determine which should execute the "else" block, you'll erroneously determine that all threads (even the ones which ran the "then" block!) should run the "else" block. Instead, you need to compute up-front the set of threads should be running the "else" block, save it, and re-load it when starting to execute the "then" block.
 For a "for" loop, you also need to save 2 predication masks:

  1. Again, you need to store the incoming predication mask, to restore it after the loop has totally completed
  2. You also need to save and restore the set of threads which should execute the loop increment operation at the end of the loop. The purpose of saving and restoring this is so that break statements can modify it. Any thread that executes a break statement needs to remove itself from the set of threads which executes the loop increment. Any thread that executes a continue statement needs to remove itself from executing *until* the loop increment. Again, this is a place where you can't recompute the value at runtime because you don't know which threads will execute break or continue statements.
If you set up "if" statements and "for" loops as above, then break and continue statements actually end up really quite simple. First, you can verify statically that no statement directly follows them - they should be the last statement in their block.

Then, what a break statement does is:
  1. Find the deepest loop it's inside of, and find all the "if" statements between that loop and the break statement
  2. For each of the "if" statements:
    1. Emit code to remove all the currently running threads from both of the saved bitmasks associated with that "if" statement. Any thread that executes a break statement should not run an "else" block and should not come back to life after the "if" statement.
  3. Emit code to remove all the currently running threads from just the second bitmask associated with the loop. (This is the one that gets restored just before the loop increment operation). Any thread that executes a break statement should not execute the loop increment.
A "continue" statement does the same thing except for the last step (those threads *should* execute the loop increment). And a "return" statement removes all the currently running threads from all bitmasks from every "if" statement and "for" loop it's inside of.

This is kind of interesting - it means an early return doesn't actually stop the function or perform a jmp or ret. The function still continues executing, albeit with a modified predication bitmask, because there might still be some threads "alive." It also means that "if" statements don't actually need to have any jumps in them - in the general case, both the "then" block and the "else" block will be executed, so instead of jumps you can just modify the predication bitmasks - and emit straight-line code. (Of course, you'll want the "then" block and the "else" block to both jump to the end if they find that they start executing with an empty predication bitmask, but this isn't technically necessary - it's just an optimization.)

Shared Variables


When you're using the SIMT approach, one thing that becomes useful is the ability to interact with external memory. GPU threads don't really perform I/O as such, but instead just communicate with the outside world via reading/writing global memory. This is a bit of a problem for SIMT-generated code, because it will assume that the type of everything is vector type - one for each thread. But, when interacting with external memory, all "threads" see the same values - a shared int is just an int, not a vector of ints.

That means we now have a 3rd kind of value classification. Previously, we had l-values and r-values, but l-values can be further split into vector-l-values and scalar-l-values. A pointer type now needs to know statically whether it points to a vector-l-value or a scalar-l-value. (This information needs to be preserved as we pass it from l-value pointers through the & and * operators.) In the language, this looks like "pointer<uint64 | shared>".

It turns out that, beyond the classical type-checking analysis, it's actually pretty straightforward to deal with scalar-l-values. They are actually strictly simpler than vector-l-values.

In the language, you can declare something like:

variable<shared> x: uint64;
x = 4;

which means that it is shared among all the threads. If you then refer to x, that reference expression becomes a scalar-l-value, and produces a vector of pointers, all of which point to x's (shared) storage. The "=" in the "x = 4;" statement now has to be made aware that:
  1. If the left side is a vector-l-value, then the scatter operation needs to offset each pointer in the vector to point to the specific place inside the destination vectors that the memory operations should write to
  2. But, if the left side is a scalar-l-value, then no such offset needs to occur. The pointers already point to the one single shared memory location. Everybody points to the right place already.
(And, of course, same thing for the right side of the assignment, which can be either a vector-l-value, a scalar-l-value, or an r-value.)

Comparisons and Booleans


AVX-512 of course has vector compare instructions. The result of these vector comparisons *isn't* another vector. Instead, you specify one of the bitmask registers to receive the result of the comparison. This is useful if the comparison is the condition of an "if" statement, but it's also reasonable for a language to have a boolean type. If the boolean type is represented as a normal vector holding 0s and 1s, there's an elegant way to convert between the comparison and the boolean.

The comparison instructions look like:

vpcmpleq %zmm1,%zmm0,%k2{%k1}

If you were to speak this aloud, what you'd say is "do a vector packed compare for less-than-or-equal-to on the quadwords in zmm0 and zmm1, put the result in k2, and predicate the whole operation on the value of k1." Importantly, the operation itself is predicated, and the result can be put into a different predication register. This means that, after you execute this thing, you know which threads executed the instruction (because k1 is still there) but you also know the result of the comparison (because it's in k2).

So, what you can do is: use k1 to broadcast a constant 0 into a vector register, and then use k2 to broadcast a constant 1 into the same vector register. This will leave a 1 in all the spots where the test succeeded, and a 0 in all the remaining spots. Pretty cool!

If you want to go the other way, to convert from a boolean to a mask, you can just compare the boolean vector to a broadcasted 0, and compare for "not equal." Pretty straightforward.

Miscellanea


I'm using my own calling convention (and ABI) to pass values into and out of functions. It's for simplicity - the x64 calling convention is kind of complicated if you're using vector registers for everything. One of the most useful decisions I made was to formalize this calling convention by encoding it in a C++ class in the compiler. Rather than having various different parts of the compiler just assume they knew where parameters were stored, it was super useful to create a single source of truth about the layout of the stack at call frame boundaries. I ended up changing the layout a few different times, and having this single point of truth meant that such changes only required updating a single class, rather than a global change all over the compiler.

Inventing my own ABI also means that there will be a boundary, where the harness will have to call the generated code. At this boundary, there has to be a trampoline, where the contents of the stack gets rejiggered to set it up for the generated code to look in the right place for stuff. And, this trampoline can't be implemented in C++, because it has to do things like align the stack pointer register, which you can't do in C++. AVX-512 requires vectors to be loaded and stored at 64-byte alignment, but Windows only requires 16-byte stack alignments. So, in my own ABI I've said "stack frames are all aligned to 64-byte boundaries" which means the trampoline has to enforce this before the entry point can be run. So the trampoline has to be written in assembly.

The scatter/gather operations (which are required for l-values to work) only operate on 32-bit and 64-bit values inside the AVX-512 registers. This means that the only types in the language can be 32-bit and 64-bit types. An AVX-512 vector, which is 512 bits = 64 bytes wide, can hold 8 64-bit values, or 16 32-bit values. However, the entire SIMT programming model requires you to pick up front how many "threads" will be executing at once. If some calculations in your program can calculate 8 values at a time, and some other calculations can calculate 16 values at a time, it doesn't matter - you have to pessimize and only use 8-at-a-time. So, if the language contains 64-bit types, then the max number of "threads" you can run at once is 8. If the language only contains 32-bit types (and you get rid of 64-bit type support, including 64-bit pointers), then you can run 16 "threads" at once. For me, I picked to include 64-bit types and do 8 "threads" at a time, because I didn't want to limit myself to the first 4GB of memory (the natural stack and heap are already farther than 4GB apart from each other in virtual address space, so I'd have to, like, mess with Windows's VM subsystem to allocate my own stack/heap and put them close to each other, and yuck I'll just use 64-bit pointers thankyouverymuch).

Conclusion


And that's kind of it. I learned a lot along the way - there seem to be good reasons why, in many shading languages (which use this SIMT model),
  • Support for 8-bit and 16-bit types is rare - the scatter/gather operations might not support them.
  • Support for 64-bit types is also rare - the smaller your types, the more parallelism you get, for a particular vector bit-width.
  • Memory loads and stores turn into scatter/gather operations instead.
    • A sophisticated compiler could optimize this, and turn some of them into vector loads/stores instead.
    • This might be why explicit support for pointers is relatively rare in shading languages - no pointers means you can _always_ use vector load/store operations instead of scatter/gather operations (I think).
  • You can't treat memory as a big byte array and memcpy() stuff around; instead you need to treat it logically and operate on well-typed fields, so the predication registers can do the right thing.
  • Shading languages usually don't have support for function pointers, because calling them ends up becoming a loop (with a complicated pointer coalescing phase, no less) in the presence of non-uniformity. Instead, it's easy for the language to just say "You know what? All calls have to be direct. Them's the rules."
  • Pretty much every shading language has a concept of multiple address spaces. The need for them naturally arises when you have local variables which are stored in vectors, but you also need to interact with global memory, which every thread "sees" identically. Address spaces and SIMT are thoroughly intertwined.
  • I thought it was quite cool how AVX-512 complimented the existing (scalar) instruction set. E.g. all the math operations in the language use vector operations, but you still use the normal call/ret instructions. You use the same rsp/rbp registers to interact with the stack. The vector instructions can still use the SIB byte. The broadcast instruction broadcasts from a scalar register to a vector register. Given that AVX-512 came out of the Larrabee project, it strikes me as a very Intel-y way to build a GPU instruction set.