Wednesday, November 9, 2022

Video Splitter

I record ~all the games I play. Not for any particular reason, but really just because sometimes it's fun to go back and re-watch them, to repeat a positive experience. Usually, I just upload the raw footage directly to YouTube, but this time I wanted to see if I could do a little better.

Problem Statement

I just finished playing Cyberpunk 2077, and recorded 77 hours of footage. This footage is split across 41 video files. Also, when recording the footage, I wasn't particularly meticulous about stopping recording when I had to step away for a little bit. Therefore, there are a bunch of times in the videos when I stepped away for a few minutes, and nothing much is happening on-screen.

I want to take these videos, and concatenate them into a few long videos. Ideally the result would be a single video, but YouTube doesn't allow you to upload anything longer than 10 hours, so the result will be a few 10-hour-long videos. Also, I'd like to identify the periods of time in the footage when nothing much is happening, and remove those periods from the result.

Also, just for the sake of convenience, I'd like to do this with as few libraries as possible, focusing just on using the software that's built into my Mac.

Plan

There are 3 phases:
  1. Feature Extraction
  2. Partitioning
  3. Writing out the final result.
Let's take these one at a time.

Feature Extraction

This is the part where I analyze the videos to pull out useful information from them. This entails going through all the videos frame-by-frame, and mapping each frame to a set of metrics. I ended up using 5 metrics:
  1. The number of words that appear in the frame
  2. The cross-correlation from the previous frame to the frame in question
  3. The optical-flow from the previous frame to the frame in question
  4. The standard deviation of the optical flow
  5. The average luminance of the frame
Let's consider these one-by-one.

Number of Words


This is pretty straightforward to gather. Apple's Vision framework contains API to recognize the text in an image. There's one other step beyond simply using this API, though - the results contain strings, but each string may contain many words. I'm interested in the number of words, rather than the number of strings in the image. So, I use another Apple API - CFStringTokenizer to pull out the words from the string. Then I simply count the number of words. Easy peasy.

Cross Correlation



The goal here is to cross-correlate adjacent frames of video, to see roughly how much is changing from frame to frame. This one was the most difficult to implement. 

Cross correlation is usually defined a little differently than what I'm interested in. Classically, cross correlation is a function that takes 2 functions as input, and produces another function. The idea is that you take the 2 functions, multiply them together and integrate the result, to produce a particular value. However, you often want to actually displace the two input functions away from each other. The size of that displacement is why the output of cross-correlation is a function - the input of that function represents the displacement that the two input functions are separated from each other before multiplying and integrating. (I'm assuming here that the functions are real-valued.)

For me, my 2 functions are discrete - There are X and Y inputs, and the ouptut is the color value of that pixel. So, instead of multiplying and integrating like you would for continuous functions, this operation actually becomes simply a dot product. I also am using just the luminance of the image, so that if a color changes chroma but luminance stays the same, that still counts as a high cross-correlation. Also, having a 1-dimensional result is a little more convenient than if I had a 3-dimensional result (by treating red, green, and blue as distinct).

However, I don't want my output to be a whole function -  I just want a single value. Usually, in the sciences, they do this by maximizing the value of the output function - often by trying every input, and reporting the maximum result achievable. This would be great: if I'm turning the camera in the game, this operation would find the location where the previous frame best matches up with the current frame. Unfortunately, it's too slow, though - for a 4K image, there are 35,389,440 inputs to try, and each trial operates on 2 entire 4K images. So, instead, I just set displacement = 0, and assume that adjacent video frames aren't changing a huge amount from frame to frame.

From reading Wikipedia's article about cross correlation, it looks like what I want is the "zero normalized cross-correlation" which normalizes the values in the image around the mean, and divides by the standard deviation. The idea is that if the image gets brighter as a whole, but nothing else changes, that should count the same as if it didn't get brighter. It's measuring relative changes, rather than absolute changes.

So this all boils down to:
  1. Calculate the luminance of both images
  2. Calculate the average and standard deviation luminance for each image. This ignores geometry and just treats all the pixels as an unordered set.
  3. For each image, create a new "zero-normalized" image, which is (old image - mean) / standard deviation
  4. Perform a dot product of the two zero-normalized images. The result of this is a single scalar.
  5. Divide the scalar by the number of pixels
Okay, how to actually do this in code?

Luminance


Calculating the luminance is actually a little tricky. The most natural way I found was to convert the image into the XYZ colorspace, whose Y channel represents luminance. I'm doing this using MPSImageConversion which can convert images between any 2 color spaces. It operates on MTLTextures, so I had to bounce through Core Image to actually produce them (via CIRenderDestination.) I then broadcasted the Y channel to every channel of the result, which isn't strictly necessary, but makes it more convenient to use later - I can't forget which channel is the correct channel to use. I did this broadcast using CIFilter.colorMatrix.

Mean


Okay, so now we've got luminance, let's calculate the mean, which is pretty straighforward - CoreImage already has CIFilter.areaAverage which produces a 1x1 image of the average. We can tile that 1x1 image using CIFilter.affineTile so it's as big as the input image.

Subtraction


Subtracting an image from its mean now is actually kind of tricky - Both CIDifferenceBlendMode and CISubtractBlendMode seem to try really hard to not produce negative values. What I had to do in the end was to add the negative of the second image (instead of subtracting). Adding is just CIFilter.additionCompositing, and negating is just CIFilter.multiplyCompositing with an image full of -1s. However, you can't use CIConstantColorGenerator to fill an image with -1s, because that will clamp to 0. Instead you have to actually create a new CIImage from a 1x1 CPU-side bitmap buffer that holds a -1, and then use CIFilter.affineTile to make it big enough. Also, in the CIContext, you have to set the working format to one that can represent negative values (normally Core Image uses unsigned values); I'm using CIFormat.RGBAf.

Standard Deviation


Okay, so the next step is to calculate standard deviation. The standard deviation is the square root of variance, and to calculate variance, we take all the pixels, subtract them from the mean like we just did above, square the result, then find the average of the result values. Luckily, we already did all the hard parts - squaring the result is just CIFilter.multiplyCompositing, and we can use the same CIFilter.areaAverage & CIFilter.affineTile to find the average. No problem. We can then take the square root to find standard deviation by using CIFilter.gammaAdjust with a gamma of 0.5.

Division


We can't actually divide by the standard deviation using Core Image as far as I can tell - CIDivideBlendMode doesn't seem to do a naive division like we want. However, because the standard deviation is constant across the whole image, we can hoist that division out of the dot product computation. The dot product results in a single scalar, and the standard deviation for an image is a single scalar, so we can just calculate these things independently, and then divide them on the CPU afterwards. No problem.

Dot Product


Okay, so now we've got our zero-normalized images. Let's do a dot product and average the result! This is pretty easy too - a dot product is just CIFilter.multiplyCompositing, and averaging the result is CIFilter.areaAverage.

Phew! All that Core Image work just to get a single value!

Optical Flow




Optical flow between 2 images produces an (x, y) displacement vector that indicates, for each pixel in the first image, where it moved to in the second image. I'm interested in this because as I rotate the camera around in the game, that should cause most of those displacements the be pointing in roughly the same direction across the whole image. So, the average of the optical flow should tell me if I'm moving the camera or not.

On the other hand, if I'm walking forward in the game, then pixels at the top of the screen will move up, pixels on the right will move more to the right, etc. In that situation, the displacements will all cancel each other out! That's why I'm also interested in the standard deviation of the flow. If the average is 0, but the standard deviation is high, that means I'm walking forward (or backward). If the standard deviation is 0, but the average is high, that means I'm turning the camera.

Calculating this is pretty straightforward. Apple's Vision framework contains API to calculate optical flow. We can then calculate its average and standard deviation using the same method use used above, in the cross correlation section.

Optical flow is a 2-dimensional value - pixels can move in the X and Y dimension. I'm not really interested in the direction of the movement, though; I'm more interested in the amount of movement. So, after calculating the average and the standard deviation, I take the magnitude of them, to turn them into scalar values.

Average Luminance



This is pretty straightforward - in fact, we already calculated this above in the cross-correlation section. It's useful because the menus have a black background, so they are darker than regular gameplay. Also, the luminance of menus is very consistent, as opposed to regular gameplay, where luminance is going up and down all the time. So, just looking at a graph of average luminance, you can kind of already see where the menus are in the video.

Partitioning

Alright, now we've extracted 5 features from each frame of video. Each of these features is a single scalar value. The next task is to try to use statistical analysis to determine which parts of the video are the boring parts that I should cut out, and which parts are full of action and should be kept in.

Originally, I thought I could just use cross-correlation to do this. I thought that if the cross-correlation between adjacent frames is low, that means I've cut to a menu or something, and that would be a good place to cut the video up. However, this turned out not to work very well, because menus actually come on screen with an animation, so they don't actually have low cross-correlation. Also, regular gameplay has a bunch of flashes in it (things explode, the camera can turn quickly, visual effects distort the screen, etc.).

Instead, I wanted to model the data I had gathered using a piecewise constant function. E.g. during gameplay, the 5 features will adhere to a particular distribution, and during menus or boring parts, the 5 features will adhere to a different distribution. I'm modelling these distributions as normal distributions, but with different means. I'm trying to partition the data by time, and calculate a new normal distribution for each partition, such that each partition's distribution fits its data as well as possible. The trick here is to find the partioning points. I'm looking for a statistical method of finding places where the data is discontiguous.

Originally, I thought this would be a classic K-means clustering problem, but after implementing it, it turned out not to work very well. No matter how hard I tried, the partitions overlapped a lot, and looked pretty random. So that didn't work.



Bayesian Information Criterion



Next, I discovered something called the Bayesian information criterion (BIC). This comes from the science of model selection, which is essentially exactly what I'm trying to do. I have a bunch of data, and I have a family of models in mind (disjoint normal distributions, one for each partition) and I'm trying to select among the family for the best model for my data.

The BIC is a measure of how well data fits a model. Importantly, though, it tries to mitigate overfitting. For example, in my data, if every data point was in its own partition, then the model would perfectly fit the data; but this wouldn't actually solve my problem. The Bayesian information criterion has 2 terms - one that reflects how well the data fits the model, and another one that reflects how many parameters the model has. The better the data fits the model, the better the BIC; but the more parameters in the model, the worse the BIC. It's essentially a way of balancing fitting vs overfitting.

There are 2^n different ways of partitioning n values, and for me, n is the number of frames in 77 hours of video. So exhaustively calculating the BIC for every possible partitioning is clearly too expensive. I instead opted to use a greedy approach. Given a particular partitioning, we can try to add a new split at every location (which is O(n) locations), and calculate the BIC as-if we split at that position. We then pick the location which results in the best BIC, and add it into the partitioning. We keep doing this until we find there is no single new splitting location which will improve the BIC (because adding another split would overfit the model).

If you preprocess the data to precalculate prefix-sums, you can answer "sum all the values from here to there" queries in O(1) time. This allows you to compute the BIC for a single partition in O(1) time (if you expand out the polynomial in the formula).

Therefore, determining the BIC for a particular partitioning is O(number of partitions). Every time we pick a splitting point, the number of partitions increases by 1. We want to calculate a BIC once for each candidate splitting point, which is O(number of partitions * number of frames). Even better, calculating the BIC at every candidates splitting point is an embarassingly parallel problem, and we can parallelize this across all the cores in our machine. It turns out this is fast enough - the longest single video I had is 5 hours long, and this algorithm completed on that video in around 10 minutes on my 20-core machine.

There are 2 tweaks that I ended up having to do to the above:
  1. The BIC assumes that the data you have is one-dimensional. However, my data is 5-dimensional; I extracted 5 features from each frame of the video, so each data point has 5 components. There may be a way to generalize the BIC to higher dimensions, but I instead opted to do something similar: for each 5-dimensional data point, create a single one-dimensional synthetic data point that represents it. This is meaningless, but isn't without precedent: there are lots of examples of people using this approach to average together multiple benchmarks into a single synthetic score. I opted to combine the 5 values using the geometeric mean, because that is sensitive to ratios of the data points, rather than the magnitudes of the data points themselves. (Arithmetic mean is definitely wrong, because calculating the arithmetic mean involves adding together the 5 values, but the 5 values have different units, so they can't be added.)
    1. I also decided to use 1 - cross correlation instead of cross correlation itself, because all the other values have a baseline near 0, but cross correlation has a baseline near 1 (because most frames are similar to their adjacent frames). This makes all the values behave a bit more similarly, and makes ratios more meaningful.
  2. The BIC formula involves a tradeoff between fitting the data and overfitting the data. The way overfitting is measured is that it's proportional to the number of parameters in the model. For me, each different partition has (I believe) 2 parameters: the mean of the data within that partition, and the constant variance of the errors. However, using this value causes the data to be overfitted significantly, so I instead multiplied that by 10, which ended up with a pretty good result. Doing it this way ended up with the average partition in each video being around 30 - 60 seconds long, regardless of the length of the video being partitioned. So that's a pretty cool result.

Writing out the Result



The last step is to pick a threshold: for each partition, I have to determine whether that partition should be present in the final video. I already have a model, where each partition is associated with the average value of the synthetic one-dimensional calculated values inside that partition. The way I picked which regions were in and which were out was just by inspecting a bunch of partitions, to see what was happening in the video during that time. I found that, if the average synthetic value is less than around 0.12145, then the partition was boring and should not be included.

Writing the result is pretty straightforward - I'm using AVAssetWriter and passing it frames from the input file (which was read using AVAssetReader).

Results

All this work seems, somewhat surprisingly, to give pretty good results. Not perfect, but better than I was expecting. In the first few hours of gameplay that I reviewed, it neatly cut out:
  1. A section when I was reading an in-game lore book thing (it was just showing text on the screen for a minute or so)
  2. A section when the game was paused
  3. A section when I was fussing with my inventory for a few minutes
  4. A section when I was looking at my character in the mirror
The remarkable part of this is that it cut out all these things wholesale: from right when they started, to right when they end. So you see the character walk into the elevator, and then they're immediately walking out of the elevator. And it didn't cut out any of the action or interesting parts. It also seems to have some resistance to durations of events: there was a point when I read a different in-game lore book, but I only read it for a few seconds, and it kept that part in; - presumably because it wasn't worth another partition.

Wednesday, May 12, 2021

Understanding CVDisplayLink

I found it actually somewhat difficult to understand how to use CVDisplayLink. But, after a while of playing around with it, I think I've got a pretty good handle on it. It's not too complicated.

The main use of a CVDisplayLink is to have a callback that runs once per vsync of a screen. It's also stateful, so you can stop and start the callback stream.

Creation

When you create one of these objects, you have to tell the system which screen to match - because different screens can have different refresh rates. The type CVDisplayLink accepts to do this is CGDirectDisplayID. You can get this from an NSScreen* as follows:

NSDictionary<NSDeviceDescriptionKey, id> *deviceDescription = theScreen.deviceDescription;

NSNumber *directDisplayIDNumber = deviceDescription[@"NSScreenNumber"];

CGDirectDisplayID directDisplay = directDisplayIDNumber.unsignedIntValue;

Then, you can use CVDisplayLinkCreateWithCGDisplay() to create the object for that display.

Setup

The setup can be either a block or a C function. The block doesn't need a void* userInfo object because that context is implicitly captured by the block. So, you just say:

CVDisplayLinkSetOutputHandler(displayLink, ^CVReturn (CVDisplayLinkRef displayLink, const CVTimeStamp *inNow, const CVTimeStamp *inOutputTime, CVOptionFlags flagsIn, CVOptionFlags *flagsOut) {

    ...

    return kCVReturnSuccess;

});

And then you start it with just CVDisplayLinkStart(displayLink); Easy peasy. There are also functions for stopping, retaining, and releasing the CVDisplayLink.

Interpreting the Arguments

It actually took me quite a while to figure out what each of the arguments means. The docs say that flagsIn and flagsOut are 0, and the displayLink is the CVDisplayLink that you started, so there are only really two interesting arguments: inNow, and inOutputTime, both of which are of type CVTimeStamp. inNow represents the time that this callback is being run, and inOutputTime represents the time that anything you draw in the callback is supposed to show up at.

So, let's dig into CVTimeStamp. The version and reserved fields are 0, and the flags field tells you which of the fields in the CVTimeStamp are valid. I don't know what SMPTE time is, but it never seems to be set/valid, so I'm going to ignore that one. So these are the ones that are remaining:

  • hostTime
  • rateScalar
  • videoRefreshPeriod
  • videoTime
  • videoTimeScale

The thing you have to realize is that there are two timelines happening concurrently: "host" time and "video" time. So, a "point" in time actually has two different representations: one for each of the timelines.

The hostTime field uses the same tick count that mach_absolute_time() uses. To convert it to seconds, you have to use mach_timebase_info(). And, the "meaning" of the hostTime field is the current time as measured by your application - exactly what mach_absolute_time() returns.

The videoTime field does not use those same tick counts. Instead, it uses the videoTimeScale field. It's a rational number: videoTime / videoTimeScale = seconds. videoRefreshPeriod is a rational number too, using the same denominator, but it represents the delta between adjacent video frames.

For CVDisplayLink, the "video" time represents time as measured by vsyncs. You can think of vsyncs as an independent clock - it ticks every so-often, and those ticks don't have to be in exact cadence with any of the other clocks on the system. They're supposed to be, but when you actually measure them, they won't perfectly line up, because of course nothing is that perfect. So, if videoRefreshPeriod / videoTimeScale equals 1/60, and you record adjacent frames' hostTime and convert them to seconds using mach_timebase_info(), you'll get something that's close to 1/60, but it won't be exact, because nothing is ever that exact all the time.

So that's what rateScalar tries to measure. It's the only field that is floating point, and it measures the speed of the video timeline relative to the speed of the host timeline. Ideally, it would always be 1.0, but, of course, nothing is ever that perfect. It's not sensitive to workload, just as time doesn't dilate when you start asking your computer to do some work.

The video time is time based on vsyncs, not time based on the window server render loop or the core animation render loop. If some other application loads up a big Core Animation scene, your CVDisplayLink isn't going to tick slower.

Also, I assume the fact that videoRefreshPeriod is passed into each callback indicates that videos can change their refresh rate ... but I'm not sure.

Thursday, March 7, 2019

Addition Font


Shaping


When laying out text using a font file, the code points in the string are mapped one-to-one to glyphs inside the file. A glyph is a little picture inside the font file, and is identified by ID, which is a number from 0 - 65535. However, there’s a step after character mapping but before rasterization: shaping.

For example, one application of shaping is seen in Arabic text. In Arabic, each letter has four different forms, depending on which letters are next to it. For example, two letters in their isolated form look like ف and ي but as soon as you put them together, they form new shapes and look like في. This type of modification isn’t possible if characters are naively mapped to glyphs and then rasterized directly. Instead, there needs to be a step in the middle to modify the glyph forms so the correct thing is rasterized.

This “middle step” is called shaping, and is implemented by three tables inside the OpenType font file: GSUB, GPOS, and GDEF. Let’s consider GSUB alone.

GSUB


The GSUB table, or “glyph substitution” table, is designed to let font authors replace glyphs with other glyphs. It describes a transformation where the input is a sequence of glyphs and the output is a different sequence of glyphs. It is made up of a collection of constituent “lookup tables,” each of which has a “type.”

Type 1 (“single substitution”) provides a map from glyph to glyph. This is used for example, when someone enables the ‘swsh’ feature, the font can substitute out the ampersand with a fancy ampersand. In that situation, the map would contain a mapping from the regular ampersand to the fancy ampersand (possibly in addition to some more additional mappings, too).

Type 2 (“multiple substitution”) provides a map from glyph to sequence-of-glyphs. This is used, for example, if diacritic (accent) marks are represented as separate glyphs inside the font. The font can replace the “è” glyph with the “e" glyph followed by the ◌̀ glyph (and then the GPOS table later can position the two glyphs physically on top of each other).

Type 4 (“ligature substitution”) provides a map of sequence-of-glyphs to single glyph (the opposite of type 2). This is used for ligatures, so if you have a fancy “ffi” ligature, you can represent all three of those letters in the same fancy glyph.

Type 5 (“contextual substitution”) is special. It doesn’t do any replacements directly, but instead maps a sequence of glyphs to a list of other tables that should be applied at specific points in the glyph sequence. So it can say things like “in the glyph sequence ‘abcde’, apply table #7 at index 2, and then when you’re done with that, apply table #13 at index 4.” Tables #7 and #13 can be any of the types above, so you could use this table to say something like “swap out the ‘d’ for an ‘f’, but only if it appears in the sequence ‘abcde’.” This sort of thing is used to implement the “contextual alternates” feature.

There are also three other types, but they’re not particularly relevant, so I’m going to ignore them.

So, the inputs to the text system are a set of features and an input string of glyphs (The characters have already been mapped to glyphs via the “cmap” table). Features are mapped to a set of lookup tables, each of which is of a type listed above. Each of those lookup tables describes a map where the keys are sequences of glyphs, so the runtime iterates through the glyph sequence until it finds a sequence that’s an input to one of the tables. The runtime then performs that glyph replacement according to the rules of the tables, and continues iterating.

Turing Complete


So this is pretty cool, but it turns out that the contextual substitution lookup type is really powerful. This is because the table that it references can be itself, which means it can be recursive.

Let’s pretend we have a lookup table named 42 (presumably because it’s the 42nd lookup table inside the font), and it’s a contextual substitution lookup table. This table maps glyph sequences to tuples of (lookup table to recurse to, offset in glyph sequence to recurse). Let’s say we design it with these two mappings:

Table42 {
    A A : (Table42, 1);
    A B : (Table100, 1);
}

If the runtime is operating on a glyph sequence of “AAAAB”, the first two “AA”s will match the first rule, so then the system recurses and runs Table42 on the stream “AAAB”. Then these first two “AA”s will match, and so-on. This happens until you get to the end of the string, “AB” matches, and then Table100 is run on the string “B”.

This is “tail recursion,” and can be used to implement all different types of loops. Also, each mapping in the table acts as an “if” statement because it only executes if the pattern is matched.

You can use the glyph stream as memory by reading and writing to it; that is, after all, what the shaping algorithm is designed to do. You can delete a glyph by using Type 2 to map it to an empty sequence. You can insert a glyph by using Type 2 to map the preceding glyph to a sequence of [itself, the new glyph you want to insert]. And, once you’ve inserted a glyph, you can check for its presence by using the “if” statements described above.

So thats a pretty powerful virtual machine. I think the above is sufficient to prove Turing complete-ness.

Caveat


So it turns out the example (“Table42”) above doesn’t actually work in DirectWrite. This is because, in DirectWrite, inner matches have to be entirely contained within outer matches. So when the outer call to Table42 matched “AA”, the inner call to Table42 can only match within that specific “AA”. This means it’s impossible to, for example, find the first even glyphID and move it to the beginning. So, DirectWrite’s implementation isn’t Turing complete. However, it does work in HarfBuzz and CoreText, so those implementations are Turing complete.

But even in HarfBuzz and CoreText, there are hard limits on the recursion depth. HarfBuzz sets its limit to 6. Therefore, the above example will only work on strings of length 7 or fewer. HarfBuzz is open source, though, so I simply used a custom build of HarfBuzz which bumps up this limit to 4 billion. This let me recurse to my heart’s content. A limit of 6 is probably a good thing; I don’t think users generally expect their text engines to be running arbitrary computation during layout. But I want to go beyond it.

DSL


After making the above realizations, I decided to try to implement a nontrivial algorithm using only the GSUB table in a font. I wanted to try to implement addition. The input glyph stream would be of the form “=1234+5678=” and the shaping process would turn that string into “6912”.

When thinking about how to do this, I started jotting down some ideas on paper for what the lookup tables should be, and the things I were writing down were very similar to that “Table42” example above. However, writing down tables of types 1, 2, and 4 was quite cumbersome, because what I really wanted to describe were things like “move this glyph to the beginning of the sequence” rather than individual insertions or deletions.

I looked at the “fea” language, which is how these lookups are traditionally written by font designers. However, after reading the parser, it looks like it doesn’t support recursive or mutually-recursive lookups. So, I rolled my own.

So, I did what any good programmer does in this situation: I invented a new domain-specific language.

The DSL has two types of statements. The first is a way of giving a set of glyphs a name. I wanted to be able to address all of the digits without having to write out every individual digit. So, there’s a statement that looks like this:

digit: 1 2 3 4 5 6 7 8 9 10;

Note that those numbers are glyph IDs, not code points. In my specific font, the “0” character is mapped to glyph 1, “1” is mapped to glyph 2, etc.

Then, you can describe a lookup using the syntax above:

digitMove {
    digit digit: (1, digitMove), (1, digitMove2);
    digit plus: (1, digitMove2);
}

The lookup is named digitMove, and it has two rules inside it. The first rule matches any two digits next to each other, and if they match, it invokes the lookup named digitMove (which is the name of the current lookup, so this is recursive) at index 1, and then after that, invokes the lookup named digitMove2 at index 1.

Each of these stanzas gets translated fairly trivially to lookup with type 5.

These rules are recursive, as above, but they need a terminal form so that the recursion will eventually end. Those are described like this:

flagPayload {
    digit digit plus digit: flag \3 \0 \1 \2;
}

This rule is a terminal because there are no parentheses on the right side of the colon. The right side represents a glyph sequence to replace the match with. The values on the right side are either a literal glyph ID, a glyph set that contains exactly one glyph, or a backreference which starts with a backslash. “\3” means “whatever glyph was third glyph in the matched sequence”. So the rule above would turn the glyph sequence “34+5” into the sequence “F534+”. (The flag glyph is read and removed in a later stage of processing).

Translating one of these rules to lookups is nontrivial. I tried a few things, but ended up with the following design:

For each output glyph, right to left:

  • If it’s a backreference, duplicate the glyph it’s referencing, and perform a sequence of swaps to move the new glyph all the way to the right.
  • If it’s a literal glyph, insert it at the beginning, and perform a sequence of swaps to move it all the way to the right.

This means we need to have a way to do the following operations:

  • Duplicate. This is a type 4 lookup that maps every glyph to a sequence of two of that glyph.
  • Swap. This has two pieces: a type 5 lookup that has a rule for every combination of two glyphs, and each rule maps each glyph to a lookup of type 1 which replaces it with the appropriate glyph. This means you need n of these inner (type 1) lookups, allowing you to map any glyph to any other glyph. However, the encoding format allows us to encode each of these inner lookups in constant space in the font, so these inner lookups don’t take that much space. Instead, the outer type 5 lookup takes n^2 space.
  • Insert a literal. If you implemented this by simply making a type 2 that mapped every glyph to that same glyph + the literal, you would need n^2 space because there would be n of these tables. Instead, you can cut down the size by doing it in two phases: inserting a flag glyph (which is O(n) space using a lookup type 2) and mapping that glyph to any constant value (also O(n) space using a type 1).

Above, I’m worried about space constraints in the file because pointers in the file are (in general) 2 bytes, meaning the maximum size that anything can be is 2^16. If n^2 space is needed, that means n can only be 2^8 = 256, which isn’t that big. Most fonts have on the order of 256 glyphs. Therefore, we need to reduce the places where we require O(n^2) space as much as possible. LookupType 7 helps somewhat, because it allows you to use 32-bit pointers in on specific place, but it only helps that one place.

My font only has 14 glyphs in it, so i didn’t end up near any of these limits, but it’s still important to watch out for out-of-bounds problems.

So, given all that, we can make a parser which builds an AST for the language, and we can build an intermediate representation which represents the bytes in the file, and we can make a lowering phase which lowers the AST to the IR. Then we can serialize the IR and write out the data to the file.

Addition


So, once the language was up and running, I had to actually write a program that represented addition. It works in four phases.

First, define some glyphs:

digit: 1 2 3 4 5 6 7 8 9 10;
flag: 13;
plus: 11;
equals: 12;
digit0: 1;

Then, parse the string. If the string didn’t match the form “=digits+digits=” then I wanted nothing to happen. You can do this by recursing across the string, and if you find that it matches the pattern, insert a flag, and then when all the calls return, move the flag leftward.

parse {
    equals digit: (1, parseLeft), (0, afterParse);
}

parseLeft {
    digit digit: (1, parseLeft), (0, moveFlagAcrossDigit);
    digit plus digit: (2, parseRight), (0, moveFlagAcrossPlus);
}

parseRight {
    digit digit: (1, parseRight), (0, moveFlagAcrossDigit);
    digit equals: flag \0 \1;
}

moveFlagAcrossDigit {
    digit flag digit: \1 \0 \2;
}

moveFlagAcrossPlus {
    digit plus flag digit: \2 \0 \1 \3;
}

afterParse {
    equals flag: (0, removeFlag), (0, startDigitMove);
}

removeFlag {
    equals flag: \0;
}

The next step is to pair up the glyphs. For example, this would turn “1234+5678” into “15263748+”.

startDigitMove {
    equals digit: (1, digitMove), (1, startPhase2), (0, removeEquals);
}

removeEquals {
    equals digit: \1;
}

digitMove {
    digit digit: (1, digitMove), (1, digitMove2);
    digit plus: (1, digitMove2);
}

digitMove2 {
    digit digit: (1, digitMove2), (0, swapDigits);
    digit plus digit: (2, digitMove2), (0, digitMove3);
    digit plus equals: digit0 \0 \1 \2;
    plus digit: (1, digitMove2), (0, swapPlusDigit);
    plus equals: digit0 \0 \1;
    digit equals: \0 \1;
}

swapDigits {
    digit digit: \1 \0;
}

digitMove3 {
   digit plus digit: \2 \0 \1;
}

swapPlusDigit {
    plus digit: \1 \0;
}

The next step is to see if there were any glyphs left over on the right side that didn’t get moved. This happens if the right side is longer than the left side. For example, if the input string is “12+3456” we now would have “1526+34”. We want to turn this into “03041526

startPhase2 {
    digit digit: (0, phase2), (0, beginPhase3);
}

phase2 {
    digit digit: (0, phase2Move), (0, checkPayload);
}

phase2Move {
    digit digit digit digit: (2, phase2Move), (0, movePayload);
    digit digit plus equals: \0 \1 \2 \3;
    digit digit plus digit: (3, payload), (0, flagPayload);
}

payload {
    digit digit: (1, payload), (0, swapDigits);
    digit equals: \0 \1;
}

flagPayload {
    digit digit plus digit: flag \3 \0 \1 \2;
}

movePayload {
    digit digit flag digit: \2 \3 \0 \1;
}

checkPayload {
    flag digit digit digit: (0, rearrangePayload), (0, phase2);
}

rearrangePayload {
    flag digit digit digit: digit0 \1 \2 \3;
}

The last step is to actually perform the addition. This works like a ripple carry adder. We want to take the glyphs two-at-a-time, and add them, and produce a carry. Then the next pair of glyphs will add, and include the carry. We start the process by introducing a carry = 0.

beginPhase3 {
    digit digit: (0, phase3), (0, removeZero);
}

phase3 {
    digit digit digit digit: (2, phase3), (0, addPair);
    digit digit plus equals: (0, insertCarry), (0, addPair);
}

insertCarry {
    digit digit plus equals: \0 \1 digit0;
}

removeZero {
    digit0 digit: (1, removeZero), (0, removeSingleZero);
}

removeSingleZero {
    digit0 digit: \1;
}

addPair {
    1 1 1: 1 1;
    1 1 2: 1 2;
    1 2 1: 1 2;
    1 2 2: 1 3;
    1 3 1: 1 3;
    1 3 2: 1 4;
    … more here
}

HarfBuzz


So I’m doing this whole thing using the HarfBuzz shaper, as described above. This is because it’s open source, so I can find where I’m hitting limits and increase the limits. It turned out that I not only had to increase the HB_MAX_NESTING_LEVEL to 4294967294, but I also was running into more limits. I ended up just taking all the limits in hb-buffer.hh, hb-machinery.hh, and hb-ot-layout-common.hh and increasing them by a factor of 10.

There’s one more piece that was necessary to get it to work. Inside apply_lookup() in hb-ot-layout-gsubgpos.hh, there’s a section if (end <= int (match_positions[idx])). It looks to me like this section is detecting if a recursive call caused the glyph sequence to get shorter than the size of the match. Inside this block, it says  /* There can't be any further changes. */ break; which seems to stop the recursion (which seems incorrect to me, but I’m not a HarfBuzz developer, so I could be wrong). In order to get this whole system to work, I had to comment out the “break” statement.

So that’s it! After doing that, the system works, and the font correctly adds numbers. The font has 75 shaping rules and is 32KB large.

The glyph paths (contours) were taken from from the Retroscape font.

Font file download

Saturday, March 2, 2019

Wide color vs HDR

Over the past few years, there’s been something of a renaissance in display technology. It started with retina displays and now is extending to wide color and HDR. Wide color has been added to Apple’s devices, and HDR support has arrived in Windows. These are similar technologies, but they aren’t the same.

HDR allows you to display the same colors you could display with out, but at a higher luminosity (colloquially: brightness). This is sort of like the difference between one red lightbulb and two red lightbulbs. Looking at two red lightbulbs doesn’t change the color of the red, but instead it’s just brighter.

Wide color, on the other hand, lets you see colors that weren’t possible to see before. It is possible to display colors that are more saturated than otherwise could have been.

HDR monitors use the same color primaries as non-HDR monitors, but the luminosity of each of those primaries can grow beyond 1.0. On the other hand, wide color monitors use different, more saturated primaries.


Click and drag to rotate!
The colorful cube is sRGB, normalized to the luminosity of an iPad Pro screen. The white lines describe the gamut of an iPad Pro screen using the Display-P3 color space. The light blue describes the gamut of an ASUS ROG PG27UQ monitor, which is both HDR and wide color. The purple describes the gamut of a SurfaceBook laptop. The coordinate system is XYZ, but transformed such that sRGB is a unit cube.

In the above diagram, luminosity is roughly equivalent to distance in the +X+Y+Z direction. The chroma (hue and saturation) of a point is roughly the angle between two lines, one of which goes through the origin and the point, and the other goes through the origin and pure white. Therefore, wider colors are characterized by the three primary axes pointing in more opposite directions, whereas luminosity is roughly how far those lines extend.

You can see this above. The black and white points are shared between sRGB and Display P3, but the Display P3 monitor can show more points around the middle. So it isn’t more luminous, but it is wider. The ASUS monitor is both wide and HDR, so its axes open up widely, and also extend very far. A monitor that’s HDR but not wide would have the same primaries as sRGB, but would extend out far like the ASUS monitor.

Luminosity isn’t only tangentially related to color; in fact, each color has exactly one luminosity value. If you take a color and convert it to the XYZ color space, the Y component is luminosity. So, an HDR monitor can show colors with Y components significantly larger than non HDR monitors. A wide color monitor can’t, but it can show colors with X and Z values other than the values non-wide monitors can.

This is kind of interesting, because the sRGB spec says that its white point is defined to be 80 nits (which is the unit of luminosity). However, over the decades, monitors have gotten brighter, presumably because psychologically, consumers prefer to buy brighter displays than dimmer displays. Nowadays, most monitors are around 200-300 nits. Therefore, if you strictly adhere to the spec, an sRGB color value (r, g, b) should be some particular point in XYZ space, but in practice, because everyone bought brighter monitors, those same color values (r, g, b) are actually a point with a much greater Y value in XYZ. So different displays have different primaries, but they also have a different luminosity, which affects how far away from 0 the white point is in sRGB. You can see this in the above diagram - the SurfaceBook’s maximum white point is significantly smaller than the color cube, which is because the Surface Book reports a luminosity of only 270 nits. The diagram above is normalized to the luminosity of an iPad pro, which is measured by laptopmag.com to be 368 nits.

You can get all this information on Windows by using the IDXGIOutput6::GetDesc1() API call. This call gives you a lot of information, and it’s a little bit difficult to decipher. The redPrimary, greenPrimary, and bluePrimary give you the direction of each of the primaries in XYZ space. Each one is reported as an (x, y) tuple, which is the result of the calculation X/(X+Y+Z) and Y/(X+Y+Z), respectively[6]. Notice that you’re only given two pieces of information; that means that this isn’t a 3D point in XYZ space, but it’s rather a line. The line can be given in parametric form:

X(t) = x * t
Y(t) = y * t
Z(t) = (1-x-y) * t

As you can see, this passes through the origin and extends outward in some direction forever. Therefore, these xy values give you direction, but not magnitude.

To get magnitude, you need to consider the white point. The white point also has a direction, given in xy coordinates, which tells you which direction that farthest corner of the cube lies, but it doesn’t tell you how far along that line the corner is. To figure this out, you have to use the luminance figures reported by that API. Luminance is the Y channel of XYZ, so if you know the Y value and the direction of the line, you can solve for X and Z. Then, once you know that point, you can solve the maximum extents of the primaries by using the formula of redPrimary + greenPrimary + bluePrimary = whitePoint. That gives you the entire cube.

Calculating the cube for iOS is less detailed. The Display P3 color space is supposed to match the colors representable on the monitor, so we can interrogate the color space instead of the monitor’s reported info. You can construct a CGColor using the CGColorSpace.displayP3 and then use CGColor’s conversion function to turn it into an XYZ color. You can then scale the result by the luminosity of the display (which I looked up from laptopmag.com).

Here's the full text of the Swift Playground I used to calculate the Windows information:
import Foundation
import CoreGraphics
import GLKit

func calculateWhitePoint() -> (CGFloat, CGFloat, CGFloat) {
    let xWhite = CGFloat(0.3125)
    let yWhite = CGFloat(0.329101563)
    let zWhite = 1 - xWhite - yWhite
    let luminance = CGFloat(658.345215)
    let normalizedLuminance = luminance / 374

    let t = normalizedLuminance / yWhite
    let XWhite = xWhite * t
    let YWhite = yWhite * t
    let ZWhite = (1 - xWhite - yWhite) * t

    return (XWhite, YWhite, ZWhite)
}

func convertXYZToRGB(X: CGFloat, Y: CGFloat, Z: CGFloat) -> (CGFloat, CGFloat, CGFloat) {
    let r = 3.2406 * X - 1.5372 * Y - 0.4986 * Z
    let g = -0.9689 * X + 1.8758 * Y + 0.0415 * Z
    let b = 0.0557 * X - 0.2040 * Y + 1.0570 * Z
    return (r, g, b)
}

let (XWhite, YWhite, ZWhite) = calculateWhitePoint()

// X(t) = x * t
// Y(t) = y * t
// Z(t) = (1 - x - y) * t

let xRed = Float(0.674804688)
let yRed = Float(0.316406250)
let zRed = 1 - xRed - yRed
let xGreen = Float(0.1953125)
let yGreen = Float(0.708007813)
let zGreen = 1 - xGreen - yGreen
let xBlue = Float(0.151367188)
let yBlue = Float(0.046875)
let zBlue = 1 - xBlue - yBlue

// Red primary (XRed, YRed, ZRed): s * (xRed, yRed, zRed)
// Green primary (XGreen, YGreen, ZGreen): t * (xGreen, yGreen, zGreen)
// Blue primary (XBlue, YBlue, ZBlue): u * (xBlue, yBlue, zBlue)

// XWhite = XRed + XGreen + XBlue
// YWhite = YRed + YGreen + YBlue
// ZWhite = ZRed + ZGreen + ZBlue

// XWhite = s * xRed + t * xGreen + u * xBlue
// YWhite = s * yRed + t * yGreen + u * yBlue
// ZWhite = s * zRed + t * zGreen + u * zBlue

// [xRed, xGreen, xBlue]   [s]   [XWhite]
// [yRed, yGreen, yBlue] * [t] = [YWhite]
// [zRed, zGreen, zBlue]   [u]   [ZWhite]

let matrix = GLKMatrix3MakeAndTranspose(xRed, xGreen, xBlue, yRed, yGreen, yBlue, zRed, zGreen, zBlue)
let inverted = GLKMatrix3Invert(matrix, nil)
let solution = GLKMatrix3MultiplyVector3(inverted, GLKVector3Make(Float(XWhite), Float(YWhite), Float(ZWhite)))
let s = solution.x
let t = solution.y
let u = solution.z

let XRed = s * xRed
let YRed = s * yRed
let ZRed = s * zRed
let XGreen = t * xGreen
let YGreen = t * yGreen
let ZGreen = t * zGreen
let XBlue = u * xBlue
let YBlue = u * yBlue
let ZBlue = u * zBlue

// Let's check our work
XRed + XGreen + XBlue
XWhite
YRed + YGreen + YBlue
YWhite
ZRed + ZGreen + ZBlue
ZWhite
XRed / (XRed + YRed + ZRed)
xRed
YRed / (XRed + YRed + ZRed)
yRed
XGreen / (XGreen + YGreen + ZGreen)
xGreen
YGreen / (XGreen + YGreen + ZGreen)
yGreen
XBlue / (XBlue + YBlue + ZBlue)
xBlue
YBlue / (XBlue + YBlue + ZBlue)
yBlue

// 0 0 0 -> 0 0 0
// 1 0 0 -> XRed, YRed, ZRed
// 0 1 0 -> XGreen, YGreen, ZGreen
// 0 0 1 -> XBlue, YBlue, ZBlue
// 1 1 0 -> XRed + XGreen, YRed + YGreen, ZRed + ZGreen
// 0 1 1 -> XGreen + XBlue, YGreen + YBlue, ZGreen + ZBlue
// 1 0 1 -> XRed + XBlue, YRed + YBlue, ZRed + ZBlue
// 1 1 1 -> XRed + XGreen + XBlue, YRed + YGreen + YBlue, ZRed + ZGreen + ZBlue

let _000 = convertXYZToRGB(X: 0, Y: 0, Z: 0)
let _100 = convertXYZToRGB(X: CGFloat(XRed), Y: CGFloat(YRed), Z: CGFloat(ZRed))
let _010 = convertXYZToRGB(X: CGFloat(XGreen), Y: CGFloat(YGreen), Z: CGFloat(ZGreen))
let _001 = convertXYZToRGB(X: CGFloat(XBlue), Y: CGFloat(YBlue), Z: CGFloat(ZBlue))
let _110 = convertXYZToRGB(X: CGFloat(XRed + XGreen), Y: CGFloat(YRed + YGreen), Z: CGFloat(ZRed + ZGreen))
let _011 = convertXYZToRGB(X: CGFloat(XGreen + XBlue), Y: CGFloat(YGreen + YBlue), Z: CGFloat(ZGreen + ZBlue))
let _101 = convertXYZToRGB(X: CGFloat(XRed + XBlue), Y: CGFloat(YRed + YBlue), Z: CGFloat(ZRed + ZBlue))
let _111 = convertXYZToRGB(X: CGFloat(XRed + XGreen + XBlue), Y: CGFloat(YRed + YGreen + YBlue), Z: CGFloat(ZRed + ZGreen + ZBlue))

/*
0, 0, 0, 0, 1, 0, // left front
0, 0, 0, 1, 0, 0, // bottom front
1, 0, 0, 1, 1, 0, // right front
0, 1, 0, 1, 1, 0, // top front
*/
print("\(_000.0), \(_000.1), \(_000.2), \(_010.0), \(_010.1), \(_010.2), // left front")
print("\(_000.0), \(_000.1), \(_000.2), \(_100.0), \(_100.1), \(_100.2), // bottom front")
print("\(_100.0), \(_100.1), \(_100.2), \(_110.0), \(_110.1), \(_110.2), // right front")
print("\(_010.0), \(_010.1), \(_010.2), \(_110.0), \(_110.1), \(_110.2), // top front")

/*
0, 0, 1, 0, 1, 1, // left back
0, 0, 1, 1, 0, 1, // bottom back
1, 0, 1, 1, 1, 1, // right back
0, 1, 1, 1, 1, 1, // top back
*/
print("\(_001.0), \(_001.1), \(_001.2), \(_011.0), \(_011.1), \(_011.2), // left back")
print("\(_001.0), \(_001.1), \(_001.2), \(_101.0), \(_101.1), \(_101.2), // bottom back")
print("\(_101.0), \(_101.1), \(_101.2), \(_111.0), \(_111.1), \(_111.2), // right back")
print("\(_011.0), \(_011.1), \(_011.2), \(_111.0), \(_111.1), \(_111.2), // top back")

/*
0, 1, 0, 0, 1, 1, // top left
0, 0, 0, 0, 0, 1, // bottom left
1, 1, 0, 1, 1, 1, // top right
1, 0, 0, 1, 0, 1  // bottom right
*/
print("\(_010.0), \(_010.1), \(_010.2), \(_011.0), \(_011.1), \(_011.2), // top left")
print("\(_000.0), \(_000.1), \(_000.2), \(_001.0), \(_001.1), \(_001.2), // bottom left")
print("\(_110.0), \(_110.1), \(_110.2), \(_111.0), \(_111.1), \(_111.2), // top right")
print("\(_100.0), \(_100.1), \(_100.2), \(_101.0), \(_101.1), \(_101.2), // bottom right")