Wednesday, November 9, 2022

Video Splitter

I record ~all the games I play. Not for any particular reason, but really just because sometimes it's fun to go back and re-watch them, to repeat a positive experience. Usually, I just upload the raw footage directly to YouTube, but this time I wanted to see if I could do a little better.

Problem Statement

I just finished playing Cyberpunk 2077, and recorded 77 hours of footage. This footage is split across 41 video files. Also, when recording the footage, I wasn't particularly meticulous about stopping recording when I had to step away for a little bit. Therefore, there are a bunch of times in the videos when I stepped away for a few minutes, and nothing much is happening on-screen.

I want to take these videos, and concatenate them into a few long videos. Ideally the result would be a single video, but YouTube doesn't allow you to upload anything longer than 10 hours, so the result will be a few 10-hour-long videos. Also, I'd like to identify the periods of time in the footage when nothing much is happening, and remove those periods from the result.

Also, just for the sake of convenience, I'd like to do this with as few libraries as possible, focusing just on using the software that's built into my Mac.

Plan

There are 3 phases:
  1. Feature Extraction
  2. Partitioning
  3. Writing out the final result.
Let's take these one at a time.

Feature Extraction

This is the part where I analyze the videos to pull out useful information from them. This entails going through all the videos frame-by-frame, and mapping each frame to a set of metrics. I ended up using 5 metrics:
  1. The number of words that appear in the frame
  2. The cross-correlation from the previous frame to the frame in question
  3. The optical-flow from the previous frame to the frame in question
  4. The standard deviation of the optical flow
  5. The average luminance of the frame
Let's consider these one-by-one.

Number of Words


This is pretty straightforward to gather. Apple's Vision framework contains API to recognize the text in an image. There's one other step beyond simply using this API, though - the results contain strings, but each string may contain many words. I'm interested in the number of words, rather than the number of strings in the image. So, I use another Apple API - CFStringTokenizer to pull out the words from the string. Then I simply count the number of words. Easy peasy.

Cross Correlation



The goal here is to cross-correlate adjacent frames of video, to see roughly how much is changing from frame to frame. This one was the most difficult to implement. 

Cross correlation is usually defined a little differently than what I'm interested in. Classically, cross correlation is a function that takes 2 functions as input, and produces another function. The idea is that you take the 2 functions, multiply them together and integrate the result, to produce a particular value. However, you often want to actually displace the two input functions away from each other. The size of that displacement is why the output of cross-correlation is a function - the input of that function represents the displacement that the two input functions are separated from each other before multiplying and integrating. (I'm assuming here that the functions are real-valued.)

For me, my 2 functions are discrete - There are X and Y inputs, and the ouptut is the color value of that pixel. So, instead of multiplying and integrating like you would for continuous functions, this operation actually becomes simply a dot product. I also am using just the luminance of the image, so that if a color changes chroma but luminance stays the same, that still counts as a high cross-correlation. Also, having a 1-dimensional result is a little more convenient than if I had a 3-dimensional result (by treating red, green, and blue as distinct).

However, I don't want my output to be a whole function -  I just want a single value. Usually, in the sciences, they do this by maximizing the value of the output function - often by trying every input, and reporting the maximum result achievable. This would be great: if I'm turning the camera in the game, this operation would find the location where the previous frame best matches up with the current frame. Unfortunately, it's too slow, though - for a 4K image, there are 35,389,440 inputs to try, and each trial operates on 2 entire 4K images. So, instead, I just set displacement = 0, and assume that adjacent video frames aren't changing a huge amount from frame to frame.

From reading Wikipedia's article about cross correlation, it looks like what I want is the "zero normalized cross-correlation" which normalizes the values in the image around the mean, and divides by the standard deviation. The idea is that if the image gets brighter as a whole, but nothing else changes, that should count the same as if it didn't get brighter. It's measuring relative changes, rather than absolute changes.

So this all boils down to:
  1. Calculate the luminance of both images
  2. Calculate the average and standard deviation luminance for each image. This ignores geometry and just treats all the pixels as an unordered set.
  3. For each image, create a new "zero-normalized" image, which is (old image - mean) / standard deviation
  4. Perform a dot product of the two zero-normalized images. The result of this is a single scalar.
  5. Divide the scalar by the number of pixels
Okay, how to actually do this in code?

Luminance


Calculating the luminance is actually a little tricky. The most natural way I found was to convert the image into the XYZ colorspace, whose Y channel represents luminance. I'm doing this using MPSImageConversion which can convert images between any 2 color spaces. It operates on MTLTextures, so I had to bounce through Core Image to actually produce them (via CIRenderDestination.) I then broadcasted the Y channel to every channel of the result, which isn't strictly necessary, but makes it more convenient to use later - I can't forget which channel is the correct channel to use. I did this broadcast using CIFilter.colorMatrix.

Mean


Okay, so now we've got luminance, let's calculate the mean, which is pretty straighforward - CoreImage already has CIFilter.areaAverage which produces a 1x1 image of the average. We can tile that 1x1 image using CIFilter.affineTile so it's as big as the input image.

Subtraction


Subtracting an image from its mean now is actually kind of tricky - Both CIDifferenceBlendMode and CISubtractBlendMode seem to try really hard to not produce negative values. What I had to do in the end was to add the negative of the second image (instead of subtracting). Adding is just CIFilter.additionCompositing, and negating is just CIFilter.multiplyCompositing with an image full of -1s. However, you can't use CIConstantColorGenerator to fill an image with -1s, because that will clamp to 0. Instead you have to actually create a new CIImage from a 1x1 CPU-side bitmap buffer that holds a -1, and then use CIFilter.affineTile to make it big enough. Also, in the CIContext, you have to set the working format to one that can represent negative values (normally Core Image uses unsigned values); I'm using CIFormat.RGBAf.

Standard Deviation


Okay, so the next step is to calculate standard deviation. The standard deviation is the square root of variance, and to calculate variance, we take all the pixels, subtract them from the mean like we just did above, square the result, then find the average of the result values. Luckily, we already did all the hard parts - squaring the result is just CIFilter.multiplyCompositing, and we can use the same CIFilter.areaAverage & CIFilter.affineTile to find the average. No problem. We can then take the square root to find standard deviation by using CIFilter.gammaAdjust with a gamma of 0.5.

Division


We can't actually divide by the standard deviation using Core Image as far as I can tell - CIDivideBlendMode doesn't seem to do a naive division like we want. However, because the standard deviation is constant across the whole image, we can hoist that division out of the dot product computation. The dot product results in a single scalar, and the standard deviation for an image is a single scalar, so we can just calculate these things independently, and then divide them on the CPU afterwards. No problem.

Dot Product


Okay, so now we've got our zero-normalized images. Let's do a dot product and average the result! This is pretty easy too - a dot product is just CIFilter.multiplyCompositing, and averaging the result is CIFilter.areaAverage.

Phew! All that Core Image work just to get a single value!

Optical Flow




Optical flow between 2 images produces an (x, y) displacement vector that indicates, for each pixel in the first image, where it moved to in the second image. I'm interested in this because as I rotate the camera around in the game, that should cause most of those displacements the be pointing in roughly the same direction across the whole image. So, the average of the optical flow should tell me if I'm moving the camera or not.

On the other hand, if I'm walking forward in the game, then pixels at the top of the screen will move up, pixels on the right will move more to the right, etc. In that situation, the displacements will all cancel each other out! That's why I'm also interested in the standard deviation of the flow. If the average is 0, but the standard deviation is high, that means I'm walking forward (or backward). If the standard deviation is 0, but the average is high, that means I'm turning the camera.

Calculating this is pretty straightforward. Apple's Vision framework contains API to calculate optical flow. We can then calculate its average and standard deviation using the same method use used above, in the cross correlation section.

Optical flow is a 2-dimensional value - pixels can move in the X and Y dimension. I'm not really interested in the direction of the movement, though; I'm more interested in the amount of movement. So, after calculating the average and the standard deviation, I take the magnitude of them, to turn them into scalar values.

Average Luminance



This is pretty straightforward - in fact, we already calculated this above in the cross-correlation section. It's useful because the menus have a black background, so they are darker than regular gameplay. Also, the luminance of menus is very consistent, as opposed to regular gameplay, where luminance is going up and down all the time. So, just looking at a graph of average luminance, you can kind of already see where the menus are in the video.

Partitioning

Alright, now we've extracted 5 features from each frame of video. Each of these features is a single scalar value. The next task is to try to use statistical analysis to determine which parts of the video are the boring parts that I should cut out, and which parts are full of action and should be kept in.

Originally, I thought I could just use cross-correlation to do this. I thought that if the cross-correlation between adjacent frames is low, that means I've cut to a menu or something, and that would be a good place to cut the video up. However, this turned out not to work very well, because menus actually come on screen with an animation, so they don't actually have low cross-correlation. Also, regular gameplay has a bunch of flashes in it (things explode, the camera can turn quickly, visual effects distort the screen, etc.).

Instead, I wanted to model the data I had gathered using a piecewise constant function. E.g. during gameplay, the 5 features will adhere to a particular distribution, and during menus or boring parts, the 5 features will adhere to a different distribution. I'm modelling these distributions as normal distributions, but with different means. I'm trying to partition the data by time, and calculate a new normal distribution for each partition, such that each partition's distribution fits its data as well as possible. The trick here is to find the partioning points. I'm looking for a statistical method of finding places where the data is discontiguous.

Originally, I thought this would be a classic K-means clustering problem, but after implementing it, it turned out not to work very well. No matter how hard I tried, the partitions overlapped a lot, and looked pretty random. So that didn't work.



Bayesian Information Criterion



Next, I discovered something called the Bayesian information criterion (BIC). This comes from the science of model selection, which is essentially exactly what I'm trying to do. I have a bunch of data, and I have a family of models in mind (disjoint normal distributions, one for each partition) and I'm trying to select among the family for the best model for my data.

The BIC is a measure of how well data fits a model. Importantly, though, it tries to mitigate overfitting. For example, in my data, if every data point was in its own partition, then the model would perfectly fit the data; but this wouldn't actually solve my problem. The Bayesian information criterion has 2 terms - one that reflects how well the data fits the model, and another one that reflects how many parameters the model has. The better the data fits the model, the better the BIC; but the more parameters in the model, the worse the BIC. It's essentially a way of balancing fitting vs overfitting.

There are 2^n different ways of partitioning n values, and for me, n is the number of frames in 77 hours of video. So exhaustively calculating the BIC for every possible partitioning is clearly too expensive. I instead opted to use a greedy approach. Given a particular partitioning, we can try to add a new split at every location (which is O(n) locations), and calculate the BIC as-if we split at that position. We then pick the location which results in the best BIC, and add it into the partitioning. We keep doing this until we find there is no single new splitting location which will improve the BIC (because adding another split would overfit the model).

If you preprocess the data to precalculate prefix-sums, you can answer "sum all the values from here to there" queries in O(1) time. This allows you to compute the BIC for a single partition in O(1) time (if you expand out the polynomial in the formula).

Therefore, determining the BIC for a particular partitioning is O(number of partitions). Every time we pick a splitting point, the number of partitions increases by 1. We want to calculate a BIC once for each candidate splitting point, which is O(number of partitions * number of frames). Even better, calculating the BIC at every candidates splitting point is an embarassingly parallel problem, and we can parallelize this across all the cores in our machine. It turns out this is fast enough - the longest single video I had is 5 hours long, and this algorithm completed on that video in around 10 minutes on my 20-core machine.

There are 2 tweaks that I ended up having to do to the above:
  1. The BIC assumes that the data you have is one-dimensional. However, my data is 5-dimensional; I extracted 5 features from each frame of the video, so each data point has 5 components. There may be a way to generalize the BIC to higher dimensions, but I instead opted to do something similar: for each 5-dimensional data point, create a single one-dimensional synthetic data point that represents it. This is meaningless, but isn't without precedent: there are lots of examples of people using this approach to average together multiple benchmarks into a single synthetic score. I opted to combine the 5 values using the geometeric mean, because that is sensitive to ratios of the data points, rather than the magnitudes of the data points themselves. (Arithmetic mean is definitely wrong, because calculating the arithmetic mean involves adding together the 5 values, but the 5 values have different units, so they can't be added.)
    1. I also decided to use 1 - cross correlation instead of cross correlation itself, because all the other values have a baseline near 0, but cross correlation has a baseline near 1 (because most frames are similar to their adjacent frames). This makes all the values behave a bit more similarly, and makes ratios more meaningful.
  2. The BIC formula involves a tradeoff between fitting the data and overfitting the data. The way overfitting is measured is that it's proportional to the number of parameters in the model. For me, each different partition has (I believe) 2 parameters: the mean of the data within that partition, and the constant variance of the errors. However, using this value causes the data to be overfitted significantly, so I instead multiplied that by 10, which ended up with a pretty good result. Doing it this way ended up with the average partition in each video being around 30 - 60 seconds long, regardless of the length of the video being partitioned. So that's a pretty cool result.

Writing out the Result



The last step is to pick a threshold: for each partition, I have to determine whether that partition should be present in the final video. I already have a model, where each partition is associated with the average value of the synthetic one-dimensional calculated values inside that partition. The way I picked which regions were in and which were out was just by inspecting a bunch of partitions, to see what was happening in the video during that time. I found that, if the average synthetic value is less than around 0.12145, then the partition was boring and should not be included.

Writing the result is pretty straightforward - I'm using AVAssetWriter and passing it frames from the input file (which was read using AVAssetReader).

Results

All this work seems, somewhat surprisingly, to give pretty good results. Not perfect, but better than I was expecting. In the first few hours of gameplay that I reviewed, it neatly cut out:
  1. A section when I was reading an in-game lore book thing (it was just showing text on the screen for a minute or so)
  2. A section when the game was paused
  3. A section when I was fussing with my inventory for a few minutes
  4. A section when I was looking at my character in the mirror
The remarkable part of this is that it cut out all these things wholesale: from right when they started, to right when they end. So you see the character walk into the elevator, and then they're immediately walking out of the elevator. And it didn't cut out any of the action or interesting parts. It also seems to have some resistance to durations of events: there was a point when I read a different in-game lore book, but I only read it for a few seconds, and it kept that part in; - presumably because it wasn't worth another partition.