Litherum: 2013

Friday, November 29, 2013

Complex Text Handling in WebKit, Part 2: Fonts

Before we go any further, I’d like to talk a little bit about what information is contained within font files. Ultimately, font files dictate the ground truth for how to lay out a particular run of text. There are two common formats of font files: TrueType fonts and OpenType fonts. TrueType was developed by Apple, and when a licensing agreement between Apple and Microsoft couldn’t be reached, Microsoft built on top of TrueType to make OpenType. Both of these formats are similar and use the same underlying file structure. My experience is with TrueType, so that’s what I’m going to focus on here.

A TrueType font file is made up of a collection of sections, each called a “table.” There is also an directory table (which appears first) which describes the locations and sizes of all of the other tables. Each table is denoted by 4 ascii characters which describe what the table is describing. For example, the table that describes the character-to-glyph mapping has a tag of “cmap” and the table that describes naming information has a tag of “name”. Because unknown tables will simply be ignored by any program that is parsing the file, the format can be extended with different kinds of tables.

One of the duties of a font system is to translate a sequence of characters (read: code points) into a sequence of glyphs. Each glyph is identified by a number (which can later be used to get the information needed to draw the glyph). It is important to note that the set of codepoints that the font knows about is arbitrary and different from font file to font file. In addition, the glyph IDs themselves are arbitrary and different from font file to font file. (e.g. If you ask the font system which glyph should be drawn for the letter ‘a’, you cannot use that glyph ID with another font - it may point to a different glyph entirely!). The code point to glyph mapping is done using the “cmap” table in the font file. The specifics of how the lookup is completed are not really all that complicated, but I’ll let you read [1] if you’re interested.

Okay cool, so now we have a glyph ID for the particular code point that we’re trying to draw. We now have to figure out where in the font file the relevant information regarding this glyph is kept. Each glyph can contain a different number of contours and control points, and glyph information is not padded, so we have to look up where the relevant glyph information is kept. This information is inside the “loca” table. This table is simply an array of offsets, so figuring out where to start looking is very easy.

Alright, now we have an offset, and we now know where to start looking for information about how to draw our glyph. The offset is an offset into the “glyf” table, which describes the contours that make up the outline of the relevant glyph. TrueType has two different notions of glyphs: simple glyphs and compound glyphs. Compound glyphs are simply created by drawing multiple other glyphs with different positions and scales. There might be a compound glyph for “ñ” which is composed of drawing the glyph for “n” and the glyph for “~” on top of each other.

A simple glyph is described by a collection of contours. A contour is described by a sequence of quadratic curves laid out end-to-end. A single quadratic curve is characterized by three points: two endpoints and a single control point which defines the tangents at the endpoints. Because the curves in a single contour are laid out end-to-end, the endpoint of one curve would be a duplicate of the beginning-point of the next curve. Also, if we assume that the entire contour is smooth, meaning that the ending derivative of one curve is the same as the beginning derivative of the next curve, we don’t even need to explicitly state the location of the shared endpoint. We can figure out where the shared endpoint must be given the two control points that straddle it.

Therefore, the way that each contour is described is as a list of points, coupled with a boolean for each point which describes if that point is “on” the contour or “off” the contour. Points that are on the contour are endpoints of a curve, and points that are off the contour are control points. Any arbitrary collection of these (point, on/off boolean) tuples will uniquely define a contour. Note that kinks are allowed because the font author can explicitly specify each of the three points for each quadratic curve (but doesn’t have to). If there is no control point between two on-contour points, a straight line is drawn between the two points. A glyph simply consists of a bunch of these contours.

There’s one last piece we need to know in order to draw a glyph: TrueType dictates that rasterizers use the non-zero winding number rule in order to dictate whether a particular point is inside or outside the glyph. This rule is well-documented, but works like this: From the point you’re trying to determine, draw a ray in any direction. Every time the ray crosses a contour from left-to-right, increment a counter. Every time the ray crosses a contour from right-to-left, decrement the counter. If the counter ends up with a non-zero value, the point is inside the glyph. Note that this rule assumes that glyphs “hold water” meaning that there are no gaps. This means that contours can intersect, and that font authors can cut holds out of closed contours.

It is also worth stating here that all of the coordinates that points are described in this table are integral, meaning that this table can only describe locations on a grid. However, the dimensions of this grid are specified in the “head” (font header) table (sort of. Keep reading). Therefore, different fonts can have different precisions for their contours. It also means that the font system has to scale these coordinates to a pixel-based coordinate system. This scaling is straightforward: Let’s call the user-selected font point size “p”. One point is equal to 1/72 of an inch, so divide p by 72 to get the number of inches the grid should take up, then multiply that by the output device’s pixels per inch (so now we have pixels) and then scale the grid to take up that number of pixels. The TrueType spec defines that this scaling is done in fixed-point with a resolution of 1/64.

It should also be noted that the actual value in the “head” table is “units per Em”. In old days of movable type, the printer would have lots of little squares of metal with raised letters on them. When the printer wanted to print a book, he/she would arrange the little squares to spell out the page of the book, dip the raised letters in ink, then stamp out the page of the book. Therefore, each letter had to fit on one of these little squares. The size of these squares is called an “Em”. On a computer, when you say that you want to you a 12 pt font, one pt equals 1/72 of an inch, so you’re saying to make the EM square 12/72 of an inch big on your screen. The scalar then uses the “units per Em” value to figure out how to scale the glyph. However, nothing is stopping a point on the contour from being greater than the “units per Em” value. This means that the glyph may extend outside of the Em square. This is valid, and even expected, behavior.

There’s one major piece of drawing glyphs that I haven’t discussed. There is a problem when the font size is roughly the size of just a couple pixels on the screen. When that happens, the pixels don’t have enough definition to describe the character properly. The contours described so far describe the platonic ideal of what the glyph should look like, but if the glyph is on a low resolution screen, it may be beneficial to munge about with the control points so that the glyph maps better to the screen. This process is called “font hinting.” Truetype exposes a mechanism to do this, called “glyph instructions.” Yes, that’s right, this is an entire language interpreter that TrueType engines employ to munge about with glyph control points. The font author writes a program in this programming language, and the font runtime executes this program at draw time to move control points around. This programming language is complete with flow control, functions, variables, etc. The only difference between regular programming languages is that the only way to output information is to move control points around - you can’t print output to the console or anything like that. The instruction set for this language is actually quite interesting - there are many domain-specific instructions for making it convenient to move a collection of points along the same vector and things like that. It’s definitely worth taking a look at if you’re interested in domain-specific languages.

Another thing that a font file has to describe is the origin point and advance point for a particular glyph. This is actually fairly straightforward - you render one glyph with respect to its own origin. Then, you “move the pen” to that glyph’s advance point, and start on the next glyph. This is the way that glyphs don’t get bunched up together when you draw them. By convention, the first point in the font file for a glyph is that glyph’s origin, and the last point is it’s advance point.

By the way, the TrueType spec makes no mention of antialiasing, subpixel rendering, or colors. It is under the assumption that a pixel is either 100% on or 100% off, and that pixels are monochrome. That tech is implemented in the rasterizer to make fonts look beautiful. In order to do this correctly, the font system needs to figure out the coverage of a particular square region with regards to the contours. This requires multiple integrals to calculate exactly (which is obviously too slow) so estimations are used instead. Colored glyphs (such as emoji) are rendered using bitmap glyphs, which are completely different.

In this discussion, I’ve described only a few of the tables in TrueType files. However, there are many more. One of the more interesting one is the “bsln” table for describing where various baselines are for the characters (Some languages, like Bengali, use an upper baseline, and some, like Chinese, use a centered baseline). Another interesting one (“lcar”) is due to the fact that ligatures represent multiple characters in a single glyph. Because of this, there might be positions within the glyph that represent logical breaks (so that the caret or selection might appear partway through the glyph). Another (“avar”) is that a font might contain multiple sets of glyph information across various axes (such as font weight). Another (“trak”) describes how to achieve various tracking metrics for glyphs. There are many extension tables that have been added to TrueType, and even more that have been added to OpenType.

[1] https://developer.apple.com/fonts/ttrefman/RM06/Chap6cmap.html

Sunday, November 10, 2013

Complex Text Handling in WebKit, Part 1: Encoding Systems, Code Points, and Code Units

Complex text handling is very complicated, due to the various languages and writing systems in the world. This is meant for programmers who have little exposure to complex text.

The first concept to mention is the code point. A code point is a generalization of what what you are probably familiar with as a “character.” Because computers have no concept of language, words, or characters, a sentence or paragraph is represented as a sequence of numbers which describe the characters which eventually get drawn. The atomic unit of this sequence is called a “code point.” Now, the meaning of each individual code point is described in an encoding system. For example, using the ASCII encoding system, the code point with value 121 means the letter z, and the code point with value 65 means the letter A. http://www.asciitable.com contains a description of each code point in the ASCII encoding system (there are only 128 of them), and which character that code point maps to. Because there are only 128 code points, ASCII is called a 7-bit encoding system. (As an aside: however, by convention, the 7 bits are not packed, and instead are mapped to bytes by simply setting the high bit in the byte to 0. This will become important later. Historically, in the days of mainframes and such, this high bit was used for parity, but not any more.)

However, there are other encoding systems which might map the code point with value 121 to some other character. This means that, if someone hands you a string, you have to know which encoding system the string is encoded with in order to know the meaning of that string (or, indeed, to draw the string). There used to be many competing encoding systems and the world was very confusing and chaotic. However, the Unicode consortium was founded to create one gigantic encoding system (http://www.unicode.org/charts/) that would unify all of the various competing encoding systems. This encoding system contains roughly 1114112 code points, so, as you can imagine, most (but not all!) characters of most writing systems on earth are contained within it. The Unicode consortium also designed the code points such that all the ASCII characters have the same code points in Unicode as well, and it has some other nice properties too. Unicode has essentially become the standard for all computerized strings.

That’s all well and good, but because there are so many code points, the unicode encoding system would need each code point to contain 21 bits in order to have enough of them. However, most strings only use a small fraction of all the code points within them. Therefore, most unicode strings have an encoding associated with them. There are three commonly used unicode encodings: UTF-8, UTF-16, and UTF-32. UTF-32 is the most straightforward (and also the least common): it simply represents each code point as a 32-bit unsigned integer. Because the high 11 bits of each of these integers would be zeroes, this encoding is not very efficient which is why it is rarely used.

UTF-8 and UTF-16, however, use sequences of numbers to encode each code point. In this context, each of these numbers is called a “code unit.” UTF-8 uses 8-bit code units, and UTF-16 uses 16-bit code units. Because these code units have less bits than each unicode code point requires, there are some unicode code points that are encoded as multiple code units. I won’t go too deep on how exactly these encodings work, but I will mention some key properties of both of them.

In UTF-8, all of the code points that happen to be encoded as a single code unit have the high bit set to 0. This means that there are 127 such code points. These 127 code points are the exact same code points that ASCII encodes. Because ASCII doesn’t pack together its information, this means that a valid ASCII string is also a valid UTF-8 string. The other code points, however, are encoded such that the number of code units needed to represent the code point is contained within the first byte. For example, if two code units are needed, the first byte is 110xxxxx, and if 5 code units are needed, the first byte is 111110xx. By counting the number of leading 1s in the first code unit, you can determine the number of code units needed to represent the code point. In addition, the subsequent code units all start with the bit prefix of “10”. For example, all the code points that require 4 code units to encode use this pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for a grand total of 21 bits of entropy. It also means that no code units will be equal to 0 except for code point 0. Therefore, a null-terminated string encoded with UTF-8 is still null terminated (with no premature nulls).

In UTF-16, all code points can be encoded with either 1 or 2 code units (because each code unit contains 16 bits). The encoding is created such that, for a code point that requires two code units, the first code unit will always be in the range 0xD800..0xDBFF (called the “lead surrogates”) and the second code unit will always be within the range 0xDC00..0xDFFF (called the “trail surrogates”). This means that, when inspecting a string, you come across a high surrogate, you know that you must read a second code unit to know the relevant code point. If the second code unit isn’t a trail surrogate, the string is corrupt. All of the code points that only require one code unit are simply encoded directly. This means that it is impossible to represent a code point with value 0xD800 itself (because that is a high surrogate). The Unicode consortium thought of this, and made all code points between 0xD800..0xDFFF invalid code points. Therefore, if you see a code point in this range in a UTF-32 string, you know the string is corrupt.

There is one more wrench to throw into our plans: endianness. UTF-16 and UTF-32 use multi-byte code units, which means that they might be represented differently on machines with different endian systems. Therefore, when someone hands you a string, you need to know:

Which encoding system used? (The answer is almost always “Unicode”)
If Unicode, is it encoded with UTF-8, UTF-16, or UTF-32? (The answer is almost never “UTF-32”)
If UTF-16 or UTF-32, which endianness of the code units is used?

However, a convention arose regarding that last question, called a “byte order mark.” There is a character, U+FEFF (written in hex), which means “zero width no-break space.” Because this character is whitespace and has a 0 width, it has no effect on the output of drawing the string. However, if you reverse the bytes, U+FFFE is an invalid character. The convention is to put U+FEFF as the first code point in your text file. If the text system reads that character correctly, it will have no effect on the document drawn, and if the text system reads the character incorrectly, it knows to switch its endianness and try again. This system also works with UTF-32 text, because U+FFFE0000 is, again, an invalid character.

Internally, WebKit uses UTF-16 machine-endianness strings (Though there is a compile-time flag to enable using UTF-8 strings internally). This has one interesting repercussion: in order to keep String::size() fast, the function returns the number of code units in the string, not the number of code points.

Thursday, July 18, 2013

Election Computer Science Problem

Problem: There is an election. You are given a stream of votes, in the form of [a, b, a, a, c, b, c, c, a, b, ...]. You do not know ahead of time the number of votes nor the number of contestants. Figure out who wins the election in linear time and constant space without mutating the original input. You may assume that one candidate wins (ends up with more than 50% of the votes).

The key observation here is that, because one contestant claims more than half of the votes, you can pair each vote for the winner with a vote for a loser. In addition, once you come across one of these pairs, you can remove the pair from the stream without affecting the outcome of the election. Also, if you have a pair of losing votes, you can remove those as well from the stream without affecting the election. The only thing you can’t do is remove a pair of votes for the winner. One easy way to enforce this is to only choose pairs of votes for distinct candidates.

Therefore, one algorithm is to successively remove pairs of votes for distinct candidates, until only votes for one candidate are left. A naive implementation of this algorithm requires rewinding the stream. You can do a little better if you assume that the first vote is a vote for the winner. If you increment a counter every time you come across a vote for this contestant, and decrement it every time you come across a vote for any other contestant, when the counter reaches 0 again, you can effectively start the stream anew at this location. This algorithm works even when the first vote is a vote for a loser, because removing pairs of losing votes does not affect the outcome of the election. Eventually, when you reach the end of the stream, whoever has been incrementing your counter has votes leftover, and is therefore the winner.

Saturday, May 25, 2013

Speeding up Counting Sort with Histopyramids

I was listening to Matt Swoboda’s 2012 GDC talk, where he mentions using histopyramids to optimize the marching cubes algorithm. You can also use histopyramids to speed up counting sort, as he describes.

A histopyramid is essentially a mip chain of a tensor, except instead of the reduction operator being an average, the reduction operator is addition. Because our input is only one dimensional, a histopyramid can more simply be described as adding elements pairwise in our input array, then recursing. For example, if our input list is this:

[3, 1, 4, 1, 5, 9, 2, 6]

then the next level of the histopyramid would be

[4, 5, 14, 8]

and the subsequent two levels are

[9, 22]

[31]

It can easily be seen that there are O(log(n)) levels in the histopyramid, and that the last element is the sum of all the elements in the original list. You can also see that, if you have a parallel processor, creating each level from the successive level has constant time delay, which means the delay for creating the entire histopyramid is O(log(n)). Memory allocation is easy because n + n/2 + n/4 + n/8 + ... + 1 just equal 2n, so you can simply allocate one array that’s twice the size of the input, then populate the second half of the input from the first half. There has to be synchronization between generating each level of the histopyramid, however, and because the OpenCL ‘barrier’ function only operates on thread groups and not globally, this has to be done with multiple kernel invocations.

Okay, so how do we use this to actually sort the list? Well, let’s associate each value in the histopyramid with the slice of the input array that contribute to its value. The last value in the histopyramid would therefore correspond to the entire input list. In the previous level of the histopyramid, the two values correspond to the first half of the input array and the second half of the input array, respectively, and so on and so forth.

Now, let’s say that we create a histopyramid of the frequency histogram. In such a histopyramid, each value in the histopyramid corresponds to the size of the output array that its slice encompasses. In our example above, this would mean that there are 31 values in the output array, 9 of which are between [0..3] and 22 of which are between [4..7] (because these are the bounds of the slices that each value corresponds to). If you look one level deeper, you can see that 4 of the output values are between [0..1], 5 values are between [2..3], 14 values are between [4..5], etc.

This means that, if i’m trying to populate a particular cell in the output list, I first ask “is the index of my cell less than 9 or greater than 9?” This is relevant because I know that the first 9 values are between [0..3]. If my index is greater than 9, I then shrink my range to the 22 that I’m about to inspect (by subtracting 9 from my cell index), and ask if my index is now less than 14 or greater than 14. Eventually, I find the cell in the frequency histogram that my index corresponds to, so I can then fill in my cell with the frequency histogram’s cell’s index. You can see that this algorithm is O(log(n)).

Therefore, we have three passes. The length of the input array is ‘n’, and the size of the maximum value in the input array is ‘m’.

- To initialize the frequency histogram and the histopyramid to 0, 2m threads perform a O(1) computation

To create the frequency histogram, n threads perform a O(1) computation (using atomic_inc). Because atomic operations have to be serialized, this has a delay of mode(input).
To create the histopyramid, there are log(m) passes, the biggest of which requires m/2 threads, and they each perform a O(1) computation
To create the output array, n threads perform a O(log(m)) computation.

Therefore, overall, the overall delay of the algorithm is O(mode(input) + log(m)). Also, max(n, 2m) threads are required to achieve this delay. The memory complexity is the same as for the serial case, because the histopyramid is the same size as the frequency histogram (so it’s just twice as big, which is the same asymptotic function).

Counting Sort's Relation to Bead Sort

Counting sort is a sorting algorithm that doesn’t use comparisons to re-order its objects. Instead, counting sort creates a frequency histogram, then recreates the sorted list from the frequency histogram. For example, if you know that the input array consists of 3x of value A and 4x of value B, and that A < B, you know the sorted array is just AAABBBB. An obvious limitation of this is that the memory complexity of the sort is O(maximum value). Counting sort is the same as bucket sort, if our buckets have a size of 1.

Implementing counting sort on a scalar processor looks something like this (using my own made-up syntax):

max := maximum(input) // Assumed to be known at compile time

histogram[] := allocate max cells

initialize histogram[] to 0

foreach input as x:

histogram[x]++

foreach histogram as (index, x):

for 0 to x:

emit index

However, I’m interested in implementing this on a parallel processor (such as a graphics card). The first loop is straightforward to implement, as it is embarrassingly parallel (the increment can be done with atomic operations). Each iteration of the second loop, however, depends on the previous iteration, which makes that loop entirely serial.

Interestingly enough, Wikipedia describes this second loop a little differently. Its description looks like this:

strictly_increasing := scanl1((+), histogram) // scanl1 function taken from Haskell

foreach input as x:

output[atomic_inc(&strictly_increasing[x])] = x // where atomic_inc returns the original value

In this approach, the last loop is also embarrasingly parallel, however, the scanl1 isn’t. Therefore, it’s not actually any better for a parallel processor.

There is, however, a way around this. What if we tried to calculate the strictly_increasing array directly? Instead of simply incrementing the histogram count of a particular value, we can modify the original loop to to increment all the histogram values to the right of the particular value. Incrementing each one of these histogram values is independent of each other, so we can actually do all these increments in parallel. This means that, instead of assigning each thread to a particular item in the input array, we can assign one thread to the combination of an item in the input array and a cell in the histogram. This means that we go to a two-dimensional work size. Each thread, therefore, simply executes “if (histogram_index > input_element) atomic_inc(&histogram[histogram_index]);”.

Looking at that code, it would appear that this is a constant-time sort. However, it isn’t. Those atomic increment operators need to be serialized on the memory bus. Therefore, the delay of this algorithm is on the order of whichever value needs to be incremented the most. Because the output of this histogram is strictly increasing, the rightmost value is the one that gets incremented the most. It also gets incremented for every element in the input list. This means that this algorithm is simply O(n).

Interestingly enough, we can actually use almost the exact same kernel to transform this histogram into the sorted list. The value of each value in the histogram corresponds to how many values in the input array are less than the index of that value in the histogram. Because we also know the length of the input array, we can determine, for each value in the histogram, how many values in the output array are greater than that particular index into the histogram. Well, if we know that k elements are greater than some threshold, why don’t we increment the last k elements in the output array? If we step along the thresholds one by one, the last elements in the array will be incremented according to their ultimate size. Therefore, if we go through all the thresholds (which are each of the indices in the histogram), we can recreate the sorted list. We can stepping through the histogram in one dimension, step through output indices in the other dimension. Each work item can check to see if it should increment its cell, and does so accordingly. This is essentially exactly the same code as the previous pass; the only difference is that the bounds are reversed. The number of work items is also the same because the bounds of the work grid are just flipped.

This second pass isn’t constant time either, for the same reason that the previous pass isn’t constant time. In particular, the increments that end up creating the greatest element have to be serialized. That means that this second pass is O(m), where m is the greatest value in the input array. Therefore, overall, this algorithm is O(n+m).

There is a caveat, however. Moving to a two-dimensional work size means creating a whole lot of threads. I’m running this on my graphics card, which is a parallel processor; however, my graphics card has some limitations. In particular, it only has two execution units, which means that I can only ::actually:: execute two work groups at the same time. Any more than that, and the GPU will have to context switch between the two groups. My graphics card also has a maximum work group size of 1024 elements. Because we have a two-dimensional work size, this means that my 1024 elements should be accessed as a 32x32 grid. We have two execution units, so our grid is effectively 32x64 elements big. This means that, if our maximum value that we want to sort is roughly the same as the length of the input list, we can only sort lists of about sqrt(32*64)=45 elements long without context switching the GPU. Context switching is bad because it means that our parallel threads aren’t actually executing in parallel, which is bad for our parallel sort. Therefore, moving to a two-dimensional work item size means we need more threads than a modern GPU can provide.

Sunday, May 12, 2013

Design of Mesa 3D Part 9: Vertex Buffer Objects

After the last series of posts, we now have a pretty good handle on how shaders are compiled. The next thing that a GL program typically does is create buffers and upload them. All of the relevant calls are in src/mesa/main/bufferobj.c.

We keep track of all the buffer objects in a hash table stored inside ctx->Shared->BufferObjects. This means that creating a buffer object is easy; just ask the hash table what its lowest unused key is (which is potentially an O(n) exhaustive search), ask the driver for a NewBufferObject(), then insert it into the hash table with that key. We have to surround that piece with a mutex because shared contexts may be executing buffer commands on different threads, and buffers are shared. The implementation of the software driver's NewBufferObject() function, or _mesa_new_buffer_object(), just malloc's a gl_buffer_object and initializes it with some default values.

Here's the layout of gl_buffer_object, defined in src/mesa/main/mtypes.h:

struct gl_buffer_object
{
GLint RefCount;
GLuint Name;
GLenum Usage; /**< GL_STREAM_DRAW_ARB, GL_STREAM_READ_ARB, etc. */
GLsizeiptrARB Size; /**< Size of buffer storage in bytes */
GLubyte *Data; /**< Location of storage either in RAM or VRAM. */
/** Fields describing a mapped buffer */
/*@{*/
GLbitfield AccessFlags; /**< Mask of GL_MAP_x_BIT flags */
GLvoid *Pointer; /**< User-space address of mapping */
GLintptr Offset; /**< Mapped offset */
GLsizeiptr Length; /**< Mapped length */
/*@}*/
GLboolean Written; /**< Ever written to? (for debugging) */
};

_mesa_BindBuffer() is also fairly simple; it immediately delegates to bind_buffer_object(). We switch over the target to find which pointer in the context we're going to be updating, and then see if we can take an early out if the buffer is already bound properly. Otherwise, we look up the buffer in our hash map, and ask the driver to create the buffer if it didn't already exist. We then reference the buffer we found in our hash map, actually change the pointer in the context to point to the buffer, and ask the driver to BindBuffer(), which has no default implementation.

Alright, now let's allocate some space for the data with _mesa_BufferData(). After checking that the input data is good, we just hand the call off to the driver. The software driver (_mesa_buffer_data()) implements this by simply realloc'ing the Data pointer in the object, setting the fields for the size and usage of the buffer, and, if the input pointer is not NULL, memcpy()'ing the data into it. _mesa_BufferSubData() and _mesa_buffer_sub_dat()a work almost exactly the same way (with additional bounds checking).

_mesa_MapBuffer() creates an access bit string based on the access enum provided, does some error checking, then delegates to the driver's MapBuffer() function. If the driver was successful, we fill in the gl_buffer_object's AccessFlags field with the bit string we previously computed, then return the Pointer field of the gl_buffer_object. The driver function, mesa_buffer_map(), disregards the access parameter, and returns a pointer to the original data.

Using a software renderer sure makes it simple to deal with buffers! I'll get into how hardware drivers implement this in another post. For now, it's onward to glDraw* and the tnl pipeline!

Monday, May 6, 2013

Runtime of std::hash

One of the questions that I like to interview candidates for my team with involves hashing strings. I was talking with someone about it, and they insisted that computing a hash function takes constant time, (O(1)). I then asked how good the has function could really be if it didn't look at all the characters. He insisted that you could look at all the characters, but still have the function be constant time.

I decided that this would be a good opportunity to see how it's really done. The new C++ standard, C++11, includes std::hash, which does the computation that we're interested in. The implementation that FreeBSD's libstdc++ is also open source, so I can just go look at the source.

The symbol is defined in string.h, so let's take a look at that file. We can immediately see that hash<basic_string<_CharT, _Traits, _Allocator> >::operator() delegates to __do_string_hash() which delegates to __murmur2_or_cityhash(). That function is defined in include/memory, and has a really interesting trick to it, using templates:

// We use murmur2 when size_t is 32 bits, and cityhash64 when size_t
// is 64 bits. This is because cityhash64 uses 64bit x 64bit
// multiplication, which can be very slow on 32-bit systems.
template <class _Size, size_t = sizeof(_Size)*__CHAR_BIT__>
struct __murmur2_or_cityhash;

As you can see, we're creating a new templated value called size_t. We can then use template specialization to define various implementations of the function. In particular, we have these two implementations:

template <class _Size>
struct __murmur2_or_cityhash<_Size, 32>
{
_Size operator()(const void* __key, _Size __len);
};

// murmur2
template <class _Size>
_Size
__murmur2_or_cityhash<_Size, 32>::operator()(const void* __key, _Size __len)

We've also got a cityhash64 implementation.

template <class _Size>
struct __murmur2_or_cityhash<_Size, 64>
{
_Size operator()(const void* __key, _Size __len);

...

// cityhash64
template <class _Size>
_Size
__murmur2_or_cityhash<_Size, 64>::operator()(const void* __key, _Size __len)

Cool! Alright, now let's read those functions to try to see what the running time is. First, let's take murmor. The main loop in this function looks like this:

for (; __len >= 4; __data += 4, __len -= 4)

In addition, __len isn't modified in the loop. Well, that makes it simple; this is clearly O(n). What about cityhash? Its main loop looks like this:

do {
...
__len -= 64;
} while (__len != 0);

__len isn't modified elsewhere within the loop. Well, looks like we've got an answer! std::hash is O(n).

Sunday, April 28, 2013

Design of Mesa 3D Part 8: Shader Compilation Example

So I've discussed how shaders get compiled, but I wanted to give a concrete example showing the stages a shader goes through. Here's the simple shader:

#version 120

float dontinlineme(float x) {
while (x > 0) {
if ((x / 5) * 5 == x) {
return x;
}
x--;
}
return x;
}
float inlineme(float x) {
return (x / 5) * 5;
}

attribute float attr;

void main() {
gl_Position = vec4(0.0, 0.0, 0.0, 1.0);
float x = dontinlineme(attr);
float y = inlineme(attr+1.0);
if (x == y) {
gl_Position.yz = vec2(float(x), float(y));
} else {
gl_Position.yxz = vec3(0.4, 0.5, 0.6);
}
}

Here's the AST that is created that represents the shader:

FUNCTION main ( scope=0x80547b680
) param scope = 0x80547b680
{ locals=0x80547b710 outer=0x80547b680
//child 0 of 4:
EXPR: locals=0x80547b788 outer=0x80547b680
ASSIGNMENT locals=0x80547c0a8 outer=0x80547b680
VAR' gl_Position (in scope 0x7fffffffacb8) locals=0x80547b878 outer=0x80547b680
:= at 0x80547b7a0 locals=0x80547c0a8 outer=0x80547b680
LITERAL (0.000000 0.000000 0.000000 1.000000 )
//child 1 of 4:
{ locals=0x80547c270 outer=0x80547b680
//child 0 of 1:
DECL (locals=0x80547c528 outer=0x80547b680) float x (0x80547c290) (in scope 0x80547b680) := INITIALIZER
COMMA-SEQ at 0x80547c2e8 locals=0x805480ca0 outer=0x80547b680
//child 0 of 3:
DECL (locals=0x805480dd8 outer=0x805480ca0) float __resultTmp (0x805480e28) (in scope 0x805480ca0) ;
//child 1 of 3:
{{ // new scope locals=0x805480ea0 outer=0x805479098: x
//child 0 of 4:
DECL (locals=0x805482910 outer=0x805480ea0) float x (0x8054829c8) (in scope 0x805480ea0) :=
VAR' attr (in scope 0x7fffffffad10) locals=0x8054829a0 outer=0x80547b680
//child 1 of 4:
WHILE LOOP: locals = 0x805480fa8
WHILE cond:
EXPR: locals=0x8054810b0 outer=0x805480fa8
VAR' x (in scope 0x805480ea0) locals=0x805481248 outer=0x805480fa8
> at 0x8054810c8 locals=0x805481140 outer=0x805480fa8
LITERAL (0 )
WHILE body:
{{ // new scope locals=0x805481298 outer=0x805480fa8:
//child 0 of 2:
IF
ASM: vec4_multiply at 0x805481538 locals=0x8054853a0 outer=0x8038e47c0
ASM at 0x805481538 locals=0x8054853a0 outer=0x8038e47c0
//child 0 of 2:
COMMA-SEQ at 0x8054853b8 locals=0x8054857c0 outer=0x805481298
//child 0 of 3:
DECL (locals=0x8054858f8 outer=0x8054857c0) float __resultTmp (0x805485948) (in scope 0x8054857c0) ;
//child 1 of 3:
{{ // new scope locals=0x8054859a0 outer=0x8038e7b68: b bInv
//child 0 of 5:
DECL (locals=0x805486170 outer=0x8054859a0) const float b (0x805486228) (in scope 0x8054859a0) :=
LITERAL (5 )
//child 1 of 5:
{ locals=0x805485b20 outer=0x8054859a0
//child 0 of 1:
DECL (locals=0x805485bb0 outer=0x8054859a0) float bInv (0x805486290) (in scope 0x8054859a0) ;
}
//child 2 of 5:
ASM: float_rcp at 0x8054863a8 locals=0x805485bd8 outer=0x8054859a0
ASM at 0x8054863a8 locals=0x805485bd8 outer=0x8054859a0
//child 0 of 2:
FIELD x of
VAR' bInv (in scope 0x8054859a0) locals=0x805485d70 outer=0x8054859a0
//child 1 of 2:
VAR' b (in scope 0x8054859a0) locals=0x805485d98 outer=0x8054859a0
//child 3 of 5:
ASM: vec4_multiply at 0x805486408 locals=0x805485dc0 outer=0x8054859a0
ASM at 0x805486408 locals=0x805485dc0 outer=0x8054859a0
//child 0 of 3:
VAR' __resultTmp (in scope 0x8054857c0) locals=0x805485fb0 outer=0x8054857c0
//child 1 of 3:
VAR' x (in scope 0x805480ea0) locals=0x805485fd0 outer=0x805481298
//child 2 of 3:
VAR' bInv (in scope 0x8054859a0) locals=0x805485f80 outer=0x8054859a0
//child 4 of 5:
LABEL (null)
}}
//child 2 of 3:
VAR __resultTmp (in scope 0x8054857c0)
//child 1 of 2:
LITERAL (5 )
== at 0x8054813b8 locals=0x805481520 outer=0x805481298
VAR' x (in scope 0x805480ea0) locals=0x8054818a8 outer=0x805481298
THEN
{{ // new scope locals=0x8054818d0 outer=0x805481298:
//child 0 of 1:
{ locals=0x805481ff8 outer=0x8054818d0
//child 0 of 2:
ASSIGNMENT locals=0x805482100 outer=0x805481ff8
VAR' __resultTmp (in scope 0x805480ca0) locals=0x805482208 outer=0x805480ca0
:= at 0x805482010 locals=0x805482100 outer=0x805481ff8
VAR' x (in scope 0x805480ea0) locals=0x805482228 outer=0x8054818d0
//child 1 of 2:
RETURN
}
}}
ELSE
EXPR: locals=0x805481a20 outer=0x805481298
(oper-void)
ENDIF
//child 1 of 2:
EXPR: locals=0x805481ae0 outer=0x805481298
COMMA-SEQ at 0x805481af8 locals=0x80548b698 outer=0x805481298
//child 0 of 3:
DECL (locals=0x80548b7d0 outer=0x80548b698) float __resultTmp (0x80548b820) (in scope 0x80548b698) ;
//child 1 of 3:
{{ // new scope locals=0x80548b878 outer=0x8041e1398:
//child 0 of 3:
EXPR: locals=0x80548b980 outer=0x8041e1398
ASSIGNMENT locals=0x80548ba10 outer=0x8041e1398
VAR' __resultTmp (in scope 0x80548b698) locals=0x80548be88 outer=0x80548b698
:= at 0x80548b998 locals=0x80548ba10 outer=0x8041e1398
VAR' x (in scope 0x805480ea0) locals=0x80548bea8 outer=0x805481298
//child 1 of 3:
EXPR: locals=0x80548bb68 outer=0x8041e1398
ASSIGNMENT locals=0x80548bbf8 outer=0x8041e1398
VAR' x (in scope 0x805480ea0) locals=0x80548bec8 outer=0x805481298
:= at 0x80548bb80 locals=0x80548bbf8 outer=0x8041e1398
ASM: vec4_subtract at 0x80548bc70 locals=0x80548e378 outer=0x8038e1490
ASM at 0x80548bc70 locals=0x80548e378 outer=0x8038e1490
//child 0 of 2:
VAR' x (in scope 0x805480ea0) locals=0x80548e560 outer=0x805481298
//child 1 of 2:
LITERAL (1.000000 )
//child 2 of 3:
LABEL (null)
}}
//child 2 of 3:
VAR __resultTmp (in scope 0x80548b698)
}}
END WHILE LOOP
//child 2 of 4:
{ locals=0x805482570 outer=0x805480ea0
//child 0 of 2:
ASSIGNMENT locals=0x805482678 outer=0x805482570
VAR' __resultTmp (in scope 0x805480ca0) locals=0x805482780 outer=0x805480ca0
:= at 0x805482588 locals=0x805482678 outer=0x805482570
VAR' x (in scope 0x805480ea0) locals=0x8054827a0 outer=0x805480ea0
//child 1 of 2:
RETURN
}
//child 3 of 4:
LABEL (null)
}}
//child 2 of 3:
VAR __resultTmp (in scope 0x805480ca0)
}
//child 2 of 4:
{ locals=0x80547c660 outer=0x80547b680
//child 0 of 1:
DECL (locals=0x80547cc28 outer=0x80547b680) float y (0x80547c688) (in scope 0x80547b680) := INITIALIZER
COMMA-SEQ at 0x80547c6e0 locals=0x80548f4e8 outer=0x80547b680
//child 0 of 3:
DECL (locals=0x80548f620 outer=0x80548f4e8) float __resultTmp (0x80548f670) (in scope 0x80548f4e8) ;
//child 1 of 3:
{{ // new scope locals=0x80548f6c8 outer=0x80547abf0: x
//child 0 of 3:
DECL (locals=0x805490570 outer=0x80548f6c8) float x (0x805490758) (in scope 0x80548f6c8) :=
ASM: vec4_add at 0x805490588 locals=0x805492b70 outer=0x8038de1d8
ASM at 0x805490588 locals=0x805492b70 outer=0x8038de1d8
//child 0 of 2:
VAR' attr (in scope 0x7fffffffad10) locals=0x805492d58 outer=0x80547b680
//child 1 of 2:
LITERAL (1.000000 )
//child 1 of 3:
{ locals=0x80548ffd0 outer=0x80548f6c8
//child 0 of 2:
ASSIGNMENT locals=0x8054900d8 outer=0x80548ffd0
VAR' __resultTmp (in scope 0x80548f4e8) locals=0x8054901e0 outer=0x80548f4e8
:= at 0x80548ffe8 locals=0x8054900d8 outer=0x80548ffd0
ASM: vec4_multiply at 0x805490150 locals=0x805493158 outer=0x8038e47c0
ASM at 0x805490150 locals=0x805493158 outer=0x8038e47c0
//child 0 of 2:
COMMA-SEQ at 0x805493170 locals=0x805493578 outer=0x80548f6c8
//child 0 of 3:
DECL (locals=0x8054936b0 outer=0x805493578) float __resultTmp (0x805493700) (in scope 0x805493578) ;
//child 1 of 3:
{{ // new scope locals=0x805493758 outer=0x8038e7b68: b bInv
//child 0 of 5:
DECL (locals=0x805493f28 outer=0x805493758) const float b (0x805493fe0) (in scope 0x805493758) :=
LITERAL (5 )
//child 1 of 5:
{ locals=0x8054938d8 outer=0x805493758
//child 0 of 1:
DECL (locals=0x805493968 outer=0x805493758) float bInv (0x805494048) (in scope 0x805493758) ;
}
//child 2 of 5:
ASM: float_rcp at 0x805494160 locals=0x805493990 outer=0x805493758
ASM at 0x805494160 locals=0x805493990 outer=0x805493758
//child 0 of 2:
FIELD x of
VAR' bInv (in scope 0x805493758) locals=0x805493b28 outer=0x805493758
//child 1 of 2:
VAR' b (in scope 0x805493758) locals=0x805493b50 outer=0x805493758
//child 3 of 5:
ASM: vec4_multiply at 0x8054941c0 locals=0x805493b78 outer=0x805493758
ASM at 0x8054941c0 locals=0x805493b78 outer=0x805493758
//child 0 of 3:
VAR' __resultTmp (in scope 0x805493578) locals=0x805493d68 outer=0x805493578
//child 1 of 3:
VAR' x (in scope 0x80548f6c8) locals=0x805493d88 outer=0x80548f6c8
//child 2 of 3:
VAR' bInv (in scope 0x805493758) locals=0x805493d38 outer=0x805493758
//child 4 of 5:
LABEL (null)
}}
//child 2 of 3:
VAR __resultTmp (in scope 0x805493578)
//child 1 of 2:
LITERAL (5 )
//child 1 of 2:
(oper-void)
}
//child 2 of 3:
LABEL (null)
}}
//child 2 of 3:
VAR __resultTmp (in scope 0x80548f4e8)
}
//child 3 of 4:
IF
VAR' x (in scope 0x80547b680) locals=0x80547ceb0 outer=0x80547b680
== at 0x80547dc90 locals=0x80547d0c0 outer=0x80547b680
VAR' y (in scope 0x80547b680) locals=0x80547cf88 outer=0x80547b680
THEN
{{ // new scope locals=0x80547d288 outer=0x80547b680:
//child 0 of 1:
EXPR: locals=0x80547d300 outer=0x80547d288
ASSIGNMENT locals=0x80547db88 outer=0x80547d288
FIELD yz of
VAR' gl_Position (in scope 0x7fffffffacb8) locals=0x80547d3f0 outer=0x80547d288
:= at 0x80547d318 locals=0x80547db88 outer=0x80547d288
COMMA-SEQ at 0x80547dc00 locals=0x8054998f8 outer=0x80547d288
//child 0 of 3:
DECL (locals=0x805499a30 outer=0x8054998f8) vec2 __resultTmp (0x805499a80) (in scope 0x8054998f8) ;
//child 1 of 3:
{{ // new scope locals=0x805499ad8 outer=0x80380d510: x y
//child 0 of 5:
DECL (locals=0x80549a248 outer=0x805499ad8) const float x (0x80549a398) (in scope 0x805499ad8) :=
COMMA-SEQ at 0x80549a260 locals=0x80549cb98 outer=0x80547d288
//child 0 of 3:
DECL (locals=0x80549ccd0 outer=0x80549cb98) float __resultTmp (0x80549cd20) (in scope 0x80549cb98) ;
//child 1 of 3:
{{ // new scope locals=0x80549cd78 outer=0x80380cb18:
//child 0 of 2:
EXPR: locals=0x80549ce08 outer=0x80380cb18
ASSIGNMENT locals=0x80549ce98 outer=0x80380cb18
VAR' __resultTmp (in scope 0x80549cb98) locals=0x80549cff8 outer=0x80549cb98
:= at 0x80549ce20 locals=0x80549ce98 outer=0x80380cb18
VAR' x (in scope 0x80547b680) locals=0x80549d018 outer=0x80547d288
//child 1 of 2:
LABEL (null)
}}
//child 2 of 3:
VAR __resultTmp (in scope 0x80549cb98)
//child 1 of 5:
DECL (locals=0x80549a570 outer=0x805499ad8) const float y (0x80549a6c8) (in scope 0x805499ad8) :=
COMMA-SEQ at 0x80549a588 locals=0x80549f860 outer=0x80547d288
//child 0 of 3:
DECL (locals=0x80549f998 outer=0x80549f860) float __resultTmp (0x80549f9e8) (in scope 0x80549f860) ;
//child 1 of 3:
{{ // new scope locals=0x80549fa40 outer=0x80380cb18:
//child 0 of 2:
EXPR: locals=0x80549fad0 outer=0x80380cb18
ASSIGNMENT locals=0x80549fb60 outer=0x80380cb18
VAR' __resultTmp (in scope 0x80549f860) locals=0x80549fcc0 outer=0x80549f860
:= at 0x80549fae8 locals=0x80549fb60 outer=0x80380cb18
VAR' y (in scope 0x80547b680) locals=0x80549fce0 outer=0x80547d288
//child 1 of 2:
LABEL (null)
}}
//child 2 of 3:
VAR __resultTmp (in scope 0x80549f860)
//child 2 of 5:
EXPR: locals=0x805499be0 outer=0x805499ad8
ASSIGNMENT locals=0x805499c70 outer=0x805499ad8
FIELD x of
VAR' __resultTmp (in scope 0x8054998f8) locals=0x80549a0e8 outer=0x8054998f8
:= at 0x805499bf8 locals=0x805499c70 outer=0x805499ad8
VAR' x (in scope 0x805499ad8) locals=0x805499e30 outer=0x805499ad8
//child 3 of 5:
EXPR: locals=0x805499e60 outer=0x805499ad8
ASSIGNMENT locals=0x805499ef0 outer=0x805499ad8
FIELD y of
VAR' __resultTmp (in scope 0x8054998f8) locals=0x80549a108 outer=0x8054998f8
:= at 0x805499e78 locals=0x805499ef0 outer=0x805499ad8
VAR' y (in scope 0x805499ad8) locals=0x80549a0b0 outer=0x805499ad8
//child 4 of 5:
LABEL (null)
}}
//child 2 of 3:
VAR __resultTmp (in scope 0x8054998f8)
}}
ELSE
{{ // new scope locals=0x80547ddb0 outer=0x80547b680:
//child 0 of 1:
EXPR: locals=0x80547de28 outer=0x80547ddb0
ASSIGNMENT locals=0x80547e698 outer=0x80547ddb0
FIELD yxz of
VAR' gl_Position (in scope 0x7fffffffacb8) locals=0x80547df18 outer=0x80547ddb0
:= at 0x80547de40 locals=0x80547e698 outer=0x80547ddb0
LITERAL (0.400000 0.500000 0.600000 )
}}
ENDIF
}

Here's the intermediate representation that this gets turned into:

NEW SCOPE

COPY

VAR gl_Position at ENV_PARAM[0] store 0x805458248

FLOAT 0 0 0 1

VAR_DECL x (0x80547c290) at TEMP[-7] store 0x805480b40

COPY

VAR x at TEMP[-7] store 0x805480b40

VAR_DECL __resultTmp (0x805480e28) at TEMP[-7] store 0x805482c58

CALL dontinlineme_1

VAR __resultTmp at TEMP[-7] store 0x805482c58

VAR_DECL y (0x80547c688) at TEMP[-7] store 0x80548f390

COPY

VAR y at TEMP[-7] store 0x80548f390

VAR_DECL __resultTmp (0x80548f670) at TEMP[-7] store 0x805490958

NEW SCOPE

VAR_DECL x (0x805490758) at TEMP[-7] store 0x805492a10

COPY

VAR x at TEMP[-7] store 0x805492a10

IR_ADD (0x805492d98, 0x805492e08) (store 0x805492f18)

VAR attr.x??? at LOCAL_PARAM[16] store 0x80547b650

FLOAT 1 1 1 1

COPY

VAR __resultTmp at TEMP[-7] store 0x805490958

IR_MUL (0x805496db0, 0x805496e20) (store 0x805496f30)

VAR_DECL __resultTmp (0x805493700) at TEMP[-7] store 0x805494308

NEW SCOPE

VAR_DECL b (0x805493fe0) at TEMP[-7] store 0x8054963c0

COPY

VAR b at TEMP[-7] store 0x8054963c0

FLOAT 5 5 5 5

VAR_DECL bInv (0x805494048) at TEMP[-7] store 0x805496650

IR_RCP (0x8054966f0, 0x0) (store 0x8054968b0)

VAR b at TEMP[-7] store 0x8054963c0

IR_MUL (0x805496950, 0x8054969c0) (store 0x805494308)

VAR x at TEMP[-7] store 0x805492a10

VAR bInv at TEMP[-7] store 0x805496650

LABEL: __endOfFunc_/_

VAR __resultTmp at TEMP[-7] store 0x805494308

FLOAT 5 5 5 5

IR_NOP (0x0, 0x0) (store 0x0)

LABEL: __endOfFunc_inlineme_

VAR __resultTmp at TEMP[-7] store 0x805490958

COND

IR_EQUAL (0x805497580, 0x805497510) (store 0x805497660)

VAR x at TEMP[-7] store 0x805480b40

VAR y at TEMP[-7] store 0x80548f390

THEN

NEW SCOPE

COPY

SWIZZLE .yz?? of (store 0x8054997f8)

VAR gl_Position at ENV_PARAM[0] store 0x805458248

SWIZZLE .xxyw of (store 0x8054a2c40)

VAR_DECL __resultTmp (0x805499a80) at TEMP[-7] store 0x80549a988

NEW SCOPE

VAR_DECL x (0x80549a398) at TEMP[-7] store 0x80549ca40

COPY

VAR x at TEMP[-7] store 0x80549ca40

VAR_DECL __resultTmp (0x80549cd20) at TEMP[-7] store 0x80549d180

NEW SCOPE

COPY

VAR __resultTmp at TEMP[-7] store 0x80549d180

VAR x at TEMP[-7] store 0x805480b40

LABEL: __endOfFunc_float_

VAR __resultTmp at TEMP[-7] store 0x80549d180

VAR_DECL y (0x80549a6c8) at TEMP[-7] store 0x80549f708

COPY

VAR y at TEMP[-7] store 0x80549f708

VAR_DECL __resultTmp (0x80549f9e8) at TEMP[-7] store 0x80549fe48

NEW SCOPE

COPY

VAR __resultTmp at TEMP[-7] store 0x80549fe48

VAR y at TEMP[-7] store 0x80548f390

LABEL: __endOfFunc_float_

VAR __resultTmp at TEMP[-7] store 0x80549fe48

COPY

SWIZZLE .x??? of (store 0x8054a24b0)

VAR __resultTmp at TEMP[-7] store 0x80549a988

VAR x at TEMP[-7] store 0x80549ca40

COPY

SWIZZLE .y??? of (store 0x8054a2710)

VAR __resultTmp at TEMP[-7] store 0x80549a988

SWIZZLE .xxzw of (store 0x8054a2820)

VAR y at TEMP[-7] store 0x80549f708

LABEL: __endOfFunc_vec2_

VAR __resultTmp at TEMP[-7] store 0x80549a988

ELSE

NEW SCOPE

COPY

SWIZZLE .yxz? of (store 0x8054a4e48)

VAR gl_Position at ENV_PARAM[0] store 0x805458248

SWIZZLE .yxzw of (store 0x8054a4f88)

FLOAT 0.4 0.5 0.6 0.6

ENDIF

LABEL: __endOfFunc__main

And here's the actual generated code:

# Vertex Program/Shader

0: MOV OUTPUT[0], CONST[0];

1: CAL 22; # dontinlineme_1

2: MOV TEMP[0].x, TEMP[0].yyyy;

3: ADD TEMP[0].w, INPUT[16].xxxx, CONST[0].wwww;

4: MOV TEMP[1].z, CONST[1].xxxx;

5: RCP TEMP[1].w, TEMP[1].zzzz;

6: MUL TEMP[1].y, TEMP[0].wwww, TEMP[1].wwww;

7: MUL TEMP[0].z, TEMP[1].yyyy, CONST[1].xxxx;

8: MOV TEMP[0].y, TEMP[0].zzzz;

9: SEQ TEMP[0].z, TEMP[0].xxxx, TEMP[0].yyyy;

10: IF TEMP[0].zzzz; # (if false, goto 19);

11: MOV TEMP[1].z, TEMP[0].xxxx;

12: MOV TEMP[0].w, TEMP[1].zzzz;

13: MOV TEMP[1].w, TEMP[0].yyyy;

14: MOV TEMP[1].z, TEMP[1].wwww;

15: MOV TEMP[1].x, TEMP[0].wwww;

16: MOV TEMP[1].y, TEMP[1].zzzw;

17: MOV OUTPUT[0].yz, TEMP[1].xxyw;

18: ELSE; # (goto 21)

19: MOV OUTPUT[0].xyz, CONST[2].yxzw;

20: ENDIF;

21: END

22: BGNSUB; # dontinlineme_1

23: MOV TEMP[0].z, INPUT[16].xxxx;

24: BGNLOOP; # (end at 40)

25: SLE TEMP[0].w, TEMP[0].zzzz, CONST[0].xxxx;

26: IF TEMP[0].wwww; # (if false, goto 29);

27: BRK (TR); # (goto 41);

28: ENDIF;

29: MOV TEMP[1].y, CONST[1].xxxx;

30: RCP TEMP[1].z, TEMP[1].yyyy;

31: MUL TEMP[1].x, TEMP[0].zzzz, TEMP[1].zzzz;

32: MUL TEMP[1].y, TEMP[1].xxxx, CONST[1].xxxx;

33: SEQ TEMP[1].x, TEMP[1].yyyy, TEMP[0].zzzz;

34: IF TEMP[1].xxxx; # (if false, goto 37);

35: MOV TEMP[0].y, TEMP[0].zzzz;

36: RET (TR);

37: ENDIF;

38: MOV TEMP[1].y, TEMP[0].zzzz;

39: SUB TEMP[0].z, TEMP[0].zzzz, CONST[0].wwww;

40: ENDLOOP; # (goto 24)

41: MOV TEMP[0].y, TEMP[0].zzzz;

42: RET (TR);

43: ENDSUB; # dontinlineme_1

InputsRead: 0x10000 (0b1,00000000,00000000)

OutputsWritten: 0x0 (0b0)

NumInstructions=44

NumTemporaries=0

NumParameters=0

NumAttributes=0

NumAddressRegs=0

SamplersUsed: 0x0 (0b0)

Samplers=[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ]

param list 0x802cb4280

dirty state flags: 0x0

param[0] sz=4 CONST (null) = {0, 0, 0, 1}

param[1] sz=1 CONST (null) = {5, 5, 5, 5}

param[2] sz=3 CONST (null) = {0.4, 0.5, 0.6, 0.6}

Hopefully that helps make things clearer.