Friday, November 29, 2013

Complex Text Handling in WebKit, Part 2: Fonts

Before we go any further, I’d like to talk a little bit about what information is contained within font files. Ultimately, font files dictate the ground truth for how to lay out a particular run of text. There are two common formats of font files: TrueType fonts and OpenType fonts. TrueType was developed by Apple, and when a licensing agreement between Apple and Microsoft couldn’t be reached, Microsoft built on top of TrueType to make OpenType. Both of these formats are similar and use the same underlying file structure. My experience is with TrueType, so that’s what I’m going to focus on here.

A TrueType font file is made up of a collection of sections, each called a “table.” There is also an directory table (which appears first) which describes the locations and sizes of all of the other tables. Each table is denoted by 4 ascii characters which describe what the table is describing. For example, the table that describes the character-to-glyph mapping has a tag of “cmap” and the table that describes naming information has a tag of “name”. Because unknown tables will simply be ignored by any program that is parsing the file, the format can be extended with different kinds of tables.

One of the duties of a font system is to translate a sequence of characters (read: code points) into a sequence of glyphs. Each glyph is identified by a number (which can later be used to get the information needed to draw the glyph). It is important to note that the set of codepoints that the font knows about is arbitrary and different from font file to font file. In addition, the glyph IDs themselves are arbitrary and different from font file to font file. (e.g. If you ask the font system which glyph should be drawn for the letter ‘a’, you cannot use that glyph ID with another font - it may point to a different glyph entirely!). The code point to glyph mapping is done using the “cmap” table in the font file. The specifics of how the lookup is completed are not really all that complicated, but I’ll let you read [1] if you’re interested.

Okay cool, so now we have a glyph ID for the particular code point that we’re trying to draw. We now have to figure out where in the font file the relevant information regarding this glyph is kept. Each glyph can contain a different number of contours and control points, and glyph information is not padded, so we have to look up where the relevant glyph information is kept. This information is inside the “loca” table. This table is simply an array of offsets, so figuring out where to start looking is very easy.

Alright, now we have an offset, and we now know where to start looking for information about how to draw our glyph. The offset is an offset into the “glyf” table, which describes the contours that make up the outline of the relevant glyph. TrueType has two different notions of glyphs: simple glyphs and compound glyphs. Compound glyphs are simply created by drawing multiple other glyphs with different positions and scales. There might be a compound glyph for “ñ” which is composed of drawing the glyph for “n” and the glyph for “~” on top of each other.

A simple glyph is described by a collection of contours. A contour is described by a sequence of quadratic curves laid out end-to-end. A single quadratic curve is characterized by three points: two endpoints and a single control point which defines the tangents at the endpoints. Because the curves in a single contour are laid out end-to-end, the endpoint of one curve would be a duplicate of the beginning-point of the next curve. Also, if we assume that the entire contour is smooth, meaning that the ending derivative of one curve is the same as the beginning derivative of the next curve, we don’t even need to explicitly state the location of the shared endpoint. We can figure out where the shared endpoint must be given the two control points that straddle it.

Therefore, the way that each contour is described is as a list of points, coupled with a boolean for each point which describes if that point is “on” the contour or “off” the contour. Points that are on the contour are endpoints of a curve, and points that are off the contour are control points. Any arbitrary collection of these (point, on/off boolean) tuples will uniquely define a contour. Note that kinks are allowed because the font author can explicitly specify each of the three points for each quadratic curve (but doesn’t have to). If there is no control point between two on-contour points, a straight line is drawn between the two points. A glyph simply consists of a bunch of these contours.

There’s one last piece we need to know in order to draw a glyph: TrueType dictates that rasterizers use the non-zero winding number rule in order to dictate whether a particular point is inside or outside the glyph. This rule is well-documented, but works like this: From the point you’re trying to determine, draw a ray in any direction. Every time the ray crosses a contour from left-to-right, increment a counter. Every time the ray crosses a contour from right-to-left, decrement the counter. If the counter ends up with a non-zero value, the point is inside the glyph. Note that this rule assumes that glyphs “hold water” meaning that there are no gaps. This means that contours can intersect, and that font authors can cut holds out of closed contours.

It is also worth stating here that all of the coordinates that points are described in this table are integral, meaning that this table can only describe locations on a grid. However, the dimensions of this grid are specified in the “head” (font header) table (sort of. Keep reading). Therefore, different fonts can have different precisions for their contours. It also means that the font system has to scale these coordinates to a pixel-based coordinate system. This scaling is straightforward: Let’s call the user-selected font point size “p”. One point is equal to 1/72 of an inch, so divide p by 72 to get the number of inches the grid should take up, then multiply that by the output device’s pixels per inch (so now we have pixels) and then scale the grid to take up that number of pixels. The TrueType spec defines that this scaling is done in fixed-point with a  resolution of 1/64.

It should also be noted that the actual value in the “head” table is “units per Em”. In old days of movable type, the printer would have lots of little squares of metal with raised letters on them. When the printer wanted to print a book, he/she would arrange the little squares to spell out the page of the book, dip the raised letters in ink, then stamp out the page of the book. Therefore, each letter had to fit on one of these little squares. The size of these squares is called an “Em”. On a computer, when you say that you want to you a 12 pt font, one pt equals 1/72 of an inch, so you’re saying to make the EM square 12/72 of an inch big on your screen. The scalar then uses the “units per Em” value to figure out how to scale the glyph. However, nothing is stopping a point on the contour from being greater than the “units per Em” value. This means that the glyph may extend outside of the Em square. This is valid, and even expected, behavior.

There’s one major piece of drawing glyphs that I haven’t discussed. There is a problem when the font size is roughly the size of just a couple pixels on the screen. When that happens, the pixels don’t have enough definition to describe the character properly. The contours described so far describe the platonic ideal of what the glyph should look like, but if the glyph is on a low resolution screen, it may be beneficial to munge about with the control points so that the glyph maps better to the screen. This process is called “font hinting.” Truetype exposes a mechanism to do this, called “glyph instructions.” Yes, that’s right, this is an entire language interpreter that TrueType engines employ to munge about with glyph control points. The font author writes a program in this programming language, and the font runtime executes this program at draw time to move control points around. This programming language is complete with flow control, functions, variables, etc. The only difference between regular programming languages is that the only way to output information is to move control points around - you can’t print output to the console or anything like that. The instruction set for this language is actually quite interesting - there are many domain-specific instructions for making it convenient to move a collection of points along the same vector and things like that. It’s definitely worth taking a look at if you’re interested in domain-specific languages.

Another thing that a font file has to describe is the origin point and advance point for a particular glyph. This is actually fairly straightforward - you render one glyph with respect to its own origin. Then, you “move the pen” to that glyph’s advance point, and start on the next glyph. This is the way that glyphs don’t get bunched up together when you draw them. By convention, the first point in the font file for a glyph is that glyph’s origin, and the last point is it’s advance point.

By the way, the TrueType spec makes no mention of antialiasing, subpixel rendering, or colors. It is under the assumption that a pixel is either 100% on or 100% off, and that pixels are monochrome. That tech is implemented in the rasterizer to make fonts look beautiful. In order to do this correctly, the font system needs to figure out the coverage of a particular square region with regards to the contours. This requires multiple integrals to calculate exactly (which is obviously too slow) so estimations are used instead. Colored glyphs (such as emoji) are rendered using bitmap glyphs, which are completely different. 

In this discussion, I’ve described only a few of the tables in TrueType files. However, there are many more. One of the more interesting one is the “bsln” table for describing where various baselines are for the characters (Some languages, like Bengali, use an upper baseline, and some, like Chinese, use a centered baseline). Another interesting one (“lcar”) is due to the fact that ligatures represent multiple characters in a single glyph. Because of this, there might be positions within the glyph that represent logical breaks (so that the caret or selection might appear partway through the glyph). Another (“avar”) is that a font might contain multiple sets of glyph information across various axes (such as font weight). Another (“trak”) describes how to achieve various tracking metrics for glyphs. There are many extension tables that have been added to TrueType, and even more that have been added to OpenType.

[1] https://developer.apple.com/fonts/ttrefman/RM06/Chap6cmap.html

Sunday, November 10, 2013

Complex Text Handling in WebKit, Part 1: Encoding Systems, Code Points, and Code Units

Complex text handling is very complicated, due to the various languages and writing systems in the world. This is meant for programmers who have little exposure to complex text.

The first concept to mention is the code point. A code point is a generalization of what what you are probably familiar with as a “character.” Because computers have no concept of language, words, or characters, a sentence or paragraph is represented as a sequence of numbers which describe the characters which eventually get drawn. The atomic unit of this sequence is called a “code point.” Now, the meaning of each individual code point is described in an encoding system. For example, using the ASCII encoding system, the code point with value 121 means the letter z, and the code point with value 65 means the letter A. http://www.asciitable.com contains a description of each code point in the ASCII encoding system (there are only 128 of them), and which character that code point maps to. Because there are only 128 code points, ASCII is called a 7-bit encoding system. (As an aside: however, by convention, the 7 bits are not packed, and instead are mapped to bytes by simply setting the high bit in the byte to 0. This will become important later. Historically, in the days of mainframes and such, this high bit was used for parity, but not any more.)

However, there are other encoding systems which might map the code point with value 121 to some other character. This means that, if someone hands you a string, you have to know which encoding system the string is encoded with in order to know the meaning of that string (or, indeed, to draw the string). There used to be many competing encoding systems and the world was very confusing and chaotic. However, the Unicode consortium was founded to create one gigantic encoding system (http://www.unicode.org/charts/) that would unify all of the various competing encoding systems. This encoding system contains roughly 1114112 code points, so, as you can imagine, most (but not all!) characters of most writing systems on earth are contained within it. The Unicode consortium also designed the code points such that all the ASCII characters have the same code points in Unicode as well, and it has some other nice properties too. Unicode has essentially become the standard for all computerized strings.

That’s all well and good, but because there are so many code points, the unicode encoding system would need each code point to contain 21 bits in order to have enough of them. However, most strings only use a small fraction of all the code points within them. Therefore, most unicode strings have an encoding associated with them. There are three commonly used unicode encodings: UTF-8, UTF-16, and UTF-32. UTF-32 is the most straightforward (and also the least common): it simply represents each code point as a 32-bit unsigned integer. Because the high 11 bits of each of these integers would be zeroes, this encoding is not very efficient which is why it is rarely used.

UTF-8 and UTF-16, however, use sequences of numbers to encode each code point. In this context, each of these numbers is called a “code unit.” UTF-8 uses 8-bit code units, and UTF-16 uses 16-bit code units. Because these code units have less bits than each unicode code point requires, there are some unicode code points that are encoded as multiple code units. I won’t go too deep on how exactly these encodings work, but I will mention some key properties of both of them.

In UTF-8, all of the code points that happen to be encoded as a single code unit have the high bit set to 0. This means that there are 127 such code points. These 127 code points are the exact same code points that ASCII encodes. Because ASCII doesn’t pack together its information, this means that a valid ASCII string is also a valid UTF-8 string. The other code points, however, are encoded such that the number of code units needed to represent the code point is contained within the first byte. For example, if two code units are needed, the first byte is 110xxxxx, and if 5 code units are needed, the first byte is 111110xx. By counting the number of leading 1s in the first code unit, you can determine the number of code units needed to represent the code point. In addition, the subsequent code units all start with the bit prefix of “10”. For example, all the code points that require 4 code units to encode use this pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for a grand total of 21 bits of entropy. It also means that no code units will be equal to 0 except for code point 0. Therefore, a null-terminated string encoded with UTF-8 is still null terminated (with no premature nulls).

In UTF-16, all code points can be encoded with either 1 or 2 code units (because each code unit contains 16 bits). The encoding is created such that, for a code point that requires two code units, the first code unit will always be in the range 0xD800..0xDBFF (called the “lead surrogates”) and the second code unit will always be within the range 0xDC00..0xDFFF (called the “trail surrogates”). This means that, when inspecting a string, you come across a high surrogate, you know that you must read a second code unit to know the relevant code point. If the second code unit isn’t a trail surrogate, the string is corrupt. All of the code points that only require one code unit are simply encoded directly. This means that it is impossible to represent a code point with value 0xD800 itself (because that is a high surrogate). The Unicode consortium thought of this, and made all code points between 0xD800..0xDFFF invalid code points. Therefore, if you see a code point in this range in a UTF-32 string, you know the string is corrupt.

There is one more wrench to throw into our plans: endianness. UTF-16 and UTF-32 use multi-byte code units, which means that they might be represented differently on machines with different endian systems. Therefore, when someone hands you a string, you need to know:
  • Which encoding system used? (The answer is almost always “Unicode”)
  • If Unicode, is it encoded with UTF-8, UTF-16, or UTF-32? (The answer is almost never “UTF-32”)
  • If UTF-16 or UTF-32, which endianness of the code units is used?
However, a convention arose regarding that last question, called a “byte order mark.” There is a character, U+FEFF (written in hex), which means “zero width no-break space.” Because this character is whitespace and has a 0 width, it has no effect on the output of drawing the string. However, if you reverse the bytes, U+FFFE is an invalid character. The convention is to put U+FEFF as the first code point in your text file. If the text system reads that character correctly, it will have no effect on the document drawn, and if the text system reads the character incorrectly, it knows to switch its endianness and try again. This system also works with UTF-32 text, because U+FFFE0000 is, again, an invalid character.

Internally, WebKit uses UTF-16 machine-endianness strings (Though there is a compile-time flag to enable using UTF-8 strings internally). This has one interesting repercussion: in order to keep String::size() fast, the function returns the number of code units in the string, not the number of code points.