Sunday, November 10, 2013

Complex Text Handling in WebKit, Part 1: Encoding Systems, Code Points, and Code Units

Complex text handling is very complicated, due to the various languages and writing systems in the world. This is meant for programmers who have little exposure to complex text.

The first concept to mention is the code point. A code point is a generalization of what what you are probably familiar with as a “character.” Because computers have no concept of language, words, or characters, a sentence or paragraph is represented as a sequence of numbers which describe the characters which eventually get drawn. The atomic unit of this sequence is called a “code point.” Now, the meaning of each individual code point is described in an encoding system. For example, using the ASCII encoding system, the code point with value 121 means the letter z, and the code point with value 65 means the letter A. http://www.asciitable.com contains a description of each code point in the ASCII encoding system (there are only 128 of them), and which character that code point maps to. Because there are only 128 code points, ASCII is called a 7-bit encoding system. (As an aside: however, by convention, the 7 bits are not packed, and instead are mapped to bytes by simply setting the high bit in the byte to 0. This will become important later. Historically, in the days of mainframes and such, this high bit was used for parity, but not any more.)

However, there are other encoding systems which might map the code point with value 121 to some other character. This means that, if someone hands you a string, you have to know which encoding system the string is encoded with in order to know the meaning of that string (or, indeed, to draw the string). There used to be many competing encoding systems and the world was very confusing and chaotic. However, the Unicode consortium was founded to create one gigantic encoding system (http://www.unicode.org/charts/) that would unify all of the various competing encoding systems. This encoding system contains roughly 1114112 code points, so, as you can imagine, most (but not all!) characters of most writing systems on earth are contained within it. The Unicode consortium also designed the code points such that all the ASCII characters have the same code points in Unicode as well, and it has some other nice properties too. Unicode has essentially become the standard for all computerized strings.

That’s all well and good, but because there are so many code points, the unicode encoding system would need each code point to contain 21 bits in order to have enough of them. However, most strings only use a small fraction of all the code points within them. Therefore, most unicode strings have an encoding associated with them. There are three commonly used unicode encodings: UTF-8, UTF-16, and UTF-32. UTF-32 is the most straightforward (and also the least common): it simply represents each code point as a 32-bit unsigned integer. Because the high 11 bits of each of these integers would be zeroes, this encoding is not very efficient which is why it is rarely used.

UTF-8 and UTF-16, however, use sequences of numbers to encode each code point. In this context, each of these numbers is called a “code unit.” UTF-8 uses 8-bit code units, and UTF-16 uses 16-bit code units. Because these code units have less bits than each unicode code point requires, there are some unicode code points that are encoded as multiple code units. I won’t go too deep on how exactly these encodings work, but I will mention some key properties of both of them.

In UTF-8, all of the code points that happen to be encoded as a single code unit have the high bit set to 0. This means that there are 127 such code points. These 127 code points are the exact same code points that ASCII encodes. Because ASCII doesn’t pack together its information, this means that a valid ASCII string is also a valid UTF-8 string. The other code points, however, are encoded such that the number of code units needed to represent the code point is contained within the first byte. For example, if two code units are needed, the first byte is 110xxxxx, and if 5 code units are needed, the first byte is 111110xx. By counting the number of leading 1s in the first code unit, you can determine the number of code units needed to represent the code point. In addition, the subsequent code units all start with the bit prefix of “10”. For example, all the code points that require 4 code units to encode use this pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for a grand total of 21 bits of entropy. It also means that no code units will be equal to 0 except for code point 0. Therefore, a null-terminated string encoded with UTF-8 is still null terminated (with no premature nulls).

In UTF-16, all code points can be encoded with either 1 or 2 code units (because each code unit contains 16 bits). The encoding is created such that, for a code point that requires two code units, the first code unit will always be in the range 0xD800..0xDBFF (called the “lead surrogates”) and the second code unit will always be within the range 0xDC00..0xDFFF (called the “trail surrogates”). This means that, when inspecting a string, you come across a high surrogate, you know that you must read a second code unit to know the relevant code point. If the second code unit isn’t a trail surrogate, the string is corrupt. All of the code points that only require one code unit are simply encoded directly. This means that it is impossible to represent a code point with value 0xD800 itself (because that is a high surrogate). The Unicode consortium thought of this, and made all code points between 0xD800..0xDFFF invalid code points. Therefore, if you see a code point in this range in a UTF-32 string, you know the string is corrupt.

There is one more wrench to throw into our plans: endianness. UTF-16 and UTF-32 use multi-byte code units, which means that they might be represented differently on machines with different endian systems. Therefore, when someone hands you a string, you need to know:
  • Which encoding system used? (The answer is almost always “Unicode”)
  • If Unicode, is it encoded with UTF-8, UTF-16, or UTF-32? (The answer is almost never “UTF-32”)
  • If UTF-16 or UTF-32, which endianness of the code units is used?
However, a convention arose regarding that last question, called a “byte order mark.” There is a character, U+FEFF (written in hex), which means “zero width no-break space.” Because this character is whitespace and has a 0 width, it has no effect on the output of drawing the string. However, if you reverse the bytes, U+FFFE is an invalid character. The convention is to put U+FEFF as the first code point in your text file. If the text system reads that character correctly, it will have no effect on the document drawn, and if the text system reads the character incorrectly, it knows to switch its endianness and try again. This system also works with UTF-32 text, because U+FFFE0000 is, again, an invalid character.

Internally, WebKit uses UTF-16 machine-endianness strings (Though there is a compile-time flag to enable using UTF-8 strings internally). This has one interesting repercussion: in order to keep String::size() fast, the function returns the number of code units in the string, not the number of code points.

No comments:

Post a Comment