Litherum: Relationship Between Glyphs and Code Points

Recently, there have been some discussions about various Unicode concepts like surrogate pairs, variation selectors, and combining clusters, but I thought I could shed some light into how these pieces all fit together and their relationship with the things we actually see on screen.

tl;dr: The relationship between what you see on the screen and the unicode string behind it is completely arbitrary.

The biggest piece to understand is the difference between Unicode's specs and the contents of font files. A string is a sequence of code points. Certain code points have certain meanings, which can affect things like the width of the rendered string, caret placement, and editing commands. Once you have a string and you want to render it, you partition it into runs, where each run can be rendered with a single font (and has other properties, like a single direction throughout the run). You then map code points to glyphs one-to-one, and then you run a Turing-complete "shaping" pass over the sequence of glyphs and advances. Once you've got your shaped glyphs and advances, you can finally render them.

Code Points

Alright, so what's a code point? A code point is just a number. There are many specs which describe a mapping of number to meaning. Most of them are language-specific, which makes sense because, in any given document, there will likely only be a single language. For example, in the GBK encoding, character number 33088 (which is 0x8140 in hex) represents the 丂 character in Chinese. Unicode includes another such mapping. In Unicode, this same character number represents the 腀 character in Chinese. Therefore, the code point number alone is insufficient unless you know what encoding it is in.

Unicode is special because it aims to include characters from every writing system on the planet. Therefore, it is a convenient choice for an internal encoding inside text engines. For example, if you didn't have a single internal encoding, all your editing commands would have to be reimplemented for each supported encoding. For this reason (and potentially some others), it has become the standard encoding for most text systems.

UTF-32

In Unicode, there are over 1 million (0x10FFFF) available options for code points, though most of those haven't been assigned a meaning yet. This means you need 21 bits to represent a code point. One way to do this is to use a 32-bit type and pad it out with zeroes. This is called UTF-32 (which is just a way of mapping a 21-bit number to a sequence of bytes so it can be stored). If you have one of these strings on disk or in memory, you need to know the endianness of each of these 4-byte numbers so that you can properly interpret it. You should already have an out-of-band mechanism to know what encoding the string is in, so this same mechanism is often re-used to describe the endianness of the bytes. (On the Web, this is HTTP headers or the tag.) There's also this neat hack called Byte Order Markers, if you don't have any out-of-band data.

UTF-16

Unfortunately, including 11 bits of 0s for every character is kind of wasteful. There is a more efficient encoding, called UTF-16. In this encoding, each code point may be encoded as either a single 16-bit number or a pair of 16-bit numbers. For code points which fit into a 16-bit number naturally, the encoding is the identity function. Unfortunately, there are over a million (0x100000) code points remaining which don't fit into a 16-bit number themselves. Because there are 20 bits of entropy in these remaining code points, we can split it into a pair of 10 bit numbers, and then encode this pair as two successive "code units." Once you've done that, you need a way of knowing, if someone hands you a 16-bit number, if it's a standalone code point or if it's part of a pair. This is done by reserving two 10-bit ranges inside the character mapping. By saying that code points 0xD800 - 0xDBFF are invalid, and code points 0xDC00 - 0xDFFFF are invalid, we can now use these ranges to encode these 20-bit numbers. So, if someone hands you a 16-bit number, if it's in one of those ranges, you know you need to read a second 16-bit number, mask the 10 low bits of each, shift them together, and add to 0x10000 to get the real code point (otherwise, the number is equal to the code point it represents).

There are some interesting details here. The first is that the two 10-bit ranges are distinct. It could have been possible to re-use the same 10-bit range for both items in the pair (and use its position in the pair to determine its meaning). However, if you have an item missing from a long string of these surrogates, it may cause every code point after the missing one to be wrong. By using distinct ranges, if you come across an unpaired surrogate (like two high surrogates next to each other), most text systems will simply consider the first surrogate alone, treat it like an unsupported character, and resume processing correctly at the next surrogate.

UTF-8

There's also another one called UTF-8, which represents code points as either 1, 2, 3, 4, or 5 byte sequences. Because it uses bytes, endianness is irrelevant. However, the encoding is more complicated and it can be less efficient for some strings than UTF-16. It does have the nice property, however, that no byte within a UTF-8 string can be 0, which means it is compatible with C strings.

"💩".length === 2

Because its encoding is somewhat simple, but fairly compact, many text systems including Web browsers, ICU, and Cocoa strings use UTF-16. This decision has actually had kind of a profound impact on the web. It is the reason that the "length" attribute on emoji returns 2: the "length" attribute returns the number of code units in the UTF-16 string, not the number of code points. If it wanted to return the number of code points, it would require linear time to compute. The choice of which number represents which "character" (or emoji) isn't completely arbitrary, but some things we think of as emoji actually have a number value less than 0x10000. This is why some code points have a length of two but some have a length of one.

Combining code points

Unicode also includes the concept of combining marks. The idea is that if you want to have the character "é", you can represent it as the "e" character followed by U+301 COMBINING ACUTE ACCENT. This is so that every combination of diacritic marks and base characters doesn't have to be encoded in Unicode. It's important because, once a code point is assigned a meaning, it can never ever be un-assigned.

To make matters worse, there is also a standalone code point U+E9 LATIN SMALL LETTER E WITH ACUTE. When doing string comparisons, these two strings need to be equal. Therefore, string comparisons aren't just raw byte comparisons.

This idea can happen even without these zero-width combining marks. In Korean, adjacent letters in words are grouped up to form blocks. For example, the letters ㅂ ㅓ ㅂ join to form the Korean word for "rice:" 법 (read from top left to bottom right). Unicode includes a code point for each letter of the alphabet (ㅂ is U+3142 HANGUL LETTER PIEUP), as well as a code point for each joined block (법 is U+BC95). It also includes joining letters, so 법 can be represented as a single code point, but can also be represented by the string:

U+1107 HANGUL CHOSEONG PIEUP
U+1161 HANGUL JUNGSEONG A
U+11B8 HANGUL JONGSEONG PIEUP

This means, in JavaScript, you can have two strings which are treated exactly equally by the text system, and look visually identical (they literally have the same glyph drawn on screen), but have different lengths in JavaScript.

Normalization

One way to perform these string comparisons is to use Unicode's notion of "normalization." The idea is that strings which are conceptually equal should be normalized to the same sequence of code points. There are a few different normalization algorithms, depending on if you want the string to be exploded as much as possible into its constituent parts, or if you want it to be combined to be as short as possible, etc.

Fonts

When reading text, people see pictures, not numbers. Or, put another way, computer monitors are not capable of showing you numbers; instead, they can only show pictures. All the picture information for text is contained within fonts. Unicode doesn't describe what information is included in a font file.

When people think of emoji, people usually think of it as the little color pictures inside our text. These little color pictures come from font files. A font file can do whatever it wants with the string it is tasked with rendering. It can draw emoji without color. It can draw non-emoji with color. The idea of color in a glyph is orthogonal to whether or not a code point is classified as "emoji."

Similarly, a font can include a ligature, which draws multiple code points as a single glyph ("glyph" just means "picture"). A font can also draw a single code point as multiple glyphs (for example, an accent over é may be implemented as as separate glyph from e). But it doesn't have to. The choice of what glyphs to use where is totally an implementation detail of the font. The choice of which glyphs include color is totally an implementation detail of the font. Some ligatures get caret positions inside them; others don't.

For example, Arabic is a handwritten script, which means that the letters flow together from one to the next. Here are two images of two different fonts (Geeza Pro and Noto Nastaliq Urdu) rendering the same string, where each glyph is painted in a different color. You can see that both fonts show the string with a different number of glyphs. Sometimes diacritics are contained within their base glyph, but sometimes not.

Variation Selectors

There are other classes of code points which are invisible and are added after a base code point to modify it. One example is the using Variation Selector 15 and Variation Selector 16. The problem these try to solve is the fact that some code points may be drawn in either text style (☃︎) or emoji style (☃️). Variation Selector 16 is an invisible code point that means "please draw the base character like an emoji" while #15 means "please draw the base character like text." The platform also has a default representation which is used when no variation selector is present. Unicode includes a table of which code points should be able to accept these variation selectors (but, like everything Unicode creates, it affects but doesn't dictate implementations).

These variation selectors are a little special because they are the only combining codepoints I know of that can interact with the "cmap" table is the font, and therefore can affect font selection. This means that a font can say "I support the snowman code point, but not the emoji style of it." Many text systems have special processing for these variation selectors.

Zero-Width-Joiner Sequences

Rendering on old platforms is also important when Unicode defines new emoji. Some Unicode characters, such as "👨‍👩‍👧" are a collection of things (people) which can already be represented with other code points. This specific "emoji" is actually the string of code points:

U+1F468 MAN
U+200D ZERO WIDTH JOINER
U+1F469 WOMAN
U+200D ZERO WIDTH JOINER
U+1F467 GIRL

The zero width joiners are necessary for backwards compatibility. If someone had a string somewhere that was just a list of people in a row, the creation of this new "emoji" shouldn't magically join them up into a family. The benefit of using the collection of code points is that older systems showing the new string will show something understandable instead of just an empty square. Fonts often implement these as ligatures. Unicode specifies which sequences should be represented by a single glyph, but, again, it's up to each implementation to actually do that, and implementations vary.

Caret Positions

Similarly to how Unicode describes sequences of codepoints which should visually combine to a single thing, Unicode also describes what a "character" is, in the sense of what most people mean when they say "character." Unicode calls this as a "grapheme clusters." Part of the ICU library (which implements pieces of Unicode) creates iterators which will give you all the locations where lines can break, words can be formed (in Chinese this is hard), and characters' boundaries lie. If you give it the string of "e" followed by U+301 COMBINING ACUTE ACCENT, it should tell you that these codepoints are part of the same grapheme cluster. It does this by ingesting data tables which Unicode creates.

However, this isn't quite sufficient to know where to put the caret when the user presses the arrow keys, delete key, or forward-delete key (Fn + delete on macOS). Consider the following string in Hindi "कि". This is composed of the following two code points:

U+915 DEVANAGARI LETTER KA
U+93F DEVANAGARI VOWEL SIGN I

Here, if you select the text or use arrow keys, the entire string is selected as a unit. However, if you place the caret after the string and press delete, only the U+93F is deleted. This is particularly confusing because this vowel sign is actually drawn to the left of the letter, so it isn't even adjacent to the caret when you press delete. (Hindi is a left-to-right script.) If you place the caret just before the string and press the forward delete key (Fn + delete), both code points get deleted. The user expectations for the results of these kinds of editing commands are somewhat platform-specific, and aren't entirely codified in Unicode currently.
Try it out here:

==> कि <==

Simplified and Traditional Chinese

The Chinese language is many thousands of years old. In the 1950s and 1960s, the Chinese government (PRC) decided that their characters had too many strokes, and simplifying the characters would increase literacy rates. So, they decided to change how about 1/3 of the characters were written. Some of the characters were untouched, some were touched only very slightly, and some were completely changed.

When Unicode started codifying these characters, they had to figure out whether or not to give these simplified characters new code points. For the code points which were completely unchanged, it is obvious they shouldn't get their own code points. For code points which were entirely changed, it is obvious that they should get their own code points. However, what about the characters which changed only slightly? The characters were decided on a case-by-case basis, and some of these slightly-changed characters did not receive their own new code points.

This is really problematic for a text engine, because this is a discernible difference between the two, and if you show the wrong one, it's wrong. This means that the text engine has to know out-of-band which one to show.

Here's an example showing the same code point with two different "lang" tags.
Simplified Chinese:

雪

Traditional Chinese:

雪

There are a few different mechanisms for this. HTML includes the "lang" attribute, which includes whether or not the language is supposed to be simplified or traditional. This is used during font selection. On macOS and iOS, every Chinese face actually includes two font files: one for Simplified Chinese and one for Traditional Chinese. (For example, PingFang SC and PingFang TC.) Browsers use the language of the element when deciding which of these fonts to use. If the lang tag isn't present or doesn't include the information browsers need, browsers will use the language the machine is configured to use.

Rather than including two separate fonts for every face, another mechanism to implement this is by using font features. This is part of that "shaping" step I mentioned earlier. This shaping step can include a set of key/value pairs provided by the environment. CSS controls this with the font-variant-east-asian property. This works by having the font include glyphs for both kinds of Chinese, and the correct one is selected as part of text layout. This only works, however, with text renderers which support complex shaping and font features.

I think there's at least one other way to have a single font file be able to draw both simplified and traditional forms, but I can't remember what they are right now.

Litherum

Saturday, May 20, 2017

Relationship Between Glyphs and Code Points