Unicode has codepoints for characters in many different alphabets. For example, there is a single codepoint for ’n’ (U+006E; unicode code points are written in hexadecimal so this corresponds to an ID of 110) and there is a different code point for ñ (U+00F1). However, clearly it is not feasible to assign a code point for every combination of character and accent if you consider the fact that certain languages have multiple accents on the single character. Therefore, Unicode has the concept of combining marks. The concept is simple: add a mark to a character by concatenating the character’s codepoint with the mark’s codepoint. In the above example, it is possible to write the same character as the sequence U+006E U+0303 (because U+0303 is the codepoint for “combining tilde”). This means that certain conceptual “characters” can be represented with more than one code point. The technical name for these conceptual “characters” is “grapheme cluster.”
Think about this for a second. This means that it is possible to represent the exact same string in multiple different ways (the literal exact same string. They have the exact same meaning). This means that comparing equality of strings (or sorting a collection of strings) must take this into account. In order to combat this, Unicode has a concept of “canonicalization” which will convert a given string into a canonicalized form which you can then use for equality or comparison tests. (Note: string comparison is still dependent on locale. See ICU’s collation functions as well as [1])
Let’s consider the Hebrew language. Hebrew is a handwritten language, which means that characters flow together. This means that a character looks different depending on if it’s in a word or not. In particular, a character is written in one of four ways: one for if it does not have any adjacent characters, one for if it has an adjacent character on its left side, one for if it has an adjacent character on its right side, and one if it has adjacent characters on both sides. Again, this is the literal same character, just written differently. It also means that, when choosing a glyph for a particular character, you have to take the surrounding characters into account. Therefore, a simple codepoint to glyph mapping will not suffice. There are many other languages with complicated rules.
Here is an example of the four forms for a particular character: (* see bottom for how these characters are shown)
Isolated | ج |
Initial | ج |
Medial | ج |
Final | ج |
All in all, you can start to see how codepoint to glyph conversion is a nontrivial process. Luckily, In WebKit (on OS X, which is what I’m most familiar with), there is a library called CoreText which does this conversion for us. In particular, the CoreText API that WebKit uses is of the form “you give me a sequence of code points and a font, and I’ll give you a sequence of glyphs with a location for each glyph.” Once WebKit has that, it can then pass that along to the 2D drawing subsystem. It should also be noted that CoreText has higher-level APIs that can handle line wrapping and typesettings, but WebKit can’t use those. They assume that you already know the region where you want text drawn; however, the height of a div is however high the text inside it wraps to be. WebKit only asks for the locations of characters laid out in a row, and then does this multiple times for each row that needs to be measured or drawn.
The problem now becomes figuring out which runs of code points to draw at which locations.
* The different form of the above characters are created by using the Zero Width Joiner code point. This is a code point which has 0 size and an empty glyph, but is considered joinable for the purposes of glyph selection. Thanks to Dan Bernstein for this example.
[1] http://www.w3.org/International/wiki/Case_folding
No comments:
Post a Comment