Friday, February 27, 2015

End-To-End Tour of Text Layout/Rendering

I’ve found that there is a fair amount of confusion regarding how text rendering actually ends up working. In particular, it seems that the design of the system in general often goes unnoticed.

First of all, there is a major division between the layout of text and the rendering of text. These two problems are very different, and are usually solved by different libraries authored by different people. The interface between these two systems is straightforward: A sequence of glyphs, along with a location for each glyph (along with a font).

A font file contains within it information relevant to both the layout and rendering of text. A font contains a collection of glyphs. A glyph contains a sequence of contours that define its shape, as well as metrics for things like the bounds of the glyph. A program that wants to lay out or render text usually refers to a particular glyph with a 16-bit unsigned integer which represents an index into the font. Locations are represented in whatever format the rendering library represents 2D points. A font is usually represented as an opaque handle to an object that is managed by a library.

Here's an overview:
  1. Layout
    1. Paragraph layout
    2. Line layout
      1. Bidirectional processing
    3. Run layout
      1. Character -> glyph mapping
      2. Font fallback
      3. Advanced shaping
  2. Rendering
    1. Hinting
    2. Rasterization
    3. Colorization
    4. Blending / Compositing
Because the problems of layout and rendering are so different, let’s tackle them one at a time. As just described, the output of layout is a collection of glyphs and positions. The input, however, can vary depending on the library you’re using. In particular, some libraries allow for laying out entire paragraphs, while others only allow for laying out lines. Both flavors, however, require a string representing the text you’re trying to work with.

Certainly, paragraph layout will involve laying out lines, once the bounds of the lines are known. The first step to laying out a paragraph is to determine the locations of all the line-breaking opportunities in the text. This is usually done with ICU’s BreakIterator API. Note that this algorithm is language-specific, and is entirely dependent on the source string - no layout is necessary to find these opportunities.

Once you’ve found all the opportunities, you must choose which opportunities you actually act upon to break the lines. A naive choice is simply the furthest most opportunity which fits in the available width (there are more advanced line-breaking algorithms, such as the Knuth-Plass algorithm). Note that choosing where to line break requires you to find the width of substrings of the input, which requires laying out these substrings.

At this point, you have partitioned your input into lines. Before you start laying out the lines, you must perform bidirectional text processing on each line. Note that this bidirectional processing is performed on each line, rather than on the text as a whole. This bidirectional processing is necessary because certain languages (such as English) are defined to be left-to-right languages, while other languages (such as Arabic) are defined to be right-to-left languages. If you have a RTL word inside a LTR line, the characters in the RTL word must progress from right to left, regardless from the surrounding line's direction. Strings are defined to be in logical order, which means that the first character of the RTL word follows the last letter of the previous LTR word. Therefore, you must partition your line into runs, where each run has a direction, which is what the bidi algorithm does. The bidi algorithm has to handle multiple nested levels of flipped directionality phrases, and is therefore nontrivial. It is defined in UAX #9 (standardized by the Unicode Consortium). This algorithm involves looking up attributes of each code point (such as whether its LTR or RTL), which is implemented by ICU. ICU also has an API which implements the entire bidi algorithm (which WebKit can't use because ICU only understands strings, not arbitrary HTML content)

Laying out a single bidi run consists of two passes. The first pass simply performs a one-to-one mapping from code point in the input string to a glyph. This one-to-one mapping is defined in the font file, in the ‘cmap’ table. Note that, at this point, you know whether or not your font supports the particular code points or languages that the input string uses (not all fonts support all code points). If the chosen font doesn't support part of the input string, the library tries to find another, similar, font which does support the input. If no such font is found, the engine will simply use glyph 0, or ".notdef", which is usually drawn as a empty rectangle, colloquially called the "tofu." Note that this glyph fallback can cause strange renderings, where one glyph in the middle of a word might come from a different font and have a different apparent weight or size.

Once you have your glyphs, you can simply line their baselines up vertically and squish them together horizontally. At this point, you run the second pass:

The second pass involves performing context-sensitive updates and replacements to the glyph/location sequences. This pass is called “shaping.” These updates and replacements are defined in the font file. Note that these operations are entirely performed in glyph-space; the original Unicode string can be discarded from this point forward. However, the particular format and tables that define these updates and replacements differ between the various font formats.

There are two main players (Okay, well, three, but I don’t care about Adobe): Apple and Microsoft. In the beginning, TrueType was created by Apple, but then Microsoft forked it to create OpenType. Meanwhile, Apple added features to TrueType, so now there is a shared core that is in both TrueType and OpenType, but both formats contain their own specific extensions. Both Apple’s and Microsoft’s extensions to the font file includes rules for “advanced” layout, which specify the updates and replacements to the glyph/location sequence.

Apple’s advanced font technology, Apple Advanced Typography, uses the ‘morx’ table inside TrueType fonts. This table heavily relies on a concept called “font features.” A font feature is a flag you can pass to the layout engine which affects how the layout engine performs this advanced layout. The user supplies which features they want or do not want, and each feature is mapped to a specific bit inside a 32-bit bitfield (via the ‘feat’ table). Then, this feature vector is mapped to a sub-feature vector by running the bitfield through a sequence of bitwise AND and OR operations, which are defined inside the ‘morx’ table. Then, the rest of the ‘morx’ table consists of a sequence of subtables which define a particular substitution or modification to perform, as well as which bit in the sub-feature vector that their operation is predicated on. These subtables have an associated coverage subtable, which filters the operation to only be performed on particular glyphs. There are subtable formats for the following types of updates:
  • Rearrangement. This doesn’t insert or delete glyphs, but just allows the ordering to be modified.
  • Contextual: Replace a sequence of glyphs with another sequence
  • Ligatures: Replace a sequence of glyphs with a single glyph
  • Noncontextual: Replace a single glyph with a different single glyph
  • Insertion: Insert a glyph into the sequence
Note that each font file must define their own substitutions; there is no reuse between different fonts. There is no knowledge of “language” here; the text engine just follows instructions defined in the file. Also note that these substitutions can be thought of as a finite state machine, which makes authoring this table somewhat similar to programming (and is, therefore, difficult).

AAT also defines a ‘kerx’ table for defining context-sensitive kerning (spacing) between sequences of glyphs.

On the other hand, Microsoft’s advanced font technology, Uniscribe, uses the ‘GPOS’ and ‘GSUB’ tables inside OpenType fonts. The ‘GPOS’ table defines positioning rules for glyphs, and the ‘GSUB’ table defines substitution rules for glyphs. Each font supports a collection of scripts (languages), and each script defines a collection of features. However, as opposed to AAT, these features actually have language semantics. Microsoft manages a registry of the meaning of each of these features. Each feature contains a collection of lookup tables, which contains a particular substitution or modification to perform. These lookup tables have an associated coverage subtable, which filters the operation to only be performed on particular glyphs. There are lookup formats for the following types of updates:
  • (GPOS) Single Adjustment: Move a glyph
  • (GPOS) Pair Adjustment: Adjust kerning
  • (GPOS) Cursive Attachment: Describe how to connect adjacent glyphs (by defining glyphs entry position and exit position)
  • (GPOS) Mark To (Base,Ligature,Mark) Attachment: Put a diacritic on top of a glyph, ligature, or other diacritic
  • (GPOS) Contextual Positioning: Move glyphs based on context
  • (GPOS) Extensions
  • (GSUB) Single Substitution: Substitute one glyph with another
  • (GSUB) Multiple Substitution: Replace a single glyph with more than one glyph
  • (GSUB) Alternate Substitution: Optional glyph substitution for aesthetics
  • (GSUB) Ligature Substitution: Replace more than one glyph with a single glyph
  • (GSUB) Contextual Substitution: Replace a sequence of glyphs with a different sequence
  • (GSUB) Extensions
As you can see, there are many more different types of substitutions or updates that OpenType describes. In particular, these operations have particular semantics associated with them, so the font engine can make more informed decisions based on insight it has into what the font is trying to accomplish with a particular operation. This is similar to the different approach to font features - rather than arbitrary bits in a bitfield identified by a string inside another table (as is the case for AAT), the features in OpenType have semantics, so the font engine has more insight into the designer’s goal.

Alright, we’ve finished layout. We’ve got our sequence of glyphs and their associated positions. It’s time to rasterize these glyphs!

Within the font file, there is a table which describes the shape of each glyph. The specific table, however, is different between TrueType and OpenType. TrueType uses the ‘glyf’ table, which defines the contours of each glyph as a sequence of quadratic bezier curves, laid back to front. TrueType also has the concept of a "complex" glyph, which defines a new glyph by using a collection of other glyphs which are transformed into the same coordinate space. Authors can use this to, for example, design a single diacritic mark, and place it on top of existing letters to make accented letters. Therefore, TrueType glyphs are composable.

OpenType uses a format that Adobe wrote, called the Compact Font Format, which defines the contours of each glyph as a sequence of cubic bezier curves, laid back to front. In fact, OpenType simply embeds the contents of a CFF font file inside its ‘CFF’ table. CFF is, however, a font format by itself, with its own metadata, which means that there is a fair amount of duplication between the CFF data and the surrounding OpenType font. In addition, because CFF was created by Adobe and OpenType was created (mostly) by Microsoft, there are conventions that differ between the two formats. (Also, the encoding format in CFF is drop-dead stupid. Seriously, take a look at it sometime. Maybe the design decisions made sense 30 years ago, but nowadays the encoding format is a huge pain to work with).

Alright, now we’ve got some contours. TrueType specifies a nonzero winding rule, so now we can figure out which regions need to get filled.

But before we do that, there is one more step we need to perform - hinting. TrueType defines hinting in the form of an arbitrary program which the font defines which can move the control points of the contours around. That’s right - this program is defined as a sequence of instructions which are executed by a virtual machine. This virtual machine has an instruction set, and is pretty interesting from a domain-specific language standpoint. There are actually a collection of different programs, including programs which are specific to a particular glyph, a program for the font as a whole, and a program which is run every time the font point size or transformation changes.

For example, there are instructions for setting the “freedom vector” and moving a contour control point in the direction of the freedom vector. There are also instructions for rounding a point to the nearest grid location, along with fairly sophisticated rounding modes. The instruction set also supports functions, looping, and conditionals, as well as a stack where items can be pushed and popped. There’s also a sort-of heap, called the “twilight zone.”

Alright, so now we’ve grid-fitted the contours. At this point, many rasterization engines will perform some proprietary secret-sauce “autohinting” beyond what the font instructions say. The idea is pick attributes for the contours which we think will probably make the font look better. However, it’s important to note that, whatever operations the rasterization library performs at this step, those operations are not informed by the font author, and so are simply guesses at what might look better.

Now, we are finally ready to rasterize our contours. There are a few different algorithms to perform this rasterization. The most common algorithm is performed on the CPU, which loops over every pixel and computes a coverage value. The most straightforward approach of this is to iterate through each scanline one by one, and use the mathematical formulas of the contours to compute intersection points with each of the scanlines. Then, you can sort the intersection points, and as you iterate through a scanline, you can see if you’ve crossed an intersection point. (Side note - every implementation I’ve seen that does this uses 1-2 letter variable names throughout. It is very frustrating.) Once you have the coverage, you can somehow combine this with a source color, and then somehow combine that with the color that is already in the destination, and then write that result into a destination framebuffer.

A different rasterization algorithm is the Loop-Blinn algorithm, which triangulates glyphs and uses shaders on the GPU to render the contours in a scalable way. Another algorithm created by Kokojima first tessellates the glyph, and then uses the GPU’s stencil buffer rather than triangulating the glyph.

Antialiasing is achieved when the combination of coverage and foreground and background colors allows the two colors to be scaled along with the coverage value. Subpixel antialiasing is achieved when coverage geometry is carried through the foreground color computation and is used when blending with the background. Blending the foreground and background colors is usually done using the Porter-Duff blending modes.

Sub pixel antialiasing is usually achieved (in Freetype and Skia, anyway) as a horizontal convolution applied to the coverage bitmap. The weights of the convolution kernel look somewhat Gaussian.

After all that, I want to give an overview of which libraries out there perform which of the above tasks.

  • Freetype is a font rasterization library. It performs the scanline conversion step, but has no concept of color (so doesn’t perform the color blending step). It doesn’t know what shaping is. Its layout facilities are extremely limited, and you shouldn’t use them. The Xft library allows the X11 server to use FreeType on behalf of client applications. Freetype supports subpixel antialiasing. It also has a fairly confusing and opaque API (FT_GlyphSlot what???)
  • Harfbuzz is a shaping library. It only performs the glyph sequence -> glyph sequence mapping. As of some point in the recent past, it only worked for OpenType fonts natively (though I believe it will delegate to CoreText on OS X if it needs to)
  • CoreText is Apple’s text layout engine. It performs everything (including paragraph layout) up to the point where it produces a sequence of glyphs and positions.
  • CoreGraphics, also called Quartz, is Apple’s 2D graphics rasterization library. Apple doesn’t use Freetype to rasterize fonts; instead it uses CoreGraphics. CoreGraphics natively performs font rasterization, but also performs other arbitrary 2D graphics operations. It supports sub pixel antialiasing.
  • FontConfig is a library which is used to discover fonts. It monitors directories which fonts are installed in, and lets users query for arbitrary font attributes. The X11 server can be configured to use this library.
  • Cairo is a 2D drawing library which can perform arbitrary 2D graphics operations. It uses a driver architecture, with various backends. It has a backend for X11 and Xft, which just delegates all its drawing calls to the X11 server. There are also backends for Quartz on OS X, PDF, SVG, and a bunch more, as well as its own software path where it can render into a memory buffer. As far as its font rendering is concerned, it has four backends: One for FreeType, one for Quartz, one for Windows, and one where it natively renders glyph contours using its own path rendering API. Note that if you use this last backend, you won’t get subpixel antialiasing (since Cairo doesn’t support subpixel antialiasing on arbitrary paths). GTK uses Cairo for all its 2D graphics rendering, and Firefox used to use it.
  • Pango is a library which performs all of the above steps, but does so by delegating to other libraries. For font selection and shaping it can delegate to FreeType, FontConfig, and Harfbuzz, and for rendering it can delegate to Cairo or Xft, as well as being able to delegate to CoreText or Uniscribe. It is used by GTK.
  • Qt is an entire UI framework, and it seems to delegate its text rendering to either CoreText, GDI, or FreeType. 
  • Skia is an arbitrary 2D graphics library, similar in API to Cairo or CoreGraphics, used by Android and Chrome. It performs its own glyph rendering, but uses Harfbuzz to perform shaping. It supports subpixel antialiasing and is owned by Google.
  • ICU is a library which has no concept of fonts or glyphs, but does have a deep understanding of codepoints and languages. You can use ICU to determine line breaking opportunities, which language a particular codepoint belongs to, details of a particular codepoint, converting a string between different encodings, understanding strings’ sorting order, localizing dates and times, uppercasing or lowercasing arbitrary glyphs, etc…
  • libiconv is a library which allows converting a string between different encoding formats. It is as old as dirt.
  • Windows has a bunch of deprecated APIs (like Uniscribe and GDI) that perform the above steps poorly. Don't use them.
  • On Windows, the best practices for all the layout steps and font discovery steps are performed by DirectWrite. This library has no concept of drawing, and instead requires a collection of delegate callbacks in order to draw.
  • Those drawing delegate callbacks are implemented by Direct2D, which is currently the best library to perform 2D drawing on Windows. This library is implemented on top of Direct3D, the same API that AAA 3D games use, and is therefore hardware-accelerated.

No sources, because that’s how I roll.

1 comment:

  1. Could you please explain how Skia would interact with CoreText? I'd like to know which one is the last to actually draw on the screen. I thought it would be the operating system via Core Text and Skia just calls native drawing API to access the screen, but since Skia is drawing the characters I don't know what's the role of Core Text or how the operating system intervenes in the final drawing step. From your article I gather that Core Text provides glyphs and positions and Skia draws them? But Skia still has to call native API right? Which ones (on MacOs)? Thank you.