Litherum: Complex Text Handling in WebKit, Part 4: Line Breaking

So, here we are, looking at a web browser. We are interested in laying out and rendering text. Step One: figure out where lines break. Once we have partitioned our string into lines, we can later lay out the internals of each line. We have two libraries at our disposal: CoreText and ICU.

ICU has a variety of functions in it, all aimed at exposing the fundamental rules of language or characters. It knows about each language’s rules for where lines are able to be broken. For example, in English, lines are able to be broken at any whitespace character. However, in Chinese, lines are able to be broken at any location. ICU provides a collection of functions which we can use to find the next breakable boundary in a string. These functions operate on the string itself, not on the glyphs that the string translates into.

CoreText has functions which translate a string of code points into a string of glyphs and a string of positions that each of those glyphs should be drawn at. We can use these positions to determine the effective width of each character. Width measurement itself is quite a complicated algorithm, so I’ll write a post about that piece sometime later. Until then, believe that we have a function (Font::width()) to which the input is a string and the output is the width that that string would appear on the screen.

The basic algorithm should now be fairly clear: come up with a collection of line breaking candidates with ICU, then choose the furthest candidate that is less than the width of the box that we are putting text inside. Easier said than done.

The problem is that any inline element can contribute to the width of a line. For example, if you have <div>lorem <img src=“a.jpg”> ipsum</div>, you have to include the <img> when you are performing your width calculations. We also have to handle style attributes such as letter-spacing and word-spacing. We want to collapse whitespace, but not if we’re in a <pre> tag. If a single word is too wide to fit within the box, then we permit the word to extend beyond the width of the enclosing box, but not otherwise. We also don’t want to store a separate, cleaned-up version of the text that we’re rending, so instead we have to keep a fair amount of state around as we iterate through the string. We also only want to iterate across the entire string a single time, as the string could be arbitrarily long.

All in all, the general structure is in LineBreaker::nextSegmentBreak(). We walk items in file order, handling each in turn as we come to it. Our state that we are keeping track of is encapsulated in the BreakingContext object, which has a simple accessor atEnd() which determines if we have found the end of a line and our DOM traversal can end.

If the node that we are inspecting is a text node, we iterate through each of the characters of the string looking for the next line break candidate. We can’t exit this loop until we have gone past the width of the box, so we have this abstraction of the LineWidth class, which holds committed and uncommitted widths. As we build up a word, we add to the uncommitted width, then when we hit a candidate line break we “commit” the width, moving it to the committed width measurement. When we find that we have gone past the edge of the box, our committed width is the width of the text that we are going to use. The structure of BreakingContext::handleText() should then make some sense: a big loop over each of the characters in the string, an if statement inside the loop to determine if we are at a line breaking candidate, and an if statement with a return inside it to determine if we are past the end of the box. Whenever we hit the end of a word, we update the m_lineBreak Iterator so that when we return we have the furthest candidate.

This loop is also where we handle collapsing whitespace. There is a data structure, MidpointState, which keeps a vector of Iterators. The iterators come in pairs - one iterator signifies the beginning of a collection of whitespace to skip and the successive iterator signifies the end of that run of whitespace. When we are at the second space in a row (Not the first! We don’t want to skip all spaces, just successive ones) we append to this list, and whenever we hit a non-whitespace character we append to this list as well. We will see in another post about how this data structure gets used. One consequence of this is that, if someone wants to draw a single line of text (which happens in certain places), they will skip the line breaking algorithm, which means they won’t get any whitespace removal. We also cache widths of words in a WordMeasurement array to speed up later phases of layout.

One interesting bit is that ICU’s line breaking functions are context-sensitive, meaning that we need to keep track of a couple characters before each string that we give to ICU. However, we are operating on a DOM tree where the previous context might come from another DOM node and not be available. Therefore, we keep our LazyLineBreakIterator (which wraps ICU) up to date with the current characters that we are processing. If we are traversing the DOM and come across some replaced content (such as an image), we represent that as U+FFFD (OBJECT REPLACEMENT CHARACTER).

An interesting optimization is in the nextBreakablePosition() function, which uses ICU to determine where the next breakable position in a string is. However, ICU is known to be quite slow, so we try to avoid calling into it as much as possible. For example, If the current character is a space, we don’t have to call into ICU to tell us that we can break there. We can also make similar rules for most of the ASCII code points and simply store that information in a table (named asciiLineBreakTable). Only if the codepoint is something outside of this table do we have to consult with ICU. The actual consultation of ICU is behind a cache of TextBreakIterators inside a LazyLineBreakIterator.

This algorithm (LineBreaker::nextLineBreak()) has a variety of outputs, but the most important in is just an Iterator which signifies where the end of the line is. Once we have this iterator, we can lay out all the items in the current line. Control returns to RenderBlockFlow::layoutRunsAndFloatsInRange(), and we move onward to constructing BiDi runs (constructBidiRunsForSegment()), passing in our end-of-line iterator.

Litherum

Tuesday, April 8, 2014

Complex Text Handling in WebKit, Part 4: Line Breaking

No comments:

Post a Comment