Saturday, July 11, 2015

Knuth-Plass Line Breaking Algorithm

Line breaking, with regards to paragraph layout, isn’t a very flashy topic. Most people don’t even notice the location of line breaks in a paragraph of written text. However, to people who care, line breaking can be done badly.

There are many different aspects of a set of line breaks in some text. The most important principle is that the text of the line should match the width it's given as best as possible. If there is too much or too little space on a line, the line visually looks jarring. We want to avoid rapid changes of extra space on a line, so we don’t have a loose line followed directly by a tight line. We want to hyphenate words if it makes sense, but only do so sparingly and if required. In addition, we never want to hyphenate two adjacent lines. Similarly, we don’t want to hyphenate the second-to-last line of a paragraph, because there is plenty of space on that last line to accept the full word. One last, more subtle, thing we want to do is to avoid starting a word to the right of where the next line ends, because it would look like that word is floating all alone.

All of these things are desirable, but many line breaking implementations only consider a few, or none, of these qualities. The line breaking algorithm used in web browsers is simply to treat each line independently, and place as much text as possible on that line without exceeding the maximum width.

If we only consider the primary principle of filling up the available space, it may seem that our greedy algorithm is optimal. However, choosing a particular line breaking opportunity affects all subsequent lines, so it may turn out that the best choice for a particular line actually has a negative affect on the entire rest of the paragraph. Instead of optimizing each line independently, it would be better if we tried to optimize the whole paragraph.

Donald Knuth, probably the most famous computer scientist in the world, worked with Michael Plass to create an algorithm which does so. The algorithm is actually quite simple, but it does have some tricky parts.

Their algorithm has two parts: a classification stage and an execution stage. The classification stage translates each item in the text into one of three types:
  • A box. This represents a character or word and has a particular width.
  • Glue. This represents a space character, and is characterized by three values:
    • A natural width
    • A stretchability value. Higher values here mean that, when a line is stretched, more of that stretch is allocated to this character
    • A shrinkability value. This is the same thing, but for when a line is shrunk.
  • A penalty item. This represents a hyphenation opportunity within a word. It is characterized by two values (In the text it has three, but the third one isn’t really interesting in this discussion):
    • A width to use if a line break is chosen at this item.
    • A cost associated with choosing a line break at this item

There are actually all sorts of diverse typographical situations which can be represented with particular patterns of the items in this classification; Knuth names quite a few of them. Once we have done this classification, we want to choose a set of line breaks which correspond to glue items or penalty items. In order to choose a good set of line breaks, we have to introduce three new concepts.

The first is the concept of the “adjustment ratio” of a line, given a particular breakpoint. The adjustment ratio is simply the excess space in the line divided by the sum of the stretchability factors of all the glue items in the line. If the line is too long rather than too short, then the calculation is the same thing except with using shrinkability factors instead.

The authors then define the “badness” of a line to be a particular increasing function of the line’s adjustment ratio. The particular shape of this function they chose involves the cube root of the adjustment ratio, and they admit that this function was chosen arbitrarily and “has shown that good results are obtained.”

Then, the authors go further and define the “demerits” of a line. This is an increasing function of badness, but also includes extra information, such as the cost of a penalty item, if one was chosen to end the line. In addition, it includes a term which is positive if this is the second penalty item chosen in a row to end lines with. This function involves squaring the sum of badness and penalty, which the authors admit is arbitrary but “it works well in practice.”

The goal of the execution phase of the algorithm is to minimize the sum of the demerits of each line. The way it does this is actually very straightforward dynamic programming. Scan through the string, and each time you encounter a breaking opportunity, look at prior line breaking opportunities and consider a candidate line between these two opportunities. Compute the demerits of this line candidate. Select the line candidate which has the minimum demerits, and associate it with this particular line breaking opportunity. Continue through the string in this manner until you get to the end. The selection of lines which ends up with the minimum total demerits is the output of the algorithm.

A dynamic programming approach is valid because we are summing the total demerits and optimizing the entire prefix of the paragraph. This optimized prefix grows until it contains the entire text. Given an optimized prefix, if we consider a line starting at the end of that prefix, the particular decision we have made that led up to the beginning of that line don’t matter; the only thing that matters is the sum of the demerits up to that point.

Naively implemented, this algorithm is O(n2). However, there is an optimization which can improve the running time of the algorithm. The fundamental observation is that, given a line candidate’s ending position, there are only a few candidate starting positions which are not terrible. In particular, when we have selected a particular breaking candidate and are looking through prior opportunities, we can limit our search to only the prior opportunities which are not terrible. We can define “not terrible” as “having demerits below a threshold.” If we come across a line candidate that is wildly too long, we can simply forget the breaking opportunity at its beginning in our searches, because there will never be another viable line candidate starting at this opportunity. If our line candidate is wildly too short, we can abort our search, assuming we are searching in the direction of increasing indices in the string (but we can't forget this starting opportunity because it may be valuable in subsequent searches). Therefore, for each ending opportunity in our text, our search will investigate a relatively low (and vaguely constant-ish) amount of line candidates. So, if we choose an appropriate terribleness-threshold, our algorithm is O(n).

It’s worth mentioning what happens when we pick a bad terribleness threshold. If our threshold is too high, our algorithm will take much longer to run (bounding on O(n2)), but it may find a more globally-correct solution. If our threshold is too low, the algorithm will find that it is impossible to find acceptable line breaks all the way to the end of the paragraph. In this case, there is nothing we can do but try again with a higher threshold (or the greedy algorithm, which is guaranteed to find a result).

Aside: Calculating widths of line candidates can be done quickly with the established technique of the subsequence sum algorithm. This algorithm has you construct an array which contains a running sum of the widths of all the prefixes of the string (which can be done in O(n)). Then, the width of any subsequence is simply the difference between the two appropriate elements in this array. (I've seen this algorithm used to implement box blurs on images in linear time)

There is one bit of trickiness, however, with the line breaking algorithm. So far, we have assumed that the only affect an optimal prefix of a paragraph has on subsequent lines is its demerits. However, this is not true in the presence of floats. With floats, our available width can change partway down the page. Also, the location of this change might be affected by the layout of the current line, because floating elements occur in-line with text (meaning: if the floating element gets pushed down to a successive line, the location that the width change also gets pushed down to the successive line). This means that the algorithm is sensitive not only to the demerits created so far, but also the number of lines created and the location of all floating elements so far.

Because of this, we need to remember multiple ways to get to a particular line breaking candidate if those ways differ in either line count or floating element placement (and possibly things like differing line heights due to fallback fonts). The presence of these “duplicate” elements can cause our search space to grow dramatically. Luckily, we can combat this growing search space by setting our terribleness-threshold appropriately.

Overall, we have a few tools at our disposal. The first is how we classify our input text. The algorithm can be adapted to many typographic scenarios simply by inserting extra items in our stream of classified items. Next, we can change the values (like nominal width, or stretchiness factor) which characterize each item in our stream. Third, we can choose different functions for calculating badness and demerits. Finally, we can select a terribleness-threshold which will balance runtime and the likelihood of the algorithm completing. Some of these parameters are probably language-dependent, which makes selecting them even more difficult.