Saturday, July 11, 2015

Knuth-Plass Line Breaking Algorithm

Line breaking, with regards to paragraph layout, isn’t a very flashy topic. Most people don’t even notice the location of line breaks in a paragraph of written text. However, to people who care, line breaking can be done badly.

There are many different aspects of a set of line breaks in some text. The most important principle is that the text of the line should match the width it's given as best as possible. If there is too much or too little space on a line, the line visually looks jarring. We want to avoid rapid changes of extra space on a line, so we don’t have a loose line followed directly by a tight line. We want to hyphenate words if it makes sense, but only do so sparingly and if required. In addition, we never want to hyphenate two adjacent lines. Similarly, we don’t want to hyphenate the second-to-last line of a paragraph, because there is plenty of space on that last line to accept the full word. One last, more subtle, thing we want to do is to avoid starting a word to the right of where the next line ends, because it would look like that word is floating all alone.

All of these things are desirable, but many line breaking implementations only consider a few, or none, of these qualities. The line breaking algorithm used in web browsers is simply to treat each line independently, and place as much text as possible on that line without exceeding the maximum width.

If we only consider the primary principle of filling up the available space, it may seem that our greedy algorithm is optimal. However, choosing a particular line breaking opportunity affects all subsequent lines, so it may turn out that the best choice for a particular line actually has a negative affect on the entire rest of the paragraph. Instead of optimizing each line independently, it would be better if we tried to optimize the whole paragraph.

Donald Knuth, probably the most famous computer scientist in the world, worked with Michael Plass to create an algorithm which does so. The algorithm is actually quite simple, but it does have some tricky parts.

Their algorithm has two parts: a classification stage and an execution stage. The classification stage translates each item in the text into one of three types:
  • A box. This represents a character or word and has a particular width.
  • Glue. This represents a space character, and is characterized by three values:
    • A natural width
    • A stretchability value. Higher values here mean that, when a line is stretched, more of that stretch is allocated to this character
    • A shrinkability value. This is the same thing, but for when a line is shrunk.
  • A penalty item. This represents a hyphenation opportunity within a word. It is characterized by two values (In the text it has three, but the third one isn’t really interesting in this discussion):
    • A width to use if a line break is chosen at this item.
    • A cost associated with choosing a line break at this item

There are actually all sorts of diverse typographical situations which can be represented with particular patterns of the items in this classification; Knuth names quite a few of them. Once we have done this classification, we want to choose a set of line breaks which correspond to glue items or penalty items. In order to choose a good set of line breaks, we have to introduce three new concepts.

The first is the concept of the “adjustment ratio” of a line, given a particular breakpoint. The adjustment ratio is simply the excess space in the line divided by the sum of the stretchability factors of all the glue items in the line. If the line is too long rather than too short, then the calculation is the same thing except with using shrinkability factors instead.

The authors then define the “badness” of a line to be a particular increasing function of the line’s adjustment ratio. The particular shape of this function they chose involves the cube root of the adjustment ratio, and they admit that this function was chosen arbitrarily and “has shown that good results are obtained.”

Then, the authors go further and define the “demerits” of a line. This is an increasing function of badness, but also includes extra information, such as the cost of a penalty item, if one was chosen to end the line. In addition, it includes a term which is positive if this is the second penalty item chosen in a row to end lines with. This function involves squaring the sum of badness and penalty, which the authors admit is arbitrary but “it works well in practice.”

The goal of the execution phase of the algorithm is to minimize the sum of the demerits of each line. The way it does this is actually very straightforward dynamic programming. Scan through the string, and each time you encounter a breaking opportunity, look at prior line breaking opportunities and consider a candidate line between these two opportunities. Compute the demerits of this line candidate. Select the line candidate which has the minimum demerits, and associate it with this particular line breaking opportunity. Continue through the string in this manner until you get to the end. The selection of lines which ends up with the minimum total demerits is the output of the algorithm.

A dynamic programming approach is valid because we are summing the total demerits and optimizing the entire prefix of the paragraph. This optimized prefix grows until it contains the entire text. Given an optimized prefix, if we consider a line starting at the end of that prefix, the particular decision we have made that led up to the beginning of that line don’t matter; the only thing that matters is the sum of the demerits up to that point.

Naively implemented, this algorithm is O(n2). However, there is an optimization which can improve the running time of the algorithm. The fundamental observation is that, given a line candidate’s ending position, there are only a few candidate starting positions which are not terrible. In particular, when we have selected a particular breaking candidate and are looking through prior opportunities, we can limit our search to only the prior opportunities which are not terrible. We can define “not terrible” as “having demerits below a threshold.” If we come across a line candidate that is wildly too long, we can simply forget the breaking opportunity at its beginning in our searches, because there will never be another viable line candidate starting at this opportunity. If our line candidate is wildly too short, we can abort our search, assuming we are searching in the direction of increasing indices in the string (but we can't forget this starting opportunity because it may be valuable in subsequent searches). Therefore, for each ending opportunity in our text, our search will investigate a relatively low (and vaguely constant-ish) amount of line candidates. So, if we choose an appropriate terribleness-threshold, our algorithm is O(n).

It’s worth mentioning what happens when we pick a bad terribleness threshold. If our threshold is too high, our algorithm will take much longer to run (bounding on O(n2)), but it may find a more globally-correct solution. If our threshold is too low, the algorithm will find that it is impossible to find acceptable line breaks all the way to the end of the paragraph. In this case, there is nothing we can do but try again with a higher threshold (or the greedy algorithm, which is guaranteed to find a result).

Aside: Calculating widths of line candidates can be done quickly with the established technique of the subsequence sum algorithm. This algorithm has you construct an array which contains a running sum of the widths of all the prefixes of the string (which can be done in O(n)). Then, the width of any subsequence is simply the difference between the two appropriate elements in this array. (I've seen this algorithm used to implement box blurs on images in linear time)

There is one bit of trickiness, however, with the line breaking algorithm. So far, we have assumed that the only affect an optimal prefix of a paragraph has on subsequent lines is its demerits. However, this is not true in the presence of floats. With floats, our available width can change partway down the page. Also, the location of this change might be affected by the layout of the current line, because floating elements occur in-line with text (meaning: if the floating element gets pushed down to a successive line, the location that the width change also gets pushed down to the successive line). This means that the algorithm is sensitive not only to the demerits created so far, but also the number of lines created and the location of all floating elements so far.

Because of this, we need to remember multiple ways to get to a particular line breaking candidate if those ways differ in either line count or floating element placement (and possibly things like differing line heights due to fallback fonts). The presence of these “duplicate” elements can cause our search space to grow dramatically. Luckily, we can combat this growing search space by setting our terribleness-threshold appropriately.

Overall, we have a few tools at our disposal. The first is how we classify our input text. The algorithm can be adapted to many typographic scenarios simply by inserting extra items in our stream of classified items. Next, we can change the values (like nominal width, or stretchiness factor) which characterize each item in our stream. Third, we can choose different functions for calculating badness and demerits. Finally, we can select a terribleness-threshold which will balance runtime and the likelihood of the algorithm completing. Some of these parameters are probably language-dependent, which makes selecting them even more difficult.

Friday, July 10, 2015

Generating x86_64 Machine Code at Runtime

As I discussed previously[1], there is a particular file format for representing programs on disk. This file format includes a significant amount of machinery for the dynamic loader to resolve symbols at runtime rather than compile-time. However, once your program is already running, it is possible to actually author more machine code, and to run that new code directly.

In order to do this, you have to write machine code into a buffer. I’m using an x86_64 machine, so that’s the format that I’ll be using. x86_64 machine code has a very long history which has significantly affected the instruction format.

The instructions originally came from the Intel 8086, back in 1979. That processor is a 16-bit processor which included 8 16-bit data registers: AX, BX, CX, DX, BP, SP, SI, and DI. The first four of these, AX, BX, CX, and DX, are general purpose registers. BP is the “base pointer” and always points to the base of the current stack frame. Similarly, SP is the “stack pointer” and always points to the top of the current stack frame. The stack grows downward, so SP always has a lower value then BP. SI and DI are usually used for pointers or indexes into memory.

There are also some other special-purpose registers, such as IP for “instruction pointer” which represents where control flow is at, and FLAGS. Instructions such as conditionals will flip bits in FLAGS, which other instructions like conditional jumps will inspect.

The 8086 allowed addressing a maximum of one megabyte (= 2^20) of memory. However, the registers can only hold 16 bits, which means the maximum representable value is 2^16. Therefore, pointers are represented as a combination of two registers (specifically, you shift the contents of one register by 4 bits and add that to the other register). There are 4 “segment registers” that handle this kind of addressing: CS, DS, ES, and SS.

The Intel 386, in 1985, introduced 32-bit wide addressing and registers. This was achieved by putting the processor in a new mode. In this mode, the same registers are available, but they are all 32-bits wide. These wider registers are named with a leading “E,” which stands for “extended.” For example, we have “EAX” and “EBP.” However, the old 16-bit registers are still available, aliased to the low 16-bits of the new wider registers. That means that if you write into AX, then read from EAX, you would get what you had written.

Indeed, this register aliasing was not a new concept; the 8086 already allowed you to access the four data registers by their low and high bytes. “AL” is the low byte of “AX” and “AH” is the high byte of “AX.” It is for this reason that the “X” in the 16-bit registers stands for “eXtended.”

It’s important to realize here that the number of registers you can interact with has stayed at a constant 8. Therefore, you only need three bits to represent a particular register. This is an important quality of the instruction format.

The instructions have a fairly complicated format, described at [2]. I’ll summarize it here:
  1. Instructions start with a sequence of up to 4 prefix bytes
  2. Then comes a 1 - 3 byte opcode. This is simply a number which represents the instruction
  3. The rest of the format is dependent on the arguments specific to each individual opcode. Register operands are encoded in 1 - 2 subsequent bytes, called the ModR/M byte and the SIB byte, respectively
  4. An optional displacement value and an optional immediate value follow
One important thing to note is that, because we only need 3 bits to identify a register, we can identify two registers in the ModR/M byte and have an additional 2 bits left over. These two bits are used to specify that the operand should actually come from memory, at the location pointed to by the register, with an optional displacement (supplied in the last bullet above). This particular encoding is quite versatile, as it means that you can have your instruction operate on memory by simply changing a few bits in its encoding.

Of course, you aren’t ::really:: operating on memory. When you think you are operating on memory, the hardware is actually generating load and store micro-ops and performing them. You just don’t see it.

The 32-bit Intel architecture was the standard desktop computer architecture until 2003, when AMD extended it to 64-bits. Similar to the older extension to 32-bits, the registers got wider, but the low 32-bits of each register alias the old names. The new names replace the leading “E” with a leading “R” so we have names like “RSP” and “RCX. In addition, however, they also added 8 more registers simply named R8 - R15.

This addition of new registers proved problematic for instruction encoding. In particular, we were using 3 bits to identify a register, but now we need 4 bits. Rather than dramatically change the encoding, AMD decided to prefix instructions that need to use these 64-bit registers with a “REX prefix.” This prefix is a byte where the high nibble is a “4,” and the low nibble includes that 4th bit needed to properly identify a register. Because instructions can have multiple operands which may each need to identify a register, each of the bits in the low nibble of the REX Prefix corresponds to a different operand. Therefore, if you’re trying to figure out which register an instruction is naming, you have to concatenate one bit from the REX prefix and three bits from the ModR/M byte.

AMD also decided to make the REX prefix optional, so instructions which only needed to operate with 32-bits and the 32-bit registers don’t need to specify it. You may think that this means that the new 64-bit instructions are completely backwards-compatible with 32-bit code; however, that is incorrect. In 32-bit mode, the REX prefix itself is a valid opcode. Therefore, if the machine is in 32-bit mode or 64-bit mode, it will interpret that byte differently.

That’s pretty much it when it comes to formatting instructions. If you want to generate commands at runtime, it isn’t too difficult to figure out how to output them in the format of x86_64 machine code. However, which instructions should you emit?

Well, the instructions are grouped into categories[2]. General purpose instructions include things like PUSH, MOV, CALL, ADD, SAR (shift arithmetic right), JMP, etc. These are pretty self-explanatory.

There are also a bunch of floating-point instructions which use the x87 floating point unit present in all modern processors. This unit used to be an optional coprocessor which you could add to your computer, but has been bundled in all processors since the 487.

The unit is actually pretty interesting. Instead of individually accessible registers, the unit uses a stack of 8 floating-point registers. Instead of naming registers individually, you push and pop items onto the stack, and math operations will replace the top items on the stack with the result of the operation. Also, these registers are 80-bits wide, allowing for significantly more precision than a double. (I’ll probably write another whole post later just about floating point precision)

Intel’s first foray into SIMD (Single Instruction Multiple Data) is with a collection of MMX instructions, first available with the Pentium. This defined 8 new 64-bit registers, named MM0 - MM7, which are aliased with low 64-bits of the x87 register stack. Instead of representing a single 64-bit value, each register represented a vector of smaller types (2x 32-bit numbers, 4x 16-bit numbers, or 8x 8-bit numbers). MMX instructions would operate on these vectors pairwise. So, if you added two MMX registers together, and chose to interpret those registers each as 2x 32-bit numbers, you would actually get two additions for the price of one, since the first components of the vectors gets added and the second components of the vectors gets added. Unfortunately, MMX only supports integral math (not floating-point math) which makes it not very useful. Also, 64-bits is not a very wide vector, making it even less useful.

Intel’s next attempt of SIMD is the SSE set of instructions, initially released with the Pentium 3, which was extended several times to include SSE2, SSE3, SSSE3, and SSE4. This sets up a new set of 8 128-bit registers (not aliased to anything, this time) named XMM0 - XMM7. Similar to general-purpose registers, 64-bit machines double this to 16 new registers. These instructions operate on floating point data, and as of SSE2, also can operate on integral data, thereby obsoleting MMX. In Haswell and beyond, there are fused-multiply-add instructions, allowing you to perform a multiply and an add in a single instruction.

Similarly, Intel created the AVX and AVX2 instructions which use 8 new registers named YMM0 - YMM7 (yet again, 16 new registers on 64-bit machines). These registers are 256 bits wide, and the low 128 bits alias the XMM registers. AVX was first released in 2011 with Sandy Bridge processors. There is a plan to release AVX-512 with 512-bit-wide registers, but it hasn’t shipped yet.

And that’s pretty much it. There are some other instructions regarding transactional memory and halfwidth float conversion. You can detect which of these instructions the processor accepts by issuing the CPUID instruction, which will return a bit mask of capabilities. But, overall, the amount of instructions really isn't that daunting.

We are now very close to being able to write instructions at runtime. However, we need to be able to pass data between our regular program and our generated code. The way this is normally achieved is with arguments and return values to functions. At the assembly level, these concepts are achieved with simply leaving data in a predefined place, and then jumping into a function. The code in the function will expect the data to be in the predefined place. This means that the caller and the callee have to both agree about what the signature of the function is.

There is a pattern, though, of where particular items are placed. In fact, each operating system will pick a particular “ABI,” or Application Binary Interface, which defines rules about where to place all these items. Everyone on either side of a caller/callee relationship needs to adhere to this. Also, the ABI chosen is different for different operating systems, which is yet another reason why you can't run code compiled for one OS on another OS, even if they are running on the same machine.

The Intel 32-bit ABI[3] is pretty simple. All function arguments simply get dumped onto the stack, one by one. Return values are a little tricky, though. If the return type is something which fits in a register, than it is simply placed in EAX. Otherwise, the caller will actually create space for the callee to fill in, and will place a pointer to this item as a hidden first argument to the function.

The Intel 64-bit ABI[4] is significantly more complicated, because they wanted to pass more items in registers, which would reduce cache pressure. The ABI involves classifying each argument based on its type and position in the argument list, and then placing each item in either a register or on the stack.

There’s one last piece to code generation which requires some thought: linking. Much of the complexity of the Mach-O file format is due to the linker needing to fixup symbols at runtime to point where they should point. However, our codegen is occurring at runtime, after the linker has fixed up all the symbols. This means that, if we want to look up the address for a function, we actually already have the address for it! All we need to do is just write out &function and it will be correctly be called.

All right, so now we’ve pretty much got everything we need. First, we need to allocate some memory which has the execute permission set.

void *ptr = mmap(NULL, length, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANON, -1, 0)

Next, we generate instructions, convert them to machine code, and pack them out into that buffer, one by one. When we’re done generating code, we can call it by simply casting the pointer to our buffer to the correct function pointer type (keep in mind the ABI!) and calling it!

((int(*)(int, int))ptr)(4, 17)

We just made a JIT[5].


Wednesday, July 8, 2015

Producing Executables

The way that most people produce an executable is to describe the behavior you want your code to have, then run that description through a compiler, which produces an executable. It’s worth an investigation into the actual executable file itself.

The goal of an executable file is to hold a description of a program. Our computers can execute programs written in a particular language, which is the machine’s particular instruction set. A native executable contains a simple sequence of instructions in the machine’s native language. Because the language that we store is the same language that the machine can consume, we can simply map the bytes of the file into memory, point the machine at it, and say “go.”

Not so fast, though. Software is composable, meaning that your program might want to call functions that exist inside other files. One kind of library, a static library, actually has all its code copied into your executable before the final executable is produced. Dynamic libraries, on the other hand, aren’t copied into your executable, but must instead be present when your program first runs. This is done by the dynamic loader, which is the same thing which maps your code from the file into memory.

But how does the dynamic loader know which libraries to load? Well, your executable must list them somehow. This list is actually kept inside your executable file (rather than in an accompanying file). Which means that your executable file isn’t ::just:: a bunch of instructions, but instead has to have some sort of way to represent a variety of data structures, just one of which is the sequence of machine instructions.

Note that the format for this file isn’t machine-specific, but is instead OS-specific. The physical machine only cares about your instructions; everything else is just input to the dynamic loader, which is just some software that gets run when you want to execute a program. Because the dynamic loader is specific to each operating system, the format of this file is specific to each operating system.

I’ll be discussing the format on Mac OS X, which is the Mach-O file format. The format is described at [1], but I’ll summarize it here. First up is a header, with a magic number and some identifiers of the hardware machine type that this file is specific to. Then, there is a sequence of “load commands,” which really aren’t commands at all. These are simply a way to chunk up the file into a bunch of different pieces.

One such load command is LC_SEGMENT_64, which references the raw bytes which should be mapped into memory. Among other things, this chunk of the file contains a reference to the raw machine code that the instruction pointer will point into. I’ll spend more time discussing this in a bit.

One of the tasks of the loader is to resolve references between libraries. This means that a library must have a way of describing where all its symbols reside, and where all its external references are so the linker can overwrite them. A Mach-O file describes these locations with the LC_SYMTAB and LC_DYSYMTAB, and LC_DYLD_INFO_ONLY load commands.

Here are some more other interesting load commands:
  • LC_LOAD_DYLINKER: The path to the dynamic linker itself
  • LC_UUID: A random number. If you regenerate the file, this will be different, so utilities know that they need to reload their data
  • LC_LOAD_DYLIB : There will be one of these load commands for each dynamic library that this file depends on. It contains the full path of the dependent library, sot he loader can recursively load it when necessary.
  • LC_VERSION_MIN_MACOSX: Specifies the minimum OS X version that the executable should be able to run on
  • LC_SOURCE_VERSION: Version of the source code
  • LC_MAIN: Offset of the entry point
Alright, let’s talk about LC_SEGMENT_64. Each one of these load commands gets mapped into memory, and each one specifies the permissions that it should be mapped with. In executables, there is one of these, called __PAGEZERO, which has no permissions and is mapped at virtual address 0. The fact that this gets mapped with no permissions is why your program crashes if you try to read or write from address 0.

The __TEXT segment is where your raw machine code lives, and is mapped such that you cannot write to it. A segment contains 0 or more sections, and the __TEXT segment contains a variety of sections. (On the other hand, __PAGEZERO contains 0 sections.) Sections exist so you can mark up specific parts of your mapped memory as being used for a particular purpose and specify relocations that must be performed. For example, read-only C string data can be placed inside the __TEXT segment along with your code because both don’t have write privileges, but you can mark that the two sections, __cstring and __text, should be treated differently by utilities.

There is also another segment, __DATA, which gets mapped with read and write privileges, but no execute privileges. This is for any global variables which get statically initialized.

The last segment I’ll mention, __LINKEDIT, maps all the linker-specific data structures in the file. There are no sections; instead, other linker-specific segments will reference data inside this segment.

You can see how all the various data structures the dynamic linker needs are described within the Mach-O format. The format holds machine code itself, but it also holds a bunch of other information such as symbol tables, relocation information, library dependency information, etc.

There’s one last neat feature of the format that I’d like to describe: the fact that the format is extensible. There are lots of different kinds of load commands, segments, and sections. In particular, the linker accepts a flag, -sectcreate, which will create a section at link time in any given segment and populate it with the contents of a file. Therefore, you can say -sectcreate __DATA arbitrarysectname file.dat and the contents of file.dat will automatically be mapped into your process at load time. You can then access the file by marking up a pointer with the following syntax:

extern const uint8_t fooStart __asm("section$start$__DATA$arbitrarysectname");
extern const uint8_t fooEnd __asm("section$end$__DATA$arbitrarysectname");

and those variables, fooStart and fooEnd, will live at the beginning / ending of the newly created section. You can iterate over the data in the section by just looping a pointer from &fooStart to &fooEnd.