Friday, July 10, 2015

Generating x86_64 Machine Code at Runtime

As I discussed previously[1], there is a particular file format for representing programs on disk. This file format includes a significant amount of machinery for the dynamic loader to resolve symbols at runtime rather than compile-time. However, once your program is already running, it is possible to actually author more machine code, and to run that new code directly.

In order to do this, you have to write machine code into a buffer. I’m using an x86_64 machine, so that’s the format that I’ll be using. x86_64 machine code has a very long history which has significantly affected the instruction format.

The instructions originally came from the Intel 8086, back in 1979. That processor is a 16-bit processor which included 8 16-bit data registers: AX, BX, CX, DX, BP, SP, SI, and DI. The first four of these, AX, BX, CX, and DX, are general purpose registers. BP is the “base pointer” and always points to the base of the current stack frame. Similarly, SP is the “stack pointer” and always points to the top of the current stack frame. The stack grows downward, so SP always has a lower value then BP. SI and DI are usually used for pointers or indexes into memory.

There are also some other special-purpose registers, such as IP for “instruction pointer” which represents where control flow is at, and FLAGS. Instructions such as conditionals will flip bits in FLAGS, which other instructions like conditional jumps will inspect.

The 8086 allowed addressing a maximum of one megabyte (= 2^20) of memory. However, the registers can only hold 16 bits, which means the maximum representable value is 2^16. Therefore, pointers are represented as a combination of two registers (specifically, you shift the contents of one register by 4 bits and add that to the other register). There are 4 “segment registers” that handle this kind of addressing: CS, DS, ES, and SS.

The Intel 386, in 1985, introduced 32-bit wide addressing and registers. This was achieved by putting the processor in a new mode. In this mode, the same registers are available, but they are all 32-bits wide. These wider registers are named with a leading “E,” which stands for “extended.” For example, we have “EAX” and “EBP.” However, the old 16-bit registers are still available, aliased to the low 16-bits of the new wider registers. That means that if you write into AX, then read from EAX, you would get what you had written.

Indeed, this register aliasing was not a new concept; the 8086 already allowed you to access the four data registers by their low and high bytes. “AL” is the low byte of “AX” and “AH” is the high byte of “AX.” It is for this reason that the “X” in the 16-bit registers stands for “eXtended.”

It’s important to realize here that the number of registers you can interact with has stayed at a constant 8. Therefore, you only need three bits to represent a particular register. This is an important quality of the instruction format.

The instructions have a fairly complicated format, described at [2]. I’ll summarize it here:
  1. Instructions start with a sequence of up to 4 prefix bytes
  2. Then comes a 1 - 3 byte opcode. This is simply a number which represents the instruction
  3. The rest of the format is dependent on the arguments specific to each individual opcode. Register operands are encoded in 1 - 2 subsequent bytes, called the ModR/M byte and the SIB byte, respectively
  4. An optional displacement value and an optional immediate value follow
One important thing to note is that, because we only need 3 bits to identify a register, we can identify two registers in the ModR/M byte and have an additional 2 bits left over. These two bits are used to specify that the operand should actually come from memory, at the location pointed to by the register, with an optional displacement (supplied in the last bullet above). This particular encoding is quite versatile, as it means that you can have your instruction operate on memory by simply changing a few bits in its encoding.

Of course, you aren’t ::really:: operating on memory. When you think you are operating on memory, the hardware is actually generating load and store micro-ops and performing them. You just don’t see it.

The 32-bit Intel architecture was the standard desktop computer architecture until 2003, when AMD extended it to 64-bits. Similar to the older extension to 32-bits, the registers got wider, but the low 32-bits of each register alias the old names. The new names replace the leading “E” with a leading “R” so we have names like “RSP” and “RCX. In addition, however, they also added 8 more registers simply named R8 - R15.

This addition of new registers proved problematic for instruction encoding. In particular, we were using 3 bits to identify a register, but now we need 4 bits. Rather than dramatically change the encoding, AMD decided to prefix instructions that need to use these 64-bit registers with a “REX prefix.” This prefix is a byte where the high nibble is a “4,” and the low nibble includes that 4th bit needed to properly identify a register. Because instructions can have multiple operands which may each need to identify a register, each of the bits in the low nibble of the REX Prefix corresponds to a different operand. Therefore, if you’re trying to figure out which register an instruction is naming, you have to concatenate one bit from the REX prefix and three bits from the ModR/M byte.

AMD also decided to make the REX prefix optional, so instructions which only needed to operate with 32-bits and the 32-bit registers don’t need to specify it. You may think that this means that the new 64-bit instructions are completely backwards-compatible with 32-bit code; however, that is incorrect. In 32-bit mode, the REX prefix itself is a valid opcode. Therefore, if the machine is in 32-bit mode or 64-bit mode, it will interpret that byte differently.

That’s pretty much it when it comes to formatting instructions. If you want to generate commands at runtime, it isn’t too difficult to figure out how to output them in the format of x86_64 machine code. However, which instructions should you emit?

Well, the instructions are grouped into categories[2]. General purpose instructions include things like PUSH, MOV, CALL, ADD, SAR (shift arithmetic right), JMP, etc. These are pretty self-explanatory.

There are also a bunch of floating-point instructions which use the x87 floating point unit present in all modern processors. This unit used to be an optional coprocessor which you could add to your computer, but has been bundled in all processors since the 487.

The unit is actually pretty interesting. Instead of individually accessible registers, the unit uses a stack of 8 floating-point registers. Instead of naming registers individually, you push and pop items onto the stack, and math operations will replace the top items on the stack with the result of the operation. Also, these registers are 80-bits wide, allowing for significantly more precision than a double. (I’ll probably write another whole post later just about floating point precision)

Intel’s first foray into SIMD (Single Instruction Multiple Data) is with a collection of MMX instructions, first available with the Pentium. This defined 8 new 64-bit registers, named MM0 - MM7, which are aliased with low 64-bits of the x87 register stack. Instead of representing a single 64-bit value, each register represented a vector of smaller types (2x 32-bit numbers, 4x 16-bit numbers, or 8x 8-bit numbers). MMX instructions would operate on these vectors pairwise. So, if you added two MMX registers together, and chose to interpret those registers each as 2x 32-bit numbers, you would actually get two additions for the price of one, since the first components of the vectors gets added and the second components of the vectors gets added. Unfortunately, MMX only supports integral math (not floating-point math) which makes it not very useful. Also, 64-bits is not a very wide vector, making it even less useful.

Intel’s next attempt of SIMD is the SSE set of instructions, initially released with the Pentium 3, which was extended several times to include SSE2, SSE3, SSSE3, and SSE4. This sets up a new set of 8 128-bit registers (not aliased to anything, this time) named XMM0 - XMM7. Similar to general-purpose registers, 64-bit machines double this to 16 new registers. These instructions operate on floating point data, and as of SSE2, also can operate on integral data, thereby obsoleting MMX. In Haswell and beyond, there are fused-multiply-add instructions, allowing you to perform a multiply and an add in a single instruction.

Similarly, Intel created the AVX and AVX2 instructions which use 8 new registers named YMM0 - YMM7 (yet again, 16 new registers on 64-bit machines). These registers are 256 bits wide, and the low 128 bits alias the XMM registers. AVX was first released in 2011 with Sandy Bridge processors. There is a plan to release AVX-512 with 512-bit-wide registers, but it hasn’t shipped yet.

And that’s pretty much it. There are some other instructions regarding transactional memory and halfwidth float conversion. You can detect which of these instructions the processor accepts by issuing the CPUID instruction, which will return a bit mask of capabilities. But, overall, the amount of instructions really isn't that daunting.

We are now very close to being able to write instructions at runtime. However, we need to be able to pass data between our regular program and our generated code. The way this is normally achieved is with arguments and return values to functions. At the assembly level, these concepts are achieved with simply leaving data in a predefined place, and then jumping into a function. The code in the function will expect the data to be in the predefined place. This means that the caller and the callee have to both agree about what the signature of the function is.

There is a pattern, though, of where particular items are placed. In fact, each operating system will pick a particular “ABI,” or Application Binary Interface, which defines rules about where to place all these items. Everyone on either side of a caller/callee relationship needs to adhere to this. Also, the ABI chosen is different for different operating systems, which is yet another reason why you can't run code compiled for one OS on another OS, even if they are running on the same machine.

The Intel 32-bit ABI[3] is pretty simple. All function arguments simply get dumped onto the stack, one by one. Return values are a little tricky, though. If the return type is something which fits in a register, than it is simply placed in EAX. Otherwise, the caller will actually create space for the callee to fill in, and will place a pointer to this item as a hidden first argument to the function.

The Intel 64-bit ABI[4] is significantly more complicated, because they wanted to pass more items in registers, which would reduce cache pressure. The ABI involves classifying each argument based on its type and position in the argument list, and then placing each item in either a register or on the stack.

There’s one last piece to code generation which requires some thought: linking. Much of the complexity of the Mach-O file format is due to the linker needing to fixup symbols at runtime to point where they should point. However, our codegen is occurring at runtime, after the linker has fixed up all the symbols. This means that, if we want to look up the address for a function, we actually already have the address for it! All we need to do is just write out &function and it will be correctly be called.

All right, so now we’ve pretty much got everything we need. First, we need to allocate some memory which has the execute permission set.

void *ptr = mmap(NULL, length, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANON, -1, 0)

Next, we generate instructions, convert them to machine code, and pack them out into that buffer, one by one. When we’re done generating code, we can call it by simply casting the pointer to our buffer to the correct function pointer type (keep in mind the ABI!) and calling it!

((int(*)(int, int))ptr)(4, 17)

We just made a JIT[5].

[1] http://litherum.blogspot.com/2015/07/producing-executables.html
[2] http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
[3] http://www.sco.com/developers/devspecs/abi386-4.pdf
[4] http://people.freebsd.org/~obrien/amd64-elf-abi.pdf
[5] https://github.com/litherum/JIT

No comments:

Post a Comment