Wednesday, February 6, 2013

JavaScriptCore Primitive Encoding


The JavaScriptCore interpreter has a primitive type (called JSValue, inside WebKit/Source/JavaScriptCore/runtime/JSCJSValue.h) which can represent the following types in JavaScript:

  • Int32
  • Double
  • Cell (meaning: a pointer to another data structure)
  • Boolean
  • Null, undefined, empty, recently deleted (I'm grouping all these together since they're conceptually similar and have only one valid value each, similar to the "Unit Type")

The size of a JSValue, however, is 64-bits, no matter what type it is holding. It is not a polymorphic type. Instead, it is implemented as a union.

JavaScriptCore runs on 32-bit machines and 64-bit machines. The layout of the type is slightly different for the two architectures. I'll start with 32-bit machines.

The overall bit-layout of a JSValue on a 32 bit machine is as follows:

|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|
|----------------tag----------------|--------------payload--------------|

The format for everything that isn't a double is obvious - assign each type a unique tag, and put it's 32-bit payload in bytes 5-8 (since ints and pointers are both 32 bits). But then, how do we encode a double, since a double alone takes up the whole 64-bits? JavaScriptCore is tricky about this - it assigns the tags starting at 0xFFFFFFFF and counts down. The reason for this is that a double encodes "QNaN" if bits 1-13 are set (counting from the left). Therefore, if any of the tags are set, the bits in the JSValue (when interpreted as a double) would encode a NaN. There are also more NaN values available, so it's possible to encode an ::actual:: double NaN. So, to tell what type a JSValue is, just see if the tag matches, and if it doesn't, the value is a double.

----------------

Now, for a 64-bit machine. The types are all the same size, except that now pointers are 64 bits (ints are still 32 bits even on a 64-bit machine). So, how can we tag the pointer, even though it takes up the whole 64-bits? Well, JavaScriptCore makes an observation that in most pointers, the top two bytes are usually 0 (Yes, it is true that addresses are arbitrary due to virtual memory, but usually machines don't get that high when allocating addresses. To get up that high in the first place would require this process claiming 281 terabytes of memory). We can, then, recognize a pointer as having its top two bytes == 0. So then, how do we distinguish a pointer whose top two bytes are 0 from a double whose top two bytes are 0? Well, we can artificially increment the top two bytes of the double. So, to encode a double, we'd first reinterpret the bits as if they were an int64, then add 0x0001000000000000. To get back the original double, just subtract that value and reinterpret the bits as a double. This won't overflow because the only values that would overflow have the leftmost 16 bits set, but that's a NaN. The same argument can be used to show that the leftmost two bytes won't ever end up being 0xFFFF either.

|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|
|00000000000000000-----------pointer value------------------------------|
| 0x0001 - 0xFFFE ------------double value------------------------------|

Well, what about the other types that a JSValue can represent? Our old tagging mechanism (used on 32-bit processors) won't work because our tag has now been shortened to only only 16 bits, but in order to encode a NaN, the first 13 have to be set. This means that we only have 3 bits to play with, which isn't enough (we might want to add more than 8 extra types). But, never fear! Just like there are invalid double values, there are also invalid pointer values! Namely, pointer values that would map to the first page of memory (which is usually mapped with no permissions, so these addresses are invalid). Therefore, we can use the rightmost few bits as a tag. All the types except for Int don't actually encode much (don't have much entropy), so JavaScriptCore uses combinations of 0x2, 0x4, and 0x8 to create the necessary tags. However, for ints, it actually does use one of the NaN values (using a leftmost 0xFFFF tag) which encodes int32s, so the rightmost 4 bytes can be the payload. This would be faster than putting the int32 in the middle of the 64-bit value (something like making bytes 3-6 specify the payload) because you wouldn't need extra shifting operations to recreate the int.

|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx|
|FFFFFFFFFFFFFFFFF|00000000000000000|-----------integer value-----------|
|00000000000000000|-------------------------------------------------TAG-|

No comments:

Post a Comment