Saturday, December 19, 2015

Antialiasing of Core Graphics and Direct2D

Sunday, December 13, 2015

Methods of writing Windows apps


In the beginning, there was just Win32. This is a C-based API which has been around since Windows 95. In this API, UI widgets (such as buttons) were known as “Windows” (represented in code as HWND) and windows could have windows inside of them. There was no concept of automatic layout; instead, each window had to be explicitly placed and sized by the programmer.

When you have a window, you can create a “device context” which is associated with that window. This device context is what we now would call a 2D drawing context, and it has API functions which let you draw lines and circles and text and that kind of stuff. All these drawing operations were performed on the CPU by a library named “GDI.”

Interactivity was fairly primitive. Applications would create their own runloop. Each turn of the runloop calls GetMessage(), which corresponds with an event that occurred. However, because some messages need to be handled by the system, the runloop just turns around and delivers the event back to the system. Before doing this, you register a callback function which the system will invoke when it wants to. This is where you are told things like “you should paint your contents now” or “the user pressed a key on the keyboard.” Because each message has an enum type, this callback function will switch over the type, and react to all the messages it understands.

This API is 20 years old, and should not be used for new projects. However, it is still supported today on current Windows operating systems.

Microsoft Foundation Classes (MFC) was the next suggested method of writing Windows apps. This is a C++ wrapper around the Win32 API; it was a new way of doing what people were already doing. Similar to Win32, it shouldn't be used anymore.


.NET represents a new approach to Windows desktop software development. There is a new language, a new object model, a new ABI, garbage collection, etc. There are two ways that .NET applications could be used to make graphical applications: WinForms and WPF. WinForms simply internally uses the Win32 API, and therefore isn’t very interesting.

WPF, however, represents a clean break from Win32 GUI programming. This API doesn’t use GDI for drawing, doesn’t use HWNDs, and doesn't require manual layout. Instead, development starts with a declarative language called XAML. This is an XML-based language where regions of the screen are represented as elements in XAML. Each element has an associated .NET class (in the System.Windows namespace) which allows you to do everything in code that the XAML file does. These elements aren’t HWNDs; instead, they are their own unrelated type.

Because HWNDs aren’t involved, drawing isn’t done with GDI. Instead, it is done entire on top of Direct3D, which means that the drawing routines are hardware accelerated. It also means that WPF controls look different from their GDI counterparts; a WPF app is recognizable.


WinRT is the suggested method of creating a GUI application. WinRT’s GUI model is set up similarly to WPF, except that it has an entirely new set of elements (which reside in the Windows.UI namespace). It still employs XAML for representing elements in a declarative form.

For drawing a specific control, WinRT uses Direct2D, a hardware-accelerated drawing library. This library is conceptually on top of Direct3D, but it is also available for arbitrary developer use.

Once an individual element is drawn, multiple elements are composited together using Windows Composition. This library allows for creating a retained-mode hardware-accelerated tree of visual objects. This API is internally used to composite all the controls in the window. It even has a public API so you can interact with the composition tree directly.

As opposed to Win32, WinRT has an application model. You don’t have to write your own event loop with WinRT. Applications are implemented as a subclass of Windows::ApplicationModel::Core::IFrameworkView. This class has 5 functions which the application overrides in order to react to lifecycle events. The event loop is implemented in CoreApplication::Run, which takes an IFrameworkViewSource, which is a factory for IFrameworkViews.

Event handling is performed explicitly. Class definitions in C++/CX have properties, methods, and events. If you want to react to an event, you simply construct an Windows::Foundation::EventHandler object, and use the += operator to attach it to the object’s event field. The EventHandler object is a wrapper around a function pointer or a lambda. As an example, UIElement has a “PointerMoved” event.

Layout is achieved by encapsulating different layout algorithms inside different controls. For example, there is a Windows::UI::XAML::Controls::Grid control, which arranges its contents in a grid. Different controls arrange their contents according to their own internal algorithms. The most basic example is the Canvas class, which allows the programmer to explicitly lay out its child controls (so you’re not limited by the classes that come with the library).

Inside these callbacks, you can modify any part of the XAML tree you want in order to change the rendering of your app. From code, the way you identify elements in the XAML tree is pretty interesting. Each XAML file is represented in native code by a class which extends Windows::UI::Xaml::Controls::Page. When you build your project, the XAML file is used to create an auto generated header. Nodes in the XAML file can have a “name” attribute, which represents the name that native code would refer to the node as. The auto generated header contains each of these variables of the right type and name. Then, the auto generated header’s content is mixed into your Page subclass by using the support for “partial” classes. This allows the class declaration to be defined in two files; the conceptual class is the union of the two classes. Therefore, these auto generated variables automatically become class members, and are therefore in scope for all the methods of the class. In the end, this means that you don’t have to do any typing; you just immediately have access to your nodes by name.

Once you have access to the elements, you can change any of their properties, run any of their methods, and listen for any of their events.

Saturday, December 12, 2015

Binary Compatibility of Windows Programs


Around 1993, there was an issue around binary compatibility of object-oriented programs. The problem is that callers and callees must agree on the internal layout of structs in memory. This means, however, that if a callee is updated (as libraries often are) and changes this memory layout, the program will not behave as expected.

Microsoft came up with a solution to this problem by creating an “object model” (named “COM”) which objects that traverse library boundaries should adhere to. The object model has a few rules:
  • Objects must not have any data that is exposed. Therefore, the only thing publicly available from an object is functions to call on the object.
  • The object’s functions are represented in memory as an array of pointers to functions. The order of this array matches the order that the functions are declared in source code.
Therefore, given these requirements, everybody can agree on how to interact with objects; all interactions with the objects must go through the object’s functions, and everyone agrees on where the functions exist.

You may recognize this design as an “interface.” The function array is a vtable. Therefore, mapping this idea to C and C++ is very natural. Because the only things that exist in the objects is a vtable, there are no problems with padding or alignment of data members or anything like that. Inheritance is implemented by simply adding more function pointers on to the end of the array.

This memory layout also means that many languages can interact with these COM objects.

There are two more interesting pieces to this object model: there are 3 functions that every object must implement. This can be thought of as the root of the inheritance hierarchy. In C++, this base class is known as IUnknown. The three functions are:
  • AddRef()
  • QueryInterface()
  • Release()
AddRef() and Release() are for reference counting; every COM object is reference counted using these functions.

QueryInterface() is pretty interesting. It is a way of converting one object into another type. The idea is that all COM types have an associated ID number. When you call QueryInterface(), you pass in an identifier of a type that you want. If the object knows how to represent itself as that type, it returns an interface of the correct type. Therefore, it’s sort of like a dynamic_cast, except that the source and destination types don’t have to be related at all. This is actually a really powerful concept, because it provides a mechanism for interoperability between APIs. A new library can be released, and its objects might want to support being used as the objects in another library; this can naturally be achieved by supporting QueryInterface().

In every COM library I’ve used, functions use out params instead of return values. Instead, they return an HRESULT error code. Therefore, it’s common to see every call to a COM object wrapped in a block which checks the return code.


Around 10 years later, Java was becoming popular. One of Java’s goals was to allow programs to run on any machine without requiring recompilation. It achieved this goal by compiling to a “virtual machine” bytecode. Then, at runtime, an interpreter would run the bytecode.

This approach means that all memory accesses and function calls operate through an interpreter, so the interpreter can make sure that nothing bad happens (for example: a program can never segfault). It also means that the virtual machine is free to implement any object model it wants; all object accesses pass through it, so it is free to, for example, move objects around in memory whenever it wants. This naturally leads to a garbage-collected language, where the garbage collector is implemented in the virtual machine.

(Now, I’d like to make an aside about this “managed code” business. It isn’t really true that all Java programs are interpreted; the virtual machine will JIT the program and point the program counter into the JITted code. Whenever the interpreter needs to do some work, it just makes the generated bytecode call back into its own code. Therefore, nothing is being done here that couldn’t be done with a simple software library. Indeed; this managed “runtime” is the same fundamental idea as the raw C “runtime”; it’s just that many operations which would be raw instructions in C are instead implemented as function calls into the runtime. If you are diligent about all your object accesses, you can make your C++ program garbage collected. “Managed code” doesn’t really mean anything mechanical; instead, it is a way of thinking about the runtime model of your program.)

Microsoft wanted to be able to use this model of “managed code” in their own apps, so they invented a system quite similar to Java’s. You would compile your apps to a “CLI” form, which is a platform-agnostic spec describing a virtual machine. Then, the Microsoft CLR interprets and runs your program. A new language (C#) was invented which nicely maps all its concepts to CLI concepts.

There are a couple extra benefits of having a well-specified intermediary form. One is that there can be multiple languages which compile to your CLI. Indeed, Microsoft has many languages which can be compiled to the CLI, such as F# and VB.NET. Another benefit is that the standard library can exist in the runtime, which means that all of those languages can call into the standard library for free. Each language doesn’t need its own standard library.

A pretty interesting benefit of this virtual machine is that it solves the binary compatibility problem. The fundamental problem with binary compatibility is that one library might reach inside an object vended by another library; with a virtual machine, that “reaching” must go through the runtime. The runtime can then make sure that everything will work correctly. So, we now have another solution for the binary compatibility problem.


Remember how I mentioned earlier how that this concept of “runtime” is really just a library of functions that you should call when you want to do things (like call a function or access a property of an object)? Well, C++ supports linking with libraries… there is no reason why one shouldn’t be able to call into these runtime libraries from a C++ program.

The difficulty, however, is being diligent with your accesses. Every time you do anything with one of these “managed” objects, you must call through the runtime. This is a classic case of where the language can help you; you could imagine augmenting C++ to be able to naturally represent the concept of a managed object, and enforcing all interactions with such objects to be done through the runtime.

This is exactly what C++/CLI is. It is a superset of the C++ language, and it understands what managed objects are. These managed objects are fundamentally different from C++ objects; they are represented using different syntax. For example, when you create a managed object, you use the keyword “gcnew” instead of “new”, in order to remind you that these objects are garbage collected. A reference to a managed object uses the ^ character instead of the * character. Creating a reference to a managed object is done with the % character instead of the & character. These new types are a supplement to the language; all the regular C++ stuff still exists. This way, the compiler forces you to interact with the CLI runtime in the correct way.

Windows Runtime

The currently accepted way to write Windows programs is using the Windows Runtime (WinRT). You may think that WinRT is a runtime just like the .NET runtime discussed above; however, this isn’t quite right. WinRT objects are integrated with the .NET runtime, and are considered “managed” in the same way that .NET objects are considered “managed.” However, they aren’t binary compatible with .NET objects; you can’t interact with WinRT objects using the same functions you use to interact with .NET objects.

Therefore, C++/CLI doesn’t work for WinRT objects. Instead, a similarly-looking language, C++/CX, was invented to interact with these WinRT objects. It keeps some of the syntax from C++/CLI (such as ^ and %) but replaces others (such as “ref new” instead of “gcnew”).

Tuesday, October 20, 2015

Vertical Text

To start off, latin scripts are generally not written vertically. For example, let’s say you’re writing on the spine of a book. Usually, you wouldn’t stack the letters on top of each other; instead you would lay the book down horizontally and then write on the spine. In this situation, you’re simply writing horizontally on a canvas which has been rotated.

However, many East Asian scripts are written vertically. In this case, the characters are stacked on top of each other. Laying out text like this is a little different than laying it out horizontally.

You’re probably familiar with the concept of each character having an advance, which represents the displacement from one character to the next. Well, this displacement vector is different if you’re stacking letters on top of each other. In particular, there are two different kinds of advances: horizontal advances and vertical advances. With CoreText, CTFontGetAdvancesForGlyphs() takes a CTFontOrientation parameter, where you can specify if you want horizontal advances or vertical advances.

In general, the pen starts out horizontally centered above the letter, and the vertical advance takes you to a place that is horizontally centered below the letter.

This is actually the same concept as a “baseline”, where pen keeps returning to the same line after each character. The difference is that this baseline is in the middle of the letters, rather than underneath (some of) them. This is known as an “ideographic baseline,” which is distinct from the “alphabetic baseline” we are used to. A font file can define where these (and more) baselines should occur in the ‘bsln’ or ‘BASE’ table.

However, when you tell a glyph to draw at a particular point, that local origin isn’t at these pen locations.

Therefore, if you are going to draw a glyph, you need some way of going from this pen location to the glyph’s local origin. With CoreText, you can do this with the API function CTFontGetVerticalTranslationsForGlyphs().

So that should be enough to lay out a line of pure text, but in Web content, you may have stacked letters in the same line as non stacked letters (for example: an English word in a Chinese sentence). Laying out these two situations correctly is possible because of the relationship between the alphabetic baseline and the ideographic baseline.

Keep in mind baselines progress in the direction of the line, not the characters. The ideographic baseline and the alphabetic baseline are parallel to each other. The font's ascent and descent tell you how much space the letters take up orthogonally to the baseline. Therefore, in vertical text, the ascent measures to the right of the baseline and the descent measures to the left of the baseline, even though the letters are stacked.

Monday, October 19, 2015

A Deeper Look at OpenType and TrueType Font Features

Font Features are a way to provide arbitrary changes to a sequence of glyphs. Text layout happens in a couple steps:
  1. Codepoints get mapped, one-to-one, to glyphs. This turns your string into a sequence of glyph IDs. Each of these glyph IDs is associated with an “advance” which is a 2D vector from the origin of each glyph to the next.
  2. Font features run arbitrary modifications on the sequence of glyphs and advances
  3. Each glyph is rendered at its respective location.
The two popular font file formats, TrueType and OpenType, have many things in common. They both share the same notion of how to map code points to glyphs. They both share the notion that a glyph is represented a a sequence of Bezier paths laid tail-to-front. However, the two file formats model font features in very different ways.


OpenType’s model is simpler, so let’s start there. OpenType encapsulates font features into two tables, GPOS and GSUB. GPOS is for glyph positioning, and GSUB is for glyph substitution. Both of these tables actually share the same overall structure. The shared structure is as follows:
  • The table is represented by a collection of “scripts.”
  • A “script” is represented by a collection of “language systems.”
  • A “language system” is represented by a collection of “features.”
  • A “feature” is represented by a collection of “lookups.”
  • A “lookup” is represented by a collection of typed tables.
The only difference between the GPOS and GSUB tables is the last step - these typed tables. For positioning, these typed tables describe things like “adjust the positions of the following glyphs by the following amounts” or “move the following pairs of glyphs closer together.” For substitution, the typed tables describe things like “replace the following glyphs with these other glyphs” or “match this pattern of glyphs and replace it with this other pattern of glyphs.” The form of these tables is specific to the positioning or substitution you are trying to achieve.

Each of the typed tables has an associated “coverage” table. This simply holds a set of glyphs that the typed table acts upon. When the software is laying out some glyphs, if none of the glyphs appear in the coverage table, the software can safely ignore that typed table.

Everything else is shared between GPOS and GSUB, and is just an aggregation of the level below it. Each “feature” has a four-letter code associated with it, such as “liga” or “smcp.” These codes are reported in the program UI, and the user is allowed to switch on or off each of these codes independently. If the user switched off a particular code, the stuff inside that feature’s “lookup” tables is not performed. The lookup tables are performed one by one, in the order the file describes.

Scripts and language systems each have their own four-character tags associated with them. The reason for the distinction is that many different languages may share the same script. For example, both English and Spanish share the Latin script. Because the languages and scripts are expected to have very different glyph patterns, separating out the features into these buckets seems like a good idea. Also, there is a default script tag and a default language system. Microsoft keeps a registry of all of the scripts and language systems, as well as their associated 4-character tags.


TrueType font features are completely different. In particular, there are many tables, each with a specialized purpose. For example, instead of the general “GPOS” table, there is a much more specific “kern” (or “kerx”) table, which is only designed for kerning glyphs. Similarly, there is a separate, more-specific, “opbd” table for the optical bounds of the font. Many of the OpenType features have corresponding specific-purpose tables inside TrueType. These tables do not provide the same flexibility as the GPOS table.

However, unlike most of the TrueType tables, the “morx” table is quite general. Conceptually, it describes the same sort of things that the OpenType GSUB table describes. However, its model of font features is very different.

The “morx” table doesn’t mess around with any of that “script” and “language system” stuff. Instead, the table is represented by a collection of “chains.” Each chain is comprised of a collection of typed subtables. These typed subtables describe things like “Rearrange this sequence of glyphs according to these predefined rules” or “insert a specific glyph into the middle of this particular glyph sequence.”

The rest of the chain is devoted to describing which typed subtables should be performed.
Similar to before, each chain gets a coverage table. However, in addition, there is a pretty interesting mechanism to determine which features cause which typed subtable to be performed.

Inside the chain, a sequence of bitwise operations is described, which results in a particular bitfield. Each typed subtable in the chain also has an associated bitfield. If the bitwise and of those two values is nonzero, then the typed subtable is applied.

That sequence of bitwise operations is where individual features come into play. Instead of representing a particular feature as a four-character tag, like in OpenType, TrueType represents a feature as a tuple of (feature type, feature selector). The font describes a sequence of features and corresponding bit masks to be ANDed / ORed. When a feature is enabled in the application’s UI, its associated bitwise operations are performed. When the feature is disabled, its associated bitwise operations are not performed.

The software starts with a known bit pattern. It then runs through the sequence. For each item in the sequence, if its associated font feature is enabled, it performs the appropriate bitwise operations. When it’s done, the result is a particular bit pattern value.

Each of the typed subtables also has an associated pattern value. If the bitwise AND of the calculated bit pattern and a subtable’s bit pattern is nonzero, then the table is performed.

Exclusive and Default Features

In OpenType, the font file has no way of expressing that certain features conflict or that certain features should be on by default. Each feature is simply represented by a four-character tag, and it is either enabled or disabled. The software may (or may not) decide to enable certain features by default based entirely on the semantics of the feature itself (not based on the font file). For example, the OpenType feature “rlig” is mean for “required ligatures” and is generally enabled by default.

The representation of a font feature in TrueType conveys more information than an OpenType font feature. In TrueType, a feature is represented by a tuple of a “type” and a “selector.” For a particular “type,” there will be a variety of different selectors which are relevant. Many selectors are only relevant to a particular type. An example would be “type”: “kLigaturesType” and “selector”: “kHistoricalLigaturesOnSelector”. Another example would be “type”: “kCharacterShapeType” and “selector”: “kTraditionalCharactersSelector”.

There is a fairly interesting difference between these two examples, though. The first type, kLigaturesType, and “on” and “off” selectors. The second type, kCharacterShapeType, doesn’t. This means that kCharacterShape is what is known as an “exclusive” type, where setting a value for it overrides all other values for it. On the other hand, kLigaturesType is a nonexclusive type, so you can set both kRareLigaturesOnSelector and kCommonLigaturesOnSelector for it. For all the feature types your font knows about, you can declare whether they are exclusive or nonexclusive in the font’s ‘feat’ table, and the software will behave accordingly.

In addition, you can mark a feature as on by default if you set the chain’s default flags to intersect with the feature’s enable flags. Logically, this makes sense - if the feature turns on something which is already on, then that feature should be on by default. However, because the feature is on by default, the software might decide not to show it in the list of available font features. This makes sense for something like required ligatures; turning that feature off would make the font completely illegible, so the software shouldn’t even give the user that option. This behavior allows the font file to determine which features get turned on by default, rather than the software making a guess.

This also means that, for all your exclusive features, you need to designate one selector as the default, even if that selector doesn't actually trigger any typed subtables to be performed. Again, this logically makes sense, and allows you to perform some shaping by default.

Saturday, July 11, 2015

Knuth-Plass Line Breaking Algorithm

Line breaking, with regards to paragraph layout, isn’t a very flashy topic. Most people don’t even notice the location of line breaks in a paragraph of written text. However, to people who care, line breaking can be done badly.

There are many different aspects of a set of line breaks in some text. The most important principle is that the text of the line should match the width it's given as best as possible. If there is too much or too little space on a line, the line visually looks jarring. We want to avoid rapid changes of extra space on a line, so we don’t have a loose line followed directly by a tight line. We want to hyphenate words if it makes sense, but only do so sparingly and if required. In addition, we never want to hyphenate two adjacent lines. Similarly, we don’t want to hyphenate the second-to-last line of a paragraph, because there is plenty of space on that last line to accept the full word. One last, more subtle, thing we want to do is to avoid starting a word to the right of where the next line ends, because it would look like that word is floating all alone.

All of these things are desirable, but many line breaking implementations only consider a few, or none, of these qualities. The line breaking algorithm used in web browsers is simply to treat each line independently, and place as much text as possible on that line without exceeding the maximum width.

If we only consider the primary principle of filling up the available space, it may seem that our greedy algorithm is optimal. However, choosing a particular line breaking opportunity affects all subsequent lines, so it may turn out that the best choice for a particular line actually has a negative affect on the entire rest of the paragraph. Instead of optimizing each line independently, it would be better if we tried to optimize the whole paragraph.

Donald Knuth, probably the most famous computer scientist in the world, worked with Michael Plass to create an algorithm which does so. The algorithm is actually quite simple, but it does have some tricky parts.

Their algorithm has two parts: a classification stage and an execution stage. The classification stage translates each item in the text into one of three types:
  • A box. This represents a character or word and has a particular width.
  • Glue. This represents a space character, and is characterized by three values:
    • A natural width
    • A stretchability value. Higher values here mean that, when a line is stretched, more of that stretch is allocated to this character
    • A shrinkability value. This is the same thing, but for when a line is shrunk.
  • A penalty item. This represents a hyphenation opportunity within a word. It is characterized by two values (In the text it has three, but the third one isn’t really interesting in this discussion):
    • A width to use if a line break is chosen at this item.
    • A cost associated with choosing a line break at this item

There are actually all sorts of diverse typographical situations which can be represented with particular patterns of the items in this classification; Knuth names quite a few of them. Once we have done this classification, we want to choose a set of line breaks which correspond to glue items or penalty items. In order to choose a good set of line breaks, we have to introduce three new concepts.

The first is the concept of the “adjustment ratio” of a line, given a particular breakpoint. The adjustment ratio is simply the excess space in the line divided by the sum of the stretchability factors of all the glue items in the line. If the line is too long rather than too short, then the calculation is the same thing except with using shrinkability factors instead.

The authors then define the “badness” of a line to be a particular increasing function of the line’s adjustment ratio. The particular shape of this function they chose involves the cube root of the adjustment ratio, and they admit that this function was chosen arbitrarily and “has shown that good results are obtained.”

Then, the authors go further and define the “demerits” of a line. This is an increasing function of badness, but also includes extra information, such as the cost of a penalty item, if one was chosen to end the line. In addition, it includes a term which is positive if this is the second penalty item chosen in a row to end lines with. This function involves squaring the sum of badness and penalty, which the authors admit is arbitrary but “it works well in practice.”

The goal of the execution phase of the algorithm is to minimize the sum of the demerits of each line. The way it does this is actually very straightforward dynamic programming. Scan through the string, and each time you encounter a breaking opportunity, look at prior line breaking opportunities and consider a candidate line between these two opportunities. Compute the demerits of this line candidate. Select the line candidate which has the minimum demerits, and associate it with this particular line breaking opportunity. Continue through the string in this manner until you get to the end. The selection of lines which ends up with the minimum total demerits is the output of the algorithm.

A dynamic programming approach is valid because we are summing the total demerits and optimizing the entire prefix of the paragraph. This optimized prefix grows until it contains the entire text. Given an optimized prefix, if we consider a line starting at the end of that prefix, the particular decision we have made that led up to the beginning of that line don’t matter; the only thing that matters is the sum of the demerits up to that point.

Naively implemented, this algorithm is O(n2). However, there is an optimization which can improve the running time of the algorithm. The fundamental observation is that, given a line candidate’s ending position, there are only a few candidate starting positions which are not terrible. In particular, when we have selected a particular breaking candidate and are looking through prior opportunities, we can limit our search to only the prior opportunities which are not terrible. We can define “not terrible” as “having demerits below a threshold.” If we come across a line candidate that is wildly too long, we can simply forget the breaking opportunity at its beginning in our searches, because there will never be another viable line candidate starting at this opportunity. If our line candidate is wildly too short, we can abort our search, assuming we are searching in the direction of increasing indices in the string (but we can't forget this starting opportunity because it may be valuable in subsequent searches). Therefore, for each ending opportunity in our text, our search will investigate a relatively low (and vaguely constant-ish) amount of line candidates. So, if we choose an appropriate terribleness-threshold, our algorithm is O(n).

It’s worth mentioning what happens when we pick a bad terribleness threshold. If our threshold is too high, our algorithm will take much longer to run (bounding on O(n2)), but it may find a more globally-correct solution. If our threshold is too low, the algorithm will find that it is impossible to find acceptable line breaks all the way to the end of the paragraph. In this case, there is nothing we can do but try again with a higher threshold (or the greedy algorithm, which is guaranteed to find a result).

Aside: Calculating widths of line candidates can be done quickly with the established technique of the subsequence sum algorithm. This algorithm has you construct an array which contains a running sum of the widths of all the prefixes of the string (which can be done in O(n)). Then, the width of any subsequence is simply the difference between the two appropriate elements in this array. (I've seen this algorithm used to implement box blurs on images in linear time)

There is one bit of trickiness, however, with the line breaking algorithm. So far, we have assumed that the only affect an optimal prefix of a paragraph has on subsequent lines is its demerits. However, this is not true in the presence of floats. With floats, our available width can change partway down the page. Also, the location of this change might be affected by the layout of the current line, because floating elements occur in-line with text (meaning: if the floating element gets pushed down to a successive line, the location that the width change also gets pushed down to the successive line). This means that the algorithm is sensitive not only to the demerits created so far, but also the number of lines created and the location of all floating elements so far.

Because of this, we need to remember multiple ways to get to a particular line breaking candidate if those ways differ in either line count or floating element placement (and possibly things like differing line heights due to fallback fonts). The presence of these “duplicate” elements can cause our search space to grow dramatically. Luckily, we can combat this growing search space by setting our terribleness-threshold appropriately.

Overall, we have a few tools at our disposal. The first is how we classify our input text. The algorithm can be adapted to many typographic scenarios simply by inserting extra items in our stream of classified items. Next, we can change the values (like nominal width, or stretchiness factor) which characterize each item in our stream. Third, we can choose different functions for calculating badness and demerits. Finally, we can select a terribleness-threshold which will balance runtime and the likelihood of the algorithm completing. Some of these parameters are probably language-dependent, which makes selecting them even more difficult.

Friday, July 10, 2015

Generating x86_64 Machine Code at Runtime

As I discussed previously[1], there is a particular file format for representing programs on disk. This file format includes a significant amount of machinery for the dynamic loader to resolve symbols at runtime rather than compile-time. However, once your program is already running, it is possible to actually author more machine code, and to run that new code directly.

In order to do this, you have to write machine code into a buffer. I’m using an x86_64 machine, so that’s the format that I’ll be using. x86_64 machine code has a very long history which has significantly affected the instruction format.

The instructions originally came from the Intel 8086, back in 1979. That processor is a 16-bit processor which included 8 16-bit data registers: AX, BX, CX, DX, BP, SP, SI, and DI. The first four of these, AX, BX, CX, and DX, are general purpose registers. BP is the “base pointer” and always points to the base of the current stack frame. Similarly, SP is the “stack pointer” and always points to the top of the current stack frame. The stack grows downward, so SP always has a lower value then BP. SI and DI are usually used for pointers or indexes into memory.

There are also some other special-purpose registers, such as IP for “instruction pointer” which represents where control flow is at, and FLAGS. Instructions such as conditionals will flip bits in FLAGS, which other instructions like conditional jumps will inspect.

The 8086 allowed addressing a maximum of one megabyte (= 2^20) of memory. However, the registers can only hold 16 bits, which means the maximum representable value is 2^16. Therefore, pointers are represented as a combination of two registers (specifically, you shift the contents of one register by 4 bits and add that to the other register). There are 4 “segment registers” that handle this kind of addressing: CS, DS, ES, and SS.

The Intel 386, in 1985, introduced 32-bit wide addressing and registers. This was achieved by putting the processor in a new mode. In this mode, the same registers are available, but they are all 32-bits wide. These wider registers are named with a leading “E,” which stands for “extended.” For example, we have “EAX” and “EBP.” However, the old 16-bit registers are still available, aliased to the low 16-bits of the new wider registers. That means that if you write into AX, then read from EAX, you would get what you had written.

Indeed, this register aliasing was not a new concept; the 8086 already allowed you to access the four data registers by their low and high bytes. “AL” is the low byte of “AX” and “AH” is the high byte of “AX.” It is for this reason that the “X” in the 16-bit registers stands for “eXtended.”

It’s important to realize here that the number of registers you can interact with has stayed at a constant 8. Therefore, you only need three bits to represent a particular register. This is an important quality of the instruction format.

The instructions have a fairly complicated format, described at [2]. I’ll summarize it here:
  1. Instructions start with a sequence of up to 4 prefix bytes
  2. Then comes a 1 - 3 byte opcode. This is simply a number which represents the instruction
  3. The rest of the format is dependent on the arguments specific to each individual opcode. Register operands are encoded in 1 - 2 subsequent bytes, called the ModR/M byte and the SIB byte, respectively
  4. An optional displacement value and an optional immediate value follow
One important thing to note is that, because we only need 3 bits to identify a register, we can identify two registers in the ModR/M byte and have an additional 2 bits left over. These two bits are used to specify that the operand should actually come from memory, at the location pointed to by the register, with an optional displacement (supplied in the last bullet above). This particular encoding is quite versatile, as it means that you can have your instruction operate on memory by simply changing a few bits in its encoding.

Of course, you aren’t ::really:: operating on memory. When you think you are operating on memory, the hardware is actually generating load and store micro-ops and performing them. You just don’t see it.

The 32-bit Intel architecture was the standard desktop computer architecture until 2003, when AMD extended it to 64-bits. Similar to the older extension to 32-bits, the registers got wider, but the low 32-bits of each register alias the old names. The new names replace the leading “E” with a leading “R” so we have names like “RSP” and “RCX. In addition, however, they also added 8 more registers simply named R8 - R15.

This addition of new registers proved problematic for instruction encoding. In particular, we were using 3 bits to identify a register, but now we need 4 bits. Rather than dramatically change the encoding, AMD decided to prefix instructions that need to use these 64-bit registers with a “REX prefix.” This prefix is a byte where the high nibble is a “4,” and the low nibble includes that 4th bit needed to properly identify a register. Because instructions can have multiple operands which may each need to identify a register, each of the bits in the low nibble of the REX Prefix corresponds to a different operand. Therefore, if you’re trying to figure out which register an instruction is naming, you have to concatenate one bit from the REX prefix and three bits from the ModR/M byte.

AMD also decided to make the REX prefix optional, so instructions which only needed to operate with 32-bits and the 32-bit registers don’t need to specify it. You may think that this means that the new 64-bit instructions are completely backwards-compatible with 32-bit code; however, that is incorrect. In 32-bit mode, the REX prefix itself is a valid opcode. Therefore, if the machine is in 32-bit mode or 64-bit mode, it will interpret that byte differently.

That’s pretty much it when it comes to formatting instructions. If you want to generate commands at runtime, it isn’t too difficult to figure out how to output them in the format of x86_64 machine code. However, which instructions should you emit?

Well, the instructions are grouped into categories[2]. General purpose instructions include things like PUSH, MOV, CALL, ADD, SAR (shift arithmetic right), JMP, etc. These are pretty self-explanatory.

There are also a bunch of floating-point instructions which use the x87 floating point unit present in all modern processors. This unit used to be an optional coprocessor which you could add to your computer, but has been bundled in all processors since the 487.

The unit is actually pretty interesting. Instead of individually accessible registers, the unit uses a stack of 8 floating-point registers. Instead of naming registers individually, you push and pop items onto the stack, and math operations will replace the top items on the stack with the result of the operation. Also, these registers are 80-bits wide, allowing for significantly more precision than a double. (I’ll probably write another whole post later just about floating point precision)

Intel’s first foray into SIMD (Single Instruction Multiple Data) is with a collection of MMX instructions, first available with the Pentium. This defined 8 new 64-bit registers, named MM0 - MM7, which are aliased with low 64-bits of the x87 register stack. Instead of representing a single 64-bit value, each register represented a vector of smaller types (2x 32-bit numbers, 4x 16-bit numbers, or 8x 8-bit numbers). MMX instructions would operate on these vectors pairwise. So, if you added two MMX registers together, and chose to interpret those registers each as 2x 32-bit numbers, you would actually get two additions for the price of one, since the first components of the vectors gets added and the second components of the vectors gets added. Unfortunately, MMX only supports integral math (not floating-point math) which makes it not very useful. Also, 64-bits is not a very wide vector, making it even less useful.

Intel’s next attempt of SIMD is the SSE set of instructions, initially released with the Pentium 3, which was extended several times to include SSE2, SSE3, SSSE3, and SSE4. This sets up a new set of 8 128-bit registers (not aliased to anything, this time) named XMM0 - XMM7. Similar to general-purpose registers, 64-bit machines double this to 16 new registers. These instructions operate on floating point data, and as of SSE2, also can operate on integral data, thereby obsoleting MMX. In Haswell and beyond, there are fused-multiply-add instructions, allowing you to perform a multiply and an add in a single instruction.

Similarly, Intel created the AVX and AVX2 instructions which use 8 new registers named YMM0 - YMM7 (yet again, 16 new registers on 64-bit machines). These registers are 256 bits wide, and the low 128 bits alias the XMM registers. AVX was first released in 2011 with Sandy Bridge processors. There is a plan to release AVX-512 with 512-bit-wide registers, but it hasn’t shipped yet.

And that’s pretty much it. There are some other instructions regarding transactional memory and halfwidth float conversion. You can detect which of these instructions the processor accepts by issuing the CPUID instruction, which will return a bit mask of capabilities. But, overall, the amount of instructions really isn't that daunting.

We are now very close to being able to write instructions at runtime. However, we need to be able to pass data between our regular program and our generated code. The way this is normally achieved is with arguments and return values to functions. At the assembly level, these concepts are achieved with simply leaving data in a predefined place, and then jumping into a function. The code in the function will expect the data to be in the predefined place. This means that the caller and the callee have to both agree about what the signature of the function is.

There is a pattern, though, of where particular items are placed. In fact, each operating system will pick a particular “ABI,” or Application Binary Interface, which defines rules about where to place all these items. Everyone on either side of a caller/callee relationship needs to adhere to this. Also, the ABI chosen is different for different operating systems, which is yet another reason why you can't run code compiled for one OS on another OS, even if they are running on the same machine.

The Intel 32-bit ABI[3] is pretty simple. All function arguments simply get dumped onto the stack, one by one. Return values are a little tricky, though. If the return type is something which fits in a register, than it is simply placed in EAX. Otherwise, the caller will actually create space for the callee to fill in, and will place a pointer to this item as a hidden first argument to the function.

The Intel 64-bit ABI[4] is significantly more complicated, because they wanted to pass more items in registers, which would reduce cache pressure. The ABI involves classifying each argument based on its type and position in the argument list, and then placing each item in either a register or on the stack.

There’s one last piece to code generation which requires some thought: linking. Much of the complexity of the Mach-O file format is due to the linker needing to fixup symbols at runtime to point where they should point. However, our codegen is occurring at runtime, after the linker has fixed up all the symbols. This means that, if we want to look up the address for a function, we actually already have the address for it! All we need to do is just write out &function and it will be correctly be called.

All right, so now we’ve pretty much got everything we need. First, we need to allocate some memory which has the execute permission set.

void *ptr = mmap(NULL, length, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANON, -1, 0)

Next, we generate instructions, convert them to machine code, and pack them out into that buffer, one by one. When we’re done generating code, we can call it by simply casting the pointer to our buffer to the correct function pointer type (keep in mind the ABI!) and calling it!

((int(*)(int, int))ptr)(4, 17)

We just made a JIT[5].


Wednesday, July 8, 2015

Producing Executables

The way that most people produce an executable is to describe the behavior you want your code to have, then run that description through a compiler, which produces an executable. It’s worth an investigation into the actual executable file itself.

The goal of an executable file is to hold a description of a program. Our computers can execute programs written in a particular language, which is the machine’s particular instruction set. A native executable contains a simple sequence of instructions in the machine’s native language. Because the language that we store is the same language that the machine can consume, we can simply map the bytes of the file into memory, point the machine at it, and say “go.”

Not so fast, though. Software is composable, meaning that your program might want to call functions that exist inside other files. One kind of library, a static library, actually has all its code copied into your executable before the final executable is produced. Dynamic libraries, on the other hand, aren’t copied into your executable, but must instead be present when your program first runs. This is done by the dynamic loader, which is the same thing which maps your code from the file into memory.

But how does the dynamic loader know which libraries to load? Well, your executable must list them somehow. This list is actually kept inside your executable file (rather than in an accompanying file). Which means that your executable file isn’t ::just:: a bunch of instructions, but instead has to have some sort of way to represent a variety of data structures, just one of which is the sequence of machine instructions.

Note that the format for this file isn’t machine-specific, but is instead OS-specific. The physical machine only cares about your instructions; everything else is just input to the dynamic loader, which is just some software that gets run when you want to execute a program. Because the dynamic loader is specific to each operating system, the format of this file is specific to each operating system.

I’ll be discussing the format on Mac OS X, which is the Mach-O file format. The format is described at [1], but I’ll summarize it here. First up is a header, with a magic number and some identifiers of the hardware machine type that this file is specific to. Then, there is a sequence of “load commands,” which really aren’t commands at all. These are simply a way to chunk up the file into a bunch of different pieces.

One such load command is LC_SEGMENT_64, which references the raw bytes which should be mapped into memory. Among other things, this chunk of the file contains a reference to the raw machine code that the instruction pointer will point into. I’ll spend more time discussing this in a bit.

One of the tasks of the loader is to resolve references between libraries. This means that a library must have a way of describing where all its symbols reside, and where all its external references are so the linker can overwrite them. A Mach-O file describes these locations with the LC_SYMTAB and LC_DYSYMTAB, and LC_DYLD_INFO_ONLY load commands.

Here are some more other interesting load commands:
  • LC_LOAD_DYLINKER: The path to the dynamic linker itself
  • LC_UUID: A random number. If you regenerate the file, this will be different, so utilities know that they need to reload their data
  • LC_LOAD_DYLIB : There will be one of these load commands for each dynamic library that this file depends on. It contains the full path of the dependent library, sot he loader can recursively load it when necessary.
  • LC_VERSION_MIN_MACOSX: Specifies the minimum OS X version that the executable should be able to run on
  • LC_SOURCE_VERSION: Version of the source code
  • LC_MAIN: Offset of the entry point
Alright, let’s talk about LC_SEGMENT_64. Each one of these load commands gets mapped into memory, and each one specifies the permissions that it should be mapped with. In executables, there is one of these, called __PAGEZERO, which has no permissions and is mapped at virtual address 0. The fact that this gets mapped with no permissions is why your program crashes if you try to read or write from address 0.

The __TEXT segment is where your raw machine code lives, and is mapped such that you cannot write to it. A segment contains 0 or more sections, and the __TEXT segment contains a variety of sections. (On the other hand, __PAGEZERO contains 0 sections.) Sections exist so you can mark up specific parts of your mapped memory as being used for a particular purpose and specify relocations that must be performed. For example, read-only C string data can be placed inside the __TEXT segment along with your code because both don’t have write privileges, but you can mark that the two sections, __cstring and __text, should be treated differently by utilities.

There is also another segment, __DATA, which gets mapped with read and write privileges, but no execute privileges. This is for any global variables which get statically initialized.

The last segment I’ll mention, __LINKEDIT, maps all the linker-specific data structures in the file. There are no sections; instead, other linker-specific segments will reference data inside this segment.

You can see how all the various data structures the dynamic linker needs are described within the Mach-O format. The format holds machine code itself, but it also holds a bunch of other information such as symbol tables, relocation information, library dependency information, etc.

There’s one last neat feature of the format that I’d like to describe: the fact that the format is extensible. There are lots of different kinds of load commands, segments, and sections. In particular, the linker accepts a flag, -sectcreate, which will create a section at link time in any given segment and populate it with the contents of a file. Therefore, you can say -sectcreate __DATA arbitrarysectname file.dat and the contents of file.dat will automatically be mapped into your process at load time. You can then access the file by marking up a pointer with the following syntax:

extern const uint8_t fooStart __asm("section$start$__DATA$arbitrarysectname");
extern const uint8_t fooEnd __asm("section$end$__DATA$arbitrarysectname");

and those variables, fooStart and fooEnd, will live at the beginning / ending of the newly created section. You can iterate over the data in the section by just looping a pointer from &fooStart to &fooEnd.


Sunday, June 28, 2015

AppKit’s relationship with CoreAnimation

The entirety of this discussion comes from [1]

Before CoreAnimation, AppKit had a particular drawing model. In this drawing model, AppKit has a single screen-sized canvas. When AppKit wants to populate a region of this canvas, it would construct a CoreGraphics drawing context targeting the canvas, and ask each of your views which intersect the dirty region of the canvas to recursively draw using that drawing context. Your app would then would then call functions like drawLine(), drawCircle(), drawText(), etc. on the drawing context, and those commands would be realized on the canvas. The canvas then shows up on the screen. There is a simple callback you can call to mark a region dirty.

CoreAnimation, instead, uses a different paradigm, which involves a scene graph of 2D layers. Each layer can have its own canvas. CoreAnimation supports a similar drawing model as AppKit, where when CoreAnimation wants to populate a region of a layer, it will construct a CoreGraphics drawing context targeting the layer, and then ask either subclasses or the layer’s delegate to draw using that drawing context. Once all the canvases in the scene graph are populated, CoreAnimation takes care of actually updating the screen.

Before CoreAnimation, your window is visually atomic, meaning that any time anything changes, you must redraw everything dependent on that change with CoreGraphics. However, with CoreAnimation, you chop up all your drawing into layers, which are finer-grained than the entire window. Therefore, when a region of a layer changes, only bits of that layer will need to be redrawn (and all the other layers don’t need to be touched).

In addition, layers can all have differing sizes and positions. This means that, if something moves, and it has its own layer, you can just move the layer, and never have to touch CoreGraphics at all!

I’d like to take a minute to describe the pros and cons of layers. In particular, layers can be moved extremely quickly compared to AppKit views because the CoreGraphics content doesn’t have to be repainted. However, it is very common to have overlapping layers, which means you may be paying the memory cost of a pixel many times. If you have tons of overlapping views, you can definitely run out of memory. This means that, if you ask for a layer, you may be denied. You should gracefully handle that.

So that’s pretty cool, but CoreAnimation was invented after AppKit, which means that, if AppKit wants to use CoreAnimation, it has to figure out a way to make the two models work with each other. There are two models for this interaction: layer-backed views and layer-hosted views.

Layer-hosted views are conceptually simpler. A layer-hosted view is a leaf view in the AppKit view hierarchy, but it has its own (possibly complex) CoreAnimation layer tree rooted at it. Therefore, there isn’t much interoperability here; instead, when AppKit encounters this view, it knows to just let CoreAnimation handle it from there.

A layer-backed view has a CoreAnimation layer, but also has child views (which have their own layers). Thus, the AppKit view hierarchy is intact; each view just has this extra property of its backing layer. Now, layers and views have some duplicate functionality: they both have things like position and size and the ability to ask the application to draw, etc. AppKit owns all the layers which back views, and will automatically synchronize all the state changes between the view and the layer. This means if you move the view to a particular place, AppKit will implement that by moving the layer to that place. When the layer needs to be drawn, AppKit will ask the view to draw into the layer’s context. It can do this because AppKit implements the layer’s delegate, so when the layer needs something, it calls up to AppKit.

You opt-in to using a layer by using the NSView property wantsLayer. However, note that the name is “wants.” In particular, because of the memory concerns described above, you may want a layer but not get one. This is okay, though, because there isn’t any reason why your drawRect: method should behave differently depending on if its drawing into a layer or not. If you ask for a layer and don’t get one, it just means that animations will be slower, but your app will continue to behave correctly. Therefore, we have graceful degradation.

AppKit also exposes a getter which returns the layer associated with any view (and may return nil for regular views). This means, however, that you now have access to both pieces of state that AppKit tries to keep synchronized: properties on the view and properties on the layer. Therefore, when you use layer-backed views, there is a set of properties on the layers that you shouldn’t modify (because modifying them will cause CoreAnimation to get out of synch with AppKit).

But it also means that you can set any of the ::other:: properties of the CoreAnimation layer which AppKit isn’t responsible for. This is pretty powerful, as you can set the contents of the layer to an image (with a simply property assignment) or give the layer a border, etc. In fact, you may actually have a layer where the entire visual contents of the layer can be described simply by setting properties on the layer! In this case, your drawRect: function would be empty, which means that there is no reason for CoreAnimation to create a CoreGraphics context for the layer at all.

CoreAnimation actually optimizes for this case by allowing you to implement the displayLayer: method on the layer’s delegate. If you do this, it will be called ::instead:: of drawSublayer:inContext:. Note how displayLayer: doesn’t pass you a CoreGraphics context; instead, you are expected to simply set some properties on the layer which will populate its contents.

AppKit exposes this with the updateLayer: method on NSView. You opt-in to using this method by setting the wantsUpdateLayer property to true. If you do this, then AppKit will cause CoreAnimation to call the layer’s delegate’s displayLayer: method, which AppKit implements by calling the view’s updateLayer: method. This is where you can set any properties on the layer you want (except for the ones AppKit manages). Note that, if you implement this, you should ::also:: implement drawRect for graceful degradation if there isn’t enough memory to create a layer for your view.

Before CoreAnimation, animations in AppKit were implemented by an “animator” proxy object. You would set properties on this object as if it was a view, and the animator would take care of updating the real view over time. However, there was no hardware acceleration, which means that each time the animator would change something on the view, the view would be asked to redraw (with CoreGraphics). However, in CoreAnimation, we would like to draw with CoreGraphics as infrequently as possible. Specifically: when we draw into a layer, the backing store of that layer can be animated much faster than we can draw into the layer. Therefore, it’s valuable for AppKit not to tell our view to paint on each frame of an animation, and instead animate the layer. You can prescribe this behavior with the layerContentsRedrawPolicy property. Note that this degrades gracefully; this property is only consulted if you actually get a layer.

Also, because AppKit owns particular properties of the backing layers, you shouldn’t use CoreAnimation to animate those, as that will cause the states to get out of sync. Instead, you should use the AppKit animator proxy object.


Saturday, June 27, 2015

Surprising CSS Layouts

There are a few properties which all come into play when determining a layout described in CSS. However, these can interact in surprising ways. In this discussion, I’m only going to describe styling block elements. Here are the players:
  • Positioning. Blocks can be “positioned” which means that their position is stated explicitly relative to a “containing block.” You can think of it as the containing block creates a local coordinate system with the origin in its top left, and a positioned descendent states its position by using the “left” “top” “bottom” or “right” properties, which describe locations inside that coordinate system. Note that the containing block for an element is not simply the element’s parent; instead, it is the nearest ancestor with a “position” of “absolute” “relative” or “fixed.”
  • Stacking: You can specify a “z-order” on an element, which describes painting order. However, the z-order of an element is only relevant inside that element’s “stacking context.” From the outside looking in, a stacking context is atomic (flattened), and from the inside looking out, only items inside the stacking context have their painting order sorted. Specifying “z-order” on an element also creates a stacking context on that element (and its descendants belong to this stacking context)
  • Clipping: You can specify “overflow: hidden” on an element, and that will cause the element’s contents to be clipped to the bounds of the element. However, positioned descendants whose containing block is outside the overflow:hidden element won't be clipped.
  • Scrolling: Same thing as clipping, except the element has scrollbars and lets you scroll around to see the overflow. You can trigger scrolling with “overflow: scroll” but note how this is the same property that you would use for clipping so you can’t specify both at once.
  • 3D rendering: You can specify 3D transformations on an element, which specify the transform between the element’s local coordinate space and the coordinate space of the element’s parent. Doing so creates a containing block and a stacking context. It also creates a “3D rendering context” which is conceptually the shared 3D space that all elements in the context live in. If you’re on the outside looking in on a 3D rendering context, the contents of the 3D rendering context all get flattened into your local coordinate system when drawn.
Now that we’ve described the concepts at play here, we can combine them. There are a few combinations I’d like to call out as particularly surprising.
  1. You can have overflow: scroll between an element and its containing block. In the following example, both green and blue boxes’ DOM elements are inside the overflow: scroll element, but the green box is position: absolute and its containing block is outside the overflow: scroll. This means that, even though the green box’s DOM element is within the overflow: scroll, it doesn’t move when you scroll. Try scrolling the grey box to see what I mean.
  2. Overflow: scroll doesn’t create a stacking context, which means that its contents are in the same stacking context as its siblings. This means that some element from outside the overflow: scroll can decide to nestle in between the overflow: scroll’s contents (in the z-dimension). In this example, the red box is a sibling of the overflow: scroll, but the green and blue boxes are descendants of the overflow: scroll. Scroll around the grey box to see what I mean.
  3. Clipping doesn’t follow z-order. This means that you can clip an element in one z-index to an element in a completely different z-index. Therefore, you can have some other external element be displayed between the clipper and the clippee. In this example, the green box is being clipped by the red box. However, the blue box, which is external and a sibling to the red one, can get between the two. Note that all the boxes in this example have width: 100px; height: 100px;.
  4. Every time you specify a 3D transformation, it creates a new 3D rendering context, which causes flattening. This means that if you have nested 3D transformations, each transformation gets flattened into the plan of its parent, all the way up the DOM. There is a way to opt-out of this behavior (to have a shared 3D rendering context) but it requires an extra CSS property, transform-style: preserve-3D. In this example, the blue square is a descendent of the green square, and both have 3D transformations to rotate them around the Y axis. The first example does not have transform-style: preserve-3D, but the second example does. Mouse over (or tap) each of them to see how they are set up.

Sunday, June 21, 2015

3D on the web

Describing 3D on the web seems a little complicated at first, but it’s actually fairly straightforward. The concepts involved actually match existing concepts in CSS quite well. The CSS Transforms spec doesn’t actually add that much complexity.

The first thing I want to mention is that transforms are a completely different concept than either animations or transitions. Animations and transitions simply let you describe CSS values to change over time; transforms are just another few CSS properties you can specify on an element. The fact that these different specs are often used together is coincidental.

Before we start discussing transforms, I’d like to take a look back at what HTML and CSS were like before transforms. There is a document, described as a tree of nodes, usually written in HTML. There are also some style key/value pairs, usually written in CSS, which get applied to particular nodes in the document. When a browser wants to actually show a webpage on screen, it simply runs through the document, telling each node to paint. Therefore, nodes get drawn in document order. Because of this, nodes which occur later in the document appear to be on top of previous nodes. You can see this in the picture below: the blue square appears on top of the green square because it appears later in the document.
<div class="square"
    style="left: 0px;
           top: 0px;
           background-color: green;"></div>
<div class="square"
    style="left: 50px;
           top: 50px;
           background-color: blue;"></div>

So, if you have two nodes, and you want to make one appear on top of the other, you have to simply move one after the other in the document.

But wait - shouldn’t the concept of things being on top of other things be a part of style, instead of the document itself? Think about what we just did: we just modified the document itself, just to change the style of how it’s presented. This is, conceptually, a bad move.

In order to address this, CSS has the concept of z-order. This is a CSS attribute which you can put on a div to change the apparent stacking of the elements. Positive z-order values mean “closer to the user.” Here’s the same example as before, but this time using z-order to change the apparent stacking of the boxes:
<div class="square"
    style="z-index: 2;
           left: 0px;
           top: 0px;
           background-color: green;"></div>
<div class="square"
    style="z-index: 1;
           left: 50px;
           top: 50px;
           background-color: blue;"></div>

When a browser encounters a z-order, it’s important to realize that it isn’t actually moving things in the z-dimension. Instead, it just sorts items by their z-order before drawing them. This isn’t actually 3D; it’s just a reordering.

The reordering only applies to elements which have a z-order specified. If you don’t have a z-order, you’re drawn just like normal, as a part of drawing your closest ancestor who ::does:: have a z-order. Therefore, when you specify z-order on some elements, you’re partitioning the document into chunks, the chunks get sorted against each other, and then drawn in turn.

But what happens if you have two z-ordered nodes, and one is a child of the other? Do you just want to disregard everything except the shallowest z-order declaration? Coming up with a global z-ordering for the entire document would be very difficult to do for complicated pages. Instead, we want some way of making some sort of a namespace for stacking, where we can say that within a namespace, chunks will be sorted, but outside of a namespace, that entire namespace is treated as atomic. CSS does this with the concept of a “stacking context.” Using stacking contexts is a good way to encapsulate parts of a webpage.

In CSS, there are many ways to create stacking contexts[1]. There are two straightforward ways:
  1. The “isolation” CSS property. All it does is create a stacking context on any element it applies to.
  2. Specifying z-order itself.
The fact that a stacking context is created any time you specify z-order means that we will never have nested z-orders in the same stacking context. Therefore, for any given stacking context, it’s trivial to sort chunks, because none of the chunks intersect.

Here’s an example of stacking contexts. Note that, if you look at the raw z-order values, the red square should be on the bottom and the yellow square should be on the top. However, because these two elements are contained within their own stacking context, they only get sorted with regard to each other. Then, the red/yellow combination gets treated atomically with respect to the outer stacking context.
<div class="square"
    style="z-index: 2;
           left: 0px;
           top: 0px;
           background-color: green;"></div>
<div style="z-index: 3; position: relative;">
    <div class="square"
        style="z-index: 1;
               left: 50px;
               top: 50px;
               background-color: red;"></div>
    <div class="square"
        style="z-index: 5;
               left: 100px;
               top: 100px;
               background-color: yellow;"></div>
<div class="square"
    style="z-index: 4;
           left: 150px;
           top: 150px;
           background-color: blue;"></div>

You can even think about this holistically with the concept of a z-order tree (which WebKit has). The non-leaf nodes in the z-order tree are stacking contexts. The leaf nodes are chunks of the document which can be rendered atomically. When you want to render a webpage, you can do a simple traverse of this z-order tree.

Alright, let’s now talk about transforms. Transforms are a paint-time property, which means they don’t affect the layout of content. (Web browsers use two passes: layout and rendering. Laying out content determines where everything should go, and rendering actually draws it. When layout happens, transforms are ignored. Then, just before we want to paint an element, we factor in transforms at that point.)

There are two kinds of transforms: 2D transforms and 3D transforms. 2D transforms are actually conceptually very simple - when you want to paint something, just paint it somewhere else. We’ve already got a 2D graphics context; we just need to adjust the context’s 2D CTM. No big deal. All 2D drawing libraries support CTMs. However, 3D transforms are a little more complicated.

If you have content inside a 3D transform, that content doesn’t even know it. Therefore, inside a 3D transform, there is a plane where content gets drawn to. This drawing is the same drawing that we do to draw the element normally.
<div style="position: relative;
            perspective: 800px;">
    <div class="inner"
         style="position: absolute;
                transform: rotateY(20deg);
                background-color: green;">


Outside the transformed content, however, we have to flatten the 3D parts into the frame buffer (eventually to be shown on the screen). This flattening needs to occur whenever we need to draw something with a 3D transform into a frame buffer

So what happens if we have nested transforms? Well, like I said before, content that is in-between the two transformed elements doesn’t know that it’s transformed; it just paints like any normal 2D element. That means, when we go to draw the inner transformed element, it will be flattened into the plane of the outer transformed element - NOT the plane of the root document!

This is the concept of a “3D rendering context,” similar to a stacking context. Here, each time you specify a 3D transform, you are creating a 3D rendering context on that element. Anything that’s drawn as a child of that element gets flattened into the plane of the context. 

You can see that in the following markup. The blue square is a child of the green square, and both have a rotation transform applied to them. You can see that the blue square is being rotated, but we only see it after the rotation is projected onto its parent plane. This projection is the same operation that the green square undergoes to be shown on to the root document (our monitors). (Note that the flashing occurs because the blue and green squares are coplanar, so you're only ever seeing half the blue square at a time)
<div style="position: relative;
            perspective: 800px;">
    <div class="inner"
         style="position: absolute;
                transform: rotateY(20deg);
                background-color: green;">
        <div class="inner"
             style="position: absolute;
                    transform: rotateY(20deg);
                    background-color: blue;">

This kind of sucks. Every other place (modeling software, game engines, etc.) that describes a hierarchy of transformations doesn’t project everything into the plane of its parent. The reason why CSS has to do it is because we have to preserve the concept of a “document” which is 2D.

However, all is not lost. The CSS designers thought of this problem, and created another CSS property, transform-style, which gives you more control over which elements belong to which 3D rendering context. In particular, there are 2 values: flat and preserve-3d. Flat specifies the behavior described above. Preserve-3d specifies that this element (and its descendants) should belong to the 3D rendering context of its parent. With this value, no flattening occurs, and your descendants live in the same 3D space as your parent.
<div style="position: relative;
            perspective: 800px;">
    <div class="inner"
         style="position: absolute;
                transform: rotateY(20deg);
                background-color: green;
                transform-style: preserve-3d;">
        <div class="inner"
             style="position: absolute;
                    transform: rotateY(20deg);
                    background-color: blue;">

So, if you only want one 3D space, and all your transforms to nest inside it, specify transform-style: preserve-3d on all your nodes except the root one.


Thursday, May 21, 2015


There are many explanations of color spaces, but all the discussions of it I have heard have missed some points which I think are important when thinking about color spaces. In particular, there are many related pieces of a color space which all interact to define a color.

First off, a set of colors exists in the world. Physically, a color is a distribution of light frequencies at varying intensities. You can think of these colors as a “space,” similar to a vector space, but not necessarily linear. The axes in this “space” may be stretched or squished according to arbitrary formulas; however, the axes do span the space (using the mathematical definition of “span”).

A color space defines axes (again: not necessarily linear) through this space. Note that, like a vector space, these axes theoretically extend infinitely, which means that the space of representable colors extends infinitely in all directions. If you only consider these axes, all colors can be represented in all of these sets of axes. Because of this property, if you have two of these sets of axes, you can always represent all points in both of them.

However, color spaces define their own “gamut” which is simply the bounds of the color space. This means that there is a watertight seal around the space of colors which the color space can represent. Because two colorspaces might not have the same gamut, this means that there can be colors that are representable in one that are not representable in the other. (Before we introduced gamut, this was not possible; but now that we have this artificial limit of each space, it is possible). However, gamut is just the outermost bounds; all colors within this bounds are representable.

Now, let’s talk about computers. Computers can’t represent infinitely long decimals. They have to quantize (round) all the numbers they work with to particular representable values. For example, an int cannot hold the value 0.5, and a float cannot hold the value 16777217. Therefore, when computers represent a color in a color space, they can only represent discrete points inside the space, rather than all the points, continuously.

So, there is this problem of how to choose which specific points are representable in a color space. You may want to just divide up the space evenly into little grid squares, but this has a problem. In particular, depending on the color, our eyes are better or worse at determining slight variations in color. This means that, around the part of the gamut where our eyes are really good at distinguishing color, we want to make the grid squares small, so the representable values better match what we can actually see. Conversely, around the parts where our eyes don’t work so well, it is wasteful to have the grid cells be smaller than what we can actually distinguish. In effect, what we want to do is take some of those representable values, and squish them so they are denser around part of the gamut, and less dense around other parts.

So, color spaces, define formulas of how to do this. In particular, if you have a value on a particular axis, they define how far down that axis to travel. This function is often non linear. This formula allows you to connect some arbitrary number with a location in the gamut. The bounds of the gamut are often described in terms of these numbers.

“Gamma” is just the particular function (y = x ^ gamma) that sRGB uses that does this.

Now, if you consider this mathematically, all we are doing is representing the same space with different axes, which, mathematically, is completely possible and lossless. It just means that, in this other coordinate system, the numbers which represent the bounds of the gamut will (might) be different, even though the gamut itself has not changed.

That’s fine, except that when we do math on colors (which we do all the time in pixel shaders), we expect the space to behave like a linear vector space. If we have a color, and we double it, we expect to be twice as far from the origin as we were before. However, this means that the space we do math in has to be linear. So, if we want to do math on colors, we have to stretch and squish the axes in the opposite way, in order to get the squished space back to being linear. However, we are now no longer in the original color space! (Because each color space includes the squishiness characteristics of its axes.)

Therefore, if we have a color in a non-linear colorspace, and we have to do math on it, but at the end have a color in that same color space, we must convert the color into a linear space, do math, then convert it back. We don’t want to keep all colors in a linear color space all the time, because that would mean that we would have to cover the entire gamut with grid cells that are as small as the smallest one in the non-linear color space, which means we need more bits to represent our colors (smaller grid cells means more of them).

It also means that we should use a very high precision number format to do math in, and only convert back to nonlinear when we’re done. Each conversion we do is a little bit lossy. Shaders execute with floating point precision (or half float precision) which is much more precision than a single 8-bit uint (which is typically used to represent colors in sRGB).

Now, it turns out that all this stuff was being implemented when computers all used CRT monitors. Now, it turns out that the voltage response of a CRT monitor is almost exactly the inverse of the squishiness of sRGB. This means, if you feed your sRGB-encoded image directly to a CRT, it will, by virtue of physics of electron beams, just happen to exactly undo the sRGB encoding and display the correct thing. This means that, back then, all images were stored in this form, both because 1) No conversion was needed before feeding it directly to the monitor, and 2) the density of information was in all the right places. This practice continues to this day. (It means that when digital cameras take a picture, they actually perform an encoding into sRGB before saving out to disk). Therefore, if you naively have a pixel shader which samples a photograph without doing any correction, you probably are doing the wrong thing.


In OpenGL, if you are in a fragment shader and you are reading data from a texture whose contents are in a non-linear color space, you must convert that color you read into a linear color space before you do any math with it. But wait - doesn’t texture filtering count as math? Yes, it does. Doesn’t this mean that I have to convert the colors into a linear color space before texture filtering occurs? Yes, it does. One of the texture formats you can mark a texture with is GL_SRGB8 (sRGB is nonlinear), and when you do this, the hardware will run the linearizing conversion on the data read before filtering (Actually implementations are free to do this conversion after filtering, but they should do it before). Then, throughout the duration of the fragment shader, all your math is done in a linear color space, using floats.

But what about the color you output from you fragment shader? Doesn’t blending count as math? Yes, it does. When you enable GL_FRAMEBUFFER_SRGB, OpenGL lets you use sRGB textures when you create a framebuffer. You can also query whether or not your frame buffer is using sRGB or not using glGetFramebufferAttachmentParameter(). When you set this up, the GPU will convert from your linear colorspace into sRGB, and it will do this such that blending occurs correctly (linearly).

The OpenGL spec includes the exact formulas which are run when you convert into and out of sRGB from a linear colorspace. I’m not sure if the linearized sRGB colorspace has a proper name (but it is definitely not sRGB!)

As alluded to earlier, an alternative would be to keep everything in a linear space all the time, but this means that you need more bits to represent colors. In practice the general wisdom is to use at least 10 bits of information per channel. This means, however, that if you are trying to represent your color in 32 bits, you only have 2 bits left over for alpha (using the GL_RGB10_A2 format). If you need more alpha than that (almost certainly), you will likely need to go up to 16 bits per channel, which means doubling your memory use. At that point, you might as well use half floats so you can get HDR as well (assuming your hardware supports it).

Alpha isn’t a part of color. No color space includes mention of alpha. Alpha is ancillary data that does not live within a color space, the way that color does. We put alpha into the w component of a vec4 out of convenience, not because it is ontologically part of the color. Therefore, when you are converting a color between colorspaces, make sure you don’t touch your alpha component (The GL spec follows this rule when describing how it converts into and out of sRGB).