Monday, July 31, 2017

Wide and Deep Color in Metal and OpenGL

“Wide Color” and “Deep Color” refer to different things. A color space can be “wide” if it has a gamut that is bigger than sRGB. “Gamut” roughly corresponds to how saturated it is possible to represent a color. The wider the color space is, it is possible to represent more and more saturated colors.

“Deep color” refers to the number of representable values in a particular encoding of a color space. An encoding of a color space is “deep” if it has more than 2^24 representable values.

Consider widening a color space without making it deeper. In this situation, you have the same number of representable colors, but these individual points are being stretched farther apart. Therefore, the density of representable colors decreases. This is a problem because it means that our eyes might be able to distinguish between adjacent colors with a higher granularity than the granularity at which they are represented. This commonly leads to “banding,” where what should be a smooth gradient of color over an area appears to our eyes as having stripes of individual colors.

Consider deepening a color space without making it wider. In this situation, you are squeezing more and more points within the same volume of colors, making the density of these points increase. Now, adjacent points may be so close that our eye may not be able to distinguish them. This results in image quality that isn’t any better, but the amount of information required to store the information is higher, resulting in wasted space.

The trick is to do both at once. Widening the gamut, and increasing the number of representable values within that gamut, keeps the density of points roughly equivalent. More information is required to store the image, and the image looks more vibrant to our eyes.

OpenGL

Originally, OpenGL itself didn’t specify what color space its result pixels are in. At the time it was created, this meant that by default, the results were interpreted as sRGB. However, sRGB is a non-linear color space, which means that math on pixel values is meaningless. Unfortunately, alpha blending is math on pixel values, which meant that, by default, blend operations (and all math done in pixel shaders, unless this math was explicitly fixed by the shader author) was broken.

One solution is to simply make the operating system interpret the pixel results as in “linear sRGB.” Indeed, macOS lets you do this by setting the colorSpace property of an NSWindow or CGLayer. Unfortunately, this doesn’t give good results because these pixel results are in 24-bit color, and all of these representable colors should be (roughly) perceptually equidistant from each other. Our eyes, though, are better at perceiving color differences in low-light, which means that dark colors need a higher density of representable values than bright colors. So, in “linear sRGB,” the density of representable values is constant, so we actually don’t have enough definition for dark colors to look good. Increasing the density of representable values would solve the problem for dark colors, but it would make bright colors waste information. (This extra information would probably cost GPU bandwidth, which would probably be fine for just displaying the image on a monitor, but not all GPUs support rendering to > 24-bit color…)

So the colors in the framebuffer need to be in regular sRGB, not linear sRGB. But this means that blending math is meaningless! OpenGL solved this by creating an extension, EXT_TEXTURE_SRGB (which later got promoted to be part of OpenGL Core), which says “whenever you want to perform blending, read the contents of the sRGB destination color from the framebuffer, convert it to a float, linearize it, perform the blend, delinearize it, convert it back to 24-bit color, and store it to the framebuffer”. This way, the final results are always in sRGB, but the blending is done in linear space. This ugly processing only happens on the framebuffer color, not on the output of the fragment shader, so your fragment shader can assume that everything is in linear space, so any math performed will be meaningful.

The trigger to perform this processing is a special format for the framebuffer (so it’s an opt-in feature). Now, in OpenGL, the default framebuffer is not created by OpenGL. Instead, it is created by the Operating System and handed as-is to OpenGL. This means that you have to tell the OS, not OpenGL, to create a framebuffer with one of these special formats. On iOS, you do this by setting the drawableColorFormat of the GLKView. Note that opting in to sRGB is not orthogonal to using other formats - only certain formats are compatible with the sRGB processing.

On iOS, as far as I can tell, OpenGL does not support wide or deep color (because you can’t tell the OS how to interpret the pixel results of OpenGL like you can on macOS - all OpenGL pixels are assumed to be in sRGB). CAEAGLLayer doesn't have a "colorSpace" property. I can’t find any extended-range formats.

Metal

On iOS, Metal supports the same type of sRGB / non-sRGB formats that OpenGL does. You can set the MTKView’s colorPixelFormat to one of the sRGB formats, which has the same effect as is it does in OpenGL. Setting it to a non-sRGB format means that blending is performed as-is, which is broken; however, the sRGB formats perform the correct linearization / delinearization for sRGB.

iOS doesn’t support the same sort of color space annotation that macOS does. In particular, a UIWindow or a CALayer doesn’t have a “colorspace” property. Because of this, all colors are expected to be in sRGB. For non-deep and non-wide color, using the regular sRGB pixel formats is sufficient, and these will clamp to the sRGB gamut (meaning clamped between 0 and 1).

And then wide color came along. As noted earlier, wide color and deep color need to happen together, so they aren’t controllable independently. However, there is a conundrum: Because the programmer can’t annotate a particular layer with what color space the values should be interpreted as, how do you represent colors outside of sRGB? The solution is for the colorspace to be extended to beyond the 0 - 1 range. This way, colors within 0 - 1 are interpreted as sRGB as they always have. However, colors outside that range represent the new wider colors. It’s important to note that, because because the new gamut completely includes sRGB, that values must be able to be negative as well as greater than 1. A completely saturated red in the display’s native color space (which is similar to P3) has negative components for green and blue.

The mechanism for enabling this is similar to OpenGL: you select a new special pixel format. The new pixel formats have “_XR” in their name, for “extended range.” These formats aren’t clamped to 0 - 1. sRGB also applies here; the new extended range pixel formats have sRGB variants, which perform a similar gamma function as they did before in OpenGL. This gamma function is extended (in the natural way) to values greater than 1. For values less than 0, this gamma curve is flipped around to curve pointing downward (this makes it an “odd” function).

Using these new pixel formats causes your colors to go from 8 bits per channel to 10 bits per channel. The new 10 bits per channel colors are now signed (because they can go < 0), which means that there are 4 times as many representable values, and half of them are below 0, so the number of positive representable values doubled. In a non-sRGB variant, the maximum value is just around 2, but in an sRGB variant, the maximum value is greater than 2 because of the gamma curve.

On macOS, there is a way to explicitly tell the system how to interpret the color values in a NSWindow or a CALayer using the colorspace property. This works because there is a secondary pass which will convert the pixels into the color space of the monitor. (Presumably iOS doesn’t have this pass for performance, thereby leading to the restriction on which color spaces a pixel value is represented as.) Therefore, to output colors using P3, simply assign the appropriate color space value to the CALayer you are using with Metal. If you do this, remember that “1.0” doesn’t represent sRGB’s 1.0, instead it represents the most saturated color in the new color space. If you don’t also change your rendering code to compensate for this, your colors will be stretched across the gamut, leading to oversaturated colors and ugly renderings. You can solve this by setting this to the new “Extended sRGB” color space of CGColor, which will cause you to have the same rendering as iOS (and allowing values > 1.0). Note that if you do this, you can’t render to an integer pixel format, because those are clipped at 1.0; instead, you’ll have to render to a floating-point pixel format so that you can have values > 1.0.

So, on iOS, you have one switch which turns on both deep color and wide color, and on macOS, you have two switches, one of which turns on wide color and one of which turns on deep color.

Sunday, May 28, 2017

Chromaticity Diagrams

Humans can see light of the wavelengths between around 380 nm and 780 nm. We see many photons at a time, and we recognize the collection of photons as a particular color. Each photon has a frequency, which means that a particular color is the effect of how much power the photons have at each particular frequency. Put another way, a color is a distribution of power throughout the visible wavelengths of light. For example, if 1/3 of your power is at 700 nm and 2/3 of your power is at 400 nm, the color is a deep purple. This color can be described by the 2-dimensional function:

Different curves over this domain represent different colors. Here is the curve for daylight:

So, if we want to represent a color, we can describe the power function over the domain of visible wavelengths. However, we can do better if we include some biology.

Biology

We have three types of cells (called “cones”) in our eyes which react to light. Each of the three kinds of cones are sensitive to different wavelengths of light. Cones only exist the the center of our eye (the “fovea”) and not in our peripheral vision, so this model is only accurate to describe the colors we are directly looking at. Here is a graph of the sensitivities of the three kinds of cones:

Here, you can see that the “S” cones are mostly sensitive to light at around 430 nm, but they still respond to light within a window of about 75 nm around it. You can also see that if all the light entering your eye is at 540 nm, the M cones will respond the most, the L cones will respond strongly (but not as much as the M cones), and the S cones will respond almost not at all.

This means that the entire power distribution of light entering your eye is encoded as three values on its way to your brain. There are significantly fewer degrees of freedom in this encoding than there are in the source frequency distribution. This means that information is lost. Put another way, there are many different frequency distributions which get encoded the same way by your cones’ response.

This is actually a really interesting finding. It means we can represent color by three values instead of a whole function across the frequency spectrum, which results in a significant space savings. It also means that, if you have two colors which appear to “match,” their frequency distributions may not match, so if you perform a modification the same way to both colors, they may cease to match.

If you think about it, though, this is the principle that computer monitors and TVs use. They have phosphors in them which emit light at a particular frequency. When we watch TV, the frequency diagram of the light we are viewing contains three spikes at the three frequencies of phosphors. However, the images we see appear to match our idea of nature, which is represented by a much more continuous and flat frequency diagram. Somehow, the images we see on TV and the images we see in nature match.

Describing color

So color can be represented by a triple of numbers: (Response strength of S cones, Response strength of M cones, Response strength of L cones). Every combination of these three values represents every color we can perceive.

It would be great if we could simply represent a color by the response strength of each of the particular kinds of cones in our eyes; however, this is difficult to measure. Instead, let’s pick frequencies of light which we can easily produce. Let’s also select these frequencies such that they will correspond as well as possible to each of the three cones. By varying the power these lights produce, we should be able to produce many of the three triples, and therefore many of the colors we can see.

In 1931, two experiments attempted to “match” colors using lights of 435.8 nm, 546.1 nm, and 700 nm (let’s call them “blue,” “green,” and “red” lamps). The first two wavelengths are easily created by using mercury vapor tubes, and correspond to the S and M cones, respectively. The last frequency corresponds to the L cones, and, though isn’t easily created with mercury, is insensitive to small errors because the L cones’ frequency is close to flat in this neighborhood.

So, which colors should be matched? Every color can be decomposed to a collection of power values at particular frequencies. Therefore, if we could find a way to match every frequency in the observable range by humans, this data would be sufficient to match any color. For example, if you have a color with a peak at 680 nm and a peak at 400 nm, and you know that 680 nm light corresponds to our lamp powers of (a, b, c) and 400 nm corresponds to our lamp powers of (d, e, f), then the (a + d, b + e, c + f) should match our color.

This was performed by two people: Guild and Wright, using 7 and 10 samples, respectively (and averaging the results). They went through every frequency of light in the visible range, and found how much power they had to make each of the lamps emit in order to match the color.

However they found something a little upsetting. Consider the challenge of matching light at wavelength 510 nm. At this wavelength, we can see that the S cones would react near 0, and that the M cones would react maybe 20% more than the L cones. So, we are looking for how much power our primaries should emit to construct this same response in our cones.

(The grey bars are our primaries, and the light blue bar is our target)

Our primaries lie at 435.8 nm, 546.1 nm, and 700 nm. So, the blue lamp should be at or near 0; so far so good. If we select a power of the green light which gives us the correct M cone response, we find that it causes too high of an L cone response (because of how the cones overlap). Adding more of the red light only causes the problem to grow. Therefore, because the cones overlap, it is impossible to achieve a color match with this wavelength of light using these primaries.

The solution is to subtract the red light instead of adding it. The reason we couldn’t find a match before is because our green light added too much L cone response. If we could remove some of the L cone’s response, our color would match. We can do this by, instead of matching against 520 nm, let’s instead match against the sum of 520nm plus some of our red lamp. This has the effect of subtracting out some of the L cone response, and lets us match our color.

Using this approach, we can construct a graph, where for each wavelength, the three powers of the three lights are plotted. It will include negative values where matches would otherwise be impossible.

X Y Z color space

Once we have this, we  now can represent any color by a triple, possibly negative, where each value in the triple represents the power of our particular primary. However, the fact that these values can be negative kind of sucks. In particular, machines were created which can measure colors, but the machines would have to be more complicated if some of the values could be negative.

Luckily, the power of light follows mathematical operations. In particular, addition and multiplication hold. Color A plus color B yields a consistent result, no matter what frequency distribution color A is represented by. The same is true for multiplication. This means that we are actually dealing with a vector space. A vector space can be transformed via a linear transformation.

So, the power values of each of the lights at each frequency were transformed such that the resulting values were positive at each frequency. This new, transformed, vector space, is called X Y Z, and is not physically-based.

Given these new non-physical primaries, you can construct a similar graph. It shows, for each frequency of light, how much of each primary is necessary to represent that frequency.

Chromaticity graphs

So, for each frequency of light, we have an associated (X, Y, Z) triple. Let’s plot it on a 3-D graph!

(Best viewed in Safari Nightly build.)
Click and drag to control the rotation of the graph!

The white curve is our collection of (X, Y, Z) triples. (The red is our unit axes.)

Remember that every visible color is represented as a linear combination of the vectors from the origin to points on this (one-dimensional) curve.

The origin is black, because it represents 0 power. The hue and saturation of the color is described by the orientation of the point, not the distance the point is from the origin. If we want to represent this space in two dimensions, it would make sense to eliminate the brightness component and instead only show the hue and saturation. This can be done by projecting each point onto the X + Y + Z = 1 plane, as seen by the green on the above chart.

Note that this shape is convex. This is particularly interesting: any point on the contour of the shape, plus any other point on the contour of the shape, yields a point within the interior of the shape (when projected back to our projection plane). Recall that all visible colors are equal the the linear combination of points on the contour of the curve. Therefore, all visible colors equal all the points in the interior of this curve. Points on the exterior of this curve represent colors with negative cone response for at least one type of cone (which cannot happen).

So, the inside of this horseshoe shape represents every visible color. This shape is usually visualized by projecting it down to the (X, Y) plane. That yields the familiar diagram:

Inside this horseshoe represents every color we can see. Also, you can notice that, because X Y Z was constructed so that every visible color has positive coordinates, and the projection we are viewing is onto the X + Y + Z = 1 plane, all the points on the diagram are below the X + Y = 1 line.

Color spaces

A color space is usually represented as three primary colors, as well as a white point (or a maximum bounds on the magnitude of each primary color). The colors in the color space are usually represented as a linear combination of the primary colors (subject to some maximum). In our chromaticity diagram, we aren’t concerned with brightness, so we can ignore these maximums values (and associated white-points). Because we know the representable colors in a color space are a linear combination of the primaries, we can plot the primaries in X, Y, Z color space and project them to the same X + Y + Z = 1 plane. Using the same logic we used above, we know that the representable colors in the color space are on the interior of the triangle realized by this projection.

You can see the result of this projection for the primaries of sRGB in the shaded triangle in the above chart. As you can see, there are many colors that human eyes can see which aren’t representable within sRGB. The chart also allows you to toggle the bounding triangle for the DCI-P3 color space, which Apple recently released on some of its devices. You can see how Display P3 includes more colors than sRGB.

Because the shape of all visible colors isn’t a triangle, it isn’t possible to create a color space where each primary is a visible color and the colorspace encompasses every visible color. If your color space encompasses every visible color, the primaries must lie outside of the horseshoe and are therefore not visible. If your primaries lie inside the horseshoe, there are visible colors which cannot be captured by your primaries. Having your primaries be real physical colors is valuable so that you can, for example, actually build physical devices which include your primaries (like the phosphors in a computer monitor). You can get closer to encompassing every visible color if you increase the number of primaries to 4 or 5, at the cost of making each color "fatter."

Keep in mind that these chromaticity diagrams (which are the ones in 2D above) are only useful for plotting individual points. Specifically, 2-D distances across this diagram are not meaningful. Points that are close together on the diagram may not be visually similar, and points which are visually similar may not be close together on the above diagram.

Also, when reading these horseshoe graphs, realize that they are simply projections of a 3D graph onto a somewhat-arbitrary plane. A better visualization of color would include all three dimensions.

Saturday, May 20, 2017

Relationship Between Glyphs and Code Points

Recently, there have been some discussions about various Unicode concepts like surrogate pairs, variation selectors, and combining clusters, but I thought I could shed some light into how these pieces all fit together and their relationship with the things we actually see on screen.

tl;dr: The relationship between what you see on the screen and the unicode string behind it is completely arbitrary.

The biggest piece to understand is the difference between Unicode's specs and the contents of font files. A string is a sequence of code points. Certain code points have certain meanings, which can affect things like the width of the rendered string, caret placement, and editing commands. Once you have a string and you want to render it, you partition it into runs, where each run can be rendered with a single font (and has other properties, like a single direction throughout the run). You then map code points to glyphs one-to-one, and then you run a Turing-complete "shaping" pass over the sequence of glyphs and advances. Once you've got your shaped glyphs and advances, you can finally render them.

Code Points

Alright, so what's a code point? A code point is just a number. There are many specs which describe a mapping of number to meaning. Most of them are language-specific, which makes sense because, in any given document, there will likely only be a single language. For example, in the GBK encoding, character number 33088 (which is 0x8140 in hex) represents the 丂 character in Chinese. Unicode includes another such mapping. In Unicode, this same character number represents the 腀 character in Chinese. Therefore, the code point number alone is insufficient unless you know what encoding it is in.

Unicode is special because it aims to include characters from every writing system on the planet. Therefore, it is a convenient choice for an internal encoding inside text engines. For example, if you didn't have a single internal encoding, all your editing commands would have to be reimplemented for each supported encoding. For this reason (and potentially some others), it has become the standard encoding for most text systems.

UTF-32

In Unicode, there are over 1 million (0x10FFFF) available options for code points, though most of those haven't been assigned a meaning yet. This means you need 21 bits to represent a code point. One way to do this is to use a 32-bit type and pad it out with zeroes. This is called UTF-32 (which is just a way of mapping a 21-bit number to a sequence of bytes so it can be stored). If you have one of these strings on disk or in memory, you need to know the endianness of each of these 4-byte numbers so that you can properly interpret it. You should already have an out-of-band mechanism to know what encoding the string is in, so this same mechanism is often re-used to describe the endianness of the bytes. (On the Web, this is HTTP headers or the tag.) There's also this neat hack called Byte Order Markers, if you don't have any out-of-band data.

UTF-16

Unfortunately, including 11 bits of 0s for every character is kind of wasteful. There is a more efficient encoding, called UTF-16. In this encoding, each code point may be encoded as either a single 16-bit number or a pair of 16-bit numbers. For code points which fit into a 16-bit number naturally, the encoding is the identity function. Unfortunately, there are over a million (0x100000) code points remaining which don't fit into a 16-bit number themselves. Because there are 20 bits of entropy in these remaining code points, we can split it into a pair of 10 bit numbers, and then encode this pair as two successive "code units." Once you've done that, you need a way of knowing, if someone hands you a 16-bit number, if it's a standalone code point or if it's part of a pair. This is done by reserving two 10-bit ranges inside the character mapping. By saying that code points 0xD800 - 0xDBFF are invalid, and code points 0xDC00 - 0xDFFFF are invalid, we can now use these ranges to encode these 20-bit numbers. So, if someone hands you a 16-bit number, if it's in one of those ranges, you know you need to read a second 16-bit number, mask the 10 low bits of each, shift them together, and add to 0x10000 to get the real code point (otherwise, the number is equal to the code point it represents).

There are some interesting details here. The first is that the two 10-bit ranges are distinct. It could have been possible to re-use the same 10-bit range for both items in the pair (and use its position in the pair to determine its meaning). However, if you have an item missing from a long string of these surrogates, it may cause every code point after the missing one to be wrong. By using distinct ranges, if you come across an unpaired surrogate (like two high surrogates next to each other), most text systems will simply consider the first surrogate alone, treat it like an unsupported character, and resume processing correctly at the next surrogate.

UTF-8

There's also another one called UTF-8, which represents code points as either 1, 2, 3, 4, or 5 byte sequences. Because it uses bytes, endianness is irrelevant. However, the encoding is more complicated and it can be less efficient for some strings than UTF-16. It does have the nice property, however, that no byte within a UTF-8 string can be 0, which means it is compatible with C strings.

"💩".length === 2

Because its encoding is somewhat simple, but fairly compact, many text systems including Web browsers, ICU, and Cocoa strings use UTF-16. This decision has actually had kind of a profound impact on the web. It is the reason that the "length" attribute on emoji returns 2: the "length" attribute returns the number of code units in the UTF-16 string, not the number of code points. If it wanted to return the number of code points, it would require linear time to compute. The choice of which number represents which "character" (or emoji) isn't completely arbitrary, but some things we think of as emoji actually have a number value less than 0x10000. This is why some code points have a length of two but some have a length of one.

Combining code points

Unicode also includes the concept of combining marks. The idea is that if you want to have the character "é", you can represent it as the "e" character followed by U+301 COMBINING ACUTE ACCENT. This is so that every combination of diacritic marks and base characters doesn't have to be encoded in Unicode. It's important because, once a code point is assigned a meaning, it can never ever be un-assigned.

To make matters worse, there is also a standalone code point U+E9 LATIN SMALL LETTER E WITH ACUTE. When doing string comparisons, these two strings need to be equal. Therefore, string comparisons aren't just raw byte comparisons.

This idea can happen even without these zero-width combining marks. In Korean, adjacent letters in words are grouped up to form blocks. For example, the letters ㅂ ㅓ ㅂ join to form the Korean word for "rice:" 법 (read from top left to bottom right). Unicode includes a code point for each letter of the alphabet (ㅂ is U+3142 HANGUL LETTER PIEUP), as well as a code point for each joined block (법 is U+BC95). It also includes joining letters, so 법 can be represented as a single code point, but can also be represented by the string:

U+1107 HANGUL CHOSEONG PIEUP
U+1161 HANGUL JUNGSEONG A
U+11B8 HANGUL JONGSEONG PIEUP

This means, in JavaScript, you can have two strings which are treated exactly equally by the text system, and look visually identical (they literally have the same glyph drawn on screen), but have different lengths in JavaScript.

Normalization

One way to perform these string comparisons is to use Unicode's notion of "normalization." The idea is that strings which are conceptually equal should be normalized to the same sequence of code points. There are a few different normalization algorithms, depending on if you want the string to be exploded as much as possible into its constituent parts, or if you want it to be combined to be as short as possible, etc.

Fonts

When reading text, people see pictures, not numbers. Or, put another way, computer monitors are not capable of showing you numbers; instead, they can only show pictures. All the picture information for text is contained within fonts. Unicode doesn't describe what information is included in a font file.

When people think of emoji, people usually think of it as the little color pictures inside our text. These little color pictures come from font files. A font file can do whatever it wants with the string it is tasked with rendering. It can draw emoji without color. It can draw non-emoji with color. The idea of color in a glyph is orthogonal to whether or not a code point is classified as "emoji."

Similarly, a font can include a ligature, which draws multiple code points as a single glyph ("glyph" just means "picture"). A font can also draw a single code point as multiple glyphs (for example, an accent over é may be implemented as as separate glyph from e). But it doesn't have to. The choice of what glyphs to use where is totally an implementation detail of the font. The choice of which glyphs include color is totally an implementation detail of the font. Some ligatures get caret positions inside them; others don't.

For example, Arabic is a handwritten script, which means that the letters flow together from one to the next. Here are two images of two different fonts (Geeza Pro and Noto Nastaliq Urdu) rendering the same string, where each glyph is painted in a different color. You can see that both fonts show the string with a different number of glyphs. Sometimes diacritics are contained within their base glyph, but sometimes not.

Variation Selectors

There are other classes of code points which are invisible and are added after a base code point to modify it. One example is the using Variation Selector 15 and Variation Selector 16. The problem these try to solve is the fact that some code points may be drawn in either text style (☃︎) or emoji style (☃️). Variation Selector 16 is an invisible code point that means "please draw the base character like an emoji" while #15 means "please draw the base character like text." The platform also has a default representation which is used when no variation selector is present. Unicode includes a table of which code points should be able to accept these variation selectors (but, like everything Unicode creates, it affects but doesn't dictate implementations).

These variation selectors are a little special because they are the only combining codepoints I know of that can interact with the "cmap" table is the font, and therefore can affect font selection. This means that a font can say "I support the snowman code point, but not the emoji style of it." Many text systems have special processing for these variation selectors.

Zero-Width-Joiner Sequences

Rendering on old platforms is also important when Unicode defines new emoji. Some Unicode characters, such as "👨‍👩‍👧" are a collection of things (people) which can already be represented with other code points. This specific "emoji" is actually the string of code points:

U+1F468 MAN
U+200D ZERO WIDTH JOINER
U+1F469 WOMAN
U+200D ZERO WIDTH JOINER
U+1F467 GIRL

The zero width joiners are necessary for backwards compatibility. If someone had a string somewhere that was just a list of people in a row, the creation of this new "emoji" shouldn't magically join them up into a family. The benefit of using the collection of code points is that older systems showing the new string will show something understandable instead of just an empty square. Fonts often implement these as ligatures. Unicode specifies which sequences should be represented by a single glyph, but, again, it's up to each implementation to actually do that, and implementations vary.

Caret Positions

Similarly to how Unicode describes sequences of codepoints which should visually combine to a single thing, Unicode also describes what a "character" is, in the sense of what most people mean when they say "character." Unicode calls this as a "grapheme clusters." Part of the ICU library (which implements pieces of Unicode) creates iterators which will give you all the locations where lines can break, words can be formed (in Chinese this is hard), and characters' boundaries lie. If you give it the string of "e" followed by U+301 COMBINING ACUTE ACCENT, it should tell you that these codepoints are part of the same grapheme cluster. It does this by ingesting data tables which Unicode creates.

However, this isn't quite sufficient to know where to put the caret when the user presses the arrow keys, delete key, or forward-delete key (Fn + delete on macOS). Consider the following string in Hindi "कि". This is composed of the following two code points:

U+915 DEVANAGARI LETTER KA
U+93F DEVANAGARI VOWEL SIGN I

Here, if you select the text or use arrow keys, the entire string is selected as a unit. However, if you place the caret after the string and press delete, only the U+93F is deleted. This is particularly confusing because this vowel sign is actually drawn to the left of the letter, so it isn't even adjacent to the caret when you press delete. (Hindi is a left-to-right script.) If you place the caret just before the string and press the forward delete key (Fn + delete), both code points get deleted. The user expectations for the results of these kinds of editing commands are somewhat platform-specific, and aren't entirely codified in Unicode currently.
Try it out here:
==> कि <==

The Chinese language is many thousands of years old. In the 1950s and 1960s, the Chinese government (PRC) decided that their characters had too many strokes, and simplifying the characters would increase literacy rates. So, they decided to change how about 1/3 of the characters were written. Some of the characters were untouched, some were touched only very slightly, and some were completely changed.

When Unicode started codifying these characters, they had to figure out whether or not to give these simplified characters new code points. For the code points which were completely unchanged, it is obvious they shouldn't get their own code points. For code points which were entirely changed, it is obvious that they should get their own code points. However, what about the characters which changed only slightly? The characters were decided on a case-by-case basis, and some of these slightly-changed characters did not receive their own new code points.

This is really problematic for a text engine, because this is a discernible difference between the two, and if you show the wrong one, it's wrong. This means that the text engine has to know out-of-band which one to show.

Here's an example showing the same code point with two different "lang" tags.
Simplified Chinese:

There are a few different mechanisms for this. HTML includes the "lang" attribute, which includes whether or not the language is supposed to be simplified or traditional. This is used during font selection. On macOS and iOS, every Chinese face actually includes two font files: one for Simplified Chinese and one for Traditional Chinese. (For example, PingFang SC and PingFang TC.) Browsers use the language of the element when deciding which of these fonts to use. If the lang tag isn't present or doesn't include the information browsers need, browsers will use the language the machine is configured to use.

Rather than including two separate fonts for every face, another mechanism to implement this is by using font features. This is part of that "shaping" step I mentioned earlier. This shaping step can include a set of key/value pairs provided by the environment. CSS controls this with the font-variant-east-asian property. This works by having the font include glyphs for both kinds of Chinese, and the correct one is selected as part of text layout. This only works, however, with text renderers which support complex shaping and font features.

I think there's at least one other way to have a single font file be able to draw both simplified and traditional forms, but I can't remember what they are right now.

Wednesday, November 16, 2016

Single Screen GPU Handoff

Over the past few years, a collection of laptops have been released with two graphics cards. The idea is that one is low-power and one is high power. When you want long battery life, you can use the low-power GPU, but when you want high performance, you can use the high-power GPU. However, there is a wrinkle: the laptop only has one screen.

The screen’s contents have to come from somewhere. One way to implement this system would be to daisy-chain the two GPUs, thereby keeping the screen always plugged into the same GPU. In this system, the primary GPU (which the screen is plugged into) would have to be told to give the results of the secondary GPU to the screen.

A different approach is to connect both GPUs in parallel with a switch between them. The system will decide when to flip the switch between each of the GPUs. When the screen is connected to one GPU, the other GPU can be turned off completely.

The question, then, is how this looks to a user application. I’ll be investigating three different scenarios here. Note that I’m not discussing what happens if you drag a window between two different monitors each plugged into a separate card; instead, I’m discussing the specific hardware which allows multiple graphics cards to display to the same monitor.

OpenGL on macOS

On macOS, you can tell which GPU your OpenGL context is running on by running glGetString(GL_VENDOR). When you create your context, you declare whether or not you are capable of using the low-power GPU (the high-power GPU is the default). macOS has the design where if any context requires the high-power GPU, the whole system is flipped to use it. This is observable by using gfxCardStatus. This means that the whole system may switch out from under you while your app is running because of something a completely different app did.

For many apps, this isn’t a problem because macOS will copy your OpenGL resources between the GPUs, which means your app may be able to continue without caring that the switch occurred. This works because the OpenGL context itself survives the switch, but the internal renderer changes. Because the context is still alive, your app can likely continue.

The problem, though, is with OpenGL extensions. Different renderers support different extensions, and app logic may depend on the presence of an extension. On my machine, the high-powered GPU supports both GL_EXT_depth_bounds_test and GL_EXT_texture_mirror_clamp, but the low-powered one doesn’t. Therefore, if an app relies on an extension, and the renderer changes in the middle of operation, the app may malfunction. The way to fix this is to listen to the NSWindowDidChangeScreenNotification in the default NSNotificationCenter. When you receive this notification, re-interrogate the OpenGL context for its supported extensions. Note that switching in both directions may occur - the system switches to the high-power GPU when some other app is launched, and the system switches back when that app is quit.

You only have to do this if you opt-in to running on the low-power GPU, because if you don’t opt in, you will run on the high-power GPU, which means your app will be the app keeping the system on the high-power GPU, which means the system will never switch back while your app is alive.

Metal on macOS

Metal takes a different approach. When you want to create a MTLDevice, you must choose which GPU your device reflects. There is an API call, MTLCopyAllDevices(), which will simply return a list, and you are free to interrogate each device in the list to determine which one you want to run on. In addition, there’s a MTLCreateSystemDefaultDevice() which will simply pick one for you. On my machine, this “default device” isn’t magical - it is simply exactly equal (by pointer equality) to one of the items in the list that MTLCopyAllDevices() returns. On my machine, it returns the high-powered GPU.

However, MTLDevices don’t have the concept of an internal renderer. In fact, even if you cause the system to change the active GPU (using the above approach of making another app create an OpenGL context), your MTLDevice still refers to the same device that it did when you created it.

I was suspicious of this, so I ran a performance test. I created a shader which got 28 fps on the high-powered GPU and 11 fps on the low-powered one. While this program was running on the low-powered GPU, I opened up an OpenGL app which I knew would cause the system to switch to the high-powered GPU, and I saw that the app’s fps didn’t change. Therefore, the Metal device doesn’t migrate to a new GPU when the system switches GPUs.

Another interesting thing I noticed during this experiment was that the Metal app was responsive throughout the entire test. This means that the rendering was being performed on the low-power GPU, but the results were being shown on the high-power GPU. I can only guess that this means that the visual results of the rendering are being copied between GPUs every frame. This would also seem to mean that both GPUs were on at the same time, which seems like it would be bad for battery life.

DirectX 12 on Windows 10

I recently bought a Microsoft Surface Book which has the same kind of setup: one low-power GPU and one high-power GPU. Similarly to Metal, when you create a DirectX 12 context, you have to select which adapter you want to use. IDXGIFactory4::EnumAdapters1() returns a list of adapters, and you are free to interrogate them and choose which one you prefer. However, there is no separate API call to get the default adapter; there is simply a convention that the first device in the list is the one you should be using, and that it is the low-power GPU.

As I stated above, on macOS, switching to the discrete GPU is all-or-nothing - the screen’s signal is either coming from the high-power GPU or the low-power GPU.  I don’t know whether or not this is true on Windows 10 because I don’t know of a way to observe it there.

However, an individual DirectX 12 context won’t migrate between GPUs on Windows 10. This is observable with a similar test as the one described above. Automatic migration occurred on previous versions of Windows, but it doesn’t occur now.

Therefore, the model here is similar to Metal on macOS, so it seems like the visual results of rendering are copied between the two cards, and that both cards are kept on at the same time if there are any contexts executing on the high-power GPU.

However, the Surface Book has an interesting design: the high-power GPU is in the bottom part of the laptop, near the keyboard, and the laptop’s upper (screen) half can separate from the lower half. This means that the high-power GPU can be removed from the system.

Before the machine’s two parts can be separated, the user must press a special button on the keyboard which is more than just a physical switch. It causes software to run which inspects all the contexts on the machine to determine if any app is using the high-powered GPU on the bottom half of the machine. If it is being used by any app, the machine refuses to separate from the base (and shows a pop up asking the user to please quit the app, or presumably just destroy the DirectX context). There is currently no way for the app to react to the button being pressed so that it could destroy its context. Instead, currently, the user must quit the app.

However, it is possible to lose your DirectX context in other ways. For example, if a user connects to your machine via Terminal Services (similar to VNC), the system will switch from a GPU-accelerated environment to a software-rendering environment. To an app, this will look like the call to IDXGISwapChain3::Present() will return DXGI_ERROR_DEVICE_REMOVED or DXGI_ERROR_DEVICE_RESET. Apps should react to this by destroying their device and re-querying the system for the present devices. This sort of thing will also happen when Windows Update updates GPU drivers or when some older Windows versions (before Windows 10) perform a global low-power to high-power (or vice-versa) switch. So, a well-formed app should already be handling the DEVICE_REMOVED error. Unfortunately, this doesn’t help the use case of separating the two pieces of the Surface Book.

Thanks to Frank Olivier for lots of help with this post.

Friday, September 30, 2016

Variation Fonts Demo

Try opening this in a recent Safari nightly build.

The first line shows the text with no variations.
The second line animates the weight.
The third line animations the width.
The fourth line animates both.

hamburgefonstiv
hamburgefonstiv
hamburgefonstiv
hamburgefonstiv

Thursday, September 22, 2016

Variable Fonts in CSS Draft

Recently, the CSS Working Group in the W3C resolved to pursue adding support for variable fonts within CSS. A draft has been added to the CSS Fonts Level 4 spec. Your questions and comments are extremely appreciated, and will help shape the future of variation fonts support in CSS! Please add them to either a new CSS GitHub issue, tweet at @Litherum, email to mmaxfield@apple.com, or use any other means to get in contact with anyone at the CSSWG! Thank you very much!

Here is what CSS would look like using the current draft:

1. Use a preinstalled font with a semibold weight:

`<div style="font-weight: 632;">hamburgefonstiv</div>`

2. Use a preinstalled font with a semicondensed weight:

`<div style='font-stretch: 83.7%;'>hamburgefonstiv</div>`

3. Use the "ital" axis to enable italics

```// Note: No change! The browser can enable variation italics automatically. <div style="font-style: italic;">hamburgefonstiv</div>```

4. Set the "fancy" axis to 9001:

`<div style="font-variation-settings: 'fncy' 9001;">hamgurgefonstiv</div>`

5. Animate the weight and width axes together:

```@keyframes zooming { from { font-variation-settings: 'wght' 400, 'wdth' 85; } to { font-variation-settings: 'wght' 800, 'wdth' 105; } } <div style="animation-duration: 3s;animation-name: zooming;">hamburgefonstiv</div>```

6. Use a variation font as a web font (without fallback):

```@font-face { // Note that this is identical to what you currently do today! font-family: "VariationFont"; src: url("VariationFont.otf"); } <div style="font-family: 'VariationFont';"> hamburgefonstiv</div>```

7. Use a variation font as a web font (with fallback):

```@font-face { font-family: 'FancyFont'; src: url("FancyFont.otf") format("opentype-variations"), url("FancyFont-600.otf") format("opentype"); font-weight: 600; // Old browsers would fail to parse "615", // so it would be ignored and 600 remains. // New browsers would parse it correctly so 615 would win. // Note that, because of the font selection // rules, the font-weight descriptor above may // be sufficient thereby making the font-weight // descriptor below unnecessary. font-weight: 615; } #fancy { font-family: "FancyFont"; font-weight: 600; font-weight: 615; } <div id="fancy">hamburgefonstiv</div>```

8. Use two variations of the same variation font

```@font-face { font-family: "VariationFont"; src: url("VariationFont.otf"); font-weight: 400; } <div style="font-family: VariationFont; font-weight: 300;">hamburgefonstiv</div> <div style="font-family: VariationFont; font-weight: 700;">hamburgefonstiv</div>```

9. Combine two variation fonts together as if they were a single font: one for weights 1-300 and another for weights 301-999:

```@font-face { font-family: "SegmentedVariationFont"; src: url("SegmentedVariationFont-LightWeights.otf"); font-weight: 1; } @font-face { // There is complication here due to the peculiar nature of the font selection rules. // Note how this block uses the same source file as the block below. font-family: "SegmentedVariationFont"; src: url("SegmentedVariationFont-HeavyWeights.otf"); font-weight: 301; } @font-face { font-family: "SegmentedVariationFont"; src: url("SegmentedVariationFont-HeavyWeights.otf"); font-weight: 999; }```