Monday, December 15, 2014

Intel GPU Accelerated 2D Graphics

Recently, I’ve been reading the code to an open source OpenGL implementation called Mesa 3D, which includes a driver for Intel graphics cards. While reading the Intel 3D driver, I started wondering how accelerated 2D graphics works on the same card.

Intel’s documentation mentions that their GPUs have a 2D engine, and it takes the form of a blitting engine ([1] Section 4.1.2). That’s all you get as far as 2D graphics are concerned. The blitter is actually fairly simple compared to the 3D pipeline. It only contains 26 possible commands, and only three of those commands are used for setting up state. The rest of the commands specify a particular kind of blit to execute, and tell the hardware to execute it. Each of the commands specifies a different hardware-specific kind of blit to perform.

All my research is being done in the open source stack on FreeBSD, which means that applications are using X11 to get their drawing on screen. X11 has a client/server model, where the application acts as a client and connects to the X11 server, which is actually connected to the display hardware. In this model, the application programmer asks the server to create a Drawable, usually a window on screen, and then asks the server to create a Graphics Context which targets that Drawable. Then, the application sends drawing commands to the X11 server for it to execute. Therefore, the X11 server is the thing actually executing the drawing commands.

Aside: DRI, the mechanism that Mesa 3D uses, bypasses this by asking the X11 server to make a Drawable available directly to the client, and the client is then responsible for drawing into it and notifying the X11 server when it is done drawing. That, however, is currently only used for OpenGL hardware acceleration, and not what I’m concerned with here. Cairo, also, has an Intel-specific DRI backend, so they have their own implementation of hardware-accelerated 2D graphics. Again, that's not what I'm concerned with here.

The X11 server is capable of executing these drawing commands in software. However, There are X11 modules which allow the server to delegate the implementation of these drawing commands to a library that is capable of using hardware acceleration. In the Intel case, we’re looking at a module called xf86-video-intel [2].

Because this is an Xorg module, getting a debug build is a little trickier than simply linking your own application with a locally-built copy, which is what I have been doing with libdrm and Mesa. Instead, after you build xf86-video-intel, you have to tell the existing Xorg to look in your build directory for the shared object. I achieved this by creating a tiny /etc/X11/xorg.conf (the file didn’t exist before) (note that this requires root permissions, because you’re modifying the system X server):

Section "Files"
ModulePath   "/path/to/new/lib/dir"
ModulePath   "/usr/local/lib/xorg/modules"

After restarting the server, it picked up my locally built copy of xf86-video-intel, and fell back to the system directory for all other modules. I was then able to see any logging that I had enabled in the module by inspecting /var/log/Xorg.0.log. You can also debug the module by SSHing into the machine from another one, and attaching to X in a debugger as root.

Entry Points

Anyway, back to the problem at hand. I’m interested in how the Intel-specific library executes 2D drawing commands, but what are the specific 2D drawing commands that can be executed? This is where we start looking at xcb/xproto.h and xcb/render.h from libxcb. The first file describes commands in the core X11 protocol, which is fairly simplistic and doesn’t have things like antialiasing support or smoothed text support. The second file describes commands in the xrender extension, which adds these features and some others.

Anyway, there aren’t actually that many relevant commands. The core protocol has things like xcb_poly_point(), xcb_poly_line(), xcb_poly_segment(), and xcb_poly_arc() to draw points, lines, connected lines, and arcs (it also has some matching functions for filling as well as drawing). Importantly, it also has xcb_put_image() and xcb_get_image() for uploading and downloading bitmaps from the X11 server. It also has xcb_poly_text_8() and xcb_poly_text_16() for text handling. Attributes of these drawing commands (such as using a miter line join or a bevel line join) with the function xcb_change_gc().

The X11 render extension has a concept of compositing rather than drawing, which represents particular function which will combine a source pixel and a destination pixel in order to produce a new destination pixel. Its drawing API is a little bit simpler: it has xcb_render_composite(), xcb_render_trapezoids(), and xcb_render_triangles() for compositing rectangles, trapezoids, and triangles, respectively. It also has its own text functions, xcb_render_composite_glyphs_8(), xcb_render_composite_glyphs_16(), and xcb_render_composite_glyphs_32(). In addition, it has support for gradients with xcb_render_create_linear_gradient(), xcb_render_create_radial_gradient(), and xcb_render_create_conical_gradient().

Which Pixels to Modify?

So, how does xf86-video-intel accelerate these calls? Throughout this discussion, it’s important to remember that our library can fallback to the existing software path whenever it wants (as long as it can map the frame buffer to main memory), which means that not all modes of operation need to be supported. I’ll be focusing on the SNA portion of the library, since that’s the latest active implementation.

When drawing, there are two problems that the library has to deal with. The first is, given a source pixel and a destination pixel, figuring out what the resulting new color should be. The second problem is figuring out which pixels we should apply the previous computation to. Let’s talk about the latter problem.

First, let’s tackle the core X11 calls, because those don’t have to perform any complicated compositing. The library has a concept of operations, which our drawing commands will be turned into. There are only a few kinds of operations: sna_composite_op, sna_composite_spans_op, sna_fill_op, and sna_copy_op, These operations contain a vtable for functions named blt(), box(), boxes(), thread_boxes(), and done(). The idea is, for example, if you want to fill some boxes on the screen, you would make a sna_fill_op, repeatedly call box() or boxes(), and then eventually call done(). Note how there is no function that operates on anything other than a rectangle that is aligned with the screen - that is important.

The same function that populates this vtable is the one that actually initializes the command. Function pointers for these construction functions are set up in the very beginning, when we are initializing our context. This means that we can call our function pointer, and it will set up an operation as well as filling out this vtable.

Let’s take a very simple case: we are trying to draw a single diagonal line across our window. We can set up our fill operation with sna_fill_init_blt() which will call through the aforementioned function pointer. However, our operation doesn’t have any functions which can draw lines. This means that we actually walk along our line, pixel by pixel, accumulating a big array of boxes, where each box represents a single pixel which needs to be filled (in sna_poly_zero_line_blt()). When we’re done accumulating, we call the boxes() function pointer, and then call the done() function pointer. Drawing points and segments work the same way.

Arcs work a little differently. Instead of walking the arc, pixel by pixel, our library delegates to Xorg itself to do that iteration. We call sna_fill_init_blt() like we did before, but next we actually change the interface that we expose to Xorg as a whole (see sna_poly_arc()). We override our FillSpans and PolyPoint function to specific, different, ones, and we then call miZeroPolyArc(). This function will synchronously call our own FillSpans and PolyPoint functions, both of which end up calling the boxes() operation function pointer. A span is a contiguous sequence of pixels that are aligned on a row. Because both these spans and individual pixels are rectangles, we can implement these callbacks in terms of our boxes() operation function pointer. When we’re done, we set our Xorg interface back to what it was originally, and call the done() operation function pointer. Filling polygons and arcs works this way as well.

Putting an image is significantly different because it involves a significantly sized pixel payload. That call is implemented by using the drm_intel_bo_alloc_userptr() function in libdrm. This function will construct a GPU buffer around a piece of application memory without copying it; it simply maps the memory into the GPU’s address space. This is only possible because Intel GPUs share memory with the CPU, rather than having its own discrete memory. Once the GPU has access to the user’s buffer, the library will then use the copy_boxes() operation to use the GPU to move the data (see try_upload_blt()). Once it’s done copying, it will destroy the GPU buffer it created. Getting an image works similarly. However, if the GPU buffer is is tiled, we can’t take the previous code path; instead we map the buffer and copy the data in software. This part of the code uses compiler attributes to use AVX or SSE on the CPU.

When compositing, we still create an operation like before, by calling sna->render.composite_spans(). However, there are a few kind of primitives that we can be compositing, but our operations can only operate on rectangles. This means we have to have some way of turning our primitives into sequences of rectangles. We do this by iterating in software over every pixel in our primitive (see tor_blt()). This is expensive, so we spin up some threads to do it in parallel (see trapezoid_span_converter()). Each one of these threads shares the operation data structure, but gets its own buffer to populate with rects which cover their pixels. Once a thread’s buffer is populated, it calls the thread_boxes() operation function pointer (which internally deals with locks). The main thread waits on all the threads, and then calls the done() operation function pointer.

How to Modify the Pixels?

Now, the other problem: what goes on under the hood of an operation? The operations seem to attempt to use the blitter if possible, but fall back otherwise. The fill operation is actually pretty straightforward since there is a blitter command that does exactly what we want. sna_blt_fill_init() sets some state which represents the current command to XY_SCANLINE_BLT, and emits an XY_SETUP_*_BLT command. Then, sna_blt_fill() sets the function pointers to functions that will output the command followed by rectangular screen coordinates (see _sna_blt_fill_boxes()). A copy operation works the same way with XY_SRC_COPY_BLT_CMD.

Compositing has the same general flow, except not all blend modes work with blitting. We detect the cases where the blend mode is supported in sna_blt_composite(), and then populate the same data structure. Currently it looks like filling, clearing, and putting are the only blend modes that are supported. These use the commands XY_COLOR_BLT, XY_SCANLINE_BLT, XY_SRC_COPY_BLT_CMD, and XY_FULL_MONO_PATTERN_BLT.

When drawing text, Xorg hands us the bitmap that we should draw. The bitmap is small, so we use the command XY_TEXT_IMMEDIATE_BLT to allow us to specify the bitmap literally inside the command stream.

However, if we determine that we can’t use the blitter, we must use the 3D pipeline. However, we can do a little better than naively drawing a triangle fan with a passthrough vertex shader. In particular, we can turn all of the 3D pipeline stages off (including the vertex shader!) except for the WM stage (which includes the fragment shader). We only have a finite number of fragment shaders that we will need, so we can compile them all ahead of time (or even possibly at authorship time! See wm_kernels[]). And, instead of issuing a triangle fan to the hardware, we can use the special RECTLIST primitive that is designed exactly for our needs. Indeed, this is what the library does when you fall back to the 3D pipeline case.

There isn’t actually much to do when setting up our 3D pipeline. We need to choose which of the few precompiled kernels we will be using, set up our binding table for our shader, and emit some setup commands to the hardware. These commands are: MI_FLUSH (optionally), GEN4_3DSTATE_BINDING_TABLE_POINTERS, GEN4_3DSTATE_PIPELINED_POINTERS, and GEN4_3DSTATE_VERTEX_ELEMENTS.

Then, when we get a call to emit some boxes, we check two things. First, if we’re about to run out of space in our command buffer, we flush it. Then, if we’re at the beginning of the command buffer, we need to emit GEN4_3DSTATE_VERTEX_BUFFERS and GEN4_3DPRIMITIVE. Once we’ve done that, we can actually emit our box rect.

Actually figuring out how to emit the rect is the job of gen4_choose_composite_emitter(). This function has blocks for AVX or SSE compiler attributes in order to choose the right emitter code. Ultimately this code boils down to appending/prepending the vertex data on to an existing buffer.

All in all, there’s a lot there, but it’s not too complicated.


Monday, December 8, 2014

Design of Mesa 3D Part 11: Intel Command Buffers

The hardware of an Intel graphics chip actually mirrors the OpenGL pipeline pretty well. There are hardware units that correspond to each stage of the 3D pipeline. There is a stage for the vertex shader, a stage for the geometry shader, a stage for the rasterizer, etc., and each of these stages are connected together one after another. The Intel manual specifies the pipeline present in the chip (Volume 2, section 2.2):

  1. Command Streamer. This is the single point of entry for GPU commands
  2. Vertex Fetcher: Deals with index buffers
  3. Vertex Shader: Same as OpenGL
  4. Geometry Shader: Same as OpenGL
  5. Clipper: Clips to the viewport
  6. Strip/Fan: Turns higher-level primitives (such as triangle strips) into triangles
  7. Windower/Masker: Performs rasterization and runs the fragment shader.

When you send commands to the GPU, the Command Streamer receives them, and, if necessary, sends them down the pipeline. When a hardware unit encounters a command that it cares about, it performs the command, and then optionally continues sending the command on its way. Some commands don’t need to be further propagated down the pipeline, so those are squelched.

But what do these commands look like anyway? Well, they are roughly analogous to OpenGL calls, which means that most of these commands are simply setting up state and configuring the hardware in a particular way. Here are some examples of commands:

  • 3DSTATE_INDEX_BUFFER: In the future, please use this buffer as an index buffer for vertices.
  • 3DSTATE_BASE_ADDRESS: All pointers are relative to this address.
  • URB_FENCE: Please divide up your scratch space in the following way between the various pipeline stages
  • 3DSTATE_PIPELINED_POINTERS: Here is the state that every draw call needs, for each hardware unit
  • 3D_PRIMITIVE: Invoke the pipeline; do the draw

There are many more, but that’s the general idea.

There is also a pool of processing cores which execute shaders. These are called EUs, for “Execution Units.” My 4th generation card has 10 of these EUs, and each one can run a maximum of 5 threads at a time, for a total of 50 possible concurrent threads! There is a dispatcher unit which submits jobs to the EUs, and pipeline stages (such as the WM stage for the fragment shader) interact with the dispatcher to run threads.

I also want to be very explicit: the commands in the command buffer that the CS sees are not the same as the assembly commands that the EUs execute. The EUs execute shader commands, which are things like “add” and “shift.” The commands in the command buffer are for setting up the entire pipeline and invoking the pipeline as a whole.

So how do these command buffers get populated in the Intel Mesa driver? Well, any time any state changes that the pipeline cares about, the driver gets a callback. The driver simply remembers which items are dirty in a bit flag inside the context. Then, when it comes time to draw, we look at all the dirty bits.

Inside brw_state_upload.c, there is a list of “atoms.” Each atom is a tuple of a bit flag and a function pointer. Inside brw_upload_state(), we iterate through all the atoms, comparing the atom’s bit flag to the dirty bit flag. If there is a match, we call the callback function.

There seem to be two general kinds of callback functions: ones that set up state, and one that append a command to the command buffer. Many commands are simply letting the GPU know about some state struct that has been set up in GPU memory, so creating this struct has to be done before the command targeting the struct is issued.

The first kind of command is pretty straightforward: we call brw_state_batch() to allocate some space, cast the result pointer to the struct type that we are trying to fill, and then fill it with information from the context. brw_state_batch() is kind of interesting, because it will allocate from the end of the current command buffer, growing backwards. This is so that we don’t have to allocate an entire memory buffer for each struct that the GPU needs to know about. There’s a comment that I saw about how the minimum size of a buffer is 4096 bytes (likely because that is a page size), and it would be silly to waste that.

The second type of command seems fairly straightforward, except for one point. Issuing the command itself is straightforward: move a pointer forward a specified amount, see if it overflowed, if it didn’t, write into the space we just created. However, there’s one part that is interesting: the command includes a pointer to GPU memory. However, in userland, we don’t actually know where any buffer will end up in graphics memory, so we don’t actually know what pointer value to write.

This is where drm_intel_bo_emit_reloc() comes into play. You pass this command the place where you want to write the pointer, and the location that the pointer should ultimately point to. Both of these pieces are specified as a drm_intel_bo* and an offset. libdrm will simply keep track of these relocation requests. Then, when we actually flush the buffer (aka, drm_intel_bo_mrb_exec()), we pass the relocation list to the kernel. The kernel will then patch up the relocation points en route to the GPU.

Tuesday, December 2, 2014

Design of Mesa 3D Part 10: Intel’s Device Driver

FreeBSD 10.1 is out [1] which means I’ve got a good opportunity to take another look at Mesa. Since I last looked at it, the FreeBSD Ports system has been updated to the latest version of Mesa, version 10.3.3 [2], which is 3 major versions past where I was looking before. Needless to say, much has changed.

Once again, I’d like to discuss the players here.

The first player is libglapi. This is a pretty simple library which contains two thread-local variables: one which holds the current OpenGL dispatch table, and another which holds the current OpenGL context. Mesa can set and get the first one with _glapi_set_dispatch() and _glapi_get_dispatch(). The dispatch table holds a collection of function pointers, and every OpenGL API call gets directed through a dispatch table lookup. Mesa can set and get the currently OpenGL context with _glapi_set_context() and _glapi_get_context(). Inside libglapi, contexts are treated as a void*.

Another player is EGL. EGL is a platform abstraction that encapsulates X11 on FreeBSD (but is portable so it will encapsulate WGL on Windows if you’re there, etc.) EGL has a concept of a display, a rendering surface, and is responsible for making an OpenGL context which uses them. It’s also responsible for making an OpenGL context “current” as well as performing the buffer swap at the end of the frame to present your drawn framebuffer. If you want to create an onscreen surface, you have to pass the relevant function a void* which represents the platform-specific renderable that EGL should draw to.

Because EGL is a platform abstraction, it is driver-based. Drivers in Mesa have a single symbol which is either a function that populates a vtable, or the symbol simply is the vtable itself. Currently, there are only two EGL drivers in Mesa, but only one seems to have an implementation: The DRI2-based one. This driver has two parts: The main part of the driver, and the part that deals with X11 (If your platform doesn’t use X11, there are other pieces that can replace it; for example, there is a Wayland part which can be used instead of the X11 piece). The main part of the driver part knows about the X11 part, but the calls from it to the X11 piece are behind a #ifdef. The X11 part doesn’t know about the main part, but instead fills out a vtable with function pointers for the main part to call.

The X11 part uses XCB [3] to interact with the X server. Because the user of EGL had to already set up their rendering destination, they already have a connection to the X11 server, so this part piggybacks off of that connection. (If you’re rendering to a pbuffer, this part makes its own XCB connection to the X server). Its responsibility is to handle all of the requirements of DRI that interact with the X server. Luckily, there isn’t very much to do there. Look in xcb/dri2.h for a list of all the calls that are necessary. Relevant ones are:
  • xcb_dri2_query_version(): Returns the version of the DRI infrastructure. Mine says 1.3.
  • xcb_dri2_connect(): This returns two important strings:
    • The driver name that Mesa should use to actually drive the hardware. Mine is i965. Mesa turns this into “dri/” and will dlopen() it.
    • The file to open to send DRM commands to. All DRM commands are implemented as ioctl()s on a file descriptor. This is the path to the file to open to get the file descriptor. Mine is /dev/dri/card0
  • xcb_dri2_authenticate(): This is the access control that DRI uses. Once you’ve opened the DRM device file, you can start sending commands directly to the hardware. However, before you start, you have to send one particular command, drmGetMagic(), which will return an arbitrary number. You then have to supply this number to xcb_dri2_authenticate() so the X server can authenticate you (using the same authentication mechanisms that any X client uses). Once you’ve done this, you can start sending commands to the hardware via the fd you opened previously.
  • xcb_dri2_create_drawable(): No return value. You pass the onscreen window XID and this will make the window’s buffers available to DRM.
  • xcb_dri2_destroy_drawable(): The reverse as above.
  • xcb_dri2_get_buffers(): You pass an array of attachments which represent buffers that the X server knows about and will use for compositing the screen. (For example, XCB_DRI2_ATTACHMENT_BUFFER_BACK_LEFT). xcb_dri2_get_buffers() will return a bunch of information regarding each buffer, including the “pitch” of the buffer (the number of bytes between successive rows, though this is hardware-dependent as buffers can be tiled on Intel cards), the CPP (number of bytes per pixel), some flags, and, most importantly, the buffer “name” which is a number which represents the device-specific ID of the buffer.
  • xcb_dri2_swap_buffers(): Once you’re done with a frame, you have to let the X server know that.
  • There are some others, but I’ve omitted them because they’re not very relevant.
So, the X11 part of the EGL performs that initial handshake with the X server, and populates a driver_name field, which the main part of the EGL driver uses to open the correct .so. This .so represents the OpenGL driver (not the EGL driver!). Drivers export a vtable of function pointers.

Then, the client program starts calling OpenGL calls, which all get redirected through libglapi’s function table. In our case, Mesa implements the GL calls. During most OpenGL function calls, the OpenGL driver doesn’t actually get touched, because most GL calls are simply settings state in the context. For example, glViewport(), glClearColor(), glCreateShader(), and glShaderSource() all just set some state. Even glCompileShader doesn’t actually touch the i965 driver, and Mesa simply performs a compilation to a high-level intermediate representation (“hir”). In i965’s case, the driver only gets touched inside glLinkProgram() which performs the final compilation for the card. Even glGenBuffers() and glBindBuffer() don't actually touch the driver; it’s not until glBufferData() that buffers are actually created on the card.

glClear() is pretty interesting; because I only have a gen4 card, there isn’t a mechanism for quickly clearing the framebuffer. Instead, Mesa has a “meta” api where it implements certain API calls in terms of others. in glClear()’s case, it implements it by drawing two triangles which cover the screen, with a trivial fragment shader.

glDrawArrays() and glDrawElements() obviously both touch the driver. This is where the bulk of the work is done, and where the driver reads all the state that Mesa has been setting up for it.

There’s one more player, though, that I only mentioned briefly: libdrm [4]. This is how the OpenGL driver actually interacts with the GPU. It’s a platform-specific library which includes very low-level primitives for dealing with the GPU, and is implemented entirely by ioctl()s on a file descriptor (which I described how to open earlier). The library has one section for each card it supports, and you (obviously) shouldn’t use the API calls for a card that you do not have. There is one .h file which isn’t in a platform-specific directory (xf86drm.h), but the i965 Mesa driver never seems to call these non-platform-specific functions (except for drmGetMagic(), as described above). Instead, everything seems to come from inside the intel folder.

Once you’re at the level of DRM, the nouns and verbs of OpenGL are no longer relevant, since you’re dealing directly with the card. Indeed, the i965 OpenGL driver even has some nouns which DRM doesn’t know about. DRM has a concept of a buffer on the card, and these buffers are used for everything. It’s straightforward to see how a VBO simply uses a buffer on the card, but a compiled program is also simply is placed inside a buffer on the card. A frame buffer is simply a buffer on the card. A command buffer, which OpenGL has no notion of, but the i965 OpenGL driver does (and calls a batch buffer), is simply a buffer on the card. A texture is simply a buffer on the card. Therefore, the part of DRM that handles buffers is the most important part. The API to this subsystem lives in intel_bufmgr.h, and I’ll list a few of the more interesting calls here:
  • drm_intel_bufmgr_gem_init() creates a drm_intel_bufmgr*, which contains a vtable inside it that the buffer-specific calls go through
  • drm_intel_bufmgr_destroy() the reverse of above
  • drm_intel_bufmgr_set_debug(), once called, will set state which will cause successive functions to dump debug output.
  • drm_intel_bo_alloc() makes a drm_intel_bo*, which represents a buffer on the card. This call allocates the buffer.
  • drm_intel_bo_gem_create_from_name(): you pass in the name you got from xcb_dri2_get_buffers(), and it will return a drm_intel_bo* which represents that buffer. This is how you interact with the framebuffer.
  • drm_intel_bo_reference() / drm_intel_bo_unreference(): Buffer objects are reference-counted
  • drm_intel_bo_map() / drm_intel_bo_unmap(): self-explanatory
  • drm_intel_bo_subdata(): Upload data to the buffer at a particular offset
  • drm_intel_bo_get_subdata(): Download data from the buffer at a particular offset
  • drm_intel_bo_exec() / drm_intel_bo_mrb_exec(): Execute the contents of the buffer. This is what performs the glDrawElements() call.
  • drm_intel_bo_set_tiling() / drm_intel_bo_get_tiling(): Intel cards have hardware support for tiling. This is used when the buffer represents a renderbuffer or texture.
As you can start to see, these function calls are all that are really necessary for implementing the core part of the i965 OpenGL device driver. Uploading a texture is simply a drm_intel_bo_alloc() and a drm_intel_bo_subdata(). Using a shader is simply a CPU-side compilation, then an alloc/subdata to get it on the card. For a command buffer, you can map it, then write commands into the buffer. When you’re done, exec it.

These calls are all passed almost directly to the kernel, meaning that this is about as far down as we can go and still be in userland.

All this information is enough to write a program that uses the GPU directly but doesn’t go through Mesa. Here are the steps:
  1. xcb_connect() to the X server.
  2. xcb_dri2_connect() to get the DRI device file path.
  3. xcb_create_window() to make your window.
  4. xcb_map_window() to show it
  5. At this point, you should wait for an expose event
  6. xcb_dri2_create_drawable() to make it available from DRI
  7. xcb_dri2_get_buffers() with XCB_DRI2_ATTACHMENT_BUFFER_BACK_LEFT to get the name of the buffer that backs the window
  8. open() the DRI device file path
  9. drmGetMagic() to get the magic number which you will use to call…
  10. xcb_dri2_authenticate(). After this point, you can call drm functions.
  11. drm_intel_bufmgr_gem_init() to make an intel-specific bufmgr
  12. drm_intel_bo_gem_create_from_name() with the output of xcb_dri2_get_buffers() that you previously called, to create a intel-specific buffer object which represents the backbuffer of the screen.
  13. drm_intel_bo_map() to map the buffer
  14. use the backbuffer->virtual pointer to write into the frame buffer. You probably should call drm_intel_bo_get_tiling() so you know where to write stuff.
  15. drm_intel_bo_unmap() when you’re done
  16. drm_intel_bo_unreference() when you don’t need the drm_intel_bo anymore
  17. xcb_dri2_swap_buffers() to tell the X server that you’ve finished the frame
  18. drm_intel_bufmgr_destroy() when you’re done with the bufmgr
  19. close() the DRM file descriptor
  20. xcb_dri2_destroy_drawable() when you’re done with DRI2 entirely
  21. xcb_disconnect() from the X server entirely
And there you go. You can see how you might take this structure, but instead of mapping the frame buffer and drawing into it with the CPU, you could allocate a command buffer and write commands into it, and get the GPU to draw into the frame buffer for you.


Saturday, November 22, 2014

Complex Text Handling in WebKit, Part 7: Width Calculations

There’s one big piece of complex text handling that I haven’t gone into detail with, and that is how we actually calculate the width of a string of text. Calculating the ascent and descent is pretty straightforward - the font knows what its ascent and descent is. However, calculating widths is more involved.

In order to talk about widths, I first need to talk about fonts. WebKit has a different concept of a font than your operating system does. To WebKit, a font comprises of a sequence of FontData objects, which are used for fallback. When you say, for example, <div style=“font-family: Helvetica, Arial;”>text</div>, “Helvetica” refers to a FontData object and “Arial” refers to a FontData object.

There are two subclasses of FontData: SegmentedFontData and SimpleFontData. SimpleFontData simply represents an operating-system font object (CTFontRef on OS X). SegmentedFontData refers to a collection of SimpleFontData objects, where each SimpleFontData is associated with a unicode range. This is how the unicode-range property in @font-face declarations are implemented. Because a Font is actually a collection of platform fonts, when we want to measure the width of a string using a Font, we’re actually specifying a collection of fonts.

There are actually two text code paths in WebKit: a simple path and a complex path. This is because there are many languages where text layout doesn’t require any complex shaping algorithms. In english, each codepoint has a particular glyph associated with it, and each glyph has an associated advance width, and the glyphs are simply put next to each other one by one. For these cases, we can simply make a map of codepoint to glyph, and glyph to width, and use these maps exclusively when calculating text width. If we use the simple path, there is no need to touch platform APIs at all (except to initially fill the map).

However, some languages, such as Arabic, have complex shaping rules about how to lay out glyphs. In particular, the sequence of glyphs chosen for an Arabic letter might be dependent on the letters surrounding it. For this case, we can’t simply make a cache; we need to invoke the platform’s text shaping API.

Therefore, when we want to measure text (Font::width()), the first thing we have to do is determine which codepath to use. There are a variety of things this function (Font::codePath()) considers, but one thing it does is iterates through all the code points, seeing if any come from these complex languages.

As far as WebKit is concerned, the complex codepath is actually easier to understand, since most of the work is delegated to the platform. In particular, each platform has their own implementation of Font::floatWidthForComplexText(), so I can only talk about OS X’s implementation.

The first thing OS X’s implementation of this function does is determines contiguous ranges within the string that are drawn with the same font. This is where font fallback is handled. For each character in the string, it iterates through all the FontDatas inside the Font, testing if that FontData can render the character. This test is performed by creating a CTLine containing only the single character, then iterating through the constituent CTRuns, asking which specific font the platform would use when drawing that particular CTRun. If any of the CTRun’s fonts aren’t the same as the font we are inspecting, the font we are inspecting cannot be used, and we keep searching down the fallback list.

Given this test, it finds contiguous ranges of characters that can be rendered with the same font, and creates a CTLine around the sequence of characters. It then iterates through that CTLine’s constituent CTRuns, remembering the glyphs, advances, and string indices for each glyph in the CTRun, as well as the FontData which gave rise to this run.

However, we’re not done yet. We first have to adjust all the information that CoreText gave us to take into account things like character spacing, tab width, synthetic bold, rounding, collapsing whitespace, etc. This algorithm is straightforward: just iterate through each of the glyphs in each of the runs we found. While we do this, we accumulate all the character advances.

And voilà! We have our width.

The simple text codepath requires much more WebKit code, but the benefit is that we don’t have to interact with platform APIs as often. In addition, the simple text codepath is all platform independent (except for initially populating the caches, as mentioned earlier). This codepath is centered around the WidthIterator class. The idea for WidthIterator::advanceInternal() is that we walk through the string codepoint by codepoint, asking which FontData the current character should be drawn with. We can then ask the FontData what the relevant glyph and advance is for the character. We have to perform the same fixup here regarding character spacing, tab width, etc. that we had to do in the complex case.

So, how do we know what FontData the current character should be drawn with? Well, we build up a large data structure representing all the fonts on the system. The first thing we do is divide up the set of all codepoints into “pages” that each have a set size of code points (currently 256). A GlyphPage contains a pair of SimpleFontData and Glyph ID for each code point contained within the page (There is also a compaction mechanism if all the codepoints would have the same SimpleFontData).

Then, we build up a tree of these GlyphPages. The nodes in the tree are instances of the GlyphPageTreeNode class, which contains a map for FontData to GlyphPageTreeNode that represents its children, as well as a reference to a GlyphPage, and some other less interesting members.

The first level of the tree (just below the root) is actually special. Each of the nodes in the first level is associated with a page number; if you’re interested in the 5th page (meaning codepoints 5 * 256 to 6 * 256) you simply start at the 5th node in this initial level. These are actually held in a static HashMap.

Now, once you have an initial GlyphPageTreeNode, you progress down the tree (in FontGlyphs::glyphDataAndPageForCharacter()) by supplying a FontData object to the node’s children map. The FontData objects are supplied one by one from the Font object (Remember how I said earlier that Fonts are actually a sequence of FontData objects?). You stop when you have hit a node whose page contains a valid glyph and SimpleFontData for your codepoint. If we need a child that doesn’t exist, we create it and populate it at that point (GlyphPageTreeNode::initializePage()). initializePage() works by either calling into CoreText or CoreGraphics in GlyphPage::fill().

So, once we’ve descended through the tree, we’ve found the SimpleFontData that we should draw our character with. How do we then know what the character’s advance width is? Well, SimpleFontData has a GlyphMetricsMap in it which caches these width values. A GlyphMetricsMap contains a map from an int to a GlyphMetricsPage, which simply contains a constant-sized array of advances. If you ask the GlyphMetricsMap what the width for a particular glyph is, it might return cGlyphSizeUnknown, which means that SimpleFontData must then ask the platform what the advance width is, which it does by either calling into CoreText or CoreGraphics.

After we’ve answered all these questions, flow eventually returns to WidthIterator, which continues looping and accumulating glyph advance widths. When the loop concludes, we’ve got our total width.

Complex Text Handling in WebKit, Part 6: Run Layout

Previously we saw how text got transformed from lines to bidi runs, but we still have farther to go until we know exactly where to place each glyph. We have just created a sequence of bidi runs which represent a line, and we have to transform that into a data structure which represents the layout of each run.

The overarching problem that we are trying to solve is to figure out what things to draw and where to draw them when we want to draw a webpage. The usual way that these declarative scenes are drawn is with a scene graph (however, webpages conform to a tree structure), however, overloading the DOM tree to include this rendering information is a bad fit. Instead, we have separated concerns by creating a render tree which represents the scene graph of a DOM tree. The DOM tree is tasked with representing the structure of a webpage (which is what, for example, JavaScript interacts with) and the render tree is tasked with representing the answer to our original question - what to draw and where to draw it. RenderObjects have styles, but DOM nodes don’t; the CSS cascade defines how to assign every RenderObject its own style.

However, the render tree doesn’t go any deeper than a RenderText representing a text node. The RenderText object itself doesn’t have any layout information inside it about where to actually put each of the lines or runs or glyphs. Instead, the block-level container of the RenderText (a RenderBlockFlow instance) has a separate data structure which represents the layout information of all of its inline children.

For example, if you have <div>Lorem <b>ipsum</b></div>, the div itself is a DOM node, and it has a corresponding RenderObject in the render tree. The RenderObject would be a RenderBlockFlow instance, which would have two children: a RenderText (with the string “Lorem “ inside it) and a RenderInline (representing the <b> . Both of these two classes inherit from RenderObject. Note how there is no <b> specific subclass of RenderInline; the difference here is that the RenderInline has a different style than the RenderBlockFlow (namely, a style with a bold font). The RenderInline would have a RenderText child, with the string “ipsum” in it. However, none of these RenderTexts nor the RenderInline have information regarding layout. Instead, the RenderBlockFlow has a RenderLineBoxList member, which represents the layout information for all of the RenderBlockFlow’s inline children. The RenderLineBoxList is simply a list of InlineFlowBoxes.

There are only a few classes in this class hierarchy, so it makes sense to talk about them. InlineBox is the base class for all objects in the InlineBox tree. InlineFlowBox is a subclass which has pointers to children, so this class is used for all the non-leaves in the tree. RootInlineBox is a subclass of InlineFlowBox which represents the entire line, and has functions for dealing with the entire line. InlineTextBox is a subclass of InlineBox (so it can’t have any children) which represents a run of text with a particular directionality. InlineBox has fields which contain layout information.

Back to the algorithm. We’ve got a BidiRunList, and we want to turn that into an InlineBox tree, which includes populating the layout inside InlineBox (such as (x, y) and (width, height)). RenderBlockFlow::createLineBoxesFromBidiRuns() is responsible for this. We initially call RenderBlockFlow::constructLine() which simply creates the InlineBoxTree, then we lay out the tree in a horizontal pass and a vertical pass.

The horizontal pass occurs in computeInlineDirectionPositionsForLine(). Constructing the InlineBox tree isn’t very difficult; for every run in the BidiRunList, we create the appropriate InlineBox subclass instance (by delegating to the renderer, see createInlineBoxForRenderer()) and walk up the RenderObject tree creating InlineBoxes for our ancestors until we encounter a place to join up to the existing tree. We can walk up the RenderObject tree because the BidiRun has a pointer to the RenderObject for which the BidiRun represents a portion of. We can then set the BidiRun’s InlineBox pointer to be the leaf we just created.

Then comes our horizontal layout pass. This happens in two phases: setting widths for each InlineBox, and then setting positions for each InlineBox. Setting widths for an InlineTextBox mainly involves calling Font::width() on the RenderObject’s relevant substring. The RenderObject has a RenderStyle, which contains the Font to use while measuring the substring’s width. If the InlineBox isn’t an InlineTextBox (for example, an image or something), we can simply ask the renderer for its width. Once each InlineBox has a width, placing them horizontally is easy - simply accumulate the widths and set each InlineBox’s X position to the value of the accumulator. Note that the width pass iterates over BidiRuns, but the placement pass is recursive over the InlineBox tree.

Then comes the vertical pass. The idea of the vertical pass is that we want to align the baseline for all the elements in the line. This is already true within a particular InlineTextBox (due to font fallback, not all the characters in an InlineTextBox have to be drawn with the same font), so we have to align all the InlineBoxes so that the baselines align. The vertical pass occurs in computeBlockDirectionPositionsForLine(), and has two passes: a height pass and a position pass.

The height pass is a fairly straightforward recursive pass over the InlineBox tree. For every child, we call RootInlineBox::ascentAndDescentForBox() with the child. For text boxes, we simply iterates through the specific fonts that actually get used when drawing the text (which we know from calculating the InlineTextBox’s width earlier in the horizontal pass), keeping the minimum and maximum ascents and descents for the fonts. For replaced elements (like images), we simply ask the renderer what its height is. We also include things like line-height here. While we are recursing through all our InlineBoxes, we keep track of the maximum ascent and descent (which end up being the ascent and descent for the line as a whole).

Then, we place boxes vertically by calling placeBoxesInBlockDirection(). This is a recursive pass where we place each box’s Y position based on the box’s ascent compared to the line’s ascent. We also handle things here like ruby and emphasis marks.

Now we’ve got layout information regarding each run. Later when we paint we can simply recurse through the tree, and draw each element at it’s constituent position.

Complex Text Handling in WebKit, Part 5: Bidirectional Processing

In a previous post, I described the first step to text layout: line breaking. A line is characterized by two DOM iterators which represent the beginning DOM location and the ending DOM location of the line. An iterator is simply a pointer to a DOM node and an integral offset within that node (which is only used for text nodes; If the node is an tag or something like that, the offset is 0).

The next phase of line layout is bidirectional processing. Bidi processing is needed, for example, if you have a word from a right-to-left language inside a sentence written in a left-to-right language. In this scenario, we don’t want to naively draw the text character by character, because then the characters in the RTL language will be written backwards, causing the result to be indecipherable. Instead, we want to find substrings inside the line that look like they are in a differently directed language, and process those substrings with the correct directionality.

The situation gets more complicated, however, when you consider the case of a LTR word in a RTL phrase in a LTR sentence. Indeed, these nestings can be arbitrarily deep. The model that bidirectional processing uses to represent this is that each character is assigned a “depth”, and even depths represent LTR contexts and odd depths represent RTL contexts. In the previous example, the LTR word would have a depth of 2, the surrounding phrase would have a depth of 1, and the sentence would have a depth of 0. If the base direction of the line is RTL, you simply start your bidirectional processing at a depth of 1.

Once you have all your depths, you can collect sequences of equal depths and handle these sequences independently. Once you have done this, the depth values themselves cease to be important, and you are left with simply a collection of runs, where each run has uniform directionality. In the above example, there would be 5 runs, each alternating directionality.

What’s more, there are ways to provide operators inside the text stream that provide signals to the bidi algorithm. There are 3 pairs of formatting operators, each with a start code point and and end code point: embeddings, overrides, and isolates. Embeddings allow you to be explicit about pushing and popping additional levels onto the depth stack. Overrides allow you to bypass the bidi algorithm, forcing the directionality of the inner text. Isolates specify that the algorithm should be applied recursively and independently to the inner text. There are specific code points which represent each of these operators, as well as HTML5 DOM elements.

The algorithm that actually comes up with all of these depth values involves tables regarding which directionality any given character is, and rules regarding how to respond when you encounter any particular class of character. The algorithm itself is fairly complicated and has to, for example, deal with characters that do not have strong directionality (like punctuation) and with things like making sure that matching brackets always face each other. Luckily, there is a specification for this algorithm, called “UAX #9” from the Unicode consortium. The fact that it is Unicode who is speccing this makes sense, since the algorithm operates on properties of codepoints, which Unicode has defined, and since all text processing programs (both inside web browsers and out) would need to implement this algorithm.

ICU implements this algorithm; however, its implementation is not suitable for WebKit because WebKit needs to handle arbitrary DOM content in a line, rather than simple text. Because of this, WebKit implements their own version of this algorithm, in constructBidiRunsForSegment(), which uses an InlineBidiResolver to populate a BidiRunList. The algorithm itself is inside BidiResolver::createBidiRunsForLine(), and it loops over its content, calls into ICU on each codepoint, updates internal state, and periodically appending runs to the BidiRunList whenever it has finished a run.

From here on out, all layout is performed in terms of these BidiRuns, so it’s worth looking into the data structure a little bit. A BidiRun is a linked list node, so it’s possible to iterate through all the runs in a line by just calling next() on a run. In addition, it has a RenderObject which represents the renderer that the node is created from (encountering a new renderer causes a run to unconditionally be ended), and an InlineBox pointer which represents layout information which will be filled in later. It also has start and stop indices into the renderer, so you know which part of the renderer’s string this run represents, as well as the bidi level.

That’s it for bidirectional processing. Now that we have our BidiRunList, we can create InlineBoxes for the BidiRuns, and then lay out the InlineBoxes.

Saturday, June 7, 2014

Design of Mesa Round 2!

The last I wrote about Mesa was over a year ago. During that time, I have started afresh at trying to learn about how Mesa works. Last time I was writing, my graphics card didn’t have a Mesa driver that could run on my operating system, which made learning about this difficult. I recently bought a machine with a G41 chipset [1], which allows for use of the GPU on Intel Core 2 processors. This GPU has an open source Mesa driver, so after recompiling Mesa with debugging symbols, I was ready to go. This time, I have a focus on how Mesa interacts with hardware.

The first thing to realize is that there are actually four players here. The most obvious player is Mesa [2] , the project which actually implements the OpenGL API. Mesa doesn’t interact with the graphics card directly, however; there is a low-level library called libdrm [3] which provides an interface to the hardware. This library isn’t much more than a thin layer around system calls (though there are a couple places where it does a little more accounting. More on that later). It is important to note that most of libdrm is platform-specific, though there are a few non-platform-specific bits. The third player is the kernel itself, which has to receive the system calls which libdrm makes and do something with them. The fourth player is the X11 server, which doesn’t interact much but is involved in some of the setup steps.

You can actually run OpenGL commands two ways:

  • GLX [4] is an X11 extension which serializes OpenGL commands and forwards them through the X display protocol to the X server. The X server then executes these commands. As you can imagine, the extra serialization, IPC, and deserialization overhead make this path non optimal. However, on setups where the X11 server is running on a different machine than the X11 client, this is the only option.
  • DRI [5] is a concept that allows X11 clients to access the display hardware without going through the X11 server. This doesn’t have the aforementioned overhead, but only works if the server and client are running on the same machine (which is almost always the case).

I am going to be completely ignoring GLX and will focus entirely on DRI (version 2).

There are actually three different versions of DRI. Mesa will try to use each one, in turn, until it finds one that works. It does this first by asking the X11 server if it supports the official protocol name using the standard method. If so, Mesa will issue a DRI2Connect [6] request (in DRI2Connect()) which will return the name of the DRI driver to use (i965) and a file path to the device file representing the GPU. The X11 server can reply with these things because the X11 driver is DRI-aware, and the DRI messages are fulfilled by the X11 driver. Mesa goes through these steps for each X11 screen (because each screen might be connected to a different graphics card).

Mesa then dlopen()s the driver .so (driOpenDriver()), and looks for a couple symbols with well-known names. These symbols include function pointers to creating DRIScreen and DRIContext objects (which, in turn, have function pointers for all of the GL calls). Mesa then saves these function pointers in a table so that API calls can be routed to the correct place.

The kernel represents the graphics card as a device inside /dev/ (with a path like /dev/dri/card0). Commands sent to the graphics card are implemented in terms of ioctl() calls on a file descriptor to that device file. Issuing these ioctl() calls is largely the job of libdrm. One example is that there is a DRM call named drmGetVersion() which simply ends up calling ioctl(fd, DRM_IOCTL_VERSION, buffer), where DRM_IOCTL_VERSION is just a #define’d constant that matches what the kernel is looking for. The kernel will both read and write the contents of the buffer supplied to the ioctl, which is how arguments are passed. Most of these #define’d constants are platform-specific (e.g. DRM_I916_GETPARAM).

The last piece I want to discuss today is the integration between DRM and the X11 server. This is actually fairly simple: Part of the DRI X11 extension includes a DRI2GetBuffers request, which will return an array of information regarding the buffers which the X11 server has allowed DRM to render to. This is an array because you may have a double-buffered context, for example, so DRM needs information regarding both the front buffer and the back buffer. Among this information is a handle for each of the buffers. The Mesa i965 driver then runs the drm_intel_bo_gem_create_from_name() function inside libdrm (using the DRM_IOCTL_GEM_OPEN ioctl) which creates a buffer object (bo) given the handle, which the driver can then use to render to. Mesa is free to populate that buffer object however it wants, while all the X11 server has to do is composite the region of memory that Mesa is rendering to. This just reiterates the fact that the X11 server driver needs to be DRI-aware.


Tuesday, April 8, 2014

Complex Text Handling in WebKit, Part 4: Line Breaking

So, here we are, looking at a web browser. We are interested in laying out and rendering text. Step One: figure out where lines break. Once we have partitioned our string into lines, we can later lay out the internals of each line. We have two libraries at our disposal: CoreText and ICU.

ICU has a variety of functions in it, all aimed at exposing the fundamental rules of language or characters. It knows about each language’s rules for where lines are able to be broken. For example, in English, lines are able to be broken at any whitespace character. However, in Chinese, lines are able to be broken at any location. ICU provides a collection of functions which we can use to find the next breakable boundary in a string. These functions operate on the string itself, not on the glyphs that the string translates into.

CoreText has functions which translate a string of code points into a string of glyphs and a string of positions that each of those glyphs should be drawn at. We can use these positions to determine the effective width of each character. Width measurement itself is quite a complicated algorithm, so I’ll write a post about that piece sometime later. Until then, believe that we have a function (Font::width()) to which the input is a string and the output is the width that that string would appear on the screen.

The basic algorithm should now be fairly clear: come up with a collection of line breaking candidates with ICU, then choose the furthest candidate that is less than the width of the box that we are putting text inside. Easier said than done.

The problem is that any inline element can contribute to the width of a line. For example, if you have <div>lorem <img src=“a.jpg”> ipsum</div>, you have to include the <img> when you are performing your width calculations. We also have to handle style attributes such as letter-spacing and word-spacing. We want to collapse whitespace, but not if we’re in a <pre> tag. If a single word is too wide to fit within the box, then we permit the word to extend beyond the width of the enclosing box, but not otherwise. We also don’t want to store a separate, cleaned-up version of the text that we’re rending, so instead we have to keep a fair amount of state around as we iterate through the string. We also only want to iterate across the entire string a single time, as the string could be arbitrarily long.

All in all, the general structure is in LineBreaker::nextSegmentBreak(). We walk items in file order, handling each in turn as we come to it. Our state that we are keeping track of is encapsulated in the BreakingContext object, which has a simple accessor atEnd() which determines if we have found the end of a line and our DOM traversal can end.

If the node that we are inspecting is a text node, we iterate through each of the characters of the string looking for the next line break candidate. We can’t exit this loop until we have gone past the width of the box, so we have this abstraction of the LineWidth class, which holds committed and uncommitted widths. As we build up a word, we add to the uncommitted width, then when we hit a candidate line break we “commit” the width, moving it to the committed width measurement. When we find that we have gone past the edge of the box, our committed width is the width of the text that we are going to use. The structure of BreakingContext::handleText() should then make some sense: a big loop over each of the characters in the string, an if statement inside the loop to determine if we are at a line breaking candidate, and an if statement with a return inside it to determine if we are past the end of the box. Whenever we hit the end of a word, we update the m_lineBreak Iterator so that when we return we have the furthest candidate.

This loop is also where we handle collapsing whitespace. There is a data structure, MidpointState, which keeps a vector of Iterators. The iterators come in pairs - one iterator signifies the beginning of a collection of whitespace to skip and the successive iterator signifies the end of that run of whitespace. When we are at the second space in a row (Not the first! We don’t want to skip all spaces, just successive ones) we append to this list, and whenever we hit a non-whitespace character we append to this list as well. We will see in another post about how this data structure gets used. One consequence of this is that, if someone wants to draw a single line of text (which happens in certain places), they will skip the line breaking algorithm, which means they won’t get any whitespace removal. We also cache widths of words in a WordMeasurement array to speed up later phases of layout.

One interesting bit is that ICU’s line breaking functions are context-sensitive, meaning that we need to keep track of a couple characters before each string that we give to ICU. However, we are operating on a DOM tree where the previous context might come from another DOM node and not be available. Therefore, we keep our LazyLineBreakIterator (which wraps ICU) up to date with the current characters that we are processing. If we are traversing the DOM and come across some replaced content (such as an image), we represent that as U+FFFD (OBJECT REPLACEMENT CHARACTER).

An interesting optimization is in the nextBreakablePosition() function, which uses ICU to determine where the next breakable position in a string is. However, ICU is known to be quite slow, so we try to avoid calling into it as much as possible. For example, If the current character is a space, we don’t have to call into ICU to tell us that we can break there. We can also make similar rules for most of the ASCII code points and simply store that information in a table (named asciiLineBreakTable). Only if the codepoint is something outside of this table do we have to consult with ICU. The actual consultation of ICU is behind a cache of TextBreakIterators inside a LazyLineBreakIterator.

This algorithm (LineBreaker::nextLineBreak()) has a variety of outputs, but the most important in is just an Iterator which signifies where the end of the line is. Once we have this iterator, we can lay out all the items in the current line. Control returns to RenderBlockFlow::layoutRunsAndFloatsInRange(), and we move onward to constructing BiDi runs (constructBidiRunsForSegment()), passing in our end-of-line iterator.

Friday, February 7, 2014

Complex Text Handling in WebKit, Part 3: Codepoint to Glyph Mapping

Previously, I described how a sequence of bytes can be interpreted as a sequence of code points, and how glyphs are described in font files. The conversion, however, between a codepoint and a glyph is not one-to-one. Not only is the conversion not one-to-one, but it is not even trivial.

Unicode has codepoints for characters in many different alphabets. For example, there is a single codepoint for ’n’ (U+006E; unicode code points are written in hexadecimal so this corresponds to an ID of 110) and there is a different code point for ñ (U+00F1). However, clearly it is not feasible to assign a code point for every combination of character and accent if you consider the fact that certain languages have multiple accents on the single character. Therefore, Unicode has the concept of combining marks. The concept is simple: add a mark to a character by concatenating the character’s codepoint with the mark’s codepoint. In the above example, it is possible to write the same character as the sequence U+006E U+0303 (because U+0303 is the codepoint for “combining tilde”). This means that certain conceptual “characters” can be represented with more than one code point. The technical name for these conceptual “characters” is “grapheme cluster.”

Think about this for a second. This means that it is possible to represent the exact same string in multiple different ways (the literal exact same string. They have the exact same meaning). This means that comparing equality of strings (or sorting a collection of strings) must take this into account. In order to combat this, Unicode has a concept of “canonicalization” which will convert a given string into a canonicalized form which you can then use for equality or comparison tests. (Note: string comparison is still dependent on locale. See ICU’s collation functions as well as [1])

Let’s consider the Hebrew language. Hebrew is a handwritten language, which means that characters flow together. This means that a character looks different depending on if it’s in a word or not. In particular, a character is written in one of four ways: one for if it does not have any adjacent characters, one for if it has an adjacent character on its left side, one for if it has an adjacent character on its right side, and one if it has adjacent characters on both sides. Again, this is the literal same character, just written differently. It also means that, when choosing a glyph for a particular character, you have to take the surrounding characters into account. Therefore, a simple codepoint to glyph mapping will not suffice. There are many other languages with complicated rules.

Here is an example of the four forms for a particular character: (* see bottom for how these characters are shown)

All in all, you can start to see how codepoint to glyph conversion is a nontrivial process. Luckily, In WebKit (on OS X, which is what I’m most familiar with), there is a library called CoreText which does this conversion for us. In particular, the CoreText API that WebKit uses is of the form “you give me a sequence of code points and a font, and I’ll give you a sequence of glyphs with a location for each glyph.” Once WebKit has that, it can then pass that along to the 2D drawing subsystem. It should also be noted that CoreText has higher-level APIs that can handle line wrapping and typesettings, but WebKit can’t use those. They assume that you already know the region where you want text drawn; however, the height of a div is however high the text inside it wraps to be. WebKit only asks for the locations of characters laid out in a row, and then does this multiple times for each row that needs to be measured or drawn.

The problem now becomes figuring out which runs of code points to draw at which locations.

* The different form of the above characters are created by using the Zero Width Joiner code point. This is a code point which has 0 size and an empty glyph, but is considered joinable for the purposes of glyph selection. Thanks to Dan Bernstein for this example.


Monday, January 6, 2014

How Multisampling Works in OpenGL

I’ve always been a little confused about how multisampling exactly works in OpenGL, and I’ve always seemed to be able to get away without knowing how it works. Recently I did some reading and some playing around with it, and I thought I’d explain it here. I’m speaking from the perspective of OpenGL 4.3 (though my machine only supports 4.1, it should work the same in both). There is a lot of outdated information on the Internet, so I thought I’d specify the version up front.

Multisampled Textures and Renderbuffers

Multisampling only makes logical sense if you’re rendering into a destination that is multisampled. When you bind a texture for the first time, if you bind it to GL_TEX_2D_MULTISAMPLE, that texture is defined to be a multisampled texture (and similarly for renderbuffers). A multisampled texture works similarly to a regular texture, except without mipmaps. In lieu of mipmaps, each texel gets a number of slots for writing values into. It’s similar to a texture array (except without mipmaps).

You create storage for a multi sampled texture with glTexImage2DMultisample() instead of glTexImage2D(). There are four main differences between these two calls:

  • You can't specify pixel data for initializing the texture
  • You don't specify a LOD number (because there are no mipmaps to choose between)
  • You specify the number of slots each texture holds per texel (number of samples)
  • fixedsamplelocations boolean, which I explain later

Reading from a Multisampled Texture

Shaders can read from multisampled textures, though they work differently than regular textures. In particular, there is a new type, sampler2DMS, that refers to the multisampled texture. You can’t use any of the regular texture sampling operations with this new type. Instead, you can only use texelFetch(), which means that you don’t get any filtering at all. The relevant signature of texelFetch() takes an additional argument which refers to which of the slots you want to read from.

Writing to a Multisampled Texture or Renderbuffer

The only way you can write to a multisampled texture or renderbuffer is by attaching it to a framebuffer and issuing a draw call.

Normally (with multisampling off), if you run a glDraw*() call, a fragment shader invocation is run for every fragment whose center is deemed to be inside the geometry that you’re drawing. You can think of this as each fragment having a single sample located at its center. With multisampling on, there are a collection of sample points located throughout the fragment. If any of these sample points lies within the rendered geometry, an invocation of the fragment shader is run (but only a single invocation for the entire fragment - more on this later). The fragment shader outputs a single output value for the relevant texel in the renderbuffer, and this same value is copied into each slot that corresponds to each sample that was covered by the rendered geometry. Therefore, the number of samples is dictated by the number of slots in the destination renderbuffer. Indeed, if you try to attach textures/renderbuffers with differing slot counts to the same framebuffer, the framebuffer won’t be complete.

There is even a fragment shader input shader variable, glSampleMaskIn, which is a bitmask of which samples are covered by this fragment. It’s actually an array of ints because you might have more than 32 samples per pixel (though I’ve never seen that). You can also modify which slots will be written to by using the glSampleMask fragment shader output variable. However, you can’t use this variable to write to slots that wouldn’t have been written to originally (corresponding to samples that aren’t covered by your geometry).

Eventually, you eventually want to render to the screen. When you create your OpenGL context, you can specify that you want a multisampled pixel format. This means that when you render to framebuffer 0 (corresponding to the screen), you are rendering to a multisampled renderbuffer. However, the screen itself can only show a single color. Therefore, one of the last stages in the OpenGL pipeline is to average all of the slots in a given texel. Note that this only happens when you’re drawing to the screen.

Note that because we’re writing to all of the specific samples covered by the geometry (and not using something like alpha to determine coverage) that adjacent geometry works properly. If two adjacent triangles cover the same fragment, some of the samples will be written to by one of the triangles, and the rest of the samples will be written to by the other triangle. Therefore, when averaging all these samples, you get a nice blend of both triangles.

There is a related feature of OpenGL called Sample Shading. This allows you to run multiple invocations of your fragment shader for each fragment. Each invocation of the fragment shader will correspond to a subset of the samples in each fragment. You turn this feature on by saying glEnable(GL_SAMPLE_SHADING) (multisampling can also be turned on and off with glEnable() as well, but its default value is “on”). Then, you can configure how many invocations of the fragment shader you’d like to run with glMinSampleShading(). You pass this function a normalized float, where 1.0 means to run one invocation of the fragment shader for each sample, and 0.5 means run one invocation of the fragment shader for every two samples. There is a GLSL input variable, gl_SampleID, which corresponds to which invocation we are running. Therefore, if you set the minimum sample shading to 1.0, gl_SampleMaskIn will always be a power of two.

For an example, if you want to copy one multisampled texture into another one, you could bind the destination texture to the framebuffer, turn on sample shading with a minimum rate of 1.0, draw a fullscreen quad, and have your fragment shader say “outColor = texelFetch(s, ivec2(gl_FragCoord.xy), gl_SampleID);”

There is a function, glGetMultisamplefv(), which lets you query for the location of a particular sample within a fragment. You can also get the number of samples with glGet() and GL_SAMPLES. You can then upload this data to your fragment shader in a Uniform Buffer if you want to use it during shading. However, the function that creates a multisampled texture, glTexImage2DMultisample(), takes a boolean argument, fixedsamplelocations, which dictates whether or not the implementation has to keep the sample count and arrangement the same for each fragment. If you specify GL_FALSE for this value, then the output of glGetMultisamplefv() doesn’t apply.

It’s also worth noting that multisampling is orthogonal to the various framebuffer attachments. It works the same for depth buffers as it does for color buffers.

Now, when you play a game at "8x MSAA," you know exactly what's going on!