Monday, December 15, 2014

Intel GPU Accelerated 2D Graphics

Recently, I’ve been reading the code to an open source OpenGL implementation called Mesa 3D, which includes a driver for Intel graphics cards. While reading the Intel 3D driver, I started wondering how accelerated 2D graphics works on the same card.

Intel’s documentation mentions that their GPUs have a 2D engine, and it takes the form of a blitting engine ([1] Section 4.1.2). That’s all you get as far as 2D graphics are concerned. The blitter is actually fairly simple compared to the 3D pipeline. It only contains 26 possible commands, and only three of those commands are used for setting up state. The rest of the commands specify a particular kind of blit to execute, and tell the hardware to execute it. Each of the commands specifies a different hardware-specific kind of blit to perform.

All my research is being done in the open source stack on FreeBSD, which means that applications are using X11 to get their drawing on screen. X11 has a client/server model, where the application acts as a client and connects to the X11 server, which is actually connected to the display hardware. In this model, the application programmer asks the server to create a Drawable, usually a window on screen, and then asks the server to create a Graphics Context which targets that Drawable. Then, the application sends drawing commands to the X11 server for it to execute. Therefore, the X11 server is the thing actually executing the drawing commands.

Aside: DRI, the mechanism that Mesa 3D uses, bypasses this by asking the X11 server to make a Drawable available directly to the client, and the client is then responsible for drawing into it and notifying the X11 server when it is done drawing. That, however, is currently only used for OpenGL hardware acceleration, and not what I’m concerned with here. Cairo, also, has an Intel-specific DRI backend, so they have their own implementation of hardware-accelerated 2D graphics. Again, that's not what I'm concerned with here.

The X11 server is capable of executing these drawing commands in software. However, There are X11 modules which allow the server to delegate the implementation of these drawing commands to a library that is capable of using hardware acceleration. In the Intel case, we’re looking at a module called xf86-video-intel [2].

Because this is an Xorg module, getting a debug build is a little trickier than simply linking your own application with a locally-built copy, which is what I have been doing with libdrm and Mesa. Instead, after you build xf86-video-intel, you have to tell the existing Xorg to look in your build directory for the shared object. I achieved this by creating a tiny /etc/X11/xorg.conf (the file didn’t exist before) (note that this requires root permissions, because you’re modifying the system X server):

Section "Files"
ModulePath   "/path/to/new/lib/dir"
ModulePath   "/usr/local/lib/xorg/modules"

After restarting the server, it picked up my locally built copy of xf86-video-intel, and fell back to the system directory for all other modules. I was then able to see any logging that I had enabled in the module by inspecting /var/log/Xorg.0.log. You can also debug the module by SSHing into the machine from another one, and attaching to X in a debugger as root.

Entry Points

Anyway, back to the problem at hand. I’m interested in how the Intel-specific library executes 2D drawing commands, but what are the specific 2D drawing commands that can be executed? This is where we start looking at xcb/xproto.h and xcb/render.h from libxcb. The first file describes commands in the core X11 protocol, which is fairly simplistic and doesn’t have things like antialiasing support or smoothed text support. The second file describes commands in the xrender extension, which adds these features and some others.

Anyway, there aren’t actually that many relevant commands. The core protocol has things like xcb_poly_point(), xcb_poly_line(), xcb_poly_segment(), and xcb_poly_arc() to draw points, lines, connected lines, and arcs (it also has some matching functions for filling as well as drawing). Importantly, it also has xcb_put_image() and xcb_get_image() for uploading and downloading bitmaps from the X11 server. It also has xcb_poly_text_8() and xcb_poly_text_16() for text handling. Attributes of these drawing commands (such as using a miter line join or a bevel line join) with the function xcb_change_gc().

The X11 render extension has a concept of compositing rather than drawing, which represents particular function which will combine a source pixel and a destination pixel in order to produce a new destination pixel. Its drawing API is a little bit simpler: it has xcb_render_composite(), xcb_render_trapezoids(), and xcb_render_triangles() for compositing rectangles, trapezoids, and triangles, respectively. It also has its own text functions, xcb_render_composite_glyphs_8(), xcb_render_composite_glyphs_16(), and xcb_render_composite_glyphs_32(). In addition, it has support for gradients with xcb_render_create_linear_gradient(), xcb_render_create_radial_gradient(), and xcb_render_create_conical_gradient().

Which Pixels to Modify?

So, how does xf86-video-intel accelerate these calls? Throughout this discussion, it’s important to remember that our library can fallback to the existing software path whenever it wants (as long as it can map the frame buffer to main memory), which means that not all modes of operation need to be supported. I’ll be focusing on the SNA portion of the library, since that’s the latest active implementation.

When drawing, there are two problems that the library has to deal with. The first is, given a source pixel and a destination pixel, figuring out what the resulting new color should be. The second problem is figuring out which pixels we should apply the previous computation to. Let’s talk about the latter problem.

First, let’s tackle the core X11 calls, because those don’t have to perform any complicated compositing. The library has a concept of operations, which our drawing commands will be turned into. There are only a few kinds of operations: sna_composite_op, sna_composite_spans_op, sna_fill_op, and sna_copy_op, These operations contain a vtable for functions named blt(), box(), boxes(), thread_boxes(), and done(). The idea is, for example, if you want to fill some boxes on the screen, you would make a sna_fill_op, repeatedly call box() or boxes(), and then eventually call done(). Note how there is no function that operates on anything other than a rectangle that is aligned with the screen - that is important.

The same function that populates this vtable is the one that actually initializes the command. Function pointers for these construction functions are set up in the very beginning, when we are initializing our context. This means that we can call our function pointer, and it will set up an operation as well as filling out this vtable.

Let’s take a very simple case: we are trying to draw a single diagonal line across our window. We can set up our fill operation with sna_fill_init_blt() which will call through the aforementioned function pointer. However, our operation doesn’t have any functions which can draw lines. This means that we actually walk along our line, pixel by pixel, accumulating a big array of boxes, where each box represents a single pixel which needs to be filled (in sna_poly_zero_line_blt()). When we’re done accumulating, we call the boxes() function pointer, and then call the done() function pointer. Drawing points and segments work the same way.

Arcs work a little differently. Instead of walking the arc, pixel by pixel, our library delegates to Xorg itself to do that iteration. We call sna_fill_init_blt() like we did before, but next we actually change the interface that we expose to Xorg as a whole (see sna_poly_arc()). We override our FillSpans and PolyPoint function to specific, different, ones, and we then call miZeroPolyArc(). This function will synchronously call our own FillSpans and PolyPoint functions, both of which end up calling the boxes() operation function pointer. A span is a contiguous sequence of pixels that are aligned on a row. Because both these spans and individual pixels are rectangles, we can implement these callbacks in terms of our boxes() operation function pointer. When we’re done, we set our Xorg interface back to what it was originally, and call the done() operation function pointer. Filling polygons and arcs works this way as well.

Putting an image is significantly different because it involves a significantly sized pixel payload. That call is implemented by using the drm_intel_bo_alloc_userptr() function in libdrm. This function will construct a GPU buffer around a piece of application memory without copying it; it simply maps the memory into the GPU’s address space. This is only possible because Intel GPUs share memory with the CPU, rather than having its own discrete memory. Once the GPU has access to the user’s buffer, the library will then use the copy_boxes() operation to use the GPU to move the data (see try_upload_blt()). Once it’s done copying, it will destroy the GPU buffer it created. Getting an image works similarly. However, if the GPU buffer is is tiled, we can’t take the previous code path; instead we map the buffer and copy the data in software. This part of the code uses compiler attributes to use AVX or SSE on the CPU.

When compositing, we still create an operation like before, by calling sna->render.composite_spans(). However, there are a few kind of primitives that we can be compositing, but our operations can only operate on rectangles. This means we have to have some way of turning our primitives into sequences of rectangles. We do this by iterating in software over every pixel in our primitive (see tor_blt()). This is expensive, so we spin up some threads to do it in parallel (see trapezoid_span_converter()). Each one of these threads shares the operation data structure, but gets its own buffer to populate with rects which cover their pixels. Once a thread’s buffer is populated, it calls the thread_boxes() operation function pointer (which internally deals with locks). The main thread waits on all the threads, and then calls the done() operation function pointer.

How to Modify the Pixels?

Now, the other problem: what goes on under the hood of an operation? The operations seem to attempt to use the blitter if possible, but fall back otherwise. The fill operation is actually pretty straightforward since there is a blitter command that does exactly what we want. sna_blt_fill_init() sets some state which represents the current command to XY_SCANLINE_BLT, and emits an XY_SETUP_*_BLT command. Then, sna_blt_fill() sets the function pointers to functions that will output the command followed by rectangular screen coordinates (see _sna_blt_fill_boxes()). A copy operation works the same way with XY_SRC_COPY_BLT_CMD.

Compositing has the same general flow, except not all blend modes work with blitting. We detect the cases where the blend mode is supported in sna_blt_composite(), and then populate the same data structure. Currently it looks like filling, clearing, and putting are the only blend modes that are supported. These use the commands XY_COLOR_BLT, XY_SCANLINE_BLT, XY_SRC_COPY_BLT_CMD, and XY_FULL_MONO_PATTERN_BLT.

When drawing text, Xorg hands us the bitmap that we should draw. The bitmap is small, so we use the command XY_TEXT_IMMEDIATE_BLT to allow us to specify the bitmap literally inside the command stream.

However, if we determine that we can’t use the blitter, we must use the 3D pipeline. However, we can do a little better than naively drawing a triangle fan with a passthrough vertex shader. In particular, we can turn all of the 3D pipeline stages off (including the vertex shader!) except for the WM stage (which includes the fragment shader). We only have a finite number of fragment shaders that we will need, so we can compile them all ahead of time (or even possibly at authorship time! See wm_kernels[]). And, instead of issuing a triangle fan to the hardware, we can use the special RECTLIST primitive that is designed exactly for our needs. Indeed, this is what the library does when you fall back to the 3D pipeline case.

There isn’t actually much to do when setting up our 3D pipeline. We need to choose which of the few precompiled kernels we will be using, set up our binding table for our shader, and emit some setup commands to the hardware. These commands are: MI_FLUSH (optionally), GEN4_3DSTATE_BINDING_TABLE_POINTERS, GEN4_3DSTATE_PIPELINED_POINTERS, and GEN4_3DSTATE_VERTEX_ELEMENTS.

Then, when we get a call to emit some boxes, we check two things. First, if we’re about to run out of space in our command buffer, we flush it. Then, if we’re at the beginning of the command buffer, we need to emit GEN4_3DSTATE_VERTEX_BUFFERS and GEN4_3DPRIMITIVE. Once we’ve done that, we can actually emit our box rect.

Actually figuring out how to emit the rect is the job of gen4_choose_composite_emitter(). This function has blocks for AVX or SSE compiler attributes in order to choose the right emitter code. Ultimately this code boils down to appending/prepending the vertex data on to an existing buffer.

All in all, there’s a lot there, but it’s not too complicated.


No comments:

Post a Comment