Litherum: December 2014

Monday, December 15, 2014

Intel GPU Accelerated 2D Graphics

Recently, I’ve been reading the code to an open source OpenGL implementation called Mesa 3D, which includes a driver for Intel graphics cards. While reading the Intel 3D driver, I started wondering how accelerated 2D graphics works on the same card.

Intel’s documentation mentions that their GPUs have a 2D engine, and it takes the form of a blitting engine ([1] Section 4.1.2). That’s all you get as far as 2D graphics are concerned. The blitter is actually fairly simple compared to the 3D pipeline. It only contains 26 possible commands, and only three of those commands are used for setting up state. The rest of the commands specify a particular kind of blit to execute, and tell the hardware to execute it. Each of the commands specifies a different hardware-specific kind of blit to perform.

All my research is being done in the open source stack on FreeBSD, which means that applications are using X11 to get their drawing on screen. X11 has a client/server model, where the application acts as a client and connects to the X11 server, which is actually connected to the display hardware. In this model, the application programmer asks the server to create a Drawable, usually a window on screen, and then asks the server to create a Graphics Context which targets that Drawable. Then, the application sends drawing commands to the X11 server for it to execute. Therefore, the X11 server is the thing actually executing the drawing commands.

Aside: DRI, the mechanism that Mesa 3D uses, bypasses this by asking the X11 server to make a Drawable available directly to the client, and the client is then responsible for drawing into it and notifying the X11 server when it is done drawing. That, however, is currently only used for OpenGL hardware acceleration, and not what I’m concerned with here. Cairo, also, has an Intel-specific DRI backend, so they have their own implementation of hardware-accelerated 2D graphics. Again, that's not what I'm concerned with here.

The X11 server is capable of executing these drawing commands in software. However, There are X11 modules which allow the server to delegate the implementation of these drawing commands to a library that is capable of using hardware acceleration. In the Intel case, we’re looking at a module called xf86-video-intel [2].

Because this is an Xorg module, getting a debug build is a little trickier than simply linking your own application with a locally-built copy, which is what I have been doing with libdrm and Mesa. Instead, after you build xf86-video-intel, you have to tell the existing Xorg to look in your build directory for the shared object. I achieved this by creating a tiny /etc/X11/xorg.conf (the file didn’t exist before) (note that this requires root permissions, because you’re modifying the system X server):

Section "Files"

ModulePath "/path/to/new/lib/dir"

ModulePath "/usr/local/lib/xorg/modules"

EndSection

After restarting the server, it picked up my locally built copy of xf86-video-intel, and fell back to the system directory for all other modules. I was then able to see any logging that I had enabled in the module by inspecting /var/log/Xorg.0.log. You can also debug the module by SSHing into the machine from another one, and attaching to X in a debugger as root.

Entry Points

Anyway, back to the problem at hand. I’m interested in how the Intel-specific library executes 2D drawing commands, but what are the specific 2D drawing commands that can be executed? This is where we start looking at xcb/xproto.h and xcb/render.h from libxcb. The first file describes commands in the core X11 protocol, which is fairly simplistic and doesn’t have things like antialiasing support or smoothed text support. The second file describes commands in the xrender extension, which adds these features and some others.

Anyway, there aren’t actually that many relevant commands. The core protocol has things like xcb_poly_point(), xcb_poly_line(), xcb_poly_segment(), and xcb_poly_arc() to draw points, lines, connected lines, and arcs (it also has some matching functions for filling as well as drawing). Importantly, it also has xcb_put_image() and xcb_get_image() for uploading and downloading bitmaps from the X11 server. It also has xcb_poly_text_8() and xcb_poly_text_16() for text handling. Attributes of these drawing commands (such as using a miter line join or a bevel line join) with the function xcb_change_gc().

The X11 render extension has a concept of compositing rather than drawing, which represents particular function which will combine a source pixel and a destination pixel in order to produce a new destination pixel. Its drawing API is a little bit simpler: it has xcb_render_composite(), xcb_render_trapezoids(), and xcb_render_triangles() for compositing rectangles, trapezoids, and triangles, respectively. It also has its own text functions, xcb_render_composite_glyphs_8(), xcb_render_composite_glyphs_16(), and xcb_render_composite_glyphs_32(). In addition, it has support for gradients with xcb_render_create_linear_gradient(), xcb_render_create_radial_gradient(), and xcb_render_create_conical_gradient().

Which Pixels to Modify?

So, how does xf86-video-intel accelerate these calls? Throughout this discussion, it’s important to remember that our library can fallback to the existing software path whenever it wants (as long as it can map the frame buffer to main memory), which means that not all modes of operation need to be supported. I’ll be focusing on the SNA portion of the library, since that’s the latest active implementation.

When drawing, there are two problems that the library has to deal with. The first is, given a source pixel and a destination pixel, figuring out what the resulting new color should be. The second problem is figuring out which pixels we should apply the previous computation to. Let’s talk about the latter problem.

First, let’s tackle the core X11 calls, because those don’t have to perform any complicated compositing. The library has a concept of operations, which our drawing commands will be turned into. There are only a few kinds of operations: sna_composite_op, sna_composite_spans_op, sna_fill_op, and sna_copy_op, These operations contain a vtable for functions named blt(), box(), boxes(), thread_boxes(), and done(). The idea is, for example, if you want to fill some boxes on the screen, you would make a sna_fill_op, repeatedly call box() or boxes(), and then eventually call done(). Note how there is no function that operates on anything other than a rectangle that is aligned with the screen - that is important.

The same function that populates this vtable is the one that actually initializes the command. Function pointers for these construction functions are set up in the very beginning, when we are initializing our context. This means that we can call our function pointer, and it will set up an operation as well as filling out this vtable.

Let’s take a very simple case: we are trying to draw a single diagonal line across our window. We can set up our fill operation with sna_fill_init_blt() which will call through the aforementioned function pointer. However, our operation doesn’t have any functions which can draw lines. This means that we actually walk along our line, pixel by pixel, accumulating a big array of boxes, where each box represents a single pixel which needs to be filled (in sna_poly_zero_line_blt()). When we’re done accumulating, we call the boxes() function pointer, and then call the done() function pointer. Drawing points and segments work the same way.

Arcs work a little differently. Instead of walking the arc, pixel by pixel, our library delegates to Xorg itself to do that iteration. We call sna_fill_init_blt() like we did before, but next we actually change the interface that we expose to Xorg as a whole (see sna_poly_arc()). We override our FillSpans and PolyPoint function to specific, different, ones, and we then call miZeroPolyArc(). This function will synchronously call our own FillSpans and PolyPoint functions, both of which end up calling the boxes() operation function pointer. A span is a contiguous sequence of pixels that are aligned on a row. Because both these spans and individual pixels are rectangles, we can implement these callbacks in terms of our boxes() operation function pointer. When we’re done, we set our Xorg interface back to what it was originally, and call the done() operation function pointer. Filling polygons and arcs works this way as well.

Putting an image is significantly different because it involves a significantly sized pixel payload. That call is implemented by using the drm_intel_bo_alloc_userptr() function in libdrm. This function will construct a GPU buffer around a piece of application memory without copying it; it simply maps the memory into the GPU’s address space. This is only possible because Intel GPUs share memory with the CPU, rather than having its own discrete memory. Once the GPU has access to the user’s buffer, the library will then use the copy_boxes() operation to use the GPU to move the data (see try_upload_blt()). Once it’s done copying, it will destroy the GPU buffer it created. Getting an image works similarly. However, if the GPU buffer is is tiled, we can’t take the previous code path; instead we map the buffer and copy the data in software. This part of the code uses compiler attributes to use AVX or SSE on the CPU.

When compositing, we still create an operation like before, by calling sna->render.composite_spans(). However, there are a few kind of primitives that we can be compositing, but our operations can only operate on rectangles. This means we have to have some way of turning our primitives into sequences of rectangles. We do this by iterating in software over every pixel in our primitive (see tor_blt()). This is expensive, so we spin up some threads to do it in parallel (see trapezoid_span_converter()). Each one of these threads shares the operation data structure, but gets its own buffer to populate with rects which cover their pixels. Once a thread’s buffer is populated, it calls the thread_boxes() operation function pointer (which internally deals with locks). The main thread waits on all the threads, and then calls the done() operation function pointer.

How to Modify the Pixels?

Now, the other problem: what goes on under the hood of an operation? The operations seem to attempt to use the blitter if possible, but fall back otherwise. The fill operation is actually pretty straightforward since there is a blitter command that does exactly what we want. sna_blt_fill_init() sets some state which represents the current command to XY_SCANLINE_BLT, and emits an XY_SETUP_*_BLT command. Then, sna_blt_fill() sets the function pointers to functions that will output the command followed by rectangular screen coordinates (see _sna_blt_fill_boxes()). A copy operation works the same way with XY_SRC_COPY_BLT_CMD.

Compositing has the same general flow, except not all blend modes work with blitting. We detect the cases where the blend mode is supported in sna_blt_composite(), and then populate the same data structure. Currently it looks like filling, clearing, and putting are the only blend modes that are supported. These use the commands XY_COLOR_BLT, XY_SCANLINE_BLT, XY_SRC_COPY_BLT_CMD, and XY_FULL_MONO_PATTERN_BLT.

When drawing text, Xorg hands us the bitmap that we should draw. The bitmap is small, so we use the command XY_TEXT_IMMEDIATE_BLT to allow us to specify the bitmap literally inside the command stream.

However, if we determine that we can’t use the blitter, we must use the 3D pipeline. However, we can do a little better than naively drawing a triangle fan with a passthrough vertex shader. In particular, we can turn all of the 3D pipeline stages off (including the vertex shader!) except for the WM stage (which includes the fragment shader). We only have a finite number of fragment shaders that we will need, so we can compile them all ahead of time (or even possibly at authorship time! See wm_kernels[]). And, instead of issuing a triangle fan to the hardware, we can use the special RECTLIST primitive that is designed exactly for our needs. Indeed, this is what the library does when you fall back to the 3D pipeline case.

There isn’t actually much to do when setting up our 3D pipeline. We need to choose which of the few precompiled kernels we will be using, set up our binding table for our shader, and emit some setup commands to the hardware. These commands are: MI_FLUSH (optionally), GEN4_3DSTATE_BINDING_TABLE_POINTERS, GEN4_3DSTATE_PIPELINED_POINTERS, and GEN4_3DSTATE_VERTEX_ELEMENTS.

Then, when we get a call to emit some boxes, we check two things. First, if we’re about to run out of space in our command buffer, we flush it. Then, if we’re at the beginning of the command buffer, we need to emit GEN4_3DSTATE_VERTEX_BUFFERS and GEN4_3DPRIMITIVE. Once we’ve done that, we can actually emit our box rect.

Actually figuring out how to emit the rect is the job of gen4_choose_composite_emitter(). This function has blocks for AVX or SSE compiler attributes in order to choose the right emitter code. Ultimately this code boils down to appending/prepending the vertex data on to an existing buffer.

All in all, there’s a lot there, but it’s not too complicated.

[1] https://01.org/linuxgraphics/sites/default/files/documentation/g45_vol_1a_core_updated_0.pdf

[2] http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/

Monday, December 8, 2014

Design of Mesa 3D Part 11: Intel Command Buffers

The hardware of an Intel graphics chip actually mirrors the OpenGL pipeline pretty well. There are hardware units that correspond to each stage of the 3D pipeline. There is a stage for the vertex shader, a stage for the geometry shader, a stage for the rasterizer, etc., and each of these stages are connected together one after another. The Intel manual specifies the pipeline present in the chip (Volume 2, section 2.2):

Command Streamer. This is the single point of entry for GPU commands
Vertex Fetcher: Deals with index buffers
Vertex Shader: Same as OpenGL
Geometry Shader: Same as OpenGL
Clipper: Clips to the viewport
Strip/Fan: Turns higher-level primitives (such as triangle strips) into triangles
Windower/Masker: Performs rasterization and runs the fragment shader.

When you send commands to the GPU, the Command Streamer receives them, and, if necessary, sends them down the pipeline. When a hardware unit encounters a command that it cares about, it performs the command, and then optionally continues sending the command on its way. Some commands don’t need to be further propagated down the pipeline, so those are squelched.

But what do these commands look like anyway? Well, they are roughly analogous to OpenGL calls, which means that most of these commands are simply setting up state and configuring the hardware in a particular way. Here are some examples of commands:

3DSTATE_INDEX_BUFFER: In the future, please use this buffer as an index buffer for vertices.
3DSTATE_BASE_ADDRESS: All pointers are relative to this address.
URB_FENCE: Please divide up your scratch space in the following way between the various pipeline stages
3DSTATE_PIPELINED_POINTERS: Here is the state that every draw call needs, for each hardware unit
3D_PRIMITIVE: Invoke the pipeline; do the draw

There are many more, but that’s the general idea.

There is also a pool of processing cores which execute shaders. These are called EUs, for “Execution Units.” My 4th generation card has 10 of these EUs, and each one can run a maximum of 5 threads at a time, for a total of 50 possible concurrent threads! There is a dispatcher unit which submits jobs to the EUs, and pipeline stages (such as the WM stage for the fragment shader) interact with the dispatcher to run threads.

I also want to be very explicit: the commands in the command buffer that the CS sees are not the same as the assembly commands that the EUs execute. The EUs execute shader commands, which are things like “add” and “shift.” The commands in the command buffer are for setting up the entire pipeline and invoking the pipeline as a whole.

So how do these command buffers get populated in the Intel Mesa driver? Well, any time any state changes that the pipeline cares about, the driver gets a callback. The driver simply remembers which items are dirty in a bit flag inside the context. Then, when it comes time to draw, we look at all the dirty bits.

Inside brw_state_upload.c, there is a list of “atoms.” Each atom is a tuple of a bit flag and a function pointer. Inside brw_upload_state(), we iterate through all the atoms, comparing the atom’s bit flag to the dirty bit flag. If there is a match, we call the callback function.

There seem to be two general kinds of callback functions: ones that set up state, and one that append a command to the command buffer. Many commands are simply letting the GPU know about some state struct that has been set up in GPU memory, so creating this struct has to be done before the command targeting the struct is issued.

The first kind of command is pretty straightforward: we call brw_state_batch() to allocate some space, cast the result pointer to the struct type that we are trying to fill, and then fill it with information from the context. brw_state_batch() is kind of interesting, because it will allocate from the end of the current command buffer, growing backwards. This is so that we don’t have to allocate an entire memory buffer for each struct that the GPU needs to know about. There’s a comment that I saw about how the minimum size of a buffer is 4096 bytes (likely because that is a page size), and it would be silly to waste that.

The second type of command seems fairly straightforward, except for one point. Issuing the command itself is straightforward: move a pointer forward a specified amount, see if it overflowed, if it didn’t, write into the space we just created. However, there’s one part that is interesting: the command includes a pointer to GPU memory. However, in userland, we don’t actually know where any buffer will end up in graphics memory, so we don’t actually know what pointer value to write.

This is where drm_intel_bo_emit_reloc() comes into play. You pass this command the place where you want to write the pointer, and the location that the pointer should ultimately point to. Both of these pieces are specified as a drm_intel_bo* and an offset. libdrm will simply keep track of these relocation requests. Then, when we actually flush the buffer (aka, drm_intel_bo_mrb_exec()), we pass the relocation list to the kernel. The kernel will then patch up the relocation points en route to the GPU.

Tuesday, December 2, 2014

Design of Mesa 3D Part 10: Intel’s Device Driver

FreeBSD 10.1 is out [1] which means I’ve got a good opportunity to take another look at Mesa. Since I last looked at it, the FreeBSD Ports system has been updated to the latest version of Mesa, version 10.3.3 [2], which is 3 major versions past where I was looking before. Needless to say, much has changed.

Once again, I’d like to discuss the players here.

The first player is libglapi. This is a pretty simple library which contains two thread-local variables: one which holds the current OpenGL dispatch table, and another which holds the current OpenGL context. Mesa can set and get the first one with _glapi_set_dispatch() and _glapi_get_dispatch(). The dispatch table holds a collection of function pointers, and every OpenGL API call gets directed through a dispatch table lookup. Mesa can set and get the currently OpenGL context with _glapi_set_context() and _glapi_get_context(). Inside libglapi, contexts are treated as a void*.

Another player is EGL. EGL is a platform abstraction that encapsulates X11 on FreeBSD (but is portable so it will encapsulate WGL on Windows if you’re there, etc.) EGL has a concept of a display, a rendering surface, and is responsible for making an OpenGL context which uses them. It’s also responsible for making an OpenGL context “current” as well as performing the buffer swap at the end of the frame to present your drawn framebuffer. If you want to create an onscreen surface, you have to pass the relevant function a void* which represents the platform-specific renderable that EGL should draw to.

Because EGL is a platform abstraction, it is driver-based. Drivers in Mesa have a single symbol which is either a function that populates a vtable, or the symbol simply is the vtable itself. Currently, there are only two EGL drivers in Mesa, but only one seems to have an implementation: The DRI2-based one. This driver has two parts: The main part of the driver, and the part that deals with X11 (If your platform doesn’t use X11, there are other pieces that can replace it; for example, there is a Wayland part which can be used instead of the X11 piece). The main part of the driver part knows about the X11 part, but the calls from it to the X11 piece are behind a #ifdef. The X11 part doesn’t know about the main part, but instead fills out a vtable with function pointers for the main part to call.

The X11 part uses XCB [3] to interact with the X server. Because the user of EGL had to already set up their rendering destination, they already have a connection to the X11 server, so this part piggybacks off of that connection. (If you’re rendering to a pbuffer, this part makes its own XCB connection to the X server). Its responsibility is to handle all of the requirements of DRI that interact with the X server. Luckily, there isn’t very much to do there. Look in xcb/dri2.h for a list of all the calls that are necessary. Relevant ones are:

xcb_dri2_query_version(): Returns the version of the DRI infrastructure. Mine says 1.3.
xcb_dri2_connect(): This returns two important strings:

The driver name that Mesa should use to actually drive the hardware. Mine is i965. Mesa turns this into “dri/i965_dri.so” and will dlopen() it.
The file to open to send DRM commands to. All DRM commands are implemented as ioctl()s on a file descriptor. This is the path to the file to open to get the file descriptor. Mine is /dev/dri/card0

xcb_dri2_authenticate(): This is the access control that DRI uses. Once you’ve opened the DRM device file, you can start sending commands directly to the hardware. However, before you start, you have to send one particular command, drmGetMagic(), which will return an arbitrary number. You then have to supply this number to xcb_dri2_authenticate() so the X server can authenticate you (using the same authentication mechanisms that any X client uses). Once you’ve done this, you can start sending commands to the hardware via the fd you opened previously.
xcb_dri2_create_drawable(): No return value. You pass the onscreen window XID and this will make the window’s buffers available to DRM.
xcb_dri2_destroy_drawable(): The reverse as above.
xcb_dri2_get_buffers(): You pass an array of attachments which represent buffers that the X server knows about and will use for compositing the screen. (For example, XCB_DRI2_ATTACHMENT_BUFFER_BACK_LEFT). xcb_dri2_get_buffers() will return a bunch of information regarding each buffer, including the “pitch” of the buffer (the number of bytes between successive rows, though this is hardware-dependent as buffers can be tiled on Intel cards), the CPP (number of bytes per pixel), some flags, and, most importantly, the buffer “name” which is a number which represents the device-specific ID of the buffer.
xcb_dri2_swap_buffers(): Once you’re done with a frame, you have to let the X server know that.
There are some others, but I’ve omitted them because they’re not very relevant.

So, the X11 part of the EGL performs that initial handshake with the X server, and populates a driver_name field, which the main part of the EGL driver uses to open the correct .so. This .so represents the OpenGL driver (not the EGL driver!). Drivers export a vtable of function pointers.

Then, the client program starts calling OpenGL calls, which all get redirected through libglapi’s function table. In our case, Mesa implements the GL calls. During most OpenGL function calls, the OpenGL driver doesn’t actually get touched, because most GL calls are simply settings state in the context. For example, glViewport(), glClearColor(), glCreateShader(), and glShaderSource() all just set some state. Even glCompileShader doesn’t actually touch the i965 driver, and Mesa simply performs a compilation to a high-level intermediate representation (“hir”). In i965’s case, the driver only gets touched inside glLinkProgram() which performs the final compilation for the card. Even glGenBuffers() and glBindBuffer() don't actually touch the driver; it’s not until glBufferData() that buffers are actually created on the card.

glClear() is pretty interesting; because I only have a gen4 card, there isn’t a mechanism for quickly clearing the framebuffer. Instead, Mesa has a “meta” api where it implements certain API calls in terms of others. in glClear()’s case, it implements it by drawing two triangles which cover the screen, with a trivial fragment shader.

glDrawArrays() and glDrawElements() obviously both touch the driver. This is where the bulk of the work is done, and where the driver reads all the state that Mesa has been setting up for it.

There’s one more player, though, that I only mentioned briefly: libdrm [4]. This is how the OpenGL driver actually interacts with the GPU. It’s a platform-specific library which includes very low-level primitives for dealing with the GPU, and is implemented entirely by ioctl()s on a file descriptor (which I described how to open earlier). The library has one section for each card it supports, and you (obviously) shouldn’t use the API calls for a card that you do not have. There is one .h file which isn’t in a platform-specific directory (xf86drm.h), but the i965 Mesa driver never seems to call these non-platform-specific functions (except for drmGetMagic(), as described above). Instead, everything seems to come from inside the intel folder.

Once you’re at the level of DRM, the nouns and verbs of OpenGL are no longer relevant, since you’re dealing directly with the card. Indeed, the i965 OpenGL driver even has some nouns which DRM doesn’t know about. DRM has a concept of a buffer on the card, and these buffers are used for everything. It’s straightforward to see how a VBO simply uses a buffer on the card, but a compiled program is also simply is placed inside a buffer on the card. A frame buffer is simply a buffer on the card. A command buffer, which OpenGL has no notion of, but the i965 OpenGL driver does (and calls a batch buffer), is simply a buffer on the card. A texture is simply a buffer on the card. Therefore, the part of DRM that handles buffers is the most important part. The API to this subsystem lives in intel_bufmgr.h, and I’ll list a few of the more interesting calls here:

drm_intel_bufmgr_gem_init() creates a drm_intel_bufmgr*, which contains a vtable inside it that the buffer-specific calls go through
drm_intel_bufmgr_destroy() the reverse of above
drm_intel_bufmgr_set_debug(), once called, will set state which will cause successive functions to dump debug output.
drm_intel_bo_alloc() makes a drm_intel_bo*, which represents a buffer on the card. This call allocates the buffer.
drm_intel_bo_gem_create_from_name(): you pass in the name you got from xcb_dri2_get_buffers(), and it will return a drm_intel_bo* which represents that buffer. This is how you interact with the framebuffer.
drm_intel_bo_reference() / drm_intel_bo_unreference(): Buffer objects are reference-counted
drm_intel_bo_map() / drm_intel_bo_unmap(): self-explanatory
drm_intel_bo_subdata(): Upload data to the buffer at a particular offset
drm_intel_bo_get_subdata(): Download data from the buffer at a particular offset
drm_intel_bo_exec() / drm_intel_bo_mrb_exec(): Execute the contents of the buffer. This is what performs the glDrawElements() call.
drm_intel_bo_set_tiling() / drm_intel_bo_get_tiling(): Intel cards have hardware support for tiling. This is used when the buffer represents a renderbuffer or texture.

As you can start to see, these function calls are all that are really necessary for implementing the core part of the i965 OpenGL device driver. Uploading a texture is simply a drm_intel_bo_alloc() and a drm_intel_bo_subdata(). Using a shader is simply a CPU-side compilation, then an alloc/subdata to get it on the card. For a command buffer, you can map it, then write commands into the buffer. When you’re done, exec it.

These calls are all passed almost directly to the kernel, meaning that this is about as far down as we can go and still be in userland.

All this information is enough to write a program that uses the GPU directly but doesn’t go through Mesa. Here are the steps:

xcb_connect() to the X server.
xcb_dri2_connect() to get the DRI device file path.
xcb_create_window() to make your window.
xcb_map_window() to show it
At this point, you should wait for an expose event
xcb_dri2_create_drawable() to make it available from DRI
xcb_dri2_get_buffers() with XCB_DRI2_ATTACHMENT_BUFFER_BACK_LEFT to get the name of the buffer that backs the window
open() the DRI device file path
drmGetMagic() to get the magic number which you will use to call…
xcb_dri2_authenticate(). After this point, you can call drm functions.
drm_intel_bufmgr_gem_init() to make an intel-specific bufmgr
drm_intel_bo_gem_create_from_name() with the output of xcb_dri2_get_buffers() that you previously called, to create a intel-specific buffer object which represents the backbuffer of the screen.
drm_intel_bo_map() to map the buffer
use the backbuffer->virtual pointer to write into the frame buffer. You probably should call drm_intel_bo_get_tiling() so you know where to write stuff.
drm_intel_bo_unmap() when you’re done
drm_intel_bo_unreference() when you don’t need the drm_intel_bo anymore
xcb_dri2_swap_buffers() to tell the X server that you’ve finished the frame
drm_intel_bufmgr_destroy() when you’re done with the bufmgr
close() the DRM file descriptor
xcb_dri2_destroy_drawable() when you’re done with DRI2 entirely
xcb_disconnect() from the X server entirely

And there you go. You can see how you might take this structure, but instead of mapping the frame buffer and drawing into it with the CPU, you could allocate a command buffer and write commands into it, and get the GPU to draw into the frame buffer for you.

[1] https://www.freebsd.org/releases/10.1R/announce.html
[2] http://www.mesa3d.org/relnotes/10.3.3.html

[3] http://xcb.freedesktop.org
[4] http://dri.freedesktop.org/wiki/