- Command Streamer. This is the single point of entry for GPU commands
- Vertex Fetcher: Deals with index buffers
- Vertex Shader: Same as OpenGL
- Geometry Shader: Same as OpenGL
- Clipper: Clips to the viewport
- Strip/Fan: Turns higher-level primitives (such as triangle strips) into triangles
- Windower/Masker: Performs rasterization and runs the fragment shader.
When you send commands to the GPU, the Command Streamer receives them, and, if necessary, sends them down the pipeline. When a hardware unit encounters a command that it cares about, it performs the command, and then optionally continues sending the command on its way. Some commands don’t need to be further propagated down the pipeline, so those are squelched.
But what do these commands look like anyway? Well, they are roughly analogous to OpenGL calls, which means that most of these commands are simply setting up state and configuring the hardware in a particular way. Here are some examples of commands:
- 3DSTATE_INDEX_BUFFER: In the future, please use this buffer as an index buffer for vertices.
- 3DSTATE_BASE_ADDRESS: All pointers are relative to this address.
- URB_FENCE: Please divide up your scratch space in the following way between the various pipeline stages
- 3DSTATE_PIPELINED_POINTERS: Here is the state that every draw call needs, for each hardware unit
- 3D_PRIMITIVE: Invoke the pipeline; do the draw
There are many more, but that’s the general idea.
There is also a pool of processing cores which execute shaders. These are called EUs, for “Execution Units.” My 4th generation card has 10 of these EUs, and each one can run a maximum of 5 threads at a time, for a total of 50 possible concurrent threads! There is a dispatcher unit which submits jobs to the EUs, and pipeline stages (such as the WM stage for the fragment shader) interact with the dispatcher to run threads.
I also want to be very explicit: the commands in the command buffer that the CS sees are not the same as the assembly commands that the EUs execute. The EUs execute shader commands, which are things like “add” and “shift.” The commands in the command buffer are for setting up the entire pipeline and invoking the pipeline as a whole.
So how do these command buffers get populated in the Intel Mesa driver? Well, any time any state changes that the pipeline cares about, the driver gets a callback. The driver simply remembers which items are dirty in a bit flag inside the context. Then, when it comes time to draw, we look at all the dirty bits.
Inside brw_state_upload.c, there is a list of “atoms.” Each atom is a tuple of a bit flag and a function pointer. Inside brw_upload_state(), we iterate through all the atoms, comparing the atom’s bit flag to the dirty bit flag. If there is a match, we call the callback function.
There seem to be two general kinds of callback functions: ones that set up state, and one that append a command to the command buffer. Many commands are simply letting the GPU know about some state struct that has been set up in GPU memory, so creating this struct has to be done before the command targeting the struct is issued.
The first kind of command is pretty straightforward: we call brw_state_batch() to allocate some space, cast the result pointer to the struct type that we are trying to fill, and then fill it with information from the context. brw_state_batch() is kind of interesting, because it will allocate from the end of the current command buffer, growing backwards. This is so that we don’t have to allocate an entire memory buffer for each struct that the GPU needs to know about. There’s a comment that I saw about how the minimum size of a buffer is 4096 bytes (likely because that is a page size), and it would be silly to waste that.
The second type of command seems fairly straightforward, except for one point. Issuing the command itself is straightforward: move a pointer forward a specified amount, see if it overflowed, if it didn’t, write into the space we just created. However, there’s one part that is interesting: the command includes a pointer to GPU memory. However, in userland, we don’t actually know where any buffer will end up in graphics memory, so we don’t actually know what pointer value to write.
This is where drm_intel_bo_emit_reloc() comes into play. You pass this command the place where you want to write the pointer, and the location that the pointer should ultimately point to. Both of these pieces are specified as a drm_intel_bo* and an offset. libdrm will simply keep track of these relocation requests. Then, when we actually flush the buffer (aka, drm_intel_bo_mrb_exec()), we pass the relocation list to the kernel. The kernel will then patch up the relocation points en route to the GPU.