Sunday, April 28, 2013

Design of Mesa 3D Part 7: Shader Assembly Emission

The last stage in glCompileShader is to actually emit assembly commands that can be executed by the shader VM. We've previously created an intermediate representation of nodes that represent the shader; now the task is to serialize this tree into something similar to an object file. I believe that the actual assembly language is the ARB assembly language, which can then be translated by a driver into platform-specific instructions. This architecture is similar to HLSL's "assembly" language. This takes place in emit(), defined in src/mesa/shader/slang/slang_emit.c. You pass a slang_ir_node to the function; the initial node is the root of the IR tree.

This function is a big switch statement switching over the IR opcode. All of the math operators (abs, sin, min, add, less-than, etc.) fall through to the same case, which calls emit_arith(). This function is actually really straightforward; it calls emit() on all of its children, allocates a node to store the result by calling alloc_node_storage(), then calls emit_instruction() on the operation itself.

alloc_node_storage() is also fairly straightforward; it's used to allocate temporaries that don't have the Store parameter in the slang_ir_node struct set. This is a code block from the beginning of the function:


   if (!n->Store) {
      assert(defaultSize > 0);
      n->Store = _slang_new_ir_storage(PROGRAM_TEMPORARY, -1, defaultSize);
   }

Therefore, an invariant is that n->Store should always be set for an IR node after this function is called on it. _slang_new_ir_storage() is a simple constructor that just copies the register file, index, and size into a newly allocated slang_ir_storage struct. I've already copied and pasted the definition of slang_ir_storage_ in this post. One of the interesting things is that the index parameter in slang_ir_storage_ is allowed to be -1, which means that the actual location doesn't matter; just put it anywhere (as long as it's in the correct register file). Because of this, alloc_node_storage() must choose a real index for all the -1 indexes. It does this by calling _slang_alloc_temp(), defined in src/mesa/shader/slang/slang_vartable.c. This function calls alloc_reg() to actually do the allocation, then sets up the slang_ir_store to the appropriate values regarding the newly allocated register. alloc_rec() uses the struct table defined at the top of the same file. This struct represents meta information about which parts of which register files are free. Here's the definition:


typedef enum {
   FREE,
   VAR,
   TEMP
} TempState;

/**
 * Variable/register info for one variable scope.
 */
struct table
{
   int Level;
   int NumVars;
   slang_variable **Vars;  /* array [NumVars] */
   TempState Temps[MAX_PROGRAM_TEMPS * 4];  /* per-component state */
   int ValSize[MAX_PROGRAM_TEMPS * 4];     /**< For debug only */
   struct table *Parent;  /** Parent scope table */
};

The algorithm that alloc_rec() uses is quite straightforward; It simply walks Temps trying to find 4 successive components that are marked as FREE. Once it's found one, it marks them all as TEMP. So that's pretty simple.

Back to emitting code. Emitting a single instruction is handled with the emit_instruction() function. This function takes an opcode and 4 slang_ir_storage nodes: one for the destination and 3 for the inputs. You would think that this function would be trivial; however, because of indirect register inputs + outputs, it isn't. If the output or any of the inputs are indirect, this function has to deal with it. I'll skip over how we deal with this for now, but once we have our input and output registers, the code just looks like this:

   inst = new_instruction(emitInfo, opcode);
   if (!inst)
      return NULL;

   if (dst)
      storage_to_dst_reg(&inst->DstReg, dst);

   for (i = 0; i < 3; i++) {
      if (src[i])
         storage_to_src_reg(&inst->SrcReg[i], src[i]);
   }

new_instruction() is the trivial function: If we're at the end of our output array, grow the buffer, then just get a pointer to the next available instruction in the array, and initialize it. The instruction stream is attached to the gl_program object stored in emitInfo->prog(); this will become important when we call functions. storage_to_dst_reg() and storage_to_src_reg() are also rather simple: They simply fill in the register file and index, as well as a swizzle. Here are the prog_dst_register and prog_src_register structs, defined in src/mesa/shader/prog_instruction.h.

struct prog_src_register
{
   GLuint File:4; /**< One of the PROGRAM_* register file values. */
   GLint Index:(INST_INDEX_BITS+1); /**< Extra bit here for sign bit.
                                     * May be negative for relative addressing.
                                     */
   GLuint Swizzle:12;
   GLuint RelAddr:1;
   /** Take the component-wise absolute value */
   GLuint Abs:1;
   /**
    * Post-Abs negation.
    * This will either be NEGATE_NONE or NEGATE_XYZW, except for the SWZ
    * instruction which allows per-component negation.
    */
   GLuint Negate:4;
};

/**
 * Instruction destination register.
 */
struct prog_dst_register
{
   GLuint File:4;      /**< One of the PROGRAM_* register file values */
   GLuint Index:INST_INDEX_BITS;  /**< Unsigned, never negative */
   GLuint WriteMask:4;
   GLuint RelAddr:1;
   /**
    * \name Conditional destination update control.
    *
    * \since
    * NV_fragment_program, NV_fragment_program_option, NV_vertex_program2,
    * NV_vertex_program2_option.
    */
   /*@{*/
   /**
    * Takes one of the 9 possible condition values (EQ, FL, GT, GE, LE, LT,
    * NE, TR, or UN).  Dest reg is only written to if the matching
    * (swizzled) condition code value passes.  When a conditional update mask
    * is not specified, this will be \c COND_TR.
    */
   GLuint CondMask:4;
   /**
    * Condition code swizzle value.
    */
   GLuint CondSwizzle:12;
   /**
    * Selects the condition code register to use for conditional destination
    * update masking.  In NV_fragmnet_program or NV_vertex_program2 mode, only
    * condition code register 0 is available.  In NV_vertex_program3 mode,
    * condition code registers 0 and 1 are available.
    */
   GLuint CondSrc:1;
   /*@}*/
   GLuint pad:28;
};

As you can see, the instruction is optimized for size by using bitfields.

Alright, let's talk about indirect registers. This indirection is done using the ARL instruction, or Address Register Load. The spec (Section 2.14.5.3) states that it simply performs a load into the address register, which is then used for future loads and stores. This is used for doing array accesses where the index is a variable; that requires loading the value of the variable into the address register, then doing an operation using the address register as an offset into the array. However, we only have one address register (only the x component is actually used). What happens if we want to say something like x[i] + y[j]? The add instruction uses the address register explicitly, but the two operands should have different offsets. This means that we have to first load x[i] into a temporary, then run temp + y[j]. Allocating this temporary register uses the same call that it did above. It then emits a MOV instruction using the address register. A similar codepath occurs for an indirect destination register; however, if the destination is relative, all of the relative sources will be put into temporaries, so we can use an indirect destination here. The RelAddr bit in the prog_dst_register and prog_src_register structs determines if we should use the address register. After we emit the actual instruction that we're trying to perform, we have to then free the temporary registers that we've allocated.

Cool; that's how we do math. Register loads and stores work the same way. IR_SEQ instructions work exactly the way you would expect. A variable declaration tries to call _slang_alloc_var(), which works similarly to _slang_alloc_temp(). The IR_NOT operator is implemented as v = v == 0, which is cool.

All right, what about comparisons? Because performing less-than and greater-than operations doesn't makes sense on structs and vectors, it is handled by emit_arith(). However, equality comparisons work almost exactly the same way, except that we have to be able to compare structs and vectors, etc. Comparing two floats is straightforward; just call emit_instruction(). Comparing two vectors is a little more complicated, because the comparison instruction returns a vector of outputs, for each component. We can solve this by computing the dot product of the output with itself, and looking at the output. This requires allocating a temporary. Now, what about structs? This just allocates an accumulator, and walks through the size of the object, adding the output of the comparisons to the accumulator. Then, we can use the dot product trick again. Note that this won't work with arrays with padding; this is kind of an interesting problem (which doesn't look like is solved in this version of Mesa).

Alright, how about loops? There is an IR_LOOP instruction, which triggers a call to emit_loop(). There is a flag in the slang_emit_info structure which determines if we should emit so-called "high level" instructions. If so, we can simply emit a OPCODE_BGNLOOP instruction, which is pretty cool. Before we do that, we save the number of previously-emitted instructions to use for a label to jump to, should we need to. Then we can just emit the body of the loop (the 0th child of the IR loop), and then possibly emit OPCODE_ENDLOOP. Otherwise, we emit a OPCODE_BRA (branch) instruction, and set the target to the beginning of the loop. Once we've done that, we have to walk through the instructions in the loop, looking for IR_BREAK and IR_CONT nodes, and replacing them with OPCODE_BRA nodes. Now we're done!

Sampling from a texture is simply an instruction, so that doesn't add much complexity.

The last piece I'd like to get into is function calls. Because setting up all the arguments and return value was done when creating the IR (as well as as much inlining as possible), calling functions is actually fairly simple. Because instruction streams are attached to gl_program objects, we save the current gl_program object (originally in emitInfo->prog) and create a new program by calling new_subroutine() which delegates to ctx->Driver.NewProgram(). Then, we can emit a label for the new function, call emit() on the function body, and a return instruction just in case. We also might surround the function with OPCODE_BGNSUB and OPCODE_ENDSUB instructions, if the emitInfo->EmitBeginEndSub is set. Once we've emitted the new function, we set the active program to the original saved value and emit the OPCODE_CAL instruction to that stream.

Cool! Now we've got a stream of instructions that our VM can execute. Before getting into VM execution, the OpenGL pipeline, or linking shaders, I'd like to show the life of an example function, with all its intermediate forms along the way of compilation. I think that'll make the shader compilation steps clearer.

No comments:

Post a Comment