Texture Types
First of all, there are many kinds of textures. The simplest kind of texture to understand is a 2D texture, whose purpose is to act like a rectangular image. Each element in the image is configurable; you can specify that it’s a float, or an int, or 4 floats (for each channel of RGBA), etc. These “elements” are usually called “texels.” Similarly, there are 1D textures and 3D textures, which act similarly.
Then, you’ve got 1D texture arrays, and 2D texture arrays, which are not simply arrays-of-textures. Instead, they are distinct types, where each element in the array is the relevant texture type. They are their own distinct resource types because GPUs can operate on them in hardware, so the array doesn’t have to be implemented in software. As such, the hardware restricts each element in the array to have the same dimensions. If you don’t like this requirement, you can create a software array of textures, and it will go slower but the requirement won’t apply. (Or you could even have an array of texture arrays!)
Mipmaps
There’s one other important piece to textures: mipmaps. Generally, textures are mapped onto arbitrary 3d geometry, which means that the number of pixels on-screen the texture is stretched over is totally arbitrary. Using regular projection matrices, the farther the geometry is from the viewer, the fewer pixels the texture is mapped onto.
Consider drawing a single pixel of geometry that is far away from the camera. Here, the entire texture will be squished to fit into a small number of pixels, so that single pixel will be covered by many texels. So, if the renderer wanted to compute an accurate color for that pixel, it would have to average all the covered pixels together. However, what if that geometry moves closer to the camera, such that each pixel contains only ~1 texel? In this situation, no averaging is necessary; you can just do a read in the texture data.
So, if the texture is big relative to the size on-screen it’s drawn, that’s a problem, but if it’s small, that’s no problem. Think about that for a second - big data sizes are a problem, but small data sizes okay. So what if the system could just reduce the big texture to a small texture as a preprocess? In fact, if there was a collection of reductions of various sizes, there would always be a size that is appropriate for the number of pixels being drawn.
That’s exactly what a mipmap is. If a 2D texture has dimensions m * n, the object also has storage for an additional level of m/2 * n/2, and an additional level of m/4 * n/4, etc, down to a single texel. This doesn’t even waste that much memory, because it’s provable that x + x/2 + x/4 + x/8 … = 2*x, so the memory overhead is as much as an additional texture. This storage requirement also assumes that texture sizes are always powers-of-two, which is generally required, though nowadays many implementations have extensions that relax this requirement.
So, naïvely, addressing a 2D texture requires 3 components: x, y, and which mipmap level. 3D textures require 4 components, and 1D textures require 2 components. 2D texture arrays require 4 components (there’s an extra one for the layer in the array) and 1D texture arrays require 3 components. With these components, the system only has to do a single read at runtime - no looping over texels required.
Automatic Miplevel Selection
The shader API, however, can calculate the mipmap level for you, so you don’t have to do that yourself in the shader (though you can if you want to). The key here is to figure out how many texels per pixel the texture is getting squished down to. If the answer is 2, you should use the second mipmap level. If the answer is 4, you should use the third mipmap level (since each level is half as large as the previous).
So how does the system know how many texels cover your pixel? Well, if you think about it, this is the screen-space derivative of the sampling coordinate in the base level. Stated differently, it’s the rate of change of the texture coordinate (in texels) across the screen. So, how do you calculate this?
If the code you’re writing is differentiable, you could just calculate it yourself in closed-form, and just use that. However, the system can approximate it automatically, using the fact that the GPU scheduler can schedule fragment shader threads however it likes. If the scheduler chooses to dispatch fragment shader threads in 2x2 blocks, then each thread in the block can share data among each other. Then, approximating this derivative is easy, it’s simply change-in-y / change-in-x = the difference of adjacent sampling coordinates divided by the difference of adjacent screen-space coordinates. Because we are sampling adjacent pixels, the difference of adjacent screen-space coordinates is just 1, so this derivative is calculated by just subtracting the sampling position of adjacent pixels. The pixels in the 2x2 block can share the result. (Of course this sharing only works if every fragment shader in the 2x2 block is at the same point in the shader so they can cooperate together.)
So, the system does this subtraction of adjacent sampling coordinates to estimate the derivative, and takes the log base 2 of the derivative to select which miplevel to use. The result of this may not be exactly integral, so the sampler describes whether or not to just clamp to the nearest integer miplevel or to read both straddling miplevels and use a weighted average. You can also short-circuit this computation by explicitly specifying derivatives to use (which means the derivatives won’t be automatically calculated, but everything else will work the same way) or by just specifying which miplevel to use directly.
Dimension Reduction
But I’ve breezed over one of the details here - 2D textures have 2-dimensional texel coordinates, and screens also have 2-dimensional coordinates. How do we reduce these to a single miplevel? The Vulkan spec doesn’t actually describe exactly how to reduce the 2-dimensional texel coordinates into a single scalar, but it does say in section 15.6.7:
ρx and ρy may be approximated with functions fx and fy, subject to the following constraints:
fx is continuous and monotonically increasing in each of mux, mvx, and mwx
fy is continuous and monotonically increasing in each of muy, mvy, and mwy
max(|mux|, |mvx|, |mwx|) <= fx <= sqrt(2) * (|mux| + |mvx| + |mwx|)
max(|muy|, |mvy|, |mwy|) <= fy <= sqrt(2) * (|muy| + |mvy| + |mwy|)
So, you reduce the n-dimensional texture coordinate to a scalar by making up a formula that fits the above requirements. You apply the function twice - once for the horizontal screen derivative direction, and once for the vertical screen derivative direction.
So this tells you (roughly) how many texels fit in the pixel vertically, and how many texels fit in the pixel horizontally. But these values don’t have to be the same. Imagine looking out in first-person across a rendered floor. There are many texels squished vertically, but not that many horizontally.
This is called anisotropy. The amount of anisotropy is just the ratio of these two values. By default, texture sampling will just use the minimum of these two values when figuring out which miplevel to use. Remember - miplevels are zero-indexed, so the smaller the index, the more data is in that level, so the smaller miplevel means the highest level of detail. However, there some techniques in this area that involve doing extra work to improve the quality of the result.
Wrapping Things Up
At this point, the sampler provides shader authors some control over the miplevel selection. The sampler / optional arguments can include a “LOD Bias” which gets added to this value, so the author can get higher-or-lower detail as necessary. The sampler / optional arguments can also include a “LOD Clamp” which will be applied here, if, for example, not all the miplevels of the texture have their contents populated yet.
So, now that you have a miplevel, you can do the rest of the operation. If the sampler says the sampling coordinate is normalized, you denormalize it by multiplying by the dimensions of the miplevel, and modulus / mirror / whatever the sampler tells you to do. Then, depending on the sampler settings, you either round the denormalized coordinates to the nearest integer, or you read all the straddling texels and perform a weighted average. Then, if the sampler tells you to, you do it all again at the next miplevel, and perform yet another weighted average.
There’s one last tiny detail I’ve skipped over, and that is the fact that texel elements are considered to lie at the center of the texel. So, if you have a 1D texture with 2 texels, where one is black and one is white, 1/4 of the way through the texture will be full black, and 3/4 of the way through the texture will be full white, and 1/4 - 3/4 will be a gradient from black to white. But what is drawn from 0 - 1/4 and from 3/4 - 1? What about values less than 0 or greater than 1? The sampler allows for configuring this. The modulus / mirroring operation results in a value that is either on the interior of the texture, or 1 texel around the edge. These texels around the edge either get values from being repeated / mirrored / whatever, or they can just be set to a constant “border color.” This color is fed as input to the weighted average calculation, so everything just works correctly.