There are many transformations which points go through when they flow through the 3D pipeline. Points are transformed from local space to world space, from world space to camera space, and from camera space to clip space, performed by the model, view, and projection matrices, respectively. The idea here is pretty straightforward - your points are in some arbitrary coordinate system, but the 3D engine only understands points in one particular coordinate system (called Normalized Device Coordinates).
The concept of a coordinate system should be straightforward. A coordinate system is one example of a “vector space,” where there are some number of bases, and points inside the vector space are represented as linear combinations of the bases. One particular spacial coordinate system has bases of (a vector pointing to the right, a vector pointing up, and a vector pointing outward).
However, that is just one example of a coordinate system. There are (infinitely) many coordinate systems which all represent the same space. When some collection of bases span a space, it means that any point in that space can be described as the sum of scalar functions of these bases. Which means that there is a particular coordinate system that spans 3D space which has its origin at my eyeball, and there is another coordinate system which spans the same space, but its origin is a foot in front of me. Because both of these coordinate systems span 3D space, any point in 3D space can be written in terms of either of these bases.
Which means, given a point in one coordinate system, it is possible to find what that point would be if written in another coordinate system. This means that we’re taking one coordinate system, and transforming it somehow to turn it into another coordinate system. If you think of it like graph theory, coordinate systems are like nodes and transformations are like edges, where the transformations get you from one coordinate system to another.
Linear transformations are represented by matrices, because matrices can represent all linear transformations. Note that not all coordinate systems are linear - it is straightforward to think of a coordinate system where, as you travel down an axis, the points you encounter do not have linearly growing components. (Color spaces, in particular, have this characteristic). However, when dealing with space, we can still span the entire vector space by using linear bases, so there is no need for anything more elaborate. Therefore, everything is linear, which means that our points are represented as simply dot products with the bases. This also means that transformations between coordinate systems can be characterized by matrices, and all linear transformations have a representative matrix.
Another benefit of matrices if that you can join two matrix transformations together into a single matrix transformation (which represents the concatenation of the two individual transformations) by simply multiplying the matrices. Therefore, we can run a point through a chain of transformation matrices for free, by simply multiplying all the matrices together ahead of time.
When an artist models some object, he/she will do it in a local coordinate system. In order to place the object in a world, that local coordinate system needs to be transformed there. Therefore, a transformation matrix (called the Model matrix) is created which represents the transformation from the local coordinate system to the world coordinate system, and then the points in the model are transformed (multiplied) by this matrix. If you want to move the object around in 3D space, we just modify the transformation matrix, and keep everything else the same. The resulting multiplied points will just work accordingly.
However, the pixels that end up getting drawn on the screen are relative to some camera, so we need to then transform our points in world space into camera space. This is done by multiplying with the View matrix. It is the same concept as the Model matrix, but we’re just putting points in the coordinate system of the camera. If the camera moves around, we update the View matrix, and leave everything else the same.
As an aside, our 3D engine only understands points in a [-1, 1] range (it will scale up these resulting points to the size of the screen as a last step) which means that your program doesn’t have to worry about the resolution of the screen. This requirement is pretty straightforward to satisfy - simply multiply by a third matrix, called the Projection matrix, which just scales by something which will renormalize the points into this target range. So far so good.
Now, the end goal is to transform our points such that the X and Y coordinates of the end-result points represent the locations of the pixels which to light up, and the Z coordinate of the end-result point represents a depth value for how deep that point is. This depth value is useful for making sure we only draw the closest point if there are a few candidates which all end up on the same (X, Y) coordinate on the screen. Now, if the coordinate system that we have just created (by multiplying the Model, View and Projection matrices) satisfies this requirement, then we can just be done now. This is what is known as an “orthographic projection,” where points that are deeper away aren’t drawn in to the center of the screen by perspective. And that’s fine.
However, if we’re trying model people’s field of view, it gets wider the farther out from the viewer we get, like in this top-down picture. (Note that from here on out, I’m going to disregard the Y dimension so I can draw pictures. The Y dimensions is conceptually the same as the X dimension) So, what we really want are all the points in rays from the origin to have the same X coordinate.
Let’s consider, like a camera, that there is a “film” plane orthogonal to the view vector. What we are really trying to find is distance along this plane, normalized so that the edges of the plane are -1 and 1. (Note that this is a different model the the simple [-1, 1] scaling we were doing with the orthographic projection matrix, so don’t do that simple scaling here. We’ll take care of it later.)
Let’s call the point we’re considering P = (X
0, Z
0). All the points that will end up with the same screen coordinates as P will lie along a ray. We can describe this ray with a parametric equation P
all(t) = t * (X
0, Z
0). If we consider the intersection that this ray has with the film plane at a particular depth D, we can find what we are looking for.
t * Z0 = D
t = D / Z0
Xr = t * X0
Xr = D / Z0 * X0 = (D * X0) / Z0
We then want to normalize to make sure that the rays on the edge of the field of vision map to -1 and 1. We know that
tan(θ) = Xi / D
Xi = tan(θ) * D
We want to calculate
s = Xr / Xi
s = ((D * X0) / Z0) / (tan(θ) * D)
s = (X0 / Z0) / tan(θ)
s = X0 * cot(θ) / Z0
This last line really should be written as
s(X0, Z0) = X0 * cot(θ) / Z0
to illustrate that we are computing a formula for which resulting pixel gets lit up on our screen (in the x dimension), and that the result is sensitive to the input point.
Note that this resulting coordinate space is called Normalized Device Coordinates. (Normalized because it goes from -1 to 1)
This result should make sense - the Ds cancel out because it doesn’t matter how far the film plane is, the rays' horizontal distance is all proportional to the field of view. Also, as X
0 increases, so does our scalar, which means as points move to the right, they will move to the right on our screen. As Z
0 increases (points get farther away), it makes sense that they should move closer to the view ray (but never cross it). Because our coordinate system puts -1 at the left edge of the screen and 1 at the right edge, dividing a point by its Z value makes sense - it moves the point toward the center of the screen, by an amount which grows the farther the point is away, which is what you would expect.
However, it would be really great if we could represent this formula in matrix notation, so that we can insert it in to our existing matrix multiplication pipeline (Model matrix * View matrix * Projection matrix). However, that “ / Z
0” in the formula is very troublesome, because it can’t be represented by a matrix. So what do we do?
Well, we can put as much into matrix form as possible, and then do that division at the very end. This makes our transformation pipeline look like this:
(A matrix representing some piece of projection * (View matrix (Model matrix * point))) / Z0
Let’s call that matrix representing some piece of projection, the Projection matrix.
Because of the associative property of matrix multiplication, we can rewrite our pipeline to look like this:
(Projection matrix * View matrix * Model matrix) * point / Z0
That projection matrix, so far, is of the form:
[cot(θ) 0]
[ ?? ??]
(Remember I’m disregarding the Y component here). Multiplying the matrix by point (X
0, Y
0) yields X
0 * cot(θ) for the X-coordinate, which, once we divide by Z
0 at the very end, gives us the correct answer.
So far so good. However, depth screws everything up.
Depth
For the depth component, we’re trying to figure out what those ??s should be in the above matrix. Remember that our destination needs to fit within the range [-1, 1] (because that is the depth range that our 3D engine works with). Also, remember that, at the end of the pipeline, whatever the result of this matrix multiplication is, is getting divided by Z
0. If we say that the first ?? in the above matrix is “a” and the second ?? in the above matrix is “b”, then, if we just consider the depth value, what we have is
(X0 * a + Z0 * b) / Z0 = depth
= sqrt(X02 + Z02)???????
That square root thing really sucks. However, recall that this depth computation is only used to do inequality comparisons (to tell if one point is closer than the other). The exact depth value is not used; we only care that depths increase the farther away from the origin that you get. So, we can rewrite this as
(X0 * a + Z0 * b) / Z0 = some increasing function as Z0 grows
X0 * a / Z0 + b = some increasing function as Z0 grows
or,
X0 / Z0 * a + b = some increasing function as Z grows
If we make “a” negative, then our formula will increase as Z
0 grows. So far so good
Now, we want to pick values for a and b such that the depth range we’re interested in maps to [-1, 1]. Note, however, that we can’t say the naive thing that, at Z
0 = 0, depth = 0. This is because of that divide by Z
0 in the above formula. If Z
0 = 0, we are dividing by 0. Therefore, we have to pick some other distance which will map to a 0 depth. Let’s call that distance N.
At this point, we will make a simplifying assumption that there is a plane at distance N (similar to the film plane described above) and that the depth everywhere on this plane is 0. (It’s valuable, as a design consideration, to be able to model this “too-close” threshold as a simple plane) The assumption means, however, that depth cannot be sensitive to horizontal position (because it is 0 all over the near plane). This means that “a” must equal 0. But wait, didn’t we just say that “a” needed to be negative?
Homogeneous Coordinates to the rescue!
Homogeneous coordinates are already super valuable in our Model and View matrices, because we cannot model translations without them. So, this means that we’re actually already using them in all our matrix calculations. To recap, homogenous coordinates are expressed just like regular coordinates, except that positions have an extra “1” component tacked on to the end. So our point at
(X0, Y0) is actually written as
[X0]
[Y0]
[ 1]
where the “1” is known as the “w” component.
And our projection matrix so far is written as
[cot(θ) 0 0]
[ 0 b c]
Therefore, our depth value is now
(b * Z0 + c * 1) / Z0 = some increasing function as Z0 grows
b + c / Z0 = some increasing function as Z0 grows
Now, we can make our function increasing by setting “c” to be negative. In particular, we have some value for Z
0, namely N, at which the result of this expression should be -1. We also have another depth, let’s call it F, at which the result of this expression should be 1. This means that we have a near plane and a far plane, and points on those planes get mapped to -1 and 1 respectively. This also means that, between the near and the far plane, depths are not linearly increasing. We also know, because we’re dividing by Z
0, the near plane can’t be at 0. The far plane, however, can be at any arbitrary location (as long as its farther than the near plane).
So, we have a system of equations. We can substitute in the values just described:
b + (c / N) = -1 and b + (c / F) = 1
b = -1 - (c / N)
(-1 - (c / N)) + (c / F) = 1
(c / F) - (c / N) = 2
c - (c * F / N) = 2 * F
c * N - c * F = 2 * F * N
c * (N - F) = 2 * F * N
c = (2 * F * N) / (N - F)
b = -1 - ((2 * F * N) / (N - F)) / N
b = -1 - ((2 * F) / (N - F))
b = ((F - N) / (N - F)) - ((2 * F) / (N - F))
b = (F - N - 2 * F) / (N - F)
b = (-F - N) / (N - F)
b = - (N + F) / (N - F)
b = (F + N) / (F - N)
We can then check out work to make sure the values substituted in get the correct result:
((F + N) / (F - N) * N + ((2 * F * N) / (N - F))) / N
((F + N) / (F - N)) + ((2 * F) / (N - F))
((F + N) / (F - N)) - ((2 * F) / (F - N))
((F + N - 2 * F) / (F - N))
(N - F) / (F - N)
- (F - N) / (F - N)
-1
and
(((F + N) / (F - N)) * F + ((2 * F * N) / (N - F))) / F
((F + N) / (F - N)) + ((2 * N) / (N - F))
((F + N) / (F - N)) - ((2 * N) / (F - N))
(F + N - 2 * N) / (F - N)
(F - N) / (F - N)
1
So, our final matrix:
[cot(θ) 0 0 ]
[ 0 (F + N) / (F - N) (2 * F * N) / (N - F)]
Note that, because F > N, the denominator of “c” is negative, making the whole thing negative. This is what we expected (c must be negative in order for the function to be increasing).
It’s worth thinking about the shape of the depth values that are output by this formula, in particular, the formula is of the form
(negative constant) / variable + constant
which means the graph is a flipped 1/x graph, like this
The constants can scale the curve vertically, and move it vertically, but we can’t do anything horizontally. In particular, we picked constants such that the points (N, -1) and (F, 1) are on the graph. The depth values between these limits are not linear, and are denser the farther out we go. Also note that it is possible to put the far plane at Z = infinity, in which case we scale and bias the curve such that the horizontal asymptote lies at Z = 1.
There’s one last piece to consider here, and that is that Z
0 divide. In particular, this is the Z
0 that is the input to the projection matrix, not the Z component of the original point. This makes it kind of cumbersome to transform your point, remember one particular value from it, then do another transform. It also means that the output of the matrix transformation step is a point and an associated value to divide it by. Wouldn’t it be simpler if we could just combine both of those items into a vector? Maybe we could put this divisor value into an extra component of the resulting post-transformed point. Well, we can do that by adding a new row to our projection matrix
[0 1 0]
This means that that last component, the w component, of the resulting vector, will simply get set to Z
0. Then, the result of the matrix transformations is just a single (homogenous) vector with a non-one w component. The division step is then just to divide this vector by its w component, thereby bringing the w component back down to 1.
In the end, our projection matrix is this:
[cot(θ) 0 0 ]
[ 0 (F + N) / (F - N) (2 * F * N) / (N - F)]
[ 0 1 0 ]
So, there we have it. We have some matrices (Model, View, and Projection) which we can multiply together ahead of time to create a single transform which will transform our input point into clip space. The result of this is a single vector, with a W component not equal to 1 (This vector is the output of the vertex shader, being stored in gl_Position). Then, the hardware will divide the vector by its w component (in hardware on the GPU - very fast). The reason why we don't do this W divide in the vertex shader is 1. It's faster to let dedicated hardware do it, and 2. The hardware needs access to w in order to do
perspective-correct interpolation. This division is where perspective happens. This resulting vector (with W = 1) is in Normalized Device Coordinates. The X and Y coordinates of this vector are simply the screen coordinates of the point (once scaled and biased by the screen’s resolution, remember these values are between [-1 and 1]). The Z component is a non-linear, increasing depth value, where depth values get denser as you go further out. This depth value is used for depth comparisons.
P.S. Note that, if you're using an orthographic Projection matrix, you can simply make the last row of the matrix [0 0 1] so that the w-divide does nothing.
P.P.S. Note that if you take the limit of that matrix, as F approaches infinity, you get
[cot(θ) 0 0 ]
[ 0 1 -2 * N]
[ 0 1 0 ]
P.P.P.S. Note that if you look at the implementation of this in a software package, such as
GLM, you'll see some negatives in its version of the matrix. This is because of the right-hand coordinate system, where points farther away are in the negative-Z half-space, and as you go further away, points' Z coordinates get more and more negative. In the derivation described above, I was using a left-hand coordinate system, where the viewer looks in the direction of positive, increasing Z.