Litherum: May 2015

Thursday, May 21, 2015

Colorspaces

There are many explanations of color spaces, but all the discussions of it I have heard have missed some points which I think are important when thinking about color spaces. In particular, there are many related pieces of a color space which all interact to define a color.

First off, a set of colors exists in the world. Physically, a color is a distribution of light frequencies at varying intensities. You can think of these colors as a “space,” similar to a vector space, but not necessarily linear. The axes in this “space” may be stretched or squished according to arbitrary formulas; however, the axes do span the space (using the mathematical definition of “span”).

A color space defines axes (again: not necessarily linear) through this space. Note that, like a vector space, these axes theoretically extend infinitely, which means that the space of representable colors extends infinitely in all directions. If you only consider these axes, all colors can be represented in all of these sets of axes. Because of this property, if you have two of these sets of axes, you can always represent all points in both of them.

However, color spaces define their own “gamut” which is simply the bounds of the color space. This means that there is a watertight seal around the space of colors which the color space can represent. Because two colorspaces might not have the same gamut, this means that there can be colors that are representable in one that are not representable in the other. (Before we introduced gamut, this was not possible; but now that we have this artificial limit of each space, it is possible). However, gamut is just the outermost bounds; all colors within this bounds are representable.

Now, let’s talk about computers. Computers can’t represent infinitely long decimals. They have to quantize (round) all the numbers they work with to particular representable values. For example, an int cannot hold the value 0.5, and a float cannot hold the value 16777217. Therefore, when computers represent a color in a color space, they can only represent discrete points inside the space, rather than all the points, continuously.

So, there is this problem of how to choose which specific points are representable in a color space. You may want to just divide up the space evenly into little grid squares, but this has a problem. In particular, depending on the color, our eyes are better or worse at determining slight variations in color. This means that, around the part of the gamut where our eyes are really good at distinguishing color, we want to make the grid squares small, so the representable values better match what we can actually see. Conversely, around the parts where our eyes don’t work so well, it is wasteful to have the grid cells be smaller than what we can actually distinguish. In effect, what we want to do is take some of those representable values, and squish them so they are denser around part of the gamut, and less dense around other parts.

So, color spaces, define formulas of how to do this. In particular, if you have a value on a particular axis, they define how far down that axis to travel. This function is often non linear. This formula allows you to connect some arbitrary number with a location in the gamut. The bounds of the gamut are often described in terms of these numbers.

“Gamma” is just the particular function (y = x ^ gamma) that sRGB uses that does this.

Now, if you consider this mathematically, all we are doing is representing the same space with different axes, which, mathematically, is completely possible and lossless. It just means that, in this other coordinate system, the numbers which represent the bounds of the gamut will (might) be different, even though the gamut itself has not changed.

That’s fine, except that when we do math on colors (which we do all the time in pixel shaders), we expect the space to behave like a linear vector space. If we have a color, and we double it, we expect to be twice as far from the origin as we were before. However, this means that the space we do math in has to be linear. So, if we want to do math on colors, we have to stretch and squish the axes in the opposite way, in order to get the squished space back to being linear. However, we are now no longer in the original color space! (Because each color space includes the squishiness characteristics of its axes.)

Therefore, if we have a color in a non-linear colorspace, and we have to do math on it, but at the end have a color in that same color space, we must convert the color into a linear space, do math, then convert it back. We don’t want to keep all colors in a linear color space all the time, because that would mean that we would have to cover the entire gamut with grid cells that are as small as the smallest one in the non-linear color space, which means we need more bits to represent our colors (smaller grid cells means more of them).

It also means that we should use a very high precision number format to do math in, and only convert back to nonlinear when we’re done. Each conversion we do is a little bit lossy. Shaders execute with floating point precision (or half float precision) which is much more precision than a single 8-bit uint (which is typically used to represent colors in sRGB).

Now, it turns out that all this stuff was being implemented when computers all used CRT monitors. Now, it turns out that the voltage response of a CRT monitor is almost exactly the inverse of the squishiness of sRGB. This means, if you feed your sRGB-encoded image directly to a CRT, it will, by virtue of physics of electron beams, just happen to exactly undo the sRGB encoding and display the correct thing. This means that, back then, all images were stored in this form, both because 1) No conversion was needed before feeding it directly to the monitor, and 2) the density of information was in all the right places. This practice continues to this day. (It means that when digital cameras take a picture, they actually perform an encoding into sRGB before saving out to disk). Therefore, if you naively have a pixel shader which samples a photograph without doing any correction, you probably are doing the wrong thing.

OpenGL

In OpenGL, if you are in a fragment shader and you are reading data from a texture whose contents are in a non-linear color space, you must convert that color you read into a linear color space before you do any math with it. But wait - doesn’t texture filtering count as math? Yes, it does. Doesn’t this mean that I have to convert the colors into a linear color space before texture filtering occurs? Yes, it does. One of the texture formats you can mark a texture with is GL_SRGB8 (sRGB is nonlinear), and when you do this, the hardware will run the linearizing conversion on the data read before filtering (Actually implementations are free to do this conversion after filtering, but they should do it before). Then, throughout the duration of the fragment shader, all your math is done in a linear color space, using floats.

But what about the color you output from you fragment shader? Doesn’t blending count as math? Yes, it does. When you enable GL_FRAMEBUFFER_SRGB, OpenGL lets you use sRGB textures when you create a framebuffer. You can also query whether or not your frame buffer is using sRGB or not using glGetFramebufferAttachmentParameter(). When you set this up, the GPU will convert from your linear colorspace into sRGB, and it will do this such that blending occurs correctly (linearly).

The OpenGL spec includes the exact formulas which are run when you convert into and out of sRGB from a linear colorspace. I’m not sure if the linearized sRGB colorspace has a proper name (but it is definitely not sRGB!)

As alluded to earlier, an alternative would be to keep everything in a linear space all the time, but this means that you need more bits to represent colors. In practice the general wisdom is to use at least 10 bits of information per channel. This means, however, that if you are trying to represent your color in 32 bits, you only have 2 bits left over for alpha (using the GL_RGB10_A2 format). If you need more alpha than that (almost certainly), you will likely need to go up to 16 bits per channel, which means doubling your memory use. At that point, you might as well use half floats so you can get HDR as well (assuming your hardware supports it).

Alpha isn’t a part of color. No color space includes mention of alpha. Alpha is ancillary data that does not live within a color space, the way that color does. We put alpha into the w component of a vec4 out of convenience, not because it is ontologically part of the color. Therefore, when you are converting a color between colorspaces, make sure you don’t touch your alpha component (The GL spec follows this rule when describing how it converts into and out of sRGB).

Deriving Projection Matrices

There are many transformations which points go through when they flow through the 3D pipeline. Points are transformed from local space to world space, from world space to camera space, and from camera space to clip space, performed by the model, view, and projection matrices, respectively. The idea here is pretty straightforward - your points are in some arbitrary coordinate system, but the 3D engine only understands points in one particular coordinate system (called Normalized Device Coordinates).

The concept of a coordinate system should be straightforward. A coordinate system is one example of a “vector space,” where there are some number of bases, and points inside the vector space are represented as linear combinations of the bases. One particular spacial coordinate system has bases of (a vector pointing to the right, a vector pointing up, and a vector pointing outward).

However, that is just one example of a coordinate system. There are (infinitely) many coordinate systems which all represent the same space. When some collection of bases span a space, it means that any point in that space can be described as the sum of scalar functions of these bases. Which means that there is a particular coordinate system that spans 3D space which has its origin at my eyeball, and there is another coordinate system which spans the same space, but its origin is a foot in front of me. Because both of these coordinate systems span 3D space, any point in 3D space can be written in terms of either of these bases.

Which means, given a point in one coordinate system, it is possible to find what that point would be if written in another coordinate system. This means that we’re taking one coordinate system, and transforming it somehow to turn it into another coordinate system. If you think of it like graph theory, coordinate systems are like nodes and transformations are like edges, where the transformations get you from one coordinate system to another.

Linear transformations are represented by matrices, because matrices can represent all linear transformations. Note that not all coordinate systems are linear - it is straightforward to think of a coordinate system where, as you travel down an axis, the points you encounter do not have linearly growing components. (Color spaces, in particular, have this characteristic). However, when dealing with space, we can still span the entire vector space by using linear bases, so there is no need for anything more elaborate. Therefore, everything is linear, which means that our points are represented as simply dot products with the bases. This also means that transformations between coordinate systems can be characterized by matrices, and all linear transformations have a representative matrix.

Another benefit of matrices if that you can join two matrix transformations together into a single matrix transformation (which represents the concatenation of the two individual transformations) by simply multiplying the matrices. Therefore, we can run a point through a chain of transformation matrices for free, by simply multiplying all the matrices together ahead of time.

When an artist models some object, he/she will do it in a local coordinate system. In order to place the object in a world, that local coordinate system needs to be transformed there. Therefore, a transformation matrix (called the Model matrix) is created which represents the transformation from the local coordinate system to the world coordinate system, and then the points in the model are transformed (multiplied) by this matrix. If you want to move the object around in 3D space, we just modify the transformation matrix, and keep everything else the same. The resulting multiplied points will just work accordingly.

However, the pixels that end up getting drawn on the screen are relative to some camera, so we need to then transform our points in world space into camera space. This is done by multiplying with the View matrix. It is the same concept as the Model matrix, but we’re just putting points in the coordinate system of the camera. If the camera moves around, we update the View matrix, and leave everything else the same.

As an aside, our 3D engine only understands points in a [-1, 1] range (it will scale up these resulting points to the size of the screen as a last step) which means that your program doesn’t have to worry about the resolution of the screen. This requirement is pretty straightforward to satisfy - simply multiply by a third matrix, called the Projection matrix, which just scales by something which will renormalize the points into this target range. So far so good.

Now, the end goal is to transform our points such that the X and Y coordinates of the end-result points represent the locations of the pixels which to light up, and the Z coordinate of the end-result point represents a depth value for how deep that point is. This depth value is useful for making sure we only draw the closest point if there are a few candidates which all end up on the same (X, Y) coordinate on the screen. Now, if the coordinate system that we have just created (by multiplying the Model, View and Projection matrices) satisfies this requirement, then we can just be done now. This is what is known as an “orthographic projection,” where points that are deeper away aren’t drawn in to the center of the screen by perspective. And that’s fine.

However, if we’re trying model people’s field of view, it gets wider the farther out from the viewer we get, like in this top-down picture. (Note that from here on out, I’m going to disregard the Y dimension so I can draw pictures. The Y dimensions is conceptually the same as the X dimension) So, what we really want are all the points in rays from the origin to have the same X coordinate.

Let’s consider, like a camera, that there is a “film” plane orthogonal to the view vector. What we are really trying to find is distance along this plane, normalized so that the edges of the plane are -1 and 1. (Note that this is a different model the the simple [-1, 1] scaling we were doing with the orthographic projection matrix, so don’t do that simple scaling here. We’ll take care of it later.)

Let’s call the point we’re considering P = (X₀, Z₀). All the points that will end up with the same screen coordinates as P will lie along a ray. We can describe this ray with a parametric equation P_all(t) = t * (X₀, Z₀). If we consider the intersection that this ray has with the film plane at a particular depth D, we can find what we are looking for.

t * Z₀ = D
t = D / Z₀

Xr = t * X₀
Xr = D / Z₀ * X₀ = (D * X₀) / Z₀

We then want to normalize to make sure that the rays on the edge of the field of vision map to -1 and 1. We know that

tan(θ) = Xi / D
Xi = tan(θ) * D

We want to calculate

s = Xr / Xi
s = ((D * X₀) / Z₀) / (tan(θ) * D)
s = (X₀ / Z₀) / tan(θ)
s = X₀ * cot(θ) / Z₀

This last line really should be written as

s(X₀, Z₀) = X₀ * cot(θ) / Z₀

to illustrate that we are computing a formula for which resulting pixel gets lit up on our screen (in the x dimension), and that the result is sensitive to the input point.

Note that this resulting coordinate space is called Normalized Device Coordinates. (Normalized because it goes from -1 to 1)

This result should make sense - the Ds cancel out because it doesn’t matter how far the film plane is, the rays' horizontal distance is all proportional to the field of view. Also, as X₀ increases, so does our scalar, which means as points move to the right, they will move to the right on our screen. As Z₀ increases (points get farther away), it makes sense that they should move closer to the view ray (but never cross it). Because our coordinate system puts -1 at the left edge of the screen and 1 at the right edge, dividing a point by its Z value makes sense - it moves the point toward the center of the screen, by an amount which grows the farther the point is away, which is what you would expect.

However, it would be really great if we could represent this formula in matrix notation, so that we can insert it in to our existing matrix multiplication pipeline (Model matrix * View matrix * Projection matrix). However, that “ / Z₀” in the formula is very troublesome, because it can’t be represented by a matrix. So what do we do?

Well, we can put as much into matrix form as possible, and then do that division at the very end. This makes our transformation pipeline look like this:

(A matrix representing some piece of projection * (View matrix (Model matrix * point))) / Z₀

Let’s call that matrix representing some piece of projection, the Projection matrix.

Because of the associative property of matrix multiplication, we can rewrite our pipeline to look like this:

(Projection matrix * View matrix * Model matrix) * point / Z₀

That projection matrix, so far, is of the form:

[cot(θ) 0]
[ ?? ??]

(Remember I’m disregarding the Y component here). Multiplying the matrix by point (X₀, Y₀) yields X₀ * cot(θ) for the X-coordinate, which, once we divide by Z₀ at the very end, gives us the correct answer.

So far so good. However, depth screws everything up.

Depth

For the depth component, we’re trying to figure out what those ??s should be in the above matrix. Remember that our destination needs to fit within the range [-1, 1] (because that is the depth range that our 3D engine works with). Also, remember that, at the end of the pipeline, whatever the result of this matrix multiplication is, is getting divided by Z₀. If we say that the first ?? in the above matrix is “a” and the second ?? in the above matrix is “b”, then, if we just consider the depth value, what we have is

(X₀ * a + Z₀ * b) / Z₀ = depth
= sqrt(X₀² + Z₀²)???????

That square root thing really sucks. However, recall that this depth computation is only used to do inequality comparisons (to tell if one point is closer than the other). The exact depth value is not used; we only care that depths increase the farther away from the origin that you get. So, we can rewrite this as

(X₀ * a + Z₀ * b) / Z₀ = some increasing function as Z₀ grows
X₀ * a / Z₀ + b = some increasing function as Z₀ grows
or,
X₀ / Z₀ * a + b = some increasing function as Z grows

If we make “a” negative, then our formula will increase as Z₀ grows. So far so good

Now, we want to pick values for a and b such that the depth range we’re interested in maps to [-1, 1]. Note, however, that we can’t say the naive thing that, at Z₀ = 0, depth = 0. This is because of that divide by Z₀ in the above formula. If Z₀ = 0, we are dividing by 0. Therefore, we have to pick some other distance which will map to a 0 depth. Let’s call that distance N.

At this point, we will make a simplifying assumption that there is a plane at distance N (similar to the film plane described above) and that the depth everywhere on this plane is 0. (It’s valuable, as a design consideration, to be able to model this “too-close” threshold as a simple plane) The assumption means, however, that depth cannot be sensitive to horizontal position (because it is 0 all over the near plane). This means that “a” must equal 0. But wait, didn’t we just say that “a” needed to be negative?

Homogeneous Coordinates to the rescue!

Homogeneous coordinates are already super valuable in our Model and View matrices, because we cannot model translations without them. So, this means that we’re actually already using them in all our matrix calculations. To recap, homogenous coordinates are expressed just like regular coordinates, except that positions have an extra “1” component tacked on to the end. So our point at (X₀, Y₀) is actually written as

[X₀]
[Y₀]
[ 1]

where the “1” is known as the “w” component.

And our projection matrix so far is written as

[cot(θ) 0 0]
[ 0 b c]

Therefore, our depth value is now
(b * Z₀ + c * 1) / Z₀ = some increasing function as Z₀ grows
b + c / Z₀ = some increasing function as Z₀ grows

Now, we can make our function increasing by setting “c” to be negative. In particular, we have some value for Z₀, namely N, at which the result of this expression should be -1. We also have another depth, let’s call it F, at which the result of this expression should be 1. This means that we have a near plane and a far plane, and points on those planes get mapped to -1 and 1 respectively. This also means that, between the near and the far plane, depths are not linearly increasing. We also know, because we’re dividing by Z₀, the near plane can’t be at 0. The far plane, however, can be at any arbitrary location (as long as its farther than the near plane).

So, we have a system of equations. We can substitute in the values just described:

b + (c / N) = -1 and b + (c / F) = 1
b = -1 - (c / N)

(-1 - (c / N)) + (c / F) = 1
(c / F) - (c / N) = 2
c - (c * F / N) = 2 * F
c * N - c * F = 2 * F * N
c * (N - F) = 2 * F * N
c = (2 * F * N) / (N - F)

b = -1 - ((2 * F * N) / (N - F)) / N
b = -1 - ((2 * F) / (N - F))
b = ((F - N) / (N - F)) - ((2 * F) / (N - F))
b = (F - N - 2 * F) / (N - F)
b = (-F - N) / (N - F)
b = - (N + F) / (N - F)
b = (F + N) / (F - N)

We can then check out work to make sure the values substituted in get the correct result:

((F + N) / (F - N) * N + ((2 * F * N) / (N - F))) / N
((F + N) / (F - N)) + ((2 * F) / (N - F))
((F + N) / (F - N)) - ((2 * F) / (F - N))
((F + N - 2 * F) / (F - N))
(N - F) / (F - N)
- (F - N) / (F - N)
-1

and

(((F + N) / (F - N)) * F + ((2 * F * N) / (N - F))) / F
((F + N) / (F - N)) + ((2 * N) / (N - F))
((F + N) / (F - N)) - ((2 * N) / (F - N))
(F + N - 2 * N) / (F - N)
(F - N) / (F - N)
1

So, our final matrix:
[cot(θ) 0 0 ]
[ 0 (F + N) / (F - N) (2 * F * N) / (N - F)]

Note that, because F > N, the denominator of “c” is negative, making the whole thing negative. This is what we expected (c must be negative in order for the function to be increasing).

It’s worth thinking about the shape of the depth values that are output by this formula, in particular, the formula is of the form

(negative constant) / variable + constant

which means the graph is a flipped 1/x graph, like this

The constants can scale the curve vertically, and move it vertically, but we can’t do anything horizontally. In particular, we picked constants such that the points (N, -1) and (F, 1) are on the graph. The depth values between these limits are not linear, and are denser the farther out we go. Also note that it is possible to put the far plane at Z = infinity, in which case we scale and bias the curve such that the horizontal asymptote lies at Z = 1.

There’s one last piece to consider here, and that is that Z₀ divide. In particular, this is the Z₀ that is the input to the projection matrix, not the Z component of the original point. This makes it kind of cumbersome to transform your point, remember one particular value from it, then do another transform. It also means that the output of the matrix transformation step is a point and an associated value to divide it by. Wouldn’t it be simpler if we could just combine both of those items into a vector? Maybe we could put this divisor value into an extra component of the resulting post-transformed point. Well, we can do that by adding a new row to our projection matrix

[0 1 0]

This means that that last component, the w component, of the resulting vector, will simply get set to Z₀. Then, the result of the matrix transformations is just a single (homogenous) vector with a non-one w component. The division step is then just to divide this vector by its w component, thereby bringing the w component back down to 1.

In the end, our projection matrix is this:
[cot(θ) 0 0 ]
[ 0 (F + N) / (F - N) (2 * F * N) / (N - F)]
[ 0 1 0 ]

So, there we have it. We have some matrices (Model, View, and Projection) which we can multiply together ahead of time to create a single transform which will transform our input point into clip space. The result of this is a single vector, with a W component not equal to 1 (This vector is the output of the vertex shader, being stored in gl_Position). Then, the hardware will divide the vector by its w component (in hardware on the GPU - very fast). The reason why we don't do this W divide in the vertex shader is 1. It's faster to let dedicated hardware do it, and 2. The hardware needs access to w in order to do perspective-correct interpolation. This division is where perspective happens. This resulting vector (with W = 1) is in Normalized Device Coordinates. The X and Y coordinates of this vector are simply the screen coordinates of the point (once scaled and biased by the screen’s resolution, remember these values are between [-1 and 1]). The Z component is a non-linear, increasing depth value, where depth values get denser as you go further out. This depth value is used for depth comparisons.

P.S. Note that, if you're using an orthographic Projection matrix, you can simply make the last row of the matrix [0 0 1] so that the w-divide does nothing.

P.P.S. Note that if you take the limit of that matrix, as F approaches infinity, you get

[cot(θ) 0 0 ]
[ 0 1 -2 * N]
[ 0 1 0 ]

P.P.P.S. Note that if you look at the implementation of this in a software package, such as GLM, you'll see some negatives in its version of the matrix. This is because of the right-hand coordinate system, where points farther away are in the negative-Z half-space, and as you go further away, points' Z coordinates get more and more negative. In the derivation described above, I was using a left-hand coordinate system, where the viewer looks in the direction of positive, increasing Z.

Monday, May 4, 2015

How Cocoa Apps Are Written

Cocoa is the framework on Mac OS X and iOS that you use if you want to make an app. It serves as a high-level API to anything and everything that you ask the platform to do. If you use Cocoa, then all the code in your app is simply business logic, directing Cocoa to do all the heavy lifting for you.

First off, apps have GUIs. On OS X, you’ve got a Window (or multiple) that display(s) stuff. On iOS, you’ve got the whole screen (which, for the sake of brevity, I’m going to call a Window).

One possible system would be to give the developer a canvas the size of the window and tell him/her to go wild. (This actually seems to be the approach that Wayland uses). However, this doesn’t have the benefit of composability or reusability. Indeed, a better approach would be to section off a portion of the window and deal with that piece in isolation - a sort of mini-window inside the window. (Ironically, Microsoft Windows even uses this terminology somewhat.) Then, we can re-use this mini-window. In order to make these things composable, we could put mini-windows inside mini-windows. That way, we can build large pieces out of a collection of small pieces.

In Cocoa, these things are called Views. A view is a rectangular portion of the window which contains its own conceptual canvas. Views are arranged in a tree, so you can put Views inside Views - these subviews will appear within the superview. A View can draw into its own canvas, and all drawing operations are clipped to the view’s bounds. Each View has a “frame” attribute, which refers to where it lies within its superview. Also note that Views are drawn in order, so if sibling views that overlap, the last sibling appears on top.

Now, you can create different kinds of Views by subclassing the View class. Apple has actually already done this for many kinds of commonly used Views. For example, if you want to show some text, use a TextView, which has a “string” property. If you want to show an image, use an ImageView, which has an “image” property. Cocoa includes many of these prebuilt View subclasses. Cocoa is also smart enough to realize that when you change the content of one of these built in views, the window is dirty and needs to be repainted, and it handles all of the machinery required to do that. There are also built-in Views, such as StackView, which don’t draw anything themselves, but are simply designed to be a container to position all of their subviews.

You can also create your own custom View by creating a new subclass of View. If you want it to display something within the bounds of your View, override the drawRect: method. In this method, you can use the Cocoa 2D drawing functions such as calling “stroke” on a BezierPath. All 2D drawing APIs use a “context” object, and in Cocoa, the context is stashed in a static variable before drawRect: so the Cocoa drawing functions can interrogate it.

Views are bidirectional, meaning that they transfer information from the program to the user, but they also transfer information from the user to the program. This has the form of mouse events, touch events, keyboard events, and stuff like that. If the user interacts with a View, Cocoa calls a callback on the View. This is how buttons work.

Apps also have data. There is some graph of objects which represents the source of truth for the app. This data model is conceptually distinct from the Views, which are all about presentation. Surely the data itself is different from how it is presented. This data model is often persisted to some stable store so that it can be loaded for the next time the app is started. The way to do this with Cocoa is to use Core Data. You can give a tree of predicates to Core Data and ask it to create in-memory objects representing all database records that pass the expression. Once you have the in-memory objects, you can update them or delete them, and CoreData is responsible for updating the database to reflect that. You can also ask Core Data to insert a new record, and then you manually populate itself just like any other update. In order for Core Data to know what database scheme to use (It is in charge of managing all aspects of the database - the only interface you have to it is by way of the objects it produces), you have to provide it with a data model, which describes all the entities and relations between entities. In short, it is a template for your object graph. Core Data will then take it from there.

Okay, so now we’ve got data and a presentation of that data. Because these things are conceptually distinct, we need some code to populate information inside the View hierarchy given our ground truth data model. We also need some code to react to user events (usually by modifying the data model). This code is generally called “Controller” code.

There we have it - we have a data model, Controller code, and a hierarchy of Views. The Controller is the intermediary between the data and the Views, and information passes through it in both directions (from views to data and from data to Views). The data doesn’t know anything about the Views, and the Views don’t know anything about the underlying data model. The Views are just dumb displays of information, and the data is just that dumb data itself.

Therefore, the Controller is where the app is. All the executive logic about what to modify, when, under which situations, is in the Controller. The important bit - the transitions - occur in the Controller. The controller is said to own its data and its Views. This is actually enforced with reference counting - Controllers have strong references to its subtree of Views and its data.

Now, it turns out that apps usually have one portion of their UI operate on a portion of data fairly independently from the rest of the app. It makes sense - data is organized visually, so one subtree of views operates on one subset of data in the app. This means that we can actually apply the same composable logic to Controllers as we did for Views. The Controller tree, however, is much more sparse than the View tree, as you don’t need a controller for every last button and text field in your app. Instead, you have a hierarchy of conceptual pieces of your app, each one gets a Controller, and each Controller gets a subtree of Views. The Views are all connected up into a single big tree (so that everything gets drawn properly), but only certain Views correspond to a matching Controller.

This actually affects how you think about the internal workings of your app. Let’s say one Controller somewhere realizes that some View somewhere needs to be updated. If the Controller owns the View in question, great - it just updates the view and is done. However, if the Controller doesn’t own the View, it must notify either its parent Controller or one of its children Controllers that something happened. Now - this part is critical - it is up to this ::other:: Controller to determine how to react to the message it just received. This means that updates to a particular View can’t come from just anywhere; instead, all updates must go through one place whose job it is to make sure the View is showing the correct information. This makes it straightforward to implement policy regarding either presentation or persistence of data. Your app has just turned from a giant monolithic item to a network of messages flowing between pieces, each of which has their own concerns.

I’ll now take this opportunity to mention that Views are implemented by the UIView & NSView classes (for iOS & OS X, respectively), and Controllers are implemented by the UIViewController & NSViewController classes. The ViewController classes have a strong references to a “view” which is the root of their owning subtree. They also have a weak reference to their “parentViewController” and strong reference to an array of “childViewControllers.”

Now, it turns out that most apps set up a template view hierarchy at app launch time, and then keep it around indefinitely. This, coupled with Views’ tree structure makes them ripe for a declarative description of a view hierarchy. Once you have a declarative description, you can create an editor to build the hierarchy as if it is data. This exists as part of Xcode called Interface Builder. The declarative description of a View hierarchy is contained within .xib files. Interface Builder has a tree view on the left where you can describe your hierarchy, and a details view on the right which allows you to describe attributes and properties that each view should have (such as initial text content, font color, position, etc.).

So what happens at runtime with these .xib files? Well, Cocoa has this concept of a “bundle.” Bundles are a folder where a framework or app has all of its constituent files. Certainly, shared libraries or the executable is within a bundle, but any required data files are inside a bundle as well. Artwork and shaders required by the framework or app go inside the bundle. (A bit of terminology: A Framework exists on disk as a bundle, and one of the files inside the bundle is a library - either static or shared.) Bundles all have an Info.plist file, a sort of manifest, describing their contents. (A plist file is just a hierarchical file with a particular schema which allows typed key/value pairs, readable by Cocoa. It stands for “property list.”) When you start an app or use a Framework, the app / Framework has access to all the files in its particular bundle.

Now, the main() of a Cocoa app usually just has a single call to NSApplicationMain() in it. One of the things that occurs within this call is that the Info.plist file is read, and inside it is the filename of a .xib file. Cocoa will then open the .xib file and instantiate all the views inside it and set up the attributes on the views and relationships between the views that are described by the file. This means that when you run your app, you can have something that looks halfway decent without even writing a single line of code! You can even tell Cocoa to use a particular View subclass (as a string) and Cocoa will use the magic of Objective C’s dynamism at runtime to instantiate that subclass instead. Cool stuff!

This is all well and good, but there are two new problems: from the perspective of code that is running in a Controller,

How do we refer to a View which Interface Builder has created in order to push information to it?
Users will interact with the views which Interface Builder has created. However, all the interactions which the user performs are interactions with Views, not Controllers. How does flow control get from Views (as caused by a user action) to Controllers?

Interface Builder has a solution to each of these problems. The solution to the first problem is called an IBOutlet. In your controller, you annotate a variable (with type of some View subclass) with the keyword “@IBOutlet” and then tell Interface Builder to “connect” this variable with a particular View inside Interface Builder (done by dragging and dropping). This actually just sets a string inside the .xib. At runtime, due to the magic of Objective-C’s (and Swift’s) dynamic nature, Cocoa can look up the variable by name (with just a string!) and set it to whatever it wants.

The second problem is solved by something similar called an IBAction. This is the same kind of idea, except this time it’s performed on functions instead of variables. In your Controller, you annotate a function with “@IBAction,” and by dragging and dropping, you associate it with a View inside Interface Builder, which sets a string inside the .xib. Then, at runtime, Cocoa will actually set 2 variables on the View: a “target” and an “action.” The “target” is a weak reference to the controller you identified when you dragged and dropped, and the “action” is the selector to call on it when the View (really, “Control,” a subclass of View) is activated. “Activation” means different things to different Controls - a button activates when the user clicks on it, a text field activates when the user presses enter or when it loses focus, etc. Note that the “target” reference is weak, because it usually points up to the owning Controller.

But let’s think a little more about this “target” thing. In particular, what happens when you drag onto a class that is completely unrelated to Interface Builder, so Interface Builder has no concept of it? In this case, the dragging operation fails, and no IBAction is created. The IBAction (and IBOutlet) dragging operation only succeeds on classes which Interface Builder recognizes.

Well, which classes does Interface Builder recognize? Well, .xibs also have a collection of objects (not Views) which are instantiated along with everything else inside the .xib. If you specify the class of one of these objects as a subclass of ViewController, you can then set the “view” property on that object to a particular view in the hierarchy. (At runtime, one instance of this ViewController will be created along with all the Views, and the “view” property will be set accordingly.) If you set the class of this ViewController, then this is a class that Interface Builder understands, and you can then drag IBActions and IBOutlets to it.

There is actually another way to tell Interface Builder about a class, and that is the magical “File’s Owner.” Each .xib file has an object in it called “File’s Owner.” Now, the actual interface to the Cocoa function that opens and instantiates .xib files requires an extra argument called “owner.” Anything in the .xib file which refers to this “File’s Owner” will then be set to this object. For the main .xib file (the one that is listed in Info.plist), the File’s Owner is set to one of the autogenerated class stubs that Xcode creates for you when you create the project (and is therefore sensitive to the checkboxes you provide during the creation wizard). It is important, though, to realize that you can change the recorded type of “File’s Owner” inside a .xib file. Once you’ve done this, you have told Interface Builder about the type of file which will be opening this .xib, which means you can then drag IBOutlets and IBActions onto it. Note that this “File’s Owner” is usually the controller for a subtree of views.

You can also create secondary .xib files, and instantiate them (read: cause Cocoa to instantiate all the objects listed within) from code. When you do this you can set the owner argument. Also, when you do this, you immediately have access to all the objects which Cocoa created, so you are free to set any upward-pointing weak references inside the new tree as you will.

We are now at a point where we can characterize the setup of Cocoa apps that use Interface Builder. Apps have a main .xib file which is instantiated at startup. There is one object which acts as the “File’s Owner” of the .xib file, and Xcode creates stubs for this class at project creation time. During .xib instantiation, upward-pointing weak references are set up inside Views created therein, and downward-pointing strong (or weak, you have the option for either when you drag and drop) references are set up inside any classes that Interface Builder knows about. When the user interacts with views, messages are passed upward via weak references, and the ViewController is free to do whatever it wants with the message, possibly routing it to a parent ViewController via a weak reference (and, hopefully, an interface which allows for reusability and testability) or down to another child ViewController via a strong reference (or handled itself). ViewControllers own their data and View subtrees, and can interact with them as they want, thanks to IBOutlets which were set up by Cocoa at creation time. ViewControllers can instantiate child ViewControllers and View subtrees in code by calling into Cocoa and passing self as the file’s owner. Then these controllers can interact with the newly created instances. All the while, a ViewController can interact with the data that it owns, and the data is responsible for persisting itself to a database. Some Views have a weak reference to a delegate or a dataSource, which are usually implemented by a higher-level ViewController.

Throughout this discussion I have neglected to mention layout; that will be the subject for another post. Overall, though, there must be some way for each view to know where it should end up relative to its parent, so that all views get a place (and size) on screen. This computation must be repeated whenever the owning window or a superview resizes (otherwise, you wouldn’t be able to implement a behavior of something like “stick to the right edge”). This is an entire subsystem inside Cocoa named Auto Layout.