January 18, 2017

Camera Calibration - From 3D World to Image

Understanding the mathematics behind camera calibration and how 3D world points are transformed into 2D images

This article was originally published at drstarry.github.io.

Introduction

Have you ever thought about how we capture the beautiful world? At least for me, it's quite mysteries. From a modern camera point of view, it transforms a point from a random world coordinate to camera frame coordinates, then to a 2D image with a certain pixel structure.

In this article, I'm going to talk about how these two transformations work mathematically.

Basic

As the following image shows, is the basic projection model, which adopts the pinhole model as an approximation.

Camera System source

Notations:

COP refers to center of projection, also called optical center in the camera, which is the origin of our camera coordinate system.
PP is the image plane in the camera frame.
f refers to focal length, the distance between COP to image plan (is d in the image above)

In this representation, we put the image plane in front of the camera coordinate system, to avoid mathematically flipping the image. As a result, we have x pointing right, y pointing up and z pointing inwards.

Now we can project a point $(x, y, z)$ in the world to image plane $(-dx/z, -dy/z, -d)$ , derived from similar triangles. The location on the image is just $(x', y') = (-dx/z, -dy/z)$ by throwing out the z value.

Degree of Freedom

Mathematically, degrees of freedom is the number of dimensions of the domain of a random vector, or essentially the number of "free" components (how many components need to be known before the vector is fully determined).

We will use this term as DoF in the following transformation to measure each step.

Homogeneous Coordinates

Note that the operation by converting the location to $(x', y')$ is actually not a linear transformation because division by a non-constant z is non-linear. For each point, z is an independent value thus the transformation will be done case by case. To fix this issue we introduce a new coordinate as follows:

$(x, y, z) \rightarrow \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix}$

The 1 we added is called a homogenous component of the vector, later we could also use other values.

Converting from the homogenous coordinates to non-homogenous one is quite easy, by dividing the $w$ value, the scale factor.

$\begin{bmatrix} x \\ y \\ z \\ w \end{bmatrix} \rightarrow (x/w, y/w, z/w)$

Now we could use matrix multiplication with homogenous coordinates to do the projection. (We mentioned f is the focal length.)

$\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1/f & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} x \\ y \\ z/f \end{bmatrix} \rightarrow (fx/z, fy/z) = (u, v)$

where $(u, v)$ you could think them as the pixel vectors on 2D image. We can also scale the projection matrix and get invariant result.

$\begin{bmatrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} fx \\ fy \\ z \end{bmatrix} \rightarrow (fx/z, fy/z)$

We will utilize homogenous coordinate system to do many powerful transformation in the following sections.

Camera Calibration

The geometric calibration will be divided into two parts - extrinsic and intrinsic.

The extrinsic part is about mapping from some arbitrary world coordinate system to the camera's 3D coordinate system, also called camera pose.

The intrinsic part is from 3D coordinates in the camera frame to the 2D image plane via projection.

Extrinsic Parameters

Before we talk about extrinsic parameters, we need to understand how we represent a point in a certain coordinate system.

Coordinate System source

As the above picture shows, P is a point in coordinate A.

$P_A = \begin{bmatrix} x_A \\ y_A \\ z_A \end{bmatrix}$

which is equivalent to

$\overline{OP} = (x_A \cdot \overline{i_A}) + (y_A \cdot \overline{j_A}) + (z_A \cdot \overline{k_A})$

Translation

Now if we want to translate P from coordinate system A to B,

Translation source

we just use the location in coordinate system A plus the origin A represented in coordinate system B.

$P_B = P_A + O_{A,B}$

The rule also applies to another direction.

$P_A = O_{B,A} + P_B$

Fortunately, we learned from the last section that translation can be represented as a matrix multiplication using homogenous coordinates. Thus we have

$P_B = P_A + O_{A,B}$

which is equivalent to

$\begin{bmatrix} P_B \\ 1 \end{bmatrix} = \begin{bmatrix} I & O_{A,B} \\ 0^T & 1 \end{bmatrix} \begin{bmatrix} P_A \\ 1 \end{bmatrix}$

where I is the 3x3 identity matrix.

There are 3 DoFs in translation.

Rotation

Besides translation, the point could also be rotated in another coordinate system like

Combined Transformation source

And we could represent it as

$\overline{OP} = \begin{bmatrix} i_A & j_A & k_A \end{bmatrix} \begin{bmatrix} x_A \\ y_A \\ z_A \end{bmatrix} = \begin{bmatrix} i_B & j_B & k_B \end{bmatrix} \begin{bmatrix} x_B \\ y_B \\ z_B \end{bmatrix}$

$P_B = R_{A,B} \cdot P_A$

where $R_{A,B}$ describes frame A in the coordinate system of frame B.

Using homogenous coordinates, we can represent rotation as

$\begin{bmatrix} P_B \\ 1 \end{bmatrix} = \begin{bmatrix} R_{A,B} & 0 \\ 0^T & 1 \end{bmatrix} \begin{bmatrix} P_A \\ 1 \end{bmatrix}$

There are 3 DoFs in rotation.

Rigid transformation

Combining the previous observations, we could unify them as

$\begin{bmatrix} P_B \\ 1 \end{bmatrix} = \begin{bmatrix} I & O_{A,B} \\ 0^T & 1 \end{bmatrix}$

and

$\begin{bmatrix} R_{A,B} & 0 \\ 0^T & 1 \end{bmatrix} \begin{bmatrix} P_A \\ 1 \end{bmatrix} = \begin{bmatrix} R_{A,B} & O_{A,B} \\ 0^T & 1 \end{bmatrix} \begin{bmatrix} P_A \\ 1 \end{bmatrix} = T_{A,B}\begin{bmatrix} P_A \\ 1\end{bmatrix}$

where $T_{A,B}$ is the transformation from A to B. In the other direction, we could have

$\begin{bmatrix} P_A \\ 1 \end{bmatrix} = T_{A,B}^{-1}\begin{bmatrix} P_B \\ 1\end{bmatrix}$

The entire 4x4 $\begin{bmatrix} R_{A,B} & O_{A,B} \end{bmatrix}$ is the extrinsic parameters matrix, finally!

Totally, we have 6 DoFs in extrinsic part.

Intrinsic Parameters

Remember in our homogenous coordinate section, we have a perspective projection, but the ideal version

$(fx/z, fy/z) = (u, v)$

It's ideal because in real world,

pixels are in some arbitrary spatial units and f is sometimes represented as pixels, thus we may use a scale factor $\alpha$ to represent f
pixel is not always square! We need another scale factor $\beta$ .
pixel's axes may be skew
optical center is not always the camera center, so we have two more offsets values to each dimension, $u_0, v_0$

Real World Camera source

And our pixel equation becomes

$u = \alpha (x/z) - \alpha \cdot cot(\theta) \cdot (y/z) + u_0$ $v = \beta \cdot sin(\theta) \cdot (y/z) + v_0$

Now let's convert it to homogenous coordinates,

$\begin{bmatrix} z \cdot u \\ z \cdot v \\ z\end{bmatrix} = \begin{bmatrix} \alpha & -\alpha \cdot cot(\theta) & u_0 & 0 \\ 0 & \beta/sin(\theta) & v_0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix}$

There are 6 DoFs in this model

We could use the simpler model as follows