In Search Of Depth: Camera Calibration and Stereo Rectification
By Vikko Smit (QdepQ)
QdepQ Systems has teamed up with NOOR and their corporate partner Nikon Professional Services to create a stereo rig to shoot stereo content for the FotoInMotion project. It is a unique opportunity to explore the cutting-edge of 3D technology that is key to making more powerful visual content for digital users.
Why stereo images?
The most powerful immersive effects are created by a spatial awareness of a photograph. Many photographers, knowingly or not, already use depth indicators in their photos. Perspective, for instance, is a strong indicator of space. But weaker effects like bokeh or defocus are also a popular choice. And even if most photographers are currently still not using devices with multiple cameras, they are seeking strong immersive effects in their images.
In the FotoInMotion tool, we try to estimate depth from a single picture to use these strong immersive effects that require depth information. Of course the quality of depth estimation has to be measured and that’s where stereo images make the difference.
There are already a lot of stereo datasets available, but many focus on automotive applications or are in low resolution. This dataset will be captured in high resolution on content that is relevant to the pilot cases. With this dataset, we will obtain a ground truth to improve our depth reconstruction.
The stereo rig
QdepQ Systems is collaborating with NOOR and their partner Nikon Professional Services to create a rig specifically suitable to shooting relevant high-quality stereo content for the FotoInMotion project. The goal is for the professional photographers to work as naturally and effortlessly as they can, whilst still acquiring the added depth information from the stereo photo sets.
Each rig is built around a set of two professional Nikon Z6 full-frame cameras, fit with a 35mm f/1.8 prime lens. The mirrorless Z6 and prime lens combination allow the rig to contain two cameras and still be of a weight that can be comfortably used with freehand, which would be much less so with DSLR cameras and zoom lenses. By using the same model all the camera settings can be synced easily by swapping a memory card between the cameras.
Prime lenses are fitted for multiple optical reasons. Firstly, compared to zoom lenses they offer a larger field of view, being the distances where the objects are in focus, which is very important for our dataset. Secondly, they have fewer moving parts which eliminates many variables and makes them easier to calibrate. Thirdly, due to their large aperture they allow for shorter shutter times, which helps against motion blur. Lastly, the zoom factor and field of view are fixed, so the operator does not have to consider keeping two lenses synced.
It is critical that both cameras take a picture at exactly the same time. To realise this a trigger mechanism was created by modifying two Nikon remote triggers such that they sync both cameras to respond to the trigger of either. This way the photographer can use the rig as intuitively as they would when using their normal camera.
To guarantee that the cameras remain in fixed position relative to each other they are mounted in cages which in turn are then secured together through two mounting plates. This way the cameras only need to be calibrated each time the rig is assembled (after transport). The bottom mounting plate has the option of placing the whole rig securely on a tripod if the photographer wants to use it that way.
What about camera calibration?
The pinhole model
An visual representation of the camera pinhole model. Source: wikipedia
In the pinhole camera model, an ideal situation is described to project a 3D space on an image. The aperture is a single point in space and no lenses are used to redirect and focus the light. The ideal world it describes has no distortions to the captured image, usually caused by mechanics or physics.
These are the parameters that relate to the internal mechanics of the camera. They consist of a scaling factor of horizontal and vertical pixels, a skew coefficient between the x and y axis (which is often 0) and an offset to the center of the image.
The extrinsic parameters of a camera relate ‘real world’-3D coordinates to image coordinates. They combine a rotation matrix with the origin of the world coordinate.
The most used way to calibrate cameras is to shoot a set of checkerboards. With an easy to detect grid pattern on the crossings of the squares, you can build a system of equations that estimate the parameters of the setup and create a calibration matrix of the camera. With this calibration matrix, you can compensate for all the variations in the parameters of the camera and treat the captured images as if they were shot in the preferred ideal world: as they were shot by a pinhole camera.
Example of the camera calibration toolbox in Matlab
When the two matrices are combined, a mapping is formed from a 3D point in the world to the pixel coordinate in the camera.
As soon as both cameras have the same perception of the observed world, translation matrices can be calculated from the left pixel coordinates to the right pixel coordinates and vice versa.
By comparing a left image and a right image in left image coordinates, the pixel shift can identify how close it is to the camera relative to the horizon. This shift is used to form a depth map that is further used as a metric for depth estimation.
So how does it look?
First a left-to-right stereo disparity map was computed for each image using the pre-computed calibration data. This map contains for each pixel the value of the shift with its counterpart in the right image, proportional to the depth of that point in space. Not all pixels can be used for stereo comparison, they need to be both visible and distinct in the left and the right pictures. Since the disparity cannot be computed for each pixel of the image, only those with a valid disparity are given a value in the disparity map. Missing data can be interpolated using various interpolation techniques, also known as depth inpainting.