3D position estimation of a known object using a single camera

5 min readJan 4, 2021

In Computer Vision, there are many ways to find the 3D location of an object like using a stereo camera, Lidar, Radar, etc. But sometimes it is possible to achieve 3D perception with just a single camera. The one condition to find the 3D location using a single camera is, the size of the object in the picture whose position needs to be estimated should be known. Keep in mind that the object in the picture might have different sizes when the orientation of the object changes. In this article, to avoid this complexity which would require us to understand the orientation of the object, we will try to estimate the 3D position of a ball. Since a ball, no matter viewed from any orientation will have the same size.

Here is how it works,

We calculate the relation between the 3D world in the Camera coordinate system and the 2D pixel coordinates in the image. This step is called camera calibration, more specifically intrinsic parameter calculation.
Then we use an object segmentation algorithm to find the object of interest.
We measure the size of the object.
We use the size of the object and the pixel position of the center of the object to estimate the 3D position using the intrinsic parameters obtained in step 1.

This article will cover the basic background knowledge you need to understand the algorithm and explains the algorithm. There is also a link to a C++ implementation using OpenCV at the end of this article.

Camera Coordinate System

The camera coordinate system is a 3D cartesian coordinate system with the origin at the focus point of the camera and the Z-axis is this optical axis of the camera. This is used to understand the position of the object in 3D with respect to the camera’s position.

Pixel Coordinate System

Pixel coordinates are just the position of each pixel in an image from the top left corner.

Camera Calibration

Camera Calibration is the process of understanding the intrinsic and extrinsic parameters of the camera. In simple terms, extrinsic parameters tell us the relation between a given 3D coordinate system in the world with the position of the camera and the intrinsic parameters tell us the relation between the 3D coordinate of the object and 2D pixel coordinate of the image capture by the camera.

For our goal, we will only focus on the intrinsic parameter of the camera. The below equation will help you understand the intrinsic parameters mathematically.

The X, Y, Z values are the cartesian coordinates of a point in the real world with the camera at the origin and the Z-axis is collinear to the optical axis of the camera.

Focus parameters:

fx and fy are the focus in pixels in the x-axis and y-axis. In a pinhole camera model, we have the focus in units of distance and the focus in the x-axis and y-axis are the same. But in a digital camera due to factors like the size of the camera sensor and distortion caused by the lens the focus in the x and y axes are not the same always.

Here is an equation to understand the “focus in pixels” concept,

fx = F * W/w

fx is the focus in pixels for the x-axis
F is the focus of the camera in mm
W is the width of the camera sensor in mm
w is the width of the image in pixels

The Intrinsic Matrix, by using the focus in pixel values, maps the real world from distance (millimeter) unit of measurements to pixel-units in the image.

Principal Point parameters:

The principal points indicate the offset in pixels between the two coordinate systems (the pixel coordinates and the camera coordinate system).

The camera coordinate system usually has its origin in the center of the image while the pixel coordinate system of the image has its origin in the top-left corner. Hence, the values are usually half of the width of the image for cx and half of the height of the image for cy.

How can we get the intrinsic matrix?

OpenCV provides the tools and steps to do camera calibration. You can learn how to calibrate the camera and get the intrinsic parameters from here.

Object Segmentation

We need to do object segmentation and not just object detection because we not only need to know the position of the object in the image but also the width of the object in the image so that we can compare it with the real width of the object.

There are different ways to do object segmentation using neural networks like Masked R-CNN. Or for similar cases like detecting colored balls, we could just do color segmentation of the image followed by shape analysis.

Size Estimation

Once we know the pixels which belong to the object from object segmentation we can count the number of pixels that belong to the object in the image. This will give us the size of the object in the image.

Calculate the 3D position

Now let’s use the example of finding the 3D position of a colored ball of a known size in an image. Let us assume the ball is 8 cm in diameter.

Z = (fx * 0.08) / (d_pix)

X = ((x-cx)*Z)/fx

Y= ((y-cy)*Z)/fy

d_pix -diameter in pixel count
x,y -the center of the ball in the image
X,Y,Z -3D position of the object in camera coordinate system
cx,cy -principal point.

This method might of finding the 3D position might not be always feasible, but when the conditions are met, it could save you a lot of money.