Implementing a Simple 2D Object Tracker

Font Size

Introduction

This article will implement a relatively simple object tracker that is able to track 2D objects. To make this process more interesting, you will use the ladybug as an object to track. First, figure out what an object tracker is and how it works. Assume that you see a ladybug on a green leaf and you want to know where it moves. You can focus your eye on the ladybug and track its movement and orientation (note that you can move yourself at the same time). You can write a program that will do that for you automatically; that program is called an object tracker.

(Full Size Image)

Interesting link: Tracking is not limited to 2D objects. For example, here you can find the demo of a tracker that is able to track a human's head motion in 3D space using an USB web camera.

Compiling the Code

To compile the code, you need to have Visual C++ 6.0 or later installed. Visual C++ Express Edition also fits your needs.

In this program, you use many functions that are provided with the OpenCV library (Intel Open Source Computer Vision Library), so you have to download (about 18 Mb) and install it to your computer.

After you've installed OpenCV, you need to tell Visual Studio where OpenCV is located. In your Visual Studio main window, open the Tools->Options menu item and click the Projects and Solutions -> VC++ Directories menu item. In the "Show directories for:" combo box, select "Library files" and add the following line to the list:

C:\Program Files\OpenCV\lib

In the "Show directories for:" combo box, select "Include files" and add the following lines to the list:

C:\Program Files\OpenCV\cv\include
C:\Program Files\OpenCV\cxcore\include
C:\Program Files\OpenCV\otherlibs\highgui

Now, open the workspace file (tracker2d.dsw or tracker2d.sln), compile, and run it.

Running the Demo

When you run the program, you will see a console window. You will be asked to choose what image alignment method to use:

Please choose an image alignment method:
1 - forwards additive
2 - forwards compositional
3 - inverse additive
4 - inverse compositional
Your choice (1-4)? >

Type a number between 1 and 4 to choose the desired method. For example, type 4 to choose the inverse compositional method. I will discuss image alignment methods later.

Then, you will be asked if additional windows (template and confidence map) should be displayed.

Show template and confidence map windows?
Press Y to show. Any other key to not show.

Press Enter to not show those windows. I will discuss later why template and confidence map windows are needed.

Then, you will see window titled "Video" displaying a ladybug on green leave (see Figure 1). If you do not see the "Video" window, it is behind the console window. Click "Video" in the task bar to show it. Position the "Video" window to see what is written to the console because you will print some diagnostic information to the console. Please ensure that the "Video" window has the input focus because all subsequent user input will be directed to this window (not to the console window).

Figure 1: The "Video" window.

What can you see in the "Video" window? You can see a ladybug looking to the west and a blue bounding rectangle around the ladybug. You also can see the current FPS (frames per second rate). And, you can see the current track time. The tracking time shows how much time is needed to estimate motion parameters of the ladybug between two subsequent frames.

Now, see what the console window shows:

Current motion type: still.
Press any key to choose another motion type.

You are asked to press any key to change the type of motion. Do that (see Figure 2). Now, the motion type is translation, and the ladybug starts moving along the X and Y axes. You can say that ladybugs can't go backwards. It's true. Assume that the ladybug is just not moving at all, but we move the camera to the west, so you have the effect of the ladybug moving backwards.

Current motion type: translation.
Press any key to choose another motion type.

You can see that sometimes the tracker almost loses the ladybug. This is because you use a dynamic template, and an error is accumulated during the sequence. I will discuss later what a dynamic template is.

Figure 2: Motion type is translation.

Now, press any key to choose the next type of motion (see Figure 3).

Current motion type: rotation.
Press any key to choose another motion type.

Figure 3: Motion type is rotation.

Because ladybugs, unfortunately, are very difficult to rotate, you assume that your camera rotates around the ladybug.

There are other three predefined types of motion (see Figures 4-6): scaling, combined motion (translation, rotation, and scaling at the same time), and slow translation. The last motion type is used to show that the tracker can lose an object even with very slow motion, when its accuracy is not enough to estimate that motion with a dynamic template.

Figure 4: Scale.

Figure 5: Combined motion.

Figure 6: Slow translation.

Finally, you will be asked to press any key to exit the application.

Synthetic Video Capture

You will use a simple class that simulates frame capturing from a digital camera. You will name this class CSynthVideoCapture. Here is declaration of the class (see synthvid.h):

/* Class: CSynthVideoCapture
* Brief: This class is used to generate synthetic video
* showing a moving object.
*/
class CSynthVideoCapture
{
// Grant access to private members to COjbDetectorStub
// because it should know the internal state of an object
// to "find" its position.
friend class CObjDetectorStub;
public:
// Construction/destruction
CSynthVideoCapture(const char* texture_name,
const CvSize frame_size);
~CSynthVideoCapture();
// Operations
bool IsValid();
IplImage* GrabFrame();
void SetMotionType(MotionType motion_type);
private:
void DrawObject();
MotionType m_motion_type;
struct _obj_pose
{
float x;
float y;
float angle;
float s;
}m_object_pose;
// Transformation matrix.
CvMat* m_M;
// Coordinates of pixel in coordinate frame of ellips.
CvMat* m_X;
// Coordinates of pixel in coordinate frame of image.
CvMat* m_Z;
IplImage* m_pTexture;      // Pointer to the texture image.
IplImage* m_pFrame;        // Pointer to the frame image.
CvRect m_bounding_rect;    // Current object's bounding rectangle.
};

First of all, you declare a CObjDetectorStub friend class. You will return to that class later and see why it is declared as a friend.

Then, you see a constructor method that takes two parameters: texture name and frame size.

CSynthVideoCapture(const char* texture_name,
const CvSize frame_size);

The texture name is the name of JPG file that contains the ladybug image. You can find the image in data\bug.jpg.

And, the frame size parameter defines how large a video frame you will have. You will use 320x240 frames, because it is the standard frame size for USB cameras.

Then, you see several operations:

bool IsValid();
IplImage* GrabFrame();
void SetMotionType(MotionType motion_type);

The IsValid() method is used to determine whether the texture was loaded successfully and that you can generate the video sequnce.

GrabFrame() is used to retrieve the current frame of the video sequence. The frame is an image in RGB format. In OpenCV, you use the IplImage structure to store an image.

And, the SetMotionType() method is used to set the current type of motion. Below are the available motion types. I already discussed the types of motion in the Running the Code section.

// Available types of object motion.
enum MotionType
{
STILL,          // No motion.
TRANSLATION,    // Translations along X and Y axis.
ROTATION,       // Rotation around object center.
SCALE,          // Zooming.
COMBINED,       // Translation, rotation, and zooming.
TOO_SLOW        // Slow translation.
};

Look at what you have in the private section of the class.

The DrawObject() method is used to draw the ladybug to the frame image. Then, you see the object_pose stucture:

struct _obj_pose
{
float x;
float y;
float angle;
float s;
}m_object_pose;

This structure stores the current position and orientation of the ladybug. And, in the last part of class declaration you see these lines:

CvMat* m_M;    // Transformation matrix.
CvMat* m_X;    // Coordinates of pixel in coordinate frame of ellipse.
CvMat* m_Z;    // Coordinates of pixel in coordinate frame of image.
IplImage* m_pTexture;      // Pointer to the texture image.
IplImage* m_pFrame;        // Pointer to the frame image.
CvRect m_bounding_rect;    // Current object's bounding rectangle.

These are internally used matrices and images. In OpenCV, matrices are stored in CvMat structures, and images are stored in IplImage structures.

m_M, m_X and m_Z members are used for coordinate transformations. m_pTexture contains the texture of the ladybug (bug.jpg). m_pFrame contains the current frame image. m_bounding_rect contains the current bounding rectangle of the ladybug.

The source code for the CSynthVideoCapture class is located in synthvid.cpp.

Object Detector Stub

Another thing you have to use is an object detector, and now you will see why. An object tracker is able to track an object, but it can't find an object on the video frame. If the tracker loses the object, it can't find it again. So, you have to find the object yourself and tell the tracker: "I know for sure that the ladybug is located in this position (bounding rectangle is attached), and it looks, for example, to the west. Please track its position and orientation on the subsequent frames of the video sequence."

Because this article is focused on object tracking only, you won't implement a ladybug detector, but just create its simulator and call it CObjDetectorStub class (see synthvid.h). This class should somehow determine the position of the ladybug object; that's why you provide it access to the private members of the CSynthVideoCapture class.

/* Class: CObjDetectorStub.
* Brief: This class is used to "find" the object on the first
* frame of the video sequence.
* Details:
*  In real life, the object's position is detected with some
*  detection algorithm.
*  The detector can find an object only in a predefined position.
*  For example, a face detector can find a face if it is in a
*  frontal view and positioned vertically.
*  The detector usually can provide just the bounding rectangle of
*  the object.
*
*  To make our detector stub closer to real life, we allow it
*  to "find" the ladybug in some predefined pose only (when its
*  head points to the west direction).
*/
class CObjDetectorStub
{
public:
CObjDetectorStub(CSynthVideoCapture& capture):
m_refCapture(capture){}
~CObjDetectorStub(){}
bool FindObject(CvRect& obj_rect)
{
// If the ladybug looks approximately to the west,
// we can "find" it.
// Otherwize, we return a negative status.
if(fabs(m_refCapture.m_object_pose.angle)<=0.2)
{
obj_rect = m_refCapture.m_bounding_rect;
return true;
}
return false;
}
private:
// reference to the capture
CSynthVideoCapture& m_refCapture;
};

Simple 2D Object Tracker
In this section, you will see how your 2D tracker is implemented.
As you already know, the initial position of the object is found with an object detector (you use CObjDetectorStub for that purpose). You initialize the tracker by defining the initial position of the object model. But what is that model?
There is no universal model for every object, because a model reflects object shape, material, and some other important properties of the object. For your ladybug object, you take a simple elliptic model. This is because the ladybug's shape is similar to a ellipse. The math expression for the filled ellipse is:
(x^2)/(a^2)+(y^2)/(b^2) <= 1, where x and y are coordinates of a pixel inside an ellipse, a and b are horizontal and vertical radius of the ellipse, respectively.






To see the model, run the demo again, choose inverse compositional method (type 4) and press Y to show the confidence map (see Figure 7) and template windows. The confidence map window shows the model of the ladybug. Bright pixels belong to the ladybug model. You use a non-uniform intensity of pixels. The intensity of pixels is higher to the center of the ellipse; this reflects the probability that the given pixel belongs to the object. Pixels near the edge of the ellipse have much a smaller probability to belong to the object. This is used in tracking, where pixels with high probability contribute more to motion estimation than others.

(Full Size Image)
Figure 7: Elliptic model of ladybug.
When initializing the model for the first frame, you create a confidence map image and you also create a template image (see Figure 8). The template image is a gray-scale copy of the image in the area of the object. The template is updated for each frame. After you estimate the motion of the object between two subsequent frames, you update the image of the object. This is called dynamic template. For comparison, a static template is when you create a template for the first frame only and do not update it for the subsequent frames.
The disadvantage of a dynamic template is that is has no "memory." If motion is estimated incorrectly, the error is accumulated. For example, refer to Figure 6 (slow translation). In that type of motion, inter-frame translations are very small, and the accuracy of the alignment algorithm is not enough to estimate motion correctly. So, the error is accumulated very rapidly.

Figure 8: Template image.
The template and confidence map are used for motion estimation. If lighting conditions are constant, you can assume that the appearance of the template has not changed on the next frame. You assume that only its position and orientation have changed. By minimizing the difference between the template and the next frame, you can estimate motion parameters.
The declaration for the CSimple2DTracker class is located in the simptrck.h file:
/* Class: CSimple2DTracker
* Brief: This class can be used to track 2D objects using four
* available methods:
* Lucas-Kanade method, Baker-Dellaert-Matthews method,
* forwards compositional method, and Hager-Belhumeur method.
*/
class CSimple2DTracker
{
public:
// Construction/destruction
CSimple2DTracker(MethodName method, CvSize frame_size);
~CSimple2DTracker();
// Operations
bool InitModel(IplImage* pFrame, CvRect& obj_rect);
bool TrackModel(IplImage* pFrame);
void DrawModelPosition(IplImage* pFrame);
IplImage* _GetImgT(){ return m_pImgT; }
IplImage* _GetConfMap(){ return m_pConfMap; }
private:
void UpdateTemplate(IplImage* pFrame, CvRect template_rect);
void UpdateConfMap(CvRect& bounding_rect);
MethodName m_Method;    // Method name.
CvSize m_FrameSize;     // Size of input frame.
CvMat* m_C;    // Composition of transformation matrices.
// Warp matrix retrieved from image alignment method.
CvMat* m_W;
CvMat* m_Q;    // Temporarily used matrix.
// Pixel coordinates in coordinate frame of template T.
CvMat* m_X;
// Pixel coordinates in coordinate frame of image I.
CvMat* m_Z;
// Size of object's initial bounding rectangle.
CvSize m_ObjSize;
// Incoming frame converted to gray-scale format.
IplImage* m_pImgI;
IplImage* m_pImgT;       // Template image T.
IplImage* m_pConfMap;    // Confidence map image.
// Current template's bounding rectangle.
CvRect m_template_rect;
// Pointer to currently selected image alignment method
CImageAlignmentMethodBase* m_pAlignMethod;
};

A class constructor takes two parameters: the name of image alignment method and size of frame.
CSimple2DTracker(MethodName method, CvSize frame_size);
The image alignment method can be one of the following: forwards additive, forwards compositional, inverse additive, and inverse compositional. I will discuss those methods in the following section.
enum MethodName
{
FORWARDS_ADDITIVE      = 1,    // Lucas-Kanade
FORWARDS_COMPOSITIONAL = 2,    // forwards compositional
INVERSE_ADDITIVE       = 3,    // Hager-Belhumeur
INVERSE_COMPOSITIONAL  = 4     // Baker-Dellaert-Matthews
};

The frame size should be the size of the frame retrieved from CSynthVideoCapture.
Next, you see several operations:
// Operations
bool InitModel(IplImage* pFrame, CvRect& obj_rect);
bool TrackModel(IplImage* pFrame);
void DrawModelPosition(IplImage* pFrame);
IplImage* _GetImgT(){ return m_pImgT; }
IplImage* _GetConfMap(){ return m_pConfMap; }

The InitModel() method initializes the tracker with an initial object position found with the object detector. TrackModel() tracks the object on the current frame of the video sequence. And, DrawModelPosition() is a helper method that is used to visualize the current position of the object.
The _GetImgT() and _GetConfMap() methods are two helper methods that return pointers to the template and confidence map images, respectively.
Image Alignment Methods
The user can select one of four available image alignment methods to use. An image alignment method is used to estimate an object's motion parameters between two subsequent frames. Because you use four image alignment methods in this program, it would be a good idea to write the base class for them. You can call this class CImageAlignmentMethodBase (see simtrck.h):
/* Class: CImageAlignmentMethodBase
* Brief: This class defines base interface for all image
* alignment methods.
*/
class CImageAlignmentMethodBase
{
public:
virtual bool AlignImage(IplImage* pImgT, IplImage* pConfMap,
CvRect template_rect, IplImage* pImgI, CvMat* W, int step)
= 0;
};

As you see, this class is abstract, and it has a declaration for AlignImage() method that should be implemented by all derived image alignment method classes.
See what parameters it has:

    pImgT: Pointer to the template image
    
pConfMap: Pointer to the confidence map
    
template_rect: Pointer to the template rectangle
    
pImgI: Pointer to the gray-scale copy of the current frame
    
W: Resulting warp matrix
    
step: Step for template walking cycle 

The image alignment method takes a template, confidence map, or current frame as input parameters. It estimates inter-frame motion parameters and returns the warp matrix W. The matrix W transforms the template to produce the warped template that has the same position and orientation as the object on the current frame (see Figure 9).

(Full Size Image)
Figure 9: Motion estimation between two frames.
The step parameter is used to speed up the image alignment. The value of this parameter is equal to 2; this means to skip each second pixel (by X and Y axis) of the template while estimating motion parameters. It is possible to take each pixel of the template into account, but the speed will reduce. It is also possible to use a step larger than 2.
In this program, you use four image alignment methods:

    forwards additive (Lucas-Kanade) method: This method is implemented as the CForwardsAdditiveMethod class.
    
forwards compositional method: This method is implemented as the CForwardsCompositionalMethod class.
    
inverse additive (Hager-Belhumeur) method: This method is implemented as the CInverseAdditiveMethod class.
    
inverse compositional (Baker-Dellaert-Matthews): This method is implemented as the CInverseCompositionalMethod class. 

These methods are declared in the forwadditive.h, forwcomp.h, invadditive.h, and invcomp.h header files. I won't describe these methods in detail here. The interested reader can find more information and sample source code in the Image Alignment Algorithms article.
Here, I just mention how pixel weights are taken into account when estimating motion. In Figure 7, you can see that the template pixels have non-uniform weights. Some pixels in the template can be more reliable than others. You assume that pixels near the center of the elliptic model are more reliable than the pixels near the edge, so the pixels near the center should contribute more to motion estimation than others. You can incorporate weights into the image similarity function as follows:

where w is the weight of pixel x. Derivation of image alignment algorithms with weights can be found here.
Another thing I should mention is additional termination criteria for image alignment algorithms. As you can see from the Image Alignment Algorithms (Part II) article, almost half of the iterations are not needed during the minimization process because a mean error value is not reduced on them. To avoid such iterations, you add new termination criteria. You remember a mean error value for each N-th iteration and compare whether it becomes smaller on the next iteration. If the mean error doesn't become smaller for five neighbor iterations, you exit the minimization process.
/*
* Check termination criteria #1 - long oscillations
*/
if(iter==1)
{
min_err_val=mean_error;
cvCopy(m_W, m_BestW);
}
else
{
if(mean_error>=min_err_val)
{
if(bad_iter>MAX_BAD_ITER) break;
bad_iter++;
}
else
{
bad_iter=0;
min_err_val=mean_error;
cvCopy(m_W, m_BestW);
}
}

Conclusion
In this article, you implement a simple 2D object tracker with dynamic template and template pixel weights. Inter-frame object motion can be estimated using one of four available image alignment algorithms: forwards additive, forwards compositional, inverse additive, or inverse compositional. The base for this application is the OpenCV library.

Downloads

tracker2d_src.zip

Implementing a Simple 2D Object Tracker - Cod...