Registration and Tracking of Objects with Computer Vision for
Autonomous Vehicles
Except where reference is made to the work of others, the work described in this thesis is
my own or was done in collaboration with my advisory committee. This thesis does not
include proprietary or classi ed information.
Andrew Nevin
Certi cate of Approval:
David M. Bevly, Co-Chair
Associate Professor
Mechanical Engineering
Thaddeus A. Roppel, Co-Chair
Associate Professor
Electrical and Computer Engineering
Thomas Denney
Professor
Electrical and Computer Engineering
Alan Scottedward Hodel, Co-Chair
Associate Professor (deceased)
Electrical and Computer Engineering
George T. Flowers
Dean
Graduate School
Registration and Tracking of Objects with Computer Vision for
Autonomous Vehicles
Andrew Nevin
A Thesis
Submitted to
the Graduate Faculty of
Auburn University
in Partial Ful llment of the
Requirements for the
Degree of
Master of Science
Auburn, Alabama
May 9 ,2009
Registration and Tracking of Objects with Computer Vision for
Autonomous Vehicles
Andrew Nevin
Permission is granted to Auburn University to make copies of this thesis at its
discretion, upon the request of individuals or institutions and at
their expense. The author reserves all publication rights.
Signature of Author
Date of Graduation
iii
Vita
Andrew Nevin, son of Ronald and Brenda Nevin, was born June 8, 1983, in Nashville,
Tennessee. He graduated from Montgomery Bell Academy in 2002. He attended Auburn
University in September 2002 and graduated cum laude with a Bachelor of Electrical Engi-
neering degree in December 2006. He enrolled in the Graduate School, Auburn University,
in January 2006, working as a graduate teaching assistant for Electrical Engineering Lab
II and a graduate research assistant. He graduated with a Master?s of Science in Electrical
Engineering degree in May 2009.
iv
Thesis Abstract
Registration and Tracking of Objects with Computer Vision for
Autonomous Vehicles
Andrew Nevin
Master of Science, May 9 ,2009
(B.S., Auburn University, 2006)
70 Typed Pages
Directed by Scotte Hodel and David Bevly
For this research, an autonomous leader-follower vehicle caravan is being designed us-
ing computer vision. The focus of this research is to investigate computer vision methods of
identifying objects detected by the leader vehicle and determining the follower vehicle? s po-
sition relative to the objects. Many object registration and tracking methods are presented
in this paper such as Hough transforms, optical ow, and normalized cross correlation. A
novel statistical-gradient method is presented to combine feature based methods and area
based methods for object registration and tracking. These methods are evaluated against
both synthetic and captured video data. This registration method is shown to be e ective
with both synthetic data and acquired data.
v
Acknowledgments
The author would like to thank his advisors Dr. Alan Scottedward Hodel and Dr.
David Bevly, as well as his committee members Dr. Thaddeus Roppel and Dr. Thomas
Denney for their guidance and patience through this research. Additionally, he would like
to thank Dr. Stanley Reeves for instruction in Digital Image Processing and Advanced
Digital Image Processing. Without the course instruction, advice, and motivation given to
the author, this research would not have been possible.
The author wishes to express his appreciation for his grandparents Robert and Louise
Howell, his parents Ronald and Brenda Nevin, his brother Ryan, and two sisters Ashley
and Caroline for all of their continuous love, support, and encouragement throughout his
entire educational career.
vi
Style manual or journal used Journal of Approximation Theory (together with the style
known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars.
Computer software used The document preparation package TEX (speci cally LATEX)
together with the departmental style- le aums.sty.
vii
Table of Contents
List of Figures x
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Digital Image Processing Background 6
2.1 Processing Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Implementation of lters and edge detection . . . . . . . . . . . . . . . . . . 7
2.3 Image Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Multi-resolution processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Edge Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Object Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.1 De nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.2 Classi cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7.3 Methods of Object Registration . . . . . . . . . . . . . . . . . . . . . 16
2.7.4 Normalized Cross Correlation . . . . . . . . . . . . . . . . . . . . . . 18
2.8 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Experiments and Results 24
3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Acquired Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Normalized Cross Correlation . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Statistical-gradient Object Registration 30
4.0.1 Area-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.0.2 Feature-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.0.3 Code Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.0.4 Validation using Synthetic Images . . . . . . . . . . . . . . . . . . . 33
4.0.5 Validating on Acquired Images . . . . . . . . . . . . . . . . . . . . . 34
4.0.6 Motion Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.0.7 Stop Sign Registration . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.0.8 Multiple Green Signs Registration . . . . . . . . . . . . . . . . . . . 40
viii
5 Conclusions 44
6 Future work 46
Bibliography 47
Appendices 49
A Nomenclature 50
B Orientation Results of Red Block 51
C Orientation Results of The Blue Block 55
D Tolerance Values for Statistical Analysis 59
ix
List of Figures
2.1 Gaussian noise is present in the image in a). A 5 x 5 pixel mask is used as
the low-pass lter. The result is in b). Notice the blurring and removing of
some the noise as well as the letters in the image. A 2 x 2 pixel mask is used
for the median lter. The result is in c). All of the noise has been removed,
but there is a little deformation of the letters. . . . . . . . . . . . . . . . . 9
2.2 Sobel Edge detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Points on the Cartesian coordinate system . . . . . . . . . . . . . . . . . . . 15
2.4 Points from the Cartesian system in Fig 2.3 transformed into parameter space. 15
2.5 The Hough transform from the image in a) is shown as an image in b). The
lines are extracted with the inverse Hough transform and plotted on top of
the original image in c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Normalized cross correlation output . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Barber pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 All three partial derivatives at the center are calculated from the average of
the rst four di erences of adjacent pixels.The column index is denoted as j,
the row index is denoted as i, and the time index is denoted as k. . . . . . . 21
2.9 Optical ow calculated between the two input images. The block in the top
right corner is moving diagonally down from left to right, and the block in
the bottom right corner is moving from right to left. . . . . . . . . . . . . . 23
3.1 Synthetic image for object registration . . . . . . . . . . . . . . . . . . . . . 24
3.2 (a) The magnitude of the x and y gradients using the Sobel kernels. (b) Lines
extracted from the Hough transform are plotted on top of the grayscale input
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 This is a surface plot of the magnitude of the normalized correlation coe -
cients. The red block is on the left, and the red straight line is on the right.
Notice that the maximum coe cient is in the middle of the actual red block. 26
x
3.4 The output from the digital camera. Included in the image are the four
blocks to be detected. There is a black background on the rink, there is a
blue and white surface, and there are some pebbles. . . . . . . . . . . . . . 27
3.5 Edge detection on the acquired data . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Hough transform on the acquired data . . . . . . . . . . . . . . . . . . . . . 28
3.7 The output is a surface plot of the normalized cross correlation coe cients
. The red surface indicates a high correlation coe cient and blue represents
a low correlation coe cient. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 The plot on the left contains 10 randomly generated line segments. The plot
on the right are the line segments that are perpendicular to each other. . . 32
4.2 Synthetic image for object registration . . . . . . . . . . . . . . . . . . . . . 34
4.3 The output from the algorithm used on the input from Fig 4.2. . . . . . . . 35
4.4 The output from the line segment algorithm on the lines extracted from the
Hough transform on the output from the object identi cation method. . . . 35
4.5 Four outputs of the registered red block . . . . . . . . . . . . . . . . . . . . 36
4.6 The center of the registered object red block in ten selected frames from the
image sequence is shown above. . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Four outputs of the registered blue block . . . . . . . . . . . . . . . . . . . . 38
4.8 Plots of the X estimation, Y estimation, and Z estimation of the red block. 38
4.9 Plots of the X estimation, Y estimation, and Z estimation of the blue block. 39
4.10 Four outputs from the object orientation method. . . . . . . . . . . . . . . . 41
4.11 A frame of the video sequence of the green signs. . . . . . . . . . . . . . . . 42
4.12 Four outputs from the video sequence of the green signs . . . . . . . . . . . 43
xi
Chapter 1
Introduction
1.1 Motivation
The desire and necessity to build autonomous vehicles is increasing every year. Au-
tonomous vehicles can be designed to replace humans on jobs that are dangerous, time-
consuming, or di cult. Recent events have shown the value of using highly automated
technology in the battle eld. Autonomous aircraft, terrestrial, and marine vehicles have
accomplished important missions while reducing risk to personnel [6].They can be used
in rescue operations, such as collapsed mines or building res, where it is dangerous to
send workers. Other tasks include avoiding obstacles, keeping a safe distance behind other
vehicles, regulation of speed according to tra c conditions and road characteristics, and
driving and parking within urban environments [1]. Autonomous vehicles can be sent along
a dangerous path to send supplies to soldiers in need without risking injuries or fatalities
to others. Research in vehicle following has attracted the attention of several researchers
in the past decades, particularly in the USA and Europe where safety, energy consump-
tion, and tra c congestions are the primary motivators [2], [3]. In a recent study by the
Virginia Department of Motor Vehicles and Virginia Commonwealth University, researchers
conclude that between 25 to 50 percent of all motor vehicle crashes in the U.S. are caused by
driver distraction. Rubbernecking (slowing down to look at another accident) accounts for
16% of accidents caused by distractions. Driver fatigue (12 percent), looking at scenery (10
percent), other passengers or children (9 percent), adjusting the radio (7 percent), reading
maps or other documents (2 percent) account for the majority of accidents. Cell-phone use
is another distraction that is increasing the number of accidents. Driver fatigue accounts
for 100,000 crashes each year. Aggressive driving is another frequent cause of accidents. It
1
is evident that there is a need to prevent fewer accidents each year. Creating autonomous
or semi-autonomous vehicles is an important step toward reducing accidents.
The rst step to creating an autonomous vehicle is self-localization. Self-localization
is the process by which an autonomous vehicle calculates its position relative to the envi-
ronment. GPS is typically used to to estimate the vehicle?s current position, however, due
to indoor environments, dense trees or urban areas, the line of sight between the satellites
and vehicle can be blocked. Other sensors, such as IMU, LIDAR, and vision can be used
to estimate the position of the vehicle when GPS is not available.
A desire to create an autonomous leader-follower caravan for military convoys is grow-
ing. In [4], the authors discuss the current state of research in these convoys. The leader-
follower capability allows one man-driven vehicle to be followed by one or more unmanned
vehicles in a convoy-like operation. The man-driven lead vehicle establishes the route for
the unmanned follower vehicles by sending the waypoints or GPS breadcrumb coordinates
to the followers. Additionally, the follower vehicles are instructed to trail the leader at a
speci ed distance [4]. The motivation behind the leader-follower convoy is to reduce the
risk of soldiers in dangerous situations.
For this research, an autonomous leader-follower vehicle caravan is being designed
using computer vision. As the leader vehicle navigates through a landscape, it will need to
recognize and identify features it may "see", which includes obstacles that it may have to
detour around. The leader vehicle identi es these features using digital image processing
techniques such as gradient detection, Hough transforms, and local averages and variances
of pixel values. In this research it is assumed that the leader estimates the orientation
and distance of an object, and transmits the data to the follower vehicle. The follower
vehicle searches the landscape for the object identi ed by the leader vehicle, and calculates
the orientation and distance from the object. The follower will then make attempts to
correct its position based on the leader vehicle?s data. The focus of this research is to
investigate computer vision methods of identifying objects detected by the leader vehicle
and determining the follower vehicle? s position relative to the objects.
2
1.2 Previous Work
In [5], the authors use a leader vehicle, which is not equipped with GPS or inter-
vehicle communication networks, and a follower vehicle equipped with an on-board laser
scanner and odometry sensors. The leader position is estimated using the measurement
data from the laser scanner. In [6], the authors present a leader-follower design for multiple
autonomous underwater vehicles operating in formation to perform tedious and dangerous
tasks, such as minesweeping. The leader vehicle broadcasts its position to the follower vehi-
cles through underwater acoustic communication. The follower vehicles move into formation
relative to the leader to scan for mines. This work requires that the leader vehicle be close
to the follower vehicles. In [7], the authors use cameras for an autonomous vehicle to follow
an astronaut in outer space. They tag the astronaut with a green color blob. A green color
lter is used to convert the data from the camera on the follower to a binary image. The
orientation and size of the detected color blob is used to control the distance between the
astronaut and the follower. This follower robot helps the astronaut carry moon rocks and
other heavy items that would otherwise burden the astronaut. However in many operations
it may be impractical to attach a highly visible signal to the leader vehicle. Additionally in
some leader follower operations, the follower may be far enough back that the lead vehicle
is not visible.
In [8], the authors present the design of a di erent algorithm using visual sensors. The
follower vehicle is equipped with only a camera. The tracking objective requires information
related to the leader?s motion, which is extracted from the video data from the follower
vehicle. A reference image is captured and stored in the memory of the leader vehicle while
it is in the desired position and orientation relative to the follower. The control algorithm
attempts to match this reference image as closely as possible. For this research, a leader-
follower caravan is designed that does not require the follower to be close to the leader.
For example, if the leader vehicle travels down a path and somehow determines this is the
3
optimal path for other vehicles, then the leader vehicle will transmit data describing the
path to the follower vehicle.
Many di erent approaches have been implemented for identifying and tracking objects
through vision sensors. The most common algorithms include normalized cross correlation,
optical ow, edge detection, or the Hough transform. In [17], the authors use a fast nor-
malized cross correlation algorithm for self-localization of a robot. In [9], the authors use
an optical ow algorithm that can be used for road navigation. However, after the optical
ow is calculated, no other information is extracted for navigation of a following vehicle.
In [19], the authors use an area-based algorithm to statistically measure a small window
of points between two images for registering Synthetic Aperture Radar (SAR) images. A
feature-based method, the Hough transform, can be used to detect the shapes of objects.
In [11], the authors use the Hough transform to identify di erent shaped objects. In [12],
the authors use the Hough transform to detect ellipses.
Simultaneous Localization and Mapping (SLAM) is another active area of research. In
SLAM, it is important to continuously localize and estimate 3D positions of new landmarks
in an unknown environment. In [14], a high speed CMOS camera is used to implement
SLAM in real-time on a robotic arm.
1.3 Approach
First, the algorithms for the follower vehicle will be developed. This development in-
cludes computer vision algorithms. This research will start with the underlying assumptions
that the leader vehicle has previously passed down a path, has identi ed good features to
track through vision sensors, and has transmitted the data to the follower vehicle. The fol-
lower vehicle will then try to locate these speci c objects with vision sensors and navigate
along the same path as the leader vehicle. The follower vehicle could travel down this path
in seconds, hours, or days after the leader vehicle has traversed it. Before experimenting
with images from landscape and the actual vehicle, a smaller scale experiment is performed.
A digital camera is placed on a cart and pushed across a rink. There are three surfaces:
4
a smooth blue one, a white sandy one, and pebbles. There are also four blocks scattered
around the surfaces. There is a red block, blue block, black block, and a tan colored block.
It is assumed that the leader vehicle detected and located the red block as a good feature
to track. After demonstrating successful object registration, the algorithms are applied to
video data taken from a vehicle on the road. The video data obtained includes stop signs
and green signs periodically placed 200 feet apart. In this thesis, di erent methods of object
registration in a sequence of images using digital image processing techniques are reviewed,
and a proposal for a fusion of two methods to optimize the identi cation of objects for
tracking is tested.
5
Chapter 2
Digital Image Processing Background
An image may be de ned as a two-dimensional function, f(x;y), where x and y are
spatial (plane coordinates), and the amplitude of f at any pair of coordinates (x;y) is
called the intensity or gray level of the image at that point [15]. Digital image processing
is the use of computer algorithms to perform image processing on digital images. The
purpose of computer vision is to recognize speci c events and perform actions through visual
inputs instead of humans. This chapter provides the background on basic digital image
processing algorithms, such as implementation of lters, multi-resolution, edge detection,
Hough transform, normalized cross correlation, and optical ow.
2.1 Processing Pixels
Any given pixel at coordinates (x;y) has four horizontal and vertical neighbors de-
scribed by the following notation:
(x+ 1;y);(x 1;y);(x;y + 1);(x;y 1) (2.1)
There are also four diagonal pixel neighbors de ned as:
(x+ 1;y + 1);(x+ 1;y 1);(x 1;y + 1);(x 1;y 1) (2.2)
These equations for a pixel only describe binary or gray scale images. It is a little more
computationally expensive to process colored images. A colored image is composed of three
6
channels: red, green, and blue. The value of a pixel can be represented as a vector:
p(x;y) =
2
66
66
64
R(x;y)
G(x;y)
B(x;y)
3
77
77
75 (2.3)
For example, a color image can have dimensions 256 x 256 x 3 pixels in a memory array
in a computer, but have 256 x 256 pixels for a binary or grayscale images. To convert a
colored image to grayscale, the magnitude of the vector in (2.3) is computed:
pgrayscale =
q
R(x;y)2 +G(x;y)2 +B(x;y)2 (2.4)
Transforming a colored image into grayscale has many advantages. Eliminating out two-
thirds of the pixel data, allows for faster processing time on certain operations. But in some
cases, it is desirable to use all color information obtained from a scene. Basic arithmetic
operations can be performed on the pixels and neighboring pixels, such as expected values,
edge detection, low-pass or median ltering.
2.2 Implementation of lters and edge detection
More advanced digital image processing techniques can be implemented. Image lters
can be used for several purposes: color extraction, noise subtraction, or smoothing an
image. Edge detection can be used for image restoration, calculating optical ow, or shape
detection. Implementing these two operations can be done in the spatial domain or the
frequency domain. This section provides the background on implementing ltering and
edge detection operations using two-dimensional convolution in both domains.
Two-dimensional convolution between an image f(x;y) and template h(x;y) is de ned
as:
f(x;y) h(x;y) = 1MN
M 1X
m=0
N 1X
n=0
f(m;n)h(x m;y n) (2.5)
7
This operation can also be performed in the frequency domain. The two-dimensional Fourier
transform of an image f(x;y) is given by:
F(u;v) = 1MN
M 1X
x=0
N 1X
y=0
f(x;y)e j2 (ux=M+vy=N) (2.6)
The inverse Fourier transform of F(u;v) is given by:
f(x;y) = 1MN
M 1X
u=0
N 1X
v=0
F(u;v)ej2 (ux=M+vy=N) (2.7)
Now two-dimensional convolution can be implemented by multiplying the Fourier transforms
of the image f(x;y) and the template h(x;y), and then implementing the inverse Fourier
transform of the product, which results in an output image g(x;y):
g(x;y) = F 1[H(u;v)F(u;v)] (2.8)
Computationally, it is easier and faster to implement two dimensional convolution in the
frequency domain rather than in the spatial domain.
2.3 Image Filters
In dealing with acquired data from a camera, the majority of images are sensitive to
additive noise. Noise is a high frequency occurrence in images. There are two common
methods for removing noise: low- pass or median ltering. A low-pass lter uses a mask
with dimensions M x N. When a low-pass lter is applied to an image, all pixel values p
change to the value of the average value of the M x N surrounding pixels. A 3 x 3 low-pass
lter would look like:
p(x;y) = 19
2
66
66
64
1 1 1
1 1 1
1 1 1
3
77
77
75 (2.9)
8
a) Input image with noise b) Low-pass lter result c) Median lter result
Figure 2.1: Gaussian noise is present in the image in a). A 5 x 5 pixel mask is used as the
low-pass lter. The result is in b). Notice the blurring and removing of some the noise as
well as the letters in the image. A 2 x 2 pixel mask is used for the median lter. The result
is in c). All of the noise has been removed, but there is a little deformation of the letters.
The resulting image would be more smooth and have less sharp transitions of intensities.
However, the noise has a blurring e ect on the image once the lter has been applied. The
negative side of using a low-pass lter, is that other important high frequency content i.e.
edges are blurred as well. The larger the lter mask, the more noise will be smoothed out,
but the blurring of edges will increase as well. A median lter is another method to remove
noise from an image, but without blurring edges. The median of a set of values describes
the number separating the higher half of a sample from the lower half. A median lter has
M x N dimensions, and the pixel value assigned is the median value of the pixels in the
surrounding pixels within the M x N neighborhood. This lter is very useful for removing
sharp impulse noise, but does not remove or blur edges. In Fig 2.1, there is a noisy image
in (a), a result from the low-pass averaging lter in (b), and a result from the median lter
in (c). The low-pass averaging lter result does reduce the in uence of noise, but blurs the
entire image. The median lter removes all noise, but some of the edges around the letters
are removed as well.
2.4 Multi-resolution processing
A common problem in computer vision is identifying objects that are small relative to
the surroundings. Large distance from the observer to the object is the main cause of this.
9
In these cases, it can be di cult to extract valuable information, such as the gradients. For
example, if a vehicle is approaching a stop sign, it is desirable to recognize the sign as soon
as possible. From far away, the stop sign looks like a red blob, and as the distance decreases
between the vehicle and sign, the size of the stop sign increases, which makes it easier to
calculate the gradients. The gradients are important in this case, because the shape of the
stop sign is unique, which serves as a method to identify it. Multi-resolution processing
takes an image f(x;y) as an input, performs interpolation on it, and outputs an image
g(x;y). Interpolation is the process of using known data to estimate values at unknown
locations. The three most commonly used methods of interpolation are nearest-neighbor,
bilinear, and bicubic interpolation.
Nearest-neighboor interpolation is computationally easy to implement compared with
the other methods. The intensity value at p(x;y) to be estimated, takes on the value of the
intensity of its nearest-neighbor.
Bilinear interpolation uses the four nearest-neighbors ofp(x;y) to estimate the intensity
value at location (x;y) on the new resolution grid. It is described by:
v(x;y) = ax+by +cxy +d (2.10)
where the four coe cients are determined from four equations (one for each neighboring
pixel) with four unknowns.
The intensity value at location (x;y) using bicubic interpolation is determined by:
v(x;y) =
3X
i=0
3X
j=0
aijxiyi (2.11)
where the sixteen coe cients are solved for from the sixteen equations with sixteen un-
knowns obtained from the nearest sixteen neighbors. This method is the most computa-
tionally expensive of all of the interpolation methods.
10
2.5 Edge Detection Methods
Edges in physical objects often result in large local gradients in an image. Edge detec-
tion is the process of extracting points where there is a sharp change or gradient of image
brightness in an image. The rst order gradient of a digital image f(x;y) is de ned as the
vector:
rf =
2
64 Gx
Gy
3
75 =
2
64 @f@x
@f
@y
3
75 (2.12)
The magnitude of the gradient is de ned as:
mag(rf) =
q
G2x +G2y (2.13)
The edge detection result is the magnitude of the gradients in the x and y direction of
an image. Detecting edges is useful for feature extraction or object recognition. There are
many methods to detect edges: Sobel, Canny, Prewitt, and Roberts method as discussed
in [10]. The main di erence in these methods are the assigned weights of the kernels used
to compute the gradients.
The Sobel operator calculates the gradient of an image intensity at every point. The
Sobel kernels used to calculate the gradients in the x and y direction are:
5x =
2
66
66
64
1 2 1
0 0 0
1 2 1
3
77
77
75 (2.14)
5y =
2
66
66
64
1 0 1
2 0 2
1 0 1
3
77
77
75 (2.15)
11
If there is no change in a 3 x 3 pixel area, the kernel sums to 0, which means that there
is no change in intensity over that particular area in the image. Performing two-dimensional
convolution with the kernel in (2.14) and an input image will yield the gradient in the x-
direction. Also, performing two-dimensional convolution with the kernel in (2.15) and an
input image will yield the gradient in the y-direction. The Prewitt?s kernels are
5x =
2
66
66
64
1 1 1
0 0 0
1 1 1
3
77
77
75 (2.16)
5y =
2
66
66
64
1 0 1
1 0 1
1 0 1
3
77
77
75 (2.17)
The weight value 2 on the Sobel kernels results in a smoother output. A Sobel edge detection
result is displayed in Fig 2.2. The input is a synthetic image of rectangles. Gx and Gy are
shown as images, as well as the magnitude of the gradients.
12
(a) Input image (b) Horizontal edges
(c) Vertical edges (d) Magnitude of gradients
Figure 2.2: Sobel Edge detection
2.6 Hough Transform
Background
The Hough Transform is used to detect straight lines in an image. Consider the slope-
intercept form y = mx + b, where m is the slope, and b is the y-intercept. It can describe
a straight line between a pair of points on the Cartesian space. Computers have a problem
in calculating m for vertical lines, which have a slope that does not exist (division by zero).
Instead, a point is described in parameter space by the variables and , where is the
distance between the line and the origin, and is the angle from the origin to the closest
point on the line. This method for describing a line in parameter space is known as the
Hough transform. The equation for a straight line can be written as:
y = cos sin x+ sin (2.18)
13
which is simpli ed to:
= xcos +ysin (2.19)
The parameter space of the image is divided into accumulator cells. The accumulator
value A(i;j) in a cell represented the number of points in the Cartesian coordinate system
that lie on a straight line. As the number of accumulator cells increase, so does the accuracy
of the colinearity of these points on the Cartesian coordinate system. Below is the procedure
for implementing the Hough transform:
1. Compute the gradients of the image and choose a threshold to create a binary image.
2. Create sub-divisions for the accumulator cells in the parameter space
3. Evaluate the value of each accumulator cells. A higher value indicates a strong edge
in the Cartesian coordinate system.
A plot of some points are plotted on the Cartesian coordinate system in Fig 2.3. There
is a line connecting three points on the plot. The same points that are represented in the
Cartesian coordinate system are represented in parameter space via the Hough transform
in Fig 2.4. Each point from the Cartesian plot is represented as a cosine wave in parameter
space. Wherever the cosine wave intersects another cosine wave, indicates a line in Carte-
sian space. The more cosine wave intersections at one location indicate a stronger line in
cartesian space, or an edge in an image. An example of implementing the Hough transform
on an image is in Fig 2.4.
2.7 Object Registration
2.7.1 De nition
Sets of data, acquired by sampling the same scene or object at di erent times or from
di erent perspectives, will be in di erent coordinate systems. Object registration is the
process of transforming the di erent sets of data into one coordinate system. The original
image is known as the reference image. The image to be mapped is known as the target
14
0 1 2 3 4 5 6
0
0.5
1
1.5
2
2.5
3
3.5
4
x
y
Figure 2.3: Points on the Cartesian coordinate system
0 1 2 3 4
!4
!2
0
2
4
6
Theta (radians)
Rho
Figure 2.4: Points from the Cartesian system in Fig 2.3 transformed into parameter space.
!
"
!80 !60 !40 !20 0 20 40 60 80
!300
!200
!100
0
100
200
300
a) Input b) Hough transform c) Lines segments extracted
Figure 2.5: The Hough transform from the image in a) is shown as an image in b). The lines
are extracted with the inverse Hough transform and plotted on top of the original image in
c).
15
image. There are four main groups into which image registration is divided: di erent
viewpoints, di erent times, di erent sensors, and scene to model registration. Di erent
viewpoints (multiview analysis) is the process when images of the same scene are acquired
from di erent viewpoints. The goal is to gain larger 2D or 3D representations of the scene.
Di erent times (temporal analysis) is the process in which images of the same scene are
acquired at di erent times. Di erent sensors (multimodal analysis) is when images of the
same scene are acquired by di erent sensors. The goal is to integrate the information
obtained from di erent sensors to gain a detailed scene representation. Scene-to-model
registration compares the images of a scene to a computer model image of the same scene.
The model can be a computer representation of the scene (for example digital elevation
maps). The goal is to localize the acquired image in the scene/model to compare them.
2.7.2 Classi cations
Area based algorithms look at the structure of the image with correlation metrics,
Fourier properties, and other means of structural analysis. Instead of looking at the overall
structure of images, feature-based algorithms concentrate on image features such as lines,
curves, points, line intersections, and boundaries. Standard edge detection methods such
as the Canny or the Laplacian detector can be implemented to extract these features.
2.7.3 Methods of Object Registration
Object registration can be implemented in either the spatial or frequency domain.
Traditionally, image registration is used between a set of images of the same scene at
di erent angles. The transformation of one image to the next is estimated. However in
this research, the camera is moving, which introduces more complexities. The objects are
changing sizes, color intensities can be brighter or darker due to shadows, and additive noise
can disrupt image registration.
16
Spatial-domain
Image registration in the spatial domain chooses matching sets of control points between
images. Features, structures, and textures are matching criteria. After the control points
have been selected, the transformation from one to the other is estimated. Consider the
four pairs of control points to be (v;w) and (x;y) for the input image and the reference
image respectively. Bilinear approximation is given by:
x = c1v +c2w +c3vw +c4 (2.20)
y = c5v +c6w +c7vw +c8 (2.21)
The coe cients describe the transformation from the reference image to the input image.
In practice, obtaining a reference image with only stationary elements is not always
possible, and building a reference from a set of images containing one or more moving
objects becomes necessary. This applies particularly to situations describing busy scenes or
in cases where frequent updating is required [15].
Frequency
Some algorithms use properties in the frequency domain to directly determine shifts
between two images. The phase correlation method is used on a pair of images which
produces a third image with a single peak. The location of this peak corresponds to the
relative translation between the two images. The phase correlation algorithm is described
by the following steps:
Compute the 2D Fourier Transform
Calculate the cross-power spectrum by taking the complex conjugate of the second
image?s Fourier transform. Then multiply the Fourier transforms together element
wise, and normalize the product element-wise.
Obtain the normalized cross correlation by applying the inverse Fourier transfrom
17
Determine the location of the peak
2.7.4 Normalized Cross Correlation
Image cross correlation is de ned as:
(x;y) =
P
s;t[(f(s;t) f(s;t)][h(x+s;y +t) h)]q
fPs;t[f(s;t) f(s;t)]2Ps;t[h(x+s;y +t) h]2g
(2.22)
where h (template) is the image of the object to be located within f. The primary reason to
use image correlation is for matching. The cross correlation coe cient will be maximum
where the template and image are co-located or displaced on top of each other. A generic
output for normalized cross correlation is shown in Fig 2.6. The sharp peak represents the
maximum cross correlation coe cient (see [15] for details). There is a disadvantage of
using image correlation in that there is high probability that a window containing a smooth
area without any prominent details will be matched incorrectly with other smooth areas in
the reference image due to its non-saliency [10]. It is clear that normalized cross-correlation
is not the ideal approach to feature tracking since it is not invariant with respect to imaging
scale, rotation, and perspective distortions [16].
Figure 2.6: Normalized cross correlation output
2.8 Optical Flow
Optical ow is the distribution of apparent velocities of movement of brightness patterns
in an image. By calculating optical ow in a sequence of images, one can detect the motion
18
of objects or people. This detection could be very useful in many situations, such as for
automatic cruise control in a vehicle, slow motion of geographical structures (i.e. glaciers),
and other desirable situations where someone wants to track moving objects. There are
many methods to calculate optical ow, such as the Horn-Schunck method or the Lucas-
Kanade method. The main di erence between these two methods is the constraints added to
solve the problem of one equation with two unknowns. For this research, the Horn-Schunk
method is arbitrarily chosen to be studied.
In an ideal environments a point on any object in motion does not change intensity
over time. Therefore the derivative of intensity with respect to time must be zero:
dE
dt = 0 (2.23)
Intensity of a pixel is de ned as E(x;y;t). The variables x and y are the position coordinates
and t is time. Taking the partial derivative of E(x;y;t) yields the equation:
@E
@x
dx
dt +
@E
@y
dy
dt +
@E
dt = 0 (2.24)
To simplify the variables, denote the partial derivative of intensity of x with respect to time
as Ex, and Ey and Et respectively, and to simplify the equation change dxdt to u and dydt to
v. The variable u denotes the velocity in the x-direction, and the variable v denotes the
velocity in the y-direction. The resulting equation is shown below.
Exu+Eyv +Et = 0 (2.25)
In vector form:
Ex Ey
26
4
u
v
3
75 = E
t (2.26)
19
The movement component is in the direction of the brightness gradient:
Etq
E2x +E2y
(2.27)
This constraint is known as the aperture e ect. The velocity components computed lie in
the direction of the brightness gradient. This e ect is evident on a barber pole as in Fig
2.7. The pole is actually rotating from left to right, but the gradients are moving up. This
e ect leads to inaccurate estimates of the direction of velocity of the object.
Figure 2.7: Barber pole
There is a small problem with equation (2.26). Ex, Ey, and Et can be estimated.
However, u and v must be calculated and are unknown. Therefore, there is one equation
and two unknowns. An additional constraint needs to be derived to solve this equation.
Horn and Schunck introduce a smoothness constraint, which is the assumption that
if an object is in motion, the neighboring pixels are moving with similar velocities, rather
than every point on the brightness pattern moving independently. For example, if there is
an image sequence containing a blue ball moving, then it is assumed that all of the pixels
that comprise the ball are moving in the same direction. Horn and Schunck introduce the
smootheness constraint by minimizing the square of the gradient magnitudes of the optical
ow velocity:
@u
@x
2 + @u
@y
2 and @v
@x
2 + @v
@y
2 (2.28)
20
The rst step in calculating optical ow is estimating the partial derivatives of Ex, Ey,
and Et. Each estimate is the average of the rst four di erences taken over neighboring
pixels. The partial derivatives are described by:
Ex = 14(Ei;j+1;k Ei;j;k+Ei+1;j+1;k Ei+1;j;k+Ei;j+1;k+1 Ei;j;k+1+Ei+1;j+1;k+1 Ei+1;j;k+1)
(2.29)
Ey = 14(Ei+1;j;k Ei;j;k+Ei+1;j+1;k Ei;j+1;k+Ei+1;j;k+1 Ei;j;k+1+Ei+1;j+1;k+1 Ei;j+1;k+1)
(2.30)
Et = 14(Ei;j;k+1 Ei;j;k+Ei+1;j;k+1 Ei+1;j;k+Ei;j+1;k+1 Ei;j+1;k+Ei+1;j+1;k+1 Ei+1;j+1;k)
(2.31)
The box in Fig 2.8 is a visualization of the pixels used to estimate the partial derivatives.
Figure 2.8: All three partial derivatives at the center are calculated from the average of the
rst four di erences of adjacent pixels.The column index is denoted as j, the row index is
denoted as i, and the time index is denoted as k.
The Laplacians of the ow velocities u and v need to be estimated as well. The
estimations can be written as
r2u = k( ui;j;k ui;j;k) and r2v = k( vi;j;k vi;j;k) (2.32)
21
where the local averages of u and v are:
ui;j;k = 16[ui 1;j;k+ui;j+1;k+ui+1;j;k+ui;j 1;k]+ 112[ui 1;j 1;k+ui 1;j+1;k+ui+1;j+1;k+ui+1;j 1;k]
(2.33)
vi;j;k = 16[vi 1;j;k+vi;j+1;k+vi+1;j;k+vi;j 1;k]+ 112[vi 1;j 1;k+vi 1;j+1;k+vi+1;j+1;k+vi+1;j 1;k]
(2.34)
After the estimations have been made for the partial derivatives and the Laplacians, the
problem becomes a minimization of the error in the constant brightness equation in (2.25)
and the departure from smoothness in (2.28).
The total error to be minimized becomes:
2 =
Z Z
( 2 2c + 2b)dxdy (2.35)
where c is the error from the departure of smoothness, b is the error from the bright-
ness equation, and is a weighting value. The value plays a signi cant role only for
areas where the brightness gradient is small, preventing haphazard adjustments to the es-
timated ow velocity occasioned by noise in the estimated derivatives [13]. Using a least
squares minimzation on (2.35), and substituting the estimates for the partial derivatives
and Laplacians yields the iterative solution:
un+1 = un Ex(Ex u
n +Ey vn +Et
2 +E2x +E2y ) (2.36)
vn+1 = vn Ey(Ex u
n +Ey vn +Et
2 +E2x +E2y ) (2.37)
Optical ow is calculated using synthetic images as inputs. The inputs and results are
shown in Fig 2.9.
22
a)Frame 1 b) Frame 2 c) Optical ow
Figure 2.9: Optical ow calculated between the two input images. The block in the top
right corner is moving diagonally down from left to right, and the block in the bottom right
corner is moving from right to left.
23
Chapter 3
Experiments and Results
This chapter presents an overview of validating the digital image processing techniques
discussed in the previous chapter. Synthetic data and acquired data are both used in this
validation. Synthetic data is useful to test digital image processing algorithms on, due to
the fact that they can be free from noise, artifacts from video coding, and background
clutter. After the object registration methods are validated on the synthetic data, they are
used on the acquired data from the digital camera. Results are discussed.
3.1 Synthetic Data
Sobel Edge Detection
The synthetic image in Fig 3.1 is used as the input to the Sobel edge detection algo-
rithm. The magnitude of the gradients in the x and y direction is displayed in Fig 3.2(a).
Figure 3.1: Synthetic image for object registration
24
(a) (b)
Figure 3.2: (a) The magnitude of the x and y gradients using the Sobel kernels. (b) Lines
extracted from the Hough transform are plotted on top of the grayscale input image.
Hough Transform
The result from the Sobel edge detection is used as the input to the Hough transform.
The lines are extracted from the Hough transform and plotted on top of the original input
image in grayscale in Fig 3.2(b). The Hough transform detects the strongest edges or gradi-
ents in the image. Not all edges were detected by the Hough transform in this experiment.
The Hough transform extracted the longest lines in the image, which compose of the larger
blocks. None of the edges of the block in the lower left-hand corner were extracted. Even if
the Hough transform detects all of the edges, there is no method to determine which block
is which from this information alone.
Normalized Cross Correlation
The red block is cropped and stored in memory. This red block will be assigned as the
function h(x;y) in the normalized cross correlation equation. The image in Fig 3.1 will be
stored as the function f(x;y). After the cross-correlation coe cients are calculated for
all possible shifts of h(x;y) onto f(x;y), the maximum coe cient is where there is a match
of h(x;y) and f(x;y). The output is shown as a surface plot of all of the normalized cross
correlation coe cients in Fig 3.3. The red block is the group of coe cients to the left. The
25
group to the right is from the red straight line. The highest correlation value is 1, which is
near the middle of the red block. This method successfully identi ed the template h(x;y)
within the image f(x;y).
Figure 3.3: This is a surface plot of the magnitude of the normalized correlation coe cients.
The red block is on the left, and the red straight line is on the right. Notice that the
maximum coe cient is in the middle of the actual red block.
3.2 Acquired Data
Digital video was obtained in an arena designed for a robotic competition. Four dif-
ferent colored blocks were placed in front of a robot. The background and rink introduce
additional di culties such as changing colors on parts of the oor, di erent textured sur-
faces such as pebbles, and a window. A digital camera was placed on top of a robot in
the rink, and was pushed forward for a couple of seconds. A total of 299 images were used
as data to process. A frame taken from this video is in Fig 3.4. In this section, the edge
detection, Hough transform, and the normalized cross correlation operations are applied to
this image.
26
Figure 3.4: The output from the digital camera. Included in the image are the four blocks
to be detected. There is a black background on the rink, there is a blue and white surface,
and there are some pebbles.
3.2.1 Edge Detection
The Sobel edge detection discussed in the previous chapter is used on the image in Fig
3.4. The magnitude of the gradients are in Fig 3.5. As one can see, the presence of the
pebbles introduces some noise. The edges of the window, rink, and the change of color on
the oor are detected, as well as the blocks.
Figure 3.5: Edge detection on the acquired data
3.2.2 Hough Transform
The Hough transform as discussed in the previous chapter is applied to the edge detec-
tion result in Fig 3.5. The results of the Hough transform are in Fig 3.6. As expected, the
27
strongest lines are extracted, which are the lines comprising of the rink and the window. It
is apparent at this point that the edge detection and Hough transform operations alone are
not very useful in registering objects.
Figure 3.6: Hough transform on the acquired data
3.2.3 Normalized Cross Correlation
Finally the normalized cross correlation operation is performed on the acquired data.
The red-block is stored as the template h(x;y) for normalized cross correlation in equation
(2.22). The input image from Fig 3.4 is stored as the function f(x;y). The output is a
matrix of correlation values displayed as a surface plot in Fig 3.7. The red color on the
surface plot coincides with a high correlation value, and the blue color represents a low
correlation value. There is strong correlation across the lower borders of the rink, the edge
of the pebble surface, and the some bricks in the background. The red-block has a medium
correlation coe cient (light blue color).
3.3 Comments
The edge detection and Hough transform are useful methods to extract the gradients of
the images. Normalized cross correlation works well with the synthetic data, but has issues
with the acquired data. An issue in using normalized cross correlation is that h(x;y) and
28
Figure 3.7: The output is a surface plot of the normalized cross correlation coe cients .
The red surface indicates a high correlation coe cient and blue represents a low correlation
coe cient.
f(x;y) need to be in grayscale. Intensity information in RGB space is lost in this process.
As a result, there are many areas away from the block with high correlation coe cients.
It is evident that there needs to be an algorithm that uses these digital image processing
techniques in combination to form a robust computer vision algorithm.
29
Chapter 4
Statistical-gradient Object Registration
This chapter presents a new method for object registration. This method detects
objects by combining a series of two identi cation methods. There are two common types
of methods that are used to identify objects: area-based and feature-based, as discussed in
[10]. Area-based classi cations look at the structure of the image with correlation metrics or
Fourier analysis. Feature-based methods concentrate on the overall structure of the image,
such as edges, line intersections, and curves. The rst identi cation method used is based
on local averages and variance of surrounding pixels. The result is a binary image, which
is used with the second method which uses the Hough transform to extract the gradients.
4.0.1 Area-Based
This area-based method uses six parameters to classify an object: the local averages
and variance for the red, green, and blue channels in a given pixel window that compose
a digital image. A template h(x;y) contains the object that is desired to be located in an
image f(x;y). The object in h(x;y) is divided in the computer as a three dimensional array
(M x N x 3). The three dimensions are composed of pixel values for the red, green, and
blue space. Statistical analysis is used to characterize each color matrix. First, the average
values for each color matrix is calculated. Then the variance for each pixel in the template
h(x;y) is calculated. After the template h(x;y) has been classi ed with this method, the
image f(x;y) is searched for areas that match the parameters for the template h(x;y). If
a pixel in f(x;y) matches the parameters given by h(x;y), then the location of the pixel
is mapped into a separate output 2-dimensional matrix with the same row-column size of
the image f(x;y) . A tolerance value is utilized that will allow values that have small
deviation from the calculated parameters to be accepted as a match. Refer to the appendix
30
for speci c tolerance values. Note that this method fails to incorporate any information
regarding the shape of the object. This fault could have minimal e ect, however if there are
many objects of similar color. then it would be di cult to determine with con dence that
the object detected is the desired one. Therefore, the result from this area-based method,
is tested with the Hough transform to match the shape of the object.
4.0.2 Feature-Based
The output matrix from the proposed area-based method is then used as the input to
the Hough transform. After the Hough transform is performed, the output lines are sent
to an algorithm that checks the angle of intersection between the line segments. From the
synthetic and acquired data discussed in Section 3, it is evident that the objects are cubes.
Therefore, the intersection of the gradients of the image are perpendiular. To calculate the
angle of intersection, the slope of each line is calculated. After all of the slopes are calculated,
the algorithm compares the negative inverse slope to that of all the other slopes. If they are
equal within a controllable window of tolerance, then the lines are perpendicular. If there
is no match, the line segment is thrown out. If the lines intersections are perpendicular,
then the probability of the detected region being the template h(x;y) is high. If the angles
are not perpendicular, then there is some error.
An experiment of randomly distributed line segments are used as inputs to test gradient
method for determining if line segments are perpendicular within 3 degrees of tolerance.
The plot in Fig 4.1 contains the randomly created line segments and the output, which is a
plot of the lines that are perpendicular. It appears that other sets of lines are perpendicular
to each other, which is the reason that the tolerance of error ( in degrees) can be adjusted.
This procedure works on these randomly created line segments, so the next step is to use
this method on the real line segments extracted from the Hough transform.
31
0 0.5 1
0
0.5
1
Line Segments
0 0.5 1
0
0.5
1
Perpendicular Line Segments
Figure 4.1: The plot on the left contains 10 randomly generated line segments. The plot
on the right are the line segments that are perpendicular to each other.
4.0.3 Code Outline
This section provides an outline of the software used to implement the statistical-
gradient object registration method.
Inputs- the template h(x;y) and the image f(x;y)
Output- an image of the detected object with lines drawn around it
1. calculate Nh x Mh of the template h(x;y) and Nf x Mf input image f(x;y)
2. calculate the average value and variance for the pixel values in the template h(x;y)
for each RGB channel.
3. create an averaging mask with the row-column dimensions Nh x Mh
4. increase the resolution of f(x;y) to 2Nf x 2Mf with nearest-neighbor interpolation
5. perform two-dimensional convolution with the averaging mask and the input image
f(x;y) for each RGB channel separately.
6. calculate the variance for each pixel in the input image f(x;y) within the two-dimensional
space spanned by the size of the template h(x;y) and the variance of the output from
the two-dimensional convolution with the input image f(x;y) and the averaging mask
for each RGB channel separately.
7. set the window of tolerance for accepting values as a match with the template
32
8. search the average values and variance from the input image f(x;y) in each RGB
channel and compare with the six parameters calculated from the template h(x;y).
9. if the values satisfy all six parameters store the pixel locations in a separate output
matrix out(x;y) with row-column dimensions 2Nf x 2Nf
10. use the output matrix out(x;y) as the input to the Hough transform
11. calculate the slope in degrees relative to the horizontal for all line segments
12. verify the line segments are perpendicular
13. calculate the center and area of the object
14. plot all perpendicular line segments onto the input image f(x;y) with row-column
dimensions 2Nf x 2Mf.
To experimentally validate this statistical-gradient object registration method, syn-
thetic data (Fig 4.2) and acquired data (Fig 3.4) are used as the input images. Synthetic
data is useful to test digital image processing algorithms on, because there is no noise, no
artifacts from the video coding, and a less busy background. After the object registration
method work on the synthetic data, it is used on the acquired data from the digital camera.
4.0.4 Validation using Synthetic Images
A synthetic red-block is created and placed in the bottom left-hand corner. There is
also a line with the same characteristics along with blue and green blocks added to the
image. The RGB values for the red-block were set to the values in Table 4.1. A 6 x 6 pixel
block was created to resemble the real red block on the real surface from the acquired data.
The dimensions of the block were chosen because it was desired to have a block that was
smaller than other blocks in the image. Having a smaller block introduces a higher level of
di culty for digital image processing algorithms. Table 4.1 describes the synthetic block
according to its expected value and variance of each red, green, and blue parameters. An
33
image is created with the red block in Table 4.1 along with a green block, blue rectangle,
and a red line with the same characteristics as the synthetic block described by Table 4.1.
This image is displayed in Fig 4.2.
Table 4.1: Synthetic Block Characteristics
Expected Value Variance
Red 219 0
Green 60 0
Blue 54 0
Figure 4.2: Synthetic image for object registration
Statistical-gradient Object Registration
This method is tested on the synthetic image in Fig 4.2, and the output image is shown
in Fig 4.3. As seen in Fig 4.3, the red block was detected.
4.0.5 Validating on Acquired Images
The next step is to test this method on the acquired data, which has more background
noise. To demonstrate the algorithm on more realistic data, digital video was obtained in
an arena designed for a robotic competition. A total of 299 images were used as data to
process. The thirty-seventh frame is shown in Fig 3.4. The statistical-gradient method is
applied to the image sequence.
34
Figure 4.3: The output from the algorithm used on the input from Fig 4.2.
Registering the Red Block
The red-block was copied from the input image in Fig 3.4 and saved as the template
h(x;y). The pixel values of each RGB channel of h(x;y) were processed with digital image
processing techniques. The statistical values from this operation is given in Table 4.2.
The statistical-gradient registration method is performed on the entire image sequence (299
frames) taken from the digital camera. The gradient result can be seen in Fig 4.4. All of
the the lines are perpendicular within 9 degrees of tolerance. The same template is used
for the entire image sequence. This registration method located the template h(x;y) in all
of the 299 of the input images f(x;y). If h(x;y) is smaller than the object in f(x;y), it
appears to have minimal e ect on the outputs.
Figure 4.4: The output from the line segment algorithm on the lines extracted from the
Hough transform on the output from the object identi cation method.
35
Table 4.2: Red Block Characteristics
Expected Value Variance
Red 202.30 85.08
Green 71.41 40.07
Blue 32.90 58.00
Four output frames are displayed in Fig 4.5. A vector from the origin to the centroid of
the block is plotted on top of the image. The location of the center of the registered object
is an output, as well as the area of the object. The area of the object is used to interpret
if the object is moving closer or farther away. The area of the block generally increases as
the camera moves closer.
(a) Frame 37 (b) Frame 64
(c) Frame 91 (d) Frame 129
Figure 4.5: Four outputs of the registered red block
Registering the Blue Block
Additionally, the blue block in the image is registered as well. The template h(x;y) is
the blue block. The six statistical parameters used to classify the object are in Table 4.3.
Four output frames from the video sequence are in Fig 4.7
36
Figure 4.6: The center of the registered object red block in ten selected frames from the
image sequence is shown above.
Table 4.3: Blue Block Characteristics
Expected Value Variance
Red 51.08 40.28
Green 59.94 16.62
Blue 82.26 38.85
4.0.6 Motion Prediction
It is assumed the objects are moving at a constant velocity, therefore it reasonable to
predict where it will be in the next frame. If the location of the object can be accurately
estimated, then the search area of the input frame f(x;y), can be signi cantly reduced.
Knowledge of the location of the object would increase the processing time, as well as
eliminate false detections. The open loop estimation of a moving object in an image sequence
can be described by the following equation:
Objk+1(x;y;z) =
2
66
66
64
xk + (xk 2 xk 1)
yk + (yk 2 yk 1)
zk + (zk 2 zk 1)
3
77
77
75 (4.1)
where the variables x;y;z refer the the x and y coordinate of the center of the registered
object, and z refers to the size of the registered object. Fig 4.8 and Fig 4.9 contain plots
37
(a) Frame 37 (b) Frame 64
(c) Frame 91 (d) Frame 129
Figure 4.7: Four outputs of the registered blue block
of the estimations of the centers for the blue and red blocks using equation in (4.1). The
largest prediction error for the location of the center of the object for the x-cooridnate is
about 20 pixels. This error is the same for the prediction of the center of the object for
the y-coordinate. The prediction of the area of the block error is signi cantly higher, due
to the nonlinear growth of the area of the block as it gets closer to the camera. The mean,
variance, and standard deviation of the motion prediction of the red and blue blocks are in
Table 4.4 and Table 4.5, respectively. To further reduce these error predictions, a Kalman
lter could be implemented.
Figure 4.8: Plots of the X estimation, Y estimation, and Z estimation of the red block.
38
Table 4.4: Motion Prediction Error for the Red Block
Expected Value (pixels) Variance (pixels2) Standard Deviation (pixels)
X 2.04 30.79 5.54
Y 1.39 58.87 7.67
Z 11.61 42,825 206.94
Figure 4.9: Plots of the X estimation, Y estimation, and Z estimation of the blue block.
4.0.7 Stop Sign Registration
To further validate the statistical-graident registration method, video data was recorded
with a web cam placed on the top of a car. The video data was divided into 55 images. The
video displays the car driving toward a stop sign. There is a railroad crossing sign behind
the stop sign, as well as lots of tress and dirt to add clutter to the background. The stop
sign is the desired object to track in this video sequence. Two shades of red are stored as
the template h(x;y) to search for within the input images, because as the distance decreases
between the vehicle and the sign, the color intensity gets brighter. The results from the
statistical-gradient registration method are shown in Fig 4.10. The white lettering on the
stop sign causes the algorithm to detect the stop sign in two sections or just nds blocks of
red on the sign. When the stop sign is far away in Frame 11, the sign is so small that the
gradients (green lines) do not compose an octagon.
39
Table 4.5: Motion Prediction Error for the Blue Block
Mean (pixels) Variance (pixels2) Standard Deviation (pixels)
X .9052 78.02 8.8
Y .9118 23.10 4.81
Z 32.73 11,930 109.22
4.0.8 Multiple Green Signs Registration
To further validate the statistical-gradient registration method, video data taken from
the dashboard inside a vehicle contains multiple green signs on the side of the road. Each
sign is located a distance of 200 feet apart. Fig 4.11 contains a frame from the video. The
green sign is used as the template h(x;y). The output images are in Fig 4.12
40
(a) Frame 11 (b) Frame 17
(c) Frame 28 (d) Frame 38
Figure 4.10: Four outputs from the object orientation method.
41
Figure 4.11: A frame of the video sequence of the green signs.
42
(a) Frame 1 (b) Frame 153
(c) Frame 229 (d) Frame 278
Figure 4.12: Four outputs from the video sequence of the green signs
43
Chapter 5
Conclusions
This thesis has presented multiple object registration methods to address detection of
objects in a di erent settings, such as indoor and outdoor environments. These registra-
tion methods work well on noiseless synthetic data, but prove to be troublesome in noise
corrupted data obtained from a camera. Quantization noise was seen to be a major source
of corrupt data in digital image processing, as well as blurring from motion of the camera,
or an object.
Optical ow was seen to be di cult to implement in object registration and tracking
for a number of reasons. For the image sequences used in this research, the camera was
in motion. All objects and the background are moving. This motion makes it di cult to
register objects that are still. If other objects were in motion, the relative velocities could
then be segmented.
Feature based methods such as edge detection and the Hough transform provided useful
results to begin classifying objects. However, there needs to be a higher level of computer
processing to register the object. Statistics of areas of pixels can provide useful information
to be extracted for object registration as well. By combining these two methods, it ensures
that the template h(x;y) has been successfully matched to itself in the input image f(x;y).
The red and blue blocks from the rst experimental study were registered in every
frame of the video sequence. The resolution is changed via nearest-neighbor interpolation
to enlarge the gradients when the block is distant from the camera. As the block is closer
to the camera, the gradients become a little distorted due to the resolution increase, as was
seen in Fig 4.5(d).
The algorithm for the stop sign and green sign images also exhibited some limitations.
Because the template h(x;y) is small relative to the objects as the camera moves closer
44
to them. The gradients also change for the objects. When the stop sign is far away, the
shape is not an octagon, but a red blur. As the stop sign gets closer, the octagon shape is
recognizable.
45
Chapter 6
Future work
Future work entails more extensive testing of the algorithm, as well as implementing
it in real-time. Experimental validation through a leader-follower demonstration is another
goal for the future. An experiment set-up could be to visually tag a robot with a speci c
mark that the follower robot would be searching for. After the demonstration is successful,
a larger scale experiment could be used in fusion with other sensors such as IMU or GPS
to follow the leader vehicle.
Another goal is to write algorithms for the leader vehicle. At this point in the research,
an algorithm has been written to register objects. However, an algorithm for the leader
vehicle to identify good features to track is important. The leader vehicle must survey the
environment, and decide what objects are other features of the image are good features
to track. After the leader vehicle has done this, it will transmit this data to the follower
vehicle.
46
Bibliography
[1] A. Broggi, P.Cerri, and P.C. Antonello, "Multi-resolution vehicle detection using arti-
cial vision", in Proc. Intelligent Vehicle Symp., 2004, pp 310-314
[2] Committee on Army Unmanned Ground Vehicle Technology, "Technology Develop-
ment for Army Unmanned Ground Vehicles", The National Academies Press, Wash-
ington, D.C. 2002.
[3] Sheikholeslam , and CA. Desoer, "Longitudinal control of a platoon of vehicles with no
communication of the lead vehicle information", Proc. of American control conference
3, pp 3102-3106, 1991.
[4] Michael Price and Steven Munkeby. "Future Combat Systems Autonomous Navigation
System Technology Revolutionize Warfare." 2008
[5] Teck Chew Ng, Javier Guzman, and Martin Adams. "Autonomous Vehicle-Following
Systems: A Virtual Trailer Link Model"
[6] D.B. Edwards, T.A. Bean, D.L. Odell, and M.J. Anderson." A Leader-Follower Algo-
rithm for Multiple AUV Formations."
[7] E. M. Atkins, J. A. Lennon. "Color-based Vision Tracking for an Astronaut EVA
Vehicle" 2001
[8] Hariprasad Kannan, Vilas K. Chitrakaran, Darren M. Dawson, and Timothy Burg. "
Vision-Based Leader/Follower Tracking for Nonholonomic Mobile Robots." In Proc.
American Control Conference. pp 2159-2164, 2007
[9] Andrea Giachetti, Marco Campani, and Vincent Torre. "The Use of Optical Flow for
Road Navigation." IEEE Robotics and Automation, Vol 14, No.1, February 1998.
[10] Flusser, Jan. Barbara Zitova. "Image registration methods: a survey," Image and Vi-
sion Computing 21 (2003) 9771000
[11] Strzodka, R.; Ihrke, I.; Magnor, M., "A graphics hardware implementation of the
generalized Hough transform for fast object recognition, scale, and 3D pose detection,"
Image Analysis and Processing, 2003.Proceedings. 12th International Conference on ,
vol., no., pp. 188-193, 17-19 Sept. 2003
[12] Chia, A.Y.S.; Leung, M.K.H.; How-Lung Eng; Rahardja, S., "Ellipse Detection with
Hough Transform in One Dimensional Parametric Space," Image Processing, 2007.
ICIP 2007. IEEE International Conference on , vol.5, no., pp.V -333-V -336, Sept. 16
2007-Oct. 19 2007
47
[13] B.K.P. Horn and B.G. Schunk. Determining Optical Flow. Arti cial Intelligence,
17:185-203, 1981
[14] Gemeiner, P.; Ponweiser, W.; Einramhof, P.; Vincze, M., "Real-Time SLAM with a
High-Speed CMOS Camera," Image Analysis and Processing, 2007. ICIAP 2007. 14th
International Conference on , vol., no., pp.297-302, 10-14 Sept. 2007
[15] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed. Prentice Hall,
2002.
[16] J.P. Lewis. "Fast Normalized Cross-Correlation", Industrial Light and Magic
[17] Briechle, K.; Hanebeck, U.D., "Self-localization of a mobile robot using fast normalized
cross correlation," Systems, Man, and Cybernetics, 1999. IEEE SMC ?99 Conference
Proceedings. 1999 IEEE International Conference on , vol.4, no., pp.720-725 vol.4, 1999
[18] Richard O. Duda, Peter E. Hart. "Use Of The Hough Transformation To Detect Lines
and Curves In Pictures" Comm.ACM, vol 15, No. 1, pp. 11-15 (January 1972).
[19] Kumar, Suvesh; Rishiwal, Vinay; Joglekar, P.N., "Area Based Approach to Registra-
tion of SAR Images with Sub-pixel Accuracy," TENCON 2006. 2006 IEEE Region 10
Conference , vol., no., pp.1-4, Nov. 2006
48
Appendices
49
Appendix A
Nomenclature
f(x;y) - image as a two dimensional function
h(x;y)- template
p(x;y)- pixel value at location (x;y) in an image
Gx- gradient with respect to x of an image
Gy- gradient with respect to y of an image
F(u;v) image represented in the frequency domain
A(i;j)- accumulator value of Hough transform at location (i;j) in parameter space
- the distance from the line to the origin for the Hough transform
- the angle of the vector from the origin to the closest point
u- average velocity in the x-direction
v- average velocity in the v-direction
- two-dimensional normalized cross correlation coe cient
50
Appendix B
Orientation Results of Red Block
51
52
53
54
Appendix C
Orientation Results of The Blue Block
55
56
57
58
Appendix D
Tolerance Values for Statistical Analysis
Table D.1: Tolerances from the experiments
Expected Value Tolerance Variance Tolerance
Red block 15 50
Blue block 40 70
Stop sign 15 40
Green signs 40 150
59