Registration and Tracking of Objects with Computer Vision for Autonomous Vehicles Except where reference is made to the work of others, the work described in this thesis is my own or was done in collaboration with my advisory committee. This thesis does not include proprietary or classi ed information. Andrew Nevin Certi cate of Approval: David M. Bevly, Co-Chair Associate Professor Mechanical Engineering Thaddeus A. Roppel, Co-Chair Associate Professor Electrical and Computer Engineering Thomas Denney Professor Electrical and Computer Engineering Alan Scottedward Hodel, Co-Chair Associate Professor (deceased) Electrical and Computer Engineering George T. Flowers Dean Graduate School Registration and Tracking of Objects with Computer Vision for Autonomous Vehicles Andrew Nevin A Thesis Submitted to the Graduate Faculty of Auburn University in Partial Ful llment of the Requirements for the Degree of Master of Science Auburn, Alabama May 9 ,2009 Registration and Tracking of Objects with Computer Vision for Autonomous Vehicles Andrew Nevin Permission is granted to Auburn University to make copies of this thesis at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii Vita Andrew Nevin, son of Ronald and Brenda Nevin, was born June 8, 1983, in Nashville, Tennessee. He graduated from Montgomery Bell Academy in 2002. He attended Auburn University in September 2002 and graduated cum laude with a Bachelor of Electrical Engi- neering degree in December 2006. He enrolled in the Graduate School, Auburn University, in January 2006, working as a graduate teaching assistant for Electrical Engineering Lab II and a graduate research assistant. He graduated with a Master?s of Science in Electrical Engineering degree in May 2009. iv Thesis Abstract Registration and Tracking of Objects with Computer Vision for Autonomous Vehicles Andrew Nevin Master of Science, May 9 ,2009 (B.S., Auburn University, 2006) 70 Typed Pages Directed by Scotte Hodel and David Bevly For this research, an autonomous leader-follower vehicle caravan is being designed us- ing computer vision. The focus of this research is to investigate computer vision methods of identifying objects detected by the leader vehicle and determining the follower vehicle? s po- sition relative to the objects. Many object registration and tracking methods are presented in this paper such as Hough transforms, optical ow, and normalized cross correlation. A novel statistical-gradient method is presented to combine feature based methods and area based methods for object registration and tracking. These methods are evaluated against both synthetic and captured video data. This registration method is shown to be e ective with both synthetic data and acquired data. v Acknowledgments The author would like to thank his advisors Dr. Alan Scottedward Hodel and Dr. David Bevly, as well as his committee members Dr. Thaddeus Roppel and Dr. Thomas Denney for their guidance and patience through this research. Additionally, he would like to thank Dr. Stanley Reeves for instruction in Digital Image Processing and Advanced Digital Image Processing. Without the course instruction, advice, and motivation given to the author, this research would not have been possible. The author wishes to express his appreciation for his grandparents Robert and Louise Howell, his parents Ronald and Brenda Nevin, his brother Ryan, and two sisters Ashley and Caroline for all of their continuous love, support, and encouragement throughout his entire educational career. vi Style manual or journal used Journal of Approximation Theory (together with the style known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars. Computer software used The document preparation package TEX (speci cally LATEX) together with the departmental style- le aums.sty. vii Table of Contents List of Figures x 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Digital Image Processing Background 6 2.1 Processing Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Implementation of lters and edge detection . . . . . . . . . . . . . . . . . . 7 2.3 Image Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Multi-resolution processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Edge Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 Object Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7.1 De nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7.2 Classi cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.7.3 Methods of Object Registration . . . . . . . . . . . . . . . . . . . . . 16 2.7.4 Normalized Cross Correlation . . . . . . . . . . . . . . . . . . . . . . 18 2.8 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Experiments and Results 24 3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Acquired Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Normalized Cross Correlation . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Statistical-gradient Object Registration 30 4.0.1 Area-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.0.2 Feature-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.0.3 Code Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.0.4 Validation using Synthetic Images . . . . . . . . . . . . . . . . . . . 33 4.0.5 Validating on Acquired Images . . . . . . . . . . . . . . . . . . . . . 34 4.0.6 Motion Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.0.7 Stop Sign Registration . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.0.8 Multiple Green Signs Registration . . . . . . . . . . . . . . . . . . . 40 viii 5 Conclusions 44 6 Future work 46 Bibliography 47 Appendices 49 A Nomenclature 50 B Orientation Results of Red Block 51 C Orientation Results of The Blue Block 55 D Tolerance Values for Statistical Analysis 59 ix List of Figures 2.1 Gaussian noise is present in the image in a). A 5 x 5 pixel mask is used as the low-pass lter. The result is in b). Notice the blurring and removing of some the noise as well as the letters in the image. A 2 x 2 pixel mask is used for the median lter. The result is in c). All of the noise has been removed, but there is a little deformation of the letters. . . . . . . . . . . . . . . . . 9 2.2 Sobel Edge detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Points on the Cartesian coordinate system . . . . . . . . . . . . . . . . . . . 15 2.4 Points from the Cartesian system in Fig 2.3 transformed into parameter space. 15 2.5 The Hough transform from the image in a) is shown as an image in b). The lines are extracted with the inverse Hough transform and plotted on top of the original image in c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Normalized cross correlation output . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Barber pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8 All three partial derivatives at the center are calculated from the average of the rst four di erences of adjacent pixels.The column index is denoted as j, the row index is denoted as i, and the time index is denoted as k. . . . . . . 21 2.9 Optical ow calculated between the two input images. The block in the top right corner is moving diagonally down from left to right, and the block in the bottom right corner is moving from right to left. . . . . . . . . . . . . . 23 3.1 Synthetic image for object registration . . . . . . . . . . . . . . . . . . . . . 24 3.2 (a) The magnitude of the x and y gradients using the Sobel kernels. (b) Lines extracted from the Hough transform are plotted on top of the grayscale input image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 This is a surface plot of the magnitude of the normalized correlation coe - cients. The red block is on the left, and the red straight line is on the right. Notice that the maximum coe cient is in the middle of the actual red block. 26 x 3.4 The output from the digital camera. Included in the image are the four blocks to be detected. There is a black background on the rink, there is a blue and white surface, and there are some pebbles. . . . . . . . . . . . . . 27 3.5 Edge detection on the acquired data . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Hough transform on the acquired data . . . . . . . . . . . . . . . . . . . . . 28 3.7 The output is a surface plot of the normalized cross correlation coe cients . The red surface indicates a high correlation coe cient and blue represents a low correlation coe cient. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 The plot on the left contains 10 randomly generated line segments. The plot on the right are the line segments that are perpendicular to each other. . . 32 4.2 Synthetic image for object registration . . . . . . . . . . . . . . . . . . . . . 34 4.3 The output from the algorithm used on the input from Fig 4.2. . . . . . . . 35 4.4 The output from the line segment algorithm on the lines extracted from the Hough transform on the output from the object identi cation method. . . . 35 4.5 Four outputs of the registered red block . . . . . . . . . . . . . . . . . . . . 36 4.6 The center of the registered object red block in ten selected frames from the image sequence is shown above. . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.7 Four outputs of the registered blue block . . . . . . . . . . . . . . . . . . . . 38 4.8 Plots of the X estimation, Y estimation, and Z estimation of the red block. 38 4.9 Plots of the X estimation, Y estimation, and Z estimation of the blue block. 39 4.10 Four outputs from the object orientation method. . . . . . . . . . . . . . . . 41 4.11 A frame of the video sequence of the green signs. . . . . . . . . . . . . . . . 42 4.12 Four outputs from the video sequence of the green signs . . . . . . . . . . . 43 xi Chapter 1 Introduction 1.1 Motivation The desire and necessity to build autonomous vehicles is increasing every year. Au- tonomous vehicles can be designed to replace humans on jobs that are dangerous, time- consuming, or di cult. Recent events have shown the value of using highly automated technology in the battle eld. Autonomous aircraft, terrestrial, and marine vehicles have accomplished important missions while reducing risk to personnel [6].They can be used in rescue operations, such as collapsed mines or building res, where it is dangerous to send workers. Other tasks include avoiding obstacles, keeping a safe distance behind other vehicles, regulation of speed according to tra c conditions and road characteristics, and driving and parking within urban environments [1]. Autonomous vehicles can be sent along a dangerous path to send supplies to soldiers in need without risking injuries or fatalities to others. Research in vehicle following has attracted the attention of several researchers in the past decades, particularly in the USA and Europe where safety, energy consump- tion, and tra c congestions are the primary motivators [2], [3]. In a recent study by the Virginia Department of Motor Vehicles and Virginia Commonwealth University, researchers conclude that between 25 to 50 percent of all motor vehicle crashes in the U.S. are caused by driver distraction. Rubbernecking (slowing down to look at another accident) accounts for 16% of accidents caused by distractions. Driver fatigue (12 percent), looking at scenery (10 percent), other passengers or children (9 percent), adjusting the radio (7 percent), reading maps or other documents (2 percent) account for the majority of accidents. Cell-phone use is another distraction that is increasing the number of accidents. Driver fatigue accounts for 100,000 crashes each year. Aggressive driving is another frequent cause of accidents. It 1 is evident that there is a need to prevent fewer accidents each year. Creating autonomous or semi-autonomous vehicles is an important step toward reducing accidents. The rst step to creating an autonomous vehicle is self-localization. Self-localization is the process by which an autonomous vehicle calculates its position relative to the envi- ronment. GPS is typically used to to estimate the vehicle?s current position, however, due to indoor environments, dense trees or urban areas, the line of sight between the satellites and vehicle can be blocked. Other sensors, such as IMU, LIDAR, and vision can be used to estimate the position of the vehicle when GPS is not available. A desire to create an autonomous leader-follower caravan for military convoys is grow- ing. In [4], the authors discuss the current state of research in these convoys. The leader- follower capability allows one man-driven vehicle to be followed by one or more unmanned vehicles in a convoy-like operation. The man-driven lead vehicle establishes the route for the unmanned follower vehicles by sending the waypoints or GPS breadcrumb coordinates to the followers. Additionally, the follower vehicles are instructed to trail the leader at a speci ed distance [4]. The motivation behind the leader-follower convoy is to reduce the risk of soldiers in dangerous situations. For this research, an autonomous leader-follower vehicle caravan is being designed using computer vision. As the leader vehicle navigates through a landscape, it will need to recognize and identify features it may "see", which includes obstacles that it may have to detour around. The leader vehicle identi es these features using digital image processing techniques such as gradient detection, Hough transforms, and local averages and variances of pixel values. In this research it is assumed that the leader estimates the orientation and distance of an object, and transmits the data to the follower vehicle. The follower vehicle searches the landscape for the object identi ed by the leader vehicle, and calculates the orientation and distance from the object. The follower will then make attempts to correct its position based on the leader vehicle?s data. The focus of this research is to investigate computer vision methods of identifying objects detected by the leader vehicle and determining the follower vehicle? s position relative to the objects. 2 1.2 Previous Work In [5], the authors use a leader vehicle, which is not equipped with GPS or inter- vehicle communication networks, and a follower vehicle equipped with an on-board laser scanner and odometry sensors. The leader position is estimated using the measurement data from the laser scanner. In [6], the authors present a leader-follower design for multiple autonomous underwater vehicles operating in formation to perform tedious and dangerous tasks, such as minesweeping. The leader vehicle broadcasts its position to the follower vehi- cles through underwater acoustic communication. The follower vehicles move into formation relative to the leader to scan for mines. This work requires that the leader vehicle be close to the follower vehicles. In [7], the authors use cameras for an autonomous vehicle to follow an astronaut in outer space. They tag the astronaut with a green color blob. A green color lter is used to convert the data from the camera on the follower to a binary image. The orientation and size of the detected color blob is used to control the distance between the astronaut and the follower. This follower robot helps the astronaut carry moon rocks and other heavy items that would otherwise burden the astronaut. However in many operations it may be impractical to attach a highly visible signal to the leader vehicle. Additionally in some leader follower operations, the follower may be far enough back that the lead vehicle is not visible. In [8], the authors present the design of a di erent algorithm using visual sensors. The follower vehicle is equipped with only a camera. The tracking objective requires information related to the leader?s motion, which is extracted from the video data from the follower vehicle. A reference image is captured and stored in the memory of the leader vehicle while it is in the desired position and orientation relative to the follower. The control algorithm attempts to match this reference image as closely as possible. For this research, a leader- follower caravan is designed that does not require the follower to be close to the leader. For example, if the leader vehicle travels down a path and somehow determines this is the 3 optimal path for other vehicles, then the leader vehicle will transmit data describing the path to the follower vehicle. Many di erent approaches have been implemented for identifying and tracking objects through vision sensors. The most common algorithms include normalized cross correlation, optical ow, edge detection, or the Hough transform. In [17], the authors use a fast nor- malized cross correlation algorithm for self-localization of a robot. In [9], the authors use an optical ow algorithm that can be used for road navigation. However, after the optical ow is calculated, no other information is extracted for navigation of a following vehicle. In [19], the authors use an area-based algorithm to statistically measure a small window of points between two images for registering Synthetic Aperture Radar (SAR) images. A feature-based method, the Hough transform, can be used to detect the shapes of objects. In [11], the authors use the Hough transform to identify di erent shaped objects. In [12], the authors use the Hough transform to detect ellipses. Simultaneous Localization and Mapping (SLAM) is another active area of research. In SLAM, it is important to continuously localize and estimate 3D positions of new landmarks in an unknown environment. In [14], a high speed CMOS camera is used to implement SLAM in real-time on a robotic arm. 1.3 Approach First, the algorithms for the follower vehicle will be developed. This development in- cludes computer vision algorithms. This research will start with the underlying assumptions that the leader vehicle has previously passed down a path, has identi ed good features to track through vision sensors, and has transmitted the data to the follower vehicle. The fol- lower vehicle will then try to locate these speci c objects with vision sensors and navigate along the same path as the leader vehicle. The follower vehicle could travel down this path in seconds, hours, or days after the leader vehicle has traversed it. Before experimenting with images from landscape and the actual vehicle, a smaller scale experiment is performed. A digital camera is placed on a cart and pushed across a rink. There are three surfaces: 4 a smooth blue one, a white sandy one, and pebbles. There are also four blocks scattered around the surfaces. There is a red block, blue block, black block, and a tan colored block. It is assumed that the leader vehicle detected and located the red block as a good feature to track. After demonstrating successful object registration, the algorithms are applied to video data taken from a vehicle on the road. The video data obtained includes stop signs and green signs periodically placed 200 feet apart. In this thesis, di erent methods of object registration in a sequence of images using digital image processing techniques are reviewed, and a proposal for a fusion of two methods to optimize the identi cation of objects for tracking is tested. 5 Chapter 2 Digital Image Processing Background An image may be de ned as a two-dimensional function, f(x;y), where x and y are spatial (plane coordinates), and the amplitude of f at any pair of coordinates (x;y) is called the intensity or gray level of the image at that point [15]. Digital image processing is the use of computer algorithms to perform image processing on digital images. The purpose of computer vision is to recognize speci c events and perform actions through visual inputs instead of humans. This chapter provides the background on basic digital image processing algorithms, such as implementation of lters, multi-resolution, edge detection, Hough transform, normalized cross correlation, and optical ow. 2.1 Processing Pixels Any given pixel at coordinates (x;y) has four horizontal and vertical neighbors de- scribed by the following notation: (x+ 1;y);(x 1;y);(x;y + 1);(x;y 1) (2.1) There are also four diagonal pixel neighbors de ned as: (x+ 1;y + 1);(x+ 1;y 1);(x 1;y + 1);(x 1;y 1) (2.2) These equations for a pixel only describe binary or gray scale images. It is a little more computationally expensive to process colored images. A colored image is composed of three 6 channels: red, green, and blue. The value of a pixel can be represented as a vector: p(x;y) = 2 66 66 64 R(x;y) G(x;y) B(x;y) 3 77 77 75 (2.3) For example, a color image can have dimensions 256 x 256 x 3 pixels in a memory array in a computer, but have 256 x 256 pixels for a binary or grayscale images. To convert a colored image to grayscale, the magnitude of the vector in (2.3) is computed: pgrayscale = q R(x;y)2 +G(x;y)2 +B(x;y)2 (2.4) Transforming a colored image into grayscale has many advantages. Eliminating out two- thirds of the pixel data, allows for faster processing time on certain operations. But in some cases, it is desirable to use all color information obtained from a scene. Basic arithmetic operations can be performed on the pixels and neighboring pixels, such as expected values, edge detection, low-pass or median ltering. 2.2 Implementation of lters and edge detection More advanced digital image processing techniques can be implemented. Image lters can be used for several purposes: color extraction, noise subtraction, or smoothing an image. Edge detection can be used for image restoration, calculating optical ow, or shape detection. Implementing these two operations can be done in the spatial domain or the frequency domain. This section provides the background on implementing ltering and edge detection operations using two-dimensional convolution in both domains. Two-dimensional convolution between an image f(x;y) and template h(x;y) is de ned as: f(x;y) h(x;y) = 1MN M 1X m=0 N 1X n=0 f(m;n)h(x m;y n) (2.5) 7 This operation can also be performed in the frequency domain. The two-dimensional Fourier transform of an image f(x;y) is given by: F(u;v) = 1MN M 1X x=0 N 1X y=0 f(x;y)e j2 (ux=M+vy=N) (2.6) The inverse Fourier transform of F(u;v) is given by: f(x;y) = 1MN M 1X u=0 N 1X v=0 F(u;v)ej2 (ux=M+vy=N) (2.7) Now two-dimensional convolution can be implemented by multiplying the Fourier transforms of the image f(x;y) and the template h(x;y), and then implementing the inverse Fourier transform of the product, which results in an output image g(x;y): g(x;y) = F 1[H(u;v)F(u;v)] (2.8) Computationally, it is easier and faster to implement two dimensional convolution in the frequency domain rather than in the spatial domain. 2.3 Image Filters In dealing with acquired data from a camera, the majority of images are sensitive to additive noise. Noise is a high frequency occurrence in images. There are two common methods for removing noise: low- pass or median ltering. A low-pass lter uses a mask with dimensions M x N. When a low-pass lter is applied to an image, all pixel values p change to the value of the average value of the M x N surrounding pixels. A 3 x 3 low-pass lter would look like: p(x;y) = 19 2 66 66 64 1 1 1 1 1 1 1 1 1 3 77 77 75 (2.9) 8 a) Input image with noise b) Low-pass lter result c) Median lter result Figure 2.1: Gaussian noise is present in the image in a). A 5 x 5 pixel mask is used as the low-pass lter. The result is in b). Notice the blurring and removing of some the noise as well as the letters in the image. A 2 x 2 pixel mask is used for the median lter. The result is in c). All of the noise has been removed, but there is a little deformation of the letters. The resulting image would be more smooth and have less sharp transitions of intensities. However, the noise has a blurring e ect on the image once the lter has been applied. The negative side of using a low-pass lter, is that other important high frequency content i.e. edges are blurred as well. The larger the lter mask, the more noise will be smoothed out, but the blurring of edges will increase as well. A median lter is another method to remove noise from an image, but without blurring edges. The median of a set of values describes the number separating the higher half of a sample from the lower half. A median lter has M x N dimensions, and the pixel value assigned is the median value of the pixels in the surrounding pixels within the M x N neighborhood. This lter is very useful for removing sharp impulse noise, but does not remove or blur edges. In Fig 2.1, there is a noisy image in (a), a result from the low-pass averaging lter in (b), and a result from the median lter in (c). The low-pass averaging lter result does reduce the in uence of noise, but blurs the entire image. The median lter removes all noise, but some of the edges around the letters are removed as well. 2.4 Multi-resolution processing A common problem in computer vision is identifying objects that are small relative to the surroundings. Large distance from the observer to the object is the main cause of this. 9 In these cases, it can be di cult to extract valuable information, such as the gradients. For example, if a vehicle is approaching a stop sign, it is desirable to recognize the sign as soon as possible. From far away, the stop sign looks like a red blob, and as the distance decreases between the vehicle and sign, the size of the stop sign increases, which makes it easier to calculate the gradients. The gradients are important in this case, because the shape of the stop sign is unique, which serves as a method to identify it. Multi-resolution processing takes an image f(x;y) as an input, performs interpolation on it, and outputs an image g(x;y). Interpolation is the process of using known data to estimate values at unknown locations. The three most commonly used methods of interpolation are nearest-neighbor, bilinear, and bicubic interpolation. Nearest-neighboor interpolation is computationally easy to implement compared with the other methods. The intensity value at p(x;y) to be estimated, takes on the value of the intensity of its nearest-neighbor. Bilinear interpolation uses the four nearest-neighbors ofp(x;y) to estimate the intensity value at location (x;y) on the new resolution grid. It is described by: v(x;y) = ax+by +cxy +d (2.10) where the four coe cients are determined from four equations (one for each neighboring pixel) with four unknowns. The intensity value at location (x;y) using bicubic interpolation is determined by: v(x;y) = 3X i=0 3X j=0 aijxiyi (2.11) where the sixteen coe cients are solved for from the sixteen equations with sixteen un- knowns obtained from the nearest sixteen neighbors. This method is the most computa- tionally expensive of all of the interpolation methods. 10 2.5 Edge Detection Methods Edges in physical objects often result in large local gradients in an image. Edge detec- tion is the process of extracting points where there is a sharp change or gradient of image brightness in an image. The rst order gradient of a digital image f(x;y) is de ned as the vector: rf = 2 64 Gx Gy 3 75 = 2 64 @f@x @f @y 3 75 (2.12) The magnitude of the gradient is de ned as: mag(rf) = q G2x +G2y (2.13) The edge detection result is the magnitude of the gradients in the x and y direction of an image. Detecting edges is useful for feature extraction or object recognition. There are many methods to detect edges: Sobel, Canny, Prewitt, and Roberts method as discussed in [10]. The main di erence in these methods are the assigned weights of the kernels used to compute the gradients. The Sobel operator calculates the gradient of an image intensity at every point. The Sobel kernels used to calculate the gradients in the x and y direction are: 5x = 2 66 66 64 1 2 1 0 0 0 1 2 1 3 77 77 75 (2.14) 5y = 2 66 66 64 1 0 1 2 0 2 1 0 1 3 77 77 75 (2.15) 11 If there is no change in a 3 x 3 pixel area, the kernel sums to 0, which means that there is no change in intensity over that particular area in the image. Performing two-dimensional convolution with the kernel in (2.14) and an input image will yield the gradient in the x- direction. Also, performing two-dimensional convolution with the kernel in (2.15) and an input image will yield the gradient in the y-direction. The Prewitt?s kernels are 5x = 2 66 66 64 1 1 1 0 0 0 1 1 1 3 77 77 75 (2.16) 5y = 2 66 66 64 1 0 1 1 0 1 1 0 1 3 77 77 75 (2.17) The weight value 2 on the Sobel kernels results in a smoother output. A Sobel edge detection result is displayed in Fig 2.2. The input is a synthetic image of rectangles. Gx and Gy are shown as images, as well as the magnitude of the gradients. 12 (a) Input image (b) Horizontal edges (c) Vertical edges (d) Magnitude of gradients Figure 2.2: Sobel Edge detection 2.6 Hough Transform Background The Hough Transform is used to detect straight lines in an image. Consider the slope- intercept form y = mx + b, where m is the slope, and b is the y-intercept. It can describe a straight line between a pair of points on the Cartesian space. Computers have a problem in calculating m for vertical lines, which have a slope that does not exist (division by zero). Instead, a point is described in parameter space by the variables and , where is the distance between the line and the origin, and is the angle from the origin to the closest point on the line. This method for describing a line in parameter space is known as the Hough transform. The equation for a straight line can be written as: y = cos sin x+ sin (2.18) 13 which is simpli ed to: = xcos +ysin (2.19) The parameter space of the image is divided into accumulator cells. The accumulator value A(i;j) in a cell represented the number of points in the Cartesian coordinate system that lie on a straight line. As the number of accumulator cells increase, so does the accuracy of the colinearity of these points on the Cartesian coordinate system. Below is the procedure for implementing the Hough transform: 1. Compute the gradients of the image and choose a threshold to create a binary image. 2. Create sub-divisions for the accumulator cells in the parameter space 3. Evaluate the value of each accumulator cells. A higher value indicates a strong edge in the Cartesian coordinate system. A plot of some points are plotted on the Cartesian coordinate system in Fig 2.3. There is a line connecting three points on the plot. The same points that are represented in the Cartesian coordinate system are represented in parameter space via the Hough transform in Fig 2.4. Each point from the Cartesian plot is represented as a cosine wave in parameter space. Wherever the cosine wave intersects another cosine wave, indicates a line in Carte- sian space. The more cosine wave intersections at one location indicate a stronger line in cartesian space, or an edge in an image. An example of implementing the Hough transform on an image is in Fig 2.4. 2.7 Object Registration 2.7.1 De nition Sets of data, acquired by sampling the same scene or object at di erent times or from di erent perspectives, will be in di erent coordinate systems. Object registration is the process of transforming the di erent sets of data into one coordinate system. The original image is known as the reference image. The image to be mapped is known as the target 14 0 1 2 3 4 5 6 0 0.5 1 1.5 2 2.5 3 3.5 4 x y Figure 2.3: Points on the Cartesian coordinate system 0 1 2 3 4 !4 !2 0 2 4 6 Theta (radians) Rho Figure 2.4: Points from the Cartesian system in Fig 2.3 transformed into parameter space. ! " !80 !60 !40 !20 0 20 40 60 80 !300 !200 !100 0 100 200 300 a) Input b) Hough transform c) Lines segments extracted Figure 2.5: The Hough transform from the image in a) is shown as an image in b). The lines are extracted with the inverse Hough transform and plotted on top of the original image in c). 15 image. There are four main groups into which image registration is divided: di erent viewpoints, di erent times, di erent sensors, and scene to model registration. Di erent viewpoints (multiview analysis) is the process when images of the same scene are acquired from di erent viewpoints. The goal is to gain larger 2D or 3D representations of the scene. Di erent times (temporal analysis) is the process in which images of the same scene are acquired at di erent times. Di erent sensors (multimodal analysis) is when images of the same scene are acquired by di erent sensors. The goal is to integrate the information obtained from di erent sensors to gain a detailed scene representation. Scene-to-model registration compares the images of a scene to a computer model image of the same scene. The model can be a computer representation of the scene (for example digital elevation maps). The goal is to localize the acquired image in the scene/model to compare them. 2.7.2 Classi cations Area based algorithms look at the structure of the image with correlation metrics, Fourier properties, and other means of structural analysis. Instead of looking at the overall structure of images, feature-based algorithms concentrate on image features such as lines, curves, points, line intersections, and boundaries. Standard edge detection methods such as the Canny or the Laplacian detector can be implemented to extract these features. 2.7.3 Methods of Object Registration Object registration can be implemented in either the spatial or frequency domain. Traditionally, image registration is used between a set of images of the same scene at di erent angles. The transformation of one image to the next is estimated. However in this research, the camera is moving, which introduces more complexities. The objects are changing sizes, color intensities can be brighter or darker due to shadows, and additive noise can disrupt image registration. 16 Spatial-domain Image registration in the spatial domain chooses matching sets of control points between images. Features, structures, and textures are matching criteria. After the control points have been selected, the transformation from one to the other is estimated. Consider the four pairs of control points to be (v;w) and (x;y) for the input image and the reference image respectively. Bilinear approximation is given by: x = c1v +c2w +c3vw +c4 (2.20) y = c5v +c6w +c7vw +c8 (2.21) The coe cients describe the transformation from the reference image to the input image. In practice, obtaining a reference image with only stationary elements is not always possible, and building a reference from a set of images containing one or more moving objects becomes necessary. This applies particularly to situations describing busy scenes or in cases where frequent updating is required [15]. Frequency Some algorithms use properties in the frequency domain to directly determine shifts between two images. The phase correlation method is used on a pair of images which produces a third image with a single peak. The location of this peak corresponds to the relative translation between the two images. The phase correlation algorithm is described by the following steps: Compute the 2D Fourier Transform Calculate the cross-power spectrum by taking the complex conjugate of the second image?s Fourier transform. Then multiply the Fourier transforms together element wise, and normalize the product element-wise. Obtain the normalized cross correlation by applying the inverse Fourier transfrom 17 Determine the location of the peak 2.7.4 Normalized Cross Correlation Image cross correlation is de ned as: (x;y) = P s;t[(f(s;t) f(s;t)][h(x+s;y +t) h)]q fPs;t[f(s;t) f(s;t)]2Ps;t[h(x+s;y +t) h]2g (2.22) where h (template) is the image of the object to be located within f. The primary reason to use image correlation is for matching. The cross correlation coe cient will be maximum where the template and image are co-located or displaced on top of each other. A generic output for normalized cross correlation is shown in Fig 2.6. The sharp peak represents the maximum cross correlation coe cient (see [15] for details). There is a disadvantage of using image correlation in that there is high probability that a window containing a smooth area without any prominent details will be matched incorrectly with other smooth areas in the reference image due to its non-saliency [10]. It is clear that normalized cross-correlation is not the ideal approach to feature tracking since it is not invariant with respect to imaging scale, rotation, and perspective distortions [16]. Figure 2.6: Normalized cross correlation output 2.8 Optical Flow Optical ow is the distribution of apparent velocities of movement of brightness patterns in an image. By calculating optical ow in a sequence of images, one can detect the motion 18 of objects or people. This detection could be very useful in many situations, such as for automatic cruise control in a vehicle, slow motion of geographical structures (i.e. glaciers), and other desirable situations where someone wants to track moving objects. There are many methods to calculate optical ow, such as the Horn-Schunck method or the Lucas- Kanade method. The main di erence between these two methods is the constraints added to solve the problem of one equation with two unknowns. For this research, the Horn-Schunk method is arbitrarily chosen to be studied. In an ideal environments a point on any object in motion does not change intensity over time. Therefore the derivative of intensity with respect to time must be zero: dE dt = 0 (2.23) Intensity of a pixel is de ned as E(x;y;t). The variables x and y are the position coordinates and t is time. Taking the partial derivative of E(x;y;t) yields the equation: @E @x dx dt + @E @y dy dt + @E dt = 0 (2.24) To simplify the variables, denote the partial derivative of intensity of x with respect to time as Ex, and Ey and Et respectively, and to simplify the equation change dxdt to u and dydt to v. The variable u denotes the velocity in the x-direction, and the variable v denotes the velocity in the y-direction. The resulting equation is shown below. Exu+Eyv +Et = 0 (2.25) In vector form: Ex Ey 26 4 u v 3 75 = E t (2.26) 19 The movement component is in the direction of the brightness gradient: Etq E2x +E2y (2.27) This constraint is known as the aperture e ect. The velocity components computed lie in the direction of the brightness gradient. This e ect is evident on a barber pole as in Fig 2.7. The pole is actually rotating from left to right, but the gradients are moving up. This e ect leads to inaccurate estimates of the direction of velocity of the object. Figure 2.7: Barber pole There is a small problem with equation (2.26). Ex, Ey, and Et can be estimated. However, u and v must be calculated and are unknown. Therefore, there is one equation and two unknowns. An additional constraint needs to be derived to solve this equation. Horn and Schunck introduce a smoothness constraint, which is the assumption that if an object is in motion, the neighboring pixels are moving with similar velocities, rather than every point on the brightness pattern moving independently. For example, if there is an image sequence containing a blue ball moving, then it is assumed that all of the pixels that comprise the ball are moving in the same direction. Horn and Schunck introduce the smootheness constraint by minimizing the square of the gradient magnitudes of the optical ow velocity: @u @x 2 + @u @y 2 and @v @x 2 + @v @y 2 (2.28) 20 The rst step in calculating optical ow is estimating the partial derivatives of Ex, Ey, and Et. Each estimate is the average of the rst four di erences taken over neighboring pixels. The partial derivatives are described by: Ex = 14(Ei;j+1;k Ei;j;k+Ei+1;j+1;k Ei+1;j;k+Ei;j+1;k+1 Ei;j;k+1+Ei+1;j+1;k+1 Ei+1;j;k+1) (2.29) Ey = 14(Ei+1;j;k Ei;j;k+Ei+1;j+1;k Ei;j+1;k+Ei+1;j;k+1 Ei;j;k+1+Ei+1;j+1;k+1 Ei;j+1;k+1) (2.30) Et = 14(Ei;j;k+1 Ei;j;k+Ei+1;j;k+1 Ei+1;j;k+Ei;j+1;k+1 Ei;j+1;k+Ei+1;j+1;k+1 Ei+1;j+1;k) (2.31) The box in Fig 2.8 is a visualization of the pixels used to estimate the partial derivatives. Figure 2.8: All three partial derivatives at the center are calculated from the average of the rst four di erences of adjacent pixels.The column index is denoted as j, the row index is denoted as i, and the time index is denoted as k. The Laplacians of the ow velocities u and v need to be estimated as well. The estimations can be written as r2u = k( ui;j;k ui;j;k) and r2v = k( vi;j;k vi;j;k) (2.32) 21 where the local averages of u and v are: ui;j;k = 16[ui 1;j;k+ui;j+1;k+ui+1;j;k+ui;j 1;k]+ 112[ui 1;j 1;k+ui 1;j+1;k+ui+1;j+1;k+ui+1;j 1;k] (2.33) vi;j;k = 16[vi 1;j;k+vi;j+1;k+vi+1;j;k+vi;j 1;k]+ 112[vi 1;j 1;k+vi 1;j+1;k+vi+1;j+1;k+vi+1;j 1;k] (2.34) After the estimations have been made for the partial derivatives and the Laplacians, the problem becomes a minimization of the error in the constant brightness equation in (2.25) and the departure from smoothness in (2.28). The total error to be minimized becomes: 2 = Z Z ( 2 2c + 2b)dxdy (2.35) where c is the error from the departure of smoothness, b is the error from the bright- ness equation, and is a weighting value. The value plays a signi cant role only for areas where the brightness gradient is small, preventing haphazard adjustments to the es- timated ow velocity occasioned by noise in the estimated derivatives [13]. Using a least squares minimzation on (2.35), and substituting the estimates for the partial derivatives and Laplacians yields the iterative solution: un+1 = un Ex(Ex u n +Ey vn +Et 2 +E2x +E2y ) (2.36) vn+1 = vn Ey(Ex u n +Ey vn +Et 2 +E2x +E2y ) (2.37) Optical ow is calculated using synthetic images as inputs. The inputs and results are shown in Fig 2.9. 22 a)Frame 1 b) Frame 2 c) Optical ow Figure 2.9: Optical ow calculated between the two input images. The block in the top right corner is moving diagonally down from left to right, and the block in the bottom right corner is moving from right to left. 23 Chapter 3 Experiments and Results This chapter presents an overview of validating the digital image processing techniques discussed in the previous chapter. Synthetic data and acquired data are both used in this validation. Synthetic data is useful to test digital image processing algorithms on, due to the fact that they can be free from noise, artifacts from video coding, and background clutter. After the object registration methods are validated on the synthetic data, they are used on the acquired data from the digital camera. Results are discussed. 3.1 Synthetic Data Sobel Edge Detection The synthetic image in Fig 3.1 is used as the input to the Sobel edge detection algo- rithm. The magnitude of the gradients in the x and y direction is displayed in Fig 3.2(a). Figure 3.1: Synthetic image for object registration 24 (a) (b) Figure 3.2: (a) The magnitude of the x and y gradients using the Sobel kernels. (b) Lines extracted from the Hough transform are plotted on top of the grayscale input image. Hough Transform The result from the Sobel edge detection is used as the input to the Hough transform. The lines are extracted from the Hough transform and plotted on top of the original input image in grayscale in Fig 3.2(b). The Hough transform detects the strongest edges or gradi- ents in the image. Not all edges were detected by the Hough transform in this experiment. The Hough transform extracted the longest lines in the image, which compose of the larger blocks. None of the edges of the block in the lower left-hand corner were extracted. Even if the Hough transform detects all of the edges, there is no method to determine which block is which from this information alone. Normalized Cross Correlation The red block is cropped and stored in memory. This red block will be assigned as the function h(x;y) in the normalized cross correlation equation. The image in Fig 3.1 will be stored as the function f(x;y). After the cross-correlation coe cients are calculated for all possible shifts of h(x;y) onto f(x;y), the maximum coe cient is where there is a match of h(x;y) and f(x;y). The output is shown as a surface plot of all of the normalized cross correlation coe cients in Fig 3.3. The red block is the group of coe cients to the left. The 25 group to the right is from the red straight line. The highest correlation value is 1, which is near the middle of the red block. This method successfully identi ed the template h(x;y) within the image f(x;y). Figure 3.3: This is a surface plot of the magnitude of the normalized correlation coe cients. The red block is on the left, and the red straight line is on the right. Notice that the maximum coe cient is in the middle of the actual red block. 3.2 Acquired Data Digital video was obtained in an arena designed for a robotic competition. Four dif- ferent colored blocks were placed in front of a robot. The background and rink introduce additional di culties such as changing colors on parts of the oor, di erent textured sur- faces such as pebbles, and a window. A digital camera was placed on top of a robot in the rink, and was pushed forward for a couple of seconds. A total of 299 images were used as data to process. A frame taken from this video is in Fig 3.4. In this section, the edge detection, Hough transform, and the normalized cross correlation operations are applied to this image. 26 Figure 3.4: The output from the digital camera. Included in the image are the four blocks to be detected. There is a black background on the rink, there is a blue and white surface, and there are some pebbles. 3.2.1 Edge Detection The Sobel edge detection discussed in the previous chapter is used on the image in Fig 3.4. The magnitude of the gradients are in Fig 3.5. As one can see, the presence of the pebbles introduces some noise. The edges of the window, rink, and the change of color on the oor are detected, as well as the blocks. Figure 3.5: Edge detection on the acquired data 3.2.2 Hough Transform The Hough transform as discussed in the previous chapter is applied to the edge detec- tion result in Fig 3.5. The results of the Hough transform are in Fig 3.6. As expected, the 27 strongest lines are extracted, which are the lines comprising of the rink and the window. It is apparent at this point that the edge detection and Hough transform operations alone are not very useful in registering objects. Figure 3.6: Hough transform on the acquired data 3.2.3 Normalized Cross Correlation Finally the normalized cross correlation operation is performed on the acquired data. The red-block is stored as the template h(x;y) for normalized cross correlation in equation (2.22). The input image from Fig 3.4 is stored as the function f(x;y). The output is a matrix of correlation values displayed as a surface plot in Fig 3.7. The red color on the surface plot coincides with a high correlation value, and the blue color represents a low correlation value. There is strong correlation across the lower borders of the rink, the edge of the pebble surface, and the some bricks in the background. The red-block has a medium correlation coe cient (light blue color). 3.3 Comments The edge detection and Hough transform are useful methods to extract the gradients of the images. Normalized cross correlation works well with the synthetic data, but has issues with the acquired data. An issue in using normalized cross correlation is that h(x;y) and 28 Figure 3.7: The output is a surface plot of the normalized cross correlation coe cients . The red surface indicates a high correlation coe cient and blue represents a low correlation coe cient. f(x;y) need to be in grayscale. Intensity information in RGB space is lost in this process. As a result, there are many areas away from the block with high correlation coe cients. It is evident that there needs to be an algorithm that uses these digital image processing techniques in combination to form a robust computer vision algorithm. 29 Chapter 4 Statistical-gradient Object Registration This chapter presents a new method for object registration. This method detects objects by combining a series of two identi cation methods. There are two common types of methods that are used to identify objects: area-based and feature-based, as discussed in [10]. Area-based classi cations look at the structure of the image with correlation metrics or Fourier analysis. Feature-based methods concentrate on the overall structure of the image, such as edges, line intersections, and curves. The rst identi cation method used is based on local averages and variance of surrounding pixels. The result is a binary image, which is used with the second method which uses the Hough transform to extract the gradients. 4.0.1 Area-Based This area-based method uses six parameters to classify an object: the local averages and variance for the red, green, and blue channels in a given pixel window that compose a digital image. A template h(x;y) contains the object that is desired to be located in an image f(x;y). The object in h(x;y) is divided in the computer as a three dimensional array (M x N x 3). The three dimensions are composed of pixel values for the red, green, and blue space. Statistical analysis is used to characterize each color matrix. First, the average values for each color matrix is calculated. Then the variance for each pixel in the template h(x;y) is calculated. After the template h(x;y) has been classi ed with this method, the image f(x;y) is searched for areas that match the parameters for the template h(x;y). If a pixel in f(x;y) matches the parameters given by h(x;y), then the location of the pixel is mapped into a separate output 2-dimensional matrix with the same row-column size of the image f(x;y) . A tolerance value is utilized that will allow values that have small deviation from the calculated parameters to be accepted as a match. Refer to the appendix 30 for speci c tolerance values. Note that this method fails to incorporate any information regarding the shape of the object. This fault could have minimal e ect, however if there are many objects of similar color. then it would be di cult to determine with con dence that the object detected is the desired one. Therefore, the result from this area-based method, is tested with the Hough transform to match the shape of the object. 4.0.2 Feature-Based The output matrix from the proposed area-based method is then used as the input to the Hough transform. After the Hough transform is performed, the output lines are sent to an algorithm that checks the angle of intersection between the line segments. From the synthetic and acquired data discussed in Section 3, it is evident that the objects are cubes. Therefore, the intersection of the gradients of the image are perpendiular. To calculate the angle of intersection, the slope of each line is calculated. After all of the slopes are calculated, the algorithm compares the negative inverse slope to that of all the other slopes. If they are equal within a controllable window of tolerance, then the lines are perpendicular. If there is no match, the line segment is thrown out. If the lines intersections are perpendicular, then the probability of the detected region being the template h(x;y) is high. If the angles are not perpendicular, then there is some error. An experiment of randomly distributed line segments are used as inputs to test gradient method for determining if line segments are perpendicular within 3 degrees of tolerance. The plot in Fig 4.1 contains the randomly created line segments and the output, which is a plot of the lines that are perpendicular. It appears that other sets of lines are perpendicular to each other, which is the reason that the tolerance of error ( in degrees) can be adjusted. This procedure works on these randomly created line segments, so the next step is to use this method on the real line segments extracted from the Hough transform. 31 0 0.5 1 0 0.5 1 Line Segments 0 0.5 1 0 0.5 1 Perpendicular Line Segments Figure 4.1: The plot on the left contains 10 randomly generated line segments. The plot on the right are the line segments that are perpendicular to each other. 4.0.3 Code Outline This section provides an outline of the software used to implement the statistical- gradient object registration method. Inputs- the template h(x;y) and the image f(x;y) Output- an image of the detected object with lines drawn around it 1. calculate Nh x Mh of the template h(x;y) and Nf x Mf input image f(x;y) 2. calculate the average value and variance for the pixel values in the template h(x;y) for each RGB channel. 3. create an averaging mask with the row-column dimensions Nh x Mh 4. increase the resolution of f(x;y) to 2Nf x 2Mf with nearest-neighbor interpolation 5. perform two-dimensional convolution with the averaging mask and the input image f(x;y) for each RGB channel separately. 6. calculate the variance for each pixel in the input image f(x;y) within the two-dimensional space spanned by the size of the template h(x;y) and the variance of the output from the two-dimensional convolution with the input image f(x;y) and the averaging mask for each RGB channel separately. 7. set the window of tolerance for accepting values as a match with the template 32 8. search the average values and variance from the input image f(x;y) in each RGB channel and compare with the six parameters calculated from the template h(x;y). 9. if the values satisfy all six parameters store the pixel locations in a separate output matrix out(x;y) with row-column dimensions 2Nf x 2Nf 10. use the output matrix out(x;y) as the input to the Hough transform 11. calculate the slope in degrees relative to the horizontal for all line segments 12. verify the line segments are perpendicular 13. calculate the center and area of the object 14. plot all perpendicular line segments onto the input image f(x;y) with row-column dimensions 2Nf x 2Mf. To experimentally validate this statistical-gradient object registration method, syn- thetic data (Fig 4.2) and acquired data (Fig 3.4) are used as the input images. Synthetic data is useful to test digital image processing algorithms on, because there is no noise, no artifacts from the video coding, and a less busy background. After the object registration method work on the synthetic data, it is used on the acquired data from the digital camera. 4.0.4 Validation using Synthetic Images A synthetic red-block is created and placed in the bottom left-hand corner. There is also a line with the same characteristics along with blue and green blocks added to the image. The RGB values for the red-block were set to the values in Table 4.1. A 6 x 6 pixel block was created to resemble the real red block on the real surface from the acquired data. The dimensions of the block were chosen because it was desired to have a block that was smaller than other blocks in the image. Having a smaller block introduces a higher level of di culty for digital image processing algorithms. Table 4.1 describes the synthetic block according to its expected value and variance of each red, green, and blue parameters. An 33 image is created with the red block in Table 4.1 along with a green block, blue rectangle, and a red line with the same characteristics as the synthetic block described by Table 4.1. This image is displayed in Fig 4.2. Table 4.1: Synthetic Block Characteristics Expected Value Variance Red 219 0 Green 60 0 Blue 54 0 Figure 4.2: Synthetic image for object registration Statistical-gradient Object Registration This method is tested on the synthetic image in Fig 4.2, and the output image is shown in Fig 4.3. As seen in Fig 4.3, the red block was detected. 4.0.5 Validating on Acquired Images The next step is to test this method on the acquired data, which has more background noise. To demonstrate the algorithm on more realistic data, digital video was obtained in an arena designed for a robotic competition. A total of 299 images were used as data to process. The thirty-seventh frame is shown in Fig 3.4. The statistical-gradient method is applied to the image sequence. 34 Figure 4.3: The output from the algorithm used on the input from Fig 4.2. Registering the Red Block The red-block was copied from the input image in Fig 3.4 and saved as the template h(x;y). The pixel values of each RGB channel of h(x;y) were processed with digital image processing techniques. The statistical values from this operation is given in Table 4.2. The statistical-gradient registration method is performed on the entire image sequence (299 frames) taken from the digital camera. The gradient result can be seen in Fig 4.4. All of the the lines are perpendicular within 9 degrees of tolerance. The same template is used for the entire image sequence. This registration method located the template h(x;y) in all of the 299 of the input images f(x;y). If h(x;y) is smaller than the object in f(x;y), it appears to have minimal e ect on the outputs. Figure 4.4: The output from the line segment algorithm on the lines extracted from the Hough transform on the output from the object identi cation method. 35 Table 4.2: Red Block Characteristics Expected Value Variance Red 202.30 85.08 Green 71.41 40.07 Blue 32.90 58.00 Four output frames are displayed in Fig 4.5. A vector from the origin to the centroid of the block is plotted on top of the image. The location of the center of the registered object is an output, as well as the area of the object. The area of the object is used to interpret if the object is moving closer or farther away. The area of the block generally increases as the camera moves closer. (a) Frame 37 (b) Frame 64 (c) Frame 91 (d) Frame 129 Figure 4.5: Four outputs of the registered red block Registering the Blue Block Additionally, the blue block in the image is registered as well. The template h(x;y) is the blue block. The six statistical parameters used to classify the object are in Table 4.3. Four output frames from the video sequence are in Fig 4.7 36 Figure 4.6: The center of the registered object red block in ten selected frames from the image sequence is shown above. Table 4.3: Blue Block Characteristics Expected Value Variance Red 51.08 40.28 Green 59.94 16.62 Blue 82.26 38.85 4.0.6 Motion Prediction It is assumed the objects are moving at a constant velocity, therefore it reasonable to predict where it will be in the next frame. If the location of the object can be accurately estimated, then the search area of the input frame f(x;y), can be signi cantly reduced. Knowledge of the location of the object would increase the processing time, as well as eliminate false detections. The open loop estimation of a moving object in an image sequence can be described by the following equation: Objk+1(x;y;z) = 2 66 66 64 xk + (xk 2 xk 1) yk + (yk 2 yk 1) zk + (zk 2 zk 1) 3 77 77 75 (4.1) where the variables x;y;z refer the the x and y coordinate of the center of the registered object, and z refers to the size of the registered object. Fig 4.8 and Fig 4.9 contain plots 37 (a) Frame 37 (b) Frame 64 (c) Frame 91 (d) Frame 129 Figure 4.7: Four outputs of the registered blue block of the estimations of the centers for the blue and red blocks using equation in (4.1). The largest prediction error for the location of the center of the object for the x-cooridnate is about 20 pixels. This error is the same for the prediction of the center of the object for the y-coordinate. The prediction of the area of the block error is signi cantly higher, due to the nonlinear growth of the area of the block as it gets closer to the camera. The mean, variance, and standard deviation of the motion prediction of the red and blue blocks are in Table 4.4 and Table 4.5, respectively. To further reduce these error predictions, a Kalman lter could be implemented. Figure 4.8: Plots of the X estimation, Y estimation, and Z estimation of the red block. 38 Table 4.4: Motion Prediction Error for the Red Block Expected Value (pixels) Variance (pixels2) Standard Deviation (pixels) X 2.04 30.79 5.54 Y 1.39 58.87 7.67 Z 11.61 42,825 206.94 Figure 4.9: Plots of the X estimation, Y estimation, and Z estimation of the blue block. 4.0.7 Stop Sign Registration To further validate the statistical-graident registration method, video data was recorded with a web cam placed on the top of a car. The video data was divided into 55 images. The video displays the car driving toward a stop sign. There is a railroad crossing sign behind the stop sign, as well as lots of tress and dirt to add clutter to the background. The stop sign is the desired object to track in this video sequence. Two shades of red are stored as the template h(x;y) to search for within the input images, because as the distance decreases between the vehicle and the sign, the color intensity gets brighter. The results from the statistical-gradient registration method are shown in Fig 4.10. The white lettering on the stop sign causes the algorithm to detect the stop sign in two sections or just nds blocks of red on the sign. When the stop sign is far away in Frame 11, the sign is so small that the gradients (green lines) do not compose an octagon. 39 Table 4.5: Motion Prediction Error for the Blue Block Mean (pixels) Variance (pixels2) Standard Deviation (pixels) X .9052 78.02 8.8 Y .9118 23.10 4.81 Z 32.73 11,930 109.22 4.0.8 Multiple Green Signs Registration To further validate the statistical-gradient registration method, video data taken from the dashboard inside a vehicle contains multiple green signs on the side of the road. Each sign is located a distance of 200 feet apart. Fig 4.11 contains a frame from the video. The green sign is used as the template h(x;y). The output images are in Fig 4.12 40 (a) Frame 11 (b) Frame 17 (c) Frame 28 (d) Frame 38 Figure 4.10: Four outputs from the object orientation method. 41 Figure 4.11: A frame of the video sequence of the green signs. 42 (a) Frame 1 (b) Frame 153 (c) Frame 229 (d) Frame 278 Figure 4.12: Four outputs from the video sequence of the green signs 43 Chapter 5 Conclusions This thesis has presented multiple object registration methods to address detection of objects in a di erent settings, such as indoor and outdoor environments. These registra- tion methods work well on noiseless synthetic data, but prove to be troublesome in noise corrupted data obtained from a camera. Quantization noise was seen to be a major source of corrupt data in digital image processing, as well as blurring from motion of the camera, or an object. Optical ow was seen to be di cult to implement in object registration and tracking for a number of reasons. For the image sequences used in this research, the camera was in motion. All objects and the background are moving. This motion makes it di cult to register objects that are still. If other objects were in motion, the relative velocities could then be segmented. Feature based methods such as edge detection and the Hough transform provided useful results to begin classifying objects. However, there needs to be a higher level of computer processing to register the object. Statistics of areas of pixels can provide useful information to be extracted for object registration as well. By combining these two methods, it ensures that the template h(x;y) has been successfully matched to itself in the input image f(x;y). The red and blue blocks from the rst experimental study were registered in every frame of the video sequence. The resolution is changed via nearest-neighbor interpolation to enlarge the gradients when the block is distant from the camera. As the block is closer to the camera, the gradients become a little distorted due to the resolution increase, as was seen in Fig 4.5(d). The algorithm for the stop sign and green sign images also exhibited some limitations. Because the template h(x;y) is small relative to the objects as the camera moves closer 44 to them. The gradients also change for the objects. When the stop sign is far away, the shape is not an octagon, but a red blur. As the stop sign gets closer, the octagon shape is recognizable. 45 Chapter 6 Future work Future work entails more extensive testing of the algorithm, as well as implementing it in real-time. Experimental validation through a leader-follower demonstration is another goal for the future. An experiment set-up could be to visually tag a robot with a speci c mark that the follower robot would be searching for. After the demonstration is successful, a larger scale experiment could be used in fusion with other sensors such as IMU or GPS to follow the leader vehicle. Another goal is to write algorithms for the leader vehicle. At this point in the research, an algorithm has been written to register objects. However, an algorithm for the leader vehicle to identify good features to track is important. The leader vehicle must survey the environment, and decide what objects are other features of the image are good features to track. After the leader vehicle has done this, it will transmit this data to the follower vehicle. 46 Bibliography [1] A. Broggi, P.Cerri, and P.C. Antonello, "Multi-resolution vehicle detection using arti- cial vision", in Proc. Intelligent Vehicle Symp., 2004, pp 310-314 [2] Committee on Army Unmanned Ground Vehicle Technology, "Technology Develop- ment for Army Unmanned Ground Vehicles", The National Academies Press, Wash- ington, D.C. 2002. [3] Sheikholeslam , and CA. Desoer, "Longitudinal control of a platoon of vehicles with no communication of the lead vehicle information", Proc. of American control conference 3, pp 3102-3106, 1991. [4] Michael Price and Steven Munkeby. "Future Combat Systems Autonomous Navigation System Technology Revolutionize Warfare." 2008 [5] Teck Chew Ng, Javier Guzman, and Martin Adams. "Autonomous Vehicle-Following Systems: A Virtual Trailer Link Model" [6] D.B. Edwards, T.A. Bean, D.L. Odell, and M.J. Anderson." A Leader-Follower Algo- rithm for Multiple AUV Formations." [7] E. M. Atkins, J. A. Lennon. "Color-based Vision Tracking for an Astronaut EVA Vehicle" 2001 [8] Hariprasad Kannan, Vilas K. Chitrakaran, Darren M. Dawson, and Timothy Burg. " Vision-Based Leader/Follower Tracking for Nonholonomic Mobile Robots." In Proc. American Control Conference. pp 2159-2164, 2007 [9] Andrea Giachetti, Marco Campani, and Vincent Torre. "The Use of Optical Flow for Road Navigation." IEEE Robotics and Automation, Vol 14, No.1, February 1998. [10] Flusser, Jan. Barbara Zitova. "Image registration methods: a survey," Image and Vi- sion Computing 21 (2003) 9771000 [11] Strzodka, R.; Ihrke, I.; Magnor, M., "A graphics hardware implementation of the generalized Hough transform for fast object recognition, scale, and 3D pose detection," Image Analysis and Processing, 2003.Proceedings. 12th International Conference on , vol., no., pp. 188-193, 17-19 Sept. 2003 [12] Chia, A.Y.S.; Leung, M.K.H.; How-Lung Eng; Rahardja, S., "Ellipse Detection with Hough Transform in One Dimensional Parametric Space," Image Processing, 2007. ICIP 2007. IEEE International Conference on , vol.5, no., pp.V -333-V -336, Sept. 16 2007-Oct. 19 2007 47 [13] B.K.P. Horn and B.G. Schunk. Determining Optical Flow. Arti cial Intelligence, 17:185-203, 1981 [14] Gemeiner, P.; Ponweiser, W.; Einramhof, P.; Vincze, M., "Real-Time SLAM with a High-Speed CMOS Camera," Image Analysis and Processing, 2007. ICIAP 2007. 14th International Conference on , vol., no., pp.297-302, 10-14 Sept. 2007 [15] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed. Prentice Hall, 2002. [16] J.P. Lewis. "Fast Normalized Cross-Correlation", Industrial Light and Magic [17] Briechle, K.; Hanebeck, U.D., "Self-localization of a mobile robot using fast normalized cross correlation," Systems, Man, and Cybernetics, 1999. IEEE SMC ?99 Conference Proceedings. 1999 IEEE International Conference on , vol.4, no., pp.720-725 vol.4, 1999 [18] Richard O. Duda, Peter E. Hart. "Use Of The Hough Transformation To Detect Lines and Curves In Pictures" Comm.ACM, vol 15, No. 1, pp. 11-15 (January 1972). [19] Kumar, Suvesh; Rishiwal, Vinay; Joglekar, P.N., "Area Based Approach to Registra- tion of SAR Images with Sub-pixel Accuracy," TENCON 2006. 2006 IEEE Region 10 Conference , vol., no., pp.1-4, Nov. 2006 48 Appendices 49 Appendix A Nomenclature f(x;y) - image as a two dimensional function h(x;y)- template p(x;y)- pixel value at location (x;y) in an image Gx- gradient with respect to x of an image Gy- gradient with respect to y of an image F(u;v) image represented in the frequency domain A(i;j)- accumulator value of Hough transform at location (i;j) in parameter space - the distance from the line to the origin for the Hough transform - the angle of the vector from the origin to the closest point u- average velocity in the x-direction v- average velocity in the v-direction - two-dimensional normalized cross correlation coe cient 50 Appendix B Orientation Results of Red Block 51 52 53 54 Appendix C Orientation Results of The Blue Block 55 56 57 58 Appendix D Tolerance Values for Statistical Analysis Table D.1: Tolerances from the experiments Expected Value Tolerance Variance Tolerance Red block 15 50 Blue block 40 70 Stop sign 15 40 Green signs 40 150 59