Enhanced Visual Navigation for Mobile Robots by Brian West A thesis submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Master of Science Auburn, Alabama May 4, 2014 Approved by Thaddeus Roppel, Chair, Associate Professor of Electrical and Computer Engineering Prathima Agrawal, Samuel Ginn Distinguished Professor of Electrical and Computer Engineering John Hung, Professor of Electrical and Computer Engineering ii Abstract Navigation solely by vision has been a goal in robotics for many years. It is desirable for a robot to plan and navigate the same way we do naturally as humans; however computation speed, changing illumination, complex algorithms, and a host of other problems have long plagued progress. One of the most prominent issues in using visible light outdoors and indoors is the unpredictable and changing ambient lighting. Shadows can vastly affect the robot?s ability to navigate on a simple path, and often look like separate objects to a robot. Features detected and described under perfect light may not register under dim light. This work focuses on a method developed to address these issues and construct a navigable path using color information, despite ambient and direct lighting. The results show a vast improvement over traditional intensity-based algorithms both in navigation and localization. iii Acknowledgment For my wife, who has sacrificed much for me to pursue my dreams. I couldn?t have done it without her love, support, and encouragement. Thank you Mary Grace. iv Table of Contents Abstract ............................................................................................................................... ii Acknowledgment ............................................................................................................... iii List of Tables ..................................................................................................................... vi List of Figures ................................................................................................................... vii List of Abbreviations ......................................................................................................... ix Chapter 1: Introduction ........................................................................................................1 Chapter 2: Literature Survey ................................................................................................3 2.1: Computer Vision ...............................................................................................3 2.1.1: A Brief History ..................................................................................3 2.1.2: Feature Detection and Tracking .........................................................5 2.1.3: Optical Flow ......................................................................................5 2.2: Color Invariance ...............................................................................................6 2.3: Simultaneous Localization and Mapping .........................................................7 2.4: Path Planning and Obstacle Avoidance ............................................................8 Chapter 3: Hardware ..........................................................................................................10 3.1: Camera ............................................................................................................10 3.2: Computer ........................................................................................................10 Chapter 4: Algorithms........................................................................................................12 4.1: Color Invariance .............................................................................................12 v 4.1.1: L-2 Normalization............................................................................12 4.1.2: Log-Chromaticity Projection ...........................................................13 4.1.3: Combining Models for Path Classification ......................................15 4.2: Scale Invariant Feature Transformation .........................................................17 4.3: Optical Flow ...................................................................................................20 4.4: Particle Filter...................................................................................................20 4.5: Simultaneous Localization and Mapping .......................................................22 4.5.1: Localization .....................................................................................22 4.5.2: Mapping ...........................................................................................22 4.6: Path Planning and Obstacle Avoidance .........................................................24 Chapter 5: Experiments and Results ..................................................................................28 5.1: Color Invariance and Path Classification........................................................28 5.2: Color Invariance with Scale Invariant Feature Transform .............................32 5.3: Color Invariance with Optical Flow ...............................................................35 5.4: Color Invariance with Simultaneous Localization and Mapping ...................38 5.5: Color Invariance with Path Planning ..............................................................43 Chapter 6: Summary and Future Work ..............................................................................44 6.1: Summary .........................................................................................................44 6.2: Future Work ....................................................................................................44 References ..........................................................................................................................46 vi List of Tables Table 1: Sum of absolute differences between k-means image segmentation, color invariant path detection, and color based path detection. .....................................3 Table 2: Comparison of the number of feature matches between grayscale and log- chromaticity projection images ...........................................................................35 vii List of Figures Figure 1: L-2 norm image comparison ..............................................................................13 Figure 2: Log-chromaticity projection image comparison ..................................................5 Figure 3: Color invariant path detection ............................................................................16 Figure 4: Cost map computed by pixel density .................................................................17 Figure 5: Comparison of full frame feature detection, and the proposed ground plane restrictions .........................................................................................................18 Figure 6: Comparison of binary image before and after perspective correction spatial transformation. ..................................................................................................23 Figure 7: Illustration of binary map building .....................................................................24 Figure 8: Path detection method comparison.....................................................................30 Figure 9: Comparison of keypoint detectors on traditional grayscale vs. log-chromaticity projection ...........................................................................................................32 Figure 10: Comparison of SIFT feature matching (1) .......................................................33 Figure 11: Comparison of SIFT feature matching (2) .......................................................34 Figure 12: Path estimation from optical flow global motion measurements .....................37 Figure 13: View map built from optical flow motion estimation and color invariance path detection ...........................................................................................................37 Figure 14: Google Earth image of path traversed in Figures 12, 13, 15, and 16 ...............37 Figure 15: Path estimation from SIFT global motion measurements ................................38 viii Figure 16: View map built from SIFT motion estimation and color invariance path detection ............................................................................................................38 Figure 17: Particle filter demonstration using color invariant features (1) ........................40 Figure 18: Particle filter demonstration using color invariant features (2) ........................41 Figure 19: Particle filter demonstration using color invariant features (3) ........................42 Figure 20: Path planning demonstration with color invariance .........................................43 ix List of Abbreviations SIFT Scale Invariant Feature Transform SSD Sum of Squared Differences SLAM Simultaneous Localization and Mapping RGB Red-Green-Blue SAD Sum of Absolute Differences 1 Chapter 1: Introduction Since the dawn of automation, man has dreamed of a way to create machines capable of perceiving and interacting with their surroundings. From the first creation of Zadoc P. Dederick?s Steam Man [1], we have been seeking new ways to make our creations navigate, manipulate, and understand the world we live in. Since then, there have been many devices constructed for mobile robots to sense their environments. Tactile sensors are used to feel surroundings by registering touch, force, or pressure, but must make physical contact [2]. Sonar sensors ?ping? the environment searching for sound waves which bounce off of objects [3], but are inherently noisy. Laser and lidar systems use light in a similar manner to produce point clouds [4], but can be very costly, with the most reliable state of the art systems costing thousands of dollars. Infrared has been used in this manner as well, and also with a camera to view pattern distortion (such as in the Microsoft Kinect?) [5], but does not work well outside due to solar interference. Using the visible light spectrum in the context of computer vision as a sole means of data acquisition for robots has been avoided for many years due to its various problems. Computation speed is just starting to catch up to the increasingly complex algorithms required, and lighting conditions can cause conventional methods to fail. However, while radar, sonar, tactile and other methods of navigation and localization have proven useful, it is desirable for a robot to plan and navigate the same way we do 2 naturally as humans. We have naturally built our environments to support the transfer of information through sight, and enabling robots to see the world as we do will simplify programming, human-robot interaction, and especially navigation. There is also a considerable cost advantage using a simple camera over many costly sensors in tandem. Most vision-based systems actually work best with relatively low resolution imaging, reducing complexity of algorithms and drastically reducing processing time. In fact, simple inexpensive webcams have proven substantial in most applications. Thus expanding vision-based technology will vastly improve accessibility in all levels of education as well. One of the most prominent issues in using visible light as the sole means of data acquisition outdoors is the unpredictable and changing ambient lighting. Shadows can vastly affect the robot?s ability to navigate on a simple path, much less paths with complex coloring and texture. Shadowing can also present challenges indoors, as shadows often look like separate objects to a robot. To confront this issue, it is necessary to construct a color-invariant image of the robot?s surroundings. This simply means that despite changes in ambient and direct lighting, the color information should remain largely the same. During the course of this research, a method for obtaining a color-invariant image was developed and tested in tandem with computer vision techniques for mobile robot navigation, and will be presented throughout the remainder of this thesis. 3 Chapter 2: Literature Survey This chapter serves to provide essential background information regarding the field of robotics and computer vision. First, a brief history of computer vision and its use in robotics is presented. Second, key methods for providing and interpreting information for navigation and motion estimation are reviewed. Next, the topic of color-invariance is explored both in history and current research, and finally, current robust methods of localization, pose estimation, and map building, and path planning with obstacle avoidance pertaining to mobile robots will be discussed. 2.1 Computer Vision In the most general sense, computer vision is the practice of describing the environment, and providing information of its properties including shape, illumination, objects, and colors. Though some researchers include non-visible light (such as infrared or lidar) as a subset of the field of computer vision, the usage within this thesis refers strictly to information gathered in the visible light spectrum through means of a camera. 2.1.1 A Brief History Computer vision first found its mark in the 1970s, as the agenda for artificial intelligence was becoming more aggressive. It was originally thought that pairing the visual input with ?intelligent? computers and systems would be an easy step towards a 4 functioning human-like perception and decision making system. According to Boden, a professor at Massachusetts Institute of Technology once asked a student to ?spend the summer linking a camera to a computer and getting the computer to describe what it saw? [6]. Computer vision quickly distinguished itself from the field of image processing, as full scene understanding and three dimensional reconstructions were pursued. By the 1980s, techniques for stereo correspondence had been established, and motion estimation started taking place in the form of optical flow. Image pyramids (using different levels of resolution) were being used in coarse-to-fine searches and scale- space property inspection, and many algorithms were unified using the same basic mathematical infrastructure [7]. In the 1990s, work was devoted to topics such as structure from motion, facial recognition, image segmentation, and invariant descriptors. Physics-based vision was born out of using color and intensity measurements along with physical models, and optical flow and stereo vision methods were vastly improved. The computer animation business boomed, while animators could reconstruct models from images [7]. Computer vision is becoming increasingly prevalent in today?s society. Character recognition is used to read postal codes and license plates [8], industrial parts are quickly assessed for quality control, object recognition is used in self-checkout lines [9], and three dimensional models are created from a sequence of images. Computer vision has also found its way into the automotive world, detecting objects and obstacles, and researchers have even built self-driving cars [10]. It is also used for surveillance in diverse applications such as building security, traffic analysis, and locating drowning 5 victims in pools. The most recent research comes from feature-based object recognition and navigation partnered with machine learning and Bayesian techniques [7]. 2.1.2 Feature Detection and Tracking Methods of feature detection and tracking throughout this thesis have been restricted to the Scale Invariant Feature Transform (SIFT) algorithm. SIFT was first proposed by Lowe in [11], and has become a staple in the computer vision community and a baseline to measure new keypoint detectors and descriptors. There have been many spin-offs of the SIFT algorithm, such as Principle Component Analysis (PCA)-SIFT [12], Gradient Location-Orientation Histogram (GLOH) [13], and Speeded Up Robust Features (SURF) [14], all of which improve upon computation and work marginally better. More recent methods such as BRISK [15] involve different sampling methods for the descriptors and sub-pixel level accuracy, accomplished by quickly searching at lower resolutions and interpolating for fast sampling and feature recognition. For ease of use and prototyping reasons, all feature detection and tracking in this thesis is performed using a SIFT type algorithm, as its speed and robustness are more than adequate for the purpose of this research. 2.1.3 Optical Flow Optical (or optic) flow is a method of motion estimation by calculating the motion of color or intensity differences on each pixel of an image, first appearing in the 1970s [16]. The most common, and general way of doing this is by minimizing the sum of 6 squared differences (SSD) between two subsequent images. This is routinely done by summing over the overlapping regions of the image and looking for minima, or by using Markov random fields and looking for a global minimum. Both methods can be refined in incremental steps to achieve sub-pixel estimation, and both methods have even been used together in instances when the type of motion is known [17]. For example, if a robot is navigating using a camera, the assumption can be made that most motion is from the movement of the robot within a static scene. Local optical flow calculations can be made, and the entire scene can be parameterized to form a global estimation of motion to update the robot?s position. Computationally, performing per-pixel search over an entire possible displacement is a slow process, and more efficient means of estimation have been developed more recently. These newer methods also utilize gradient descent and coarse- to-fine searches along with the classical optical flow techniques to attain motion estimation [17]. 2.2 Color Invariance Shadows can vastly affect the robot?s ability to navigate on a simple path, even more so on paths with complex coloring and texture. Shadowing can also present challenges indoors as well as outdoors, as shadows often look like separate objects to a robot. To confront this issue, it is necessary to construct a color invariant image of the robot?s surroundings. This simply means that despite changes in ambient and direct lighting, the color information remains largely the same. Xu et al. propose a method to remove shadows from an image using the L-2 normalization of the RGB color values 7 [18]. Another method, based on log-chromaticity space projection, is presented by Finlayson et al. [19]. This approach uses log-domain characteristics to create a color invariant image, projected into intensity values. Finlayson made the observation that similar colors in an image are projected onto lines having similar slopes in log- chromaticity space despite ambient and direct lighting. When projected in the direction orthogonal to the lighting, a color invariant image is obtained. 2.3 SLAM Simultaneous Localization and Mapping (SLAM) has long been described as a ?chicken and egg? problem. For a robot to accurately localize itself, it must have an unbiased map representation. However, to build an unbiased map, a robot must have an accurate pose estimate, accomplished by successfully localizing itself. SLAM has the goal as described by Siegwart et al. of ?starting from an arbitrary initial point, a mobile robot should be able to explore autonomously the environment with its on-board sensors, gain knowledge about it, interpret the scene, build an appropriate map, and localize itself relative to this map? [20]. There has been much research done to try to resolve the technical obscurities of the SLAM problem. It is understood that being able to successfully and completely answer the characteristic questions ?what does the world look like?? and ?where am I in the world?? will give a robot full and true autonomy. Though there are many methods utilizing SLAM, there are three main algorithms which are integrated into nearly all of them, including numerous publications: Extended Kalman Filter (EKF) SLAM [21], Graph-SLAM [22], and particle-filter SLAM [23]. 8 EKF SLAM is known as a more traditional approach, and was the first proposed method for obtaining useable results. EKF SLAM uses an extended state vector with the robot?s pose and feature locations in the map. As the robot moves, measurements are recorded, and the state vector is updated using an extended Kalman filter. As new features are discovered and described, the state vector grows, along with the noise covariance matrix, making EKF SLAM computationally inefficient. Graph-Based SLAM, initially proposed in [22], uses the notion that the SLAM problem can be viewed as a set of robot locations and features, with constraints of relative position between them. Error propagation can be corrected by allowing these constraints to ?flex? or adjust to updated locations. Therefore, finding the state of minimal energy is the solution to the SLAM problem. Finally, particle filter SLAM works by randomly sampling the position error distribution [23]. Each particle has a corresponding estimate of the robot?s path as well as locations of every feature in the map. With every measurement update an ?importance factor? is assigned based on the probability of observing that measurement. According to this importance factor, the distribution is resampled with the same amount of particles, and feature locations are updated. 2.4 Path Planning and Obstacle Avoidance The concept of path planning and avoiding obstacles has been around since autonomous robots first appeared, even before they were mobile [20]. Industrial robots had to be able to accurately plan their routes and avoid collision with walls, objects, and 9 other robots. While some roboticists may think of path planning as a solved problem, it actually has transitioned into an optimization problem. Graph-search based methods such as A* [24], D* [25], Voronoi diagrams [26], cell decomposition, breadth- or depth-first, Dijkstra?s algorithm[27], and all of their variants have long seen use in nearly every robotic platform imagined. These algorithms often work in free space using connectivity graphs, and are all appropriate for static (and sometimes slowly changing dynamic) environments. Artificial potential field (APF) path planning has also proven useful in many robotic systems [26], and also simplifies calculations in most cases, as the path simply follows gradient descent methods. Of course, potential fields always run the possibility of running into local minima, for which there are several proven solutions [28, 29, 30]. 10 Chapter 3: Hardware 3.1 Camera Images were acquired using two cameras throughout this project. The first is an 8-megapixel Motorola Droid X2 phone camera, and the other is a Nikon Coolpix AW100. Both feature automatic brightness and contrast adjustment, as is industry standard [31]. It is important to note that the camera selection is irrelevant, as the process developed and tested is designed to work with any type of color capable camera with resolution of at least 240x320, as all images are sampled at this resolution. 3.2 Computer All computing was done on a desktop machine equipped with a 3.6 GHz i7 processor. While the developed techniques have only been tested via post-processing, combining the image processing techniques with the latest and fastest iterations of feature detectors and descriptors and matching algorithms should be somewhat trivial to obtain real time performance. It is also important to note that nearly all code was written in MATLAB, a notoriously slow programming language. MATLAB was chosen for its invaluable selection of built in functions for prototyping purposes. 11 At a resolution of 240x320, one frame is processed for color-invariance in .0156 seconds on average. This is fast enough to produce over sixty frames per second, fast enough for additional implementations of SIFT (or other feature detection and description algorithms), particle filter software, and nearly any path planning and obstacle avoidance method desired. 12 Chapter 4: Algorithms 4.1 Color Invariance Two models were developed to work concurrently to produce the best results in changing lighting conditions: L-2 normalization and log-chromaticity space projection. It is important to note that any method should ideally work in a real-time vision system. Therefore, many different image processing techniques were explored and tested, but point-wise operations were given heavy favor due to easy and fast matrix manipulation. 4.1.1 L-2 Normalization The first model, derived from a method proposed by Xu et al. [18] to remove shadows from single images, involves computing the L-2 norm using the standard red- green-blue (RGB) values for each frame as inputs. Normalizing each pixel value gives the effect of setting all pixels to the same ?intensity?. It is important to note that this is not the same intensity value given in an indexed or grayscale image, but rather standardization of RGB values for each pixel in the image. This was accomplished using the standard L-2 norm formula: 13 ? ? ? (1) Where r, g, and b are the red, green, and blue input values, respectively, and r?, g?, and b? are the normalized red, green, and blue outputs, respectively. Figure 1 shows the result of this operation. 4.1.2 Log-Chromaticity Projection The second model is based on log-chromaticity space projection presented by G.D. Finlayson et al. [19]. Chromaticity is defined as a description of color ?quality? regardless of luminance. Chromaticity has two independent components: hue and colorfulness (or saturation). Figure 1, L-2 norm image comparison. Original image (left) compared to L-2 color normalized image (right). 14 This approach uses log-domain characteristics to create a color invariant image, projected into intensity values. Finlayson made the observation that similar colors in an image are projected onto lines having similar slopes in log-chromaticity space despite ambient and direct lighting (and subsequently, the property of luminance). When projected in the direction orthogonal to the lighting, a color-invariant image is obtained: ( ) ( ) (2) In this equation, r, g, and b are the red, green, and blue input values, respectively, ? is the projecting direction, and inv is the color-invariant image product. The value of ? was chosen to be 43.58?, shown experimentally by Li et al. to be an acceptable approximation. To finish out the model, an exponential operation is applied to the color- invariant image and histogram equalization is performed to make the matrix easier to view and also work with. The result is shown in Figure 2. 15 4.1.3 Combining Models for Path Classification Deciding what is and is not a path can be accomplished a number of different ways. Using only color information, region properties can be parameterized, and the algorithm can be trained on specific surfaces, vanishing points can be determined, or texture analysis can be done. The proposed method assumes that the robot is initialized on a traversable path, or region. The image known to be path is then sampled and analyzed, and all regions within the frame having similar properties of normalized color and chromaticity are considered to be traversable terrain. This is accomplished by first creating two binary images from the two models presented using thresholding, and applying the AND operator to combine them. Since both models are susceptible to different types of error, the AND operator significantly reduces the amount of noise in the image. Next, the largest blob is found and kept, while the rest of the blobs are discarded. The result is a binary image, where the terrain is classified as traversable or not on a per pixel basis. This step is illustrated in Figure 3. Figure 2, Comparison of original image (left) with log-chromaticity projected image (right). Note that shadowing in the path is nearly invisible, while color differences (such as in the bottom right of the image) are preserved. 16 Because there could potentially be obstacles or other important occlusions in the frame, the binary path is not simply flood filled to close gaps. Rather, a cost map can be created based on pixel density to quantify how safe a region is to travel. The downside to this is that even large obstacles in the background may not register until the robot gets closer, and a continuing and diminishing path in the distance will be down-weighted by the surrounding non-traversable pixels in the image. An example of such a cost map is shown in Figure 4. Figure3, Color invariant path detection. Binary mask overlaid on original image: Blue tinted regions represent detected path, and red tinted regions represent non-path. 17 Figure 4, Cost map computed by pixel density. Blue levels represent low cost, versus red indicating high cost. 4.2 SIFT Traditionally, feature matching algorithms are computed on entire image frames to extract as many usable features as possible. Due to spatial ambiguities, keypoints found in close two dimensional Euclidean distance may actually be infinitely separated in the third dimension. To overcome these ambiguities, many SLAM algorithms rely on camera calibration to estimate the distance a keypoint lies from the camera. Calculating locations of keypoints in three dimensions requires complex and computationally inefficient affine transform solutions. 50 100 150 200 250 300 50 100 150 200 18 To avoid making these calculations, the SIFT keypoint detection and descriptors are only computed on the binary classified output from the method of path identification in 4.1. This effectively restricts all keypoints to two dimensional space, i.e. those strictly lying on the ground plane within the detected path (see Figure 5). The important assumption is made that the ground is relatively flat, meaning that rapid changes in incline or altitude are not made. This allows a single projective spatial transformation (discussed in 4.5) to estimate the global two dimensional coordinates of each keypoint. The SIFT algorithm first detects keypoints using a Noble?s harmonic mean keypoint detector [36], defined as: ( ) ( ) (3) Where Hr is the classic Harris corner detector [32], defined as: Figure 5, Comparison of full frame feature detection, and the proposed ground plane restrictions. 19 [ ] (4) Noble?s harmonic mean detector was chosen to simplify use by eliminating the need for tuning the Harris detector. Once keypoints are located, a Normalized Laplace operator is used to find the characteristic scale for each corresponding keypoint. Local maxima are found and a threshold is used to determine if that keypoint is a ?corner.? This allows the keypoint descriptor to be invariant to scale. After the characteristic scales are computed, a Gaussian blur function is applied to each local area around keypoints, each according to its scale. Next, the gradient is calculated at each keypoint, and a weighted histogram is computed to find the keypoint?s main orientation. Main orientation is the rotation describing the keypoint, and falls into one of eight bins for searching. In the case where a keypoint does not have a main orientation (i.e., the histogram is flat), the keypoint is rejected. A SIFT keypoint descriptor, similar to Lowe?s method [11] is then applied by sampling the local image of each keypoint according to its characteristic scale and main orientation. The result is a 128 value vector describing a keypoint. These vectors are unique to each keypoint, and are compared in subsequent images to find matches. Keypoints are tracked by searching for similar descriptors in an expected location window. That is, when a robot is moving, keypoints are expected to move only a small amount, thus only a small region around the initial detected keypoint is searched for the 20 corresponding match in the next image. This helps speed up computation time when matching features. 4.3 Optical Flow Optical flow was implemented using Roborealm computer vision software. Roborealm?s optical flow module uses a block matching strategy similar to the method presented in [35]. Since only the global motion is of interest, global horizontal and vertical vectors can be saved to file for post processing, or serially transmitted to any other software. 4.4 Particle Filter A particle filter was used to filter the results from the SIFT algorithm outputs. Particle filters are based on Monte Carlo methods [33] of randomly sampling the robot?s two dimensional belief distribution. This method allows the system to model any distribution or non-linear relationship, a powerful tool when tackling the SLAM problem. At each time step (or image frame), K particles contain estimates of the robot?s past and present poses Xt[k] and global feature positions having mean ?t,i[k] and covariance matrices ?t,i[k]. Each particle is then defined as: Particle[k] = Xt[k]; (?t,0[k], ?t,0[k]); (?t,1[k], ?t,1[k]); ? ; (? t,n-1[k], ?t,n-1[k]); (5) Where k is the particle index, t is the time step, and n is the total number of features in the map. Because keypoint locations are restricted to the ground plane, global locations of keypoints are estimated based on a linearized relationship between the robot?s pose when 21 viewing the keypoint and the keypoint?s location in the image frame. Furthermore, the assumption is made that the robot?s pose within the image is exactly at the bottom center of the frame. Each keypoint?s location is then linearized by finding the polar coordinates of the keypoints from the origin, taken at the robot?s pose. Each feature, then, has a perceived global location of: xi = Xt(x,y) + rt 6/5[ sin?t, cos?t] (6) Where xi is the location of the ith feature, Xt is the pose of the robot at time step t, rt is the radial distance of the feature in the image frame at timestep t, and ?t is the angle between the keypoint and image centerline. The exponent 6/5 on r is the relationship between radial distances of pixels versus actual units in the global scene, and was chosen experimentally. At each time step, keypoint mean locations and covariance matrices are updated for each particle using the Kalman filter update rule. With each movement ut each particle updates its robot pose estimation, and makes another observation zt. Next, each particle?s importance factor wt[k] is computed as the probability of zt given the particles previous states and observations: wt[k] = p(Particle[k]| zt , ut,nt )/ p(Particle[k]| zt-1 , ut ,nt-1) (7) Montemerlo et al. shows in [23] that this representation can be mathematically approximated with the integral: wt[k] ? ?p(zt|Xt , Particle[k]t , nt ) p(Xt)dXt (8) This calculation comes with the common assumption of a uniform particle distribution. 22 Based on this importance factor, particles are then resampled to replace those with low importance factors with those with higher ones, and means and covariance matrices are updated. Resampling the particles each time with updated importance factors on a per particle basis allows the algorithm to assess different observation data association concurrently. This makes the particle filter SLAM solution more robust than traditional EKF methods, as particles with erroneous estimates will lie further from the center of the belief distribution, and thus less likely to be chosen when resampling. 4.5 SLAM 4.5.1 Localization As mentioned in section 4.2, the SLAM problem is greatly simplified by restricting keypoints to the ground plane and within the perceived path. Each feature found is stored in a state vector containing a given identification number, descriptor info, and global x and y coordinates. Given the location estimations of the SIFT descriptors from the particle filter output a pose estimate can be generated. 4.5.2 Map Building For new maps (meaning the robot has never seen or navigated in this area before), the map is first initialized, and the starting pose of the robot is always [0 0 0] (x,y,theta). Using filtered motion estimation information from the output of the optical flow/SIFT motion estimators, the robot?s pose change is updated for every movement. 23 Using pose information, a projective spatial transformation is applied to the binary output of traversable terrain to correct for perspective: [ ] [ ] (9) Where T has the form: [ ] and h00, h01, ? , h21 = p are parameters of the transformation to be solved. This can be done by iteratively solving for parameters p through minimizing the non-linear least squares problem, using robust least-squares, or RANSAC [34] (or its variants). A comparison of the binary image before and after this transformation is shown in Figure 6. Figure 6, Comparison of binary image before (left) and after (right) perspective correction spatial transformation. 24 The resulting transformed image is then zero padded to create a ?chunk.? The chunk is then rotated and shifted to match the robot?s pose estimation, and the chunk is ORed with the global map. The result is a binary map with real?not approximated? edges of the perceived path. Also, because of the OR operation applied, permanent obstacles are preserved while moving obstacles become filled. An example of this operation is shown in Figure 7. Figure 7, Illustration of binary map building. If instead a cost map is desired, the same process can be applied to the cost map information generated using the pixel density function output. The results can then be added and normalized to obtain a global cost map. This map will have the same features from the binary map, including real edges and obstacle preservation (see section 4.1.3). 4.6 Path Planning and Obstacle Avoidance One long standing sensor problem in the area of robotics is the ability not just to plan an appropriate path, but where in the world the path is allowed to lie. Certainly other sensor systems have advantages over computer vision techniques when it comes to 25 identifying large obstacles such as rocks or trees, but they most definitely fall short to accurately paint the whole scene. Consider a robot navigating in an urban environment: range-finding sensors will certainly do a good job locating people, poles and cars, but may not notice any difference between roads, medians, and sidewalks. Similarly, in rural areas, there may be no perceivable difference in a point cloud between the road or path and the surrounding field. If a robot using only range-finding sensors ventured into a mud pit, it would surely never recover. Some systems supplement range-finders with computer vision, such as the famous robot Stanley that won the DARPA Grand Challenge in 2005. Stanley used an array of five laser sensors to scan the surrounding terrain to determine where the vehicle could navigate. Because of the limited range of the sensors (about 22 meters), the vehicle was limited to a speed of 25 mph. The Stanford team supplemented the laser data by analyzing the color of the path deemed navigable by the laser data, and searched a video feed for areas with similar color characteristics [10]. Due to the video?s much farther field of view, the vehicle was able to path plan while traveling up to 35 mph, going on to be the first robot to ever complete the challenge. Following this notion, similar techniques using only color information (discussed in 4.1) are applied to obtain a per-pixel binary image of perceived traversable areas. As previously discussed, there are many different ways to interpret the binary information to use in path planning. Perhaps the most simple and straightforward way to use the binary image is to consider the ?traversable? pixels the workspace, where any point is safe to travel. This 26 allows any path planning algorithm such as A* or D* (and their variants) to be directly applied. Note that this can be computed on a local or global scale. Applying the algorithm frame by frame, making the assumption that the robot?s pose is located at the bottom center of the frame allows the robot to navigate dynamic and permanent obstacles. Applying the same spatial transformation that is applied to feature and view mapping using the robot?s priors can give a transformation to and from the global workspace. This allows a path to be computed in the global workspace, and transformed to the local workspace for finer control and obstacle avoidance. Of course, the downside to using a strict binary local workspace is the influence of noise. Without any further processing or decision making, a single pixel declared not traversable can hinder the robot?s efficiency in traveling, and perhaps even impede it completely. To avoid this, the binary image may be quickly and efficiently converted to a cost map or potential field map by applying a Gaussian blur function to the entire image. Inverting the result effectively computes the pixel density which is proportional (given gains and other parameters) to the desired cost or potential field map. Cost maps have the advantage of smoothing out rigid motions and hindrances of a binary workspace while maintaining arithmetic calculations when applying the same path planning algorithms. A* and D* can just as simply be computed on pixels containing a value between zero and one versus strictly zero or one. Potential field path planning has also proven useful in many robotic systems, and also simplifies calculations in most cases, as the path simply follows gradient descent methods. This map can be calculated by extending the cost map by one step: ?tilting? the 27 map by either raising the bottom or lowering the top values (and all values in between according to a linear function), or tilting towards the target location. Of course, potential fields always run the possibility of running into local minima, and for the sake of demonstration, the proposed solution will be to always have the target location at a global minimum, and minimizing the cost function of travelling to that value. 28 Chapter 5: Experiments and Results As the title of this thesis suggests, the methods proposed are designed to enhance the current methods of visual navigation?not replace them. To this end, a series of tests were designed and comparisons will be drawn. 5.1 Color Invariance and Path Classification To evaluate the method for path classification, a series of images will be compared to the tried-and-true image segmentation method of k-means clustering. It should be noted that the k-means clustering is hand-tuned for each picture to ensure that the most accurate account of the true path is given and compared to the color-invariant method proposed. Because both images being compared are binary, the simplest method of sum of absolute differences (SAD) was used, in the form: ?? ? (8) Where Xk-means is the k-means binary image and Xcompared is the path extracted image using the methods of color invariance path classification, and path classification without color invariance, respectively. Table 1 shows these results, along with Figure 8, which shows the comparison of classifications. Note that in one instance the proposed color invariant path classification actually outperforms k-means clustering. 29 Image ESAD with Color Invariant path detection ESAD without Color Invariant path detection 1 1764 19341 2 6716 15637 3 8775 19218 4 415 8129 5 5403 10165 6 1208 11746 7 4518 18027 8 4925 17318 9 11775 24049 10 Outperformed k-means -- Table 1, Sum of absolute differences between k-means image segmentation, color invariant path detection, and color based path detection. Along with comparative accuracy in determining where the path lies, color invariant path detection also performs much faster than k-means clustering, as discussed in section 3.2. 30 1) 2) 3) 4) 5) 6) 31 7) 8) 9) 10) Figure 8, Path detection method comparison. Methods compared using hand-tuned k- means clustering (column 1), color invariance as proposed in this thesis (column 2), and color classification (column 3) on multiple surface types. Rows 1 through 10 correspond with SAD data in table 1. Note that color invariance outperforms color classification and is comparable to k-means clustering, even outperforming it in some instances, such as 10. 32 5.2 Color Invariance with SIFT Although SIFT feature descriptors already have some color invariance properties, even more color invariance can be achieved by performing SIFT on the log-chromaticity projection, as the values in the chromaticity space are mapped to intensity values. Figure 9 shows a comparison of a Harris corner detector (red asterisk) and Shi and Tomasi?s minimum eigenvalue corner detector (blue asterisk) applied in the tradition grayscale intensity image compared to the log-chromaticity projected intensity image. Figure 9, Comparison of keypoint detectors on traditional grayscale (left) vs. log-chromaticity projection (right). It is easy to see that there are many more keypoints detected in the grayscale (left) image which fall directly onto shadow edges and features. It is precisely these types of keypoints a robot wants to avoid when building a map, as they directly rely upon the luminance, weather, and even wind direction and speed; all of which are time variant. Another measure of performance critical to localization derived from SIFT is the number of keypoints that are matched between images. Ideally, it is preferred to have a relatively small number of very strong keypoints that can be uniquely identified across multiple frames. Figures 10 and 11 show two sets of images captured at different times 33 of day, with different lighting conditions. Comparing the grayscale SIFT feature matching with the log-chromaticity projection SIFT feature matching shows that in many cases log-chromaticity meets or exceeds the number of matches as traditional grayscale techniques. Table 2 shows a comparison of the number of feature matches between grayscale and log-chromaticity projection images among seven different data sets (the first two rows correspond to Figures 10 and 11, respectively). Figure 10, Comparison of SIFT feature matching (1). Original image (top), grayscale image (middle), and log- chromaticity projected image (bottom). C o r r e s p o n d i n g FP b e t w e e n i m a g e 1 a n d i m a g e 2 C o r r e s p o n d i n g FP b e t w e e n i m a g e 1 a n d i m a g e 2 34 Figure 11, comparison of SIFT feature matching (2). Original image (top), grayscale image (middle), and log- chromaticity projected image (bottom). Note the incorrect matches from the grayscale images due to shadowing. C o r r e s p o n d i n g FP b e t w e e n i m a g e 1 a n d i m a g e 2 C o r r e s p o n d i n g FP b e t w e e n i m a g e 1 a n d i m a g e 2 35 Number of grayscale matches Number of log-chromaticity matches 10 25 4 4 25 22 125 103 34 18 1 1 3 4 Table 2, Comparison of the number of feature matches between grayscale and log-chromaticity projection images 5.3 Color Invariance with Optical Flow Perhaps one of the most significant contributions the path detection alone could make is the ability to build a map with optical flow. As discussed in section 2.1.3 and 4.3, optical flow motion estimation is accomplished by measuring changes in pixel values, so there is not much thought given to ambient conditions. A robust optical flow algorithm will give the same results despite hard or soft shadowing, or type of environment. For motion estimation on a mobile robot, the assumption is usually made that the scene the robot is in is mostly stationary; that is, most objects or obstacles are static. Extending this concept, this also means that most of the motions attributing to optical flow measurements are from the robot itself moving through the scene. For a robot, the sensing is analogous to humans feeling acceleration. Imagine closing your eyes while 36 riding in a car?you know when you move, and even which way you move, but you have no idea what objects you are moving in reference to. Unless the surface being traveled on is homogenous (which is atypical for outdoor environments), it is very difficult to model the traversable areas using optical flow. However, when combined with the color invariant path recognition presented in this thesis, it is easy to create view and cost maps for initial mapping purposes (it should be noted that without proper filters and estimators error propagation will show large drifts from the true traveled path, as with any motion estimation). The map building procedure described in 4.5.2 is applied along with global motion estimates from the optical flow algorithm to produce the results shown in Figure 12, and the view map in Figure 13. These results can be compared to the actual path as shown by Google Earth in Figure 14. 37 Figure 32, Path estimation from optical flow global motion measurements. Figure 43, View map built from optical flow motion estimation and color invariance path detection. The green dot represents the starting point and the red dot represents the end point. Figure 54, Google Earth image of path traversed in Figures 12, 13, 15, and 16. The green dot represents the starting point and the red dot represents the end point (used with permission under Google Earth policies). 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 38 5.4 Color Invariance with SLAM Once keypoints have been detected, described, and matched, the SLAM problem remains the same. Using the path detection along with SLAM offers the advantage of being able to build not just a feature map and the path traveled, but a view map as produced in section 5.3. An example of this is shown in Figure 15, where the same path in Figure 12 is built instead using SIFT feature detection with the SLAM algorithm discussed in 4.5. Figure 6, Path estimation from SIFT global motion measurements. Figure 7, View map built from SIFT motion estimation and color invariance path detection. The green dot represents the starting point and the red dot represents the end point. 0 100 200 300 400 500 600 - 1 2 0 0- 1 0 0 0- 8 0 0- 6 0 0- 4 0 0- 2 0 00200 39 While using a particle filter in theory is a simple concept, the logistics of keeping track of thousands of features is non-trivial. Global feature locations should be updated upon every instance of detection, however features are not necessarily in subsequent frames. Thus, time constraints prevented the use of actual global feature locations. Instead, visual odometry data was taken to be the ?actual? path and global feature locations were simulated using an arbitrary number of features. Gaussian observation noise was added to simulate actual measurements in x and y directions. The motion (visual odometry) parameter was taken to be the measured movement of matched keypoints, and the observation parameter was taken to be position using multilateration from observed keypoints. Figures 17 through 19 show the true path via Google Maps, the visual odometry path (taken as actual path for simulation purposes) with simulated feature locations and pose estimations after filtering with the described particle filter, and the standard deviations of the particle filter clouds in two directions. The results are shown for multiple paths. 40 Figure 87, Particle filter demonstration using color invariant features (1). a) True traversed path (Google Earth), b) visual odometry path (blue) overlaid with particle filter estimated path (red), and c) standard deviations of particle filter cloud. a) b) c) 41 Figure 9, Particle filter demonstration using color invariant features (2). a) True traversed path (Google Earth), b) visual odometry path (blue) overlaid with particle filter estimated path (red), and c) standard deviations of particle filter cloud. a) b) c) 42 Figure 10, Particle filter demonstration using color invariant features (3). a) True traversed path (Google Earth), b) visual odometry path (blue) overlaid with particle filter estimated path (red), and c) standard deviations of particle filter cloud. a) b) c) 43 5.5 Color Invariance with Path planning Given the binary output from color invariance path detection, practically any navigation algorithm may be run. Taking the bottom center of the image to be the robot?s location (and thus start point), a path was formed to a hand selected goal (for demonstration purposes) using different algorithms as shown in Figure 20. Figure 20, Path planning demonstration with color invariance. Original image (top) compared to binary (middle) and pixel density (bottom) applications of the D* algorithm. 44 Chapter 6: Summary and Future Work This chapter serves to summarize the work presented in this thesis, and suggest future work for continuing the presented topic of study. 6.1 Summary This thesis has shown that using color invariant properties of scene images can significantly reduce perception errors from traditional computer vision techniques. The fast computation of log-chromaticity projection and L-2 normalization serves to specifically improve upon visual navigation techniques for mobile robots operating outdoors. The results also show that using computer vision as a sole means of data acquisition is not only feasible, but offers advantages such as path recognition and color information over other sensory methods. 6.2 Future Work As this work was devoted to prototyping a new method, little effort was given to the computation speed of the particular algorithms used. Other research has shown very efficient and fast algorithms that perform all the necessary steps to obtain the results achieved from the prototype methods, and could be implemented for real-time use. 45 As mentioned in section 5.4, considerable effort needs to be given toward a robust global feature tracking algorithm. There is not only a complex geometric relationship between feature locations in a video frame and their respective global coordinates, but correctly updating observed locations for use in a particle filter is a difficult task. There have also been considerable accomplishments in the research of color- specific descriptors, and comparisons should be drawn between the methods presented in this thesis and current state of the art color descriptors. 46 References [1] Dederick, Z. (1868). Improvement in steam-carriage. US Patent No. 75874 A, Washington, DC: U.S. Patent and Trademark Office. [2] Goto, T., Inoyama, T., & Takeyasu, K. (1974). Precise insert operation by tactile controlled robot. Industrial Robot: An International Journal, 1(5), 225-228. [3] Leonard, J. J., & Durrant-Whyte, H. F. (1992). Directed sonar sensing for mobile robot navigation (Vol. 448). Dordrecht: Kluwer Academic Publishers. [4] Raibert, M., Blankespoor, K., Nelson, G., & Playter, R. (2008, July). Bigdog, the rough-terrain quadruped robot. In Proceedings of the 17th World Congress(pp. 10823-10825). [5] Zhang, Z. (2012). Microsoft kinect sensor and its effect. MultiMedia, IEEE,19(2), 4-10. [6] Boden, M. A. (2006). Mind as machine: A history of cognitive science (Vol. 1). Oxford University Press. [7] Szeliski, R. (2010). Computer vision: algorithms and applications. Springer. [8] Schantz, H. F. (1982). History of OCR, Optical Character Recognition. Recognition Technologies Users Association. [9] Bobbit, R., Connell, J., Haas, N., Otto, C., Pankanti, S., & Payne, J. (2011, January). Visual item verification for fraud prevention in retail self-checkout. InApplications of Computer Vision (WACV), 2011 IEEE Workshop on (pp. 585- 590). IEEE. [10] Thrun, S., Montemerlo, M., Dahlkamp, H., Stavens, D., Aron, A., Diebel, J., ... & Mahoney, P. (2006). Stanley: The robot that won the DARPA Grand Challenge. Journal of field Robotics, 23(9), 661-692. [11] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints.International journal of computer vision, 60(2), 91-110. 47 [12] Ke, Y., & Sukthankar, R. (2004, June). PCA-SIFT: A more distinctive representation for local image descriptors. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on (Vol. 2, pp. II-506). IEEE. [13] Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest point detectors. International journal of computer vision, 60(1), 63-86. [14] Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: Speeded up robust features. In Computer Vision?ECCV 2006 (pp. 404-417). Springer Berlin Heidelberg. [15] Leutenegger, S., Chli, M., & Siegwart, R. Y. (2011, November). BRISK: Binary robust invariant scalable keypoints. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 2548-2555). IEEE. [16] Nakayama, K., & Loomis, J. M. (1974). Optical velocity patterns, velocity-sensitive neurons, and space perception: a hypothesis. Perception, 3(1), 63-80. [17] Sun, D., Roth, S., & Black, M. J. (2010, June). Secrets of optical flow estimation and their principles. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (pp. 2432-2439). IEEE. [18] Xu, L., Qi, F., & Jiang, R. (2006, October). Shadow removal from a single image. In Intelligent Systems Design and Applications, 2006. ISDA'06. Sixth International Conference on (Vol. 2, pp. 1049-1054). IEEE. [19] Finlayson, G. D., Hordley, S. D., Lu, C., & Drew, M. S. (2006). On the removal of shadows from images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(1), 59-68. [20] Siegwart, R., Nourbakhsh, I. R., & Scaramuzza, D. (2011). Introduction to autonomous mobile robots. MIT press. [21] Smith, R. C., & Cheeseman, P. (1986). On the representation and estimation of spatial uncertainty. The international journal of Robotics Research, 5(4), 56-68. [22] Lu, F., & Milios, E. (1997). Globally consistent range scan alignment for environment mapping. Autonomous robots, 4(4), 333-349. 48 [23] Montemerlo, M., Thrun, S., Koller, D., & Wegbreit, B. (2002, July). FastSLAM: A factored solution to the simultaneous localization and mapping problem. InAAAI/IAAI (pp. 593-598). [24] Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. Systems Science and Cybernetics, IEEE Transactions on, 4(2), 100-107. [25] Stentz, A. (1994, May). Optimal and efficient path planning for partially-known environments. In Robotics and Automation, 1994. Proceedings., 1994 IEEE International Conference on (pp. 3310-3317). IEEE. [26] Latombe, J. (1991). Robot Motion Planning. Kluwer Academic, Norwood, MA. [27] Dijkstra, E. W. (1959). A note on two problems in connexion with graphs.Numerische mathematik, 1(1), 269-271. [28] Khatib, O. (1986). Real-time obstacle avoidance for manipulators and mobile robots. The international journal of robotics research, 5(1), 90-98. [29] Feder, H. J. S., & Slotine, J. J. (1997, April). Real-time path planning using harmonic potentials in dynamic environments. In Robotics and Automation, 1997. Proceedings., 1997 IEEE International Conference on (Vol. 1, pp. 874-881). IEEE. [30] Li, G., Yamashita, A., Asama, H., & Tamura, Y. (2012, August). An efficient improved artificial potential field based regression search method for robot path planning. In Mechatronics and Automation (ICMA), 2012 International Conference on (pp. 1227-1232). IEEE. [31] Lee, M. H. (1996). U.S. Patent No. 5,546,134. Washington, DC: U.S. Patent and Trademark Office. [32] Harris, C., & Stephens, M. (1988, August). A combined corner and edge detector. In Alvey vision conference (Vol. 15, p. 50). [33] Metropolis, N., & Ulam, S. (1949). The monte carlo method. Journal of the American statistical association, 44(247), 335-341. [34] Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381-395. 49 [35] Davis, C. Q., Karu, Z. Z., & Freeman, D. M. (1995, November). Equivalence of subpixel motion estimators based on optical flow and block matching. In Computer Vision, 1995. Proceedings., International Symposium on (pp. 7-12). IEEE. [36] Noble, J. A. (1988). Finding corners. Image and Vision Computing, 6(2), 121-128.