An Indoor Spatial Query Evaluation System based on RFID and Particle Filters by Jiao Yu A thesis submitted to the Graduate Faculty of Auburn University in partial ful?llment of the requirements for the Degree of Master of Science Auburn, Alabama May 4, 2013 Keywords: Indoor spatial query, RFID, Particle ?lter Copyright 2013 by Jiao Yu Approved by Wei-Shinn Ku, Associate Professor of Computer Science and Software Engineering David Umphress, Associate Professor of Computer Science and Software Engineering Xiao Qin, Associate Professor of Computer Science and Software Engineering Abstract People spend a signi?cant amount of time in indoor spaces (e.g., o?ce buildings, subway systems, etc.) in their daily lives. Therefore, it is important to develop e?cient indoor spa- tial query algorithms for supporting various location-based applications. However, indoor spaces di?er from outdoor spaces because users have to follow the indoor ?oor plan for their movements. In addition, positioning in indoor environments is mainly based on sensing de- vices (e.g., RFID readers) rather than GPS devices. Consequently, we cannot apply existing spatial query evaluation techniques devised for outdoor environments for this new challenge. Because particle ?lters can be employed to estimate the state of a system that changes over time using a sequence of noisy measurements made on the system, in this research, we pro- pose the particle ?lter-based location inference method as the basis for evaluating indoor spatial queries with noisy RFID raw data. Furthermore, two novel models, indoor walk- ing graph model and anchor point indexing model, are created for tracking object locations in indoor environments. Based on the inference method and tracking models, we develop innovative indoor range and k nearest neighbor (kNN) query algorithms. We validate our solution through extensive simulations with real-world parameters. Our experimental re- sults show that the proposed algorithms can evaluate indoor spatial queries e?ectively and e?ciently. ii Acknowledgments This research has been funded in part by the National Science Foundation (NSF) Grants CNS-0831502 (CT) and CNS-0855251 (CRI). iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Indoor Spatial Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 RFID-Based Track and Trace . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Particle Filter-Based Location Inference . . . . . . . . . . . . . . . . . . . . 10 3.3 Symbolic Model-Based Location Inference . . . . . . . . . . . . . . . . . . . 12 3.3.1 Symbolic Model and Deployment Graph . . . . . . . . . . . . . . . . 12 3.3.2 Symbolic Model based Indoor Range Query . . . . . . . . . . . . . . 15 3.3.3 Symbolic Model based Indoor kNN Query . . . . . . . . . . . . . . . 17 4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 Event-Driven Raw Data Collector . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Indoor Walking Graph Model and Anchor Point Indexing Model . . . . . . . 22 4.3 Query Aware Optimization Module . . . . . . . . . . . . . . . . . . . . . . . 24 4.4 Particle Filter-Based Preprocessing Module . . . . . . . . . . . . . . . . . . . 26 4.5 Cache Management Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.6 Query Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.6.1 Indoor Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 iv 4.6.2 Indoor kNN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1 Simulator Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 E?ects of Query Window Size . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3 E?ects of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4 E?ects of Number of Particles . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.5 E?ects of Number of Moving Objects . . . . . . . . . . . . . . . . . . . . . . 39 5.6 E?ects of Activation Range . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6 Continuous Spatial Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.1 Continuous Indoor Range Query . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.2 Continuous Indoor kNN Query . . . . . . . . . . . . . . . . . . . . . . . . . 43 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 v List of Figures 3.1 An example of particle ?ltering. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Floor plan and connectivity graph. . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Accessibility graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 An example of the symbolic model. . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5 Range query example. [29] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6 Distance based pruning. [30] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 Overall system structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Example of ?ltering out kNN query non-candidate objects. Note that si(li) is the minimum (maximum) shortest network distance from q to the uncertain region of oi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Example of ?ltering out range query non-candidate objects. . . . . . . . . . . . 25 4.4 Example of indoor range query. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5 Example of indoor kNN query. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 The simulator structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 E?ects of query window size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3 E?ects of k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 vi 5.4 The impact of the number of particles. . . . . . . . . . . . . . . . . . . . . . . . 38 5.5 The impact of the number of moving objects. . . . . . . . . . . . . . . . . . . . 38 5.6 The impact of activation range. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1 Mapping process to identify critical devices. . . . . . . . . . . . . . . . . . . . . 41 vii List of Tables 3.1 Symbolic notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.1 Default values of parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 viii Chapter 1 Introduction Today most people spend a signi?cant portion of their time daily in indoor spaces such as subway systems, o?ce buildings, shopping malls, convention centers, and many other structures. In addition, indoor spaces are becoming increasingly large and complex. For instance, the New York City Subway has 468 stations and contains 209 miles (337 km) of routes [28]. In 2011, the subway system delivered over 1.64 billion rides, averaging approximately 5.3 million rides on weekdays [15]. Therefore, users will have more and more demand for launching spatial queries for ?nding friends or Points Of Interest (POI) in indoor places. However, existing spatial query evaluation techniques for outdoor environments (either based on Euclidean distance or network distance) [18, 6, 16, 19, 12] cannot be applied in indoor spaces because these techniques assume that user locations can be acquired from GPS signals or cellular positioning, but the assumption does not hold in covered indoor spaces. Furthermore, indoor spaces are usually modelled di?erently from outdoor spaces. In indoor environments, user movements are enabled or constrained by entities and topologies such as doors, walls, and hallways. Radio Frequency Identi?cation (RFID) technologies have become increasingly popular over the last decade with applications in areas such as supply chain management [20], health care, and transportation. In indoor environments, RFID is mainly employed to support track and trace applications. Generally, RFID readers are deployed in critical locations while objects carry RFID tags. When a tag passes the detection range of a reader, the reader recognizes the presence of the tag and generates a record in the back end database. However, the raw data collected by RFID readers is inherently unreliable [21, 8], with false negatives as a result of RF interference, limited detection range, tag orientation, and other 1 environmental phenomena [26]. In addition, readers cannot cover all areas of interest because of their high cost or privacy concerns [24]. Therefore, we cannot directly utilize RFID raw data to evaluate commonly used spatial query types (e.g., range and kNN) for achieving high accuracy results in indoor environments. In this research, we consider the setting of an indoor environment where a number of RFID readers are deployed in hallways. Each user is attached with an RFID tag, which can be identi?ed by a reader when the user is within the detection range of the reader. Given the history of RFID raw readings from all the readers, we are in the position to design a system that can e?ciently answer indoor spatial queries. Particle ?lters are sequential Monte Carlo methods based on point mass representations of probability densities, which can be applied to any state-space model [1]. Particle ?lters can be employed to estimate the state of a system that changes over time using a sequence of noisy measurements made on the system. In this paper we propose the particle ?lter-based location inference method, the indoor walking graph model, and the anchor point indexing model for inferring object locations from noisy RFID raw data. On top of the location inference, indoor range and kNN queries can be evaluated e?ciently by our algorithms with high accuracy. The contributions of this study are as follows: ? We design the particle ?lter-based location inference method as the basis for evaluating indoor spatial queries. ? We propose two novel models, the indoor walking graph model and the anchor point indexing model, and an RFID-based system for tracking object locations in indoor environments. ? Indoor spatial query evaluation algorithms for range and kNN queries are developed based on the proposed system. 2 ? We demonstrate the e?ciency and e?ectiveness of our approach by comparing the per- formance of our system with the symbolic model-based solution [30] through extensive simulations using real-world parameters. The rest of this paper is organized as follows. In chapter 2, we survey previous works for indoor object monitoring and spatial queries. Background knowledge of particle ?lters and the symbolic model-based location inference is provided in chapter 3. In chapter 4 we introduce our particle ?lter-based indoor spatial query evaluation system. The experimental validation of our design is presented in chapter 5. Chapter 6 proposes tentative solutions for more complicated continuous spatial query types. And Chapter 7 concludes this paper with a discussion of future work. 3 Chapter 2 Related Work In this section, we review previous work related to indoor spatial queries and RFID data cleansing. 2.1 Indoor Spatial Queries Outdoor spatial queries, e.g., range and kNN queries, have been extensively studied both for Euclidean space [18, 6] and road networks [16, 19, 12]. However, due to the inherent di?erences in spatial characteristics, indoor spatial queries need di?erent models and cannot directly apply mature techniques from their outdoor counterparts. Therefore, indoor spatial queries are drawing more and more research attentions from industry and academia. For answering continuous range queries in indoor environments, Jensen et al. [9] proposed using the positioning device deployment graph to represent the connectivity of rooms and hallways from the perspective of positioning devices. Basically, entities that can be accessed without having to be detected by any positioning device are represented by one cell in the graph, and edges connecting two cells in the graph represent the positioning device(s) which separate them. Based on the graph, initial query results can be easily processed with the help of an indexing scheme also proposed by the authors [29]. Query results are returned in two forms: certain results and uncertain results. To reduce the workload of maintaining and updating the query results, Yang et al. further proposed the concept of critical devices. Only from the ENTER and LEAVE observations of its critical devices can a query?s results be a?ected. However, the probability model utilized in Yang?s work is very simple: a moving object is uniformly distributed over all the reachable locations constrained by its maximum speed in a given indoor space. This simple probability model is incapable of taking advantage of the 4 moving object?s previous moving patterns, such as direction and speed, which would make the location prediction more reasonable and precise. In addition, Yang et al. [30] also addressed the problem of kNN queries over moving objects in indoor spaces. Unlike another previous work [14] which de?nes nearest neighbors by the minimal number of doors to go through, they proposed a novel distance metric, minimum indoor walking distance, as the underlying metric for indoor kNN queries. Moreover, Yang et al. provided the formal de?nition for Indoor Probabilistic Threshold kNN Query (PTkNN) as ?nding a result set with k objects which have a higher probability than the threshold probability T. Indoor distance-based pruning and probability threshold-based pruning are proposed in Yang?s work to speed up PTkNN query processing. Similarly, the paper employs the same simple probabilistic model as in [29], and therefore has the same de?ciencies in probability evaluation. 2.2 RFID-Based Track and Trace RFID is a very popular electronic tagging technology that allows objects to be auto- matically identi?ed at a distance using an electromagnetic challenge-and-response exchange of data [23]. An RFID-based system consists of a large number of low-cost tags that are at- tached to objects, and readers which can identify tags without a direct line-of-sight through RF communications. RFID technologies enable exceptional visibility to support numerous track and trace applications in di?erent ?elds [31]. However, the raw data collected by RFID readers is inherently noisy and inconsistent [21, 8]. Therefore, middleware systems are re- quired to correct readings and provide cleansed data [7]. In addition to the unreliable nature of RFID data streams, another limitation is that due to the high cost of RFID readers, RFID readers are mostly deployed such that they have disjoint activation ranges in the settings of indoor tracking. Furthermore, privacy (i.e., readers are deployed in hallways rather than rooms in o?ce buildings) is also an important concern [26]. 5 To overcome the above limitations, RFID data cleansing is a necessary step to produce consistent data to be utilized by high-level applications. Tran et al. [22] used a sampling- based method called particle ?ltering to infer clean and precise event streams from noisy raw data produced by mobile RFID readers. Three enhancements are proposed in their work to make traditional particle ?lter techniques scalable. However, their work is mainly designed for warehouse settings where objects remain static on shelves, which is quite di?erent from our setting where objects move around in a building. Therefore, Tran?s approach of adapting and applying particle ?lters cannot be directly applied to our settings. Another limitation of [22] is that they did not explore further utilization of the output event streams for high- level applications. Chen et al. [3, 10] employed a di?erent sampling method called Markov Chain Monte Carlo (MCMC) to infer objects? locations on shelves in warehouses. Their method takes advantage of the spatial and temporal redundancy of raw RFID readings, and also considers environmental constraints such as the capacity of shelves, to make the sampling process more precise. Their work also focuses on warehouse settings; thus is not suitable for our problem of general indoor settings. The works in [17, 25, 13] target settings such as o?ce buildings, which are similar to our problem. They use particle ?lters in their preprocessing module to generate probabilistic streams, on which complex event queries such as ?Is Joe meeting with Mary in Room 203?? can be processed. However, their goal is to answer event queries instead of spatial queries, which is di?erent from the goal of this research. Furthermore, a hot research topic of the robotics research community, simultaneous localization and mapping (SLAM), also makes extensive utilization of particle ?lters [27, 2]. 6 Chapter 3 Preliminary In this section, brief introductions to the mathematical background of particle ?lters, particle ?lter-based location inference, and symbolic model-based location inference [29] are provided. Particle ?lters are the main technique utilized in this paper to infer the posterior probability distributions of objects? locations. We ?rst introduce the mathematical derivation of particle ?lters. Then, we present particle ?lter-based location inference for supporting indoor spatial queries. To the best of our knowledge, symbolic model-based location inference is the only method of drawing the probability distribution of an object?s location for the purpose of indoor spatial queries in the literature. Therefore, we describe it here in order to compare it with our methods. Table 1 summarizes the notations used in this paper. 3.1 Particle Filters In this section, we describe the formal mathematical statements of the Sampling Impor- tance Resampling (SIR) ?lter (the original particle ?ltering algorithm) [5], which provides a technical context for later chapters. A particle ?lter is a method that can be applied to nonlinear recursive Bayesian ?ltering problems [1]. The system under investigation is often modeled as a state vector, which con- tains all relevant information about the system. At least two models are critical in analyzing and making inferences about a dynamic system: the system model and the measurement model, which are given by Equations (3.1) and (3.2), respectively. xk = fk(xk 1;vk 1) (3.1) 7 Equation (3.1) describes how the system evolves from the state vector xk 1 at time k?1 to state vector xk at time k. fk is a possible nonlinear function, and vk 1 is an independently identically distributed (i.i.d.) process noise sequence. zk = hk(xk;uk) (3.2) Equation (3.2) describes how observation zk relates to the true state xk of the system, where hk is a possible nonlinear function and uk is an i.i.d. measurement noise sequence. The objective of the particle ?lter method is to construct a discrete approximation to the probability density function (pdf) p(xk|z1:k), that is, the probability of the system true state at time k given previous observations from time 1 to k. Particle ?lters approximate this pdf function by a set of random samples with associated weights. Simply put, every Symbol Meaning q An indoor query point oi The object with ID i C A set of candidate objects D A set of sensing devices G The indoor walking graph pi A probability distribution function for oi in terms of all possible loca- tions api An anchor point with ID i Ns The total number of particles for an object umax The maximum walking speed of a person lmax The maximum walking distance of a person during a certain period of time UR(oi) The uncertain region of object oi si The minimum shortest network dis- tance li The maximum shortest network dis- tance Areai The size of a given region i Table 3.1: Symbolic notations. 8 particle represents a hypothesis (sample) of the system true state; According to whether the hypothesis is consistent with the observation, each particle is assigned a di?erent weight. The weights are determined by the principle of importance sampling with the importance density to be p(xk|xk 1) for the SIR ?lter [1]. Given {xik 1;wik 1}Nsi=1 where {xik 1;i = 1;:::;Ns} is a set of support points (particles) with associated weights {wik 1;i = 1;:::;Ns}, the support points update formula and weight update formula for the SIR ?lter are: xik ? p(xk|xik 1) (3.3) wik ? wik 1p(zk|xik) (3.4) From Equation (3.3), we can see that the particle xik at time k is sampled from the conditional pdf p(xk|xik 1) with xik 1 being its parent particle from time k?1. Theoretically, p(xk|xik 1) is related to and can be inferred from the system model (3.1). Equation (3.4) means that the new weight wik is proportional to the old weight wik 1 augmented by the obser- vation likelihood p(zk|xik), which can be inferred from the measurement model (3.2). Thus, particles which are more likely to cause an observation consistent with the true observation result zk will gain higher weights than others. The posterior ?ltered density p(xk|z1:k) can be approximated as p(xk|z1:k) ? Ns? i=1 wik (xk ?xik) (3.5) Equations (3.3) and (3.4) are conceptual processes of how to iteratively calculate par- ticles and their weights; however, in real world applications it is hard to derive analytical forms of p(xk|xik 1) and p(zk|xik). In our application, particles update their locations accord- ing to the object motion model employed in our work. Simply put, the object motion model assumes objects move forward with constant speeds, and can either enter rooms or continue 9 Algorithm 1 Resampling Algorithm 1. {{xj?k ;wjk}Nsj=1=RESAMPLE[{xik;wik}Nsi=1]} 2. Initialize the CDF: c1 = 0 3. for i = 2 to Ns do 4. Construct CDF: ci = ci 1 + wik 5. end for 6. Start at the bottom of the CDF: i = 1 7. Draw a starting point: u1 ??[0;N 1s ] 8. for j = 1 to Ns do 9. Move along the CDF: uj = u1 + N 1s (j ?1) 10. while uj > ci do 11. i = i + 1 12. end while 13. Assign sample: xj?k = xik 14. Assign weight: wjk = N 1s 15. end for to move along hallways. Once an object is inside a room, it will continue to reside in the room with probability 0.9, or start to move out of the room with probability 0.1. Weights of particles are updated according to the device sensing model [3] used in this research. Resampling is a method to solve the degeneration problem in particle ?lters. Degen- eration means that with more iterations only a few particles would have dominant weights while the majority of others would have near-zero weights. The basic idea of resampling is to eliminate low weight particles, replicate high weight particles, and generate a new set of particles {xi?k }Nsi=1 with equal weights. In SIR ?lters, the resampling step is performed at every time index. The algorithm of resampling is shown in Algorithm 1. 3.2 Particle Filter-Based Location Inference Now we are ready to explain how we apply particle ?lters to the problem of RFID-based indoor location inferences. We will use Figure 3.1 as an example. In Figure 3.1, d1, d2 and d3 are RFID readers which partition the hallway into four di?erent sections labelled H1, H2, H3, and H4, respectively. Suppose from raw readings we know that a tag is ?rst seen at d2 at time t0, then later is seen at d3 at time t1. We want to predict its location in a probabilistic form after the tag leaves the activation range of d3. 10 After leaving d3, the person carrying the tag is more likely to keep his/her original moving direction and move towards H4 rather than backward to H3. By their very nature, particle ?lters will produce ?ltered results consistent with our expectation. The rest of this section will explain why particle ?lters are able to predict this trend. We assume particle ?lters start running at t0. At ?rst, particles represent samples drawn from the initial pdf p(x0) of the person?s location. In other words, each particle represents a hypothesis of the person?s state with its own location, moving direction, and speed. At t0 the person?s tag is detected by d2, which means that the person must be somewhere within the detection range of d2. Initially, particles are distributed randomly within the detection range of d2 as shown in Figure 3.1(a). Every particle randomly picks its moving direction and speed. For simulating people?s indoor movements, we set particles? speed to be a Gaussian distribution of = 1 m/s and = 0:1. After the initial distribution, particles update their locations according to their own speeds and directions. Some particles may move right to H3 or possibly enter rooms R3 and R7; some particles may move left to H2, R2, and R6. Up to time t1, particles already become dispersed as shown in Figure 3.1(b). At t1 a new reading is generated by d3, when the person entered d3?s activation range. At every new observation, particle ?lters are going to perform the steps of reweighting and resampling. For readings from d3, particles that are within the detection range of d3 are assigned high weights, while particles elsewhere are assigned a very low weight. Next in the resampling step, particles are sampled with a probability proportional to their weights. Thus after resampling, most particles are replicates of previous highly-weighted particles; that is, the ones within the detection range of d3. The (a) Particles Distribution at t0 (b) Particles Distribution at t1 Figure 3.1: An example of particle ?ltering. 11 newly generated particles maintain the moving direction of those highly weighted particles, which is from left to right. Therefore, at this step after analyzing two devices? readings, particle ?lters already gain some knowledge of the true moving direction and speed of the person. After the person leaves d3?s activation range but before any new observation, particle ?lters are going to predict the person?s location to be more likely in H4 rather than in H3 or H2, because most particles now are moving in the direction of the hallway from left to right. 3.3 Symbolic Model-Based Location Inference In this section, we introduce symbolic model based location inference and the corre- sponding range/kNN query evaluation algorithms proposed in [29] and [30]. In chapter 5, we did extensive experiments to compare the performance of our particle ?lter based query evaluation algorithms with the state-of-art symbolic model based algorithms, and the results reveal that our algorithm can achieve a much higher accuracy. 3.3.1 Symbolic Model and Deployment Graph Symbolic models are di?erent from traditional geometric coordinate models, the advan- tage of symbolic models lies in its ability to capture the semantics associated with indoor entities [9]. In symbolic models, a base graph describes the topology of an indoor space in which each separate partition such as a room, a staircase, or a hallway is represented as a vertex. All the space outside of the whole indoor space is also represented as one vertex, while edges capture the connectivity (undirected, such as a door connecting two rooms) or accessibility (directed, such as a one way entrance/exit) between two vertices. Figure 3.2 and 3.3 show an example of a ?oor plan and its symbolic model: the connectivity graph or the accessibility graph. On the foundation of a base graph model for an indoor space, a deployment graph can be constructed according to the deployment of a particular positioning technology [9]. Basically, entities that can be accessed without having to be detected by any device are 12 Figure 3.2: Floor plan and connectivity graph. Figure 3.3: Accessibility graph. 13 represented by one cell in the graph, and edges connecting two cells in the graph represent the device(s) which separate them. We refer readers to [9] to see the detailed algorithm of the RFID reader deployment graph construction. Below, an example of a possible RFID reader deployment in an indoor space and its corresponding deployment graph is shown in Figure 3.4.      (a) Reader deployment in an in- door space.               (b) Reader deployment graph. Figure 3.4: An example of the symbolic model. In Figure 3.4(a), since one can enter the hallway in the middle from the room on the right through door1 without being detected by any RFID reader, they are represented as Cell9 in Figure 3.4(b); the same is true for the two rooms on the top left corner of Figure 3.4(a) as Cell4. Cell3 represents the staircase and is separated from other rooms or hallways by a pair of RFID readers (reader1 and reader1?). All the outside space is represented by Cell10. The work in [29] de?nes three types of positioning devices: ? Undirected Partitioning Device: it separates two cells but cannot di?erentiate the moving directions of objects, such as reader4. ? Directed Partitioning Device: it consists of an entry/exit pair of devices, and is able to not only partition cells but also infer the moving directions of objects by the reading sequence. An example is reader1 and reader1? in Figure 3.4. 14 ? Presence Device: it simply senses objects within its detection range, but does not par- tition the space into di?erent cells. For example, reader3 is such a device in Figure 3.4. Symbolic model-based location inference assumes an object?s position is uniformly dis- tributed over all possible locations. More speci?cally, we discuss several cases here to better explain how this probability model works: Case 1: If an object is currently being observed by an RFID reader, then its possible loca- tion is anywhere in the detection range of the reader. Case 2: If an object leaves a presence device, it must still be in the same cell as the presence device. For example, in Figure 3.4, if an object leaves reader3, it must be inside cell9 before being detected by any other reader. Case 3: If an object leaves a directed partitioning device pair, the cell the object is entering can be inferred from the reading sequence. For example, if an object is seen at reader1? and then reader1, it must enter cell3. Case 4: If an object leaves an undirected partitioning device, it can be in either of the cells that the device partitions. For example, if an object leaves reader4, it can either be in cell4 or cell9. Note that this inference method is very conservative in the sense that it will identify all the possible locations an object can be, but is unable to further di?erentiate an object?s location within all possible cells. Therefore, we choose to apply the more e?ective particle ?lter-based location inference technique in our design. 3.3.2 Symbolic Model based Indoor Range Query To e?ectively answer range queries with symbolic model and deployment graph, Yang etc. [29] proposed to use an indexing scheme with several hash tables. We are going to 15 introduce the indexing scheme and how to evaluate range queries with the help of the hash tables. DHT: A device hash table (DHT) maps a deviceID to the set of objects active in its reading range. CDHT: A cell deterministic hash table (CDHT) maps a cellID to the set of objects that are known deterministically in it. CNHT: A cell nondeterministic hash table (CNHT) maps a cellID to the set of objects that are possibly in it. OHT: An object hash table (OHT) maps an objectID to a tuple (STATE;t;IDset). STATE is an enumerate type which has three values: active (the object is in the active range of a reader), deterministic (the object is deterministically located in a cell), or nonde- terministic (the object is possibly in a set of cells). t is the start time of the current state, and IDset is a device ID, a cell ID, or a set of cell IDs corresponding to the three di?erent states. It is very obvious that from the four cases of location inference from section 3.3.1, case 1 in which the object enters the range of a reader, should generate an entry in the DHT, case 2 and 3 in which the object leaves a presence device or directed partitioning device, should generate an entry in CDHT, and case 4 in which the object leaves an undirected partitioning device should generate an entry in CNHT. To answer a snapshot range query, the evaluation algorithm ?rst identi?es cells and readers that overlap with the query range, then look up DHT, CDHT, CNHT to get the result set. The algorithm categorizes the results into two types: deterministic (objects that are for sure within the query range) and nondeterministic (objects that are possibly in the query range). Cells and devices that are fully contained in the query range contribute to the deterministic result set from their CDHT and DHT, contribute to the nondeterministic 16 result set from CNHT. Cells and devices that partially intersect with the query range can only contribute to the nondeterministic result set. However, there is one more complication here: If an object?s state is nondeterministic, and cells of its IDset are all fully contained in the query range, then the object is also in the deterministic result set. Figure 3.5 shows an example of two range queries. For query 2, if an object was last seen by device 16, it is in nondeterministic state and its cellID is (12, 13); query 2 happen to fully cover cell 12 and 13, so this object should be in the deterministic result set. Figure 3.5: Range query example. [29] 3.3.3 Symbolic Model based Indoor kNN Query With uniform distribution model at its core, symbolic model based indoor kNN query is not an easy task since there are many di?erent combinations with di?erent probabilities. Yang etc [30] proposed kNN query with user speci?ed probability threshold, which can be used to ?lter out objects whose probability of being inside the result set is lower than the threshold. By pruning the non-candidate objects, the number of possible result sets is considerably reduced, and a large percent of the calculations can be thus saved in the ?nal stage of probability evaluation of possible result sets. 17 First of all, a novel distance metric is proposed in Yang?s work?minimum indoor walking distance. This new metric, as its name suggests, represents the minimum distance needed to walk from any two points in an indoor space. A precomputed door-to-door distance graph is used in their algorithm. The source/destination point ?rst searches for the nearest door, and the shortest distance between the two doors is then calculated according to the door-to-door graph. Thus the shortest distance between source and destination points can be calculated accordingly. A major contribution of Yang?s work is that they proposed two e?ective pruning methods of non-candidate objects: distance based pruning and probability threshold based pruning. Distance based pruning will ?lter out objects whose uncertain regions are too far away from the query point that at least k objects are for sure closer to the query point. For example, ?gure 3.6 shows one kNN query with four objects currently in the indoor enviroment. Even the shortest distance from q to object 4?s uncertain region is larger than the furthest distance of q to the uncertain regions of objects 1, 2, and 3. Therefore, object 4 can be safely pruned if this is a 3NN query. We borrow this idea in our system design to optimize particle ?lters based kNN queries in section 4.3, which has more formal mathematical de?nitions of distances. Probability threshold based pruning is one step further than the former pruning method, which calculates the probability of an object being in the result set, and ?lter out those whose probability is lower than threshold. After pruning, probability evaluation of all di?erent combinations of candidate objects is performed, and low probability combinations are eliminated. Suppose there arem objects left after initial pruning, then there are Ckm di?erent possible result sets needing to be evaluated. Yang etc. approximated continuous integration of pdf functions to be the sum of many discrete intervals. As their previous work [29], the main drawback of this approach is that all the probability evaluations are based on the simple but inaccurate uniform distribution assumptions. Also, the computation intensity of approximating the integration can get high when m is large. 18 Figure 3.6: Distance based pruning. [30] 19 Chapter 4 Design In this section, we will introduce the design of an RFID-based indoor range and kNN query evaluation system, which incorporates ?ve modules: event-driven raw data collector, query aware optimization module, particle ?lter-based preprocessing module, cache man- agement module, and query evaluation module. In addition, we introduce the underlying framework of two models: indoor walking graph model and anchor point indexing model. We will elaborate the function of each module and model in the following sections. Figure 4.1 shows the overall structure of our system design. Raw readings are ?rst fed into and processed by the event-driven raw data collector module, which then provides aggregated readings for each object at every second to the query aware optimization module, particle ?lter-based preprocessing module, and cache management module. The query aware optimization module ?lters out non-candidate objects according to registered queries and objects? most recent readings, and outputs a candidate set C to the particle ?lter-based preprocessing module. The particle Filter-based preprocessing module cleanses the noisy raw data for each object in C, stores the resulting probabilistic data in a hash table, and passes the hash table to the query evaluation module. At the same time, particle ?lter- based preprocessing module and the cache management module communicates data when necessary. The cache management module also requests data from the event-driven raw data collector module in order to age out old entries. At last, the query evaluation module answers registered range and kNN queries based on the hash table that contains ?ltered data. 20 &DFKH0DQDJHPHQW0RGXOH (YHQWGULYHQ 5DZ'DWD&ROOHFWRU4XHU\(YDOXDWLRQ0RGXOH 4XHU\$ZDUH2SWLPL]DWLRQ0RGXOH 3DUWLFOH)LOWHUEDVHG3UHSURFHVVLQJ0RGXOH Figure 4.1: Overall system structure. 4.1 Event-Driven Raw Data Collector In this section, we describe the event-driven raw data collector which is the front end of the entire system. The data collector module is responsible for storing RFID raw readings in an e?cient way for the following query processing tasks. Considering the characteristics of particle ?ltering, readings of one detecting device alone cannot e?ectively infer an object?s moving direction and speed, while readings of two or more detecting devices can. We de?ne events in this context as the object either entering (ENTER event) or leaving (LEAVE event) the reading range of an RFID reader. To minimize the storage space for every object, the data collector module only stores readings during the most recent ENTER, LEAVE, ENTER events, and removes earlier readings. In other words, our system only stores readings of up to the two most recent consecutive detecting devices for every object. For example, if an object is previously identi?ed by di and dj, readings from di and dj are stored in the data collector. When the object is entering the detection range of a new device dk, the data collector will record readings from dk while removing older readings from di. The data collector module is also responsible for aggregating the raw readings to more concise entries with a time unit of one second. The reasons are twofold: RFID readers usually have a high reading rate of tens of samples per second. However, particle ?lters do not need such a high observation frequency. An update frequency of once per second would provide a good enough resolution. Therefore, aggregation of the raw readings can further save storage without compromising accuracy. Another advantage of aggregating is to signi?cantly mitigate the e?ects of missing readings. With tens of samples per second, 21 as long as an object is detected at least once during a second, an entry marking that event is inserted into the aggregated results. It is very unlikely that all the readings of an object during one second are totally missed by a reader. Thus aggregation can greatly reduce the detecting errors of false negatives. It is worth noting that since this research focuses on snapshot queries launched at the present time, the data collector module can be designed as above to save storage space. For systems which are required to answer historical queries, the data collector module needs to be modi?ed accordingly to keep a longer reading history. 4.2 Indoor Walking Graph Model and Anchor Point Indexing Model This section introduces the underlying assumptions and backbone models of our system, which forms the basis for understanding subsequent sections. We propose two novel models in our system, indoor walking graph model and anchor point indexing model, for tracking object locations in indoor environments. Indoor Walking Graph Model: we assume our system setting is a typical o?ce building where the width of hallways can be fully covered by the detection range of sensing devices (which is usually true since the detection range of RFID readers can be as long as 3 meters), and RFID readers are deployed only along the hallways. In this case the hallways can simply be modelled as lines, since from RFID reading results alone, the locations along the width of hallways cannot be inferred. Furthermore, since no RFID readers are deployed inside rooms, the resolution of location inferences cannot be higher than a single room. Based on the above assumptions, we propose an indoor walking graph model. The indoor walking graph G?N;E? is abstracted from the regular walking patterns of people in an indoor environment, and can represent any accessible path in the environment. The graph G comprises a set N of nodes together with a set E of edges. By restricting object movements and particle movements to be only on the edges E of G, we can greatly simplify the object movement model while at the same time still preserving the inference accuracy of 22 particle ?lters. Also, the distance metric used in this paper, e.g., in kNN query evaluations, can simply be the shortest spatial network distance on G, which can then be calculated by many well-known spatial network shortest path algorithms [16, 19] as shown in Figure 4.2. Anchor Point Indexing Model: the indoor walking graph edges E are by nature continuous. To simplify the representation of an object?s location distribution on E, we propose an e?ective spatial indexing method: anchor point-based indexing. We de?ne anchor points as a set AP of prede?ned points on E with a uniform distance (such as 1 meter) to each other. An example of anchor points is shown in Figure 4.2. In essence, the model of anchor points is a scheme of trying to discretize objects? locations. After particle ?ltering is ?nished for an object oi, every particle of oi is assigned to its nearest anchor point, so that the inferred object location can only be on discrete locations instead of anywhere on E. For an anchor point apj with a nonzero number n of particles, pi(oi:location = apj) = n=Ns, where pi is the probability distribution function that oi is at apj and Ns is the total number of particles for oi. A hash table APtoObjHT is maintained in our system with the key to be the coordi- nates of an anchor point apj and returned value the list of each object and its probability q Figure 4.2: Example of ?ltering out kNN query non-candidate objects. Note that si(li) is the minimum (maximum) shortest network distance from q to the uncertain region of oi. 23 at the anchor point (?oi;pi(apj)?). For instance, an entry of APtoObjHT would look like: (8:5;6:2);{?o1;0:14?;?o3;0:03?; ?o7;0:37?}, which means at the anchor point with coordinate (8.5, 6.2), there are three pos- sible objects (o1, o3, and o7), with probabilities of 0.14, 0.03, and 0.37, respectively. With the help of the above anchor point indexing model, the query evaluation module can simply refer to the hash table APtoObjHT to determine objects? location distributions. 4.3 Query Aware Optimization Module To answer every range query or kNN query, a naive approach is to calculate the prob- ability distribution of every object?s location currently in the indoor setting. However, if query ranges cover only a small fraction of the whole area, then there will be a considerable percentage of objects who are guaranteed to be not in the result set of any query. We call those objects that have no chance to be in any result set ?non-candidate objects?. The computational cost of running particle ?lters for non-candidate objects should be saved. In this section we present two e?cient methods to ?lter out non-candidate objects for range query and kNN query, respectively. Range Query: to decrease the computational cost, we employ a simple approach based on the Euclidian distance instead of the minimum indoor walking distance [30] to ?lter out non-candidate objects. An example of the optimization process is shown in Figure 4.3. For every object oi, its most recent detecting device d and last reading time stamp tlast are ?rst retrieved from the data collector module. We assume the maximum walking speed of people to be umax. Within the time period from tlast to the present time tcurrent, the maximum walking distance of a person is lmax = umax ?(tcurrent ?tlast). We de?ne oi?s uncertain region UR(oi) to be a circle centered at d with radius r = lmax+d:range. If UR(oi) does not overlap with any query range then oi is not a candidate and should be ?ltered out. On the contrary, if UR(oi) overlaps with one or more query ranges then we add oi to the result candidate 24 G VUHDGLQJUDQJH TXHU\UDQJH Figure 4.3: Example of ?ltering out range query non-candidate objects. set C. In Figure 4.3, the only object in the ?gure should be ?ltered out since its uncertain region does not intersect with any range query currently evaluated in the system. kNN Query: by employing the idea of distance-based pruning in [30], we perform a similar distance pruning for kNN queries to identify candidate objects. We use si(li) to denote the minimum (maximum) shortest network distance (with respect to the indoor walking graph) from a given query point q to the uncertain region of oi: si = min p2UR(oi) dshortestpath(q;p);li = max p2UR(oi) dshortestpath(q;p): (4.1) Let f be the k-th minimum of all objects? li values. If si of object oi is greater than f, object oi can be safely pruned since there exist at least k objects whose entire uncertain regions are de?nitely closer to q than oi?s shortest possible distance to q. Figure 4.2 is an example pruning process for a 2NN query: There are 3 objects in total in the system. We can see l1 < l2 < l3 and consequently f = l2 in this case; s3 is greater than f, so o3 has no chance to be in the result set of the 2NN query. We run the distance pruning for every kNN query and add possible candidate objects to C. Finally, a candidate set C is produced by this module, containing objects that might be in the result set of one or more range queries or kNN queries. C is then fed into the particle ?lter preprocessing module which will be explained in the next section. 25 4.4 Particle Filter-Based Preprocessing Module We design a particle ?lter-based algorithm (Algorithm 2) with the prior knowledge of the indoor walking graph G?N;E?, anchor points set AP, and the deployment information of sensing devices D. This algorithm receives the output candidates set C from the query aware optimization module as input, infers the probability distribution of candidate objects? locations, and smooths out the result by assigning particles? locations to the nearest anchor point. For every candidate in C, the Particle Filter algorithm ?rst retrieves its most recent readings detected by up to two RFID readers (this number can be adjusted by users for supporting other query types) from the data collector module. If the object has just entered the system and is only detected by one sensing device, the algorithm still runs, although one device?s readings alone can hardly determine the object?s moving direction. If the object is undetected by any device for a long time, it is highly likely that the object stays in a room. In this case if the particle ?lter continues running for a while without any observation, particles will become dispersed over a large area and the ?ltering result will become unusable. Line 6 restricts the particle ?lter from running more than 60 seconds beyond the last active reading. The particle ?lter method consists of 3 steps: initialization, particle updating, and par- ticle resampling. In the ?rst step, a set of particles are generated and uniformly distributed on the graph edges within the detection range of di, and each particle picks its own moving direction and speed as in line 5. In our system, particles? speeds are drawn from a Gaussian distribution with = 1 m/s and = 0:1. In the updating step, lines 8 to 16 are particles? location updates; at every time interval (1 second), particles move along the graph edges according to their speed and direction. Particles pick a random direction at intersections; if particles are inside rooms, they continue to stay inside with probability 0.9 and move out with probability 0.1. After location updating, lines 21 to 27 update particles? weights according to their consistency with reading results. In other words, particles within the 26 Algorithm 2 Particle Filter(C) 1. for each object oi of C do 2. retrieve oi?s aggregated readings from the data collector module 3. t0, td = the starting/ending time of the aggregated readings 4. di, dj = the second most/most recent detecting devices for oi //dj may not exist 5. generate particles for oi within di:activationRange 6. tmin = min(td + 60;tcurrent) 7. for every second tj from t0 to tmin do 8. for every particle pm of oi do 9. Let pm move along graph edges E with pm:speed and pm:direction 10. if pm meets intersection then 11. pm randomly choose a direction 12. end if 13. if pm resides in a room node of G then 14. pm moves out of room with probability 0.1 15. end if 16. end for 17. retrieve the aggregated reading entry reading of tj 18. if reading:Device=null then 19. continue 20. else 21. for every particle pm of oi do 22. if pm ? reading:Device:activationRange then 23. assign a high weight to pm 24. else 25. assign a low weight to pm 26. end if 27. end for 28. normalize the weights of all particles of oi 29. Resampling() // Algorithm 1 30. end if 31. end for 32. assign particles of oi to their nearest anchor points 33. for each anchor point ap with a nonzero number of particles n do 34. calculate probability pi(oi:location = ap) = n=Ns 35. update Hash Table APtoObjHT 36. end for 37. end for detecting device?s range are assigned a high weight, while others are assigned a low weight. In the resampling step, particles? weights are ?rst normalized as in line 28. We then employ the Resampling algorithm to replicate highly weighted particles and remove lowly weighted particles as in line 29. 27 Lines 33 to 36 discretize the ?ltered probabilistic data and build the hash table APtoObjHT as described in Section 4.2. 4.5 Cache Management Module The cache management module is optional for the system functionality, but will improve the query evaluation performance if queries are frequent and geographically adjacent to previous queries. We design the cache management module to store the particle states of objects from the Particle Filter algorithm. Consequently, insertion to the cache happens every time when Algorithm 2 is done for an object oi. In case near future queries need to determine the location distribution for the same object oi again, we do not need to run the Particle Filter algorithm from the start; instead, previous computation is reused by retrieving the particles of oi from the cache and resuming the Particle Filter algorithm from the cache-stored time stamp. We also need to design a proper life time for entries in the cache. Intuitively, we know that moving patterns from a distant past provide little help to current location inferences. The same is true for particle ?ltering. Suppose for an object oi, the cache stores its particles due to a previous query at time tprev. In addition, assume the situation in the period from tprev to tcurrent, oi is detected by two or more readers. According to Algorithm 2, the old particles in the cache are useless since we only focus on readings of the most recent two readers after tprev. Furthermore, in order to make the ?ltering process among objects consistent (i.e., the particle ?ltering for each object is based on the readings of the most recent two devices), we decide to discard processed particles of oi from the cache every time oi is detected by a new device. Otherwise the particle ?ltering for some objects will be based on readings of more than two devices. If the cache management module is implemented, the Particle Filter algorithm needs to be slightly modi?ed by looking up the cache ?rst before running from t0. If there is a cache hit, then particle ?lters should run from the cache-stored time stamp. After the particle 28 ?ltering step in the Particle Filter algorithm, the object ID, particle states, and current time stamp are inserted into the cache. 4.6 Query Evaluation In this section we are going to discuss how to evaluate range and kNN queries e?ciently with the ?ltered probabilistic data in the hash table APtoObjHT. For kNN queries, without loss of generality, the query point is approximated to the nearest edge of the indoor walking graph for simplicity. 4.6.1 Indoor Range Query To evaluate indoor range queries, the ?rst thought would be to determine the anchor points within the range, then answer the query by returning objects and their associated probabilities indexed by those anchor points. However, with further consideration, we can see that since anchor points are restricted to be only on graph edges, they are actually the 1D projection of 2D spaces; the loss of one dimension should be compensated in the query evaluation process. Figure 4.4 shows an example of how the compensation is done with respect to two di?erent types of indoor entities: hallways and rooms. In Figure 4.4, query q is a rectangle which intersects with both the hallway and room R1, but does not directly contain any anchor point. We denote the left part of q which overlaps with the hallway as qh, and the right part which overlaps with R1 as qr. We Figure 4.4: Example of indoor range query. 29 Algorithm 3 Indoor Range Query(q) 1. resultSet=? 2. cells=getIntersect(q) 3. for every cell in cells do 4. if cell.type=HALLWAY then 5. anchorpoints=cell.getCoveredAP(q) 6. ratio=cell.getWidthRatio(q) 7. else if cell.type=ROOM then 8. anchorpoints=cell.getInsideAP() 9. ratio=cell.getAreaRatio(q) 10. end if 11. result=? 12. for each ap in anchorpoints do 13. result=result+APtoObjHT.get(ap) 14. end for 15. result=result*ratio 16. resultSet=resultSet+result 17. end for 18. return resultSet ?rst look at how to evaluate the hallway part of q. The anchor points which fall within q?s vertical range are marked red in Figure 4.4, and should be considered for answering qh. Since in our assumptions no di?erentiation along the width of hallways can be inferred about an object?s true location, objects in hallways can be anywhere along the width of hallways with equal probability. With this assumption, the ratio of wqh (the width of qh) and wh (the width of hallway) will indicate the probability of objects in hallways within the vertical range of q being in qh. For example, if an object oi is in the hallway and in the vertical range of q with probability p1, which can be calculated by summing up the probabilities indexed by the red anchor points, then the probability of this object being in qh is pi(oi:location ? qh) = p1 ?wqh=wh. Then we look at the room part of q. The anchor points within room R1 should represent the whole 2D area of R1, and again we assume objects inside rooms are uniformly distributed. Similar to the hallway situation, the ratio of qr?s area to R1?s area is the probability of an object in R1 happening to be in qr. For example, if oi?s probability of being in R1 is p2, then its probability of being in qr is pi(oi:location ? qr) = p2 ? Areaqr=AreaR1, where p2 can be 30 calculated by summing up the indexed probabilities of oi on all the anchor points inside R1 and Areai stands for the size of a given region i. Algorithm 3 summarizes the above procedures. In line 15, we de?ne the multiply opera- tion for resultSet to adjust the probabilities for all objects in it by the multiplying constant. In line 16, we de?ne the addition operation for resultSet to be: if an object probability pair ?oi;p? is to be added, we check whether oi already exists in resultSet. If so, we just add p to the probability of oi in resultSet; otherwise, we insert ?oi;p? to resultSet. For instance, sup- pose resultSet originally contains {(o1;0:2);(o2;0:15)}, and result stores {(o2;0:1);(o3;0:05)}. resultSet is updated to be {(o1;0:2);(o2;0:25);(o3;0:05)} after the addition in line 16. 4.6.2 Indoor kNN Query For indoor kNN queries, we present an e?cient evaluation method with statistical accu- racy. Unlike previous work [30, 4], which involves heavy computation and returns multiple result sets for users to choose, our method is user friendly and returns a relatively small number of candidate objects. Our method works as follows: starting from the query point q, anchor points are searched in ascending order of their distance to q; the search expands from q one achor point forward per iteration, until the sum of the probability of all objects indexed by the searched anchor points is no less than k. The result set has the form of ?(o1;p1);(o2;p2);:::(om;pm)? where ?mi=1 pi ? k. The number of returned objects will be at least k. From the sense of statistics, the probability pi associated with object oi in the result set is the probability of oi being in the kNN result set of q. The algorithm of the indoor kNN query evaluation method in our work is shown in Algorithm 4. In Algorithm 4, lines 1 and 2 are initial setups. Line 3 adds two entries to a vector V, whose elements store the edge segments expanding out from query point q. In the following for loop, line 5 ?nds the next unvisited anchor point further away from q. If all anchor points are already searched on an edge segment e, lines 6 to 11 remove e and add all adjacent unvisited edges of e.node to V. Line 12 updates the result set by adding ?object 31 Algorithm 4 Indoor kNN Query(q, k) 1. resultSet=? 2. ninj= nd segment(q) 3. vector V=?(ni;q);(nj;q)? // elements in V have the form (node, prevNode) 4. for every entry e in V do 5. anchorpoint= nd nextAnchorPoint(e) // return the next unsearched anchor point from e.prevNode to e.node 6. if anchorpoint=? then 7. remove e from V 8. for each unvisited adjacent node nx of e.node do 9. add (nx, e.node) to V 10. end for 11. continue 12. end if 13. resultSet=resultSet+APtoObjHT.get(anchorpoint) 14. probtotal=resultSet.getTotalProb() //calculate the probability sum of all objects in resultSet 15. if probtotal >= k then 16. break 17. end if 18. end for 19. return resultSet ID, probability? pairs indexed by the current anchor point to it. In lines 13 to 16, the total probability of all objects in the result set is checked, and if it equals or exceeds k, the algorithm ends and returns the result set. Note that the stopping criteria of our kNN algorithm do not require emptying the frontier edges in V. An example kNN query is shown in Figure 4.5, which is a snapshot of the running status of Algorithm 4. In Figure 4.5, red arrows indicate the searching directions expanding from q, and red anchor points indicate the points that have already been searched. Note that the Figure 4.5: Example of indoor kNN query. 32 edge segment from q to n3 is already removed from V and new edges n3n4, n3n5 are currently in V as well as n3n2. The search process is to be continued until the total probability of a result set is no less than k. 33 Chapter 5 Experiment In this section, we evaluate the performance of the proposed RFID and particle ?lter- based indoor spatial query evaluation system using the data generated by real-world param- eters, and compare the results with the symbolic model-based solution [30]. We implemented the proposed algorithms and related experimental components in C++. All the experiments were conducted on an Ubuntu Linux server equipped with an Intel Xeon 2.4GHz processor and 16GB memory. The settings of our experiment validation include 30 rooms and 4 hall- ways on a single ?oor, in which all rooms are connected to one or more hallways by doors. A total of 19 RFID readers are deployed on hallways with uniform distance to each other. The whole simulator consists of seven components, including true trace generator, raw reading generator, particle ?lter module, symbolic model module, ground truth query evalu- ation, top-k success module, and KL divergence module. Figure 5.1 shows the relationship of di?erent components in the simulation system. The true trace generator module is respon- sible for generating the ground truth traces of moving objects and records the true location of each object every second. We let each object randomly select a room as its destination, and walk along the shortest path on the indoor walking graph from its current location to the destination node. We simulate the objects? speeds using a Gaussian distribution with 6\PEROLF0RGHO0RGXOH ./'LYHUJHQFH0RGXOH 7RS.6XFFHVV0RGXOH *URXQG7UXWK4XHU\(YDOXDWLRQ5DZ5HDGLQJ*HQHUDWRU7UXH7UDFH*HQHUDWRU 3DUWLFOH)LOWHU0RGXOH Figure 5.1: The simulator structure. 34 = 1 m/s and = 0:1. At the same time, the raw reading generator module checks whether each object is detected by a reader according to the deployment of readers and the current location of the object. Whenever a reading occurs, the raw reading generator will feed the reading, including detection time, tag ID, and reader ID, to the two probabilistic query eval- uation modules (particle ?lter module and symbolic model module). We also implemented the ground truth query evaluation module in order to form a basis to evaluate the accuracy of the results returned by the two probabilistic query evaluation modules. 5.1 Simulator Implementation The query results are evaluated by the following metrics: (1) We calculated the top- 1 and top-2 success rate of particle ?lters inferred objects? locations with respect to their true locations. The top-k success rate is a percentage of the number of objects whose true locations match the top k predicted locations of the reconstructed distribution over the total number of objects. Note that this metric only applies to the particle ?lter-based method. (2) For range queries, we employed the metric of Kullback-Leibler (KL) divergence to measure the accuracy of query results based on the two di?erent probabilistic models. KL divergence is a metric commonly used to evaluate the di?erence between two probability distributions [11]. KL divergence is de?ned in Equation 5.1 with two probability distributions P and Q of a discrete random variable. In the following experiments, smaller KL divergence indicates better accuracy of the results with regard to the ground truth query results. (3) For kNN queries, KL divergence is no longer a suitable metric since the result sets returned from the symbolic model module do not contain object-speci?c probability information. Instead, we simply count the hit rates of the results returned by the two probabilistic methods over the ground truth result set. We only consider the maximum probability result set generated by the symbolic model module when calculating hit rate. DKL(P||Q) = ? i P(i)ln P(i)Q(i) (5.1) 35 Parameters Default Values Number of particles 64 Query window size 2% Number of moving objects 200 k 3 Activation range 2 meters Table 5.1: Default values of parameters. In all the following experimental result ?gures, we utilize PF and SM to represent the curves of the particle ?lter-based method and the symbolic model-based method, respectively. The default parameters of all the experiments are listed in Table 5.1. 5.2 E?ects of Query Window Size We ?rst evaluate the e?ects of query window size on the accuracy of range queries. The window size is measured by percentage with respect to the total area of the simulated indoor space. 100 query windows are randomly generated as rectangles at each time stamp, and the results are averaged over 50 di?erent time stamps. As shown in Figure 5.2, the KL divergence of both methods does not seem to be a?ected by the query window size, but the KL divergence of the particle ?lter-based method is signi?cantly below that of the symbolic model-based method. 5.3 E?ects of k In this experiment we evaluate the accuracy of kNN query results with respect to the value of k. We choose 30 random indoor locations as kNN query points and issue queries on 0 0.2 0.4 0.6 0.8 1 1.2 1%2%3%4%5% KL Divergence Query Window Size PFSM Figure 5.2: E?ects of query window size. 36 these query points at 50 di?erent time stamps. As k goes from 2 to 9, we can see in Figure 5.3 that the average hit rate of the symbolic model-based method grows slowly. As k increases, the number of objects returned by the method increases as well, resulting in a higher chance of hits. On the contrary, the average hit rate of the particle ?lter-based method is relatively stable with respect to the value of k, and the particle ?lter-based method always outperforms the symbolic model-based method in terms of the average hit rate. 5.4 E?ects of Number of Particles From the mathematical analysis of particle ?lters in Section 3.1, it is known that if the number of particles is too small, the accuracy of particle ?lters will degenerate due to insu?cient samples. On the opposite, keeping a large number of particles is not a good choice either since the computation cost may become overwhelming, as the accuracy improvement is no longer obvious when the number of particles is beyond a certain threshold. In this subsection, we conduct extensive experiments to exploit the e?ects of the number of particles on query evaluation accuracy in order to determine an appropriate size of particle set for the application of indoor spatial queries. As shown in Figure 5.4, we can see that when the number of particles is very small, the particle ?lter-based method has a larger KL divergence for range queries and a smaller average hit rate for kNN queries than the symbolic model-based method. As the number of particles grows beyond 8, the performance of the particle ?lter-based method begins to exceed the symbolic model-based method. Another observation is that when the number of particles is beyond 64, the top-k success rates of our solution are relatively stable. In 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 Average Hit Rate k PFSM Figure 5.3: E?ects of k. 37 0 0.5 1 1.5 2 2 4 8 16 32 64 128 256 512 KL Divergence Number of Particles PFSM 0 0.2 0.4 0.6 0.8 1 2 4 8 16 32 64 128 256 512 Average Hit Rate Number of Particles PFSM 0 0.2 0.4 0.6 0.8 1 2 4 8 16 32 64 128 256 512 Top-k Success Rate Number of Particles Top-1Top-2 (a) KL divergence. (b) kNN success ra-tio. (c) Top-k successrate. Figure 5.4: The impact of the number of particles. 0 0.2 0.4 0.6 0.8 1 1.2 200 400 600 800 1000 KL Divergence Number of Objects PFSM 0 0.2 0.4 0.6 0.8 1 200 400 600 800 1000 Average Hit Rate Number of Objects PFSM 0 0.2 0.4 0.6 0.8 1 200 400 600 800 1000 Top-k Success Rate Number of Objects Top-1Top-2 (a) KL divergence. (b) kNN success ra-tio. (c) Top-k successrate. Figure 5.5: The impact of the number of moving objects. 0 0.5 1 1.5 2 0.5 1 1.5 2 2.5 KL Divergence Activation Range(m) PFSM 0 0.2 0.4 0.6 0.8 1 0.5 1 1.5 2 2.5 Average Hit Rate Activation Range(m) PFSM 0 0.2 0.4 0.6 0.8 1 0.5 1 1.5 2 2.5 Top-k Success Rate Activation Range(m) Top-1Top-2 (a) KL divergence. (b) kNN success ra-tio. (c) Top-k successrate. Figure 5.6: The impact of activation range. addition, both the KL divergence and the average hit rate change slowly when the number of particles grows beyond 64. We conclude that in our application, the appropriate size of particle set is around 60, which guarantees a good accuracy while not costing too much in computation. 38 5.5 E?ects of Number of Moving Objects In this subsection, we evaluate the scalability of our proposed algorithm by varying the number of moving objects from 200 to 1000. All the result data are collected by averaging an extensive number of queries over di?erent query locations and time stamps. Figure 5.5 shows a comparison of the query results from the two probabilistic methods, and also the top-k success rates of particle ?lters? inferred locations. The KL divergence of the two methods and top-k success rates of the particle ?lter-based method are relatively stable, but the average hit rate of kNN queries decreases for both methods. The decrease of kNN hit rate is due to more objects being distributed in the same indoor space. A ?ner resolution algorithm is required to accurately answer kNN queries. In all, our solution demonstrates good scalability in terms of accuracy when the number of objects increases. 5.6 E?ects of Activation Range Finally, we evaluated the e?ects of reader?s activation range by varying the range from 50 cm to 250 cm. The results are reported in Figure 5.6. As the activation range increases, the performance of the two methods gets better because uncertain regions not covered by any reader essentially get reduced. In addition, even when the activation range is small (e.g., 100 cm), the particle ?lter-based method is still able to achieve relatively high accuracy. Therefore, the particle ?lter-based method is more suitable than the symbolic model-based method when the physical constrains limit readers? activation ranges. 39 Chapter 6 Continuous Spatial Queries This chapter presents tentative solutions to evaluate continuous indoor range query and continuous indoor kNN query, but their accuracy and performance has not been validated by simulation yet. Our ongoing research will conduct further simulation to compare the performance of our proposed methods to current state-of-art methods [29, 30]. 6.1 Continuous Indoor Range Query In this section, we aim to solve the problem of continuous indoor range query on ?ltered probabilistic data. We give users the option of setting a probability threshold to truncate low probability candidate objects. To e?ciently monitor the result set, we use similar concept "critical device" as in [29], which can save considerable computations than constantly repeating the snapshot algorithm. We de?ne critical devices for a query to be the set of devices whose readings will a?ect the query results. Our continuous monitoring algorithm is distinct from Yang?s work [29] in two aspects: 1. We leverage the Indoor Walking Graph to simplify the identi?cation process of critical devices; 2. The probability updating process is particle ?lters based, which are more accurate and very di?erent from their approach in nature. To identify critical devices for a range query, we propose an approach consisting of two steps, mapping and searching. For the mapping step, we categorize two di?erent cases and treat them di?erently: case 1, the whole query range is within one room or adjacent rooms, and we map the query range to the door(s) of the containing room(s). The reason is that doors are the only entrances/exits of rooms. Case 2, the query range also overlaps with hallways, and it is mapped to an edge segment of Indoor Walking Graph edges E lying along 40 the hallway. For the searching step, an expansion starting from the mapped points or edge segments is performed along E until reaching the activation range of an RFID reader or deadend. Notice that in the second case of mapping step, we do not specially deal with the room part of queries, since the subsequent expansion from the mapped edge segment will eventually go through the doors of a?ected rooms. Figure 6.1 shows an example of the mapping process for two queries q1 and q2. Since q1 is fully contained in room R1, it is mapped to a point at the door of R1. q2 intersects with not only rooms but also hallway, therefore it is mapped to an edge segment along hallway as marked red in ?gure 6.1; For the initial evaluation of a query, we change the optimization algorithm in section 4.3 of the snapshot query to fully take advantage of critical devices. For an object to be in the query range, it must be mostly recently detected by a critical device or any device that is bounded by the critical devices. Therefore, we no longer need to calculate the maximum bounding circle of each object to determine whether it is a candidate as in section 4.3. Other than the di?erence in identifying the candidate object set, other parts of the initial evaluation algorithm are the same as its snapshot counterparts. After initial evaluation, we continuously monitor the candidate set by performing particle ?lters for them at every time step. Note here for continuous query evaluation, particle ?lters algorithm should run with cache in order to avoid repetitive calculations and thus gain higher speed. T T Figure 6.1: Mapping process to identify critical devices. 41 Algorithm 5 Continuous Range Query(q) 1. Dcriticaldevices = getCriticalDevices(q) 2. C = ? 3. for every reader in or bounded by Dcriticaldevices do 4. C = C?DtoObj(reader) 5. end for 6. Particle Filter with Cache(C) 7. Rinit=Indoor Range Query(q) 8. for every second from treg to tunreg do 9. for every oi detected by any reader in Dcriticaldevices do 10. if oi ? C then 11. C.remove(oi) 12. else 13. C.add(oi) 14. end if 15. end for 16. for every oi ? C do 17. if oi.particles are all outside the bounded region of Dcriticaldevices then 18. C.remove(oi) 19. end if 20. end for 21. Particle Filter with Cache(C) 22. R=Indoor Range Query(q) 23. end for At the same time, the candidate set may change due to candidates moving out or non candidate objects moving into the critical device bounded region. Therefore, how to e?ciently update the candidate set becomes a critical problem. Again, we rely on critical devices to gain information about possible outgoing and incoming objects. During continuous monitoring, if a candidate object enters the activation range of critical devices, or none of its particles still reside in the bounded region, then we assume it is moving out and should be removed from the candidate set. On the other hand, if a non candidate object enters the range of critical devices, we assume it is moving into the bounded region, and should be added to the candidate set. Algorithm 5 summarizes our proposed continuous indoor range query algorithm. Lines 1 to 6 initializes the critical devices and candidate set for query q. In line 5 we use a new hash table DtoObj, which maps a device to objects whose most recent readings are from 42 this device. Lines 9 to 20 are updating the candidate set according to the readings of critical devices, and also particles? presence within the bounded region. Line 21 run Algorithm 2 with cache to update candidate objects? location distribution probability. Line 22 calculates the result set using Algorithm 3. Note there is no need to recompute the anchor point set a?ecting query q in line 22, since it is already calculated in line 7 and remain unchanged until the query unregisters from the system. 6.2 Continuous Indoor kNN Query Similar to continuous indoor range query, how to update the candidate set of continuous indoor kNN query is crucial. To reduce the overhead of computing candidate set at every time step, we bu?er certain number of extra candidates than necessary, and only recompute the candidate set according the optimization approach in section 4.3 when the total number of candidates is less than k. Recalling from section 4.3, by examining the minimum/maximum shortest network dis- tance from the query point q to an object?s maximum speed circle, the snapshot optimization approach excludes objects whose minimum shortest network distance are larger than the kth largest minimum shortest network distance. Note here the candidate set identi?ed by this method contains k + m candidates, where m >= 0. Based on this snapshot optimization approach, we extend it to include at least k + y candidates where y is a user con?gurable parameter. Obviously, y represents a compromise between the size of candidate set and the recomputation frequency. We accomplish this by calculating the k + yth largest minimum shortest network distance among all objects, and use this value as a threshold to cut o? non-candidate objects. In this way, the candidate set would contain k + y + m objects, ensuring that we have at least k + y candidates. During continuous monitoring, we need to make sure the candidate set gets updated accordingly when a candidate object moves away or non candidate object moves towards q. We use a similar concept of ?critical device? as in continuous indoor range query, although 43 the critical devices here may change each time the candidate set is recomputed. The iden- ti?cation process of critical devices goes like following: after calculating the candidate set, a search is performed from q along E to cover all the maximum speed circle of candidate objects, until reaching readers (critical devices) or dead ends. As we can see, critical devices form a bounded region where at least k + y candidate objects are for sure inside it. On every new reading of critical devices, if the involved object is non candidate, we assume it is entering the bounded region and moving towards q. Accordingly, this object should be added to the candidate set. Otherwise, if it is a candidate object, we assume it is moving out of the bounded region and should be removed from the candidate set. As long as the total number of candidates is no less than k, there is no need to recompute the candidate set or critical devices. And the query result can be calculated according to Algorithm 4 for every time step. However, if there are more outgoing objects than incoming objects, the total number of candidates may fall below k. In this case, we restart the optimization algorithm in section 4.3 to get a new candidate set of at least k + y objects. 44 Chapter 7 Conclusion In this paper, we introduced an RFID and particle ?lter-based indoor spatial query evaluation system. In order to evaluate indoor spatial queries with unreliable data collected by RFID readers, we proposed the particle ?lter-based location inference method, the in- door walking graph model, and the anchor point indexing model for cleansing noisy RFID raw data. After the data cleansing process, indoor range and kNN queries can be evalu- ated e?ciently and e?ectively by our algorithms. Our experiments with data generated by real-world parameters demonstrate that our solution outperforms the symbolic model-based method signi?cantly in query result accuracy. For future work, we plan to conduct further analysis of our system with more perfor- mance evaluation metrics. In addition, we intend to extend our framework to support more spatial query types such as continuous range, continuous kNN, closest-pairs, etc. 45 Bibliography [1] M. S. Arulampalam, S. Maskell, N. J. Gordon, and T. Clapp. A tutorial on particle ?lters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174?188, 2002. [2] L. Carlone, M. E. K. Ng, J. Du, B. Bona, and M. Indri. Rao-Blackwellized Particle Filters multi robot SLAM with unknown initial correspondences and limited communi- cation. In ICRA, pages 243?249, 2010. [3] H. Chen, W.-S. Ku, H. Wang, and M.-T. Sun. Leveraging spatio-temporal redundancy for RFID data cleansing. In SIGMOD Conference, pages 51?62, 2010. [4] R. Cheng, L. Chen, J. Chen, and X. Xie. Evaluating probability threshold k-nearest- neighbor queries over uncertain data. In EDBT, pages 672?683, 2009. [5] N. Gordon, D. Salmond, and A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F on Radar and Signal Processing, 140(2):107?113, apr 1993. [6] G. R. Hjaltason and H. Samet. Distance Browsing in Spatial Databases. ACM Trans. Database Syst., 24(2):265?318, 1999. [7] S. R. Je?ery, M. J. Franklin, and M. N. Garofalakis. An Adaptive RFID Middleware for Supporting Metaphysical Data Independence. VLDB J., 17(2):265?289, 2008. [8] S. R. Je?ery, M. N. Garofalakis, and M. J. Franklin. Adaptive Cleaning for RFID Data Streams. In VLDB, pages 163?174, 2006. [9] C. S. Jensen, H. Lu, and B. Yang. Graph Model Based Indoor Tracking. In Mobile Data Management, pages 122?131, 2009. [10] W.-S. Ku, H. Chen, H. Wang, and M.-T. Sun. A Bayesian Inference-Based Framework for RFID Data Cleansing. IEEE Trans. Knowl. Data Eng., In press, 2012. [11] S. Kullback and R. A. Leibler. On Information and Su?ciency. Annals of Mathematical Statistics, 22:49?86, 1951. [12] K. C. K. Lee, W.-C. Lee, B. Zheng, and Y. Tian. ROAD: A New Spatial Object Search Framework for Road Networks. IEEE Trans. Knowl. Data Eng., 24(3):547?560, 2012. [13] J. Letchner, C. R?e, M. Balazinska, and M. Philipose. Access Methods for Markovian Streams. In ICDE, pages 246?257, 2009. 47 [14] D. Li and D. L. Lee. A Lattice-Based Semantic Location Model for Indoor Navigation. In MDM, pages 17?24, 2008. [15] Metropolitan Transportation Authority. Subway and Bus Ridership Statistics 2011. http://mta.info/nyct/facts/ridership/index.htm. Retrieved on October 22, 2012. [16] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao. Query Processing in Spatial Network Databases. In VLDB, pages 802?813, 2003. [17] C. R?e, J. Letchner, M. Balazinska, and D. Suciu. Event queries on correlated proba- bilistic streams. In SIGMOD Conference, pages 715?728, 2008. [18] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest Neighbor Queries. In SIGMOD Conference, pages 71?79, 1995. [19] H. Samet, J. Sankaranarayanan, and H. Alborzi. Scalable network distance browsing in spatial databases. In SIGMOD Conference, pages 43?54, 2008. [20] B. L. D. Santos and L. S. Smith. RFID in the Supply Chain: Panacea or Pandora?s Box? Commun. ACM, 51(10):127?131, 2008. [21] L. Sullivan. RFID Implementation Challenges Persist, All This Time Later. Informa- tionWeek, October 2005. [22] T. T. L. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. J. Shenoy. Probabilistic Inference over RFID Streams in Mobile Environments. In ICDE, pages 1096?1107, 2009. [23] R. Want. The Magic of RFID. ACM Queue, 2(7):40?48, 2004. [24] E. Welbourne, L. Battle, G. Cole, K. Gould, K. Rector, S. Raymer, M. Balazinska, and G. Borriello. Building the Internet of Things Using RFID: The RFID Ecosystem Experience. IEEE Internet Computing, 13(3):48?55, 2009. [25] E. Welbourne, N. Khoussainova, J. Letchner, Y. Li, M. Balazinska, G. Borriello, and D. Suciu. Cascadia: a system for specifying, detecting, and managing r?d events. In MobiSys, pages 281?294, 2008. [26] E. Welbourne, K. Koscher, E. Soroush, M. Balazinska, and G. Borriello. Longitudinal study of a building-scale r?d ecosystem. In MobiSys, pages 69?82, 2009. [27] J. Welle, D. Schulz, T. Bachran, and A. B. Cremers. Optimization techniques for laser- based 3D particle ?lter SLAM. In ICRA, pages 3525?3530, 2010. [28] Wikipedia. New York City Subway. http://en.wikipedia.org/wiki/New_York_ City_Subway. Retrieved on October 22, 2012. [29] B. Yang, H. Lu, and C. S. Jensen. Scalable continuous range monitoring of moving objects in symbolic indoor space. In CIKM, pages 671?680, 2009. 48 [30] B. Yang, H. Lu, and C. S. Jensen. Probabilistic threshold k nearest neighbor queries over moving objects in symbolic indoor space. In EDBT, pages 335?346, 2010. [31] L. Yang, J. Cao, W. Zhu, and S. Tang. A hybrid method for achieving high accuracy and e?ciency in object tracking using passive r?d. In PerCom, pages 109?115, 2012. 49