Approximate Evaluation of Queries for Spatial, Uncertain and Probabilistic
Databases
by
Haiquan Chen
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
August 6, 2011
Keywords: Data Cleansing, Spatial databases, RFID, Probabilistic Databases, Query
Processing
Copyright 2011 by Haiquan Chen
Approved by
Wei-Shinn Ku, Chair, Assistant Professor of Computer Science and Software Engineering
Xiao Qin, Associate Professor of Computer Science and Software Engineering
Weikuan Yu, Assistant Professor of Computer Science and Software Engineering
Abstract
In modern geographic information systems, route search represents an important class
of queries. In route search related applications, users may want to define a number of
traveling rules (traveling preferences) when they plan their trips. However, these traveling
rules are not considered in most existing techniques. In this dissertation we propose a novel
spatial query type, the multi-rule partial sequenced route (MRPSR) query, which enables
efficient trip planning with user defined traveling rules. The MRPSR query provides a
unified framework that subsumes the well-known trip planning query (TPQ) and the optimal
sequenced route (OSR) query. The difficulty in answering MRPSR queries lies in how to
integrate multiple choices of points-of-interest (POI) with traveling rules when searching for
satisfying routes. We prove that MRPSR query is NP-hard and then provide three algorithms
by mapping traveling rules to an activity on vertex network. Afterwards, we extend all the
proposed algorithms to road networks. By utilizing both real and synthetic POI datasets, we
investigate the performance of our algorithms. The results of extensive simulations show that
our algorithms are able to answer MRPSR queries effectively and efficiently with underlying
road networks. Compared to the Light Optimal Route Discoverer (LORD) based brute-force
solution, the response time of our algorithms is significantly reduced while the distances of
the computed routes are only slightly longer than the shortest route.
The past few years have witnessed the emergence of an increasing number of applica-
tions for tracking and tracing based on Radio Frequency Identification (RFID) technologies.
However, raw RFID readings are usually of low quality and may contain numerous anomalies.
An ideal solution for RFID data cleansing should address the following issues. First, in many
applications, duplicate readings of the same object are very common. The solution should
take advantage of the resulting data redundancy for data cleaning. Second, prior knowledge
ii
about the environment (e.g., prior object distribution, false negative rates of readers) may
help improve data quality, and a desired solution must be able to take into account such
knowledge. Third, the solution should take advantage of physical constraints in target ap-
plications (e.g., the number of objects in a same location cannot exceed a given value) to
elevate the accuracy of data cleansing. There are several existing RFID data cleansing tech-
niques. However, none of them support all the aforementioned features. In this dissertation
we propose a Bayesian inference based framework for cleaning RFID raw data. We first de-
sign an n-state detection model and formally prove that the 3-state model can maximize the
system performance. In addition, we devise a Metropolis-Hastings sampler with Constraints
(MH-C), which incorporates constraint management to clean RFID data with high efficiency
and accuracy. Moreover, to support real-time object monitoring, we present the Streaming
Bayesian Inference (SBI) method to cope with real-time RFID data streams. Finally, we
evaluate the performance of our solutions through extensive experiments.
iii
Acknowledgments
I am sincerely grateful to my supervisor, Dr. Wei-Shinn Ku, for his academic guidance,
valuable comments, continuous support, and encouragement during my doctoral studies. I
am very appreciative of his personableness and endless help in the achievement of this work.
Without his support, this dissertation would not have been possible. I owe my deepest
gratitude to him.
I would like to thank my committee members: Dr. Xiao Qin and Dr. Weikuan Yu for
their time, patience and advices that led to me improving this work.
Special thanks to Dr. Stanley Reeves for his time, kindness, and giving me the courage
to realize my goal. I do believe it is God who leads me towards achieving this.
Finally, I am deeply grateful to my parents. It is their love and support that makes this
dissertation possible.
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Spatial Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 RFID Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Data Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Overview of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Approximate Query Evaluation on Spatial Databases . . . . . . . . . . . . . . . 10
2.1 The Multi-Rule Partial Sequenced Route Query . . . . . . . . . . . . . . . . 10
2.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Properties of the MRPSR Query . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Percentage of the Constrained Categories . . . . . . . . . . . . . . . . 14
2.2 Activity-on-Vertex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Nearest Neighbor Search in Road Networks . . . . . . . . . . . . . . . 18
2.3.2 Nearest Neighbor-based Partial Sequenced Route Algorithm . . . . . 19
2.3.3 NNPSR with Light Optimal Route Discoverer Algorithm . . . . . . . 19
v
2.3.4 Advanced A* Search-based Partial Sequenced Route Algorithm . . . 20
2.3.5 Comparison between NNPSR, NNPSR-LORD and AASPSR . . . . . 23
2.4 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Effect of the Percentage of the Constrained Categories . . . . . . . . 27
2.4.3 Effect of the Average Category Cardinality . . . . . . . . . . . . . . . 30
2.4.4 Effect of the Number of Query Categories . . . . . . . . . . . . . . . 31
3 Approximate Query Evaluation on RFID Databases . . . . . . . . . . . . . . . . 37
3.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 A Bayesian Inference Based Approach . . . . . . . . . . . . . . . . . 38
3.1.2 The Goal and the Obstacles . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 RFID Reader Detection Models . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Physical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Problems of the 2-State Detection Model . . . . . . . . . . . . . . . . 42
3.2.3 The n-state Detection Model . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.4 A Case Study: The 3-State Model . . . . . . . . . . . . . . . . . . . . 44
3.3 Entropy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Entropy versus Read Rate . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Entropy versus Number of States . . . . . . . . . . . . . . . . . . . . 49
3.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Sample Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.3 Metropolis-Hastings and Gibbs Sampling . . . . . . . . . . . . . . . . 55
3.4.4 Metropolis-Hastings sampler with Constraints . . . . . . . . . . . . . 57
3.5 Real-Time Cleansing on Data Streams . . . . . . . . . . . . . . . . . . . . . 59
3.5.1 Streaming Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . 60
3.5.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vi
3.6 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6.2 Simulator Implementation . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6.3 The Performance Analysis of MH-C . . . . . . . . . . . . . . . . . . . 64
3.6.4 Visualization of Query Results . . . . . . . . . . . . . . . . . . . . . . 68
3.6.5 The Performance Analysis of SBI . . . . . . . . . . . . . . . . . . . . 70
4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1 Spatial Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1.1 Nearest Neighbor Query . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1.2 Route Planning Query . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 RFID Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vii
List of Figures
1.1 Two possible routes (solid and dashed arrows) of a MRPSR query. . . . . . . . 3
1.2 Spatial overlapping of detection regions. . . . . . . . . . . . . . . . . . . . . . . 5
2.1 The AOV network of Q represents POI categories as vertices and prerequisites
as edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 NN search by the incremental network expansion [43]. . . . . . . . . . . . . . . 18
2.3 Two trips generated by NNPSR (the dashed route) and ASPSR (the solid route)
with the traveling rule Bank?Restaurant. . . . . . . . . . . . . . . . . . . . 20
2.4 A trip search by ASPSR with the traveling rule Bank?Restaurant. . . . . . 22
2.5 A trip search by AASPSR(1) with the traveling rule Bank?Restaurant. . . 22
2.6 A trip search by AASPSR(2) with the traveling rule Bank?Restaurant. . . 23
2.7 An example where AASPSR generates a longer route than NNPSR. . . . . . . . 23
2.8 Real Datasets from the state of California. . . . . . . . . . . . . . . . . . . . . . 26
2.9 Route distance of NNPSR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of PCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 Response time of NNSPR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of PCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
viii
2.11 Route distance of NNSPR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of the average category cardinality. . . . . . . . . . . . . . . 31
2.12 Response time of NNSPR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of the average category cardinality. . . . . . . . . . . . . . . 32
2.13 Route distance of NNPSR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of the number of query categories. . . . . . . . . . . . . . . . 33
2.14 Response time of NNSPR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of the number of query categories. . . . . . . . . . . . . . . . 34
3.1 An illustration of the relationship between read rate and distance. . . . . . . . . 42
3.2 An illustration of the simplified 2-state model. . . . . . . . . . . . . . . . . . . . 43
3.3 The family of the n-state detection model. . . . . . . . . . . . . . . . . . . . . . 44
3.4 The 3-state detection model of RFID readers. . . . . . . . . . . . . . . . . . . . 45
3.5 The detection region overlap interpreted by the 3-state detection model. . . . . 46
3.6 Relationship between entropy and read rate in the 2-state and 3-state models. . 50
3.7 Relationship between entropy and the number of states in a detection model. . . 51
3.8 Taking advantage of the correlation between samples to improve sampling efficiency. 56
3.9 The simulator structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.10 MCMC versus SIS on sampling time. . . . . . . . . . . . . . . . . . . . . . . . . 65
3.11 The impact of the number of qualified samples on accuracy for MCMC and SIS. 66
3.12 The impact of RFID data redundancy degree on accuracy for MCMC and SIS. . 66
ix
3.13 The impact of the number of managed racks per reader on accuracy for MCMC
and SIS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.14 The results of the location queries answered by MCMC and SIS. . . . . . . . . . 69
3.15 The results of the remaining capacity queries answered by MCMC. . . . . . . . 70
3.16 Performance comparison of SBI (1), SBI (3) and SBI (5) on sampling time. . . . 71
3.17 The impact of the number of qualified samples on accuracy for SBI (1), SBI (3)
and SBI (5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.18 The impact of RFID data redundancy degree on accuracy for SBI (1), SBI (3)
and SBI (5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
x
List of Tables
1.1 RFID readings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 POI categories and partial sequence rules in an example MRPSR query Q. . . . 15
2.2 Symbolic notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 The feature comparison among proposed algorithms. . . . . . . . . . . . . . . . 25
2.4 The category cardinalities used in our California dataset. . . . . . . . . . . . . . 28
3.1 Symbolic notations of Section 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 A snippet of the RFID raw data. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Symbolic notations utilized in Algorithm 3. . . . . . . . . . . . . . . . . . . . . 58
3.4 The matrix of: (a) true distribution and (b) raw readings. . . . . . . . . . . . . 62
3.5 The parameters for our simulations. . . . . . . . . . . . . . . . . . . . . . . . . 65
xi
Chapter 1
Introduction
1.1 Spatial Databases
In Geographic Information Systems (GIS) related research [8,22,49,52,55], significant
efforts have been spent on nearest neighbor (NN) queries, range queries as well as their
variants [38,40,57,68]. While these query types are building blocks for many existing appli-
cations, more advanced spatial query types must be studied for future GIS systems. Route
queries [40,52] are an important class of spatial queries for users to request a efficient path
by specifying a source and a destination. As an essential component, route queries are
widely supported by many of today?s popular online map service providers (e.g., Google
Maps1, MapQuest 2, Yahoo! Maps3, Bing Maps4). By issuing a route query to a map service
provider, users will obtain a recommended route on the map with an estimated mileage and
turn-by-turn driving instructions. Li et al. [40] proposed solutions for Trip Planning Queries
(TPQ). With TPQ, the user specifies a set of Point of Interest (POI) types and asks for the
optimal route (with minimum distance) from her starting location to a specified destination
which passes through exactly one POI in each POI type. On the other hand, Sharifzadeh
et al. [52] presented OSR queries where the user asks for an optimal route from her starting
location and passing through a number of POIs (each with a different type) in a particular
order (sequence) imposed on all the types of POIs to be visited. However, both TPQ and
OSR queries fails to consider the sub-sequences of POI types which occur naturally in many
GIS applications. To remedy this, in this chapter, we propose a novel route query type,
1http://maps.google.com/
2http://www.mapquest.com/
3http://maps.yahoo.com/
4http://maps.bing.com/
1
Multi-Rule Partial Sequenced Route (MRPSR) query [10]. Our objectives are to assist users
to plan trips that involve multiple POIs which belong to different POI categories (types)
and satisfy a number of user defined traveling rules in road networks with a short response
time. Our MRPSR query aims at unifying the well-known TPQ and OSR queries.
1.1.1 Motivation
As a motivating application, consider the scenario as shown in Figure 1.1. Alice is
planning a trip that starts from her home and involves visiting the following POI categories:
a bank, a restaurant, a gas station, and a movie theater. In addition, Alice also makes the
following traveling rules on her trip:
1. Visit a bank to withdraw money before having lunch at a restaurant.
2. Fill up gas before going to watch a movie.
In order to fulfill these two traveling rules, the returned trip must follow two sub-
sequences simultaneously: (a) traveling to a bank before going to a restaurant and (b)
visiting a movie theater after filling up the gas tank in a gas station. Aside from these two
sequences, Alice is free to visit any of the other POI categories in any order she pleases
and furthermore, they can be interleaved in any order with the two rule-based sequences.
Figure 1.1 illustrates two possible satisfying routes in a road network with different travel
distances.
User defined traveling rules can be formulated as sub-sequences of POI categories in
MRPSR queries. Such sub-sequences (or partial sequence) exists inherently in many GIS
applications or can be specified by users as external constraints. Therefore, MRPSR queries
are useful in numerous fields such asautomotive navigation systems, transportation planning,
supply chain management, online Web mapping services, etc.
Note that the MRPSR query differs from the Traveling Salesman Problem (TSP). In
both cases a least-cost route is sought. However, with TSP a set of POIs (e.g., cities) is given
2
and each element must be visited exactly once. On the other hand, with MRPSR each POI
is associated with a category and one may select any element of that category. For example,
if the route should include a gas station visit, then one may choose any one of the available
gas stations.
Gas
Station
Home
Gas
Station
Bank
Movie
Theater
Gas
Station
Restaurant
Bank
Restaurant
Figure 1.1: Two possible routes (solid and dashed arrows) of a MRPSR query.
1.1.2 Contribution
In this dissertation we present the MRPSR query and provide three fast approximation
algorithms which are designed to efficiently compute satisfying routes with the near-optimal
travel distance in road networks. The contributions of our work are as follows:
? We formally define the Multi-rule Partial Sequenced Route (MRPSR) query and prove
the MRPSR problem to be a member of the NP-complete class.
? By casting traveling rules into an activity-on-vertex network, we utilize topological
sorting [33] to integrate traveling rules with multiple choices of POIs and study the
solvability of MRPSR queries.
3
? We propose the Nearest Neighbor-based Partial Sequence Route query (NNPSR) al-
gorithm. The NNPSR algorithm uses activity-on-vertex networks to guide the search
to retrieve a near-optimal route satisfying all the traveling rules in road networks.
? We integrate NNPSR with the Light Optimal Route Discoverer (LORD) algorithm [52]
to create NNPSR-LORD that further reduces the trip distance based on the NNPSR
algorithm.
? We also design anAdvanced A* Search-based Partial Sequence Route query (AASPSR(k))
algorithm. AASPSR(k) takes advantage of the location of the destination as well as
traveling rules to generate an efficient trip plan in road networks.
? We compare the performance of NNPSR, AASPSR(k) and NNPSR-LORD analytically.
? By using real and synthetic POI datasets, we compare experimentally the performance
of NNPSR, AASPSR(k), NNPSR-LORD and the LORD-based brute-force solution in
the road network of California.
1.2 RFID Databases
Radio Frequency Identification (RFID) is a very popular electronic tagging technology
that allowsobjects to beautomatically identified using anelectromagnetic challenge/response
exchange [66]. An RFID driven system consists of a large number of low-cost and identifiable
tags that are attached to objects, and readers which can identify tags at a distance through
RF communications. An increasing number of major retailers such as Wal-Mart, The Home
Depot, Kroger, and Costco have installed RFID based inventory management systems in
their warehouses and distribution centers. RFID technologies enable unprecedented visi-
bility to support numerous track and trace applications. However, RFID researchers and
practitioners are facing a challenging problem: the raw data collected by RFID readers are
inherently unreliable [31,56]. Therefore, middleware systems [30] are required to correct
4
readings and provide cleansed data. Most previous solutions [16,21,29,31,44] for cleansing
RFID raw data focused on smoothing the readings generated by a group of readers. However,
these existing solutions suffer from three major limitations:
? Data redundancy introduced by overlapping detection regions of multiple stationary
readers (spatial redundancy) or continuous readings over time of a single mobile reader
(temporal redundancy) is not utilized to improve reading accuracy.
? Prior knowledge about tagged objects and RFID readers is not effectively utilized to
improve reading accuracy.
? Constraints in target applications (e.g., the maximal capacity of a room or a shelf) are
not effectively utilized to cleanse the data.
?
?
?
?
?
?
?
?
Figure 1.2: Spatial overlapping of detection regions.
Zone
1
Reader
Zone
2
Reader
Zone
3
Reader
Zone
4
Reader
Zone
5
Reader
Zone
6
Reader
Obj 1 1 0 0 0 0 0
Obj 2 0 1 1 0 0 0
Obj 3 0 0 0 1 0 0
Obj 4 0 0 1 1 0 0
... ... ... ... ... ... ...
Table 1.1: RFID readings.
For addressing these limitations, in this dissertation we propose a Bayesian inference
based framework to elevate the accuracy and efficiency of data cleansing by taking full
advantage of data redundancy, prior knowledge, and application constraints [12].
5
1.2.1 Data Redundancy
Two types of redundancy may arise in RFID related applications: spatial redundancy,
where an object is detected by multiple readers in its neighborhood, and temporal redun-
dancy, where an object is detected multiple times by a single reader over time.
Spatial Redundancy: In order toreduce the complexity of dataanalysis, previous works [16,
21,29,31,44,67] assume that each object is read once, and read by one reader only. Clearly,
this assumption is difficult to enforce, and more importantly, it oversimplifies the reality.
Because RFID readings are of low quality, many applications have to employ redundant
readers to cover the target area completely to improve localization accuracy, which means
objects are read by multiple readers simultaneously.
Indeed, in RFID systems, spatial redundancy is very common. Figure 1.2 shows an
example of spatial redundancy where the target area is divided into six zones (using one
dimensional model) and an RFID reader is located in the center of each zone. Spatial overlap
of readers? detection regions leads to duplicate readings, i.e., an object is in the detection
regions of multiple readers. A possible set of readings is shown in Table 1.1 wherein 1?s
denote successful detections and 0?s otherwise. The table shows two effects of redundancy:
? Object 2 is detected by the reader in Zone 2 and also the reader in Zone 3, which
makes it difficult to tell the exact location of Object 2. However, since an object
cannot appear in more than one zone at the same time, at least one of the readings
belongs to spatial redundancy.
? Object 3 is detected in Zone 4 only. However, it does not necessarily mean that Object
3 is in Zone 4 for sure. It is possible that the reader in the zone where Object 3 is
located simply failed to detect it.
On the first look, spatial redundancy causes confusion as it introduces inconsistent
information (e.g., about the location of Object 2). However, redundant readings may supply
the necessary information for the system to derive the location of an object when its intended
6
reader fails to detect it (e.g., Object 3). Thus, the challenge is how to take advantage of
redundancy while avoiding its undesirable effect in data cleansing.
Temporal Redundancy: Besides employing multiple stationary readers, many applica-
tions monitor the target area using a mobile reader (e.g., a handheld or a robot-mounted
reader [61]) to take continuous readings on its route. Because the exact location of the
mobile reader is always changing when the reader reports raw data, the detection regions
at different time points may overlap, introducing temporal data redundancy of readings.
However, if we treat the same reader at different time points as different readers, e.g., as
shown in Table 1.1, when a mobile reader traverses from zone 1 to zone 6, the reader can
be considered as the zone 1 reader while moving in zone 1 and can be treated as the zone
2 reader while moving in zone 2, the temporal redundancy problem can be reduced to the
spatial redundancy problem. Therefore, we will mainly focus on spatial redundancy in this
dissertation.
1.2.2 Prior Knowledge
As false negatives and false positives abound in raw RFID readings [31,56], in order
to recover the true information, the data cleansing system should take prior knowledge into
account. Prior knowledge may include information such as, for example, the detection areas
of readers in Zone 2 and Zone 3 have significant overlapping, the position of the reader in
Zone 4 makes it more likely to detect objects in Zone 3 than objects in Zone 5, or the reader
in Zone 3 has high false negative rate, etc. Such information, when properly integrated with
the readings, is extremely valuable for data cleansing.
1.2.3 Constraints
Environmental constraints can be utilized to improve data cleansing. For example, the
maximal capacity of each zone (the number of objects that can reside in the same zone) is a
constraint. If each zone represents a rack or shelf in a warehouse, one possible constraint is
7
the total size or weight of the objects which the rack can hold. In addition to these physical
constraints, information obtained from other channels can be translated into constraints.
For instance, if an extra source indicates that two certain objects are in the same zone, it
may help cleanse the data of these two as well as other objects when the information is
integrated with readings and other constraints.
1.2.4 Overview of Our Approach
In this chapter, we propose an innovative approach of cleansing RFID raw data which
is able to take full advantage of duplicate readings and integrate prior knowledge as well as
environmental constraints. Our approach is based on Bayesian inference. We introduce an
n-state detection model and prove by entropy analysis that the 3-state detection model can
maximize the system performance. Furthermore, we devise a Metropolis-Hastings sampler
with constraints to approximate the posterior efficiently (Metropolis-Hastings sampling is a
Markov Chain Monte Carlo method [4,46]). Besides, we present the Streaming Bayesian
Inference (SBI) method to deal with RFID data streams in support of real-time object
monitoring. Consequently, our framework is able to answer the location query for all tagged
objects in a target area in a real-time fashion based on the cleansed RFID raw data.
The contributions of our work are as follows:
? By using Bayesian inference, we derive a universal framework for computing the pos-
terior probabilities (of the location of each object).
? Based on the physical characteristics of RFID readers, we propose an n-state detection
model to capture likelihoods, which enables us to take full advantage of duplicate
readings.
? By investigating the impact of the number of states in a detection model on the system
entropy, we formally prove that the system entropy can be minimized if the 3-state
8
model is adopted compared with other state models. In other words, having even more
states (greater than 3) can in fact deteriorate the overall system performance.
? We devise MH-C, an improved Metropolis-Hastings sampler, to sample from the pos-
terior while taking the environmental constraints into consideration.
? In order to enable real-time object monitoring, we present the Streaming Bayesian
Inference method to cope with the real-time RFID data stream.
? We demonstrate the efficiency and effectiveness of our approach through extensive
simulations.
9
Chapter 2
Approximate Query Evaluation on Spatial Databases
2.1 The Multi-Rule Partial Sequenced Route Query
In this section, we formulate the proposed multi-rule partial sequenced route query and
then discuss the properties of the proposed query type.
2.1.1 Problem Formulation
Definition Givenn disjoint sets ofPOI category{C1,C2,...,Cn}, each containing a number
of POIs in R2, the MRPSR query is to search for a route that satisfies the following three
requirements:
1. The route will traverse through exactly one POI in each category;
2. The total traveling distance is minimized;
3. The route conforms with the given constraints (i.e., traveling rules).
While the first two requirements are commonly seen in the other types ofroute queries [40,
52], the third requirement is unique. Here, the issue is how we should properly define a con-
straint. Without loss of generality, we assume that each constraint can be mapped into a
partial sequence rule, defined as follows.
Definition A partial sequence rule is defined as an ordered subset of categories Ck1 ?
Ck2 ?????Ckm, which specifies the order of visits between < Cki > in the subset.
For instance, a user may issue a MRPSR query with a constraint that he would like to
withdraw money at a bank before going for grocery shopping and dinner. This constraint
can be converted to the following two partial sequence rules:
10
1. CBank ?CSupermarket
2. CBank ?CRestaurant.
These two rules enforce that a Bank should be visited before a supermarket and a
restaurant on the trip, but do not put a restriction on the order between the supermarket
and the restaurant.
Notice that if no restriction is placed on the format of the user?s constraints, the trans-
lation itself is a challenging artificial intelligence research problem [45]. The human natural
language can be ambiguous and non-grammatical. Automatic translation requires algorithms
that can deal with not only the ambiguity but also with parsing and interpretation of a large
dynamic vocabulary, which is not likely to be accomplished in real time. With the help of
input forms, the types of the user?s constraints can be limited so that the translation from
the constraints to the partial sequence rules can be handled with ease. With the notion of
the partial sequence rules, the compatibility of a set of partial sequence rules can be defined
as follows.
Definition A set of the partial sequence rules is defined to be compatible if and only if there
is a total order of < Ci > that satisfies the order specified in each of the rules in the set.
For instance, the set of rules {C1 ? C2,C2 ? C3,C3 ? C1} is not compatible since
it will be impossible to satisfy all these three rules at the same time. When all the travel
constraints are represented as a set of partial sequence rules, the original definition of the
MRPSR query can be formulated as follows.
Definition Given a set of POI categories and a set of partial sequence rules, a MRPSR
query is defined to return the route with the minimal total traveling distance that satisfies
the order specified in each of the partial sequence rules.
11
2.1.2 Properties of the MRPSR Query
The following theorem shows that MRPSR query provides a unified framework that sub-
sumes the well-known tripplanning techniques, including the trip planning queries (TPQ) [40]
and the optimal sequenced route (OSR) queries [52].
Theorem 2.1. The problems of the trip planning query and the optimal sequenced route
query are special cases of the problem of the multi-rule partial sequenced route query.
Proof. According to [40], the problem of the trip planning query is identical to the problem
of the multi-rule partial sequenced route query when the set of partial sequence rules is
empty. In addition, according to [52], the problem of the optimal sequenced route for a
given sequence of categories of POIs is the same as the problem of the multi-rule partial
sequenced route query when the set of partial sequence rules contains one partial sequence
rule specifying the same order.
From Theorem 2.1, we obtain the following important property for the MRPSR query.
Corollary 2.2. The problem of the multi-rule partial sequence route query is NP-hard.
Proof. According to [40], the problem of the trip planning query is NP-hard. Since the
problem of the trip planning query is a special case of the problem of the multi-rule partial
sequenced route query (Theorem 2.1), it follows immediately that the problem of the multi-
rule partial sequenced route query is NP-hard.
Corollary 2.2 implies that when the search space is large, it is advisable to quickly find
a suboptimal route that satisfies the given partial sequence rules instead of the route with
the minimal total distance.
The set of the partial sequence rules plays an important role in the MRPSR query. As
indicated in Theorem 2.1 and Corollary 2.2, if the set is empty, the search space will be
large and the MRPSR query is NP-hard. However, if the rule specifies the total order of
the categories, the MRPSR problem can be solved in polynomial time [52]. Intuitively, the
12
tighter the set of rules is, the smaller the search space will be and the easier the MRPSR
query can be answered. While it is difficult to quantify the level of tightness for a set of
partial sequence rules, we provide Theorem 2.3 to see if a given set of rules will possibly
lead to a solution. Theorem 2.3 shows the relationship between the solvability of a MRPSR
query and the compatibility of a given set of rules.
Theorem 2.3. If a multi-rule partial sequenced route query is solvable, then the correspond-
ing set of the partial sequence rules must be compatible.
Proof. The proof is done by contradiction. Assume that the set of rules is not compatible,
then according to Definition 3 there is no ordered sequence of categories that satisfies all the
rules. In other words, no matter how POIs are selected, it will be impossible to order them
so that the ordered sequence meets all of the constraints.
Notice that Theorem 2.3 does not guarantee that a compatible set of partial sequence
rules can always lead to a solution for a corresponding MRPSR query because some categories
may contain no POI. If each category contains at least one POI, the inverse of Theorem 2.3
(i.e., the compatible set of rules implies the solvability of the corresponding MRPSR query)
will also be true. According to Definition 2.1.1, if the partial sequence rules are compatible,
then there must exist at least one total order of categories < Ci > that satisfies the order
specified in each of the rules. Let one of such orders be {C1,C2,...,Cn}. Now since each
category is not empty, we can arbitrarily pick one POI px from each category Ci to compose a
route{p1,p2,...,pn}which traverses through exactly one POI in each category and conforms
with the given traveling rules. According to Definition 2.1.1, if there is only one such route,
we have our answer. If not, the one with the minimal traveling distance will be what we
want to retrieve. In Section 2.2, we will elaborate how to verify if a set of partial sequence
rules is compatible.
13
2.1.3 Percentage of the Constrained Categories
Definition Given a MRPSR query, the Percentage of the Constrained Categories (PCC) is
defined as the percentage of the number of categories included in the set of traveling rules
over the total number of categories to be visited in the query.
PCC is used to measure the extent that aMRPSR query isconstrained by traveling rules.
According to the definition of PCC, the trip planning query (TPQ) [40] can be considered
as a MRPSR query with a PCC of 0% while the optimal sequenced route (OSR) query [52]
can be treated as a MRPSR query with a PCC of 100%.
2.2 Activity-on-Vertex Networks
In order to plan a route which can fulfill all the user defined partial sequence rules, we
need a solution to combine all the provided traveling rules and verify if they are compatible.
The relationship between all the given traveling rules can be represented as a directed graph
in which the vertices represent POI categories and the directed edges represent prerequisites.
This graph has an edge * if and only if category i is an immediate prerequisite for
category j in one of the rules. The complete graph is named Activity-On-Vertex (AOV)
network [27]. The following theorem provides the relationship of an AOV network and the
compatibility of the traveling rules.
Theorem 2.4. The partial sequence rules are compatible if and only if the corresponding
AOV network is a directed acyclic graph.
Proof. Definition 3 indicates that the rules are compatible if and only if there is a category
sequence that satisfies the order specified in each of the traveling rules. Let that category
sequence be the feasible sequence of tasks that satisfies all of the orders. According to [27],
an AOV has a feasible sequence of tasks if and only if the precedence relations in the AOV
network are both transitive and irreflexive. In other words, the corresponding AOV network
must be directed and acyclic.
14
Table 2.1 lists the POI categories and partial sequence rules specified by an example
MRPSR query Q. The corresponding AOV network for Q is shown in Figure 2.1.
Data Type Name Prerequisites
C1 Bank None
C2 Bookstore None
C3 Restaurant C1, C2
C4 Gas Station None
C5 Hospital C4
C6 Shopping
Center
C5
C7 Church C3, C6
C8 Coffee Shop C3
C9 Gift Shop C7, C8
C10 Park C7
Table 2.1: POI categories and partial sequence rules in an example MRPSR query Q.
C1
C2
C3
C8
C7
C9
C10
C4 C5 C6
Figure 2.1: The AOV network of Q represents POI categories as vertices and prerequisites
as edges.
After we represent all the partial sequence rules in a MRPSR query as an AOV network,
providing that the AOV network is directed and acyclic, Topological Order (or Topological
Sorting) can be used to generate a feasible complete ordering of POI categories which is
compatible with every partial sequence rule in the MRPSQ query. In graph theory, a topo-
logical order of a directed acyclic graph (DAG) is a linear ordering of its vertices in which
each vertex comes before all vertices to which it has outbound edges. Each DAG has at least
15
one topological order. The algorithm to find a topological order is as follows. The first step is
to list out a vertex in the network that has no predecessor. Then the second step is to delete
this vertex and all edges leading out from it from the AOV. By repeating these two steps
until either all the vertices have been listed or all remaining vertices have predecessors and
hence none of them can be removed. In the latter case, the AOV has a cycle and the trip
is infeasible, i.e., the partial sequence rules are not compatible. If a topological order has
the property that all pairs of consecutive vertices in it are connected by AOV edges, then
these edges form a directed Hamiltonian path in the AOV [33]. If a Hamilton path exists,
the topological sort order is unique and no other order respects the edges of the path. On
the contrary, if a topological order does not form a Hamiltonian path, the AOV will have
two or more valid topological orderings, for in this case it is always possible to form a second
valid ordering by swapping two consecutive vertices that are not connected by an AOV edge
to each other. For supporting both cases, we keep a counter of the number of immediate
predecessors for each vertex and represent the network by its adjacency lists. Then we can
carry out the deletion of all incident edges of a vertex v by decreasing the predecessor count
of all vertices on its adjacency list. Whenever the count of a vertex drops to zero (in-degree
= 0), we place the vertex on a list (Lzero) of vertices with a zero count. As mentioned in
Section 1.1, the traveling rules (the AOV network) may not cover all the user selected POI
categories. With the goal of creating a complete trip plan (i.e., the plan covers all requested
categories), we add all the requested POI types which are not included in the AOV into the
list Lzero. The complexity of topological sort is O(e+n), where n is the number of vertices
and e is the total edge number. The sort can be finished in linear time.
2.3 Algorithm Design
After having the AOV networks in hand, we can start to compute a trip plan satisfying
all the traveling rules. In this section, we propose three approximate algorithms to answer a
MRPSR query: the Nearest Neighbor-based Partial Sequenced Route (NNPSR) algorithm,
16
the Nearest Neighbor-based Partial Sequenced Route with Light Optimal Route Discov-
erer [52] (NNPSR-LORD) algorithm, and the Advanced A* Search-based Partial Sequenced
Route (AASPSR(k)) algorithm. NNPSR applies AOV networks to capture traveling rules
and launches successive nearest neighbor queries to answer a given MRPSR query. NNPSR-
LORD utilizes the Light Optimal Route Discoverer [52] to optimize the route obtained by
NNPSR. In AASPSR(k), as a hybrid scheme of NNPSR and ASPSR, distance heuristic
functions are integrated with NNPSR to answer a MRPSR query. All of the proposed algo-
rithms aim to find the near-optimal route which follows all of the traveling rules. Table 2.2
summarizes our set of notations.
Symbol Meaning
A The adjacency list representation of
an AOV
C The set of all the user selected cat-
egories
R The set of all the traveling rules
P A set of POIs
Q The priority queue
S The starting point of a MRPSR
query
D The destination of a MRPSR query
q The query point of a nearest neigh-
bor query
Ci A POI category
Ci.P All the POIs of a category
Lzero A list of AOV vertices with a zero
count
Lroute A list of the POI sequence of a trip
plan
PNN The query result of a nearest neigh-
bor query
DistE(x,y) The Euclidean distance between
points x and y
DistN(x,y) The network distance between
points x and y
Table 2.2: Symbolic notations.
17
2.3.1 Nearest Neighbor Search in Road Networks
In practice, users usually move only in the underlying road networks rather than trav-
eling freely through obstacles (e.g., buildings, rivers, etc.). Network distance computations
and nearest neighbor queries in road networks have been well studied [32,43]. As the ba-
sic building block of our proposed algorithms, in this subsection, we briefly review how to
answer a nearest neighbor query in road networks by the incremental network expansion
approach [43].
Figure 2.2: NN search by the incremental network expansion [43].
Figure 2.2 demonstrates the nearest neighbor search by the incremental network expan-
sion [43]. In Figure 2.2, the blank point,q, means the query point, white points, A,B,C,D,E,F,
denote road network conjunctions, and triangles, P1, P2, P3, represent POIs. (P1,P2,P3) are
in ascending order of their Euclidean distance to q. Incremental network expansion performs
network expansion from q and examines POIs in the order they are encountered. To be
specific, first, the road segment CD that covers q is found and all POIs on CD are retrieved.
Then a priority queue, Q =< (C,5),(D,7) >, is initiated, since no POI is covered by CD,
the node C closest to q is de-queued and its adjacent nodes, A and E, are inserted into Q
18
with their accumulated distance from q, i.e., Q =< (A,9),(E,10),(D,7) >. No POI is found
on CA and CE. Then D, whose distance is closest to q in current Q, is expanded and its
adjacent nodes, B and F, are en-queued, after which Q =< (A,9),(E,10),(B,15),(F,10) >.
Afterwards, P3 can be discovered on DB with an accumulated distance of 9 while no POI is
found on DF. This distance offers a upper bound to restrict the search space. Because the
next node to expand is A and the distance from q to A is already 9, which is no less than
the upper bound, the algorithm terminates and returns P3 as the nearest neighbor to q with
a network distance of 9.
2.3.2 Nearest Neighbor-based Partial Sequenced Route Algorithm
Here we devise a Nearest Neighbor-based Partial Sequence Route (NNPSR) query al-
gorithm by utilizing both the Lzero list and the nearest neighbor query (i.e., the incremental
network expansion [43] based implementation) to generate a trip satisfying all the traveling
rules. With NNPSR, we first search for the nearest POI to the query point q (as the starting
point) whose category is included in Lzero. The retrieved nearest POI PNN will be stored in
a route list Lroute and the category of PNN (i.e., PNN.C) will be removed from Lzero. Next,
we update the adjacency list and new zero count vertices may be added to Lzero. In addition,
the query point q is also updated to the location of PNN. The process will repeat until all
the selected categories are contained in the route. The complete algorithm of NNPSR is
formalized in Algorithm 1.
2.3.3 NNPSR with Light Optimal Route Discoverer Algorithm
Suppose a complete POI sequence to be visited is given, the Light Optimal Route
Discoverer (LORD) algorithm [52] can guarantee finding a route of minimum distance. Since
we can obtain a complete POI sequence after each execution of the NNPSR algorithm, we can
further optimize the trip by applying LORD on the POI sequence found by NNPSR. LORD
is a threshold-based algorithm and requires less memory space compared with Dijkstra?s
19
shortest path solution. The first step in LORD is to issue consecutive nearest neighbor
queries to find the greedy route that follows the given POI category sequence from the
starting point. Then, the length of the greedy route becomes a constant threshold value Tc.
In addition, LORD also keeps a variable threshold value Tv whose value reduces after each
iteration, and LORD discards all the POIs whose distances to the starting point are more
than Tv. Afterward, LORD iteratively builds and maintains a set of partial sequenced routes
in the reverse sequence (i.e., from the end points toward the starting point). During each
iteration of LORD, POIs from the following category are added to the head of each of these
partial sequence routes to make them closer to the starting point. The two thresholds are
utilized to prune non-promising routes for reducing the search space.
After executing the NNPSR algorithm, we can acquire a sequence of POIs. Since each
POI belongs to an individual POI category, we can also obtain a POI category sequence as
the input of LORD. For most cases, the NNPSR-LORD solution outperforms the original
NNPSR algorithm in terms of route distance. More detailed performance evaluations are
contained in Section 2.4.
2.3.4 Advanced A* Search-based Partial Sequenced Route Algorithm
`
Figure 2.3: Two trips generated by NNPSR (the dashed route) and ASPSR (the solid route)
with the traveling rule Bank?Restaurant.
Although the NNPSR and NNPSR-LORD algorithms can fulfill the traveling rules and
reduce the travel distance of a trip, they do not consider the location of the destination
20
when greedily generating the route sequence. Consider the example shown in Figure 2.3. In
Figure 2.3, S and D denote the start point and destination of the trip. Suppose we have a
traveling rule which can be denoted as Bank?Restaurant. We will find that the dashed
route returned by NNPSR is much longer than another feasible trip (the solid route) which
considers the location of the destination. Therefore, another approach is to limit the trip
planning within a range defined by S and D (e.g., an ellipse whose two focal points are S
and D).
The A* search based Partial Sequenced Route (ASPSR) algorithm considers the location
of the destination in its heuristic function. Similar to the admissible heuristic of the A*
algorithm [48], in ASPSR, we retrieve the POI p with the minimum cost of DistE(S,p) +
DistE(p,D) in each category included in Lzero. Afterward the POI p with the lowest cost
will be added into the route list Lroute and the category of p will be withdrawn from Lzero.
Then both A and Lzero will be updated and the location of p is set as the new query point.
The process will reiterate until all the user selected categories are covered.
However, there could be roundabout ways when we plan a trip by ASPSR. Consider the
example as shown in Figure 2.4. Because the POIs which are closer to the major axis of the
ellipse (S and D are the two focal points) have a lower distance cost, Bank1, Restaurant1
and gasstation1 will be picked sequentially in ascending order of their costs. Consequently,
a detour will occur where the user has to travel far away from from D to visit Restaurant1
and gasstation1 before reaching D at last. Therefore, we need to improve ASPSR to solve
the aforementioned problem.
The improved version of ASPSR is named as the Advanced A* Search-based Partial
Sequenced Route query (AASPSR) algorithm. In the following sections of this dissertation,
we use AASPSR(k) to denote our AASPSR algorithm with parameter k to specify the
number of POI?s we retain for each category. AASPSR(k) can be considered as a hybrid
scheme to combine ASPSR and NNPSR. To be specific, AASPSR(k) first computes Ci.P*
for each category Ci in C such that every POI in Ci.P* is a POI with the top-k minimum
21
`
Figure 2.4: A trip search by ASPSR with the traveling rule Bank?Restaurant.
`
Figure 2.5: A trip search by AASPSR(1) with the traveling rule Bank?Restaurant.
traveling distance sum from S to D in Ci. In particular, if k = 1, only one POI with the
shortest traveling distance sum from S to D for Ci is added to Ci.P*. On the contrary, if
k =?, then all the POIs on the underlying road network will be included in Ci.P* for each
Ci. After Ci.P* has been generated for each category, we launch NNPSR to generate a route
only on these selected POIs in each Ci.P*. Starting with S, we search for the nearest POI
p in Ci.P* (Ci ?Lzero). Afterward, p is inserted into Lroute and the location of p is used as
the query point of the following NN query. Next we remove the category of p from Lzero and
recompute the adjacency list. The whole process will repeat until Lzero becomes empty. The
complete algorithm of AASPSR(k) is illustrated in Algorithm 2. For comparison purpose,
the trips generated by AASPSR(1) and by AASPSR(2) are demonstrated in Figure 2.5 and
Figure 2.6, respectively.
22
`
Figure 2.6: A trip search by AASPSR(2) with the traveling rule Bank?Restaurant.
2.3.5 Comparison between NNPSR, NNPSR-LORD and AASPSR
In order to analyze the performance of the three aforementioned algorithms, we have
the following two definitions:
Definition A MRPSR query is called to be a strictly constrained query if its PCC value is
relatively high.
Definition A MRPSR query is called to be a loosely constrained query if its PCC value is
relatively low.
y2
z1
`
y1x1
z2 x2
Figure 2.7: An example where AASPSR generates a longer route than NNPSR.
Theorem 2.5. Given a MRPSR query, AASPSR does not necessarily return a shorter route
than NNPSR.
23
Proof. We can prove it by a counter-example as shown in Figure 2.7. In Figure 2.7, the
triangle, rectangle, and pentagon each represents a different category of POIs, and S and
D denote the start point and destination of the trip, respectively. Additionally, we have a
traveling rule denoted as triangle?rectangle?pentagon. Consequently, the NNPSR
created trip plan T1 = {S, px1, py1, pz1, D} (the solid route) is shorter than the AASPSR
created trip plan T2 ={S, px2, py2, pz2, D}(the dashed route). The existence of traveling
rules leads to a roundabout route in AASPSR, which only chooses the POIs inside the
ellipse. Therefore, AASPSR performs worse than NNPSR in terms of route distance in this
scenario.
In each category, AASPSR(k) only chooses the k POI?s which have the minimum trav-
eling distance sum from the start point to the destination for the subsequent NN search.
Consequently, if there are too many traveling rules imposed, i.e., there are many restric-
tions on the category order, AASPSR(k) will be very likely to generate a roundabout route.
Therefore, for strictly constrained MRPSR queries, which have a higher PCC, NNPSR usu-
ally returns a shorter route than AASPSR(k). On the other hand, for loosely constrained
MRPSR queries, which hold a lower PCC, such as TPQ (PCC equals zero), AASPSR(k)
usually outperforms NNPSR in terms of route distance.
By taking advantage of the LORD algorithm [52], NNPSR-LORD can further shorten
the length of the route retrieved by the NNPSR algorithm. Consequently, NNPSR-LORD
can consistently outperform NNPSR in terms of route distance. However, as far as the
response time is concerned, NNPSR-LORD, compared with NNPSR and AASPSR, needs
much more computational time, especially in road networks. This is due to its extensive
usage of network distance functions. Therefore, NNPSR-LORD is only applicable to non-
time sensitive applications. A complete comparison among proposed algorithms is shown in
Table 2.3.
24
NNPSR AASPSR
with a
small k
AASPSR
with a
large k
NNPSR-
LORD
Strictly constrained
MRPSR (e.g. OSR)
check check check
Loosely constrained
MRPSR (e.g.
TPQ)
check check
Further optimized
route
check
Time sensitive ap-
plication
check check
Non-time sensitive
application
check check
Table 2.3: The feature comparison among proposed algorithms.
2.4 Experimental Validation
2.4.1 Experimental Setup
The experimental results are reported in this section. We implemented the NNPSR,
NNPSR-LORD, and AASPSR(k) algorithms in road networks to evaluate their performances
with respect to the returned route distance and the response time to generate the correspond-
ing routes. Besides, to highlight the benefits of our three approximate approaches, we used
the LORD-based brute-force solution as the baseline, which applies LORD [52] on each
possible permutation of all categories to get the optimal sequenced route for each particu-
lar category sequence. For each run of the LORD-based brute-force solution, we compared
the distances of all the possible optimal sequenced routes and recorded the minimum route
distance and overall response time.
As we discussed before, AASPSR(k) will exhibit more characteristics of NNPSR when k
increases (in particular, AASPSR(k) degrades to NNPSR if k =?) and show more charac-
teristics of AASPSR(1) when k decreases. Therefore, in this section we only focused on the
performance of AASPSR(1) (the terms AASPSR and AASPSR(1) are used interchangeably
25
in this section). We varied the following parameters to obtain their effects on the route dis-
tance and response time: the Percentage of the Constrained Categories (PCC), the average
category cardinality, and the number of query categories. PCC describes the percentage of
the number of categories involved in traveling rules over the total number of categories to be
visited in a query. The average category cardinality is the average number of POIs over all
categories while the number of query categories is the total number of categories to be visited
in the query. For each result of the NNPSR, AASPSR and NNPSR-LORD algorithms, 100
MRPSR queries were launched with a starting point and a destination generated randomly
on the road network, and then the results were averaged. All the experiments were conducted
on a Linux machine with an Intel Core2 Quad CPU (Q9400 2.66GHz) and 4GB memory.
Fig. 2.8(a) Road network of
California.
Fig. 2.8(b) California points of
interest distribution.
Figure 2.8: Real Datasets from the state of California.
Road Network Dataset
To investigate the performance of our proposed algorithms for road networks, first we
obtained the road network dataset of the state of California from [1]. As shown in Figure
2.8(a), the road network of California contains 21,048 nodes and 22,830 edges. Each node is
described with a tuple of ?Node ID,Longitude,Latitude? and each edge is represented by
a tuple of?Edge ID,Start Node ID,End Node ID,L2 Distance?.
26
California Point of Interest Dataset
We collected the points of interest of the state of California from [2] as shown in Fig-
ure 2.8(b). This California dataset has 63 different categories, including airports, hospitals,
schools, populated places, etc., which correspond to more than 100,000 points of interest.
Each category exhibits a distinct density and distribution. Each point of interest is rep-
resented as a tuple of ?Category Name,Longitude,Latitude?. The cardinalities of all the
categories used in our research is shown in Table 2.4.
To merge the points of interest in the real dataset with the road network, we adopted
the map format where each point of interest was at first mapped to a point on an edge and
then represented as the distance of this point to the start node of that edge.
Synthetic Point of Interest Datasets
To control different cardinalities and distributions of categories, we also applied synthetic
datasets in our experiments. We generated different numbers of points of interest for different
datasets and uniformly distributed the points of interest on the edges of the California road
networks.
Traveling Rules
Without loss of generality, for the real dataset, rules were generated between Building
and Populated place, Church and Hospital, and Locale and Park, i.e., rules can be represented
as follows: Building ? Populated place, Church ? Hospital and Locale ? Park. For the
synthetic datasets, rules were generated between any two arbitrary categories at random.
2.4.2 Effect of the Percentage of the Constrained Categories
In our first experiment, we varied the Percentage of the Constrained Categories (PCC)
to investigate the performance of NNPSR, AASPSR, NNPSR-LORD and the LORD-based
brute-force solution in terms of the route distance and response time. Since our proposed
27
Category Size
Airport 995
Area 287
Bar 278
Building 4110
Church 7680
Hospital 835
Locale 13481
Park 6728
School 11173
Populated place 6900
Summit 5594
Valley 7596
Table 2.4: The category cardinalities used in our California dataset.
0 20 40 60 80 100
0
60
80
100
120
140
160
180
Ro
ute
di
sta
nce
(km
)
Percentage of constrained categories (%)
NNPSR
AASPSR
NNPSR-LORD
Brute-force
0 20 40 60 80 100
0
60
80
100
120
140
160
180
Ro
ute
di
sta
nce
(km
)
Percentage of constrained categories (%)
NNPSR
AASPSR
NNPSR-LORD
Brute-force
Fig. 2.9(a) California dataset Fig. 2.9(b) Synthetic dataset
Figure 2.9: Route distance of NNPSR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of PCC.
MRPSR query subsumes the TPQ and OSR queries, the MRPSR queries exhibit the char-
acteristics of TPQ queries when PCC decreases and the characteristics of OSR queries when
PCC increases. Our results are based on the California POI dataset and the synthetic POI
datasets, respectively. In the synthetic POI dataset, the average category cardinality is 6000.
Furthermore, we assumed that the number of query categories is 6. Figure 2.9 illustrates the
relationship between route distance and PCC for NNPSR, AASPSR, NNPSR-LORD, and
LORD-based brute-force algorithms.
28
0 20 40 60 80 100
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Re
sp
on
se
tim
e (
se
c)
Percentage of constrained categories (%)
NNPSR
AASPSR
0 20 40 60 80 100
10
100
1000
Re
sp
on
se
tim
e (
se
c)
Percentage of constrained categories (%)
NNPSR-LORD
Brute-force
Fig. 2.10(a) NNPSR and
AASPSR (California dataset)
Fig. 2.10(b) NNPSR-LORD
and LORD-based brute-force
(California dataset)
0 20 40 60 80 100
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Re
sp
on
se
tim
e (
se
c)
Percentage of constrained categories (%)
NNPSR
AASPSR
0 20 40 60 80 100
10
100
1000
Re
sp
on
se
tim
e (
se
c)
Percentage of constrained categories (%)
NNPSR-LORD
Brute-force
Fig. 2.10(c) NNPSR and
AASPSR (synthetic dataset)
Fig. 2.10(d) NNPSR-LORD
and LORD-based brute-force
(synthetic dataset)
Figure 2.10: Response time of NNSPR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of PCC
In Figure 2.9 route distance increases with the increase of PCC for all the algorithms.
This is because with a higher PCC, there will be more restrictions on the order of the
categories, which leads to a longer route. Note that the route distance of AASPSR changes
remarkably against PCC in contrast to NNPSR and NNPSR-LORD. The lower PCC is, the
better AASPSR works compared with NNPSR. With a higher PCC, the route distance of
29
AASPSR increases dramatically. In other words, AASPSR is only suitable for planning a
trip with a low PCC, such as TPQ (PCC equals zero). This is because AASPSR only picks a
single POI in each category for the subsequent NN search. Consequently, given a higher PCC
(there are more restrictions on the category order), a longer route may be needed to traverse
all the POIs picked up in the first step. In addition, NNPSR-LORD outperforms NNPSR
and AASPSR in terms of route distance given any PCC. The reason is that NNPSR-LORD
employs LORD to obtain the shortest route under the specific order of categories in the
route found by NNPSR.
Figure 2.10 plots the response time against PCC for NNPSR, AASPSR, NNPSR-LORD,
and the LORD-based brute-force method. Notice that in Figure 2.10 (b) and (d), we plotted
the relationship by using the log value of the response time instead of the original value
because the response times of NNPSR-LORD and LORD-based brute-force solutions are in
different orders of magnitude. First, as shown in Figure 2.10, all our proposed algorithms
significantly reduce the response time compared with the LORD-based brute-force solution.
To be specific, NNPSR and AASPSR are more than 10000 times faster in some cases than
the LORD-based brute-force method while NNPSR-LORD is more than 100 times faster.
Second, the response times decrease with the increase of PCC. This is because a higher PCC
will decrease the search space of POIs.
2.4.3 Effect of the Average Category Cardinality
Next, we studied the effect of the average category cardinality by varying the cardinality
from 2000 to 14000 using synthetic datasets. Here we assumed that the number of query
categories is 6. Figure 2.11 shows the route distances of NNPSR, AASPSR, NNPSR-LORD,
and the LORD-based brute-force method where PCC equals 33% and 66%, respectively. As
Figure 2.11 shows, the route distance decreases for each algorithm with the increase of the
average category cardinality. The reason is that a denser distribution of a category will lead
to more POI choices, which result in a lower probability of detours. Notice that AASPSR has
30
0 2000 4000 6000 8000 10000 12000 14000
0
5
60
70
80
90
100
110
120
130
140
Ro
ute
di
sta
nce
(km
)
The average category cardinality
NNPSR
AASPSR
NNPSR-LORD
Brute-force
0 2000 4000 6000 8000 10000 12000 14000
0
20
40
60
80
100
120
140
160
180
Ro
ute
di
sta
nce
(km
)
The average category cardinality
NNPSR
AASPSR
NNPSR-LORD
Brute-force
Fig. 2.11(a) PCC = 33% Fig. 2.11(b) PCC = 66%
Figure 2.11: Route distance of NNSPR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of the average category cardinality.
relatively poor performance with either a PCC of 33% or 66% because AASPSR outperforms
NNPSR in terms of route distance only if PCC is very low. The PCC of 33% or 66% is large
enough to deteriorate the performance of AASPSR. Figure 2.12 shows the response time for
the above algorithms. As Figure 2.12 demonstrates, the response time of each algorithm
increases with the enlargement of the average category cardinality. This is due to a higher
density of each POI category which elongates the computational time.
2.4.4 Effect of the Number of Query Categories
In this subsection, we changed the number of query categories to 3, 6, 9 and 12 to
investigate the impact of the number of query categories on the performance of NNPSR,
AASPSR, NNPSR-LORD, and the LORD-based brute-force method. Our experiments are
based on the California POI and the synthetic POI datasets, respectively. In the synthetic
POI dataset, the average category cardinality is assumed to be 6000. In addition, we assume
that PCC equals 66%. Figures 2.13 and 2.14 illustrate the experimental results. As shown
in Figure 2.13, when the number of query categories increases, the route distance of each
algorithm extends dramatically. This is because with an increasing number of categories
to be visited, there will be more POIs to be traversed in a trip. Notice that the number
31
0 2000 4000 6000 8000 10000 12000 14000
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.11
0.12
Re
sp
on
se
tim
e (
se
c)
The average category cardinality
NNPSR
AASPSR
0 2000 4000 6000 8000 10000 12000 14000
10
100
1000
Re
sp
on
se
tim
e (
se
c)
The average category cardinality
NNPSR-LORD
Brute-force
Fig. 2.12(a) NNPSR and
AASPSR (PCC = 33%)
Fig. 2.12(b) NNPSR-LORD
and LORD-based brute-force
(PCC = 33%)
0 2000 4000 6000 8000 10000 12000 14000
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
Re
sp
on
se
tim
e (
se
c)
The average category cardinality
NNPSR
AASPSR
0 2000 4000 6000 8000 10000 12000 14000
1
10
100
1000
Re
sp
on
se
tim
e (
se
c)
The average category cardinality
NNPSR-LORD
Brute-force
Fig. 2.12(c) NNPSR and
AASPSR (PCC = 66%)
Fig. 2.12(d) NNPSR-LORD
and LORD-based brute-force
(PCC = 66%)
Figure 2.12: Response time of NNSPR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of the average category cardinality.
of query categories has a significant impact on whether AASPSR outperforms NNPSR in
terms of route distance. For three-category cases, AASPSR returns a shorter route distance
than NNPSR. On the contrary, for the 6, 9, or 12 category cases, AASPSR reports a longer
route than NNPSR. The reason is that fewer categories will lead to a lower probability for
32
0 2 4 6 8 10 12 14
0
60
80
100
120
140
160
180
Ro
ute
di
sta
nce
(km
)
Number of query categories
NNPSR
AASPSR
NNPSR-LORD
Brute-force
0 2 4 6 8 10 12 14
0
60
80
100
120
140
160
180
Ro
ute
di
sta
nce
(km
)
Number of query categories
NNPSR
AASPSR
NNPSR-LORD
Brute-force
Fig. 2.13(a) California dataset Fig. 2.13(b) Synthetic dataset
Figure 2.13: Route distance of NNPSR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of the number of query categories.
a detour to occur when AASPSR tries to traverse all the POIs selected in its first step. On
the other hand, as Figure 2.14 shows, when the number of query categories increases, the
response time prolongs accordingly. The reason is that all the algorithms need more time
to compute more categories to answer a MRPSR query. In particular, AAPSR consistently
needs more response time than NNPSR, irrespective of the number of query categories.
33
0 2 4 6 8 10 12 14
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
Re
sp
on
se
tim
e (
se
c)
Number of query categories
NNPSR
AASPSR
0 2 4 6 8 10 12 14
1
10
100
1000
Re
sp
on
se
tim
e (
se
c)
Number of query categories
NNPSR-LORD
Brute-force
Fig. 2.14(a) NNPSR and
AASPSR (California dataset)
Fig. 2.14(b) NNPSR-LORD
and LORD-based brute-force
(California dataset)
0 2 4 6 8 10 12 14
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Re
sp
on
se
tim
e (
se
c)
Number of query categories
NNPSR
AASPSR
0 2 4 6 8 10 12 14
1
10
100
1000
Re
sp
on
se
tim
e (
se
c)
Number of query categories
NNPSR-LORD
Brute-force
Fig. 2.14(c) NNPSR and
AASPSR (synthetic dataset)
Fig. 2.14(d) NNPSR-LORD
and LORD-based brute-force
(synthetic dataset)
Figure 2.14: Response time of NNSPR, AASPSR, NNPSR-LORD, and LORD-based brute-
force as a function of the number of query categories.
34
Algorithm 1 Nearest Neighbor-based Partial Sequenced Route query(C, R, S, D)
1: Set Lroute =?and q = S
2: Integrate all elements in R into an AOV adjacency list A and put all vertices with zero
count in Lzero
3: if The AOV network is a DAG then
4: Add all elements of C\A into Lzero
5: while Lzero negationslash=?do
6: P =?
7: for each Ci ?Lzero do
8: P = Ci.P?P
9: end for
10: Identify the road segment ninj covering q.
11: Find all the POIs in P on ninj.
12: if If there exists at least one POI in P on ninj then
13: Update pNN with the POI Pk with the smallest DistN(q,Pk).
14: else
15: Q =< (ni,DistN(q,ni)),(nj,DistN(q,nj)) >
16: De-queue the node n in Q with the smallest DistN(q,n)
17: while DistN(q,n) < Threshold do
18: for each non-visited adjacent node nk of n do
19: Find all the POIs in P on the road segment nnk.
20: Update PNN from the POI p? in P with the smallest network distance found
so far
21: Update Threshold with DistN(q,p?)
22: En-queue (nk,DistN(q,nk)) in Q
23: end for
24: De-queue the node n in the updated Q with the smallest DistN(q,n)
25: end while
26: end if
27: q = PNN
28: Lroute = Lroute?PNN
29: Remove PNN.C from Lzero
30: Update A and Lzero
31: end while
32: return Lroute
33: else
34: Report cycles in R
35: end if
35
Algorithm 2 Advanced A* Search-based Partial Sequenced Route query(C, R, S, D, k)
1: Set Lroute =?and Q = S
2: Integrate all elements in R into an AOV adjacency list A and put all vertices with zero
count in Lzero
3: if The AOV network is a DAG then
4: Add all elements of C\A into Lzero
5: for each category Ci?C do
6: for each POI pj ?Ci.P do
7: Costj = DistE(S,pj) +DistE(pj,D)
8: end for
9: Sort POI pj ?Ci in ascending order based on Costj and add the top-k pj ?Ci with
the minimum Costj into Ci.P*
10: end for
{ASPSR search done, the subsequent NNPSR search starts}
11: while Lzero negationslash=?do
12: P =?
13: for each Ci ?Lzero do
14: P = P?Ci.P?
15: end for
16: Identify the road segment ninj covering q.
17: Find all the POIs in P on ninj.
18: if If there exists at least one POI in P on ninj then
19: Update PNN with the POI Pk with the smallest DistN(q,Pk).
20: else
21: Q =< (ni,DistN(q,ni)),(nj,DistN(q,nj)) >
22: De-queue the node n in Q with the smallest DistN(q,n)
23: while DistN(q,n) < Threshold do
24: for each non-visited adjacent node nk of n do
25: Find all the POIs in P on the road segment nnk.
26: Update PNN from the POI p? in P with the smallest network distance found
so far
27: Update Threshold with DistN(q,p?)
28: En-queue (nk,DistN(q,nk)) in Q
29: end for
30: De-queue the node n in the updated Q with the smallest DistN(q,n)
31: end while
32: end if
33: q = PNN
34: Lroute = Lroute?PNN
35: Remove PNN.C from Lzero
36: Update A and Lzero
37: end while
38: return Lroute
39: else
40: Report cycles in R
41: end if
36
Chapter 3
Approximate Query Evaluation on RFID Databases
3.1 Bayesian Inference
Symbol Meaning
?H The random vector representing lo-
cations of all objects
hi The random variable representing
the location of oi
Z Raw data reported by RFID read-
ers
zij The raw data (0 or 1) reported by
the reader in zone j for object oi
post( ?H|Z) The posterior probability of the lo-
cation vector ?H given the raw data
Z
p(zij|hi) The likelihood that the zone j
reader reports the value of zij for
object oi given that object oi is in
the zone hi
p(hj) The prior probability that object
oj is in the zone hj
Table 3.1: Symbolic notations of Section 3.1.
post( ?H|Z) = post(h1,h2,...,hn|
?
?? z11 ??? z1m... ... ...
zn1 ??? znm
?
??)
?p(
?
?? z11 ??? z1m... ... ...
zn1 ??? znm
?
??|h
1,h2,...,hn)?p(h1,h2,...,hn)
(3.1)
37
post( ?H|Z)?
productdisplay
i
p(zi1,zi2,...,zim|h1,h2,...,hn)?p(h1,h2,...,hn) (3.2)
post( ?H|Z)?
productdisplay
i
p(zi1|hi)?p(zi2|hi)?...?p(zim|hi)?
productdisplay
j
p(hj) (3.3)
In this section, we develop a Bayesian inference-based approach to handle redundant
readings and prior knowledge, and we analyze the challenges when applying this approach.
Table 3.1 summarizes the notations used in this section.
3.1.1 A Bayesian Inference Based Approach
Bayesian inference is a statistical inference technique that estimates the probability
of a hypothesis (x) based on observations (y). Bayesian inference shows that posterior
is proportional to the multiplication of likelihood and prior, which can be represented as
p(x|y)?p(y|x)p(x).
Suppose there are m zones and n objects in our monitoring environment, each zone with
a reader mounted in the zone center. Let oi represent the object with ID i. For each oi,
its location is represented by a random variable hi. Therefore, a possible distribution of n
objects in m zones can be denoted as an instance of the random vector ?H = (h1,h2,...,hn).
hi represents the zone ID where object oi is in. For example h1 = 2 denotes that object o1 is
in zone 2 in the current instance. For the reader in zone j, the raw data (0 or 1) it receives
from the RFID tag of object oi is denoted as zij. The raw data matrix for each complete
scan from m readers can then be represented as an n?m matrix Z = [zij]. Thus the
Bayes? theorem can be represented as Equation 3.1, where post( ?H|Z) denotes the posterior
probability of location vector ?H given the raw data Z, and a valid hypothesis means the
hypothesis satisfies all constraints:
38
post( ?H|Z) = ??
productdisplay
i
p(zi1|hi)?p(zi2|hi)?...?p(zim|hi)?
productdisplay
j
p(hj) (3.4)
post( ?H|Z) = 0 : ?H is not valid
post( ?H|Z) > 0 : ?H is valid
post( ?H1|Z) > post( ?H2|Z) : ?H1 is more likely than ?H2
In particular, if zij = 1 in a raw data matrix and the actual location of object oi is
not in zone j, then zij is a false positive. Take Table 1.1 as an example, at least one of z22
and z23 is a false positive because object 2 cannot be in zone 2 and zone 3 simultaneously.
Similarly, at least one of z43 and z44 is a false positive.
To compute post( ?H|Z), we make some independence assumptions of random variables.
RFID reader transmissions or tag transmissions may lead to collisions because readers and
tags communicate over a shared wireless channel. Reader collisions happen when neighboring
readers communicate with a tag simultaneously [18] and tag collisions occur when multiple
tags transmit to a reader at the same time [20]. However, the two kinds of collisions can be
effectively prevented by arbitration protocols (e.g. by scheduling adjacent readers to operate
at different times) [26,42,64]. Therefore, we assume each reader detects the tags of different
objects independently (i.e., whether a reader can successfully detect the tag of a certain
object does not interfere with whether the reader can successfully detect that of another
object) in this research. Based on the assumption, we can derive Equation 3.2.
Furthermore, because we employ MH-C to take into account constraints (i.e., to ensure
that each generated sample satisfies all the constraints), here we can simply assume indepen-
dence between different hi (i.e., the locations of objects). In addition, we assume that each
reader?s detection of the same object is independent. Besides, the prior distribution of each
object does not depend on that of other objects. Therefore, we can obtain Equation 3.3. If
we rewrite Equation 3.3 using the normalizing constant, denoted as ?, we can reach Equa-
tion 3.4, which shows how to compute the posterior of each sample. To be specific, p(zij|hi)
39
reflects the corresponding likelihood, which is the probability that the reader in zone j re-
ports the value of zij about object oi given that object oi is actually in the zone with ID hi.
Furthermore, p(hj) denotes the prior probability that the object oj is in the zone with the
ID of hj. The prior probability can be interpreted as the assumed distribution before the
acquiring of the RFID raw data.
3.1.2 The Goal and the Obstacles
Based on Equation 3.4, given the raw readings Z and a hypothesis ?H (the location of
each object), we can derive the probability of the hypothesis. However, finding just one valid
hypothesis will provide nothing more than a biased answer to queries against the uncertain
data. To address this issue, we need to query against all valid hypotheses. However, this
is unrealistic because there are numerous valid hypotheses in most cases. Thus, our goal is
to create a large sample set of valid hypotheses, each associated with a weight computed
by Equation 3.4: ( ?H1,w1),( ?H2,w2),???,( ?Hn,wn). The sample set of valid hypotheses as
a whole enables us to answer queries with high credibility. To achieve this goal, we must
overcome the following obstacles:
? A prerequisite for effective hypothesis sampling is to be able to compute the posterior
probability of each hypothesis precisely. Therefore, we propose the n-state detection
model in Section 3.2 to capture likelihoods in an affordable and accurate way.
? The hypothesis space is high dimensional, and the posterior probability is difficult to
sample from. Thus, we need a sampling technique that has desirable efficiency. We
apply a Markov Chain Monte Carlo method (MCMC) because MCMC can maintain
the correlation between samples, resulting in an improved sampling efficiency.
? We need to incorporate constraint management in sampling. We propose a sam-
pler called Metropolis-Hastings sampler with Constraints (MH-C), which improves the
40
naive Metropolis-Hastings (MH) sampler. Each sample generated by MH-C automat-
ically satisfies all the constraints.
3.2 RFID Reader Detection Models
The major difficulty in computing the posterior of each sample (Equation 3.4) lies in how
to accurately estimate the likelihood p(zij|hi). To do so, we introduce the n-state detection
model to capture likelihood in an affordable and precise way.
3.2.1 Physical Characteristics
An RFID reader sends RF signals to communicate with (passive) tags to retrieve a list of
IDs in its detection range. For a typical warehouse application, the frequency channel of 915
MHz is usually employed for the communication because of its long detection range, which is
about 10-20 feet [66]. However, RFID data acquisition and transmission are unreliable [20,
29,31,44]. In our experiment, we investigated the change of the read rate over distance using
regular RFID readers and tags. The results are illustrated in Figure 3.1. The model of the
tags is Gen2 RFID Smart Label and the model of the reader is MPR-6000 (antenna 902-928
MH), provided by WJ Communications Inc. Our test environment is a lab with many metal
objects (tables, desks and computer equipment), representing a noisy environment.
As Shown in Figure 3.1, the overall detection range of a reader can be separated into
the major detection region and the minor detection region, where in the major detection
region from 0 to almost 5 feet, the read rate can keep a level of around 95% and in the minor
detection region approximately from 5 to 13 feet, the read rate drops off almost linearly.
Furthermore, the read rate deteriorates to zero in the region more than 13 feet away from
the reader, which is defined as beyond the overall detection range [31]. The read rate falls
with the increase in the distance, which is similar to the scenario where the strength of a
wireless signal fades as the signal propagates in an open space.
41
s48 s50 s52 s54 s56 s49s48 s49s50 s49s52 s49s54 s49s56 s50s48
s48s46s48
s48s46s49
s48s46s50
s48s46s51
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s82
s101
s97
s100
s32
s82
s97
s116
s101
s68s105s115s116s97s110s99s101s32s40s102s101s101s116s41
Figure 3.1: An illustration of the relationship between read rate and distance.
3.2.2 Problems of the 2-State Detection Model
Taking the scenario in Figure 1.2 as an example, one way to estimate likelihood is as
follows:
p(zij = 1|hi) =
braceleftBig r if hi?{j?1,j,j + 1}
0 otherwise
(3.5)
p(zij = 0|hi) =
braceleftBig 1?r if hi?{j?1,j,j + 1}
1 otherwise
(3.6)
where r is the average read rate. Intuitively, it means that an object oi holds the same
probability to be detected by a reader whether oi is in the zone (j) the reader is associated
to, or in any of neighboring zones (j?1 or j +1). Apparently, if compared with Figure 3.1,
this 2-state detection model is inherently inaccurate as it fails to capture any change of
the read rate in the overall detection region. Consequently, when predicting the location of
an object, the resulting system is unable to differentiate between its own zone and all its
neighboring zones because all the above zones are with an identical read rate.
42
In order to solve this problem, current works are forced to adopt a simplified 2-state
detection model, which is shown in Figure 3.2. This simplified 2-state model assumes readers?
detection regions do not overlap, i.e., a reader is only able to detect the objects in its own
zone. Then the likelihood can be estimated as follows:
p(zij = 1|hi) =
braceleftBig r if hi = j
0 otherwise
(3.7)
p(zij = 0|hi) =
braceleftBig 1?r if hi = j
1 otherwise
(3.8)
This simplified 2-state model, however, has two problems. First, it is unrealistic to
assume that we can divide the space into non-overlapping detection regions. Second, the
model does not support applications that use redundant information (such as redundant
readings) to offset the unreliability of raw RFID readings.
?
?
?
?
?
?
?
?
Figure 3.2: An illustration of the simplified 2-state model.
3.2.3 The n-state Detection Model
In order to take full advantage of duplicate readings, we propose an n-state detection
model, which is illustrated in Figure 3.3. In Figure 3.3, the overall detection region of
an RFID reader is divided into several sub-regions, each of which corresponds to a zone
associated with a unique read rate. As far as a specific detection model is concerned, the
difference in the read rate over any two adjacent sub-regions is a constant, i.e., the read
rates for different states constitute an arithmetic sequence. Take the 4-state model as an
43
example. Suppose the highest read rate in the model is x. The first state (counted with
the increase of the detection distance) holds a read rate of x, the second state keeps a read
rate of 2x3 , the third state maintains a read rate of x3, and the fourth state eventually has a
read rate of zero. Thus, as for a specific reader, by employing the n-state detection model,
each correlated zone is assigned a distinct read rate according to its distance to this reader.
In particular, the simplest model in the family of the n-state detection model is the 2-state
model, where an identical read rate is assumed in the overall detection region of each reader.
Figure 3.3: The family of the n-state detection model.
Notice that in practice, the n value depends on how zones are divided in overall detection
regions of RFID readers. This is because an n-state model in fact implies that every 2(n?
2) + 1 = 2n?3 zones correlate with each other (assuming that all the zones are in a 1-
dimensional distribution). For example, if it is known as prior knowledge that one object
can be read simultaneously by up to five readers, we have to choose n = 4 to incorporate
the correlation among every 5 successive zones.
3.2.4 A Case Study: The 3-State Model
We elaborate on the 3-state model as a case study. Suppose one reader can only detect
its own zone and the two neighboring zones. This assumption implies that, as for a particular
reader, there are three distinct location-based states of a object: in the same zone as the
reader, in the neighboring zones, and in all the other zones. To capture this correlation,
44
we have to choose n = 3 and the resulting model is the 3-state detection model, where the
overall detection range of a reader is divided into two sub-regions, as shown in Figure 3.4.
Specifically, the major detection region, the minor detection region and the zero read rate
region in Figure 3.4 correspond to the zone where the reader locates, neighboring zones
and all the other zones, respectively. Therefore, the motivating scenario (Figure 1.2), if
interpreted by the 3-state model, can be illustrated in Figure 3.5.
Figure 3.4: The 3-state detection model of RFID readers.
In Figure 3.5, by using the 3-state detection model, not only the duplicate readings can
be incorporated, but also a zone and all its neighboring zone can be differentiated because
they are with distinct read rates. To be specific, if object oi is in zone j, not only zij but
also zi(j?1) and zi(j+1) should have a considerable chance to be 1 (false positives). In the
meantime, other readers are unable to detect the tag of object oi. The reason is that object
oi may be in the major detection region of the reader in zone j while it may also be in the
minor detection region of both readers in zones j?1 and j + 1. As for the other readers,
object oi is totally beyond their overall detection regions, leading those readers to report 0
for object oi. If we denote the mean value of the read rate in major detection region as rmajor
and the mean value of the read rate in minor detection region as rminor, the estimate of the
likelihood using the 3-state model can be represented in Equation 3.9 and Equation 3.10.
45
?
?
?
?
?
?
?
?
Figure 3.5: The detection region overlap interpreted by the 3-state detection model.
p(zij = 1|hi) =
braceleftBig rmajor if hi = j
rminor if hi?{j?1,j + 1}
0 otherwise
(3.9)
p(zij = 0|hi) =
braceleftBig 1?rmajor if hi = j
1?rminor if hi?{j?1,j + 1}
1 otherwise
(3.10)
The advantage of the 3-state detection model over the 2-state detection model can be
measured by the system entropy. We also answer the question whether having even more
states (more than 3) can further benefit the system.
3.3 Entropy Analysis
We use entropy to measure the uncertainty in a system after invalid system states
have been eliminated by a data cleansing method. Generally, applying an efficient data
cleansing method will lead to systems with smaller entropy. In this section, we first show
the advantage of the 3-state model over the 2-state model and then prove that the 3-state
model can maximize the system performance compared with other detection models with
even more than 3 states. A snippet of the RFID raw data is shown in Table 3.2, and the
actual location of an object i is denoted as a random variable L.
46
3.3.1 Entropy versus Read Rate
Entropy of the 2-State Model
Suppose that y denotes the read rate in the 2-state detection model. According to the
right side of Equation 3.4, the probability mass function of L in the 2-state model can be
represented as:
p(L = l) =
braceleftBig ?(1?y)y(1?y)? if l = j
?(1?y)(1?y)y? if l?{j?1,j + 1}
0 otherwise
(3.11)
where ? is the normalizing constant and ? represents the prior probability (we assume the
prior distribution as a uniform distribution) in Equation 3.4. Because the entropy H of a
discrete random variable X with possible values{x1,x2,...,xn}can be represented as
H(x) =?
summationdisplay
i=1
np(x
i)?lnp(xi),
where p(xi) denotes the probability mass function of X, we can calculate the entropy H of
L as:
... Zone
j ? 1
Reader
Zone j
Reader
Zone
j + 1
Reader
...
... ... ... ... ... ...
Obj i ... 0 1 0 ...
... ... ... ... ... ...
Table 3.2: A snippet of the RFID raw data.
47
H(L) = ??(1?y)(1?y)y??ln(?(1?y)(1?y)y?) (3.12)
??(1?y)y(1?y)??ln(?(1?y)y(1?y)?)
??(1?y)(1?y)y??ln(?(1?y)(1?y)y?)
Because the probabilities on all the locations sum to 1, we can derive Equation 3.13.
By applying Equation 3.13 to Equation 3.12 (? and ? are canceled out), we can obtain
Equation 3.14.
?? = 13(1?y)2y (3.13)
H(L) =?3?13?ln 13 = 1.098 (3.14)
Entropy of the 3-State Model
Figure 3.5 corresponds to the 3-state model scenario. Suppose x is the read rate in the
major detection region. Then the read rate in the minor detection region can be denoted as
x/2. Thus, according to the right side of Equation 3.4, the probability mass function of L
can be represented as follows:
p(L = l) =
braceleftBig ?(1?
x
2)x(1?
x
2)? if l = j
?(1?x2)(1?x)x2? if l?{j?1,j + 1}
0 otherwise
Similarly, ? is the normalizing constant and ? represents the prior probability in Equa-
tion 3.4. Therefore, we can calculate the entropy H of L as:
48
H(L) = ??(1?x2)(1?x)x2??ln(?(1?x2)(1?x)x2?)
??x(1?x2)2??ln(?x(1?x2)2?) (3.15)
??(1?x2)(1?x)x2??ln(?(1?x2)(1?x)x2?)
Since probabilities on all locations sum to 1, we can obtain Equation 3.16.
?? = 1x(1?x
2)(2?
3x
2 )
(3.16)
Combining Equation 3.16 and Equation 3.15, we have:
H(L) =?2? 1?x4?3x?ln 1?x4?3x? 2?x4?3x?ln 2?x4?3x
In Figure 3.6, we plot the relationship between the reconstruction entropy and read rate
under the 2-state and 3-state models, respectively. As Figure 3.6 illustrates, the entropy
decreases accordingly with the increase of read rate, which indicates that the system will have
less uncertainty with more reliable readers. Moreover, Figure 3.6 shows that the entropy in
the 3-state model is always smaller than that in the 2-state model. For example, if x = 0.95,
the entropy in the 3-state model is 0.395 while the entropy in the 2-state model is 1.098.
This observation reveals that the 3-state model can be more informative in object localization
than the 2-state model.
3.3.2 Entropy versus Number of States
Here we investigate the relationship between system entropy and the number of states
in a detection model. Suppose an n-state model is adopted with the highest read rate of
x. Thus, the read rate in the ith state (counted with the increase of the detection distance)
can be represented as (n?i)?xn?1 . Combined with Equation 3.4 and Table 3.2, we can obtain the
49
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
Entropy
Read Rate
Two-state model
Three-state model
Figure 3.6: Relationship between entropy and read rate in the 2-state and 3-state models.
p(L = l) =
?
???
??
???
?
?(1? xn?1)(1? 2xn?1)...(1?(n?2)xn?1 )?x?(1? (n?2)xn?1 )...(1? 2xn?1)(1? xn?1)? if l = j
?(1? xn?1)(1? 2xn?1)...(1?(n?2)xn?1 )?(n?1?k)xn?1 ?(1? (n?2)xn?1 )...
(1? (n?1?k+1)xn?1 )?(1?x)?(1? (n?1?k?1)xn?1 )...(1? 2xn?1)(1? xn?1)?
if l?{j?k,j + k},k?{1,2,...,n?2}
0 otherwise
(3.17)
probability mass function of L, represented as Equation 3.17. Equation 3.17 shows that in
an n-state model, a successful reading ?1? of a certain reader about an object in fact implies
that this object may exist in any of the 2(n?2) + 1 = 2n?3 correlated zones (including
the zone which the reader is associated to), each with a non-zero probability.
Based on Equation 3.17, we plotted the relationship between entropy and the number
of states in a detection model in Figure 3.7, where we assume x equals to 0.95 (the most
common case). According to Figure 3.7, the 3-state detection model can minimize the system
entropy and lead to the maximum system performance. In other words, having more states
(more than 3) in a detection model can even deteriorate the system performance. Therefore,
our experiments are mainly focused on the 3-state model.
50
0.0 2 3 4 5 6 7 8
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
En
tro
py
Number of State in Reader Detection Model
Figure 3.7: Relationship between entropy and the number of states in a detection model.
3.4 Sampling
By using Bayesian inference, we derive the posterior, as shown in Equation 3.4. Since
Equation 3.4 is easy to compute but hard to sample from, we need an efficient method to
draw samples from the posterior distribution. In this section, we briefly review Markov Chain
Monte Carlo (MCMC) and then show why MCMC is chosen in our solution. Next, as the
two most commonly used MCMC samplers, the difference between the Metropolis-Hastings
sampler and the Gibbs sampler is discussed. Finally, we propose a Metropolis-Hastings
sampler with Constraints (MH-C) method.
3.4.1 Preliminary
Monte Carlo Sampling
Monte Carlo Sampling is a technique to evaluate an integration of a random variable
by performing statistical sampling experiments, i.e., drawing a set of samples from a target
probability density function [4]. Suppose X is a random variable and its density distribution
51
is defined as p(x). f(x) is an objective function defined on X. Then, we can utilize Equa-
tion 3.18 to approximate the integration of f(x), which can be interpreted as the expected
value of f(x) on X.
integraldisplay
f(x)p(x)dx ? 1N
summationdisplay
i
f(x(i)),X?p(x) (3.18)
In Equation 3.18, f(x(i)) is the value of the objective function on x(i), which is the value
of the random variable X on the ith sampling. N denotes the total count of the sampling.
Equation 3.18 reflects that the expected value can be estimated by the average of the samples
using the Monte Carlo method. This evaluation is unbiased and will converge to the exact
value of the integration according to the strong law of large numbers [46].
Rejection Sampling
In rejection sampling, instead of sampling directly from the target distribution p(x), we
sample indirectly from another proposal density distribution q(x) which is easier to sample.
The criterion of choosing a q(x) is that its support1 should be larger than that of p(x) and
Equation 3.19 can be satisfied.
p(x)?C?q(x) (3.19)
where C is an arbitrary bounding constant to guarantee that Equation 3.19 holds for
any x within the support of p(x). The exact sampling procedure is as follows. Suppose in
the ith cycle of sampling, we sample x? according to q(x). Then we accept x? as x(i) with the
probability of the acceptance rate ?. If x? is accepted, we move to the next cycle. Otherwise,
we obtain another sample from q(x) and follow the above steps repetitively. The acceptance
rate ? can be computed by Equation 3.20.
1The support of a distribution f(x) is the closure of a set X such that f(x)negationslash= 0,?x?X.
52
? = p(x
?)
C?q(x?) (3.20)
Notice that each sample generated by rejection sampling is independent.
Importance Sampling
Instead of sampling directly from the target distribution p(x), we sample indirectly from
another proposal density distribution q(x). Here, we only require that the support of q(x)
subsumes that of p(x). Also, Equation 3.18 can be rewritten as Equation 3.21.
integraldisplay
f(x)p(x)dx =
integraldisplay
f(x)p(x)q(x)q(x)dx (3.21)
In Equation 3.21, p(x)/q(x) is defined as the importance weight of the sample x, which
can be denoted as w(x). Consequently, the Monte Carlo estimate of an integration based on
the importance sampling can be represented as Equation 3.22.
integraldisplay
f(x)p(x)dx ? 1N
summationdisplay
i
f(x(i))w(x(i)) =
summationdisplay
i
f(x(i))w?(x(i)) (3.22)
In Equation 3.22, w?(x(i)) is the normalized weight of the sample x(i), which can be
computed by Equation 3.23 [17].
w?(x(i)) = w(x
(i))
summationtext
j w(x(j))
(3.23)
Like rejection sampling, all the samples generated according to importance sampling
are also independent. This inhibits the possibility of using the correlation between samples
to improve the sampling efficiency.
53
Markov Chain
A Markov chain is a stochastic process which consists of possible states of random
variables [46]. It can be denoted as a sequence states of X1, X2, X3, ..., Xn, which satisfy
p(Xn+1 = x|Xn = xn,Xn?1 = xn?1,...,X1 = x1) =
p(Xn+1 = x|Xn = xn)
(3.24)
where p(x|y) is the transition probability from state y to state x.
Markov Chain Monte Carlo
Markov Chain Monte Carlo (MCMC) is a technique to generate samples from the state
space by simulating a Markov chain. The formed Markov chain is constructed in such a way
that the chain spends more time in the regions with higher importance (i.e., the Markov chain
converges to the posterior as its stationary distribution). In other words, the equilibrium
distribution of the Markov chain converges to the target distribution. As the number of
samples is sufficiently large, all the samples become the fair samples from the posterior.
Consequently, we are able to approximate a sophisticated posterior based on deliberately
constructing such a Markov Chain of samples.
3.4.2 Sample Correlation
In our design, we choose MCMC instead of other sampling techniques because MCMC
maintains the correlation among samples. In MCMC, the next sample depends on the current
sample. Before we elaborate on how we can take advantage of sample correlation to improve
the efficiency of sampling in our scenario, we define two terms as follows.
Definition We call any sample generated by the sampler a candidate sample. A qualified
sample is a candidate sample that satisfies all constraints.
54
The existence of constraints leads to the uniqueness of our problem. Samples must sat-
isfy all the constraints to become qualified. Note that in most sampling problems, we prefer
independent samples, that is, the current draw of a sample is independent from the previ-
ous draw. In our scenario, however, the sampling techniques which generate independent
samples (e.g., importance sampling) may suffer from low sampling efficiency due to the loss
of correlation between adjacent samples. Figure 3.8 illustrates how the correlation between
samples can be utilized to improve the sampling efficiency. The qualified sampling space is a
subset of the complete sampling space. Suppose sample point A is the current sample in the
qualified sampling space. For an independent sampler, the next sample could be any point
in the complete sampling space. However, the next sample is useful only if it happens to fall
into the qualified sampling space. On the contrary, if a MCMC-based sampler is employed,
the next sample will be chosen according to the proposal distribution at point A, i.e., the
next sample will be in the area denoted by the dotted circle centering at point A. Therefore,
compared to other independent samplers, the probability that the next sample generated by
MCMC falls into the qualified sampling space is considerably increased.
Note that although MCMC improves sampling efficiency, a sample generated by MCMC
may not necessarily be a qualified sample. As in Figure 3.8, point B is a sample generated
by MCMC after point A. However, point B is outside the qualified sampling space. Conse-
quently, we have to sample again to acquire point C, which is a qualified sample, and then
add it into the Markov chain as the next state (i.e., the Markov chain moves from point A
to point C).
3.4.3 Metropolis-Hastings and Gibbs Sampling
The Metropolis-Hastings (MH) sampler and the Gibbs sampler are the two most com-
mon MCMC samplers. MH conducts a sequence of random walks using a proposal distribu-
tion and decides whether to reject the proposed moves using the rejection sampling. In the
applications of Bayesian inference, the normalizing constant is usually extremely difficult to
55
B
B
A
C
Figure 3.8: Taking advantage of the correlation between samples to improve sampling effi-
ciency.
compute. MH avoids the computation of the constant. It approximates the posterior by
using only the ratio of the posterior, where the constant is canceled out.
Recall that the random vector representing the locations of objects is denoted as ?H and
the posterior distribution is post( ?H|Z). Suppose ?Ht?1 is the immediate previous state before
the state ?Ht in the formed Markov chain. According to the MH algorithm, at first, a sample,
?Hq, is drawn from a proposal distribution, q( ?Hq|?Ht?1), i.e., ?Hq is a random deviation from
?Ht?1. In our research, we use a uniform proposal distribution whose support is defined as
the step length. The proposal sample ?H? can be denoted as ?Ht?1+ ?Hq. Then MH accepts
?H? as the next state ?Ht with the probability of post( ?H?|Z)
post( ?Ht?1|Z).
Here we compare MH sampler with the Gibbs sampler in brief. The Gibbs sampler
requires that conditional (marginal) distributions for each variable are known and easy to
sample from. MH relies on the ratio of the posterior, and does not require to sample from any
distribution. Because we have already derived the closed form of the posterior asEquation 3.4
and are able to calculate likelihoods easily according to the proposed n-state model, it will
be much more straightforward to use MH sampler rather than the Gibbs sampler in our
design.
56
3.4.4 Metropolis-Hastings sampler with Constraints
Although the naive MH algorithm can evaluate the posterior by forming a Markov chain
in the sampling space, it does not take constraints into account. If we impose constraints to
samples, many of them should be rejected because they are inapplicable, i.e., they are not
qualified samples.
To incorporate constraints in sampling, we propose a Metropolis-Hastings sampler with
Constraints (MH-C). With MH-C, each zone is associated with multiple variables called
resource descriptors. The current value of a resource descriptor represents how much the
associated resource is still available. Suppose we have a variable, denoted as deszonei, to keep
track of the current available vacancy in zone i. The initial value of deszonei is set to the
maximal capacity of zone i. Thus, whether an object j with the volume volobjectj is able to
be stored in zone i can be examined by:
deszonei = deszonei?volobjectj (3.25)
The proposed resource allocation is feasible only if deszonei is no less than zero. Other-
wise, we have to re-sample until a new allocation meets all the constraints. Consequently,
the problem of whether an allocation is feasible (or compatible) can reduce to the problem
of monitoring the value of each descriptor.
With MH-C, because each sample is a Dobject-dimensional vector, a proposal sample is
generated iteratively dimension by dimension. If any descriptor for the current allocation is
less than zero, there will be no chance for the current partial sample to become a qualified
sample. Therefore, we can discard the current value and then choose another value for that
dimension by re-sampling. As far as the proposal distribution is concerned, we construct a
random walk chain by choosing a uniform proposal distribution within the step length. A
detailed description of MH-C algorithm is illustrated in Algorithm 3 and the related notations
are summarized in Table 3.3.
57
Symbol Meaning
Z The raw data matrix from RFID
readers
S The sample set
vectorC The current sample in the
Markov chain
vectorP The proposal sample in the
Markov chain
Cj The jth dimension of vectorC
Pj The jth dimension of vectorP
E The number of effective samples
B The number of samples in the
burn-in phase
S The step length for the uniform
proposal distribution
Dobject The total number of monitored
objects
Dzone The total number of zones
Jitter A random number between 0 and
1
Rand(a,b) Generate a random integer be-
tween a and b based on uniform
distribution
Post( ?H|Z) The posterior probability of the
sample ?H given raw data Z
Table 3.3: Symbolic notations utilized in Algorithm 3.
In Algorithm 3, line 1 initializes the sample set and takes the RFID raw data. Line 2
loads the n-state detection model of readers with the objective of computing the likelihood.
Line 3 initializes all the resource descriptors and line 4 randomly chooses a qualified sample
as the first state of the Markov Chain. Lines 6 to 17 generate a random sample as the
proposal sample dimension by dimension (object by object). Lines 8 to 13 correspond to
the sampling process. First, we obtain a random integer based on the current value and
the random proposal value. The correct range of the value on each dimension of the sample
vector, represented as hi, is [1, Dzone]. Therefore, if the proposal value Pj overflows in the
range, we need to make the value reflect into the range. Then, we check all the related
58
descriptors to make sure their updated values are no less than zero. If any value is less
than zero, it means that the current allocation will violate the corresponding constraints.
Consequently, we go back to line 8 to re-sample until an allocation on that dimension is
feasible, as shown in line 15. Note that our algorithm guarantees each proposal sample
is also a qualified sample. After a complete proposal sample is generated, lines 18 to 21
accept this proposal sample as the next state of the Markov chain with the probability of
the posterior ratio of the proposal sample over the current sample. Line 22 adds the next
state into the sample set. Line 23 resets all the resource descriptors to make sure that the
examination on descriptors for the next proposal sample is correct. Note that line 5 is to set
the upper bound of the sampling loop to E + B in order to guarantee that the final number
of samples in sample set is E + B. It is because the first B samples should be excluded as
burn-in samples and consequently only the remaining E samples will be taken into account
to reconstruct the posterior.
3.5 Real-Time Cleansing on Data Streams
post( ?H|C) = post(h1,h2,...,hn|
?
?? c11 ??? c1m... ... ...
cn1 ??? cnm
?
??)
?p(
?
?? c11 ??? c1m... ... ...
cn1 ??? cnm
?
??|h
1,h2,...,hn)?p(h1,h2,...,hn)
(3.26)
post( ?H|C) = ??
productdisplay
i
p(ci1|hi)?p(ci2|hi)?...?p(cim|hi)?
productdisplay
j
p(hj) (3.27)
p(cij|hi) = N!c
ij!?(N?cij)!
?p(zij|hi)cij ?(1?p(zij|hi))N?cij (3.28)
In many real-world applications, RFID raw data is incrementally becoming available as
data streams and the locations of objects may change over time [14,61,62]. In support of
59
real-time object monitoring where RFID data are generated quickly and automatically, we
propose a Streaming Bayesian Inference (SBI) approach. SBI combines Bayesian inference
with MH-C and employs a counted matrix in the sliding window to cleanse the RFID data
and predict the locations of tagged objects in a real-time fashion.
3.5.1 Streaming Bayesian Inference
First of all, we assume that there exist a sliding window during which the locations of
all objects can be considered as unchanged and the process of posterior reconstruction can
be finished during each sliding window. In the previous section, we illustrated that MH-C is
a very efficient way to draw samples from the posteriors. As demonstrated in Figure 3.10, in
most cases, MH-C can generate the reconstruction of posteriors for all objects in no longer
than 20 seconds. Therefore, it is reasonable to set the size of the sliding window according
to the running time of MH-C. On the other hand, our sliding window based approach can
meet the common requirement in data stream processing algorithms which dictates that the
time complexity must be linear.
Once the size of the sliding window is determined, the posterior distributions of loca-
tions of all objects within a certain sliding window can be computed based on Bayesian
inference, as shown in Equation 3.26. While the non-streaming Bayesian inference approach
directly utilizes the RFID raw data matrix as the input (as illustrated in Equation 3.1), the
streaming Bayesian inference method takes the RFID counted matrix as the input as shown
in Equation 3.26. In Equation 3.26, C is the counted matrix which is generated by counting
the numbers of successfully readings (i.e., ?1?s?) for each combination of objects and readers
over all the raw data matrices during a sliding window, i.e., cij =summationtextt zijt, where zijt denotes
the reported value of zij at the time point t during the sliding window. Besides, if we as-
sume that each reader detects different objects independently and the prior distribution of
each object does not depend on that of others, we can obtain Equation 3.27, where ? is
a normalizing constant. In other words, given a counted matrix with respect to a certain
60
sliding window, we can compute the posterior of each sample, ?H, within this sliding window,
Post( ?H|C) by using Equation 3.27. Furthermore, noticing that the read operation of each
reader at different time points is independent from each others?, we can conclude that cij in
fact follows a binomial distribution, i.e., cij w.r.t. hi?B(N,p(zij|hi))), where N is the total
number of read operations within the sliding window. Therefore, p(cij|hi) can be computed
according to the binomial distribution as shown in Equation 3.28. Notice that p(zij|hi) in
Equation 3.28 can be estimated based on our n-state detection model efficiently.
3.5.2 Algorithm Description
The complete algorithm of our SBI method is formalized in Algorithm 4. In Algorithm 4,
Line 1 generates the counted matrix, C, by counting the numbers of successful readings, i.e.,
cij = summationtextt zijt. Subsequently, all samples are generated using MH-C. To be specific, line 2
randomly chooses a qualified sample within the support of Post( ?H|C) as the first state of the
Markov Chain. Lines 3 to 4 correspond to the sampling process. Line 4 draws a proposal
sample, ?H?i, according to MH-C. In line 5, Post( ?H?i|C) is computed using Equation 3.27
and Equation 3.28 according to the binomial distribution B(N,p(zij|hi)), and the n-state
detection model. Line 6 calculates the posterior ratio of the proposal sample over the current
state, and then accepts ?H?i as the next state with the acceptance ratio of Post( ?H?i|C)Post( ?H
i?1|C)
.
3.6 Experimental Validation
In this section, we applied our method to a warehouse application, i.e., each object
corresponds to a case and each zone corresponds to a rack. To capture the likelihood,
without loss of generality, we assumed that the 3-state detection model was adopted in this
application. We implemented MCMC (we use the terms MH-C and MCMC interchangeably
in this section) and our Streaming Bayesian Inference (SBI) approach to evaluate their
performances. For comparison with our MCMC-based solution, we extended the SIS-based
approach in [67] to incorporate duplicate readings (i.e., their work does not consider duplicate
61
readings and only focuses on the distribution of the missing cases). For the experiments
on SBI(N), where N denotes the number of arrivals of raw data at different time points
during each sliding window, we varied N from 1 (SBI (1)), 3 (SBI (3)) to 5 (SBI (5)). In
particular, when N equals to 1, SBI(N) degrades to our MCMC-based solution (the non-
steaming method). Therefore, we employed SBI (1) as a baseline to show the advantage of
our SBI-based approaches.
3.6.1 Data Representation
In this dissertation, we store the true distributions of objects and RFID raw readings
at each time point in matrices. For simplicity of presentation, suppose we have six tagged
objects in the target area. A possible true distributions matrix is shown in Table 3.4(a)
where the rows represent cases and columns represent racks. To be specific, if a case is on a
rack, the corresponding position is 1 and 0 otherwise. Therefore, according to Table 3.4(a),
the first case is on the first rack, the second case is on the fourth rack and so on. On the other
hand, a possible RFID raw reading matrix is shown in Table 3.4(b) in the same format. By
comparing the two matrices, we can easily see that there are notable differences (existence
of noise and duplicate readings) between these two matrices. Specifically, duplicate readings
of the fourth and fifth case (consecutive 1?s) occur in the raw reading matrix (data).
?
??
??
??
?
1 0 0 0 0
0 0 0 1 0
0 0 1 0 0
0 1 0 0 0
0 0 0 0 1
1 0 0 0 0
?
??
??
??
?
?
??
??
??
?
1 0 0 0 0
0 0 0 1 0
0 1 0 0 0
1 1 0 0 0
0 0 0 1 1
0 1 0 0 0
?
??
??
??
?
(a) (b)
Table 3.4: The matrix of: (a) true distribution and (b) raw readings.
62
Figure 3.9: The simulator structure.
3.6.2 Simulator Implementation
Our simulator consists of seven components as displayed in Figure 3.9. The true matrix
generator randomly produces distribution matrices as true distributions. On the contrary,
noisy matrix generator provides the noisy matrices as the RFID raw data in the same format.
Then MCMC and SIS modules reconstruct the distribution for each case using the noisy
matrix as the input. Our simulator generates the synthetic RFID raw data with duplicate
readings according to the physical characteristics of RFID readers [31]. The 3-state detection
model was used to capture the likelihood. In addition, we assume as the prior distribution
that each case exists on each rack with the same probability. The parameters used in our
simulations are shown in Table 3.5. All our experiments were conducted on a Linux machine
with an Intel Pentium 4 2.4GHz processor and 2GB of memory.
In order to investigate the impact of redundant readings on reconstruction accuracy, we
define the data redundancy degree as follows:
Definition The data redundancy degree is the probability that a reader successfully detects
an object not in the zone associated to that particular reader.
The data redundancy degree can reflect the probability that redundant readings occur in
RFID raw data, i.e., redundant readings (false positives) are actually the successful detection
of objects which are not in the zone that a certain reader belongs to. Taking the 3-state
model as an example, we can utilize the read rate in the minor detection region to define
63
the data redundancy degree. A larger data redundancy degree indicates a higher probability
that a reader can detect objects in the neighboring zones (or racks).
We employed K-L divergence, the top-1 success rate, and the top-2 success rate to
evaluate the reconstruction accuracy. K-L divergence is a metric commonly used to evaluate
the difference between two distributions. The definition of the top-k success rate is as follows:
Definition The top-k success rate is a percentage of the number of cases whose true locations
match the top k predicted locations of the reconstructed distribution over the total number
of cases.
Our simulator calculated the K-L divergence from the reconstructed distribution to the
true distribution, i.e., the smaller value of K-L convergence indicates the higher accuracy
of the reconstructed distribution. Specifically, the recovered matrices (reconstructed distri-
butions) were inverted by the matrix inversion module to facilitate the computation of K-L
divergence. The K-L divergence module was used to compare the reconstructed distributions
with the true distributions and compute the corresponding K-L divergence values for MCMC
and SIS. On the other hand, the top-k analysis module was responsible for calculating the
top-k success rate.
For each experimental result in this section, we randomly generated the true distribution
matrix and the corresponding RFID raw reading matrix 100 times. During each run of
distribution reconstruction, we recorded the sampling time, and computed the average K-L
divergence, the top-1 success rate and the top-2 success rate of all the 5000 cases (i.e., each
result is obtained by averaging over 5000 tagged objects in the whole target area).
3.6.3 The Performance Analysis of MH-C
In this section we focus on the performance of MCMC and SIS with respect to recon-
struction efficiency and accuracy.
64
Parameter Value
Dobject 5000
Dzone 200
B 50
S 30
Volobject 1
Capacityzone 50
Table 3.5: The parameters for our simulations.
The Reconstruction Efficiency
Here we investigated the performance of MCMC and SIS in terms of the average sam-
pling time. Compared to SIS, the average sampling time of MCMC is remarkably reduced
over different number of qualified samples as illustrated in Figure 3.10. For example, with
5000 qualified samples the sampling time of MCMC is 11.58 seconds while the sampling time
of SIS is 230.78 seconds. This is because MCMC takes advantage of the current qualified
sample to generate the next qualified sample (i.e., keeping the relevance of samples). Con-
sequently, MCMC takes less time than SIS to come up with the same number of qualified
samples.
0 1000 2000 3000 4000 5000 6000 7000 8000 900010000
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
MCMC
SIS
Number of Qualified Samples
Sa
mp
lin
g T
im
e (
se
c)
0
50
100
150
200
250
300
350
400
450
500
Sa
mp
lin
g T
im
e (
se
c)
Figure 3.10: MCMC versus SIS on sampling time.
65
The Reconstruction Accuracy
In this experiment, we varied the number of qualified samples, data redundancy degree,
and the number of managed racks per reader to investigate their effects on the reconstruction
accuracy.
s48 s49s48s48s48 s50s48s48s48 s51s48s48s48 s52s48s48s48 s53s48s48s48 s54s48s48s48 s55s48s48s48 s56s48s48s48 s57s48s48s48 s49s48s48s48s48
s48s46s48
s48s46s53
s49s46s48
s49s46s53
s50s46s48
s50s46s53
s51s46s48
s51s46s53
s52s46s48
s52s46s53
s53s46s48
s32s77s67s77s67
s32s83s73s83
s78s117s109s98s101s114s32s111s102s32s81s117s97s108s105s102s105s101s100s32s83s97s109s112s108s101s115
s75
s45
s76
s32s68
s105s118
s101
s114
s103
s101
s110
s99
s101
s32
s48 s49s48s48s48 s50s48s48s48 s51s48s48s48 s52s48s48s48 s53s48s48s48 s54s48s48s48 s55s48s48s48 s56s48s48s48 s57s48s48s48 s49s48s48s48s48
s48s46s48
s48s46s49
s48s46s50
s48s46s51
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s77s67s77s67
s32s83s73s83
s78s117s109s98s101s114s32s111s102s32s81s117s97s108s105s102s105s101s100s32s83s97s109s112s108s101s115s32
s84
s111
s112
s45
s49
s32s83
s117
s99
s99
s101
s115
s115
s32s82
s97
s116s101
s48 s49s48s48s48 s50s48s48s48 s51s48s48s48 s52s48s48s48 s53s48s48s48 s54s48s48s48 s55s48s48s48 s56s48s48s48 s57s48s48s48 s49s48s48s48s48
s48s46s48
s48s46s49
s48s46s50
s48s46s51
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s77s67s77s67
s32s83s73s83
s78s117s109s98s101s114s32s111s102s32s81s117s97s108s105s102s105s101s100s32s83s97s109s112s108s101s115
s84
s111
s112
s45
s50
s32s83
s117
s99
s99
s101
s115
s115
s32s114
s97
s116s101
s32
(a) Average K-L diver-
gence of 5000 cases.
(b) Top-1 success rate of
5000 cases.
(c) Top-2 success rate of
5000 cases.
Figure 3.11: The impact of the number of qualified samples on accuracy for MCMC and SIS.
s48s46s51s48s48 s48s46s51s50s53 s48s46s51s53s48 s48s46s51s55s53 s48s46s52s48s48 s48s46s52s50s53 s48s46s52s53s48 s48s46s52s55s53 s48s46s53s48s48
s48s46s48
s48s46s53
s49s46s48
s49s46s53
s50s46s48
s50s46s53
s51s46s48
s51s46s53
s52s46s48
s52s46s53
s53s46s48
s53s46s53
s54s46s48
s54s46s53
s55s46s48
s32s77s67s77s67
s32s83s73s83
s82s101s100s117s110s100s97s110s99s121s32s68s101s103s114s101s101
s75
s45
s76
s32s68
s105s118
s101
s114
s103
s101
s110
s99
s101
s48s46s51s48s48 s48s46s51s50s53 s48s46s51s53s48 s48s46s51s55s53 s48s46s52s48s48 s48s46s52s50s53 s48s46s52s53s48 s48s46s52s55s53 s48s46s53s48s48
s48s46s48
s48s46s49
s48s46s50
s48s46s51
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s77s67s77s67
s32s83s73s83
s82s101s100s117s110s100s97s110s99s121s32s68s101s103s114s101s101
s84
s111
s112
s45
s49
s32s83
s117
s99
s99
s101
s115
s115
s32s114
s97
s116s101
s48s46s51s48s48 s48s46s51s50s53 s48s46s51s53s48 s48s46s51s55s53 s48s46s52s48s48 s48s46s52s50s53 s48s46s52s53s48 s48s46s52s55s53 s48s46s53s48s48
s48s46s48
s48s46s49
s48s46s50
s48s46s51
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s77s67s77s67
s32s83s73s83
s82s101s100s117s110s100s97s110s99s121s32s68s101s103s114s101s101
s84
s111
s112
s45
s50
s32s83
s117
s99
s99
s101
s115
s115
s32s82
s97
s116s101
(a) Average K-L diver-
gence of 5000 cases.
(b) Top-1 success rate of
5000 cases.
(c) Top-2 success rate of
5000 cases.
Figure 3.12: The impact of RFID data redundancy degree on accuracy for MCMC and SIS.
The Impact of the Number of Qualified Samples
We first increased the number of qualified samples from 500 to 9000 to investigate the
performance of MCMC and SIS on reconstruction accuracy. Here we assumed that the read
rate in the major detection region is 95% and the number of racks managed by a reader is
1. As demonstrated in Figure 3.11(a), with the increase of the number of qualified samples,
the K-L divergence values of both approaches kept decreasing. However, MCMC always
66
s48 s49 s50 s51 s52 s53 s54 s55
s48s46s48
s48s46s53
s49s46s48
s49s46s53
s50s46s48
s50s46s53
s51s46s48
s51s46s53
s52s46s48
s52s46s53
s53s46s48
s53s46s53
s54s46s48
s54s46s53
s55s46s48
s32s77s67s77s67
s32s83s73s83
s78s117s109s98s101s114s32s111s102s32s82s97s99s107s115s32s112s101s114s32s82s101s97s100s101s114
s75
s45
s76
s32s68
s105s118
s101
s114
s103
s101
s110
s99
s101
s48 s49 s50 s51 s52 s53 s54 s55
s48s46s48
s48s46s49
s48s46s50
s48s46s51
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s77s67s77s67
s32s83s73s83
s78s117s109s98s101s114s32s111s102s32s82s97s99s107s115s32s112s101s114s32s82s101s97s100s101s114
s84
s111
s112
s45
s49
s32s83
s117
s99
s99
s101
s115
s115
s32s82
s97
s116s101
s32
s48 s49 s50 s51 s52 s53 s54 s55
s48s46s48
s48s46s49
s48s46s50
s48s46s51
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s77s67s77s67
s32s83s73s83
s78s117s109s98s101s114s32s111s102s32s82s97s99s107s115s32s112s101s114s32s82s101s97s100s101s114
s84
s111
s112
s45
s50
s32s83
s117
s99
s99
s101
s115
s115
s32s114
s97
s116s101
(a) Average K-L diver-
gence of 5000 cases.
(b) Top-1 success rate of
5000 cases.
(c) Top-2 success rate of
5000 cases.
Figure 3.13: The impact of the number of managed racks per reader on accuracy for MCMC
and SIS.
outperformed SIS with all experimented sample numbers. Particularly, when we drew 500
qualified samples, the K-L divergence of MCMC was 0.86 while the K-L divergence of SIS
was 3.78. When we picked 9000 qualified samples, the K-L divergence of MCMC reduced
to a remarkable value 0.64 comparing with 2.77 of the SIS solution. Also, as far as the
top-1 success rate is concerned, with the increase of the number of qualified samples, the
top-1 success rate of MCMC increased from 0.50 to 0.70 while the top-1 success rate of SIS
extended from 0.36 to 0.55 as shown in Figure 3.11(b). In addition, the top-2 success rate
of MCMC raised from 0.70 to 0.89 while the top-2 success rate of SIS changed from 0.60 to
0.76 as demonstrated in Figure 3.11(c).
The Impact of the Data Redundancy Degree
Next, we studied the impact of the data redundancy degree on the performance of MCMC
and SIS in terms of the reconstruction accuracy. We varied the data redundancy degree
from 0.325 to 0.475, corresponding to the read rate in the major detection regions from
65% (the least reliable reader) to 95% (the most reliable reader). For each experiment, we
drew 5000 qualified samples and assumed that the number of racks managed by a reader
is 1. Figure 3.12 illustrates the results. With the enlargement of the data redundancy
degree, both the performances of MCMC and SIS on reconstruction accuracy are elevated.
Specifically, as demonstrated in Figure 3.12(a), MCMC always maintained a lower K-L
67
divergence value than SIS, reflecting a more precise prediction. Furthermore, as shown in
Figure 3.12(b), the top-1 success rate of MCMC increased from 0.54 to 0.65 with the increase
of the data redundancy degree while the top-1 success rate of SIS expanded from 0.42 to
0.51. Figure 3.12(c) demonstrates how the top-2 success rate increased when we raised the
data redundancy degree for both MCMC and SIS.
The Impact of the Number of Managed Racks per Reader
We evaluated the performance of MCMC and SIS by varying the number of managed
racks per reader. In order to deploy readers in a warehouse more efficiently, users may want
to assign multiple racks to be managed by a single reader. Taking into account the fact that
the overall detection region of a regular RFID reader has little chance to be more than 20
feet [31], we changed the number of racks managed by a reader from 1 to 6. As Figure 3.13(a)
demonstrates, when each rack had its own reader, the K-L divergence values of MCMC and
SIS were 0.68 and 3.11, respectively. When a reader monitored more racks, both the K-
L divergence values of MCMC and SIS deteriorated. When a reader was responsible for
detecting cases on six racks, the K-L divergence values raised to 1.66 of MCMC and 4.01 of
SIS. Moreover, as demonstrated in Figure 3.13(b), with the enlargement of the number of
managed racks per reader the top-1 success rate of MCMC decreased from 0.65 to 0.55. On
the other hand, the top-1 success rate of SIS dropped from 0.51 to 0.41. Also, Figure 3.13(c)
depicts how the top-2 success rate of MCMC and SIS decreased correspondingly.
3.6.4 Visualization of Query Results
The Location Query
The results of the location queries returned by MCMC and SIS for the first six cases
are demonstrated in Figure 3.14. For the first case, the correct location is the first rack.
MCMC predicted a probability of 0.47 on that rack while SIS predicted a probability of only
0.01. Similarly, for the second case, the true location is the fourth rack. MCMC predicted
68
a probability of 0.70 on that rack while SIS predicted a probability of 0.98. For the third
case, the accurate location is the third rack. MCMC predicted a probability of 0.53 on that
rack while SIS predicted a probability of 0.98. For the fourth case, the exact location is
the second rack. MCMC predicted a probability of 0.51 on that rack while SIS predicted a
probability of 0.32. For the fifth case, the precise location is the fifth rack. MCMC predicted
a probability of 0.53 on that rack while SIS predicted a probability of 0.99. At last, for
the sixth case, the correct location is the first rack. MCMC predicted a probability of 0.20
on that rack while SIS predicted a probability of only 0.05. In summary, MCMC tends to
generate a much smoother probability distribution than SIS. Consequently, MCMC provides
a superior overall prediction of the distribution of all the objects monitored by an RFID
system compared with SIS.
1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Pr
ob
ab
ilit
y
ID of the Rack
MCMC
SIS
1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Pr
ob
ab
ility
ID of the Rack
MCMC
SIS
1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Pr
ob
ab
ility
ID of the Rack
MCMC
SIS
(a) First case (true loca-
tion: rack 1).
(b) Second case (true lo-
cation: rack 4).
(c) Third case (true loca-
tion: rack 3).
1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Pr
ob
ab
ility
ID of the Rack
MCMC
SIS
1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Pr
ob
ab
ilit
y
ID of the Rack
MCMC
SIS
1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Pr
ob
ab
ilit
y
ID of the Rack
MCMC
SIS
(d) Fourth case (true lo-
cation: rack 2).
(e) Fifth case (true loca-
tion: rack 5).
(f) Sixth case (true loca-
tion: rack 1).
Figure 3.14: The results of the location queries answered by MCMC and SIS.
69
The Remaining Capacity Query
We applied the available length on a rack as the acquirable volume of that rack to
demonstrate the remaining capacity query. The results of the remaining capacity queries
answered by MCMC for the first five racks are demonstrated in Figure 3.15. Note that
the remaining capacity on each rack is 3 because each rack exactly accommodates 4 cases
according to the true distribution matrix used for this experiment (the first six rows of the
matrix are as shown in Table 3.4(a)). As illustrated in Figure 3.15, the remaining volumes
on racks 1 and 4 were reported correctly. On the contrary, the remaining volumes on racks
2 and 3 were underestimated and the remaining volume on rack 5 was overestimated.
0 1 2 3 4 5 6 7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Pr
ob
ab
ilit
y
Remaining Capacity
Rack 1
Rack 2
Rack 3
0 1 2 3 4 5 6 7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Pr
ob
ab
ilit
y
Remaining Capacity
Rack 4
Rack 5
(a) Racks 1, 2 and 3 (true
value: 3).
(b) Racks 4 and 5 (true value:
3).
Figure 3.15: The results of the remaining capacity queries answered by MCMC.
3.6.5 The Performance Analysis of SBI
In this section we evaluated the performance of SBI (1), SBI (3) and SBI (5) with respect
to reconstruction efficiency and accuracy.
The Reconstruction Efficiency
Here we investigated the performance of SBI (1), SBI (3) and SBI (5) in terms of the
average sampling time. When N increases, the average sampling time of SBI (N) extends
almost linearly with the same number of qualified samples, as illustrated in Figure 3.16. For
70
s48 s49s48s48s48 s50s48s48s48 s51s48s48s48 s52s48s48s48 s53s48s48s48 s54s48s48s48 s55s48s48s48 s56s48s48s48 s57s48s48s48s49s48s48s48s48
s48
s50s48
s52s48
s54s48
s56s48
s49s48s48
s49s50s48
s49s52s48
s49s54s48
s49s56s48
s32s83s66s73s32s40s49s41
s32s83s66s73s32s40s51s41
s32s83s66s73s32s40s53s41
s78s117s109s98s101s114s32s111s102s32s81s117s97s108s105s102s105s101s100s32s83s97s109s112s108s101s115
s83
s97
s109
s112
s108s105s110
s103
s32s84
s105s109
s101
s32s40
s115
s101
s99
s41
s32
Figure 3.16: Performance comparison of SBI (1), SBI (3) and SBI (5) on sampling time.
example, with 5000 qualified samples the average sampling time of SBI(1) is 11.58 seconds
while the average sampling time of SBI (3) and SBI (5) is 45.10 seconds and 67.25 seconds,
respectively. This is because a larger N requires more calculations to estimate the likelihood
of each sample based on the Binomial distribution, B(N,p(zij|hi)), thus leading to more
time spent on computing the posterior of each sample.
The Reconstruction Accuracy
In this experiment, we focus on the performance of SBI (1), SBI (3) and SBI (5) on
reconstruction accuracy.
The Impact of the Number of Qualified Samples
We first studied the relationship between the number of qualified samples and the re-
construction accuracy under SBI (1), SBI (3) and SBI (5). Here we assumed that the read
rate in the major detection region is 95% and the number of racks managed by each reader
is 1. As shown in Figure 3.17(a), with a larger N, the K-L divergence of SBI (N) kept
decreasing. Particularly, when we drew 5000 qualified samples, the K-L divergence of SBI
(1) is 0.68. On the contrary, the corresponding K-L divergence of SBI (3) and SBI (5) is
0.39 and 0.19, respectively. Furthermore, when we chose a larger N, the top-1 success rate
71
of SBI (N) will be elevated accordingly as illustrated in Figure 3.17(b). For instance, with
5000 qualified samples the top-1 success rates of SBI (3) and SBI (5) are 0.80 and 0.86,
respectively while the top-1 success rate of SBI (1) is 0.65. In addition, as Figure 3.17(c)
demonstrates, the top-2 success rate can also be improved by taking a larger N. For ex-
ample, with 5000 qualified samples the top-2 success rates of SBI (1), SBI (3) and SBI (5)
are 0.85, 0.92 and 0.94, respectively. In summary, SBI (3) and SBI (5) always outperformed
SBI (1) in terms of accuracy because they can estimate the posterior of each sample more
precisely by taking into consideration a larger number of RFID raw readings of each object
at a particular location. Therefore, a more precise localization can be achieved by inferring
jointly on these readings generated at different time points.
The Impact of the Data Redundancy Degree
s48 s49s48s48s48 s50s48s48s48 s51s48s48s48 s52s48s48s48 s53s48s48s48 s54s48s48s48 s55s48s48s48 s56s48s48s48 s57s48s48s48 s49s48s48s48s48
s48s46s48
s48s46s53
s49s46s48
s49s46s53
s50s46s48
s32s83s66s73s32s40s49s41
s32s83s66s73s32s40s51s41
s32s83s66s73s32s40s53s41
s78s117s109s98s101s114s32s111s102s32s81s117s97s108s105s102s105s101s100s32s83s97s109s112s108s101s115
s75
s45
s76
s32s68
s105s118
s101
s110
s103
s101
s110
s99
s101
s32
s48 s49s48s48s48 s50s48s48s48 s51s48s48s48 s52s48s48s48 s53s48s48s48 s54s48s48s48 s55s48s48s48 s56s48s48s48 s57s48s48s48 s49s48s48s48s48
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s83s66s73s32s40s49s41
s32s83s66s73s32s40s51s41
s32s83s66s73s32s40s53s41
s78s117s109s98s101s114s32s111s102s32s81s117s97s108s105s102s105s101s100s32s83s97s109s112s108s101s115
s84
s111
s112
s45
s49
s32s83
s117
s99
s99
s101
s115
s115
s32s82
s97
s116s101
s32
s48 s49s48s48s48 s50s48s48s48 s51s48s48s48 s52s48s48s48 s53s48s48s48 s54s48s48s48 s55s48s48s48 s56s48s48s48 s57s48s48s48 s49s48s48s48s48
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s83s66s73s32s40s49s41
s32s83s66s73s32s40s51s41
s32s83s66s73s32s40s53s41
s78s117s109s98s101s114s32s111s102s32s81s117s97s108s105s102s105s101s100s32s83s97s109s112s108s101s115
s84
s111
s112
s45
s50
s32s83
s117
s99
s99
s101
s115
s115
s32s82
s97
s116s101
s32
(a) Average K-L diver-
gence of 5000 cases.
(b) Top-1 success rate of
5000 cases.
(c) Top-2 success rate of
5000 cases.
Figure 3.17: The impact of the number of qualified samples on accuracy for SBI (1), SBI (3)
and SBI (5).
Next we investigated the performance of SBI (1), SBI (3) and SBI (5) on the recon-
struction accuracy by varying the data redundancy degree. For each experiment, we picked
5000 qualified samples and assumed that the number of racks managed by a reader is 1. As
demonstrated in Figure 3.18(a), with the enlargement of N the K-L divergence of SBI (N)
kept decreasing, reflecting a more precise prediction. In particular, when the data redun-
dancy degree is 0.475, the K-L divergence of SBI (3) and SBI (5) is 0.38 and 0.17, respectively
while the K-L divergence of SBI (1) is 0.68. Moreover, as shown in Figure 3.18(b), when N
72
s48s46s51s48s48 s48s46s51s50s53 s48s46s51s53s48 s48s46s51s55s53 s48s46s52s48s48 s48s46s52s50s53 s48s46s52s53s48 s48s46s52s55s53 s48s46s53s48s48
s48s46s48
s48s46s53
s49s46s48
s49s46s53
s50s46s48
s50s46s53
s51s46s48
s32s83s66s73s32s40s49s41
s32s83s66s73s32s40s51s41
s32s83s66s73s32s40s53s41
s82s101s100s117s110s100s97s110s99s121s32s68s101s103s114s101s101
s75
s45
s76
s32s68
s105s118
s101
s114
s103
s101
s110
s99
s101
s32
s48s46s51s48s48 s48s46s51s50s53 s48s46s51s53s48 s48s46s51s55s53 s48s46s52s48s48 s48s46s52s50s53 s48s46s52s53s48 s48s46s52s55s53 s48s46s53s48s48
s48s46s48
s48s46s49
s48s46s50
s48s46s51
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s83s66s73s32s40s49s41
s32s83s66s73s32s40s51s41
s32s83s66s73s32s40s53s41
s82s101s100s117s110s100s97s110s99s121s32s68s101s103s114s101s101
s84
s111
s112
s45
s49
s32s83
s117
s99
s99
s101
s115
s115
s32s82
s97
s116s101
s32
s48s46s51s48s48 s48s46s51s50s53 s48s46s51s53s48 s48s46s51s55s53 s48s46s52s48s48 s48s46s52s50s53 s48s46s52s53s48 s48s46s52s55s53 s48s46s53s48s48
s48s46s48
s48s46s49
s48s46s50
s48s46s51
s48s46s52
s48s46s53
s48s46s54
s48s46s55
s48s46s56
s48s46s57
s49s46s48
s32s83s66s73s32s40s49s41
s32s83s66s73s32s40s51s41
s32s83s66s73s32s40s53s41
s82s101s100s117s110s100s97s110s99s121s32s68s101s103s114s101s101
s84
s111
s112
s45
s50
s32s83
s117
s99
s99
s101
s115
s115
s32s82
s97
s116s101
s32
(a) Average K-L diver-
gence of 5000 cases.
(b) Top-1 success rate of
5000 cases.
(c) Top-2 success rate of
5000 cases.
Figure 3.18: The impact of RFID data redundancy degree on accuracy for SBI (1), SBI (3)
and SBI (5).
increased from 1, 3, to 5, the top-1 success rate of SBI (1), SBI (3) and SBI (5) is elevated
accordingly. For example, with the data redundancy degree of 0.475, the top-1 success rate
of SBI (1) is 0.65. On the other hand, the corresponding top-1 success rates of SBI (3) and
SBI (5) jumped to 0.79 and 0.84, respectively. Besides, Figure 3.18(c) demonstrates how the
top-2 success rate can be raised by choosing a larger N.
73
Algorithm 3 Metropolis-Hastings Sampler with Constraints
1: Set S =?and take raw data Z
2: Load the n-state detection model
3: Initialize all the resource descriptors to their maximal capacity.
4: Initialize vectorC by randomly choosing a qualified sample within the support of Post( ?H|Z)
as the starting point.
5: for Cycle = 2 to E+B do
6: for j = 1 to Dobject do
7: repeat
8: Pj = Cj+ Rand(-S,S)
{Generate a new integer based on the current value and a proposal value within
the step length}
9: if Pj < 1 then
10: Pj = 1+ (1-Pj)
{Overflow and Reflection}
11: end if
12: if Pj > Dzone then
13: Pj = Dzone-(Pj-Dzone)
{Overflow and Reflection}
14: end if
15: until The value of any resource descriptor related to the referred zone is no less
than zero after the proposed allocation on the current object is committed
16: j?j + 1
17: end for
18: Generate a random number between 0 and 1: Jitter
19: if Jitter?min(1, Post(vectorP|Z)Post(vectorC|Z)) then
20: vectorC=vectorP
{Metropolis-Hastings}
21: end if
22: Add vectorC into S as the next sample
23: Resetting all the resource descriptors
24: Cycle?Cycle+ 1
25: end for
74
Algorithm 4 Streaming Bayesian Inference (SBI) Method
Require: Raw data matrices Zt1, Zt2, ..., ZtN at time points, t1, t2, ..., tN, within a sliding
window W and the n-state detection model
1: Compute counted matrix, C, by counting the numbers of successfully readings in Zt1,
Zt2, ..., ZtN within W
2: Randomly choose a qualified sample, ?H0, within the support of Post( ?H|C) as the the
first state of the Markov Chain.
3: for each cycle i do
4: Draw a proposal sample ?H?i
5: Compute Post( ?H?i|C) according to Equation 3.27 and Equation 3.28
6: Accept ?H?i asthe next state in the Markov chain, ?Hi , with the probability of Post( ?H?i|C)Post( ?H
i?1|C)
{Metropolis-Hastings Sampler with Constraints}
7: end for
Ensure: All the accepted states of the resulting Markov chain, ?H0, ?H1, ?H2, ... , ?Hi, ... , ?Hn
{By counting over samples, we reconstruct posteriors of locations of all objects w.r.t.
the sliding window w}
75
Chapter 4
Related Work
4.1 Spatial Databases
4.1.1 Nearest Neighbor Query
The nearest neighbor query is a very important query type for supporting GIS appli-
cations. With the R-tree family [7,24,51] of spatial indices, depth first search (DFS) [47]
and best first search (BFS) [25] have been the prevalent branch-and-bound techniques for
processing nearest neighbor queries. The DFS method recursively expands the intermediate
nodes for searching NN candidates. At each newly visited index node, DFS computes the or-
dering metrics for all its child nodes and applies pruning strategies to remove non-promising
branches. When the search reaches a leaf node, the data objects are retrieved and the NN
candidates are updated. On the other hand, the BFS method employs a priority queue to
store nodes to be explored through the search process. The nodes in the queue are sorted
according to their minimum distance (MINDIST) to the query point. During the search
process, BFS repeatedly dequeues the top entry in the queue and enqueues its child nodes
with their MINDIST into the queue. When a data entry is dequeued, it is included in the
result set.
Recently nearest neighbor search solutions have been extended to support queries on
spatial networks. Jensen et al. [32] proposed data models and graph representations for
NN queries in road networks and designed corresponding solutions. Papadias et al. [43]
presented solutions for NN queries in spatial network databases by progressively expanding
road segments around a query point. A network Voronoi diagram based solution for NN
search in road network databases was proposed in [37]. Sharifzadeh et al. [54] extended
76
the Voronoi diagram based approach for spatial data streams by using approximate Voronoi
cell computation. On the contrary, in order to speed up the NN search, Samet et al. [50]
proposed a solution to explore the entire spatial network by pre-computing the shortest
paths between all the vertices in the network and using a shortest path quadtree to capture
spatial coherence. Using their approach, the shortest paths between various vertices can be
only computed once to answer different NN queries on a given spatial network. However,
the above pre-computation based approaches suffer from high overhead and adapt poorly to
network updates. To overcome this pitfall, Lee et al. [39] presented an efficient and flexible
query framework, ROAD, based on search space pruning by using shortcuts for accelerating
network traversals and object abstracts for guiding traversals.
4.1.2 Route Planning Query
In many GIS applications (e.g., logistics and supply chain management), users have to
plan a trip to a number of locations with several sequence rules and the goal is to find the
optimal route that minimizes the total traveling distance. One related query type is named
the optimal sequenced route (OSR) query proposed by Sharifzadeh et al. [52]. OSR query
retrieves a route of minimum length starting from a given source location and passing through
a number of locations (with different types) in a particular order (sequence) imposed on all
the POI types. In [53], a pre-computation approach was provided to answer OSR query
by taking advantage of a family of AW-Voronoi diagrams for different POI types. However,
because the searches in [53] are based on NN queries, the authors need to investigate the
performance of their method in terms of the traveling distance besides the response time. A
multi-type nearest neighbor (MTNN) query solution was proposed in [41] by Ma et al. Given
a query point and a collection of locations (with difference types), a MTNN query finds the
shortest path for the query point such that only one instance of each type is visited during
the trip. MTNN can be treated as an extended solution of OSR by exploiting a page-level
upper bound. On the contrary, Li et al. [40] designed solutions for another new query type ?
77
Trip Planning Queries (TPQ). With TPQ, the user specifies a set of POI types and asks for
the optimal route from her starting location to a specified destination which passes through
exactly one POI in each POI type. Notice that compared to a OSR query, there is no order
imposed on the types of POIs to be visited in a TPQ query. Terrovitis et al. [58] illustrated
a-autonomy shortest path and k-stops shortest path problems for spatial databases. Given
a source point and a destination point, the first query retrieves a sequence of points from the
database where the distance between any two consecutive points in the path is not greater
than a. The second query searches for the optimal path from a origin to an end which passes
through exactly k intermediate points in the database. Tian et al. [59] proposed skyline
path queries in road networks based on multiple route search criteria (i.e., shortest traveling
distance and shortest traveling time). By taking into account the probability that each
POI type satisfies the user?s particular need, an interactive approach was proposed in [34].
In order to capture the characteristics of ever-changeling road networks, Tian et al. [60]
proposed the continuous min-cost path query and presented a system, PathMon, to monitor
min-cost routes in dynamic road networks. However, all the aforementioned solutions cannot
support MRPSR queries.
Another related problem to MRPSR is the sequential ordering problem (SOP) [19] and
it is stated as follows. Given a graph G with n vertices and directed weighted edges, find
a minimal cost Hamiltonian path from the start vertex to the terminal vertex which also
observes precedence constraints. Nevertheless, a Hamilton path is not required in MRPSR
and the types of visited locations are considered by our solution.
4.2 RFID Databases
Many systems have been developed to manage data with uncertainty. The usual ap-
proach to address uncertainty is to augment the classical relational model with attribute-level
or tuple-level probability values [3,5,6,13,15]. On the other hand, a more general approach
based on sampling is proposed for managing incomplete and uncertain data [11,28,67]. The
78
idea is simple and intuitive: we construct random samples while observing the prior statisti-
cal knowledge and constraints about the data. Thus, each sample is one possible realization
in the space of uncertainty, and the entire set of samples reveals the distribution of the
uncertain data we want to model. Queries and inferences are then conducted against this
distribution.
An important application that drives the recent surge of interest in managing incomplete
and uncertain data is RFID data management. Chawathe et al. [9] proposed a system
architecture of a distributed RFID system and discussed related data management challenges
such as inferences and online warehousing. An expressive temporal data model for RFID
data is defined by Wang and Liu [65] to support tracking and monitoring queries. Gonzalez
et al. [23] designed a warehousing model which provides RFID data compression and path-
dependent aggregates. Based on the proposed warehousing framework, they also developed
techniques for summarizing and indexing RFID data and various query methods. Because
RFID raw data usually contains anomalies [20], solutions have been proposed to clean the
input data from readers. Jeffery et al. [29] presented a framework for building data cleaning
infrastructure to support pervasive applications. An adaptive smoothing filter (SMURF) for
RFID data cleaning was proposed in [31]. SMURF focuses on a sliding-window aggregate
that interpolates for lost readings. Cocci et al. [14] provided a data interpretation and
compression scheme over RFID streams. A probabilistic model [61] was provided to capture
the dynamics of the reader and object, and noisy readings in RFID stream processing. Tran
et al. [63] presented PODS, a system to support stream processing for captured uncertain
data using continuous random variables based on Gaussian Mixture Models (GMM). An
approximate framework to evaluate complex queries involving conditioning and aggregation
operations on uncertain data streams was proposed in [62]. Our work, however, focuses on
how to leverage spatio-temporal data redundancy to elevate the localization accuracy for all
the tagged objects in a target area for real-time RFID tracking and monitoring applications.
79
The works in [35,36,67] are the most relevant research to this dissertation. For correcting
erroneous RFID raw data, Khoussainova et al. [35,36] presented a system for correcting input
data errors automatically using application defined global integrity constraints. The system
corrects the input data by inserting missing tuples when necessary and assigning to each
one the probability that it is correct for groups of conflicting tuples. This maximum entropy
based solution is practical. However, it is unable to capture all application related prior
knowledge and dependency compared with sampling-based approaches. Useful information
can be recovered from noisy RFID data by exploiting constraints and sequential importance
sampling methods [67]. Nevertheless, the work in [67] failed to consider the duplicate readings
caused by the overlapped detection regions of RFID readers.
80
Chapter 5
Conclusion and Future Work
5.1 Conclusion
Geographic information systems are getting increasingly sophisticated and route queries
with traveling rules represent a significant class of spatial queries. Existing solutions only
focus on trips with a complete POI category sequence or without any sequence. However, GIS
users usually want to set a number of traveling preferences when they plan their trips. In this
dissertation we propose the MRPSR query and design three fast approximation algorithms
to efficiently compute routes which can fulfill all the traveling rules with a near-optimal
travel distance based on the underlying road networks. With extensive simulations, we show
that our techniques can generate satisfying trips which are very close to the shortest routes
with remarkable short response time.
The data reported by RFID devices are known to be unreliable. In this dissertation we
propose a Bayesian inference based framework for cleansing RFID raw data which can take
full advantage of the spatio-temporal redundancy in the raw data. Our approach takes into
account prior knowledge (prior distribution and the likelihood) to evaluate location queries
for all tagged objects in a target area. Specifically, we propose the n-state model to cap-
ture likelihood and validate that the 3-state model can maximize the system performance.
Moreover, we devise MH-C to efficiently sample from the posterior distribution under envi-
ronmental constraints. To deal with the RFID raw data available as data streams, we present
our streaming based solution, the streaming Bayesian inference method, in support of real-
time tracking and monitoring. Finally, we demonstrate the advantages of our approach in
terms of efficiency and accuracy by extensive simulations.
81
5.2 Future Work
For future work, we intend to extend our proposed algorithms in support of MRPSR
queries on dynamic road networks in which traffic information (e.g., travel time, traffic
congestion, etc.) is incrementally becoming available as a data stream. As for our study
on RFID databases, we plan to extend our n-state model to cover 2D RFID reader array
deployment and identify the obstacles when integrating a 2D n-state model with our Bayesian
inference based framework.
82
Bibliography
[1] Digital Chart of the World Server.
[2] U.S. Geological Survey.
[3] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A
System for Data, Uncertainty, and Lineage. In VLDB, pages 1151?1154, 2006.
[4] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An Introduction to MCMC for Machine
Learning. Machine Learning, 50(1-2):5?43, 2003.
[5] P. Andritsos, A. Fuxman, and R. J. Miller. Clean Answers over Dirty Databases: A Probabilistic
Approach. In ICDE, page 30, 2006.
[6] L. Antova, C. Koch, and D. Olteanu. Query Language Support for Incomplete Information in the
MayBMS System. In VLDB, pages 1422?1425, 2007.
[7] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-Tree: An Efficient and Robust Access
Method for Points and Rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference
on Management of Data, pages 322?331, 1990.
[8] C. Beeri, Y. Kanza, E. Safra, and Y. Sagiv. Object Fusion in Geographic Information Systems. In
Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB), pages 816?
827, 2004.
[9] S. S. Chawathe, V. Krishnamurthy, S. Ramachandran, and S. E. Sarma. Managing RFID Data. In
VLDB, pages 1189?1195, 2004.
[10] H. Chen, W.-S. Ku, M.-T. Sun, and R. Zimmermann. The multi-rule partial sequenced route query. In
GIS, page 10, 2008.
[11] H. Chen, W.-S. Ku, and H. Wang. Cleansing uncertain databases leveraging aggregate constraints. In
ICDE Workshops, pages 128?135, 2010.
[12] H. Chen, W.-S. Ku, H. Wang, and M.-T. Sun. Leveraging Spatio-Temporal Redundancy for RFID Data
Cleansing. In SIGMOD Conference, pages 51?62, 2010.
[13] R. Cheng, S. Singh, and S. Prabhakar. U-DBMS: A Database System for Managing Constantly-evolving
Data. In VLDB, pages 1271?1274, 2005.
[14] R. Cocci, T. T. 0002, Y. Diao, and P. J. Shenoy. Efficient Data Interpretation and Compression over
RFID Streams. In ICDE, pages 1445?1447, 2008.
[15] N. Dalvi and D. Suciu. Efficient Query Evaluation on Probabilistic Databases. The VLDB Journal,
16(4):523?544, 2007.
[16] A. Deshpande, C. Guestrin, and S. Madden. Using probabilistic models for data management in
acquisitional environments. In CIDR, pages 317?328, 2005.
[17] A. Doucet, S. Godsill, and C. Andrieu. On Sequential Monte Carlo Sampling Methods for Bayesian
Filtering. Statistics and Computing, 10:197?208, 2000.
[18] D. W. Engels and S. E. Sarma. The Reader Collision Problem. In IEEE SMC, 2002.
[19] L. F. Escudero. An Inexact Algorithm for the Sequential Ordering Problem. European Journal of
Operational Research, 37(2):236?249, 1988.
83
[20] C. Floerkemeier and M. Lampe. Issues with RFID Usage in Ubiquitous Computing Applications. In
Pervasive, pages 188?193, 2004.
[21] M. J. Franklin, S. R. Jeffery, S. Krishnamurthy, F. Reiss, S. Rizvi, E. Wu, O. Cooper, A. Edakkunni,
and W. Hong. Design Considerations for High Fan-In Systems: The HiFi Approach. In CIDR, pages
290?304, 2005.
[22] B. George, S. Kim, and S. Shekhar. Spatio-temporal Network Databases and Routing Algorithms: A
Summary of Results. In Proceedings of the 10th International Symposium on Advances in Spatial and
Temporal Databases (SSTD), pages 460?477, 2007.
[23] H. Gonzalez, J. Han, X. Li, and D. Klabjan. Warehousing and Analyzing Massive RFID Data Sets. In
ICDE, page 83, 2006.
[24] A. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD?84, Proceedings
of Annual Meeting, pages 47?57, 1984.
[25] G. R. Hjaltason and H. Samet. Distance Browsing in Spatial Databases. ACM Trans. Database Syst.,
24(2):265?318, 1999.
[26] J. Ho, D. W. Engels, and S. E. Sarma. HiQ: A Hierarchical Q-learning Algorithm to Solve the Reader
Collision Problem. In SAINT Workshops, pages 88?91, 2006.
[27] E. Horowitz, S. Sahni, and S. Anderson-Freed. Fundamentals of Data Strucures in C. W. H. Freeman,
1993.
[28] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. MCDB: A Monte Carlo Approach
to Managing Uncertain Data. In SIGMOD, pages 687?700, 2008.
[29] S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. Declarative Support for Sensor Data
Cleaning. In Pervasive, pages 83?100, 2006.
[30] S. R. Jeffery, M. J. Franklin, and M. N. Garofalakis. An Adaptive RFID Middleware for Supporting
Metaphysical Data Independence. VLDB J., 17(2):265?289, 2008.
[31] S. R. Jeffery, M. N. Garofalakis, and M. J. Franklin. Adaptive Cleaning for RFID Data Streams. In
VLDB, pages 163?174, 2006.
[32] C. S. Jensen, J. Kol?arvr, T. B. Pedersen, and I. Timko. Nearest Neighbor Queries in Road Networks. In
Proceedings of the 11th ACM International Symposium on Advances in Geographic Information Systems
(ACM-GIS), pages 1?8, 2003.
[33] A. B. Kahn. Topological Sorting of Large Networks. Commun. ACM, 5(11):558?562, 1962.
[34] Y. Kanza, R. Levin, E. Safra, and Y. Sagiv. An interactive approach to route search. In GIS, pages
408?411, 2009.
[35] N. Khoussainova, M. Balazinska, and D. Suciu. Towards Correcting Input Data Errors Probabilistically
Using Integrity Constraints. In MobiDE, pages 43?50, 2006.
[36] N. Khoussainova, M. Balazinska, and D. Suciu. Probabilistic Event Extraction from RFID Data. In
ICDE, pages 1480?1482, 2008.
[37] M. R. Kolahdouzan and C. Shahabi. Voronoi-Based K Nearest Neighbor Search for Spatial Network
Databases. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB),
pages 840?851, 2004.
[38] W.-S. Ku, R. Zimmermann, H. Wang, and C.-N. Wan. Adaptive Nearest Neighbor Queries in Travel
Time Networks. In Proceedings of the 13th ACM International Symposium on Advances in Geographic
Information Systems (ACM-GIS), pages 210?219, 2005.
[39] K. C. K. Lee, W.-C. Lee, and B. Zheng. Fast object searchon roadnetworks. In EDBT, pages1018?1029,
2009.
84
[40] F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, and S.-H. Teng. On Trip Planning Queries in Spatial
Databases. In Proceedings of the 9th International Symposium on Advances in Spatial and Temporal
Databases (SSTD), pages 273?290, 2005.
[41] X. Ma, S. Shekhar, H. Xiong, and P. Zhang. Exploiting a Page-Level Upper Bound for Multi-Type
Nearest Neighbor Queries. In Proceedings of the 14th ACM International Symposium on Geographic
Information Systems (ACM-GIS), pages 179?186, 2006.
[42] J. Myung, W. Lee, J. Srivastava, and T. K. Shih. Tag-Splitting: Adaptive Collision Arbitration Proto-
cols for RFID Tag Identification. IEEE Trans. Parallel Distrib. Syst., 18(6):763?775, 2007.
[43] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao. Query Processing in Spatial Network Databases. In
Proceedings of the 29th International Conference on Very Large Data Bases (VLDB), pages 802?813,
2003.
[44] J. Rao, S. Doraiswamy, H. Thakkar, and L. S. Colby. A Deferred Cleansing Method for RFID Data
Analytics. In VLDB, pages 175?186, 2006.
[45] R. Reddy. To dream the possible dream. Commun. ACM, 39(5):105?112, 1996.
[46] S. M. Ross. Introduction to Probability Models, Ninth Edition. Academic Press, 2006.
[47] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest Neighbor Queries. In Proceedings of the 1995 ACM
SIGMOD International Conference on Management of Data, pages 71?79, 1995.
[48] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2002.
[49] H. Samet. Issues, Developments, and Challenges in Spatial Databases and Geographic Information
Systems (gis). In Proceedings of the Ninth ACM International Symposium on Advances in Geographic
Information Systems (ACM-GIS), page 1, 2001.
[50] H. Samet, J. Sankaranarayanan, and H. Alborzi. Scalable network distance browsing in spatial
databases. In SIGMOD Conference, pages 43?54, 2008.
[51] T. K. Sellis, N. Roussopoulos, and C. Faloutsos. The R+-Tree: A Dynamic Index for Multi-Dimensional
Objects. In Proceedings of 13th International Conference on Very Large Data Bases (VLDB), pages
507?518, 1987.
[52] M. Sharifzadeh, M. R. Kolahdouzan, and C. Shahabi. The Optimal Sequenced Route Query. The VLDB
Journal, accepted for publication, 2008.
[53] M. Sharifzadeh and C. Shahabi. Processing optimal sequenced route queries using voronoi diagrams.
GeoInformatica, 12(4):411?433, 2008.
[54] M. Sharifzadeh and C. Shahabi. Approximate voronoi cell computation on spatial data streams. VLDB
J., 18(1):57?75, 2009.
[55] S. Shekhar, M. Coyle, B. Goyal, D.-R. Liu, and S. Sarkar. Data Models in Geographic Information
Systems. Commun. ACM, 40(4):103?111, 1997.
[56] L. Sullivan. RFID Implementation Challenges Persist, All This Time Later. InformationWeek, October
2005.
[57] Y. Tao, D. Papadias, and Q. Shen. Continuous Nearest Neighbor Search. In Proceedings of 28th
International Conference on Very Large Data Bases (VLDB), pages 287?298, 2002.
[58] M. Terrovitis, S. Bakiras, D. Papadias, and K. Mouratidis. Constrained Shortest Path Computation.
In Proceedings of the 9th International Symposium on Advances in Spatial and Temporal Databases
(SSTD), pages 181?199, 2005.
[59] Y. Tian, K. C. K. Lee, and W.-C. Lee. Finding skyline paths in road networks. In GIS, pages 444?447,
2009.
[60] Y. Tian, K. C. K. Lee, and W.-C. Lee. Monitoring minimum cost paths on road networks. In GIS,
pages 217?226, 2009.
85
[61] T. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. Shenoy. Probabilistic Inference over RFID
Streams in Mobile Environments. In ICDE, 2009.
[62] T. T. L. Tran, A. McGregor, Y. Diao, L. Peng, and A. Liu. Conditioning and Aggregating Uncertain
Data Streams: Going Beyond Expectations. 2010.
[63] T. T. L. Tran, L. Peng, B. Li, Y. Diao, and A. Liu. PODS: A New Model and Processing Algorithms
for Uncertain Data Streams. In SIGMOD Conference, pages 159?170, 2010.
[64] J. Waldrop, D. W. Engels, and S. E. Sarma. Colorwave: An Anticollision Algorithm for the Reader
Collision Problem. In ICC, pages 1206? 1210, 2003.
[65] F. Wang and P. Liu. Temporal Management of RFID Data. In VLDB, pages 1128?1139, 2005.
[66] R. Want. The Magic of RFID. ACM Queue, 2(7):40?48, 2004.
[67] J. Xie, J. Yang, Y. Chen, H. Wang, and P. S. Yu. A Sampling-Based Approach to Information Recovery.
In ICDE, pages 476?485, 2008.
[68] J. Zhang, M. Zhu, D. Papadias, Y. Tao, and D. L. Lee. Location-based Spatial Queries. In Proceedings
of the 2003 ACM SIGMOD International Conference on Management of Data, pages 443?454, 2003.
86
*