Context-Based File Systems and Spatial Query Applications by Ji Zhang A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama May 5, 2013 Keywords: Context-Based, File System, Voronoi Diagram Copyright 2013 by Ji Zhang Approved by Xiao Qin, Associate Professor of Computer Science and Software Engineering Wei-Shinn Ku, Associate Professor of Computer Science and Software Engineering Sanjeev Baskiyar, Associate Professor of Computer Science and Software Engineering Saad Biaz, Associate Professor of Computer Science and Software Engineering Abstract This dissertation presents studies related to I/O techniques in data-intensive computa- tion and advanced solutions of spatial queries. There is a lack of general mechanisms for integrating multiple file system techniques and, therefore, the dissertation first illustrates a framework for Context-Based File Systems (CBFSs), which simplifies the development of context-based file systems and applications. Unlike existing informed-based context- aware systems, the framework provides a unifying informed-based mechanism that abstracts context-specific solutions as views, thereby allowing applications to make view selections according to their behaviours. The framework can not only eliminate overheads induced by traditional context analysis, but also simplify the interactions between file systems and applications. In addition, o?oading a portion of a program to an active storage is another way to improve I/O performance in a cluster by significantly reducing data traffic. In the of- floading study, we design a general o?oading framework or ORCA that enables programmers without I/O o?oading experience to complete the o?oading development. In the second part of this dissertation, we propose two novel spatial queries, multi- criteria optimal location query and keyword-spatial query. In our approaches, Voronoi dia- gram techniques are utilized for efficiently answering the queries. Besides two intuitive ap- proaches, weexploretwoadvancedsolutions, RealRegionasBoundary(RRB)andMinimum Bounding Rectangle as Boundary (MBRB), which are based on our proposed Overlapping Voronoi Diagram (OVD) model. High complexity of Voronoi diagram overlap computation in RRB motivates us to reduce costs of the overlap operation by replacing real boundaries of Voronoi diagrams with their Minimum Bounding Rectangles (MBR). Moreover, we employ a filter-and-refine strategy in an evaluation system for the keyword-spatial query. The system is comprised of Keyword Constraint Filter (KCF), Keyword and Spatial Refinement (KSR), ii and a ranker. KCF calculates the keyword relevancy of spatial objects, and KSR refines intermediate results by considering both spatial and keyword constraints. The extensive experimental results show that the queries can be efficiently and effectively evaluated by the proposed solutions. iii Acknowledgments This dissertation would not have been completed without invaluable guidance, help and experience sharing from the people who constantly support and encourage me during my study at Auburn University. First and foremost, my most sincere and deepest gratitude goes to my advisor, Dr. Xiao Qin, for his great efforts and trust in my work. I will never forget his extensive knowledge in thefieldofcomputerscienceandinexhaustibleenthusiasmforresearch, whichkeepsinspiring and driving me to accomplish my research. When working on the paper of Context-Based File Systems (CBFSs), he gave me numerous valuable advices and suggestions, including setting up accurate motivations behind the research, building a multiple context model, and a concrete design that introduces a creative concept, view, to file systems. I am also tremendously grateful to be advised by Dr. Wei-Shinn Ku. His solid under- standing in spatial queries, unlimited patience in answering my questions and meticulous working style impressed me in our discussion and meetings. His insightful comments and suggestions helped and enlightened me with literature reviews, appropriate topic targeting, idea extension and demonstration, and experimental validation. I owe my gratitude to Dr. Min-Te Sun, whose creative thinking and constructive crit- icism helped me considerably improve the way how the idea of the Overlapping Voronoi Diagram (OVD) model is extended and organized in our paper. His extensive experience and strong understanding in mathematics boosts my confidence to present mathematical details, such as property analysis and proofs, in the dissertation. I would like to thank my committee members, Dr. Sanjeev Baskiyar and Dr. Saad Biaz, who reviewed my proposal and dissertation. They gave me a number of valuable suggestions, by which my dissertation had been substantially improved. I am equally grateful to Dr. iv Fa Foster Dai, who gave me helpful comments and suggestions on my dissertation as the university reader. Working with our research group is fantastic. I would like to thank Xiaojun Ruan, Zhiyang Ding, Shu Yin, James Majors, Yun Tian, Jiong Xie, Yixian Yang, Maen Al Assaf, Xunfei Jiang, Ajit Chavan, Tausif Muzaffar, Sanjay Kulkarni and Yuanqi Chen who helped me with paper writing, experimental result collection and group discussions. I would like to thank all the professors and students in the Department of Computer Science and Software Engineering, who create and maintain an excellent atmosphere for study and research. Finally and most importantly, the endless love and support from my family is the most powerful strength that keeps me fighting for my research. My mother Yingxiang Hu, my father Lixin Zhang and my wife Xunfei Jiang always stay with me, cheering for achievement and overcoming all difficulties. v To my parents and Xunfei Jiang vi Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation of the Framework for Context-Based File Systems . . . . . . . . 2 1.2 Motivation of the O?oading Framework . . . . . . . . . . . . . . . . . . . . 4 1.3 Multi-Criteria Optimal Location Queries . . . . . . . . . . . . . . . . . . . . 6 1.4 Spatial Keyword Queries on Networks . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Context-Based File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Context-Aware Systems . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 View-Enhanced Systems . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 O?oading Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Multi-Criteria Optimal Location Queries . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Reverse Nearest Neighbor Queries . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Optimal Location Queries . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Spatial Keyword Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 k Nearest Neighbor Queries . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.3 Spatial Keyword Queries . . . . . . . . . . . . . . . . . . . . . . . . . 20 vii 3 Frog: a Framework for Context-Based File Systems . . . . . . . . . . . . . . . . 22 3.1 Context-Based File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 Context-Aware vs. Context-Based File Systems . . . . . . . . . . . . 24 3.1.3 The Frog Framework for Context-Based File Systems . . . . . . . . . 25 3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.3 Three Frog-based CBFS Designs . . . . . . . . . . . . . . . . . . . . . 32 3.2.4 Interactions with Applications . . . . . . . . . . . . . . . . . . . . . . 33 3.2.5 Application Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.6 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 The BAVFS File System . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 The BHVFS File System . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.1 File Creation/Deletion Rate . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.2 BAVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.3 BHVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.1 Kernel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.2 Work in a Distributed Environment . . . . . . . . . . . . . . . . . . . 54 3.5.3 Optimization for View-unaware Applications . . . . . . . . . . . . . . 55 3.5.4 Context Exposure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 ORCA: An O?oading Framework for I/O-Intensive Applications on Clusters . . 59 4.1 The ORCA I/O-O?oading Framework . . . . . . . . . . . . . . . . . . . . . 60 4.1.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 viii 4.1.2 Structure of Applications in ORCA . . . . . . . . . . . . . . . . . . . 61 4.2 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.1 Data-Intensive Module Identification . . . . . . . . . . . . . . . . . . 63 4.2.2 O?oading a Program . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.3 Controlling an Execution Path . . . . . . . . . . . . . . . . . . . . . . 65 4.2.4 Data Sharing among Storage and Computing Nodes . . . . . . . . . . 65 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.2 Workflow of an Application in ORCA . . . . . . . . . . . . . . . . . . 67 4.3.3 O?oading APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.4 Sharing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4.1 Experimental Testbed for ORCA . . . . . . . . . . . . . . . . . . . . 72 4.4.2 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.3 PostgreSQL: A case study . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5.1 Overall Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 78 4.5.2 Network Load Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.3 CPU Usage Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.6 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6.1 O?oading Module Identification . . . . . . . . . . . . . . . . . . . . . 88 4.6.2 Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5 MOLQ: Multi-Criteria Optimal Location Query with Overlapping Voronoi Dia- grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.2 Voronoi Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 ix 5.1.3 Fermat-Weber Point . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.2.1 Sequential Scan Object Combinations . . . . . . . . . . . . . . . . . . 96 5.2.2 Sequential Scan Locations . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3 The OVD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.1 An OVD Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.2 Overlapped Voronoi Diagram Definition . . . . . . . . . . . . . . . . 99 5.3.3 Algebraic Structure of MOVD . . . . . . . . . . . . . . . . . . . . . . 103 5.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4.1 Framework of the MOVD-based Solution . . . . . . . . . . . . . . . . 106 5.4.2 The RRB Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4.3 The MBRB Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.4.4 A Cost-Bound approach in Optimizer . . . . . . . . . . . . . . . . . . 114 5.5 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.5.2 Cost-Bound Approach Evaluation . . . . . . . . . . . . . . . . . . . . 117 5.5.3 Overlapping Two Voronoi Diagrams . . . . . . . . . . . . . . . . . . . 118 5.5.4 Overlapping Multiple Voronoi Diagrams . . . . . . . . . . . . . . . . 119 5.5.5 MOLQ Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6 Efficient Evaluation of Spatial Keyword Queries on Spatial Networks . . . . . . 122 6.1 Query Type Definition and Background . . . . . . . . . . . . . . . . . . . . . 122 6.1.1 Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.1.2 Spatial Keyword Ranker . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.1.3 Spatial Keyword kNN Queries . . . . . . . . . . . . . . . . . . . . . . 126 6.1.4 Spatial Keyword Range Queries . . . . . . . . . . . . . . . . . . . . . 127 6.1.5 Network Voronoi Diagram . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 x 6.2.1 Framework of Query Evaluation . . . . . . . . . . . . . . . . . . . . . 130 6.2.2 Keyword Constraint Filter . . . . . . . . . . . . . . . . . . . . . . . . 131 6.2.3 The Network Expansion-Based SKkNN Query Algorithm . . . . . . . 134 6.2.4 The Voronoi Diagram-Based SKkNN Query Algorithm . . . . . . . . 137 6.2.5 The Spatial Keyword Range Query Algorithm . . . . . . . . . . . . . 140 6.3 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.3.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.3.2 Data Set Size Experiment . . . . . . . . . . . . . . . . . . . . . . . . 143 6.3.3 Number of Keywords Experiment . . . . . . . . . . . . . . . . . . . . 144 6.3.4 Number of k Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.3.5 Query Range Experiment . . . . . . . . . . . . . . . . . . . . . . . . 146 6.3.6 Page Access Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 147 7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.1 Framework for Context-Based File Systems . . . . . . . . . . . . . . . . . . . 149 7.2 O?oading Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.3 MOLQ Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.4 Spatial Keyword Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 xi List of Figures 1.1 An example of residential location selection. The object weights are indicated as . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 A sample spatial network of hotels close to an airport. . . . . . . . . . . . . . . 9 3.1 A comparison of context-aware systems. . . . . . . . . . . . . . . . . . . . . . . 24 3.2 The framework of a Frog-based CBFS. . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Overview of a Frog-based CBFS. . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Frog-based CBFS designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Interactions between processes and Frog. . . . . . . . . . . . . . . . . . . . . . . 34 3.6 Examples of view-aware applications. . . . . . . . . . . . . . . . . . . . . . . . . 35 3.7 Application modification comparisons. . . . . . . . . . . . . . . . . . . . . . . . 36 3.8 Two consistency mechanisms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.9 File locking among views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.10 Structure of BAVFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.11 Name processing in two views. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.12 The structure of BHVFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 xii 3.13 File creation/deletion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.14 Random and sequential read evaluation in BAVFS. . . . . . . . . . . . . . . . . 50 3.15 Random read and write evaluation in BHVFS. . . . . . . . . . . . . . . . . . . . 51 3.16 VED and views allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.17 Two block mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 The architecture of commodity clusters, where a number of nodes are connected witheachotherthroughinterconnects. Wefocusonclustersenhancedwithactive storage nodes that have computing capability. . . . . . . . . . . . . . . . . . . . 60 4.2 An o?oading domain is a logic processing unit, in which a pair of computing and o?oading modules are coordinated. I/O-bound modules are assigned to and executed on storage nodes; CPU-bound modules are handled by computing nodes. ORCA overlaps the executions of CPU-bound and I/O-bound modules. . 61 4.3 The execution flow of a data-intensive application running in the ORCA o?oad- ing framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 A simple example of o?oad_call . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 The execution flow of official PostgreSQL . . . . . . . . . . . . . . . . . . . . . 77 4.6 The execution flow of o?oading PostgreSQL in ORCA. The computing node handles the parser, rule system, and optimizer; the executor is o?oaded to the storage node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.7 ORCA-based applications vs. Official applications. Execution times of the five real-world benchmark applications running on the homogeneous cluster (i.e., the first testbed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 xiii 4.8 ORCA-based applications vs. Official applications. Execution times of the five real-world benchmark applications running on the heterogeneous cluster (i.e., the second testbed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.9 Network load imposed by both official and ORCA-based PostgreSQL accessing different databases on the homogeneous cluster (i.e., the first testbed). . . . . . 83 4.10 Network load imposed by the four real-world applications and their ORCA-based counterpartsaccessing800MBdatasetsonthehomogeneouscluster(i.e., thefirst testbed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.11 Network load imposed by both official and ORCA-based PostgreSQL accessing different databases on the heterogeneous cluster (i.e., the second testbed). . . . 85 4.12 Network load imposed by the four real-world applications and their ORCA-based counterparts accessing the 800 MB datasets on the heterogeneous cluster (i.e., the second testbed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.13 CPU load imposed by the five real-world ORCA-based applications in the storage nodes of the homogeneous cluster (i.e., the first testbed). . . . . . . . . . . . . . 87 4.14 CPU load imposed by the five real-world ORCA-based applications in the storage nodes of the heterogeneous cluster (i.e., the second testbed). . . . . . . . . . . . 87 5.1 Ordinary Voronoi diagrams and OVDs. . . . . . . . . . . . . . . . . . . . . . . . 99 5.2 The Framework of the MOVD-based solution. . . . . . . . . . . . . . . . . . . . 106 5.3 Overlapping two MOVDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 Weighted Voronoi diagrams (the numbers indicate weights). . . . . . . . . . . . 112 5.5 Data structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 xiv 5.6 Optimal locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.7 The CB approach evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.8 Execution time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.9 Number of OVRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.10 Memory consumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.11 Varying number of object types. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.12 Three object types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.13 Four object types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1 An example spatial network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Voronoi diagram examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3 Framework of the spatial keyword query evaluation system. . . . . . . . . . . . 131 6.4 Inverted index structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.5 An example of KeywordMatch. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.6 A VDkNN query example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.7 Execution times of NEkNN and VDkNN queries as a function of data set size. . 143 6.8 Execution times of NEkNN and VDkNN queries as a function of number of keywords. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.9 Execution times of NEkNN and VDkNN queries as a function of number of k. . 146 6.10 Execution time of SKR queries as a function of query range. . . . . . . . . . . . 146 6.11 Page Access evaluation with different data set sizes. . . . . . . . . . . . . . . . . 148 xv List of Tables 4.1 The ORCA O?oading Programming Interface . . . . . . . . . . . . . . . . . . . 70 4.2 Hardware and Software Configurations . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Configuration of the two Testbeds for ORCA . . . . . . . . . . . . . . . . . . . 73 4.4 Real-World Benchmark Applications . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1 Symbolic Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.1 A sample data set of hotels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Symbolic notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.3 Simulator configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.4 Data sets employed in experiments. . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.5 Default values of parameters used in experiments. . . . . . . . . . . . . . . . . . 143 xvi Chapter 1 Introduction With rapid growth of data volume, methods of efficiently processing large amount of data have been reported in the past decade. In order to overcome the challenging problem, various advanced I/O techniques and approaches have been designed to alleviate the I/O bottleneck. Fast file system [80] and log-structure file system [99] increase the throughput of read and write operations by applying update-in-place and update-out-of-place strategies, respectively. I/O prefetching [90] and buffering [86] techniques are proposed for further performance improvement by conducting I/O behaviour analysis and prediction. O?oading techniques that o?oad programs instead of transferring huge data over network are designed to avoid network traffic in a distributed computing environment or save energy on mobile devices. Inaddition toI/Otechniques, investigatingspatial queryapplicationsis anotherfocused field in this dissertation. The spatial queries are not only data-intensive applications that evaluate the queries on large data sets, but also real world applications that have impacts on our lives. For example, top k nearest neighbor query provides us with the best candidates, such as restaurants, hotels or museums, based on customized criteria [107]. Range query returns all qualified candidates within a particular area. This dissertation consists of two parts. The first part contains the studies related to I/O techniques; the second part presents the advanced solutions of novel spatial queries. This chapter is organized as follows. Sections 1.1 and 1.2 elaborate motivations of our frameworks for context-based file systems and o?oading applications. Sections 1.3 and 1.4 illustrate the problem statements and motivations of novel spatial queries with simple examples. Finally, section 1.5 outlines the dissertation organization. 1 1.1 Motivation of the Framework for Context-Based File Systems Context-aware computing allows modern storage systems to adjust I/O schemes and mechanisms according to specific contexts (e.g., read-intensive contexts). Taking context information into account enables storage systems to dynamically optimize I/O performance, energy efficiency, etc. For instance, quFiles create multiple physical representations of files for data-specific contexts [122]. Prefetching can be optimized (e.g., adjusting the number of prefetched blocks) by identifying and modeling future access patterns in a real-time man- ner [111]. When it comes to energy saving, a server may have two operating contexts: a full-utilization mode during daytime and an energy-saving mode during nights [73]. To re- duce Solid-State Drive (SSD) maintenance costs, an HDD-SSD hybrid file system may have two contexts: a performance critical context and an SSD lifetime-extending context [112]. Both contexts can be implemented on SSDs and Hard-Disk Drives (HDDs), respectively. Context-aware techniques have been investigated in previous studies. However, the fol- lowing four problems make the existing context-aware techniques impractical in file systems. First, the context-aware techniques are too complicated to be implemented. Various heuris- tic or informed approaches have been proposed to identify contexts. Appropriately selecting an approach for a context-aware file system requires extensive experience. Second, the exist- ing informed-based approaches expose different interfaces to applications. Migrating them from one approach to another is difficult. Third, a context of a file system can be divided into several finer-grained contexts, which make it easy to optimize I/O operations. Last, the existing approaches suffer from the back-end analysis, which is likely to incur a significant overhead during I/O operations in file systems. In this study, we propose a framework, Frog, for Context-Based File Systems (CBFSs) that encapsulates a number of solutions, each of which is dedicatedly designed for a particular context. The context-specific solutions are abstracted as views in the internal representation of Frog-based CBFSs. Since the views are independent of each other, a new context can be supported in the CBFSs by creating a new view. 2 Frog is a unifying informed-based framework for statically mapping contexts to appro- priate solutions in CBFSs. The framework simplifies the interfaces between file systems and applications. Rather than propagating hints through a dedicated interface (i.e., ioctl [90] or newly developed ctx_pread [111]), the framework allows applications to choose contexts by inserting view names in file path strings. Without knowledge of extra I/O interfaces, application developers are able to write view-aware codes. Frog does not dynamically identify any context based on hints, thereby avoiding over- heads of the back-end analysis and shortening I/O response times. The contexts maintained in Frog-based CBFSs are configured off-line; view-aware applications are in charge of context selections. The performance of programs running in the CBFSs are automatically optimized by appropriate context-specific solutions implemented in form of views. Frog takes care of details of integrating multiple solutions, keeping consistency among views, and avoiding resource contention (see Section 3.2.6). This allows system developers to focus on context- specific solutions without really paying any attention to the integration issues. Frog is compatible with existing non-view-aware applications. The compatibility is enforced by setting a default view for non-view-aware applications that generally do not represent any specific context. Moreover, the default view is an important abstraction, on which the consistency and locking mechanism rely. Frog offers three approaches to integrating multiple context-specific solutions by con- sidering metadata and physical data management. In the shared-data approach, physical data are shared among views, whereas in the shared-nothing approach, each view maintains a replica of files. The third approach is a hybrid in the sense that it is a combination of the shared-data and shared-nothing approaches. We also illustrate flexible interactions between Frog and applications by two concrete examples. Applications can access a file through dif- ferent views based on their contexts. In addition, an application can update a file through a view while another application can read the file through another view later. 3 We implement two prototypes for the shared-data and shared-nothing approaches, re- spectively. The first prototype is the Bi-context Archiving Virtual File System (BAVFS) and the second prototype is the Bi-context Hybrid Virtual File System (BHVFS). BAVFS separates the contexts of sequential and random reads, and applies aggressive and conser- vative prefetching in each view. BHVFS manages two views for both read-intensive context and write-intensive context. Specifically, the update-in-place (e.g., fast file system [80]) and update-out-of-place (e.g., log-structured file system [99]) strategies are used for each context. 1.2 Motivation of the O?oading Framework Although o?oading techniques have been applied to a wide range of computing plat- forms (e.g. parallel file systems [38] [93] and object-based storage [36]), there is lack of a general o?oading framework tailored for I/O intensive applications running on clusters. Moreover, none of existing works pay any attention on o?oading application development fromdevelopers?perspective. Basedonourexperience, writinganappropriateo?oadingpro- gramisdifficultandtime-consuming. Inthischapter, weproposeanewo?oadingframework called ORCA to map I/O-intensive code to a cluster that consists of computing and storage nodes. The following two factors motivate us to develop the ORCA o?oading framework: heavy network traffic is imposed by transferring data from storage to computing nodes in clusters, and writing an o?oading program without any general framework is difficult. Due to the nature of I/O intensive applications, heavy network traffic is caused by retrieving massive amount of data between computing and storage nodes in clusters. During the data staging phase, data to be processed by applications running on computing nodes must be loaded from storage nodes through interconnections. Transferring huge amount of data can slow down the performance of the applications. This I/O problem becomes 4 even worse for clusters using the Ethernet, where all nodes share network bandwidth in the clusters. The second motivation driving us is that studies of o?oading development are missing. Even for an experienced developer, a number of issues related to the o?oading development are difficult to solve, including appropriately designing o?oading programs, accurately de- ciding I/O-bound modules of programs, controlling execution paths and efficiently sharing data. Our goal is to address the above two issues by developing the ORCA framework to au- tomatically o?oad I/O-bound modules of an application to active storage nodes in a cluster. The o?oading framework deals with configurations, execution-path control, o?oading exe- cutable code, and data sharing. Our framework coupled with an application programming interface (API) and a run-time system enables programmers without any I/O o?oading ex- perience to easily write new I/O-intensive code or extend existing code running efficiently on clusters. The main contributions of this work are: We describe the ORCA framework centered around an o?oading API and a run-time system (see Sec. 4.2). We discuss the implementation details, including the issues of configurations, program- ming interface, and data sharing (see Sec. 4.3). We develop a testbed to evaluate real-world applications in our run-time system (see Sec. 4.4). We present experimental results to show that both homogeneous and heterogeneous clusters powered by ORCA experience reduced amount of network bandwidth used to transfer data among computing and storage nodes (see Sec. 4.5). Online resources. The source code and documentation of our I/O o?oading frame- workarefreelyavailableatftp://ftp.eng.auburn.edu/pub/jzz0014/offloading/library/ 5 1.3 Multi-Criteria Optimal Location Queries Numerousspatialqueries,includingnearestneighborandreversenearestneighborqueries, had been extensively studied; however, there are still spatial queries applied in real appli- cations, such as location decision making, that cannot be efficiently addressed by existing spatial query types. A typical example is making residential location decisions, such as find- ing home locations that would maximize residential satisfaction, which is a critical part of community planning and development [85]. In order to attract more customers, an optimal location would be selected based on a comprehensive consideration of a number of factors, such as transportation accessibility (the ease of reaching a bus or subway station), the dis- tance to an elementary school, or the distance to a supermarket where residents can purchase food and living necessities. Bus St op 1 Community 3 Community 2 Community 1 5 < 1 , 2 > 5 8 6 5 5 10 6 3 Supermarket 1 < 2 , 2 > Supermarket 2 <2, 1> School 2 <3 , 1 > School 1 < 3 , 2 > B us St op 2 < 1 , 1 > Figure 1.1: An example of residential location selection. The object weights are indicated as . Fig. 3.6 displays a simple example of residential location selection in a city. We assume that there are two schools, two bus stops, and two supermarkets in the city. Their locations are indicated by symbols. The figure also shows three potential community locations. Lines connect communities to their closest bus stop, school, and supermarket, respectively. The numbers on the lines indicate the distance between two locations. If we assume that the 6 optimal location for a new community is the place that minimizes the total distance to its closest school, bus stop, and supermarket, the best location is Community 1, the total distance (16) of which is shorter than that of Community 2 (19) or Community 3 (18). Tradeoffs of multiple factors are actually considered in real residential location selec- tion [126]. The importance of schools, bus stops, and supermarkets varies greatly among different people. For example, some people may prefer living near a school because it is convenient to drive their children to school. In addition, objects of a particular type are con- sidered differently. When selecting a school, the ones that provide higher quality programs are more attractive than others. In order to take these differences into consideration, a type weight wt and object weight wo are associated with each object. With the weights indicated in the figure, the best choice is Community 3 (33) if the weighted distance of a community and an object is calculated as the product of distance and the two weights. Instead of associating a single weight with an object, type weight and object weight are set individually in the example because various weight functions are allowed to be applied to the query. This will be described in Section 5.1. In the example, a multiplicatively-based weight function is applied to both type weight and object weight. In order to efficiently answer the query, we propose an advanced solution that utilizes Overlapped Voronoi Diagram (OVD) model and Fermat-Weber techniques. The OVD model comprehendslocationinformationandobjectweightswo ofspatialobjectsbyoverlappingthe diagrams generated from the objects. With the OVD model, the closest objects of different types to a particular location can be efficiently retrieved. Fermat-Weber techniques are used for finding the optimal location of given objects. In addition, due to a surprisingly large number of Fermat-Weber problems generated in our solution, we propose a cost-bound iterative solution that is able to significantly reduce the computation complexity of the original iterative solution [127]. The contributions of this study are summarized below: 7 1. We identify a novel query type that is able to find optimal locations comprehensively considering multiple criteria. 2. We build an OVD model, analyze its properties systematically, and create an algebraic structure of its overlap operations. 3. Based on the OVD model, We propose a Real Region as Boundary (RRB) solution that is able to efficiently evaluate the novel query. 4. In the proposed Minimum Bounding Rectangle as Boundary (MBRB) solution, we optimize the overlap operation by avoiding overlapping region calculations. 5. Facing large amount of Fermat-Weber problems, we propose a cost-bound iterative solution that is able to significantly reduce the computation complexity of the original iterative solution. 6. We evaluate the performance of the proposed solutions through extensive experiments with real-world data sets. 1.4 Spatial Keyword Queries on Networks A Spatial Keyword (SK) query is an approach of searching qualified spatial objects by considering both the query requester?s location and user specified keywords. Taking both spatial and keyword requirements into account, the goal of a spatial keyword query is to efficiently find results that satisfy all the conditions. However, all existing solutions for SK queries are designed based on Euclidean distance [40, 53, 139, 136], which is not realistic since most users move on spatial networks. Moreover, all current approaches of SK queries are limited to finding objects that fully match the given keywords. Nevertheless, the objects with fully matched keywords could be far away from the query point. In this research, we design novel SK query techniques based on spatial networks. In addition, we take both fully and partially matched query results into account in the process of keyword searching. 8 This new SK query mechanism enables users to not only retrieve qualified results on spatial networks, but also obtain partially matched objects when there are not enough fully matched results in the vicinity of the requester. 4 miles 5 miles 6 miles 4 miles 5 miles3 miles 5 miles InternetBreakfast 5 PoolFitness 3 1 InternetBreakfast 6 ParkingFitness Breakfast 4 InternetParking2 InternetFitness Breakfast 3 miles Figure 1.2: A sample spatial network of hotels close to an airport. Figure 1.2 illustrates an example: a tourist who flies to Atlanta may want to search for two hotels which provide both ?Internet" and ?Breakfast" amenities and have the shortest driving distance to the Atlanta airport. In addition, the tourist may also search for all the hotels which are within 10 miles of the airport and provide the two amenities in order to compare the hotels? reviews and prices. For retrieving the qualified hotels, the tourist will launch a Spatial Keyword k Nearest Neighbor (SKkNN) query with ranking parameters for the first search, and the query results are hotels 1 and 3. A Spatial Keyword Range (SKR) query will be executed for the second inquiry, and the answers are hotels 1, 3, and 6. In this chapter, we focus on solving the two aforementioned spatial query types by devising three novel solutions which employ the inverted index technique, shortest path search algo- rithms, and network Voronoi diagrams. Particularly, the inverted index is used to maintain 9 the relationships between spatial objects and their attached keywords for quickly retriev- ing spatial objects whose features match the given keywords. In addition, we propose both a network expansion-based approach and a Voronoi diagram-based approach to efficiently answer SKkNN queries on spatial networks. The contributions of this study are as follows: 1. We provide formal definitions of spatial keyword kNN and range queries on spatial networks. 2. We develop two novel approaches for efficiently processing SKkNN query and one approach for SKR query on spatial networks. 3. Our SKkNN solution is able to return partially matched query results based on the output of the spatial keyword ranker. 4. We evaluate the performance of the proposed SKkNN and SKR algorithms through extensive experiments with both real-world and synthetic data sets. 1.5 Dissertation Organization This dissertation is organized as follows. In Chapter 2, related work reported in the literature is briefly reviewed. Chapter 3 introduces the concept of context-based file systems, and illustrates the design of a framework, Frog, that is able to simplify the development of context-based file systems. In Chapter 4, we develop an o?oading framework, ORCA, that is able to not only significantly reduce the network traffic by o?oading programs over network, but also help developers to deal with complex issues involved in o?oading development. In Chapter 5, we first propose an Overlapping Voronoi Diagram (OVD) model. Then, two advanced solutions based on the model are introduced and investigated. In Chapter 6, two advanced solutions, network expansion-based and Voronoi diagram-based approaches, are proposed and evaluated. Finally, Chapter 7 summarizes the contributions of this dissertation and comments on future directions on the research. 10 Chapter 2 Related Work As Chapter 1 mentioned, a variety of I/O techniques and spatial queries have been extensively studied in the past. This chapter briefly reviews existing approaches that most relevant to file system design, o?oading techniques, and spatial queries. 2.1 Context-Based File Systems 2.1.1 File Systems Two completely different update strategies were implemented in file systems: update-in- place and update-out-of-place. An update-in-place file system (e.g., FFS [80], IBM?s JFS [8], SGI?s XFS [10], ReiserFS [9] and Ext 4 [78]) usually commits an update at a place where data is stored. On the other hand, an update-out-of-place file system (e.g., LFS [99]) appends updates in a log file, which is reorganized by a separate cleaner program in a batch manner. Hybrid file systems integrate multiple I/O techniques in a single file system. Identifying the characteristics of a file system workload, MFS separates data into control, data, and log storage [82]. The control and data storage manage data that has not been recently modified. The log storage is designed for the crash recovery purpose. DualFS is a journaling file system that separates metadata from data using partitions or devices [91]. Update-in- place and update-out-of-place strategies are applied to the metadata and data partitions, respectively. HyLog is a hybrid file system that incorporates the FFS and LFS strategies [124]. To avoid the cleaning overhead of LFS, HyLog divides disk space into fixed-size segments. Hot segments - containing hot pages - are organized in LFS to achieve high write performance. 11 Cold segments - containing cold pages - are managed in FFS. When hot pages become cold, HyLog moves the cold pages to the cold partition to avoid extra cleaning overhead. The nature of pages, regardless of hot or cold, is determined by a separating algorithm. Similar to HyLog, hFS is a hybrid file system combining FFS and LFS [138]. In hFS, files are classified into two categories: large and small files. Large files are stored in FFS to offer competitive read performance by avoiding fragmentations incurred by small files. Metadata and small files are organized in the LFS fashion because read operations accessing small amounts of data are likely to respond by retrieving one or two blocks. Unlike the existing file systems, our Frog-based CBFS allows applications to interact with file systems. The CBFS exhibits not only file information but also a number of contexts in which the context-specific solutions are applied. View-aware applications are able to benefit from the context-specific solutions by appropriately selecting views. Thestackablefilesystemswereproposedasaneasierwayofimplementingandextending file systems [100] [134] [135]. By using unionfs [133], files in several directories appearing to be merged are actually managed separately. Our design is different from these systems in the sense that Frog-based CBFS combines multiple context-specific solutions by duplicating either a part of or an entire file system. In light of contexts, applications can pick views that offer optimization solutions to meet the needs of the applications. 2.1.2 Context-Aware Systems Context-aware computing provides ample opportunities to improve the performance of adaptive systems. For example, in a context-aware mobile computing system, context- sensitive data (e.g., locations and users? activities) can be simply collected by sensors. Based on the context-sensitive information, mobile devices can identify contexts and adjust device behaviors accordingly [69][109]. Collecting context-sensitive data is non-trivial at storage server ends. A context-aware prefetching study was conducted at storage servers, in which access patterns can be identified 12 via hints provided by applications through dedicated programming interfaces [111]. Rather than relying on hints, our Frog-based CBFS handles context selections made by applications that are aware of future I/O access needs. The CBFS can avoid the overheads of context analysis on storage servers. A number of context-aware middle-ware systems were proposed to make context-aware computingpossible[30][50][98]. Ourapproachisdifferentfromthesesystemsinthatwefocus on context-based file systems, in which view-aware applications can achieve context-specific I/O optimizations. 2.1.3 View-Enhanced Systems Conceptual view models have been widely used in software design. Along with multiple views, data can be logically presented in a diverse manner. A typical example is a relational database management system, in which views - constructed on top of hierarchical structures - are available for applications [95]. Gehani et al. proposed a file system interface for an object-oriented database [47] where objects can be accessed through both file manipulations and database operations. MVSS introduces a multi-view model for active storage systems to provide flexible ser- vices by dynamically generating in-memory views. A set of views are mounted at distinct points in a file system namespace [76]. Unlike MVSS, which implements multiple views at the storage system level, Frog supports multiple views at the file system level, where it takes care of the details of integrating multiple solutions. Frog handles metadata and physical data management, consistency issues, and resource contention, which can be applied to either a native file system in the kernel or a storage system in a distributed environment. Moreover, Frog statically duplicates metadata and physical data across views; thereby avoiding the overhead of dynamic methods (e.g., making decisions about when and what types of views are created). 13 The issues of context-aware adaptation were addressed in quFiles [122]. In particular, various contexts in quFiles are accessed in the form of views, in which an optimal policy is selected for a specific I/O context. To provide flexible and transparent services, quFiles dynamically or statically creates views. One view is set as the default view. Compared with quFiles, Frog diversifies not only physical representations but also metadata management as well as I/O operations. This salient feature of Frog makes it possible to integrate multiple I/O optimizations (e.g., prefetching) for particular contexts. More importantly, we show how to make use of Frog to implement feasible designs of context-based file systems. 2.2 O?oading Techniques The concept of active disks was proposed by Acharya et al. [16]. In their active disk architecture, processing power and memory are deployed into individual disks, to which a portion of application computation can be o?oaded by using a stream-based programming model. Their simulation results show that significant performance improvement can be achieved by reducing data traffic. Riedel et al. designed a similar system that moves an application?s processing to disk drives. In addition, they developed an analytical model, evaluating a wide range of applications that may benefit from the active storage [96]. Keeton et al presented an "intelligent" disk (IDISKs) architecture for decision support database servers [63]. Lim et al. designed an active-disk-based file system (ADFS), in which application- specific operations can be executed by disk processors [75]. When a file is loaded, only results processed by an application are returned. Moreover, the ADFS file system also o?oads a part of file system functionalities (e.g. lookup) to active disks. This approach is able to significantly reduce the workload of central file management. Chiu et al. presented a distributed architecture that utilizes smart disks equipped with processing power, on- disk memory, and network interfaces [29]. A set of representative I/O-intensive workloads 14 are evaluated on the architecture. Their experimental results suggest that the distributed architecture outperforms partially distributed and centralized systems. The idea of active storage was implemented in the Lustre file system by Felix et al. [38]. Afterwards, piernas et al proposed an active storage for Lustre in the user space [93]. Both approaches reduce data movements and improve computing capability. Compared with the kernel-based implementation, the user-space approach is faster, more flexible and readily deployable. In addition, motivated by requirements of specific applications, piernas et al. designed and evaluated an efficient way to manage complex striped files in active storage [92]. Du designed an intelligence storage system that combines active disks and object-based storage device (OSD) [36]. Du?s study mainly focused on fundamental changes in exist- ing storage systems; he proposed a number of future research directions in the realm of OSD-based storage. Son et al. investigated an active storage in the context of parallel file systems [110]. Based on the analysis of parallel applications, they designed an enhanced programming interface that enables application codes to embed in the parallel file system. Moreover, their approach also provides server-to-server communication for reduction and aggregation. Althoughtheo?oadingtechniqueshavebeenextensivelyexploredintheaforementioned studies, our approach differs from the above solutions with respect to the following three is- sues. First, recognizing that there is a lack of generic o?oading framework, we propose a general o?oading programming model that can be applied to either sequential or paral- lel applications (e.g., multi-thread and multi-process programs). We introduce the concept of o?oading domains to represent computation consequences. Server-to-server communica- tions proposed by Son et al. [110] can be converted into domain-to-domain communications. Second, developing o?oading-oriented applications is non-trivial from programmers? per- spective. In this study, we address a number of critical implementation issues raised in the development of o?oading-oriented programs. These issues, which are crucial to program designs and performance, include I/O-bound module identification, execution path control, 15 and data sharing in an o?oading domain. Last, developing o?oading-oriented programs in C language is time consuming due to the complexity of the programming language and; therefore, we propose an approach that is able to share dynamic and static data (e.g., codes). 2.3 Multi-Criteria Optimal Location Queries 2.3.1 Reverse Nearest Neighbor Queries Korn and Muthukrishnan [66] proposed the influence set notion based on reverse nearest neighbor (RNN) queries. They presented a precomputation-based approach for solving RNN queries and an R-tree based method (RNN-tree) for large data sets. In order to decrease index maintenance costs in [66], Yang and Lin [131] presented the Rdnn-tree which combines the R-tree with the RNN-tree and leads to significant savings in dynamically maintaining the index structure. The solutions in [66, 131] can be employed to evaluate both the monochro- matic RNN query and the bichromatic RNN query; however, these precomputation-based techniques incur extra maintenance costs for data updates. Therefore, several solutions with- out precomputation were proposed. For discovering influence sets in dynamic environments, Stanoi et al. [113] presented techniques to process bichromatic RNN queries without pre- computation. The design is to dynamically construct the influence region of a given query point q where the influence region is defined as a polygon in space which encloses all RNNs of q. For the monochromatic RNN query, Tao et al. [117] developed algorithms for evaluat- ing RkNN with arbitrary values of k on dynamic multidimensional data sets by utilizing a data-partitioning index. The algorithms were later extended to support continuous RkNN search [118], which returns the RkNN results for every point on a line segment. There are some other works related to RNN query evaluation. Retrieving RNN aggrega- tions (such as COUNT or MAX DISTANCE) over data streams was introduced in [67]. Yiu et al. [132] proposed pruning-based methods to find RNNs in large graphs. The algorithms for efficient RNN search in generic metric spaces were presented in [119]. The techniques 16 require no detailed representations of objects and can be applied as long as the similarity be- tween two objects can be computed and the similarity metric satisfies the triangle inequality. Cheema et al. [27] studied the problem of continuous monitoring of reverse k nearest neigh- bors queries in Euclidean space as well as in spatial networks. While the aforementioned approaches work well for R(k)NN queries, they cannot be utilized directly to evaluate the unique query type studied in this chapter due to the fundamental differences between query definitions. 2.3.2 Optimal Location Queries One group of optimal location queries (OLQ) are defined with an optimization function which maximizes the influence of a facility. Given a set of sites, a set of weighted objects, and a spatial region Q, the optimal-location query defined in [37] returns a location in Q with maximum influence based on the L1 distance where the influence of a location is the total weight of its RNNs. Xia et al. [129] proposed pruning techniques based on a metric named minExistDNN to retrieve the top-t most influential sites according to the total weights of their RNNs inside a given spatial region Q. The Optimal Location Selection (OLS) search was introduced in [46], which retrieves target objects in a target object set DB that are outside a spatial region R but have maximal optimality with a given data object set DA and a critical distance dc. Here, the optimality of a target object b2DB located outside R is defined as the number of the data objects from DA that are inside R and meanwhile have their distances to b not exceeding dc. Another group of location optimization queries are defined with a different optimization function which minimizes the average distance between a client and the nearest facility. Zhang et al. [137] proposed the Min-Dist Optimal Location Query (MDOLQ). Given a set S of sites, a set O of weighted objects, and a spatial region Q, MDOLQ returns a location for building a new site inQ, which minimizes the average distance from each object to its closest site according to theL1 distance. They provide a progressive algorithm that quickly suggests 17 a location, tells the maximum error it may have, and continuously refines the result. When the algorithm finishes, the exact answer can be found. Because user movements are usually confinedtounderlyingspatialnetworksinpractice,Xiaoet al.[130]extendedOLQtosupport queries on road networks. They design a unified framework that addresses three variants of optimal location queries. By observing that users can only choose from some candidate locations to build a new facility in many real applications, Qi et al. [94] introduced the Min-dist Location Selection Query (MLSQ) based on the studies in [137, 130]. Given a set of clients and a set of existing facilities, MLSQ finds a location from a given set of potential locations for establishing a new facility where the average distance between a client and her nearest facility is minimized. MND, a method, for efficiently solving MLSQ, employs a single value to describe a region that encloses the nearest existing facilities of a group of clients, is presented in [94] . However, these studies differ from the proposed query type in definition and optimization functions. Consequently, we cannot use them for answering our novel query type. 2.4 Spatial Keyword Queries 2.4.1 k Nearest Neighbor Queries In spatial databases,k nearest neighbor (kNN) and range queries are fundamental query types. Thesetwotypesofspatialquerieshavebeenextensivelystudiedandappliedinvarious geographic information system (GIS) applications. By employing the R-tree [51, 21] based indices, depthfirstsearch(DFS)[101]andbestfirstsearch(BFS)[55]havebeentheprevalent branch-and-bound algorithms for processing nearest neighbor queries in Euclidean spaces. However, neitherDFSnorBFScanbeappliedtospatialnetworksforansweringkNNqueries. For answering spatial queries on road networks, Papadias et al. [87] developed a Eu- clidean restriction and a network expansion framework to efficiently prune the search space. Based on the proposed frameworks, solutions for nearest neighbor queries are designed in the context of spatial network databases. In addition, a network Voronoi diagram-based 18 solution for kNN searches in spatial network databases is presented in [65] by partitioning a large network to small Voronoi regions and pre-computing distances both within and across the regions. Moreover, Jensen et al. [59] proposed a data model and definitions of abstract functionality needed for moving kNN queries in road networks and designed corresponding solutions. Because most Dijkstra?s algorithm based kNN solutions have been shown to be efficient only for short distances, Hu et al. [56] proposed an efficient index, distance signa- ture, for distance computation and query processing over long distances. Their technique discretizes the distances between objects and network nodes into categories and then en- codes these categories to accelerate kNN search process. Furthermore, in order to speed up kNN searches, Samet et al. [102] designed an algorithm to explore the entire network by pre-computing the shortest paths between all the vertices in the network and employing a shortest path quadtree to capture spatial coherence. With the algorithm, the shortest paths between all possible vertices can be computed only once to answer various kNN queries on a given spatial network. Nevertheless, all the aforementioned techniques mainly focused on the distance metric. They did not consider text description (keywords) of spatial objects in their query evaluation processes. 2.4.2 Text Retrieval Text retrieval is another important topic related to spatial keyword queries. There are two main indexing techniques, inverted files and signature files, widely utilized in text retrieval systems. According to experiments made by Zobel et al. [141, 140], signature files require a much larger space to store index structures, and are more expensive to construct and update than inverted files. In addition, Baeza-Yates and Ribeiro-Neto [20] also stated that inverted files outperform signature files in most cases. Although these aforesaid methods perform quite well in text retrieval applications, none of them can efficiently process spatial keyword queries. In other words, it is impractical to answer spatial keyword queries by simply employing approaches introduced in this or the 19 previous subsection. An effective way to handle spatial keyword queries is to combine the two groups of techniques as discussed in the following subsection. 2.4.3 Spatial Keyword Queries As local search services become more and more popular, many solutions [25, 31, 40, 53, 28, 139, 136] have been developed to evaluate spatial keyword queries by integrating index techniques previously used in spatial queries and text search. Location-based web search is studied by Zhou et al. [139] to find web pages related to a spatial region. They described three different hybrid indexing structures of integrating inverted files and R*-trees together. According to their experiments, the best scheme is to build an inverted index on the top of R*-trees. In other words, the algorithm first sets up an inverted index for all keywords, and then creates an R*-tree for each keyword. This method performs well in spatial keyword queries in their experiments; however, its maintenance cost is high. When an object insertion or deletion occurs, the solution has to update the R*-trees of all the keywords of the object. Cong et al. [31] illustrated a hybrid index structure, the IR-tree, which is a combination of an R-tree and inverted files to process location-aware text retrieval and provide k best candidates according to a rank system. They also proposed the DIR-tree and the CIR-tree, two extensions of the IR-tree, which take both minimizing areas of enclosing rectangles and maximizing text similarities into account during construction procedures. Recently, Cary et al. [25] proposed an efficient approach of answering top-k spatial boolean queries. They combined an R-tree with an inverted index to search the k best candidates which satisfy a group of boolean constraints. However, with their method, only candidates which completely meet boolean constraints will be found. The ones merely matching part of the constraints will simply be discarded because of strict constraints or an error input by mistake. Felipe et al. [40] developed a novel index, IR2-Tree which integrates an R-tree and signature files together, to answer top-k spatial keyword queries. They record signature 20 information in each node of R-trees in order to decide whether there is any object which satisfies both spatial and keyword constraints simultaneously. However, the size of space for storing signatures in each node is decided before IR2-Tree construction. Once the IR2-Tree has been built, it is impossible to enlarge the space unless reconstructed. If the number of keywords grows quickly, a system will spend a lot of time on repeatedly rebuildingIR2-Tree. Hariharan et al. [53] proposed an indexing mechanism, KR*-tree, which combines an R*-tree and an inverted index. The difference between their solution and [40] is that they only store related keywords in each node of an R*-tree in order to avoid merging operations to find candidates containing all keywords. Consequently, the number of keywords that appear in each node varies. However, such a complicated indexing technique has a high maintenance cost as well. If an object with new keywords is inserted, the method not only has to add new keywords to corresponding nodes from leaf to root of the R*-tree, but also update the inverted index (KR*-tree List). Althoughthereareanumberofpreviousstudiesonspatialkeywordqueries,thesolutions can only evaluate queries in Euclidean spaces. This limitation is due to the adoption of the R-tree (or its variants), which cannot index spatial objects based on network distances, into their hybrid index structures. In addition, it is infeasible to provide partial results with existing solutions, which consider both spatial and keyword constraints simultaneously. 21 Chapter 3 Frog: a Framework for Context-Based File Systems ThischapterpresentstheFrogframeworkforContext-BasedFileSystems(CBFSs) that aim at simplifying the development of context-based file systems and applications. Unlike existing informed-based context-aware systems, Frog is a unifying informed-based framework that abstracts context-specific solutions as views, thereby allowing applications to make view selections according to application behaviors. The framework can not only eliminate overheads induced by traditional context analysis, but also simplify the interac- tions between the context-based file systems and applications. Rather than propagating data through solution-specific interfaces, views in Frog can be selected by inserting their names in file path strings. With Frog in place, programmers can migrate an applications from one solution to another by switching among views rather than changing programming interfaces. Since the data consistency and resource protection are automatically enforced by the framework, file system developers can focus their attention on the context-specific solutions. We implement two prototypes to demonstrate the strengths and overheads of our design. The Bi-context Archiving Virtual File System (BAVFS) utilizes conservative and aggressive prefetching for the contexts of random and sequential reads. The Bi-context Hybrid Virtual File System (BHVFS) combines the update-in-place and update-out-of-place solutions for read-intensive and write-intensive contexts. Our experimental results show that the benefits of Frog-based CBFSs outweigh the overheads introduced by integrating multiple context- specific solutions. Thischapterisorganizedasfollows. Section3.1introducesbasicconceptsofthecontext- based file system, and compares it with context-aware systems. Section 3.2 illustrates the 22 design of the context-based file system and the Frog framework. The implementation details of the two prototypes are described in section 3.3. The experimental results are shown in section 3.4. We discuss several interesting topics related to our design in section 3.5. 3.1 Context-Based File Systems 3.1.1 Basic Concepts An I/O Context is a set of interrelated conditions, under which I/O operations are performed. A file system may leverage access-pattern information exploited from an I/O context (i.e., conditions) to improve I/O performance. For instance, data can be prefetched before being required by applications in a sequential-read context. An I/O Context-Specific Solution is a set of techniques or mechanisms deployed to maximize benefits (e.g., performance and energy efficiency) of I/O operations in a given context. Regularly, a context-specific solution that fits one context may be inadequate for another context. For example, a file system may benefit from a prefetching solution in a sequential-read context, whereas the solution can cause performance degradation in a random-read context due to retrieving unused data. A Context-Based File System (CBFS) is a file system that encapsulates an array of I/O context-specific solutions. A CBFS exposes the solutions to and receives context selections from applications. When a context is selected by an application, I/O operations will be processed by the context-specific solution of CBFS. A Context-Based Application is a program that is able to identify its contexts, perceive contexts provided by systems, and select the best context leading to good perfor- mance. Differing from context-aware applications, context-based applications can interact with context-based systems to select contexts based on both I/O operations and available contexts offered by the systems. 23 3.1.2 Context-Aware vs. Context-Based File Systems In this subsection, we show differences between context-based file systems and context- aware file systems. Context-aware file systems can be roughly divided into two categories: heuristic-based and informed-based systems. Fig. 3.1 illustrates that three types of file systems encapsulate a set of context-specific solutions. The methods of context identification and solution selection are completely different among the three approaches. App Coordinator Solution 1 Solut ion nDB I/Os App Coordinator Solution 1 Solut ion n I/Os & Descriptions App Solution 1 Solut ion n Context Choices Heuristic-Based Context-aware File System Informed-Based Context-aware File System Context-Based File System Context Choices Context Choices Context Information ... ... ... Figure 3.1: A comparison of context-aware systems. In a heuristic-based file system, a coordinator keeps track of system-wide I/O operations in an internal database. By analyzing historical I/O accesses, the coordinator identifies a context for current I/O operations. The advantage of heuristic-based systems is that context processing is transparent to applications. Any application can be supported by heuristic- based systems without modifying the application?s source code. Heuristic-based systems have three major drawbacks. First, maintaining I/O records increases the complexity of system design, and incurs extra costs on behavior analysis. Second, accuracy of context selections relies heavily on the analysis. Because applications may exhibit different I/O patterns, recording and analyzing their behaviors may have negative impacts on context identifications. Third, applicationswithrandombehaviors(e.g., triggeredbyevents)produce noises to the context identification process. 24 In an informed-based file system, a coordinator receives I/O operations with their de- scription (e.g., represented by data structures) through dedicated interfaces. The description is transformed into context choices based on particular rules. The informed-based systems do not suffer from the overhead of back-end analysis; accuracy of context identification de- pends on applications. However, the systems have to incorporate additional interfaces to collect extra information from applications. Currently, the interfaces are abstracted in var- ious ways. The lack of consistent information propagation prevents the informed approach from widespread deployment. Context-based file systems, or CBFSs, aim to overcome the aforementioned problems in context-aware file systems. CBFS can be envisioned as a special informed-based system withoutrelyingoncontextanalysis. Thecontextselectionsarepropagatedfromapplications. In addition, CBFS provides unifying interfaces between file systems and context-based ap- plications, meaning that the applications can easily and freely migrate among context-based file systems. 3.1.3 The Frog Framework for Context-Based File Systems Encapsulating multiple context-specific solutions in a file system is challenging. The methodsoforganizingmetadataandphysicaldatainthesolutionsmaybedifferent, implying that a potential conflict may allow only one method to be applied due to the fact that most systems have only one copy of metadata and physical data. If CBFS maintains multiply copies of data, each of which is managed by solutions individually, other issues are raised by data duplications. Take the consistency issue for example; once data handled by a solution has been updated, other solutions have to change their managed data accordingly. To simplify context-based file system design, we propose a general framework, Frog, that provides (1) a unifying interaction mechanism between systems and applications, and (2) a solution encapsulation mechanism. Fig. 3.2 shows that the context-specific solutions are abstracted as views in Frog. Frog is implemented as an intermediate layer between 25 A pp Fr og View 1 View n View C ho ices Fr og -based C B F S ... Figure 3.2: The framework of a Frog-based CBFS. applications and views. Frog and applications exchange the view information through a unifying interface at runtime. Propagating a view from applications to Frog indicates that thecorrespondingcontextisselectedbytheapplications. Fromtheperspectiveofapplication development, the unifying interface significantly reduces the costs of solution migrations, replacing an existing solution with a new one so that only the view information needs to change. Importantly, Frogmakesiteasytocreateindividualspacesforcontext-specificsolutions, each of which maintains data in its own space. Separating data spaces can allow a solution to be plugged in as if there were only one solution in the file system. Frog is in charge of solution maintenance issues such as consistency. Due to the independence among solutions, one or all of the solutions can have their own heuristic-based methods, e.g., tracking and analyzing I/O behaviors in specific contexts. In Section 3.5.3, we present a special case in which heuristic-based methods are utilized to boost the I/O performance of view-unaware applications. 3.2 Design In Frog, we creatively introduce the view concept into file systems. Three essential attributes, file metadata management, physical data management and I/O operations, are encapsulated in views. In the subsection, we highlight the overview of Frog as well as new 26 concepts by illustrating Frog-based CBFSs. Next, due to the diversity of physical data management, three Frog-based CBFS designs are proposed. Finally, we discuss three types of overheads in Frog-based CBFSs. Since our study mainly focuses on designing a framework for CBFSs, we do not demon- strate context-specific solutions in details. In real CBFS development, existing solutions can be selected to be implemented in views. We will show two concrete prototypes as CBFS examples in the next subsection. 3.2.1 Overview Fig. 3.3 shows the hierarchical structure in the metadata space of Frog. Directories, views, and files are three types of abstractions represented by white, gray, and black boxes, respectively. These directories and files are called View-Enhanced Directories (VEDs) and View-Enhanced Files (VEFs). To concisely and accurately illustrate our design, we focus on the differences between Frog-based CBFSs and traditional file systems. We do not discuss the design principles of Frog-based CBFSs that are similar to those of the traditional file systems. File View Directoryview1 view2 root dir1 dir2 ...F1'? Fn?? View2'? ...F1'? Fn?? View1'? ...F1' Fn? View2' ...F1' Fn? View1' Figure 3.3: Overview of a Frog-based CBFS. A Frog-based CBFS is organized in a tree structure, the root of which is an instance of VED. A VED manages other VEDs and a set of views, but does not directly contain any VEF. The number of views in a VED is configured when the CBFS is developed. Once the 27 CBFS is mounted to a point of the naming space, the number of views cannot be modified (dynamic configuration will be considered in future work). Note that views under a VED are the instances of different view types. Fig. 3.3 illustrates an example of two view types configured in a Frog-based CBFS. view1, view1? and view1? are three instances of a view type, and view2, view2? and view2? are instances of another view type. In the root node, view1 and view2 are two instances of different types. Throughout this section, views are referred to as view instances if the views are not explicitly indicated as view types. Views are an intermediate layer between VEDs and VEFs. A view is a container that organizes only VEFs rather than VEDs or other views. Thus, views do not contribute to the hierarchy construction. The views under a VED expose logically identical interfaces and metadata of VEFs (e.g., file names and file sizes) to applications; however, the implementa- tion details (e.g., managing VEFs in a B-tree or a hash table) of views may vary. VEFs are the basic units that refer to physical data on devices. In traditional file systems, a file is a logically unique unit that has only one copy of metadata. However, Frog maintains multiple metadata copies (VEFs) for a logical file. VEFs are always the leaf nodes in the tree structure. It is worth noting that two special cases of Frog-based CBFSs are the Single-context File System (SFS) and Bi-context File System (BFS). SFS is configured with one view type. Only one view instance is created in VEDs. SFS degrades to a traditional file system that does not duplicate any metadata of logical files. The only difference between SFS and the traditional file system is that SFS separates the file metadata and encapsulates them in views. Unlike SFS, BFS is the simplest form of Frog-based CBFSs that maintains duplication. Compared with other Frog-based CBFSs with more than two views, BFS incurs the minimum overhead in data duplication management. Fig. 3.3 shows an example of BFS, where the root VED has two VEDs (i.e., dir1 and dir2) and two views (i.e., view1 and view2). These two views do not have any VEF. dir1 has two views (i.e., view1? and view2?), each of which contains a number of VEFs. The file 28 information exposed from view1? and view2? is identical. dir2 has organization similar to dir1. 3.2.2 Operations Now we discuss the operations of VEDs, views, and VEFs at the file system level. Let us emphasize on new features supported by Frog. In order to show compatibility with existing applications, we set views of a particular type in VEDs as default views. The default views play a key role in consistency maintenance and locking mechanism design. Frog-based CBFS Initialization/Finalization Let us start the operation description with the initialization/finalization processes in Frog-based CBFSs when the file systems are mounted/unmounted at a point in the nam- ing space. Before deploying a Frog-based CBFS, a format program should be executed to construct a super block, set up metadata and data partitions, and create a root VED and its sub-views. The format program builds an empty file system, the information of which is stored on a disk. When the CBFS is mounted, the root VED and sub-views are loaded into main memory. Then, the CBFS is ready to serve I/O requests. When the CBFS is unmounted, all metadata should be permanently serialized to disks. VED Operations VED operations supported by Frog-based CBFSs are directory creation, deletion, and listing. These operations differing from those of traditional file systems are that Frog must consider views in addition to directories. In the creation process, a new VED is created and appended to a children list of its parent. Views under the new VED are also generated. Hence, by default, an empty VED has two special directories ("." and "..") and a number of empty views. Views are not created under "." and ".." directories. 29 In traditional file systems, only empty directories are allowed to be deleted. Frog follows this principle; a VED can be deleted if it does not contain any other VEDs (except for "." and "..") and all sub-views are empty. In the process of VED deletion, its sub-views are automatically removed. The directory listing operation (through readdir system call) will be discussed in the default view subsection (see Section 3.2.2), where we show how to make trade-off between supporting compatibility for existing applications and exposing views to view-aware appli- cations. View Operations In Frog, view supports creation, deletion and listing operations. The listing operation can be individually issued, but the creation and deletion have to be committed by VED creation and deletion operations. In the view creation, a new view that maintains an empty VEFlistiscreated. TherelationshipbetweenthenewviewanditsparentVEDisdetermined by the VED creation operation. A deletion operation can only delete empty views that do not have any VEFs. When a listing operation (through readdir system call) is issued on a view, the names of VEFs in the view will be returned. The names of views are pre-determined by their types. For example, in Fig. 3.3, view1, view1? and view1? are instances of a view type. These views can share an identical name (e.g., "view1") without any name conflicts because they are in different VEDs. If the names of view types are different, views in a VED cannot conflict with each other as well; however, similar to "." and ".." directories, the view names pollute the naming space in VEDs, so that the VED names cannot be the same as the view names. The name conflicts will be detected before a VED is created. The operation will be denied if any name conflict exists. The conflicts cannot occur during view creation since there is no sibling-VED existed when a view is created. 30 VEF Operations Due to metadata duplication across views, creating/deleting a file in a view must trigger the view synchronization in order to keep metadata consistent. This synchronization process takes place in any operations, in which the metadata will be modified. Moreover, Frog must consider synchronization issues of concurrent VEF accesses, which significantly differs from tradition file systems. For example, two files may be concurrently created by applications through two views. In another example, one file may be concurrently modified by applications through two views. Without concurrent controls, the metadata consistency will be violated. To simplify our design, the consistency problem is solved by applying a lock mechanism. Two locks are set for VEF creation/deletion and read/write operations, respectively. The creation/deletion lock, setting in the parent VED, has to be acquired before any VEF is created/deleted. The read/write lock mechanism will discussed in Section 3.2.6. Default View Configuration In our design, we classify applications into two categories: view-aware and non-view- aware applications. Obviously, existing applications are non-view-aware. The challenge of supporting both types of applications is that views are visible in view-aware application and invisible in non-view-aware ones. To show compatibility with non-view-aware applications, we set up view instances of a particular type as Default Views in VEDs, which is used when view names are missing from file paths. We describe the following typical scenarios to demonstrate how the default views work. File paths of two types of applications are different. In Fig. 3.3, VEF F1? under dir1 is referred to as the path "/root/dir1/view1?/F1?" in view-aware applications, and "/root/dir1/F1?"innon-view-awareapplications. Thedifferencebetweenthesetwofilepaths is that the view name is visible in the path string in view-aware applications. 31 Directory View File Metadata Physical Data ...... root view1 view2 F1 FnF1 Fn Hybrid ...F1 Fn view3 File 1* ... File n*File 1 ... File n ...... root view1 view2 Shared-Nothing F1 FnF1 Fn File 1 ... File n File 1* ... File n* ...... root view1 view2 Shared-Data F1 Fn F1 Fn File 1 ... File n Figure 3.4: Frog-based CBFS designs. The default views are used when view names are missing from file paths. For example, when the open system call is issued on "/root/dir1/F1?", the last entity in the path string must be the file name F1?, and the next to the file name can be either a VED or a view. This can be decided by the name because view names are pre-determined. If the entity is a VED, the default view name (either view1? or view2?) will be inserted to form a view-aware path. The semantics of directory listing are different between view-aware and non-view-aware applications. When listing a VED, view-aware applications expect to retrieve the names of sub-VEDs and sub-views; whereas non-view-aware applications obtains the names of sub- directories and files. In case that existing applications cannot be modified, our design adapts view-aware applications to non-view-aware semantics, so that the view names will not be exposed in directory listing. Another reason we compromise the view-aware semantics is that similar to "." and ".." directories, the view names are pre-determined. It is reasonable to assume that view names are known as a priori in applications. 3.2.3 Three Frog-based CBFS Designs Metadata management has been discussed in the previous subsections. If we take into accountphysicaldataorganizations, therearethreepotentialdesignsofCBFSs(seeFig.3.4). Note that other forms of CBFSs might be derived from these three designs. 32 Shared-Data CBFS shares one copy of physical data among multiple views. F1s in view1 and view2 refer to the address of File1 stored on disks. Shared-data CBFS simplifies the physical data management by sharing data among views. Shared-data CBFS is applica- ble to the integration of context-specific solutions that are not conflicting in physical data management. BAVFS - a good example of shared-data CBFS - integrates aggressive and conservative prefetching in two views. Theoretically, these two prefetching techniques do not interfere with physical data management. Instead, the techniques address the issues of read operations (e.g., how much data is prefetched each time) by duplicating metadata. Shared-Nothing CBFS advocates a data duplication approach, where each copy is managed by a view; thus, VEFs in views refer to the addresses of data in separate copies. A typical example of shared-nothing CBFS is BHVFS, which maintains data replicas for both update-in-place and update-out-of-place strategies that conflict with each other in data organization. Compared with shared-data CBFS, shared-nothing CBFS has to pay extra overheads to physical data synchronization among views. Hybrid CBFSis a combination of shared-data and shared-nothing CBFSs. In a hybrid CBFS, some views share data whereas others separately manage duplicated data. An ex- ample of hybrid CBFS is to combine update-in-place and update-out-of-place strategies with prefetching techniques, in which update-out-of-place is applied in view1, and update-in-place in view2 and view3. In addition, conservative prefetching is applied in view2 for random reads; aggressive prefetching is used in view3 for sequential reads. 3.2.4 Interactions with Applications Besides the diversity of Frog-based CBFS designs, Frog is incredibly flexible to support a wide range of applications. Thanks to the availability of all views in Frog, one view or multiple views can be chosen by applications. In the current design, views (or contexts) are pre-determined, Frog does not expose the view information by a dedicated interface. Instead, applications provide their view selection by inserting the view names in the file path 33 Pro cess 1 View 1 View 2 read wr ite read w rite Frog -based C B FS Pro cess 2 Pro cess 3 readwrite Figure 3.5: Interactions between processes and Frog. string. Fig. 3.5 displays two typical scenarios of interactions between processes and a Frog- based CBFS. On top of the figure, process 1 concurrently communicates with both views; in particular, it would choose the best view for an operation. For example, if the update-out- of-place strategy is employed in view1 and update-in-place in view2, the process can commit writes through view1 and retrieve data using view2 to achieve better performance. Fig. 3.5 shows another example at the bottom. The data written by process 2 through view1 can be retrieved by process 3 through view2. Theconcreteexamplesofview-awareapplicationsareshowninFig.3.6. foo writes"hello world" in a given file, and bar reads a string from a given file (see the top two functions in Fig. 3.6). In the first case at bottom left, an application processes two files from two views (i.e., update file1 from view1 and read file2 from view2). The application can switch from one view to another at runtime. In the second case at bottom right, an application updates data in a file and another application later retrieves the data from the file. These two applications can be supported by two different views. 34 foo (char* path){ char* str = hello world ; FILE* fh = fopen(path, w ); fwrite(str, strlen(str)+1, 1, fh); fclose(fh); } bar(char* path){ char buf[20]; FILE* fh = fopen(path, r ); fread(buf, sizeof(buf), 1, fh); fclose(fh); } Application 1: void main (){ foo ( .../view1/file1 ); bar( .../view2/file2 ); } Application 2: void main(){ foo( .../view1/file1 ); } Application 3: void main(){ bar( .../view2/file1 ); } Figure 3.6: Examples of view-aware applications. 3.2.5 Application Comparisons Recall that context-based file systems are different from context-aware file systems (see Section 3.1.2). Now we further illustrate their differences by comparing applications run- ning in these systems. Apparently, the applications running on heuristic-based systems are context-unaware applications, which do not identify and disclose context information. On the other hand, applications running on context-aware and context-based systems are context-aware. In contrast to context-aware applications, context-based applications can identify contexts based on a combination of their behaviors and contexts perceived from systems. Such ability offers context-based applications a powerful adaptability that enables the applications to perform well in diverse and changeable systems. The ease of context-specific solution migration estimated by the efforts paid on solu- tion replacement is different among the three types of applications. Because heuristic-based applications are transparent to solutions, no application modification is necessary when ap- plications are migrated from one solution to another; however, the other two types of appli- cations are aware of solution replacements. Thanks to a unifying interface provided by Frog, 35 context-based applications only change view names, whereas context-aware applications may have to shift to a new interface applied by the new solution. Fig. 3.7 illustrates a concrete example where an application must be modified in case of a solution replacement. In this example, a prefetching solution proposed by Patterson et al. [90] has been employed in the file system. For some reasons (e.g., hardware updates such as replacing HDDs with SSDs), the solution needs to be replaced by another prefetching solution (for example, [111]). In this case, both the file system and application have to be modified accordingly. context-aware-app() { char buf[20]; int fd = open( /tmp/file , ...); context-based-app() { char buf[20]; int fd = open( /tmp/view1/file , ...); read(fd, buf, sizeof(buf)); close(fd); } //initialize context parameter structure context ctx; ctx_action(CTX_BEG); ctx_pread(ctx, buf); ctx_action(CTX_ENG); //initialize hints HINT_STRU hints; hints.filename = /tmp/file ; hints.pattern = SEQ_READs; ioctl (fd, REQ, &hints); read(fd, buf, sizeof(buf)); close(fd); } int fd = open( /tmp/view2/file , ...); code code Codes related to the old solution Codes related to the new solution Figure 3.7: Application modification comparisons. An intuitive modification method is to remove codes related to the old solution, and to write new codes for the new solution. The old codes are indicated by strike-through texts; 36 the new codes are highlighted by gray background. In the context-aware applications (on the left-hand side in Fig. 3.7), a total of 6 lines of codes are removed and 5 lines are newly written, which is more than 2-line changes in the context-based applications (on the right- hand side in Fig. 3.7). Please note that changes of real-world context-aware applications are more complex than those demonstrated in this example. 3.2.6 Overheads Due to data duplication, Frog introduces three types of overheads that are data consis- tency, resource contention and data duplication. Data Consistency In Frog, views are managed in separate spaces. Once an update is issued in a view, the update will be synchronized with other views in order to keep all of the views consistent. Due to replicated metadata, all three designs mentioned earlier have to address the issue of metadata synchronization. The more views configured in Frog, the higher the consistency maintenance overhead is. Importantly, the overhead of physical data synchronization differs significantly among the designs. Shared-data CBFSs do not conduct any physical data synchronization, whereas shared-nothing CBFSs have to synchronize all updates among the views. Hybrid CBFSs can be anywhere between shared-data and shared-nothing CBFSs. Files are normally significantly larger than their metadata [17] [116]; thus, data consistency overhead is dominated by physical data synchronization if there exists one. Moreover, we also consider crash recovery issues that Frog can automatically reconcile among views; even some of them are in inconsistent states. We propose two mechanisms for metadata protection. Centralized Journaling encapsulates the journaling processing in the framework as shown in Fig. 3.8. The Frog will create a separate area that tracks metadata updates in a journal file [106]. The updates will be written to disk before committing them to the main file system. If the file system crashes during the processing, the journal file will 37 bring the system to a consistent state. We can adopt an existing journaling method in Frog. For physical data updates, we can simply apply ordered update mechanism that data blocks will be committed to disk before the metadata [78]. When committing metadata updates to the main file system, the updates will be com- mitted to one view. Other views will cache the updates in memory until the disk is not busy. In this way, the file system has at least one view in consistent state, and the costs of synchronization is hidden in back end. In addition, the parent directory should maintain a bit map to indicate that which view is in consistent state. When the system recovers from a crash, file information will be retrieved from the view in consistent state. The consistent view can be either statically (e.g. the default view) or dynamically selected by a particular criteria. CentralizedJournalingreducesthedifficultyofsystemdevelopmentbyprovidingaunify- ing journaling mechanism. Developers can focus on the functionalities of the context-specific solutions. On the other hand, the flexibility of system design is compromised. All meta- data updates have to be journaled, which prevents from integrating a non-journaled solution (probably for performance consideration). In addition, the entire file system is running un- der a journaling mechanism. Other metadata protection methods, such as soft update [79], log-structure file systems [99] and copy-on-write [1] [23], can not be utilized in the file system. The other mechanism is Distributed Journaling that provides developers with flexible journaling choices. Distributed Journaling mechanism separates file updates from directory updates. The directory updates are journaled by Frog as centralized journaling mechanism does; however, all file updates are directly forwarded to views, which protect metadata individually. During the update distribution, one view is selected as a consistent view, and commits the updates to disk. Other views cache the updates in memory. In each view, the view-specific solution selects the metadata, even physical data, protection mechanisms. Under distributed journaling mechanism, any metadata protection technique can be used in views. However, the metadata consistency is actually protected by the consistent 38 Frog-based CBFS Metadata updates Decentralized Me chanism Journalin g View 1 View 2 Frog-based CB FS Metadata up dates Centralized Mechanism A view in consiste nt state A view in inconsistent state Directory Journaling File Journali ng View 1 View 2 File Journali ng Figure 3.8: Two consistency mechanisms. view. If the view is dynamically selected, the consistency is actually protected by the weakest mechanism among views. The bit map in the directory has to be updated after view updates, which will be processed by the directory journal in Frog. If a crash occurs between directory and file updates, the system will recover from a consistent view indicated by the bit map in the last update, which means that the latest file update is lost. Resource Contention Resource contention in traditional file systems takes place when two concurrent requests (e.g., a read and write) are issued on a file. A file locking mechanism can be utilized to solve this problem. Before issuing any I/O operation, applications must successfully acquire a lock. We have to carefully address the contention problem in the context-based file system because two requests are allowed to be issued on a logical file from two views by view-aware applications. In a traditional file system, the locking mechanism is carried out in two steps. First, applications retrieve a file descriptor by opening a file. Then, a file lock operation is issued through flock or fcntl on the file descriptor. Frog follows the two steps. When opening a 39 FD1 FD2 FLB User SpaceFile System Space FO2FO1 lockfcb ... ... lock fcb... ... ... FCB1 flb... View 1 ... FCB2 flb... View 2 ... FCB3 flb... View 3 Figure 3.9: File locking among views. file, Frog will create a FO (File Object) that interacts with applications. The FDs (File Descriptor) held by applications refers to the FOs, shown in Fig. 3.9. The FO has two key fields: lock refers to a FLB (File Locking Block) that manages the lock associated with the logical file. fcb points to FCB (File Control Block) (e.g., an inode) in the tree structure. Only the flb field of the FCB in the default view (e.g., view 1) refers to the FLB. The flb of the FCB in other views are always NULL. The details of the lock operation is described in Algorithm 1. The fundamental idea is that only one lock is created for a logically unique file. When a locking operation is issued, FCB in the default view is loaded into memory. Before an FLB is associated with a logic file, the lock fields in FOs and flb fields in FCBs are NULL. When a lock operation is issued, Frog checks the associated FO at first. If it refers to a FLB, the locking request will be committed on the FLB. Otherwise, the flb field of the corresponding FCB in the default view will be checked. If a FLB has been created, the lock field of the FO will refer to the FLB. Otherwise, a new FLB will be created, and the locking operation will be issued on the newly created FLB. 40 Algorithm 1 flock(fd, op) 1. Get fo by fd 2. if fo.lock != NULL then 3. lock (fo.lock, op) 4. return SUCCESS 5. end if 6. Find fcb in the default view, which manages a logically unique file with fo.fcb 7. if fcb.flb == NULL then 8. Create a new file locking block lb 9. fcb.flb = lb 10. end if 11. fo.lock = fcb.flb 12. lock (fo.lock, op) 13. return SUCCESS Duplications Frog requires extra disk and memory space to accommodate metadata duplications. Increasing the number of views in the file system can drive the disk space overhead up. The space overhead for physical data duplication greatly varies among designs. As shown in Fig. 3.4, the shared-data CBFS does not have any physical data replicas; on the other hand, the number of data replicas maintained by the shared-nothing CBFS is equal to the number of views. The reason that we create metadata replicas in all three designs and do not carry out any de-duplication technique is that the size of metadata is relatively small compared to the overall capacity of file systems. According to the study that the mean and median fullness (ratio of usage to capacity) of file systems are between 40% to 50% [17]. Disk space is available to metadata replicas in most cases. In addition, disk space is seemingly wasted for physical data duplications; however, data reliability of file systems is improved by the virtue of data replicas. When one data copy is lost due to bad sectors, the data can be retrieved from other copies. Benefiting from flexibility of system design, data replicas can be distributed across multiple disks or servers to further boost system reliability. Typical 41 storage systems that offer data replicas include RAID-1 (data is mirrored) [89], GFS [48] and HDFS [108] (3 copies are created by default). 3.3 Case Studies Weconducttwocasestudiestoevaluatetheeffectiveness,generality,andflexibilityofthe Frog framework. In the first case study, we outline the implementation issues of BAVFS - the shared-data design of a context-based file system that applies a dual-mode (i.e., aggressive and conservative) prefetching for the context of sequential and random small reads. In the second case study, we focus on BHVFS - the shared-nothing design that separates read- intensive contexts from write-intensive ones, in each of which either the update-in-place or update-out-of-place strategy is employed to improve I/O performance. The prototypes of BAVFS and BHVFS are implemented on the top of the FUSE file system [115], which is popular in the file systems research community [57][114][125]. Di- rectories are used to emulate views in a hierarchy structure. Our prototypes enforce the differences between the two abstractions. We adopt the distributed journaling mechanism without directory journaling in the framework of both prototypes because the underlying file systems (Ext 4 in our experiments) take care of metadata protection. The operation interfaces implemented in the prototypes include: open, create, unlink, read, write, release, mkdir, rmdir, readdir, getattr, and lock. We may further optimize the prototypes. For example, considering alphabetical order in the metatada management of BAVFS, applications can sort data returned from readdir. Nevertheless, we implement the prototypes in a straightforward way while focusing on the strengths and overheads of context-based file systems. 42 3.3.1 The BAVFS File System Overview The data explosion problem has been observed in the past few years. The number and diversity of files are growing rapidly [45]. Recent studies [17][116] show that more than 50% of all files in a file system are small (e.g., smaller than 4 KBytes); thus, it is critical to develop modern file systems that can efficiently process large number of small files [24][88][104]. Two common types of data access patterns in I/O-intensive applications are sequential and random reads. For example, all small files in a directory are sequentially scanned by antivirus softwares, whereas one of these files is simultaneously opened by an editor. Se- quential data retrieval can be improved by prefetching techniques [68] [74] [90]; however, the unpredictability of future accesses is a natural barrier to adopting the prefetching techniques. Unnecessary prefetched data can significantly degrade system performance. We implement the dual-mode I/O prefetching in a file system. A similar dual-mode prefetching mechanism can be found in the realm of instruction retrievals [72]. Aggressive prefetching is applied in the sequential context, thereby avoids unnecessary prefetches due to random data retrievals; on the other hand, conservative prefetching is employed in the random context. BAVFS is different from existing prefetching approaches in that BAVFS provides a context-based prefetching approach, where not only data in operating files but also in files to be accessed in future can be prefetched based on the contexts. Implementation To shorten relocation times in the sequential view, small files in a directory are ag- gregated into large files. As shown in Fig. 3.10, the physical data of files are merged into big .data files. The .meta files store the metadata in the views. The .data and .meta files appear in pair, and the order of metadata in .meta keeps the same with the ones in the 43 corresponding .data file. When a file is required, its metadata will be loaded in memory under the random view; whereas the entire .meta file will be loaded in the sequential view. Rand om View File1 File2 File3 Sequent ial View .data F1 F2 F3 Files on disks .meta File1 File2 File3 Figure 3.10: Structure of BAVFS. An extendible hash method is used in file name mapping [39] (see Fig. 3.11). In both views, 128-bit hash values are generated from file names. The first 16 bits form the names of .meta files; the remaining 112 bits represent a unique index number in the .meta files. Each .meta file manages a set of entities, each of which is made up of an index number and file metadata information, including the location of physical data (i.e., :log file name, offset, and file size). An update operation may be expensive. For instance, if the current block of a file is not sufficient to accommodate an update due to an increase in file size, the entire file must be moved to another location. Though such data movement incurs extra I/O costs, the design is reasonably effective for write-once-read-many access patterns [108]. Moreover, although two copies of metadata are managed in the two views, they actually share one copy of metadata on disk. The difference of metadata management in the views only exists in memory. When a read operation is issued, 30 KBytes of data is prefetched and cached in memory in the sequential view. On the other hand, no data is prefetched in the random view. The 44 File Names 16 bits 128-bit string hash 112 bits ... Metadata ... ... ... ... ... ... ... ... ... ... ... 16 112 112 Metadata Metadata Metadata Metadata Metadata .meta file .meta file Figure 3.11: Name processing in two views. reads are directly forwarded to the underlying file system. The random view is set as the default view of BAVFS. 3.3.2 The BHVFS File System Overview The efficiency of data retrieval and storage is critical in modern file systems. Two prevailing file systems, FFS [80] and LFS [99], embrace optimization techniques for reads or writes; however, few existing file systems are able to provide competitive performance on both reads like FFS does and writes like LFS does. To maximize the strengths of LFS and FFS, hFS [138] incorporates them for both types of operations. Since either LFS or FFS technique is selected in hFS for a file according to the file size, the file can only benefit from one technique rather than two. In addition, quFile proposes a method that encapsulates physical representations of files for contexts at the file level. Enjoying the benefits from FFS and LFS cannot be addressed by quFile at the file system level. The aforementioned limitations of hFS and quFile motivate us to develop a BHVFS that can exhibits the strengths of both FFS and LFS. Their two completely different strategies are adopted in read-intensive and write-intensive views, each of which shows the strengths and weaknesses of either FFS or LFS. If the views are appropriately selected by 45 view-aware applications, the strengths of both views are maximally leveraged to achieve high performance for both reads and writes. Implementation Fig. 3.12 shows that both views individually manage metadata in the metadata space. File metadata in a directory is organized in form of a balanced search tree. In the physical data space, the FFS view applies the update-in-place strategy that overwrites data in the corresponding files; on the other hand, the LFS view uses the update-out-of-place strategy that appends updates to big log files. In addition, the LFS view maintains meta files to record the metadata information. We do not create a cleaner program to reorganize the log files as traditional log-structured file systems do because in most cases reads are issued in the FFS view, in which data are logically and continuously managed. FFS View LFS View F1 F2 F3 File 2 File 3 File 1 log File 1 File 2 File 3 F1 F2 F3 Metadata Space Physical Data Space meta ... ... ... Figure 3.12: The structure of BHVFS. 3.4 Evaluations In this subsection, we examine the performance of context-based file systems by evalu- ating BAVFS and BHVFS using our testbed. We mainly focus on 1) investigating strengths derived from context decomposition; and 2) quantifying overheads incurred by introduc- ing multiple views. BAVFS-CP and BAVFS-AP denote conservative and aggressive prefetching views in BAVFS. BHVFS-FFS and BHVFS-LFS denote FFS and LFS views 46 in BHVFS. To fairly compare BAVFS and BHVFS with existing file systems, we have to consider the negative impacts imposed by the FUSE module and virtual file systems (VFS). Rather than minimizing the negative impacts of FUSE and VFS, we create the following four baseline file systems using FUSE. FS-CP and FS-AP are traditional file systems that apply identical metadata management, data layout and I/O operations with BAVFS-CP and BAVFS-AP, respectively. The only difference is that FS-CP and FS-AP do not contain views. FS-FFS and FS-LFS are fuse-based FFS and LFS. WeevaluatethesefilesystemsonaUbuntudesktopcomputerwith2.2GHzIntelCeleron CPU, 1 GBytes memory and two 160 GBytes Sata disks [11]. The Linux kernel version is 2.6.35. Ext 4 [78] is configured as the underlying file system, on which FUSE-based file systems are running. In the experiments, BAVFS is running on a single disk; whereas two disks are utilized for BHVFS, in which views maintain their data in separate disks. We conduct the following three groups of experiments. Metadata Operations. The first group is designed to measure the performance of file creation and deletion. Random and Sequential Reads in BAVFS. We evaluate the impacts of aggressive and conservative prefetchings on sequential and random reads in BAVFS. Random Reads/Writes in BHVFS. We quantify the strengths and weaknesses of LFS and FFS in the context of BHVFS by issuing random reads and writes to the file system. During each experiment, we focus on testing a single view instead of the entire file sys- tem. This testing strategy allows us to demonstrate the performance of basic I/O operations in the light of multiple views. In the figures shown in the following subsections, BAVFS- CP, BAVFS-AP, BHVFS-LFS, and BHVFS-FFS indicate results of issuing operations on the views. In order to show the performance of BAVFS and BHVFS, we conduct the experiments by switching the default views that support non-view-aware applications with appropriate view selection. Note that the existing benchmarks and applications are non-view-aware. In 47 the two prototypes, write operations in the default view are pushed to disks by fsync, and update operations on the other view are processed by a thread at the back end. 3.4.1 File Creation/Deletion Rate To test file creation/deletion rate, we use postmark [62] - a file system benchmark - with an initial set of 1,000,000 small files. We set file sizes anywhere between 1 to 2 KBytes. Fig. 3.13 reveals the file creation/deletion rates (i.e., the number of files created/deleted per second) without other I/O transactions. Creation/Deletion Rate (#/s) File Systems/Views BAVFS?CPBAVFS?APBAVFSFS?CPFS?AP0 2000 4000 6000 8000 10000 12000 Creation RateDeletion Rate (a) BAVFS BHVFS?LFSBHVFS?FFSBHVFSFS?LFSFS?FFS0 2000 4000 6000 8000 10000 12000 14000 Creation/Deletion Rate (#/s) File Systems/Views Creation Rate Deletion Rate (b) BHVFS Figure 3.13: File creation/deletion. Fig. 3.13(a) displays the comparisons among BAVFS, FS-CP, and FS-AP. We observe that the creation/deletion rates of BAVFS-CP (713/9345) and BAVFS-AP (712/9345) are slightly lower than those of FS-CP (727/9433) and FS-AP (726/9433). The differences in creation/deletion rates are less than 3%, because BAVFS does not actually maintain any replica on disks; even metadata are simply duplicated in memory. When a file is created or deleted, only one update is committed to disks. The creation/deletion process in BAVFS is similar to those in FS-CP and FS-AP. Fig. 3.13(b) shows that the file creation/deletion rates of BHVFS-LFS are 664/10526, which are 18%/3% slower than that of FS-LFS (809/10869). Compared with FS-FFS 48 (339/7575), BHVFS-FFS (299/6802) suffers little (i.e., 11%/10%) performance degrada- tion in file creation/deletion. Thanks to committing updates to separate disks, the overhead is hidden by the back-end synchronization process. Moreover, creating/deleting files in the LFS manner is better than FFS, because LFS only appends updates at the end of .meta files rather than generating and removing files. BHVFS performs better than FS-FFS in both file creation and deletion, since BHVFS derives the performance from the LFS view. The performance degradation of BAVFS and BHVFS can be attributed to metadata management. Apart from metadata, physical data duplications may contribute to the per- formance degradation. When a file is created or deleted, the file object should be constructed or destroyed in the two views; however, the execution time of creation and deletion is not doubled in all cases. The overheads can be avoided or reduced by a careful design of the file systems. 3.4.2 BAVFS Small random and sequential reads are evaluated in BAVFS, FS-CP, and FS-AP. We create 2,000,000 files of size 1-2 KBytes by a modified postmark that does not execute transactions and delete files after file creation. We use ClamAV [64], an open source antivirus engine as a representative application with sequential access pattern, and Grep [3] as an example with random access pattern, searching a keyword "Frog" from 200,000 files in a random order. Fig. 3.14 displays the execution time of the two applications. FS-CP (797/821) performs better than BAVFS-CP (857/842); FS-AP (1048/745) is better than BAVFS-AP (1096/755) for random and sequential access patterns, because view- maintenance costs adversely affect the performance of BAVFS. In addition, random reads are more efficient in BAVFS-CP and FS-CP. Conservative prefetching does not prefetch unnecessary data for random accesses. On the other hand, high hit rates of aggressive prefetching make BAVFS-AP and FS-AP perform better than BAVFS-CP and FS-CP in the case of sequential accesses. 49 Execution Time (s) File Systems/Views BAVFS?CPBAVFS?APBAVFS FS?CP FS?AP0 200 400 600 800 1000 1200 1400 Grep: Random Reads ClamAV: Sequential Reads Figure 3.14: Random and sequential read evaluation in BAVFS. If views are selected appropriately, BAVFS performs 9% better than FS-CP with re- spect to sequential reads, and 21% better than FS-AP when it comes to random reads. The performance degradation is lower than 8% for random reads with FS-CP, and 1% for sequen- tial reads with FS-AP, because BAVFS experiences extra overhead induced by views. Our empirical study shows that the benefits of BAVFS outweighs its duplication overheads. 3.4.3 BHVFS We compare the random read and write performance of BHVFS with those of FS-LFS and FS-FFS. To reduce the data consistency overhead, we deploy two disks in this group of experiments. Two views maintain their data on two separate disks. We use postmark to create 20,000 files, each of which is 100-200 KBytes in size. There are 2,000,000 writes and 20,000 reads issued to the tested file systems. The buffer size in each transaction is 4 KBytes. Since postmark creates files in a sequential way, we modify the file creation process so that data are randomly appended to initial files. In doing so, data among files are interleaved in 50 a big log file in the LFS view. We do not create a cleaner program in FS-LFS in order to fairly compare it with the BHVFS-LFS view. Access Rate (KBytes/s) File Systems/Views BHVFS?LFSBHVFS?FFSBHVFS FS?LFS FS?FFS0 500 1000 1500 2000 2500 3000 3500 Random Reads Random Writes Figure 3.15: Random read and write evaluation in BHVFS. Fig. 3.15 reveals the access rate (measured in KBytes/Second) of I/O operations. The write rates of BHVFS-LFS and FS-LFS are much higher than those of BHVFS-FFS and FS- FFS, because written data are appended to the big log files. No seek is needed for individual files on the disks. In BHVFS, updates in the FFS view are committed to another disk by the back-end process; the rate of BHVFS-LFS is not as low as the half of FS-LFS. The overheads are attributed to the synchronization process as well as resource contentions (e.g., acquiring a lock on an update pending queue and memory consumption). Due to a continuous organization, the read rates of BHVFS-FFS and FS-FFS are much higher than those of BHVFS-LFS and FS-LFS. When a read is issued to BHVFS-LFS or FS- LFS, the data is retrieved from a number of pieces triggering multiple seeks. Moreover, the difference in read rates between BHVFS-LFS and FS-LFS, as well as BHVFS-FFS and FS- FFS, is negligible. The way of retrieving data in BHVFS is similar to that in the traditional 51 file systems. The only difference is that BHVFS has to handle one more level (i.e., views emulated by directories in our experiments) to reach files. More importantly, the read rate of BHVFS (1460) is six times better than that of FS- LFS (223), and the write rate of BHVFS (2640) is 30% worse than that of FS-LFS (3420). The read rate of BHVFS is 1% worse than that of FS-FFS (1470); whereas the write rate is 16% better than that of FS-FFS (2280). After a comprehensive comparison, we conclude that BHVFS is better than both FS-LFS and FS-FFS; the benefits obtained by BHVFS outweigh its overhead. 3.5 Discussions CBFS - the context-based file system - represents a broad research topic that cannot be fully covered in this document. We make an effort to raise a number of intriguing issues that are important for future studies. In this subsection, we first present two extensions that implement context-based file systems in kernel and distributed environments. Then, we compare several approaches to improving the performance of view-unaware applications. Last, we discuss context exposure methods. 3.5.1 Kernel Implementation When it comes to kernel implementation, we must address two issues, namely, view allocation and block allocation. Currently, BAVFS and BHVFS are built in the user space, where the view allocation issue is not a concern. Directories are used to emulate views for a simple design. In the Frog framework, when a directory is created, its sub-views are automatically created. The observation indicates that a directory allocation is always followedbyafixednumberofviewallocations. Thus, bundlingdirectoryandviewallocations is possibly more efficient than individually allocating them. Fig 3.16 displays two allocation methods by presenting VED and views layout on disks. In the separated layout, VED and views are located in two separate spaces in the metadata 52 region. VED and views allocations are independent with each other. This method simplifies the system design because it does not modify the VED allocation method. A new view allocation method is integrated into the system. A group layout indicates a group allocation method in which a directory and views are allocated at the same time. The views are implemented in form of the sub-views of the directory. The group allocation method can avoid multiple allocations that may trigger multiple disk I/O operations. Block allocation is another critical issue in a file system design. I/O operations can be optimized with a better data locality provided by block allocation methods. Block allocation might not be a big issue in a multiple-disk environment (e.g., BHVFS in our experiment), because each view has its own disk to run on. Existing block allocations can be applied in views. However, block allocation becomes complicated when two or more views run on a single disk. Applications may switch among views, which may incur disk I/Os at positions that are far away. In addition, committing an update among views may issue disk I/Os at multiple positions. Now we discuss a feasible design that reduces I/O costs induced by multiple views. The fundamental idea is that Frog uses a block-mapping mechanism to map virtual blocks in views to actual blocks on disk. Fig. 3.17 depicts two block layouts in which the mapping is efficient. The mapping process can be completed by simple numeric calculation. The upper figure shows that views have their own block space on disk. Frog maps the virtual blocks in views to their corresponding block space on disk. In this mapping, blocks are continuously stored on the disk, which offers good sequential reads in either view. As we mentioned Separated Layout Group Layout VED View View ViewVED ...... ...View A Group A Group VED View View ViewVED ...... View VED Space View Space VED Figure 3.16: VED and views allocation. 53 View 1 Virtual Block Virtual Block Virtual Block View 2 Virtual Block Virtual Block Virtual Block ...Blocks on a disk Block Space for View 1 Block Space for View 2 View 1 Virtual Block Virtual Block Virtual Block View 2 Virtual Block Virtual Block Virtual Block ... ...Blocks on a disk Interleaved Block Space for Two Views Figure 3.17: Two block mappings. earlier, updates and frequent view switches are likely to issue I/Os among positions that are far away, thereby causing long disk seek times. The lower figure shows an interleaved block mapping. This method is motivated by the observation that once an update is committed in a view, the update may be committed in other views in the near future. An interleaved layout increases the possibility that two updates may be completed on two continuous blocks on a disk, which avoids seek time in the second update. However, sequential reads in a view suffer from multiple seek overhead. 3.5.2 Work in a Distributed Environment Distributed environments help in reducing the complexity of managing metadata/block allocations, where two or more views are running on a single disk. In a distributed environ- ment where multiple computing resources and devices are coordinated, a context-based file system is likely to have each view located in a standalone disk. More importantly, data are usually duplicated in distributed environments for fault tolerance purpose (e.g., GFS [48] and HDFS [108]). All the issues related to replica maintenance have been investigated in the distributed systems. Adopting existing replica-management facilities can considerably 54 reduce the difficulty of the CBFS design. Thus, we conclude that integrating Frog into a replica-enable distributed file system is attractive. It is worth noting that after the above integration, generating replicas in CBFS may be sightly different from that in distributed file systems. Rather than byte-to-byte duplications, each copy of replicas is independently managed by views. The content and format of a copy in a view may be different from the ones in other views. If the quality of the copies is not identical, the fault tolerance of entire system will be compromised. For example, a bi- context file system has two copies of a video file. A copy in the high resolution view is used forprovidinghighqualityservicestoVIPs, whereasanothercopyinthelowresolutionviewis for regular customers. If the low quality copy is lost, the copy can be recovered from its high quality copy. However, the lost high-quality file cannot be recovered from the low quality copy. The information that does not exist in the low quality file will be lost permanently. To address this problem, we can create the low quality copy along with a complement file that stores the difference between high- and low- quality copies. In doing so, the high quality copy can be recovered from the low quality copy. The generation methods of complement files vary from application to application; these methods are out of the scope of our study. 3.5.3 Optimization for View-unaware Applications Recall that CBFS provides compatibility to view-unaware applications by setting up a default view. All I/O requests issued by view-unaware applications are committed on the default view. Due to the unawareness of views, the view-unaware applications do not select any appropriate view. As a result, it is not guaranteed that the default view is the best view tailored for the applications, meaning that the view-unaware applications may not perform very well as view-aware applications in CBFS. 55 To improve the performance of view-unaware applications, we propose the following approaches to decide the most appropriate views on the behalf of the view-unaware appli- cations. These approaches are categorized into the following two camps (i.e., static view selection and dynamic view selection). Static View Selection. In this group of view-selection approaches, the decisions of view selections are statically made. The best view of a particular type of requests is explored during the development of CBFS; an appropriate view can be selected according to the request type. For example, file I/O requests are classified into file creation, file deletion, reads and writes. If a request is issued on BHVFS, the creation/deletion and write requests should be handled by BHVFS-LFS; read requests should be processed by BHVFS-FFS. Dynamic View Selection. In the process of dynamic view selection, the views are determined at run-time. A dynamic-view-selection mechanism runs at the back-end of CBFS to collect the run-time information regarding types of I/O operations. When a request is issued, aviewcanbechosenbythedynamic-view-selectionmechanismbasedontherun-time data analysis. Compared with the static approaches, the dynamic methods are automatic and flexible on view selections at the cost of extra overheads caused by data collection and analysis. Both the static and dynamic approaches can be incorporated into CBFS to choose the most appropriate views for view-unaware applications. From the perspective of I/O access pattern recognition, these dynamic-view-selection methods can be envisioned as experienced-based or informed-based approaches. Experienced-based View Selection. In an experienced-based view selection mech- anism, historical data on access patterns are collected and used for making view selections. The effectiveness of these methods relies heavily on the access-pattern similarity between future requests and previously processed requests. Although a diversity of view selection policies can be incorporated in a dynamic view-selection mechanism, such an approach in- troduces extra overhead of historical data collection and maintenance. 56 Informed View Selection. Differing from the experienced-based methods, informed- view-selection approaches rely on view-selection hints provided by applications. In case the source code of existing view-unaware applications cannot be modified, the informed-view- selection approaches become impractical in CBFS. It is worth of attention that giving view names in a file path can be considered as one of informed-view-selection methods adopted for view-aware applications. Although the static methods are not flexible in nature, their efficiency isvery high because oflow overhead. If flexibility is moreimportant than efficiency, a dynamic experience-based approach should be adopted. 3.5.4 Context Exposure We discuss the methods of context exposure and perception in context-based systems and applications. In BAVFS and BHVFS, view set-ups in file systems are known as a priori during application development. The applications do not really perceive the contexts supported by the systems. However, context perception ability enables applications to be more adaptable in diverse computing environments. In what follows, let us discuss two approaches to providing applications with context information in Frog. The first approach is Context Detection, in which view-aware applications are able to detect the existence of particular views by issuing operations (e.g., stat) on the views. The systems return information of the views if they exist; otherwise, an error code (e.g., ENOENT) is returned. The view detection operations are invisible to view-unaware ap- plications because they can not issue any operation on views. Therefore, both types of applications can perform well in this approach. A limitation occurring in this method is that view names are required to be known in order to form a string path to views. In addi- tion, detecting the existence of a number of views may cause multiple I/O operations that downgrade system performance. The second approach is View Listing, where views are listed through directory listing operations. In this approach, a new type value (e.g., DT_VIEW in d_type field of struct 57 dirent) is added for views. After retrieving all objects under a directory, applications can find views by comparing their type values. As we mentioned in Section 3.2.2, the method exposes views to both types of applications, thereby possibly making view-unaware applications work incorrectly. For example, an application counts the number of files under a directory by calculating the number of objects that are not directories. If the views are returned along withfilesinthedirectory-listingoperation, theapplicationwillretrieveincorrectinformation. Therefore, extra efforts on view-unaware application correctness should be paid in the case of view listing. 58 Chapter 4 ORCA: An O?oading Framework for I/O-Intensive Applications on Clusters This chapter presents an o?oading framework, ORCA, to map I/O-intensive code to a cluster that consists of computing and storage nodes. To reduce data transmission among computing and storage nodes, our o?oading framework partitions and schedules CPU-bound and I/O-bound modules to computing nodes and active storage nodes, respectively. From developers? perspective, ORCA helps them to deal with execution path control, o?oading executable code, and data sharing over a network. Powered by the o?oading programming interfaces, developers without any I/O o?oading or network programming experience are allowed to write new I/O-intensive code running efficiently on clusters. We implement the ORCA framework on a cluster to quantitatively evaluate the per- formance improvements offered by our approaches. We run five real-world applications on both homogeneous and heterogeneous computing environments. Experimental results show that ORCA significantly speeds up the performance of all the five tested applications. More- over, the results also confirm that ORCA considerably reduces network burden imposed by I/O-intensive applications. This chapter is organized as follows. Section 4.1 illustrates the ORCA framework. Sec- tion 4.2 demonstrates four important design issues. Section 4.3 describes the implementation details of the framework. Sections 4.4 and 4.5 show our experimental results. Our experience on this study is discussed in section 4.6. 59 4.1 The ORCA I/O-O?oading Framework We will begin this subsection by highlighting the main idea of our ORCA o?oading framework for I/O-intensive applications. Then, we will discuss structures of applications designed to gain maximum benefit from the I/O o?oading framework. 4.1.1 System Architecture Computing Nodes Storage Nodes Network in terco nnect Network interc on nec t Figure 4.1: The architecture of commodity clusters, where a number of nodes are connected with each other through interconnects. We focus on clusters enhanced with active storage nodes that have computing capability. Fig. 4.1 illustrates the architecture of commodity clusters, where a number of nodes are connected with each other through interconnects. In this work, we focus on clusters enhanced with active storage nodes that have computing capability. In our study, a cluster has two types of nodes: (1) computing nodes that deal with CPU-bound jobs and (2) storage nodes that are responsible for storing data and processing I/O-bound jobs. In existing clusters, parallel file systems (see, for example, Lustre [105] and PVFS [84]) are employed to distributed data across multiple storage nodes. To support data-intensive applications, the parallel file systems need to transfer files back and forth between computing 60 and storage nodes. Although high peak aggregate I/O bandwidth can be achieved by access- ing multiple storage nodes in parallel, moving data between computing and storage nodes will inevitably slow down the performance of I/O-intensive applications. Our preliminary evidence shows that reducing the amount of data transferred among nodes is a practical approach to boosting the overall performance of clusters. 4.1.2 Structure of Applications in ORCA App 1 Computing Node 1 App 2 Storage Node 1 Offloading Domain Offloading Domain Offloading Domain Offloading Domain Computing Node 2 Storage Node 2 Figure 4.2: An o?oading domain is a logic processing unit, in which a pair of computing and o?oading modules are coordinated. I/O-bound modules are assigned to and executed on storage nodes; CPU-bound modules are handled by computing nodes. ORCA overlaps the executions of CPU-bound and I/O-bound modules. Fig. 4.2 depicts our I/O o?oading framework, in which I/O-bound modules are assigned to and executed on storage nodes. The goal of this framework is to reduce the amount of data transferred from storage nodes to computing nodes. Our o?oading idea is inspired by the observation that I/O-intensive applications (see Section 4.4 for real-world examples) can be partitioned into CPU-bound and I/O-bound modules. CPU-bound modules are handled by computing nodes; whereas I/O-bound modules, running on storage nodes, are referred to as o?oading parts. To achieve high performance, the framework makes an effort to overlap 61 the executions of CPU-bound and I/O-bound modules on computing and storage nodes in a cluster. We now introduce the concept of o?oading domains, which are used to group CPU- bound and I/O-bound modules. An o?oading domain is a logic processing unit, in which a pair of computing and o?oading modules are coordinated. An application may contain either only one or multiple o?oading domains. The number of the o?oading domains in an application heavily depends on the application?s design and the number of o?oading modules. O?oading domains are independent of, and isolated from each other in the sense that one o?oading domain can not be interfered with by others. Moreover, two simple applications that create two o?oading domains are demonstrated in fig. 4.2. App 1 is a multi-process program, in which the CPU-bound modules in both o?oading domains are allocated on two computing nodes. The corresponding I/O-bound modules are executed on storage node 1. This is a typical n-to-1 model that multiple com- puting nodes and a storage node are used by App 1. On the other hand, App 2 shows a typical 1-to-n model that a computing node and multiple storage nodes are utilized. The CPU-bound modules of App 2 can be created as either threads or processes. The I/O-bound modules are o?oaded to separated storage nodes. More complex applications can be derived from the two simple examples. CPU-bound and I/O-bound modules in an o?oading domain are, in some cases, serially and synchronously executed. Thus, while CPU-bound modules are running on computing nodes, theirI/O-boundcounterpartsmustbewaitingandviceversa. InthecasewhereCPU- bound and I/O-bound modules in an o?oading domain are asynchronous, our framework can overlap the executions of the CPU-bound and I/O-bound modules on computing and storage nodes to achieve high performance in a cluster. 62 4.2 Design Issues Before developing the proposed ORCA I/O-o?oading framework, we need to address the following four design issues. How to identify I/O-bound modules in an application? (see Sec. 4.2.1) How to o?oad an I/O-bound module to an active storage node? (see Sec. 4.2.2) How to transfer an execution to a storage node? (see Sec. 4.2.3) How to share data between CPU-bound and I/O-bound modules within an o?oading domain? (see Sec. 4.2.4) 4.2.1 Data-Intensive Module Identification The first step in partitioning a data-intensive application is to identify the I/O-bound modules of the application. Intuitively, I/O-bound modules need to process huge amount of data, meaningthatI/Otimeshoulddominatetheperformanceofsuchmodules. Ontheother hand, CPU-bound modules spend the majority of their time using CPUs to do calculations. A profiling and performance analysis tool can be employed to evaluate whether modules in a data-intensive application are CPU-bound or I/O-bound. With the performance analysis tool in place, programmers can evaluate whether applying the o?oading technique improves overall application performance. Such an evaluation process should take into account various aspects such as computing workload, I/O workload, and network traffic. 4.2.2 O?oading a Program The second design issue is that of an efficient way of o?oading an executable file to an active storage node. Two practical approaches to o?oading executable modules are dynamic o?oading and static o?oading. The main idea of dynamic o?oading is to automatically transfer an executable file and its configurations to storage nodes in a cluster before loading 63 the file into the memory. In this method, the o?oading platform must be aware of details of the run-time system implementation (e.g., programming languages and libraries) if the run-time system is platform dependent. If the run-time system is platform independent(e.g., implemented in scripts or java), the o?oading platform does not have to consider run-time system details. Thus, the level of difficulty in implementing the o?oading technique using dynamic distributions highly relies on the nature of the applications to be supported by the framework. Dynamic o?oading introduces another challenge - version management - for platform- dependent applications. In heterogeneous environments, all types of executable files, each of which is dedicated to a specific hardware platform, need to be precompiled. To invoke I/O-bound modules o?oaded to storage nodes, applications must detect the type of hard- ware/software in the storage nodes and choose a proper version of the I/O-bound module to be o?oaded on the fly. Moreover, this dynamic-distribution approach suffers from repeatedly transferring I/O-bound modules from computing to storage nodes. Although storage nodes are able to cache and reuse o?oaded modules, it is time consuming for computing nodes to decide whether the cached ones on the storage end are valid and updated. Unlike dynamic o?oading, static o?oading configures o?oaded I/O modules a priori. Static o?oading encompasses three distinct procedures if active storage systems are hetero- geneousinnature. ThefirstprocedureistomanuallycompileI/O-boundmodulesforvarious hardware and run-time systems in heterogeneous storage systems. The second procedure is to write specific configuration files. The last procedure is to deploy the configuration files along with I/O-bound modules onto target storage nodes. Although these three procedures are seemingly complicated, they can be automatically completed by a simple yet efficient tool in our o?oading framework. Moreover, the static o?oading approach greatly simplifies the design of our o?oading framework, because there is no need to address the platform- dependent issues. In this approach, when an application starts processing, its o?oaded I/O-bound modules have already been compiled and installed on storage nodes. 64 4.2.3 Controlling an Execution Path Thethirddesignissueisamechanismfortransferringexecutionsbackandforthbetween a pair of CPU-bound and I/O-bound modules in an o?oading domain. To deal with this issue, we considered and compared two candidate mechanisms - CORBA and RPC. CORBA [14] - a distributed programming model - is able to accommodate a number of components implemented by different languages. These components usually execute on different machines and communicate with each other through networks. However, CORBA?s extreme complexity often prevents beginner programmers from learning and using it. It normally requires at least several months for programmers to become familiar with its fun- damentals [54]. Another downside of adopting CORBA in our o?oading solution is that storage nodes must be equipped with powerful processors in order to host a complex CORBA implementation. Using CORBA in our framework is likely to lead to very expensive stor- age nodes with high-performance CPUs. Otherwise, weak processors in storage nodes will constantly be busy running CORBA middleware. A feasible option for our framework is Remote Procedure Call (RPC), which is a broadly accepted method of invoking a function to execute in a remote machine. Thanks to RPC?s simplicity, it is easy for any programmer to learn and use. There are many RPC libraries implemented by various general-purpose programming languages. RPC was applied to im- plement Network File System (NFS) [103], MapReduce [33] and Hadoop [13]. Because of this, we choose RPC rather than CORBA to implement in our o?oading framework. 4.2.4 Data Sharing among Storage and Computing Nodes The last issue is data sharing between a pair of CPU-bound and I/O-bound modules in an o?oadingdomain. Shared data include both global variables and code segments. Amajor challenge is that in an o?oading domain, global variables can not be shared by CPU-bound and I/O-bound modules allocated to different computing and storage nodes. 65 An intuitive solution for the above challenge is to establish a synchronization mecha- nism to allow a pair of modules in an o?oading domain to notify each other when any global variableisupdated. Forexample, ifaCPU-boundmodulemodifiesshareddataonacomput- ing node, a notification along with the updated data will be delivered to the corresponding I/O-bound module on a storage node. Asecondsolutionismotivatedbyanobservationthatinsomecases, I/O-boundmodules are synchronized with their CPU-bound modules. In a synchronization process, o?oaded I/O-bound modules are unable to access global data on storage nodes until control is re- gained from CPU-bound modules on computing nodes. Thus, CPU-bound modules can transfer updated shared variables to I/O-bound modules by appending the shared variables with o?oading requests. In our approach, the framework updates global variables before processing o?oading requests. In other words, the changes that occur at o?oaded modules can be treated as results in response messages. Code segments are considered to be a special type of global data. In applications implemented by compiled languages, function objects can not be shared directly. The reason for this is that addresses of an function in a pair of CPU-bound and I/O-bound modules may be different after being loaded into the main memory. On the other hand, in interpreted applications, functions are parsed by names rather than addresses. Hence, both the CPU- bound and I/O-bound modules in an o?oading domain are able to obtain identical functions by their names. In this subsection, we only highlighted the basic idea of data sharing supported in our o?oading framework. Please refer to Section 4.3.4 for implementation details on the data- sharing mechanism . 4.3 Implementation Details In this subsection, we describe the implementation details of our ORCA o?oading framework and explain how to run o?oading applications on clusters. 66 4.3.1 Configuration Recall that we took the static o?oading approach (see Section 4.2.2) by adopting the pre-configuration method to o?oad I/O-bound modules to storage nodes. The following five stepsarerequiredtorunadata-intensiveapplicationinourORCAI/O-o?oadingframework. 1. Design a data-intensive application and identify I/O-bound modules to be o?oaded to storage nodes. 2. Convert the application into its o?oading version by using the o?oading programming interface (API) described in Section 4.3.3. Developers may need to write configuration files. 3. Create executable files for target storage nodes if the executables are implemented by compiled languages. If the application is developed by interpreted languages, then source files are executable. 4. Copy executable and configuration files to specified directories on computing and stor- age nodes. 5. Start I/O-bound modules on storage nodes followed by computing nodes. This order is important because o?oaded modules must provide services to CPU-bound modules in an initial phase. 4.3.2 Workflow of an Application in ORCA Normally, o?oaded I/O-bound modules (see the right-hand side of Fig. 4.3) can be dis- patched to multiple storage nodes. Storage nodes to which I/O-bound modules are o?oaded depend on an allocation policy. For example, a typical policy is to allocate o?oaded modules to the storage nodes where data is located [33] [13]. Another possible policy is to equally distribute o?oaded modules across storage nodes. After an o?oaded module completes, 67 it returns execution control back to the corresponding CPU-bound module of a computing node. An I/O-bound module Init A CPU-bound module Init1 3 1 2 4 5 6 7 Figure 4.3: The execution flow of a data-intensive application running in the ORCA o?oad- ing framework. Fig. 4.3 shows a workflow of an application running in our ORCA o?oading framework. For simplicity, we only demonstrate the application with a single o?oading invocation. The o?oading framework manages to control the execution of the application in the following seven steps: 1. Both CPU-bound and I/O-bound modules are initialized. 2. O?oaded I/O-bound modules are suspended and wait for o?oading requests issued from the computing node. 3. The CPU-bound module starts its execution on the computing node. 4. When an o?oading invocation is required, the CPU-bound module sends a request to the I/O-bound module. The request includes the network address of a target storage node, the name of the o?oading entry, and input parameters. Network addresses of storage nodes can be listed in a configure file so that the storage nodes can be 68 quickly accessed. Names of o?oading entries can be hard-coded in the application, just like calling a function. All input parameters are transformed to a data stream to be transferred through the network. 5. After receiving an o?oading request, the I/O-bound module is activated. 6. TheI/O-boundmodulesendsaresponsebacktotheCPU-boundmodule. Theresponse contains the computation node?s network address and results. The network address can also be obtained from the configuration file. 7. After receiving a response from the storage node, the computing node continues its processing. 4.3.3 O?oading APIs The current version of the o?oading framework provides an application programming interface (API) for C and C++ languages. Similar APIs can be implemented in other languages like java or python. Our ORCA o?oading framework provides four API sets summarized in Table 4.1. The init function in the first group initializes and sets up the o?oading environments. Programs must execute init before issuing any o?oading requests. First, init decides the role?aCPU-boundorI/O-boundmodule?thattheprogramplaysbyidentifyingadedicated command-lineargument. Aftertheroledecision, initremovesthededicatedargumentwhich cannot longer be accessed. Then, a serial of MARSHAL and UNMARSHAL functions for primitive data types (e.g., char and unsigned short) are registered for supporting primitive types serialization. The second set of function in Table 4.1 is to register o?oading entries. In C/C++ applications, o?oading entries are addresses of functions in o?oaded I/O-bound modules. After compilation, all functions are converted into addresses; an identical function may 69 Table 4.1 The ORCA O?oading Programming Interface Interface & Description void init () Initialize the system. void register_function (func_addr) build a map from function addresses to their names. func_name find_name_by_func_addr (func_addr) Get a function name by a given address. func_addr find_func_by_name (func_name) Get a function address by a given name. void MARSHAL (void* obj, char**buf, int* len) Serialize an object pointed by obj into a data stream. The address and size of the data stream are specified by buf and len. void UNMARSHAL (void* obj, char*buf, int len) Un-serialize an object pointed by obj from a data stream. The address and size of the data stream are specified by buf and len. void offload_call (addr, func_name, ins, outs) Invoke an o?oading procedure named by func_name. addr indicates a networkaddress(e.g., anIPaddress)ofthetargetnode. Theinputparameters and results are specified by ins and outs. have different addresses in CPU-bound and I/O-bound modules. In order to exchange of- floading entries between a pair of CPU-bound and I/O-bound modules, we enable applica- tions to call register_function to register functions and then exchange function names instead of addresses. Addresses are automatically converted to names in CPU-bound mod- ules and reverse in o?oaded I/O-bound modules by calling find_name_by_func_addr and find_func_by_name respectively. The goal of the third API set in Table 4.1 is to send and receive parameters and results. Both MARSHAL and UNMARSHAL accept input parameters object in the type of void * in order to adapt all types of objects. The following two parameters specify the buffer of the data stream and its length. All data being exchanged between CPU-bound and I/O-bound modules must implement corresponding MARSHAL and UNMARSHAL functions that are automatically called by the o?oading framework. If a function pointer need to be serialized or un-serialized, the 70 pointer has to be converted to the function name by a second set of interfaces and then processed as a regular string. These functions must be registered during initialization as well. A simple example void domain_entry(String files []) { for (int inde x = 0; index < 2; index++) { unsigned int result; struct offloa ding_para ins = {files[index], marshal_string , unmarshal_string }; struct offloa ding_para outs = {&result , marshal_uint, unmarshal_uint}; offloa d_call(?192.168.0.1?, word _counter , ins, outs); } } struct offloa ding_para { void * obj; MA RSHAL marshal; UNMARSHAL unma rshal; }; Definition Figure 4.4: A simple example of o?oad_call offload_call is a real action for calling an o?oad. The parameter addr indicates the network address (e.g., an IP address) of the node where an o?oading part will take place. func_name specifies an o?oading entrance. ins and outs are an input and output parameters defined as instances of the offloading_para structure. Fig. 4.4 gives the definition for and an example of offloading_para recording an object and corresponding MARSHAL and UNMARSHAL functions. When offload_call is invoked, the o?oading library will automatically serialize and un-serialize input and output parameters so that the computation and o?oading parts are able to successfully communicate with each other. 4.3.4 Sharing Data Recall that the complexity of o?oading programs heavily depends on data sharing mechanisms (see Section 4.2.4). Because our goal is to keep o?oading programs simple, our 71 framework offers a simple yet efficient way of passing data as input and output parameters. We consider two key issues regarding data sharing. The first one is how to share global data between computing and storage nodes. All data accessed by both nodes should be overseen by input parameters and results (see Sec- tion 4.2.4), and is required to be deeply copied in MARSHAL and UNMARSHAL instead of merely copying object points. This is because objects created in address spaces are totally different in the two parts. A function pointer is the data that maintains the address of the func- tion. The address has to be converted to the function name in MARSHAL and recovered in UNMARSHAL, since the function name keeps consistent in both CPU-bound and I/O-bound modules. The conversion can be completed by Dynamically Loaded (DL) libraries1 if the function is defined as extern. The CPU-bound and I/O bound modules are responsible for handling global updates. The second issue is how to share code segments. Function entries or executable objects are a special type of data in programs. The framework can not simply copy binary codes and transfer them to another node, because the code might be not executable. In our implementation, we link all object codes to each part, regardless of whether the codes are used or not; therefore, programmers do not need to identify which functions belong to either parts or both. To transfer a function entry, we build a map between function names and addresses, thereby placing function names in o?oading requests and responses. Both computing and active storage nodes can resolve function names and addresses by using the o?oading API. 4.4 Evaluations 4.4.1 Experimental Testbed for ORCA Wesetupahomogeneousclusterandaheterogeneousclusterastwotestbedstoevaluate real-world applications supported by our ORCA o?oading framework. Both clusters are 1http://tldp.org/HOWTO/Program-Library-HOWTO/dl-libraries.html 72 comprised of 16 nodes, which form 8 independent o?oading domains (see Fig. 4.2 for an example of o?oading domains). All nodes are connected by the Gigabit Ethernet. Table6.3summarizesthehardwareandsoftwareconfigurationsofthetwotypesofnodes - Type I and Type II - used in our testbeds. Type I nodes have better CPU performance and largermainmemorythanTypeIInodes. Interestingly,measurementscollectedbyhdparm[4] indicate that Type II nodes have higher sequential I/O throughput (130.35 MBytes/Sec.) than Type I nodes (106.94 MBytes/Sec.). Table 4.2 Hardware and Software Configurations Name Hardware Software Type I 1 Intel Xeon 2.4 GHz processor Ubuntu 10.04 1 2 GBytes of RAM Linux kernel 2.6.23 1 1 GigaBit Ethernet network card 1 Seagate 160 GBytes Sata disk (ST3160318AS [7]) Type II 1 Intel Celeron 2.2 GHz processor Ubuntu 10.04 1 1 GBytes of RAM Linux kernel 2.6.23 1 1 GigaBit Ethernet network card 1 WD 500 GBytes Sata disk (WD5000AAKS [12]) The homogeneous cluster is made up of 16 Type I nodes; the heterogeneous cluster contains 8 Type I nodes and 8 Type II nodes. In the second testbed, computing nodes are Type I and storage nodes are Type II. The configuration details of the testbeds are specified in Table 6.3. Table 4.3 Configuration of the two Testbeds for ORCA Computing Nodes Storage Nodes Homogeneous testbed 8 Type I 8 Type I Heterogeneous testbed 8 Type I 8 Type II 73 4.4.2 Benchmark Applications Applications We tested five benchmarks (see Table 4.4), which are well-known data-intensive appli- cations. PostgreSQL, Word Count(WC), Sort, and Grep were downloaded from their official websites, whereas the Inverted Index application was implemented by our research group at Auburn. In our experiments, we ran the baseline applications on computing nodes and loaded data from the storage nodes through the Network File System (NFS) service [103]. NFS is an RPC-based solution commonly used in clusters. Due to the efficiency and ease of management of NSF, numerous commercial products rely on it to manage massive amount of data. For example, Oracle 11g - the latest DBMS product developed by Oracle corporation - is supported on Oracle Real Application Clusters (RAC) that provides shared storage through the NFS service [5]. Kassick et al. studied the impact of I/O coordination on an NFS-based environment [61]. So, we decided to test all the baseline applications using the NFS service in our testbeds. We choose PostgreSQL as the first benchmark, because o?oading I/O-bound modules of a database engine to storage nodes has been proposed by a number of researchers. This approach aims to accelerate I/O processing in database systems. For example, Pitman et al. studied a scheme to o?oad relational operations in DB2 to Active Storage Fabrics (ASF), thereby increasing parallel capability of data retrieval [41]. Choudhary et al. selected DBMS as a major benchmark application [29], in which five operations (i.e., scan, join, sort, group-by and aggregate) are o?oaded to smart disks, to evaluate the performance impact of distributed smart disks on I/O-intensive workloads. Abouzeid et al. developed HadoopDB that combines MapReduce and DBMS technologies by considering a combination of efficiency, scalability, fault tolerance and flexibility of databases [15]. HadoopDB o?oads parts of the database workload to storage servers as Mapper tasks in a hybrid system. The 74 major difference between our o?oading framework and the above studies lies in the fact that ours o?oads a database engine to storage nodes in shared nothing clusters. We applied the ORCA o?oading framework to the five benchmark applications, each of which has an I/O-bound module running on storage nodes. In particular, the "executor" is definedasano?oadedmoduleinPostgreSQL.TheORCAframeworko?oadstheI/O-bound modules of the other applications to storage nodes. Table 4.4 describes the implementation of these benchmarks. Table 4.4 Real-World Benchmark Applications Applications Descriptions PostgreSQL 9.0 [6] It is a relational database management system. The o?oad- ing framework o?oads the "executor" module to storage nodes. The I/O-bound module receives an execution plan and per- forms queries. The CPU-bound module manages connections to clients, converts SQL statements to execution plans and sends results back to clients. Word Count in GNU coreutils 7.4 [2] It counts the number of words in a set of files. Our frame- work partitions the Word Count application into an I/O-bound module that calculates word occurrences in one file, and a CPU- bound module that sums the occurrences up. Sort in GNU coreutils 7.4 [2] It sorts lines of a text file in alphabetical order. Our framework treats the entire Sort application as an o?oaded I/O-bound module that receives a file name and stores sorted text in a file. GNU Grep 2.7 [3] It searches through a file for lines which contains a given key- word. The I/O-bound module in Grep finds desired lines in a file; the CPU-bound module in Grep transfers keywords and file names to the I/O-bound module. Inverted Index (our benchmark) It loads a set of files and builds a map between words to their occurrences. In the Inverted Index application, its I/O-bound modules constructs a map for each file; a CPU-bound module transfers file names to the I/O-bound module. Data Preparation To measure performance of PostgreSQL running in the ORCA o?oading framework, we created four databases with sizes of 5 GBytes, 10 GBytes, 15 GBytes and 20 GBytes. No 75 index was generated in these databases; therefore, PostgreSQL had to directly access data in the tables rather than merely checking index structures during query processing. Each database is made up of 1,000 tables, each of which has 100 integer attributes. Tuples are equally distributed across these tables, so a larger database has more tuples in each table. We generated 1,000 queries, each of which scans only one table. Together, these queries cover all the tables in the database. For the other four benchmark applications, we created five text files of relatively smaller sizes (i.e., 400 MBytes, 600 MBytes, 800 MBytes, 2 GBytes and 4 GBytes). Each text file containedanumberofrandomlygeneratedwords. Duetothelimitationofthemainmemory, we tested the inverted index application using the first three text files on the homogeneous cluster. This was because frequent page faults made I/O noise in the experiments when the input file size was larger than the main memory. We also tested the other four applications on the heterogeneous cluster. 4.4.3 PostgreSQL: A case study We briefly described how official and o?oading PostgreSQL worked in our experiments. WechosePostgreSQLasanexample,becauseitisacomplicatedapplicationwithanumberof independent modules. Also, boundaries of I/O-intensive modules are highly distinguishable. We can easily partition PostgreSQL into computation and o?oading parts. Compare ORCA with ASF, SDs, and HadoopDB O?oading I/O-bound modules of a database engine to storage nodes is an idea proposed by other researchers. This approach can be applied to further accelerate I/O processing in database systems. Pitman et al., for example, studied a scheme that o?oads relational op- erations in DB2 to Active Storage Fabrics (ASF) in order to increase performance of parallel data retrievals [41]. To evaluate the performance improvement of distributed Smart Disks (SDs) for I/O-intensive workloads, Choudhary et al. chose DBMS as a major benchmark 76 application, in which five operations (i.e., scan, join, sort, group-by and aggregate) are of- floaded to SDs [29]. In a study conducted by Abouzeid et al., parts of database workload are o?oaded to the storage as Mapper tasks in a hybrid system called HadoopDB, which combines MapReduce and DBMS technologies for a combination of efficiency, scalability, fault tolerance and flexibility of databases [15]. The major difference between our ORCA and the aforementioned studies is that ORCA o?oads a database engine on shared nothing clusters. Official PostgreSQL Computation Node Storage Node Data Official PostgreSQL Parser Rule System Optimizer ExecutorResult Query Figure 4.5: The execution flow of official PostgreSQL PostgreSQL is an open source relational database management system. We chose the latest stable release, PostgreSQL 9.0, as a target application in these experiments. Fig. 4.5 shows that there are four components in the PostgreSQL backend program, which mainly supports SQL queries in the background. Parser checks a query string for valid syntax and creates a parse tree after the validation process. The rule system applies a group of rules to rewrite the parse tree to an execution plan. The optimizer creates an optimal execution plan; the executor runs the query [49]. 77 O?oading PostgreSQL in ORCA Storage Node Offloading Part Computation Node Computation Part Data I/O Executor Parser Rule System Optimizer Query Result Figure 4.6: The execution flow of o?oading PostgreSQL in ORCA. The computing node handles the parser, rule system, and optimizer; the executor is o?oaded to the storage node. In the query procedure, the executor is an I/O-intensive program that reads and/or writes a large amount of data from and/or to storage systems while processing expensive operations (e.g., scanning or joining tables). Fig. 4.6 illustrates the execution flow of o?oad- ing PostgreSQL in ORCA, which o?oads the executor to storage nodes. ORCA does not modify other modules, such as access methods and disk space managers related to storage systems, because an o?oading PostgreSQL can access the same data files. In the ORCA- based PostgreSQL, the executor receives an execution plan from the remote optimizer and transfers results back to the backend program. 4.5 Experimental Results 4.5.1 Overall Performance Evaluation Homogeneous Clusters Fig. 4.7 illustrates the execution times of the five applications (see Table 4.4) to compare the ORCA-enabled cluster against the same cluster without I/O o?oading. The results plot- ted in Fig. 4.7 show that the ORCA o?oading framework significantly reduces the execution 78 times of all five tested applications. For example, when data size is 4 GBytes, our scheme can reduce execution time of PostgreSQL, and Grep by 37.8% and 47.4%, respectively. 5 10 15 200 50 100 150 200 250 300 350 400 450 Execution Time (s) Data Size (GBytes) Offloading PostgreSQL Official PostgreSQL (a) PostgreSQL 400 600 800 2000 40000 20 40 60 80 100 120 140 Execution Time (s) Data Size (MBytes) Offloading WC Official WC (b) Word Count 400 600 800 2000 40000 50 100 150 Execution Time (s) Data Size (MBytes) Offloading Sort Official Sort (c) Sort 400 600 800 2000 40000 10 20 30 40 50 60 70 80 90 100 Execution Time (s) Data Size (MBytes) Offloading Grep Official Grep (d) Grep 400 600 8000 5 10 15 20 25 Execution Time (s) Data Size (MBytes) Offloading Inverted Index Inverted Index (e) Inverted Index 400 800 2,0004,00010,00020,0000 20 40 60 80 100 120 140 160 Difference of Execution Time (s) Data Size (MBytes) Official ? Offloading (f) Performance Improvement vs. Data Size Figure 4.7: ORCA-based applications vs. Official applications. Execution times of the five real-world benchmark applications running on the homogeneous cluster (i.e., the first testbed). 79 The applications without the ORCA support are slowed down by remotely accessing huge amount of data, because the data must be transferred from storage nodes to computing nodes. Our framework solves this performance problem by o?oading I/O-bound modules to storagenodes, therebysubstantiallyreducingI/Otimethroughlocaldataaccesses. Although the applications running in ORCA have to exchange input/output parameters among com- puting and storage nodes, the data size of input/output parameters is significantly smaller than the dataset size. Fig. 4.7(f) shows the impact of data size on performance improvement gained by our ORCA o?oading framework. We measured the performance of the four applications (i.e., WC, Sort, Grep and Inverted Index) on four datasets (400MB, 800MB, 2GB, and 4GB) and PostgreSQL on two large datasets (10GB and 20GB). We plotted performance improvement in terms of execution-time reduction in Fig. 4.7(f). This reveals that the performance im- provements achieved by ORCA become more pronounced as the datasets grow in size. When data size is small, the non-o?oading-enabled applications can take advantage of continuous I/O operations optimized by the NFS service. For example, NFS can cache entire datasets in the main memory so that the datasets can be repeatedly processed without further re- mote I/O accesses. Unfortunately, when the datasets grow in size, the non-ORCA-enabled applications benefit very little from caching due to limited caching ability in computing nodes. Heterogeneous Clusters Fig. 4.8 shows execution times of the five benchmark applications supported by the ORCA framework on a heterogeneous cluster, in which computing nodes and storage nodes have different performance. The results plotted in Fig. 4.8 are consistent with those shown in Fig. 4.7. In other words, we observed that ORCA reduces the execution times of the benchmarks. Moreover, the performance improvements are more distictive as the data size increases (see fig. 4.8(f)). 80 Comparing Figs. 4.7 and 4.8, we concluded that the heterogeneous cluster offers slightly better performance than the homogeneous one. This is because the storage nodes (i.e., Type II nodes) in the heterogeneous cluster have higher I/O bandwidth than the storage nodes (i.e., Type I nodes) in the homogeneous cluster. Although Type I nodes are superior to Type II nodes in terms of CPU speed and memory capacity, higher I/O throughput of Type II nodes cause the heterogeneous cluster to outperform its homogeneous counterpart. 4.5.2 Network Load Evaluation Homogeneous Clusters Fig. 4.9 shows network load caused by both official and ORCA-based PostgreSQL when data size is set to 5GB, 10GB, 15GB, and 20GB, respectively. The results confirm that the ORCA o?oading framework significantly reduces the network load of the homogeneous cluster running PostgreSQL. When the non-ORCA-based PostgreSQL is running, transfer- ring data from the storage to computing nodes keeps the network resources very busy (e.g., from 30MB/s to 60MB/s). The performance bottleneck of non-ORCA-based applications is dominated by network resources. This problem becomes even worse as data size grows. Fig. 4.10 shows network load imposed by the other four applications and their ORCA- based counterparts processing an 800MB dataset. WC, Grep, and Inverted Index share a similar network traffic pattern with PostgreSQL. Fig. 4.10(b) shows that the data transmis- sion rate in Sort is constantly changing between 0 and 65MB/s. The reason is that after loading a certain amount of data, Sort becomes CPU-bound rather than I/O-bound to han- dle the sorting process. During the short period of sorting process, the network resource is sitting idle and waiting for the next I/O request. 81 5 10 15 200 50 100 150 200 250 300 350 Execution Time (s) Data Size (GBytes) Offloading PostgreSQL Official PostgreSQL (a) PostgreSQL 400 600 8000 5 10 15 20 25 30 Execution Time (s) Data Size (MBytes) Offloading WC Official WC (b) Word Count 400 600 8000 5 10 15 20 25 30 Execution Time (s) Data Size (MBytes) Offloading Sort Official Sort (c) Sort 400 600 8000 1 2 3 4 5 6 7 8 9 10 Execution Time (s) Data Size (MBytes) Offloading Grep Official Grep (d) Grep 400 600 8000 2 4 6 8 10 12 Execution Time (s) Data Size (MBytes) Offloading Inverted Index Inverted Index (e) Inverted Index 400 800 5,000 10,000 20,0000 10 20 30 40 50 60 70 80 90 100 Difference of Execution Time (s) Data Size (MBytes) Official ? Offloading (f) Performance Improvement vs. Data Size Figure 4.8: ORCA-based applications vs. Official applications. Execution times of the five real-world benchmark applications running on the heterogeneous cluster (i.e., the second testbed). 82 0 20 40 60 80 100 120 140 160 1800 5 10 15 20 25 30 35 40 45 50 Data Transmission Rate (MBytes/s) Time (s) Offloading PostgreSQL Official PostgreSQL (a) 5 GBytes 0 50 100 150 200 2500 10 20 30 40 50 60 Data Transmission Rate (MBytes/s) Time (s) Offloading PostgreSQL Official PostgreSQL (b) 10 GBytes 0 50 100 150 200 250 3000 10 20 30 40 50 60 70 Data Transmission Rate (MBytes/s) Time (s) Offloading PostgreSQL Official PostgreSQL (c) 15 GBytes 0 50 100 150 200 250 300 350 4000 10 20 30 40 50 60 70 80 Data Transmission Rate (MBytes/s) Time (s) Offloading PostgreSQL Official PostgreSQL (d) 20 GBytes Figure 4.9: Network load imposed by both official and ORCA-based PostgreSQL accessing different databases on the homogeneous cluster (i.e., the first testbed). Heterogeneous Clusters Figs. 4.11 and 4.12 show the network traffic patterns of the five applications and their ORCA-based counterparts running on the heterogeneous cluster. The empirical results indi- cate that regardless of the type of a cluster, the non-ORCA-based applications keep network resources very busy in transferring data between computing and storage nodes. We observe that the data transmission rate (ranging from 50MB/s to 70MB/s) of the non-ORCA-based PostgreSQL on the heterogeneous cluster is constantly higher than that of the homogeneous cluster. This observation implies that for non-ORCA-based applications, 83 0 5 10 15 20 25 30 35 40 450 5 10 15 20 25 30 35 40 45 Data Transmission Rate (MBytes/s) Time (s) Offloading WC Official WC (a) Word Count 0 5 10 15 20 25 30 35 40 450 10 20 30 40 50 60 70 Data Transmission Rate (MBytes/s) Time (s) Offloading Sort Official Sort (b) Sort 0 5 10 15 20 25 300 10 20 30 40 50 60 70 Data Transmission Rate (MBytes/s) Time (s) Offloading Grep Official Grep (c) Grep 0 5 10 15 20 25 300 10 20 30 40 50 60 70 Data Transmission Rate (MBytes/s) Time (s) Offloading Inverted Index Official Inverted Index (d) Inverted Index Figure 4.10: Network load imposed by the four real-world applications and their ORCA- based counterparts accessing 800 MB datasets on the homogeneous cluster (i.e., the first testbed). improving I/O bandwidth of storage nodes can reduce the applications? execution times by increasing network utilization. In addition, the network links of the homogeneous and heterogeneous clusters are not saturated, because the data transmission rates are below the maximum network bandwidth (i.e., 1 Gbps)in both cases. Data retrieved from the storage nodes can be immediately delivered to the computing nodes; therefore, accessing data in the heterogeneous cluster is faster than in its homogeneous counterpart. 84 0 20 40 60 80 1000 10 20 30 40 50 60 70 80 Data Transmission Rate (MBytes/s) Time (s) Offloading PostgreSQL Official PostgreSQL (a) 5 GBytes 0 20 40 60 80 100 120 140 160 1800 10 20 30 40 50 60 70 80 90 Data Transmission Rate (MBytes/s) Time (s) Offloading PostgreSQL Official PostgreSQL (b) 10 GBytes 0 50 100 150 200 2500 10 20 30 40 50 60 70 80 Data Transmission Rate (MBytes/s) Time (s) Offloading PostgreSQL Official PostgreSQL (c) 15 GBytes 0 50 100 150 200 250 300 3500 10 20 30 40 50 60 70 80 90 Data Transmission Rate (MBytes/s) Time (s) Offloading PostgreSQL Official PostgreSQL (d) 20 GBytes Figure 4.11: Network load imposed by both official and ORCA-based PostgreSQL accessing different databases on the heterogeneous cluster (i.e., the second testbed). 4.5.3 CPU Usage Evaluation Homogeneous Clusters The goal of this set of experiments was to assess the performance impact of o?oaded I/O-bound modules on storage nodes in ORCA. This goal was achieved by evaluating CPU usage of storage nodes in the homogeneous cluster running the five data-intensive applica- tions. Evaluating CPU usage of storage nodes is very important, because o?oaded I/O- bound modules may have side effect on other I/O services running on the storage nodes. Fig. 4.13 illustrates CPU utilization of PostgreSQL processing a 10 GB dataset and the other applications processing an 800 MB dataset. We observe that the CPU usage of ORCA, 85 0 5 10 15 20 25 300 5 10 15 20 25 30 35 40 45 Data Transmission Rate (MBytes/s) Time (s) Offloading WC Official WC (a) Word Count 0 5 10 15 20 25 30 350 10 20 30 40 50 60 70 Data Transmission Rate (MBytes/s) Time (s) Offloading Sort Official Sort (b) Sort 0 5 10 15 200 10 20 30 40 50 60 70 Data Transmission Rate (MBytes/s) Time (s) Offloading Grep Official Grep (c) Grep 0 5 10 15 200 10 20 30 40 50 60 70 Data Transmission Rate (MBytes/s) Time (s) Offloading Inverted Index Official Inverted Index (d) Inverted Index Figure 4.12: Network load imposed by the four real-world applications and their ORCA- based counterparts accessing the 800 MB datasets on the heterogeneous cluster (i.e., the second testbed). in most cases, is below 30% although there were two cases where the CPU utilization reaches 40% and 70% for a few seconds (see Fig. 4.13(c)). These two cases have little negative impact on storage nodes. Of all the five tested applications, Grep (see Fig. 4.13(d)) had the least overall impact on other services running on storage nodes. Overall, we concluded that our ORCA I/O-o?oading framework has minimal negative impact on any services running on storage nodes in homogeneous clusters. We confirm that improving performance of data-intensive applications in ORCA comes at the cost of increasing CPU usage in storage nodes. Fig. 4.13 indicates that different o?oaded I/O-bound modules lead to different CPU-usage increases in storage nodes. An 86 0 20 40 60 801001201401601800 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading PostgreSQL (a) PostgreSQL 0 5 10 15 20 25 30 350 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading WC (b) Word Count 0 5 10 15 20 250 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading Sort (c) Sort 0 2 4 6 8 10 12 14 16 180 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading Grep (d) Grep 0 2 4 6 8 10 12 140 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading Iverted Index (e) Inverted Index Figure 4.13: CPU load imposed by the five real-world ORCA-based applications in the storage nodes of the homogeneous cluster (i.e., the first testbed). 0 20 40 60 80 1001201400 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading PostgreSQL (a) PostgreSQL 0 5 10 15 20 25 30 350 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading WC (b) Word Count 0 5 10 15 20 25 30 350 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading Sort (c) Sort 0 2 4 6 8 10 12 14 160 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading Grep (d) Grep 0 5 10 150 1020 3040 5060 7080 90100 CPU Usage (%) Time (s) Offloading Iverted Index (e) Inverted Index Figure 4.14: CPU load imposed by the five real-world ORCA-based applications in the storage nodes of the heterogeneous cluster (i.e., the second testbed). 87 increase in CPU utilization of storage nodes heavily relies on the nature of the I/O-bound modules. For example, o?oaded modules can vary from a simple problem solver (e.g., calcu- lating the number of word occurrences in a file) to a complicated procedure (e.g., searching qualified tuples by scanning an entire table). Ideally, ORCA provides a good tradeoff be- tween reducing execution time of a data-intensive application and increasing CPU usage on storage nodes. Heterogeneous Clusters Now we are in a position to evaluate ORCA?s CPU usage of storage nodes in heteroge- neous clusters. Fig. 4.14 shows CPU usages of storage nodes running the o?oaded modules for the five benchmark applications. The results suggest that WC and Inverted Index give rise to a high CPU usage (i.e., >90%). The Sort application repeatedly pushes the CPU usage up to 100% and then drops down to nearly 0%. PostgreSQL and Grep keep CPU us- age at a moderate level (i.e., 50%-60%) and a low level (i.e., <18%), respectively. If storage nodes in ORCA have low-performance CPUs or o?oaded modules are CPU-intensive, then the o?oaded modules can cause high CPU utilization in the storage nodes. 4.6 Experience In this study, we successfully extend PostgreSQL, Word Count, Sort, Grep, and Inverted Index using the ORCA APIs (see Sec. 4.3.3) in our real-world shared-nothing clusters. We have learned a great deal about the design and implementation of the ORCA o?oading framework. A summary of our experience and lessons accumulated during the development of ORCA and ORCA-based applications. 4.6.1 O?oading Module Identification When we prepared to develop the first ORCA-based o?oading application (see ORCA- based PostgreSQL in Sec. 4.4.3), we encountered a challenging issue: which modules should 88 beo?oaded. Inparticular,developershavetofigureoutwhichmodulesprocesslargevolumes of data. After the initial analysis of PostgreSQL, we treated the entire DB engine as an o?oading module. The second challenge is to minimize the computational complexity of o?oaded modules, because the o?oaded tasks in ORCA are executed on storage nodes. After examining the DB engine, we discovered that only the executor accesses a large amount of data and; therefore, the executor was chosen as an o?oaded module. The third issue we faced is to decouple the o?oaded module with other modules. The decoupling process should simplify both o?oading interfaces and data dependency. Based on our further analysis of PostgreSQL, we determined that the execution tree is the most appropriate interface defined in the o?oaded module. The execution tree is a good interface because of the twofold reason. First, this module contains a single data structure. Second, the execution tree specifies a module boundary between the optimizer and executor modules. Let us discuss the issue of data dependency in the following subsection. 4.6.2 Data Sharing Recall that o?oaded modules located in storage nodes and others modules in the host computing nodes can not share in-memory data (see Section 4.2.4 and Section 4.3.4). In the implementation of the ORCA-based PostgreSQL, no in-memory data set is shared between modules residing in computing nodes and o?oaded modules in storage nodes. The data of the execution tree located in both computing and storage nodes must be synchronized by the ORCA o?oading framework. However, the metadata of databases, originally stored in files and loaded into main memory at in the initialization phase, has to be shared by among computing and storage nodes. For example, the optimizer uses catelog data to make optimization decisions; the executor requires metadata (e.g., the locations of tables) to carry out the execution plan. Considering that the size of metadata is very small, we have these files shared by the NFS services in the ORCA-based PostgreSQL. 89 The o?oaded code in storage nodes and other code in computing nodes are linked together as an executable program in ORCA. The ORCA-based program is dispatched to both computing and storage nodes. ORCA offers an additional command-line argument to distinguish whether or not a program is an o?oaded program. This process is handled by the init function. 90 Chapter 5 MOLQ: Multi-Criteria Optimal Location Query with Overlapping Voronoi Diagrams Chapters3and4arefocusedonI/Otechniquesandfilesystems. Now, wearepositioned to investigate spatial query applications, which are data-intensive in nature. In this chap- ter, we present a novel optimal location selection problem, Multi-Criteria Optimal Location Query (MOLQ), which can be applied to a wide range of applications. After providing a for- mal definition of the novel query type, we explore two intuitive approaches that sequentially scan all possible object combinations and locations in the search space. Then, we propose an Overlapping Voronoi Diagram (OVD) model that defines OVDs and Minimum OVDs, and construct an algebraic structure under an OVD overlap operation. Based on the OVD model, we design an advanced approach to answer the query. Due to the high complexity of Voronoi diagram overlap computation, we improve the overlap operation by replacing the real boundaries of Voronoi diagrams with their Minimum Bounding Rectangles (MBR). We also propose a cost-bound iterative approach that efficiently processes a large number of Fermat-Weber problems. Our experimental results show that the proposed algorithms can efficiently evaluate the novel query type. This chapter is organized as follows. Section 5.1 formally defines our novel multi-criteria optimal location query. Two intuitive solutions are described in section 5.2. The OVD model is illustrated in section 5.3. Our advanced solution is proposed in section 5.4. The experimental results are shown in section 5.5. 91 5.1 Preliminaries 5.1.1 Definitions A spatial object is defined by the triple , wherel is its location in the search space, and wt and wo are the type weight and object weight associated with the object. E =fP1;:::;Pngdenotes a universal set of objects, where Pi =fp1i;:::;pmi gdenotes a set of objects of a particular type. G =fpu1;:::;pvng, where pu1 2P1;:::;pvn2Pn, denotes an object group, the objects of which are different types. &t and&o are weight functions applied to type weight and object weight. Notations used in this chapter are summarized in Table I. Table 5.1 Symbolic Notations. Symbol Meaning Pi An object set G An object group pui A spatial object in Pi wt Type weight wo Object weight &t A type weight function &o An object weight function jSj The number of elements in the set S An error bound A distance bound A stopping rule used in iterative approaches [121, 127] d(:;:) Euclidean distance between two objects E A set of object sets or groups V A set of Voronoi diagrams R The search space VD(Pi) Voronoi diagram of Pi Dom(pj) Dominance region of pj in a Voronoi diagram OVD An overlapped Voronoi diagram OVR An overlapped Voronoi region MOVD A minimum overlapped Voronoi diagram 92 Weighted Distance of Two Points Given a point q, a spatial object p, a type weight function &t, and an object weight function &o, weighted distance considers both the distance between two points d(:;:) and the weights of p. The formal definition is as follows: WD(q;p;&t;&o) = &t( &o( d(q; p:l); p:wo ); p:wt) (5.1) Weighted Distance from a Query Point to an Object Group Given a point q, an object group G =fpu1;:::;pvng, a type weight function &t, and object weight functions =f&o1;:::;&ong, we define the weighted distance from q to G as the sum of WD(q, psi, &t, &oi), where psi 2G, &oi 2 . The formal definition is WGD(q;G;&t; ) = X psi2G;&oi2 WD(q;psi;&t;&oi) (5.2) Minimum Weighted Distance from a Query Point to Object Groups Given a point q, a set of object sets E = fP1;:::;Png, a type weight function &t, and object weight functions =f&o1;:::;&ong, we define the minimum weighted distance from q to object combinations of E as: MWGD(q;E;&t; ) = min(fWGD(q;G;&t; )jG2P1 ::: Png) (5.3) Multi-criteria Optimal Location Query (MOLQ) Given a set of object sets E =fP1;:::;Png, a type weight function &t, and object weight functions = f&o1;:::;&ong where &oi is applied to an object pui 2Pi, the query is to find an optimal location l in the search space R that minimizes MWGD(l;E;&t; ). 93 MOLQ(E;&t; ) = l; where MWGD(l;E;&t; ) = min(fMWGD(l0;E;&t; )jl02Rg) (5.4) 5.1.2 Voronoi Diagram Ordinary Voronoi Diagram Given a set of objects Pi = fp1i;:::;pmi g, the ordinary Voronoi diagram VDO(Pi) is defined as a number of dominance regionsfDomO(pui)jpui 2Pig, each of which is dominated by an object pui. All locations in DomO(pui) are closer to pui than other objects. DomO(pui) =fljd(l;pui:l) d(l;pvi:l);u6= v;pui;pvi 2Pig (5.5) Weighted Voronoi Diagram InaweightedVoronoidiagram, generatorshavedifferentweightsreflectingtheirvariable properties. Given a set of objects Pi =fp1i;:::;pmi gand a weight function &, the dominance regions are measured by weighted distance. VDW(Pi) =fDomW(pui)jpui 2Pig where DomW(pui) =flj&( d( l; pui:l); pui:wo ) &( d( l; pvi:l); pvi:wo );u6= v; pui;pvi 2Pig (5.6) 5.1.3 Fermat-Weber Point Given a point group G = fpu1;:::;pvng in a d-dimensional space Rd, the Fermat-Weber point is the point q which minimizes the cost function [26]. c(q;G) = X psi2G psi:wt d(q;psi:l) (5.7) 94 The point exists for any point set and is unique except for the event that all the points lie on a single line [52]. In the noncollinear case, the cost function is strictly convex [121]. The solution to the three-point Fermat-Weber problem had been proposed in [58]. In the collinear case of any point set, an optimal point can be found in linear time [26]; however, to the best of our knowledge, if the number of points is greater than three, no exact solution has been reported for non-collinear cases. Instead, an iterative approach is used as an approximate solution proposed in [121, 127]. This approach converges monotonically to the unique optimal location during iterations. The iterative approach starts with an arbitrary location q0 (q0 =2 G) in Rd. In each iteration, a new locationqi = f(qi 1;G) is produced based on a locationqi 1 found before the iteration. According to the monotonic convergence property,qi is closer to the Fermat-Weber point thanqi 1; hence, theoretically, the Fermat-Weber point is located at limn!1fn(q0;G). The function f is described below. f(q;G) = 8 >< >: P psi2Gfg s i(q) p s i:lg if q =2G q Otherwise (5.8) where gsi(q) = p s i:w t d(q;psi:l) f X ps0i02G ps0i0:wt d(q;ps0i0:l)g 1 (5.9) Three existing stopping rules for the iterative method are widely adopted. ?ster and Love developed a generalized bounding method, by which the result is limited within a spec- ified rectangular distance to the optimal location [120]. Verkhovsky and Polyakov adopted the difference of the costs between two successive iterations as the stopping rule in their experiments [123]. Setting an acceptable deviation from the cost of the optimal location as the stopping rule is widely used in applications [97]. For example, given an error bound , the location after the nth iterations ln, and the optimal location l1, the iteration procedure 95 will stop when c(ln;G) c(l1;G)c(l1;G) , where c(l1;G) is approximated by a lower bound of the cost at ln: lb(ln) = dX k=1 (min x ( X psi2G psi:wtjl n:xk ps i:l:xkjjx p s i:l:xkj d(ln;psi:l) )) (5.10) 5.2 Basic Algorithms 5.2.1 Sequential Scan Object Combinations One basic algorithm to solve MOLQ sequentially checks optimal locations of all object combinations. In particular, given E =fP1;:::;Png, the optimal locations l?s of all combina- tionsfpu1;:::;pvng, where pu1 2P1;:::;pvn 2Pn, are calculated by the Fermat-Weber method, which considers both object locations and their type weights. The answer to the query is the best location among these l?s. We call this algorithm the Sequential Scan Combinations (SSC) algorithm. Algorithm 2 SSC(E, &t, ) 1. Ubound =1 2. l = < 0;0 > 3. for 2P1 ::: Pn do 4. Calculate the optimal location l1 of 5. if WGD(l1,fpu1;ps2g, &t, ) < Ubound then 6. Calculate the optimal location l2 of 7. Cost = WGD(l2,fpu1;:::;pvng, &t, ) 8. if Cost < Ubound then 9. Ubound = Cost 10. l = l2 11. end if 12. end if 13. end for 14. return l Since the computation of SSC is expensive, we can set an upper bound to reduce the complexity of the algorithm by filtering out a portion of combinations whose optimal loca- tions cannot be the answer. For an example, two combinations (object groups), G1 and G2, 96 will be evaluated sequentially in SSC. We assume the optimal location of G1 is at l1. The weighted distance from l1 to G1 is denoted by d1. Before processing G2 =, we first setd1 as an upper bound and calculate the optimal locationl2 of, which costs much less than computing an optimal location of multiple points. If the weighted distance from l2 to < pu1;ps2 > is greater than d1, the weighted distance from any location to G2 must be greater than d1, thus calculating the optimal location of G2 can be avoided. During SSC processing, the upper bound is initialized to infinity and will be reduced to the total weighted distance of the best solution found so far. The detailed steps of SSC are described in Algorithm 2. 5.2.2 Sequential Scan Locations Another intuitive algorithm is to convert an infinite search space to a finite number of locations by dividing the search space into a grid. We assume that an answer with a distance-bound is acceptable for the query. We calculate the weighted distance from all the line intersections of the grid to their nearest objects of different types. The best location is selected as the answer to the query. We call this algorithm the Sequential Scan Locations (SSL) algorithm. Algorithm 3 SSL(E, , &t, ) 1. Ubound =1 2. l = < 0;0 > 3. for pu1 2P1 do 4. for l0 in the range(pu1;Ubound) do 5. /* l0 stands for the location of an intersection */ 6. Cost = WGD(l0, the nearest objects of l0, &t, ) 7. if Cost < Ubound then 8. Ubound = Cost 9. l = l0 10. end if 11. end for 12. end for 13. return l 97 Since the number of intersections, depending on , could be extremely large, we employ an upper bound to reduce the number of intersections checked in SSL. In particular, an upper bound is initialized to infinity and will be trimmed to the total weighted distance of the best solution found so far. Since Equation 5.2 is non-decreasing, given an object group G = fpu1;:::;pvng, if the weighted distance between a location l and pu1 is greater than the upper bound, the total weighted distance from l to G must be greater than the upper bound as well, and l cannot be the answer to the query. Based on this observation, SSL only scans the intersections within a range of pu1, which is limited by the upper bound. The steps of SSL are formalized in Algorithm 3. In the for loop in line 4, expanding l0 from pu1 to all directions is more efficient because the range may be reduced by finding a better location during the scan. 5.3 The OVD Model Before describing our MOVD-based solution, we will first introduce the OVD model. In thissubsection, westartwithasimpleOVDexamplewhichprovidesabasicunderstandingof the model. Then, we formally define OVD and Minimum OVD (MOVD) and systematically analyze their properties. We build an algebraic structure of MOVD on an overlap operation +. 5.3.1 An OVD Example Fig. 5.1(a) and Fig. 5.1(b) display two ordinary Voronoi diagrams generated by schools and supermarkets, respectively. The shaded areas in the figures are dominance regions of generators p3 and q1. Fig. 5.1(c) shows an OVD that overlaps the two ordinary Voronoi diagrams. Apparently, the OVD is comprised of a number of overlapping regions, each of which is generated by overlapping two ordinary Voronoi polygons. For example, the doubly shaded area in Fig. 5.1(c) is the overlapping region in both shaded regions of two ordinary 98 p1 p2 p3 p4 p5 p6 (a) Schools q1 q2 q3 q4 q5 (b) Supermarkets p1 p2 p3 p4 p5 p6 q1 q2 q3 q4 q5 (c) An OVD Figure 5.1: Ordinary Voronoi diagrams and OVDs. Voronoi diagrams. According to the properties of Voronoi diagrams,p3 andq1 are the closest school and supermarket to any locations in the doubly shaded region. 5.3.2 Overlapped Voronoi Diagram Definition Overlapped Voronoi Diagram (OVD) GivenasetofobjectsetsE =fP1;:::;PngandasetofVoronoidiagramsV =fVD(Pi)jPi2 Eg, where VD(Pi) can be either an ordinary or a weighted Voronoi diagram generated by Pi in the search space R. Overlapped Voronoi Diagram (OVD) is a set of Overlapped Voronoi Regions (OVR), OVD(E) =fOVRj j1 j mg (5.11) where OVRj is OVR(pu1;:::;pvn) =fljl2Dom(pu1);:::;l2Dom(pvn);pu1 2P1;:::;pvn2Png (5.12) 99 Property 1. An OVD may have one or more empty set OVRs (e.g., OVRj =;). Proof. By definition, anOVRis the intersection of dominance regions from different Voronoi diagrams. Potentially, these dominance regions may not overlap each other (see the domi- nance regions of p1 in Fig. 5.1(a) and q5 in Fig. 5.1(b)). If this is the case, no locations fall into both dominance regions, thus their overlapping region is an empty set. Minimum OVD (MOVD) aMinimumOverlappedVoronoiDiagram(MOVD)isanOVD,inwhichallemptyOVRs have been removed. An OVD is an MOVD if it does not have any empty OVR. The formal definition of MOVD is: MOVD(E) = OVD(E) f;g (5.13) In the extreme case that E is an empty set, no Voronoi diagrams are overlapped, and the search space is not decomposed into subregions. We define this case as: MOVD(;) = OVD(;) =fRg (5.14) OVD/MOVD Properties A number of properties and proofs can be derived from OVD/MOVD definitions. These properties are the basis of the OVD/MOVD model utilized in our MOVD-based solution. Property 2. jMOVD(E)j jOVD(E)j= QPi2EjPij. Proof. By Equation 5.12, OVRs are generated by a combination of selected Voronoi regions. So the number of OVRs in OVD(E) is the product of the number of Voroni regions in these Voronoi diagrams. Because all the possible empty sets have been removed, the size of MOVD(E) is less than or equal to OVD(E). 100 Property 3. Any MOVD fully covers the entire search space R. OVRj2MOVD(E)[ OVRj = R (5.15) Proof. According to the Voronoi diagram property that a Voronoi diagram fully covers the entire search space, 8l 2 R, there must exist Voronoi regions {Dom(psi) 2 VD(Pi)jl 2 Dom(psi);psi 2Pi;Pi 2E}. By Equation 5.12, the location l is at the OVR(pu1;:::;pvn), and an OVD fully covers the entire search space. Moreover, because MOVD(E) only removes empty sets from OVD(E), MOVD(E) covers the entire search space as well. Property 4. The overlapping area of two different OVRs is a subset of their common bound- aries. Proof. By Equation 5.12, an OVR is the overlapping region of {Dom(pu1),..., Dom(pvn)}. If we have two OVRs from an OVD that OVR = Tps i2fp u1;:::;pvngDom(psi), and OVR0 = T ps0i 2fpu01 ;:::;pv0ngDom(p s0 i ) , then the overlapping area of OVR and OVR 0 is OVR \ OVR0 = ( \ psi2fpu1;:::;pvng Dom(psi)) \ ( \ ps0i 2fpu01 ;:::;pv0ng Dom(ps0i )) = n\ i=1 (Dom(psi) \ Dom(ps0i )) (5.16) According to the properties of the Voronoi diagram, Dom(psi)\Dom(ps0i ), where psi 6= ps0i ;psi;ps0i 2 Pi, is either their common boundaries or an empty set. Moreover, if OVR and OVR0 are different, there must exist a psi and ps0i that are different. The boundaries of an OVR are comprised of the boundaries of corresponding Voronoi regions. Hence, the overlapping region of OVR and OVR0 is a subset of their common boundaries. Property 5. Given a type weight function &t, and object weight functions =f&o1;:::;&ong, a point q in OVR(pu1;:::;pvn), the total weighted distance from q to the corresponding object 101 group G =fpu1;:::;pvng is the minimum weighted distance from q to all object combinations G0, where G02P1 ::: Pn. WGD(q;G;&t; ) = min(fWGD(q;G0;&t; )jG02P1 ::: Pn g) (5.17) Proof. IfVD(P1) is generated byP1 and weight function&o1 2 , a pointq inOVR(pu1;:::;pvn) must fall in Dom(pu1) of VD(P1) so that pu1 is the closest point in P1 to q. WD(q;pu1;&t;&o1) is the minimum weighted distance from q to any points in P1. We can get the same result in other sets Pi2E. After summing them up, we obtain Property 5 that WGD(q;G;&t; ) has the minimum distance. Property 6. |MOVD(E)| is bigger than or equal to |VD(Pi)|, where Pi2E. jMOVD(E)j jVD(Pi)j (5.18) Proof. Overlapping two Voronoi diagrams is a process in which one Voronoi diagram is decomposed by another Voronoi diagram. Each Voronoi region is divided into a number of subregions, unless two Voronoi regions from different VDs are exactly the same, or one region contains the other. In these extreme cases, the Voronoi region remains unchanged. Thus after overlapping Voronoi diagrams, the number of overlapping regions in an MOVD is either greater than or equal to the basic Voronoi diagrams. Property 7. When E is made up of only one object set E =fPg, then MOVD(E) = OVD(E) = VD(P) (5.19) Proof. This property is straightforward. If E has only one object set P, there is no other Voronoi diagram overlapped onVD(P). ObviouslyVD(P) does not have any empty regions. So OVD(E) and MOVD(E) are identical to VD(P). This property not only states an 102 extreme case of definitions, but also highlights basic units in the OVD/MOVD model. All OVDs are generated from these building blocks. 5.3.3 Algebraic Structure of MOVD After theoretically introducing the OVD/MOVD model, we will mainly focus on the overlap operation. We create an algebraic structure of MOVD by exploring MOVD space under the overlap operation and discussing its properties. The implementation details of the operation will be presented in Section 5.4. MOVD space MOVD space is a universal set of MOVDs that are fed into and produced by the over- lap operation. Given a universal set of object sets E = fP1;:::;Png, the universal set of MOVD(E) is defined as U(MOVD(E)) =fMOVD(Ei)jEi Eg (5.20) Property 8. The number of MOVDs existing in the universal space is as follows: jU(MOVD(E))j= jEjX i=0 jEj i (5.21) Proof. By definition, MOVD space consists of a number of MOVDs, each of which is gener- ated by a subset of E; thus the number of MOVDs in the space equals the number of subsets in E, which is presented as Equation 5.21. The case thatiequals 0 indicates a special subset, the empty set, defined in Equation 5.14. 103 Overlap operation + We define a binary operation + that overlaps two given MOVDs. The result of + is a new MOVD generated by the union of generator sets of input MOVDs. The formal definition is: given MOVD(Ei) and MOVD(Ej), where Ei;Ej E, then MOVD(Ei) +MOVD(Ej) = MOVD(Ei[Ej) (5.22) A general implementation (RRB) of the operation will be discussed in Section 5.4.2. + Operation Properties By properties of the union operation on sets, we can obtain the following three laws. Property 9. Idempotent Law MOVD(Ei) +MOVD(Ei) = MOVD(Ei) (5.23) Property 10. Commutative Law MOVD(Ei) +MOVD(Ej) = MOVD(Ej) +MOVD(Ei) (5.24) Property 11. Associate Law (MOVD(Ei)+MOVD(Ej)) +MOVD(Ek) = MOVD(Ei) + (MOVD(Ej) +MOVD(Ek)) (5.25) Corollary 1. MOVD(Ei), where Ei E, is unique. Proof. Accordingtothecommutativeandassociatelawsofoperation +, theorderofoverlap- ping Voronoi diagrams does not cause the result to change. Thus MOVD(Ei) is unique. Property 12. MOVD(;) is an identity element. 104 Proof. MOVD(;) equalsfRgthat leaves MOVDs unchanged under operation +. The fol- lowing equation can be easily proved by the definition of +. MOVD(Ei) +MOVD(;) = MOVD(Ei) (5.26) Property 13. Closure: the universal MOVD space of E is closed under operation +. Proof. By definition, given any MOVD(Ei) and MOVD(Ej), where Ei;Ej E, the result of overlapping them is MOVD(Ei[Ej). Obviously, Ei[Ej is still a subset of E, so the result is an element of U(MOVD(E)). Definition Sequential Overlap Operations nX i=1 MOVD(Ei) = MOVD(E1) +:::+MOVD(En) = MOVD( n[ i=1 Ei) (5.27) Definition Partial Order If MOVD(Ei) = MOVD(Ej) +MOVD(Ek) then, MOVD(Ei) MOVD(Ej) MOVD(Ei) MOVD(Ek) (5.28) The partial order definition formalizes a comparison model for evaluating how much in- formationMOVDsmaintain. AsEquation5.28shows,MOVD(Ei)isgeneratedbyMOVD(Ej) and MOVD(Ek). MOVD(Ei) has more information (i.e., objects) than either MOVD(Ej) or MOVD(Ek). We use to denote the relationship. Property 14. MOVD(Ei) + MOVD(Ej) = MOVD(Ei) if MOVD(Ei) MOVD(Ej). 105 Proof. The following equation proves Property 14 by applying the partial order definition that decomposes MOVD(Ei) into MOVD(Ej) and MOVD(Ek), and the commutative and idempotent laws of operation +. MOVD(Ei) +MOVD(Ej) = MOVD(Ej) +MOVD(Ek) +MOVD(Ej) = MOVD(Ej) +MOVD(Ej) +MOVD(Ek) = MOVD(Ej) +MOVD(Ek) = MOVD(Ei) (5.29) 5.4 Design After introducing the OVD model, we now start to illustrate our MOVD-based solution for the query. 5.4.1 Framework of the MOVD-based Solution {VD(P1),?, VD(Pn)} 2. MOVD Overlapper 3. Optimizer (Algorithm 6) 1. VD Generator MOVD ({P1,?, Pn}) Evaluation SystemQuery Inputs {P1, ?, Pn} POI Datasets Object Weight Functions RRB (Algorithm 3 & 4) MBRB(Algorithm 3 & 5) Type Weight Function g57t An Optimal Location Result Figure 5.2: The Framework of the MOVD-based solution. Fig. 5.2 illustrates the framework of our solution. The inputs are POI datasets (Pi2E), objectweightfunctions =f&o1;:::;&ongandatypeweightfunction&t. Theresultisanoptimal location of the query. 106 In the evaluation system, the query is sequentially processed by three modules. In particular, based on POIs of particular types and the object weight functions, VD Generator generatesVoronoidiagramsthatarethebasicMOVDsusedinthenextstep(seeProperty7). Then, a new MOVD is produced by overlapping the basic MOVDs with MOVD Overlapper (see Equation 5.27). Finally, Optimizer sequentially scans OVRs in the new MOVD, finding a locally optimal location in each OVR, and returns the best of these locations as the query result. Essentially, two solutions are proposed in Fig. 5.2, illustrated by two paths from the VD Generator to the Optimizer. The solutions apply either Real Region as Boundary (RRB) or Minimum Bounding Rectangle as Boundary (MBRB) approaches in the MOVD Overlapper. RRB and MBRB are two MOVD overlapping approaches that will be described in the following two subsections. A cost-bound approach used in Optimizer will be presented in Section 5.4.4. The Voronoi diagram generation approaches used in the VD Generator can be found in [83]. 5.4.2 The RRB Approach In this subsection, we describe the RRB approach for MOVD overlapping operation. Since the basic MOVDs are identical to Voronoi diagrams (see Property 7), the generation methods of which has been extensively studied, we will mainly focus on the process of creating an MOVD from two MOVDs. For better explanation, overlapping two basic MOVD is illustrated by the simple example in Fig. 5.3. A plane-sweep-based algorithm is designed in the RRB approach. As the typical plane sweepapproach[32], theRRBapproachmaintainsaneventqueueandtwosweepingstatuses. The event queue consists of a number of event points that are maximum and minimum values of projections of OVRs on the y axis. These maximum and minimum points are called start and end points, which indicate that when the sweeping line arrives at these points, the corresponding OVR starts or ends its intersection with the sweeping line. The event points 107 Sweeping Line q 1 q 2 q 3 q 4 q 5 q 6 St art q 1 ,q 2 St art q 3 St art q 4 ,q 6 End q 3 End q 4 ,q 5 , q 6 St art q 5 End q 1 ,q 2 p 1 p 2 p 3 p 4 p 5 St art p 4 ,p 5 St art p 1 ,p 2 St art p 3 End p 1 ,p 2 End p 3 End p 4 ,p 5 Y St atus St atus Figure 5.3: Overlapping two MOVDs. of both MOVDs are sorted by their y-coordinates in descending order. The sweeping line vertically scans the plane from top to bottom, so that the start point of an OVR will be reached before its end point. The status structures are set up to record OVRs that intersect with the sweeping line. Two status structures are maintained individually and respectively for MOVDs. OVRs also have a range (minimum and maximum values) of projections on the x axis. During the sweeping process, when an end point is arrived at, the corresponding OVR is removed from the status structure. When the sweeping line reaches a start point, the corresponding OVR is inserted into the status structure. Moreover, overlapping regions of the new OVR and OVRs in the other status structure are required to be detected. The detection process first identifies potential OVRs; the range of which overlaps with the new OVR on the x axis. Then, the overlapped region of the two OVRs is calculated. The details are described in Algorithms 4 and 5. The essential idea of the algorithms is that minimum and maximum values on the x and y axes are an outer boundary of OVR. Two OVRs cannot overlap each other if their outer boundaries do not overlap. Overlapped outer boundary detection significantly reduces overlapping region calculations by avoiding the overlapping of two OVRs (e.g., regions of p1 and q5 in Fig. 5.3), which are actually far away from each other. 108 As shown in Algorithm 4, the overlap operation receives two MOVDs as input parame- ters and produces a new MOVD. From lines 1-4, Result, EventQueue, Status and Status0 are initialized to be empty sets. Status keeps the status for MOVD(E), and Status0 for MOVD(E0). Then, in lines 5-6, events are inserted into EventQueue and sorted. Finally, from lines 7-14, all events are iteratively handled by Algorithm 5. Algorithm 4 Overlap(MOVD(E), MOVD(E0)) 1. Result =; 2. EventQueue =; 3. Status =; 4. Status0 =; 5. Push events of MOVD(E) and MOVD(E0) into EventQueue 6. Sort(EventQueue) 7. while ( EventQueue6=;) do 8. e = EventQueue.pop() 9. if ( e is from MOVD(E) ) then 10. EventHandler(e, Status, Status0, Result) 11. else 12. EventHandler(e, Status0, Status, Result) 13. end if 14. end while 15. return Result Algorithm 5 describes the event handler that receives the following four parameters. e is an event object. Current is the status structure of MOVD, from which the event occurs. Other refers to the other status structure. Result is the MOVD produced by the overlap operation. As shown in Fig. 5.5, an MOVD manages a list of OVRs, each of which is represented as , where region maintains the shape of the OVR and pois is a list of objects associated with the OVR. If a start event occurs, the corresponding OVR is first inserted intoCurrentstatus. Then, potentially overlapped OVRs inOther are detected by comparing their RangeX with the current OVR. RangeX denotes the range of possible x-coordinates of OVRs. If their RangeX overlap, the overlapped region is calculated in line 5. If the new-generated overlapped region is not empty, a pair of the region and its associated 109 pois will be appended to Result. In the second branch, an end event takes place and the corresponding OVR is removed from Current. Algorithm 5 EventHandler(e, Current, Other, Result) 1. if e is a start event then 2. Insert e:ovr into Current 3. for ovr2Other do 4. if RangeX(e:ovr) \ RangeX(ovr)6=; then 5. region = e:ovr:region \ ovr:region 6. if region6=; then 7. pois = e:ovr:pois [ ovr:pois 8. Result.append() 9. end if 10. end if 11. end for 12. else 13. Remove e:ovr from Current 14. end if 15. return A general overlapping approach is not presented; however, the RRB approach can be modifiedtobeageneralapproachusedfortheOVDmodelifline7isremovedandonlyregion is appended to Result in line 8. pois contains the additional information for our specific query type. Algorithm 5 does not specify any methods for overlapping region calculation in line 5. The reason is that the shape of OVRs in a general model is difficult to predict. The case is worse after overlapping because the OVRs become more complex. Furthermore, overlap methods for regions vary greatly as well. The overlap methods for polygons are different from the ones for circles. The overlap methods applied in the model cannot be determined until the shapes of regions have been decided. We will discuss this issue in Section 5.4.3. The RRB approach is an output-sensitive algorithm, the complexity of which depends on the size of the results, or more exactly the number of OVRs existing in the new MOVD. We denote the average size of input MOVDs by n. There are totally 4 n events, and sorting them in order takes O(nlgn) time. There are 2 n start and end events handled 110 by Algorithm 5. If status structures are organized as a balanced search tree that sorts OVRs in order by their start x-coordinates, inserting or deleting an OVR from status can be completed inO(lgn) time. The total cost of maintaining status isO(nlgn) as well. If status structures record the start and end x-coordinates of OVRs, a range specified by the points that are either immediately smaller than minimum or greater than maximum x-coordinate of the current OVR can be figured out in O(lgn) time. The OVRs, whose event points are located at the range are potentially required to overlap the current OVR. Moreover, we denote the number of OVRs in the result by I and costs of overlapping region computation by . The costs of calculating the overlap regions is I. In the worst case, I becomes n2, so that the total costs of operation + is O( n2). 5.4.3 The MBRB Approach According to the variety of weight functions specified in the query inputs, various Voronoi diagrams are generated by the VD Generator. In addition to the ordinary Voronoi diagrams shown in Fig. 5.1, two typical weighted Voronoi diagrams are displayed in Fig. 5.4. The generation methods of additively and multiplicatively Voronoi diagrams have been pre- sented in [22, 60, 18, 44, 19, 35, 81]. More practical Voronoi diagrams can be found in [83]. Although the generation methods of weighted Voronoi diagrams had been extensively studied, efficiently maintaining the shape of OVRs is extremely difficult since they are not in regular shapes. In general, their boundaries have to be modelled by a number of curves. More importantly, overheads of overlapping region calculations would be highly expensive due to the complexity of boundary representation. To overcome this difficulty, we propose the MBRB approach that combines Algorithm 4 with an alternate event handler, MBRBHandler, for the overlap operation. The MBRB approach is motivated by an observation that the shapes of OVRs are not used in Optimizer. Instead, the POI locations and their weights are the criteria for optimal location selection; therefore, we set Minimum Bounding Rectangles (MBR) of OVRs as their shapes in this 111 (1) (2) (3) (4) (2) (4) (a) Additively (1) (4) (2) (4) (2) (3) (2) (2) (1)(3) (3)(2) (b) Multiplicatively Figure 5.4: Weighted Voronoi diagrams (the numbers indicate weights). approach. Two OVRs will be treated as overlapped if their MBRs are overlapped. This approach is able to significantly reduce the cost of the overlap operation by simplifying boundary maintenance and avoiding region overlapping calculations; however, the approach suffers from that unnecessary OVRs (false positives which are not really overlapped) would be appended to the new MOVD. ThedatastructureusedinMBRBHandlerisillustratedinFig.5.5. AnOVRisindicated as , where MBR is comprised of minimum and maximum points on the x and y axes, and pois is a list of objects associated with the OVR. OVR 1 ... OVR m MB R 1 ...... MB R m O VR 1 O VR m Region 1 ...... Region m A n MO VD in R R B An MO VD in MBR B PO I Datas ets ... ... ... ... ... P 1 P n ... Figure 5.5: Data structures. 112 The MBRBHandler is described in Algorithm 6. In the branch that handles start event processing, the handler only detects whether two MBRs are overlapped. If this is the case, the MBRs are overlapped and the objects associated with the two OVRs are merged. The new OVR is appended to the result. The final branch remains unchanged. Algorithm 6 MBRBHandler(e, Current, Other, Results) 1. if e is a start event then 2. Insert e:ovr into Current 3. for ovr2Other do 4. if RangeX(e:ovr)TRangeX(ovr)6=; then 5. mbr = e:ovr:MBR T ovr:MBR 6. pois = e:ovr:pois S ovr:pois 7. Results.append() 8. end if 9. end for 10. else 11. Remove e:ovr from Current 12. end if 13. return Compared to the RRB approach, the complexity of region overlapping decreases in constant time, but the size of output I increases, the impact on the performance of which is difficult to be evaluated. The upper bound of I is n2; therefore, the complexity of the MBRB approach becomes O(n2) in the worst case. It is worth noting that the basic principle of our solution is that the search space is decomposed into a number of OVRs, in which a locally optimal location is found by Optimizer; however, theshapesofOVRsarenotcalculated. Howdoesoursolutiondetermine an optimal location in an OVR? Our solution does not limit the locally optimal location in a particular OVR. Instead, we look for it in the entire search space. As shown in Fig. 5.6, if an optimal location Lk is found in OVRk, Lk will undoubtedly be appended to the candidate list. If the optimal locationLi is outside ofOVRi, according to Property 3,Li must be located in another OVR, say OVRj, which must have an optimal location Lj. Lj must be identical or better than Li. Appending them to the candidate list does not change the global optimum since only the 113 OVRi OVRjL i Lj OVRk Lk Figure 5.6: Optimal locations. best one will be returned as the query result. Thus appending Li to the candidate list does not change the global optimum. 5.4.4 A Cost-Bound approach in Optimizer An optimal location q that minimizes MWGD(q;E;&t; ) is found in the third step of the proposed framework. The framework does not specify a weight function for type weight calculations; however, we mainly focus on a multiplicatively-based weight function, which is one of the practical methods used in real applications. Other weight functions can be applied in the framework as well. If applying a multiplicatively-based weight function to type weights, the problem of finding an optimal location in each OVR is converted into a typical Fermat-Weber problem in two-dimensional space. The objects associated with OVRs are the points in the Fermat- Weber problems. The weights of the points are specified by the type weight function &t. The object weights are integrated into the distance from a location to points. As mentioned in Section 5.1.3, the problem had been solved theoretically. The optimal location in three- point cases and multiple-collinear-point cases can be found in constant and linear time, respectively. An approximate iterative approach has been proposed for other cases [127]. 114 In the RRB and MBRB approaches, we observe that large number of OVRs will be created by MOVD Overlapper (see Property 2). In addition, the number of the problems increases rapidly when the number of objects grows. A basic approach is to sequentially calculate optimal locations of these Fermat-Weber problems, and select the best one as the query result; however, applying the iterative method to the Fermat-Weber problems is surprisingly expensive. Therefore, we propose a cost-bound approach, in which an optimal cost is set as a global lower bound. During the processing of a Fermat-Weber problem, a local lower bound of the cost in each iteration will be calculated. If the local lower bound is greater than the global lower bound, no matter how many iterations will be processed, its local optimal cost cannot be better than the global lower bound. Thus the following iterations can be avoided, even though the stopping condition has not been satisfied. The definition and the cost-bound approach of the problem are formally described as below. Optimum Location of Multiple Fermat-Weber Problems Given a set of object groups E =fG1;:::;Gng, where Gi (jGij 3) contains points of a Fermat-Weber problem, a type weight functions &t and object weight functions . Let lj denote the optimal location of Gj under a stopping condition , the optimal location of E is a location l2fljj1 j ngthat minimizes WGD(lj, Gj, &t, ). A Cost-Bound Approach The cost-bound approach receives a set of object groups E, a type weight function &t, object weight functions , and a stopping condition . The weights of the objects are indicated by &t. The distance from a location to points is calculated by their Euclidean distance and . The number of points in the Fermat-Weber problems (jGij) is unnecessarily fixed. In Algorithm 7, the global lower bound, Cbound, is initialized to infinity and reduced to the minimum cost of the optimal location found so far. The algorithm sequentially checks 115 Algorithm 7 CostBoundApproach(E, &t, , ) 1. Cbound =1 2. l = < 0;0 > 3. for Gi2E do 4. Initialize li to the center of Gi 5. if jGij= 3 or Gi is a collinear case then 6. Calculate the optimal location li of Gi 7. else 8. Let Gi = 9. Calculate the optimal location l0 of 10. if WGD(l0,fpu1;ps2g, &t, ) > Cbound then 11. Continue 12. end if 13. repeat 14. li = f (li, Gi) /* Iterating, see Equation 5.8 */ 15. Lbound = lb(li) /* see Equation 5.10 */ 16. until is satisfied or Lbound Cbound 17. end if 18. Cost = WGD(li, Gi, &t, ) 19. if Cbound > Cost then 20. Cbound = Cost 21. l = li 22. end if 23. end for 24. return l the Fermat-Weber problems, in each of which a local optimal location is found in line 4-17. In the branch of the iterative method inside the loop, an optimal location of the first two points in Gi is first detected in line 8-12, as SSC solution does. If a better result of Gi potentially exists, a local lower bound is calculated in each iteration in line 15. If the local lower bound is greater than Cbound, the iteration will stop in line 16. The Cost-bound approach can be used in the SSC solution as well. 5.5 Experimental Validation In this subsection, we evaluate the performance of the OVD model and proposed query solutions with real-world data sets. We implemented the proposed algorithms in C++. All data was loaded into the main memory during the execution of the simulations. All the 116 experiments were conducted on a Red Hat Enterprise Linux server equipped with four Intel Xeon X5550 2.67 GHz processors and 24GB of memory. All simulation results were recorded after the system model reached a steady state. 5.5.1 Data Sets In our experiments, the data sets were downloaded from GeoNames1. We retrieved 230,762 streams (STM), 200,996 schools (SCH), 166,788 populated places (PPL), 225,553 churches (CH) and 110,289 buildings (BLDG) in the United States. By default, we set the type weight wt and object weight wo to 1. The multiplicatively-based weight functions are used as &t and . GPC library2 is used for polygon overlapping calculations. 5.5.2 Cost-Bound Approach Evaluation Weevaluatethebasic(Original)andcost-bound(CB)approachesbyvaryingthenumber of Fermat-Weber problems and the error bound . The basic approach sequentially calculates the optimum locations of all Fermat-Weber problems, and selects the best location for the result. The number of points in each Fermat-Weber problem is fixed to 5. The coordinates and type weights of points are randomly generated from 0 to 10. The iterative method for a Fermat-Weber problem will stop when the deviation from the optimal cost is less than the error bound (see Section 5.1.3) [97]. Fig. 5.7 displays the execution time of the two approaches. As either the problem size increases or decreases, the execution time of both approaches rises. Obviously, the growth rate of the original approach is higher than the cost-bound approach because a significant number of unnecessary iterations can be avoided by setting a cost bound, which makes the cost-bound approach more efficient, even though it has to pay extra overhead on lower bound calculation in each iteration. 1http://www.geonames.org/ 2http://www.cs.man.ac.uk/ toby/gpc/ 117 2 4 6 8 10 0.0010.010.11 5 10 15Time (s)OriginalCB Fermat Weber Problems (10K) ? (10-2) Figure 5.7: The CB approach evaluation. 1 2 4 8 16 1 2 4 8 1610010 1102 103104 105 Time (s) RRBMBRB STM (10K) SCH (10K) Figure 5.8: Execution time. 1 2 4 8 16 1 2 4 8 1610210 3104 10510 6 OVRs (K) RRBMBRB STM (10K) SCH (10K) Figure 5.9: Number of OVRs. 1 2 4 8 16 1 2 4 8 16102 10310 410 5 Mem (MB) RRBMBRB STM (10K) SCH (10K) Figure 5.10: Memory consumption. 5.5.3 Overlapping Two Voronoi Diagrams Two overlap approaches, RRB and MBRB, on two regular Voronoi diagrams are eval- uated with various data set sizes. The Voronoi diagrams are generated by two object sets, which are randomly selected from STM and SCH. Their sizes are indicated by x and y axes in Fig. 5.8-5.10. From Fig. 5.8, we observe that the MBRB performs better than the RRB in terms of execution time. In RRB, the OVRs in the new-generated MOVD are determined by region overlapping calculation (polygon overlapping calculation in this experiment), which is more expensive than the MBR detection (rectangle overlapping calculation) used by MBRB. Therefore, MBRB is able to complete the overlapping processing in a shorter time. Also, due to replacing real regions of OVRs with their MBRs, MBRB will generate more OVRs 118 than RRB, as shown in Fig. 5.9. Two OVRs that are not really overlapped with each other may be determined to be overlapped by the MBR detection. However, Fig. 5.10 shows that MBRB consumes less memory than RRB. The reason is that although MBRB has more OVRs, the regions (MBRs) of which can be represented by two points while all vertices of polygons have to be recorded in RRB. According to Fig. 5.10, the total number of points managed by MBRB approach is smaller than RRB. 5.5.4 Overlapping Multiple Voronoi Diagrams 102 103 104 105 106 1 2 3 4 5 6Number of Objects in Each Type Number of Object Types RRBMBRB (a) Availability of multiple types. 101 102 103 104 105 1 2 3 4 5 6 Time (s) Number of Object Types RRBRRB* MBRB (b) Execution time. 102 103 104 105 106 1 2 3 4 5 6 OVRs (K) Number of Object Types RRBRRB* MBRB (c) Number of OVRs. 102 103 104 105 1 2 3 4 5 6 Mem (MB) Number of Object Types RRBRRB* MBRB (d) Memory consumption. Figure 5.11: Varying number of object types. In our experiment, we examine the overlap operation by varying the number of Voronoi diagrams. These Voronoi diagrams are generated by objects randomly selected from E = fSTM;SCH;PPL;CH;BLDGg. In addition to evaluating the performance, we explore the availability of the overlap operation, which is described by the maximum size of objects 119 that can be processed on the test bed. All data is assumed to be loaded into the memory (24GB). Fig.5.11(a)demonstratestheavailabilityoftheoverlapoperationbyvaryingthenumber ofobjecttypes. Whenthenumberofobjecttypesincreases,themaximumnumbersofobjects in both the RRB and MBRB approaches drop rapidly. The more Voronoi diagrams overlap, the more OVRs are generated which requires more memory. Moreover, the dropping rate of the MBRB approach is higher than RRB because the MBRB approach consumes more memory when the number of object types is greater than three as shown in Fig. 5.11(d). Fig. 5.11(b), 5.11(c) and 5.11(d) display corresponding execution time, the number of OVRs and memory consumption of both approaches with parameters that lie on the availability line. In order to fairly compare the two approaches, the performance of the RRB approach with parameters identical to the MBRB approach is highlighted by RRB*. As we expect, the MBRB approach always generates more OVRs than RRB* as shown in Fig. 5.11(c). Moreover, in Fig. 5.11(b), when the number of object types is greater than 4, RRB* performs better than the MBRB approach because the computation complexity induced by surprisingly large number of OVRs dominates the entire process in the MBRB approach, which has greater impact than the benefits obtained from the region overlapping calculation. A turning point in terms of memory consumption is observed between 2 and 3 in Fig. 5.11(d). When overlapping three or more Voronoi diagrams, the MBRB approach consumes more memory due to the large number of OVRs, in which the total number of points is more than the vertices managed by RRB*. 5.5.5 MOLQ Evaluation We evaluate the solutions for MOLQ queries with three and four object types that are popular applications in the real world. The type weights are randomly generated from 0 to 10. The objects are randomly selected from E =fSTM;SCH;PPLgin three-type case and E =fSTM;SCH;PPL;CHgin four-type case, respectively. 120 101 102 103 104 1 2 3 4 5 Time (s) Number of Objects in Each Type (K) SSCRRB MBRB Figure 5.12: Three object types. 0 200 400 600 800 1000 1200 100 200 300 400 500 Time (s) Number of Objects in Each Type SSLSSC RRBMBRB Figure 5.13: Four object types. Fig. 5.12 displays the performance of SSC, proposed RRB and MBRB solutions. The cost-bound approach is used in all the three solutions. We do not compare with the SSL approach, since it only provides an approximate solution while the other three approaches produce exact solutions for the queries with three object types. As Fig. 5.12 shows, RRB and MBRB perform better than SSC because they avoid a significant number of object combinations. Overlapping Voronoi diagrams is a process of filtering out combinations that cannot be the closest objects of any location. Another observation is that MBRB completes the query processing in a shorter time than RRB. The evidence has been shown in Fig. 5.7 and 5.11; the benefit obtained by MOVD Overlapper in MBRB is larger than the overhead paid in Optimizer. In the query with four object types, only approximate results can be provided by the four approaches. The distance-bound in SSL approach is fixed at 1 km. The error bound is set to be 0.001. Fig. 5.13 shows the execution time of the four solutions, in which the RRB solution has the best performance. Although the execution time of overlapping processing in the MBRB approach is slightly shorter than RRB as shown in Fig. 5.11(b), more OVRs generated by MBRB increase the computation complexity of the Fermat-Weber calculation. 121 Chapter 6 Efficient Evaluation of Spatial Keyword Queries on Spatial Networks In this chapter, we present efficient approaches to answer spatial keyword queries on spatial networks. In particular, we systematically introduce formal definitions of Spatial Keyword k Nearest Neighbor (SKkNN) and Spatial Keyword Range (SKR) queries. Then, we illustrate the framework of a spatial keyword query evaluation system, which consists of Keyword Constraint Filter (KCF), Keyword and Spatial Refinement (KSR), and the spatial keyword ranker. KCF employs an inverted index to calculate keyword relevancy of spatialobjects, andKSRrefinesintermediateresultsbyconsideringbothspatialandkeyword constraints with the spatial keyword ranker. In addition, we design novel algorithms for evaluating SKkNN and SKR queries. These algorithms employ the inverted index technique, shortest path search algorithm, and network Voronoi diagrams. Finally, we apply both real- world and synthetic data sets to evaluate the performance of the proposed solutions. Our extensive experimental results show that the proposed SKkNN and SKR algorithms can efficiently answer spatial keyword queries. This chapter is organized as follows. Section 6.1 formally defines the Spatial Keyword queries. Section 6.2 illustrates our proposed solutions. Section 6.3 shows our experimental results. 6.1 Query Type Definition and Background In order to explain definitions and algorithms in the following subsections, we prepare a sample data set of hotels in Table 6.1 and an example spatial network in Figure 6.1. All the hotels have three attributes which include their names, amenities, and distances from a specific location q. In Figure 6.1, road segments are assigned weights that stand for their 122 7 2 3 4 6 5 1 1 3 4 4 5 2 3 252 44 6 2 1 11 2 3 3 2 3 4 5 6 7 8910 Figure 6.1: An example spatial network. individual costs (e.g., distance or time). The location q and hotels are symbolized with a triangle and squares on road segments, respectively. Table 6.1: A sample data set of hotels. Name Dn(q;:) Amenities H1 3 Internet, Fitness Center, Pets Allowed, Parking H2 8 Pool, Parking, Room Service H3 13 Internet, Fitness Center, Pets Allowed, Parking H4 10 Parking, Airport Shuttle H5 8 Pets Allowed, Breakfast, Hot Tub, Restaurant Onsite H6 15 Internet, Pets Allowed, Restaurant Onsite H7 8 Fitness Center, Hot Tub, Parking 6.1.1 Foundation In this subsection, we introduce the foundation of spatial keyword queries. In a SK query, a spatial object is defined as a pair , where l is a location in the search space and t is a text description (e.g., amenities and features of a hotel) of the corresponding object. Table 6.2 summarizes notations used in this chapter. 123 Distance on Spatial Networks Spatial networks are composed by undirected weighted graphs G = (V;E), where V is a set of vertices and E is a set of edges. In general, the weight of each edge is determined by a metric measured in physical distance or time cost for traveling the road segment [71, 70]. The distance between two objects Dn(:;:) on spatial networks is the summation of all segment weights on the shortest path connecting the two objects. For example, in Figure 6.1, Dn(H6;H7) = Dn(H6;n8) +Dn(n8;H7) = 7. Matched Keywords Matched-keywords is a set of keywords which are in both sets of p:t and K, where p is a given spatial object, and K is a set of keywords specified by a user. For example, given a hotel H2 with keywords {?Pool", ?Parking", ?Room Service"} and a set of keywords K {?Pool", ?Parking", ?Breakfast"}, MK(H2;K) is an intersection of H2:t and K, {?Pool", ?Parking"}. The formal definition of the MK function is shown in Equation (6.1). MK(p;K) =fki2Kjki2p:tg (6.1) Fully Matched Keyword Search With a given data set, the purpose of Fully Matched Keyword Search (FMKS) is to find objects whose descriptions completely match with a set of keywords K specified by a requester. As shown in Equation (6.2), the descriptions of search results of FMKS may be either identical to K or a superset of K. For example, given keywords ?Internet" and ?Pets Allowed" and the data set in Table 6.1, the result set of the FMKS isfH1;H3;H6g. FMKS(P;K) =fpi2PjK pi:tg (6.2) 124 Table 6.2: Symbolic notations. Symbol Meaning P A set of spatial objects Q A spatial keyword query K A set of search keywords q The location of a requester I An inverted index k The requested number of objects in the result of a SKkNN query r The search range of a SKR query s The ranking score of an object jSj The number of elements in set S d(:;:) The Euclidean distance between two points Dn(:;:) The shortest network distance between two points R The result set of a query E The explored region of a VDkNN query C The set of candidate spatial objects Partially Matched Keyword Search With a given data set, the purpose of Partially Matched Keyword Search (PMKS) is to retrieve objects which match at least one keyword in the user defined keyword set as shown in Equation (6.3). For example, given keywords ?Internet" and ?Pets Allowed" and the data set in Table 6.1, the results of the PMKS are fH1;H3;H5;H6g. The difference in search results from the previous FMKS is H5, which matches only one keyword (?Pets Allowed") and is a valid answer of this PMKS. PMKS(P;K) =fpi2Pj9kj2pi:t and kj2Kg (6.3) Weighted Keyword Relevancy We use a weight function TR to calculate keyword relevancy of a specific spatial object p[128]. We assume that each keywordki in a keyword setK is assigned with a weightw(ki), which indicates its importance in queries. Consequently, given an object p and a keyword set K, we have the following equation: 125 TR(p;K) = X ki2MK(p;K) w(ki) (6.4) For special cases where all keywords share identical weight, Equation (6.5) can be de- rived from Equation (6.4) where w(ki) = 1 and jMK(p;K)j is the number of keywords in MK(p;K). TR(p;K) = X ki2MK(p;K) 1 =jMK(p;K)j (6.5) 6.1.2 Spatial Keyword Ranker Aspatialkeywordrankerisdesignedtodeterminetherankingofagivenspatialobjectin a SKkNN query by employing both metrics, spatial network distance and keyword relevancy. We utilize a ranking function RK to compute how well an object matches a SKkNN query. Given a query Q and an object p, the ranking function is defined as follows: RK(Q;p) = 1 TR(p:t;Q:K) 2 Dn(p:l;Q:l) (6.6) In Equation (6.6), 1 and 2 are parameters of each part of the function [53], and their values depend on user preferences. For example, if a user is more concerned about keyword match, 1 can be set to a larger value than 2 in order to make keyword relevancy dominant in the RK function. Moreover, intuitively, an object with either shorter distance or higher keywordrelevancywouldhaveahigherrankinginqueryresults. Therefore,TRhasapositive influence on the RK function while Dn has a negative one. 6.1.3 Spatial Keyword kNN Queries Based on the spatial keyword ranker, the purpose of a spatial keyword kNN query is to retrieve k objects which have top k ranking values. 126 Definition Given a SKkNN query Q and an object set P, we define SKkNN(P, Q) as following: RK(pi) RK(pj) where pi2SKkNN(P;Q) and pj2PnfSKkNN(P;Q)gandjSKkNN(P;Q)j= k (6.7) We utilize the data set in Table 6.1 and spatial network in Figure 6.1 to demonstrate a SKkNN query example. Assume a visitor wants to find the two nearest hotels that have the amenities, ?Internet" and ?Pets Allowed" from q. Partially matched results are acceptable when there are not enough fully matched objects in the vicinity. In addition, all keywords have identical weight and the values of 1 and 2 are 0.8 and 0.2, respectively. The result set for this query isfH1;H5gwhere H5 has only one matched keyword. Besides, if 4 hotels are requested instead of 2, the result set will befH1;H5;H3;H6g. 6.1.4 Spatial Keyword Range Queries The purpose of a SK Range query is to find all the objects that fully match the given keywords within a user specified distance. Definition Let P be a set of objects. Given a query location q, a search range r, and a set of keywords K, a SK range query is defined as follows: SKR(q;r;K) =fpi2PjK pi:t and Dn(pi;q) rg (6.8) Assume a tourist wants to find all the hotels bearing the keywords ?Internet" and ?Pets Allowed" within 10 miles from q on the sample spatial network. The answer isfH1gbased on the example data set (Table 1). Furthermore, if the range is extended to 20 miles, the result set will befH1;H3;H6g. 6.1.5 Network Voronoi Diagram We employ network Voronoi diagrams in our approach for efficiently evaluating spatial keyword queries. A Voronoi diagram divides a metric space into disjoint polygons (Voronoi 127 (a) An ordinary Voronoi diagram. (b) a network Voronoi diagram. Figure 6.2: Voronoi diagram examples. polygons) based on the distances to a specified set of points (generators) in the space. The nearest neighbor of any point inside a polygon is the generator of the polygon. The Voronoi Polygon (VP) and the Voronoi Diagram (VD) in the Euclidean plane can be formally defined as follows. Given a set of generators P =fp1;:::;png R2, where 2 n<1and pi6= pj for i6= j;i;j2In =f1;:::;ng. The region given by: VP(pi) =fpjd(p;pi) d(p;pj)gfor j6= i;j2Ing (6.9) is called the Voronoi polygon associated with pi where d(p;pi) denotes the minimum Euclidean distance between p and pi. In addition, the set given by: VD(P) =fVP(p1);:::;VP(pn)g (6.10) is called the Voronoi diagram generated by P. Figure 6.2(a) demonstrates a Voronoi diagram in the Euclidean plane. The Network Voronoi Diagram (NVD) is defined based on a planar geometric graph where the locations of objects are restricted to the links that connect the nodes of the graph. Distancesbetweenobjectsaredefinedasthelengthoftheshortestpathinthegraph(network distance) [83]. In our problem, spatial networks can be modeled as a geometric graph where 128 the intersections are symbolized by nodes of the graph and edges are represented by the links connecting the nodes. Furthermore, the weights of links are the distances between corresponding nodes. The network Voronoi diagram can be formally defined as follows. Consider a geometric graph G(N;L) consisting of a set of nodes N =fp1;:::;pn;pn+1;:::;plg, where the first n elements are the generators (i.e., P =fp1;:::;png), and a set of links L =fl1;:::;lkgwhich form a connected network. We define the distance from a point p on a link in L to a node pi in N, Dn(p;pi), by the length of the shortest path from p to pi. For all j2Innfig, let Dom(pi;pj) =fpjp2 k[ i=1 li;Dn(p;pi) Dn(p;pj)g (6.11) b(pi;pj) =fpjp2 k[ i=1 li;Dn(p;pi) = Dn(p;pj)g (6.12) We call the set Dom(pi;pj) the dominance region of pi over pj on links in L and the set b(pi;pj) the bisector (border) points between pi and pj on links in L. Accordingly, the Voronoi link set associated with pi and the network Voronoi diagram are defined as follows, respectively: Vlink(pi) = \ j2Innfig Dom(pi;pj) (6.13) NVD(P) =fVlink(p1);:::;Vlink(pn)g (6.14) where Vlink(pi) specifies all the points in all the links in L that are closer to pi than any other generator point in N. By properly connecting adjacent bisector points of a generator to each other without crossing any of the links, we can generate a bounding polygon, named Network Voronoi Polygon (NVP) [83, 65]. Figure 6.2(b) shows an NVD example where each line style corresponds to a Voronoi link set of a generator (NVPs are created by connecting adjacent bisector points). 129 6.2 System Design In this subsection, we design a spatial keyword query evaluation system which is com- prised of Keyword Constraint Filter (KCF), Keyword and Spatial Refinement (KSR), and the spatial keyword ranker. For the proposed spatial keyword query algorithms, if two or more objects have the same ranking score, our algorithms will sort the objects based on their distances to the query point (i.e., in an ascending order). In addition, in order to simplify the explanation, we assume all keywords have the same weight. 6.2.1 Framework of Query Evaluation Before presenting the details of our spatial keyword query algorithms, we briefly intro- duce the framework of our system. As illustrated in Figure 6.3, the spatial keyword query evaluation system comprises three main components which are Keyword Constrain Filter (KCF), Keyword and Spatial Refinement (KSR) and the spatial keyword ranker. This sys- tem receives both spatial data sets and spatial keyword constraints as inputs and produces results after a two-step computation. The system employs a filter-and-refine strategy to answer SK queries. The two key steps are KCF and KSR. KCF receives spatial data sets and keyword constraints and filters out objects that do not match any user specified keyword. Because spatial network distance computation is expensive, we do not take spatial constraints into account in this step. The main purpose of KCF is to reduce the number of candidate objects in order to decrease computation costs in the next step. In the second step, KSR receives inputs from KCF and refines the intermediate results based on both keyword and spatial constraints. Afterward, KSR returns the qualified objects sorted by their ranking scores provided by the ranker. 130 SKQ Evaluation System KeywordConstraint Filter Keyword and Spatial Refine Ranker Results Data Sets SpatialConstraintsKeywordConstraintsSK Query Figure 6.3: Framework of the spatial keyword query evaluation system. 6.2.2 Keyword Constraint Filter Inverted Indexing Structure Inverted indexes are primarily designed to support keyword searches from a set of text files. In our system, we utilize inverted indexes to search for objects related to specific keywordsfromspatialdatabases. AsillustratedinFigure6.4, anindexoftermsismaintained in our system where each term is a unique keyword, and each postings list contains a number of object identifiers. Each postings list is in sorted order (based on object identifiers) to facilitate the efficient search of objects related to a specific keyword. If an object has multiple keywords, its identifier will appear in each corresponding postings list. In addition, inverted indexes are independent of other dedicated index structures, such as R-trees or grids, in spatial databases. Keyword Match Algorithm Based on the proposed problem, we design a keyword match algorithm by employing the inverted index-based merge technique [77] to calculate keyword relevancy of spatial objects. With the keyword match algorithm, we measure keyword relevancy of a spatial object by counting the number of matched-keywords. The more matched-keywords the object has, 131 POI1 POI2 POI3 POI4 POI5 POI6 POI7 POI8 POI9 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Keyword 1 Keyword 2 ... Keyword n id2 ... id7 ... id4 idM ... ... idM ... ... idN Figure 6.4: Inverted index structure. the higher its keyword relevancy is. This algorithm receives an inverted index and a set of keywords as input parameters, and then returns the keyword relevancy of objects which match with at least one keyword. InAlgorithm8,mergeListisalist,ofwhicheachelementcomprisesapair whereidis an object identifier andoccurrenceis the corresponding keyword relevancy. With the for loop in line 1, the algorithm iteratively retrieves object lists of matched-keywords from the inverted index structure and merges these object lists into mergeList. This merge process, illustrated in lines 5 to 22, is an essential part which supports partially matched- keyword search. The worst-case time complexity of Algorithm 8 is O(jKj jPj), wherejPj is the number of spatial objects in the data set and jKj stands for the number of search keywords. Figure6.5illustrateshowKeywordMatchworks. WeutilizethedatasetinTable6.1and spatial network in Figure 6.1 for explanation. Assume a query with keywords ?Internet", ?Pets Allowed", and ?Parking" is evaluated. Algorithm 8 first finds object lists that are related to these keywords by searching the inverted index. As shown in Figure 6.5, three 132 Algorithm 8 KeywordMatch(I, K) 1. for each term in I do 2. if (term2K) then 3. iterator iterA = mergeList.begin 4. iterator iterB = term.idList.begin 5. while (iterA != mergeList.end and iterB != term.idList.end) do 6. if (iterA.id > iterB.id) then 7. newNode {iterB.id, 1} 8. insert newNode at previous position of iterA 9. iterB++ 10. else if (iterA.id == iterB.id) then 11. iterA.occurrence += 1 12. iterA++ 13. iterB++ 14. else 15. iterA++ 16. end if 17. end while 18. while (iterB != term.idList.end) do 19. newNode {iterB.id, 1} 20. append newNode to the end of mergeList 21. iterB++ 22. end while 23. end if 24. end for 25. return mergeList ... ... ... ... ... Internet H1 H3 H6 ... ... ... ... ... ... Pets Allowed H1 H3 H5 H6 ... ... ... ... Parking H1 H2 H3 H4 H7 ... ... ... ... ... (H1, 1) (H3, 1) (H6, 1) (H1, 2) (H3, 2) (H5, 1) (H6, 2) (H2, 1) (H3, 3) (H5, 1) (H6, 2) (H1, 3) (H4, 1) (H7, 1) Merge List An Empty ListInverted Index Figure 6.5: An example of KeywordMatch. 133 object lists,fH1;H3;H6g,fH1;H3;H5;H6gandfH1;H2;H3;H4;H7g, are retrieved from the inverted index. Then it merges these lists into a mergeList. In the first round, KeywordMatch simply copies objects in the ?Internet" list to merge- List, and each object is marked by one occurrence. Then, in the second round, Keyword- Match compares objects in the ?Pets Allowed" list with mergeList. If an object appears in the ?Pets Allowed" list but does not exist in mergeList, it will be inserted into mergeList with occurrence marked by one. On the contrary, if an object already exists in mergeList, its counter will be increased by one. The third round of merging the ?Parking" list is processed in the same way. After the iterations, mergeList contains the final result with seven objects and their keyword relevancy shown at the bottom of the dashed rectangle in Figure 6.5. 6.2.3 The Network Expansion-Based SKkNN Query Algorithm In this subsection, we explain our algorithm for processing spatial keyword k nearest neighbor query based on network expansion techniques [87, 34]. As present in Section 6.1.3, the algorithm receives an inverted index I, a query point q, the value of k, and a set of keywords K as input parameters and returns the top k objects by considering both keyword and spatial constraints. Forsearchingtheshortestpathbetweenobjectsonspatialnetworks,Dijkstra?salgorithm- basedapproaches[34,42,43]havebeenwidelyutilizedinvariousapplications. Givenasource point and a group of destinations, the algorithm recursively expands the unvisited paths and records distances of intermediate nodes. During the search, a distance record of a node will be updated if there is a shorter path than the present one. Such a process is continued until all the destinations have arrived, and the distances of all other possible paths are longer than their current distances. In addition, a solution named Incremental Network Expansion (INE) is presented in [87] by extending Dijkstra?s algorithm to compute k nearest neighbors in a network space. Specifically, INE first locates the network segmentei, which covers the query point q, and retrieves all objects on ei. If any object pi is found on ei, pi will be inserted into 134 the result set. Furthermore, the endpoint of ei, which is closer to q, will be expanded while the second endpoint of ei will be placed in a priority queue Qp. INE repeats the process by iteratively expanding the first node in Qp and inserting newly discovered nodes into Qp until k objects are retrieved. We develop a Network Expansion-based SKkNN (NEkNN) solution by leveraging INE. There are two main steps in the NEkNN algorithm. The first step is to filter out objects which do not match any user specified keywords by employing Algorithm 8. Then, we mark all the objects in mergeList in the spatial network as candidates (e.g., set a bit of these points of interest). The next step is to expand the network from q with INE and the ranking function (Section 6.1.2). When an object pi is discovered, NEkNN checks whether pi is a candidate object or not. If pi is a candidate object, NEkNN calculates its ranking score s by executing the ranking function (otherwise the algorithm ignores pi). Meanwhile, NEkNN keeps a result set R which is sorted in descending order based on the ranking score. If R has fewer than k objects and pi is a candidate object, pi is inserted into R. Otherwise, NEkNN compares the ranking score of pi with the last object pj in R. pj will be replaced by pi if pi:s>pj:s. In addition, whenjRj k, NEkNN calculates ranking scores for network nodes as well by assuming that they match all the search keywords to restrict the search space. In other words, any spatial objectpi, which is further away fromq than a network nodeni, must have a lower ranking score than ni even if pi matches all the search keywords. Consequently, NEkNN iterates the search process until R contains k objects and the next network node to be expanded in Qp has an equal or lower ranking score than the last object in R. ByemployingthedatasetinTable6.1andspatialnetworkinFigure6.1, wedemonstrate an example to retrieve the two nearest hotels that have the amenities, ?Internet" and ?Pets Allowed" from q with NEkNN. We assume that all keywords have identical weight and the values of 1 and 2 are 0.5 and 0.5, respectively. First, NEkNN executes Algorithm 8 and marks candidate hotels H1, H3, H5, and H6 on the network. Then, NEkNN locates the segment n5n7 that covers q. Since no hotel is covered by n5n7, the node (n5) closer 135 Algorithm 9 NEkNN(I;q;k;K) 1. mergeList = KeywordMatch(I;K) 2. if mergeList ==;then 3. return ; 4. end if 5. mark each object in mergeList in the spatial network 6. ninj = the segment covers q 7. if ninj covers candidate objects then 8. calculate their ranking scores and insert them into R 9. end if 10. fp1;:::;pkgare the top k objects in R sorted in descending order of their ranking scores 11. smin = pk:s // if pk =;, smin = 1 12. Qp = <(ni;Dn(q;ni));(nj;Dn(q;nj))> // sorted in ascending order of Dn 13. repeat 14. de-queue the first node nf in Qp 15. if jRj k then 16. calculate nf:s by assuming that nf fully matches K 17. else 18. nf:s = 0 19. end if 20. for each non-visited adjacent node nx of nf do 21. search nxnf 22. if nxnf covers candidate objects then 23. for each candidate object pi do 24. if jRjpk:s then 28. replace pk by pi 29. end if 30. end if 31. end for 32. end if 33. en-queue (nx;Dn(q;nx)) 34. end for 35. smin = pk:s // if pk =;, smin = 1 36. until smin . The expansion of n7 reaches n9 and Qp = <(n2;4);(n9;6)>. Next, the expansion of n2 136 reaches n1 and n3, after which Qp = <(n9;6);(n1;11);(n3;13)> and H5 is found on n2n3. Afterward, H5 is inserted into R with s = 3:5. Subsequently, n9 is expanded and Qp = <(n1;11);(n3;13);(n8;13);(n10;14)>. The ranking score of the next node in Qp (n1) is -4.5; the algorithm terminates because R contains two hotels and H5:s > n1:s. The complete algorithm of NEkNN is formalized in Algorithm 9. 6.2.4 The Voronoi Diagram-Based SKkNN Query Algorithm Although NEkNN is able to restrict the search space and retrieve top k objects based on their ranking scores, the main limitation of NEkNN is that it has to explore a large portion of the network when candidate objects are not densely distributed in the network. Therefore, we propose a Voronoi diagram-based SKkNN (VDkNN) solution by leveraging the network Voronoi diagram (NVD) [65] to improve performance. In order to be independent of the density and distribution of candidate objects, we first partition the spatial network into small regions by generating a network Voronoi diagram over all the spatial objects (points of interest). Each cell of the NVD is centered by one spatial object and contains the nodes that are closest to that object in network distance. Afterward for each NVD cell, we pre-compute the distances between all the edges of the cell to its center as well as the distances only across the border points of the adjacent cells. Consequently, for a new cell, we can quickly extend the searched region to the border points without expanding all the internal network segments. With the NVD of the search space, for a SKkNN query, VDkNN first filters out un- qualified objects with Algorithm 8 and marks all the objects in mergeList in the NVD as candidates. Then, VDkNN finds the network Voronoi polygon NVP(pi) that contains q where pi is the generator of the polygon. This step can be accomplished by employing a spatial index (e.g., the R-tree), which is generated based on the NVD cells. Next, we verify that pi is a candidate object. If pi is a candidate object, VDkNN calculates its ranking score by running the ranking function. In addition, VDkNN maintains a result set R which 137 is sorted in descending order according to the ranking score. When R contains fewer than k objects, newly discovered candidate objects are inserted into R. However, if R already includes k objects, VDkNN replaces the kth object pk of R when a newly retrieved candidate object has a higher s than pk. Also, VDkNN keeps a queue Qn which stores the neighbors (adjacent cells) of pi and a set E which consists of all the searched cells (i.e., E covers the current explored region). Subsequently, VDkNN searches the adjacent cells of E (i.e., NVP(pi)) stored in Qn for the next candidate object. Every time after a cell NVP(pj) been explored, the neighboring generatorsofpj areunionedwithQn, NVP(pj)isunionedwithE, andRisupdatedaccording to the aforementioned rules if pj is a candidate object. Moreover, when jRj k, VDkNN calculatestherankingscoreofalltheborderpointsofthecurrentexploredregionbyassuming that they match all the search keywords to restrict the search space. VDkNN iterates the search process until R contains k objects and the ranking scores of all the border points of E are equal or worse than the ranking score of the kth object in R (i.e., there will not be any changes in R even if we search further). Figure 6.6 illustrates an example of retrieving two nearest points of interest (POI) which match keywords in K from q with VDkNN. First, VDkNN executes Algorithm 8 and marks candidate POIs on the NVD. Then, VDkNN locates the network Voronoi polygon, NVP(p1), which contains q. Next, VDkNN verifies that p1 is a candidate POI and inserts the neighboring generators, p2, p5, p6, p7, p8, p9, and p10 into Qn. Also, E covers NVP(p1). Afterward VDkNN searches the objects in Qn for the next candidate POI. Assume that NVP(p2) is the second explored NVP and both p1 and p2 are candidate POIs. Then, R contains p1 and p2, E covers NVP(p1) and NVP(p2) (the shaded region in Figure 6.6), and Qn comprises nine generators (p3 to p11). Since R covers two POIs, VDkNN computes the ranking score of all the border points of E (b1 to b12) by assuming that they match all the search keywords inK. Here we suppose thatsmin >bmax and the algorithm terminates. The complete algorithm of VDkNN is formalized in Algorithm 10. 138 Algorithm 10 VDkNN(I;q;k;K) 1. Generate a NVD based on P 2. mergeList = KeywordMatch(I;K) 3. if mergeList ==;then 4. return ; 5. end if 6. mark each object in mergeList in the NVD 7. NVP(pi) = the NVP covers q 8. en-queue the neighboring generators of pi into Qn 9. E[NVP(pi) 10. if pi is a candidate object then 11. calculate s of pi and insert pi into R 12. end if 13. fp1;:::;pkgare the top k objects in R sorted in descending order of their ranking scores 14. smin = pk:s // if pk =;, smin = 1 15. if jRj k then 16. calculate s of all the border nodes of E by assuming that they fully match K 17. bmax = the largest s of all the border nodes of E 18. else 19. bmax = 0 20. end if 21. while smin pk:s then 28. replace pk by pj 29. end if 30. end if 31. end if 32. Qn [all the neighboring generators of pj 33. E[NVP(pj) 34. smin = pk:s // if pk =;, smin = 1 35. if jRj k then 36. calculate s of all the border nodes of E by assuming that they fully match K 37. bmax = the largest s of all the border nodes of E 38. else 39. bmax = 0 40. end if 41. end while 42. return R 139 1 2 3 4 5 67 8 9 10 11 Figure 6.6: A VDkNN query example. 6.2.5 The Spatial Keyword Range Query Algorithm As defined in Section 6.1.4, given a query pointq, a search ranger and a set of keywords K, SKR query is to retrieve all the objects which fully match all the keywords within r. SKR query first calculates the keyword relevancy of objects by utilizing KeywordMatch (Algorithm 8). Then, it retrieves objects which fully match all the given keywords and stores the qualified objects in C. Afterward, it calls Dijkstra?s algorithm for calculating distances fromq to all the candidate objects. Finally, SKR query removes objects which are out of the search range from R. The complete algorithm of SKR query is illustrated in Algorithm 11. The worst-case running time of Algorithms 2, 3, and 4 on a spatial network with a set of nodes N is O(jKj jPj+jNj2) by considering both the keyword match and spatial network search subroutines. 140 Algorithm 11 SKR(I;q;r;K) 1. mergeList = KeywordMatch(I;K) 2. for each object o in mergeList do 3. if o.occurrence ==jKjthen 4. C.append(o) 5. end if 6. end for 7. if (jCj!= 0 ) then 8. R = ShortestPath(q;C) 9. end if 10. filter out objects beyond r in R 11. return R 6.3 Experimental Validation In this subsection, we evaluate the performance of our spatial keyword query solutions with both real-world and synthetic data sets. We implemented the proposed algorithms and related experimental components in C++. The inverted index structure was loaded into the main memory during the execution of simulations. All the experiments were conducted on an Ubuntu Linux server equipped with an Intel Xeon 2.4GHz processor and 2GB memory. More details of the simulation environment are shown in Table 6.3. All simulation results were recorded after the system model reached a steady state. Table 6.3: Simulator configurations. Configurations Hardware Intel Xeon X3430 2.4GHz processor 2GB RAM 160GB SATA disk Software Ubuntu 10.04 Linux kernel 2.6.23 g++ 4.4.3 141 6.3.1 Data Sets Inourexperiments, areal-worlddatasetwasdownloadedfromedigitalz1, whichprovides a wide range of general data sets for free. We retrieved 9,483 restaurants in the state of California and utilized their menus (e.g., pizza, hamburger, steak, etc.) and cuisine (e.g., American, Chinese, French, etc.) as keywords for searches. The data sets of road networks in both the state of California (containing 21,692 roads and 21,047 intersections) and the continental United States (containing 179,178 roads and 175,812 intersections) were downloaded from the US Census Bureau (TIGER/Line Shapefiles)2. Forthesyntheticdataset, wegeneratedaround160,000restaurants, ofwhichthedensity follows the real-world data set in order to investigate the scalability of our algorithms. In addition, each restaurant has a similar number of keywords to the real-world data set. The network of the continental United States is used with the synthetic data set. Table 6.4 summarizes the numbers of the two data sets. Table 6.4: Data sets employed in experiments. Data Sets Total Number of Restaurants Total Number of Keywords Total Number of Roads Total Number of Intersections Real 9,483 34,091 21,692 21,047 Synt. 160,000 575,200 179,178 175,812 Table 6.5 displays the default values of parameters in our experiments. We varied an essential parameter in each experiment set in order to evaluate its impact on the performance of the proposed algorithms. Other parameters were kept constant during all the experiments in the same set. The default values of parameters are used in experiments if we do not explicitly specify other values. The selection of 1 and 2 values depends on preference for keyword relevancy and distance of users. We fixed the ratio of 1 to 2 (1=20) in all the experiments. 1http://www.edigitalz.com/ 2http://www.census.gov/geo/www/tiger/ 142 Table 6.5: Default values of parameters used in experiments. Parameters Default Values 1 0.04762 2 0.95238 k 5 jKj 2 r 20 km q randomly selected 6.3.2 Data Set Size Experiment In this experiment, we evaluate NEkNN, VDkNN, and SKR queries with various data set sizes. The main purpose of this experiment is to analyze the influence of different data set sizes on query execution time. For both real-world and synthetic data sets, we generate five subsets of restaurants with an increasing number of data objects. The number of restaurants in consecutive subsets is increased by 2,000 for real-world data sets and 35,000 for synthetic data sets. Objects in smaller subsets are included in bigger ones. 0 500 1000 1500 2000 2500 0 2 4 6 8 10 Execution Time ( ?s) Number of Restaurants (K) NEkNNVDkNN SKR (a) Real-world data set 0 500 1000 1500 2000 2500 0 20 40 60 80 100 120 140 160 Execution Time ( ?s) Number of Restaurants (K) NEkNNVDkNN SKR (b) Synthetic data set Figure 6.7: Execution times of NEkNN and VDkNN queries as a function of data set size. The results of real-world and synthetic data sets are demonstrated in Figure 6.7(a) and Figure6.7(b), respectively. VDkNNalwaysoutperformsNEkNNwiththedefaultparameters in all the experiments. From Figure 6.7, we observe that the execution time of most queries increases linearly with the increment of data set size. The reason is that more POIs and 143 keywords have to be processed in these queries. In addition, the difference of execution time between NEkNN and VDkNN queries decreases gradually as data set size grows. In other words, the time costs of these two solutions become close with a larger data set. The reason that faster performance degradation is caused in VDkNN is that it has an extra overhead of searching on Voronoi diagrams in addition to the cost of processing POIs, which is suffered by both solutions. The higher density of POIs on spatial networks, the more border nodes are generated in Voronoi polygons. Hence, VDkNN has to spend more time on border node related calculation when it tests the stopping condition. Another observation is that queries run faster on a bigger synthetic data set (e.g., 20,000 data objects) than a smaller real-world data set (e.g., 9,483 data objects). The reason is that the density of POIs is a dominant factor in these queries. Although there are more POIs involved in the keyword-match process in the synthetic data set, there are fewer candidates which are discovered in the search area due to lower POI density. 6.3.3 Number of Keywords Experiment The number of keywords is an essential parameter of both NEkNN and VDkNN queries. In order to investigate the impact of the number of keywords on query performance, we vary the number of specified keywords on both data sets. We conduct experiments from queries with one keyword to ones with five keywords by adding a new keyword after each experi- ment. Figure 6.8 shows that the execution time of queries increases when more keywords are specified by users. In order to retrieve partially matched query results, POIs that match any of the given keywords have to be taken into account. Consequently, more keywords will increase the number of POIs to be processed in the keyword match and query evaluation processes. The difference in execution time between NEkNN and VDkNN remains nearly constant in all queries. As the number of keywords becomes larger, more POIs are considered in the keyword match process in both solutions. Moreover, varying the number of keywords does 144 1000 1200 1400 1600 1800 2000 2200 2400 2600 0 1 2 3 4 5 6 Execution Time ( ?s) Number of Keywords NEkNNVDkNN (a) Real-world data set 1000 1500 2000 2500 3000 0 1 2 3 4 5 6 Execution Time ( ?s) Number of Keywords NEkNNVDkNN (b) Synthetic data set Figure 6.8: Execution times of NEkNN and VDkNN queries as a function of number of keywords. not directly enlarge or shrink the search area of both methods (i.e., the score of a POI is determined by the ranking function). Therefore, no apparent change of the difference in query performance between the two methods is observed. However, VDkNN always exceeds NEkNN in execution time in this experiment. 6.3.4 Number of k Experiment Next, we evaluate the impact of k on the performance of NEkNN and VDkNN queries with the two data sets. We vary the value of k from 5 to 30 with an increment of five. Figure 6.9 illustrates that the execution time of queries increases as the number of k be- comes larger. With both NEkNN and VDkNN, a larger search area has to be processed in order to retrieve more qualified results when we increase the k value. The performance difference between NEkNN and VDkNN becomes clear when k increases. Such a difference is proportional to the k value if POIs and networks are equally distributed. Apparently, given specific keywords, the cost of the keyword match process of NEkNN and VDkNN is identical. Therefore, the performance gain of VDkNN queries is from searches on the NVD where VDkNN can retrieve the top k candidates faster than NEkNN. When k increases, the search area is enlarged correspondingly and VDkNN is able to achieve more performance gains in the expanded search region. 145 1600 1700 1800 1900 2000 2100 2200 2300 0 5 10 15 20 25 30 35 Execution Time ( ?s) Number of k NEkNNVDkNN (a) Real-world data set 1600 1800 2000 2200 2400 2600 2800 0 5 10 15 20 25 30 35 Execution Time ( ?s) Number of k NEkNNVDkNN (b) Synthetic data set Figure 6.9: Execution times of NEkNN and VDkNN queries as a function of number of k. 6.3.5 Query Range Experiment We examine the effect that varying the query range would have on the performance of SKR queries. In the experiments, SKR queries with various query ranges are evaluated in three different cases, which are queries with one (SKR-1), two (SKR-2), and three (SKR-3) keywords. Both Figures 6.10(a) and 6.10(b) illustrate that the execution time of queries grows exponentially with increasing query range. This is because the search area expands equally in all directions. 300 350 400 450 500 550 600 650 700 0 10 20 30 40 50 60Execution Time ( ?s) Range (km) SKR-1SKR-2 SKR-3 (a) Real-world data set 500 600 700 800 900 1000 1100 1200 1300 1400 0 10 20 30 40 50 60Execution Time ( ?s) Range (km) SKR-1SKR-2 SKR-3 (b) Synthetic data set Figure 6.10: Execution time of SKR queries as a function of query range. Interestingly, the execution time of the queries on real data sets are very close. Two factorsmainlyaffectSKR.ThefirstoneisthenumberofPOIsinvolvedinthekeywordmatch step. More POIs will be processed if more keywords are given. Furthermore, POIs that are 146 fully keyword-matched are qualified candidates in SKR and a large number of partially keyword-matched POIs are filtered out by KCF. Consequently, fewer candidate POIs need to be processed in the range search phase. The two factors offset each other in range queries with relatively small search distances and data sets. However, when large amounts of POIs are searched with SKR, the overhead of the keyword match process becomes dominant in execution time. As shown in Figure 6.10(b), SKR-1 becomes the best and SKR-3 is the worst. 6.3.6 Page Access Experiment Finally, we evaluate the number of page accesses by our proposed solutions. In these tests, we mainly focus on the comparison of NEkNN and VDkNN queries. Given a specific query, both solutions have the keyword match process. An identical number of keywords are retrieved from data sets. In addition, the POIs detected by NEkNN are required to be processed in VDkNN as well, and vice versa. The only difference is that NEkNN searches on spatial networks, whereas VDkNN explores on NVDs. Therefore, we evaluate the page access regarding network retrieval in these experiments. The page size is set to be 4 KB. The size of intersections or border nodes is 20 Bytes, containing their identifiers and coordinates. The road segments have a size of 20 Bytes as well, encompassing their identifiers, identifiers of two endpoints, and the length of the road segment. A single page can accommodate either 200 nodes or road segments. The nodes and segments are stored continuously in pages. During a query process, each page is loaded only once. Figure 6.11(a) and Figure 6.11(b) display the number of page accesses of NEkNN and VDkNN queries in real-world and synthetic data sets, respectively. The trend shared by the two figures is that as the data set size grows, the number of page accesses decreases in NEkNN, whereas it increases in VDkNN. The main reason is that NEkNN searches in a smaller area for qualified results in a larger dataset. Fewer intersections and road segments are retrieved by NEkNN. On the other hand, NVD becomes more complex when more 147 0 10 20 30 40 2 4 6 8 9.5Number of Page Access Number of Restaurants (K) NEkNNVDkNN (a) Real-world data set 0 10 20 30 40 20 50 90 120160Number of Page Access Number of Restaurants (K) NEkNNVDkNN (b) Synthetic data set Figure 6.11: Page Access evaluation with different data set sizes. border nodes and connections between borders are generated. Therefore, more page access is required in VDkNN with a larger POI data set. 148 Chapter 7 Conclusion and Future Work In this dissertation, we have demonstrated our novel design of frameworks for context- based file systems, o?oading application development, and advanced solutions to Multi- Criteria Optimal Location Query and Spatial Keyword Query. This chapter concludes the dissertation study by summarizing the contributions and future work. 7.1 Framework for Context-Based File Systems We present a general informed-based framework, Frog, for context-based file systems, where contexts are encapsulated in views. Frog integrates context-specific solutions that may conflict with each other in terms of metadata management, physical data organization, and I/O operations. We show the generality, transparency, diversity, and flexibility of Frog by implementing two Bi-Context File Systems, namely, BAVFS and BHVFS. In the two case studies, we first demonstrate that BAVFS optimizes performance of sequential and random reads on small files by the virtue of dual-mode prefetching. Then, we illustrate that BHVFS speed up random reads and writes by incorporating the update-in-place and update-out-of- place strategies. Our experimental results show that the benefits gained from context-based file systems far outweigh the overhead induced by creating and maintaining duplications for multiple views. A few open issues in Frog and context-based file systems will be addressed in our future work. Duplicating and managing metadata (potentially physical data) is the fundamental idea behind the design of Frog. Creating data replicas can substantially improve data relia- bility; thus, it is intriguing to quantitatively study the reliability impacts on context-based file systems. Moreover, managing data in multiple views is a challenge in disk scheduling. 149 Disk scheduling policies applied in one view may either positively or negatively affect disk scheduling in another view. We will investigate disk scheduling optimization in a few partic- ular contexts. Last, but not least, we plan to design a view-allocation mechanism that will make a tradeoff between performance and space effectiveness in a native context-based file system. 7.2 O?oading Framework The emergence of active storage coupled with computation capability inspires us to o?oad I/O-bound modules of data-intensive applications to active storage nodes in a clus- ter computing system. In this study, we proposed a programming framework - ORCA - to automatically o?oad I/O-bound modules of applications to storage nodes in a cluster. The ORCA o?oading framework handles configurations, execution-path control, o?oading executable code, and data sharing. An ORCA application programming interface (API) and a run-time system in the framework allow programmers without any I/O o?oading experi- ence to easily write new I/O-bound modules or partition existing code to run efficiently on clusters. The proposed ORCA framework can achieve the following two objectives for data- intensive applications running in both homogeneous and heterogeneous clusters. First, our ORCA framework accelerates data-intensive applications by allocating I/O-bound modules to active storage nodes in clusters. Second, ORCA can significantly reduce network burden imposed by transferring massive amounts of data from storage nodes to computing nodes. In this work, we pay attention to o?oading domains, each of which contains a pair of a CPU-bound module and an I/O-bound module. We introduce the o?oading domain as an important concept, because this pair structure is a simple yet power model repre- senting a wide range of data-intensive applications. For future research directions, this o?oading-domain model will be extended to a multi-o?oading-domain model in which mul- tiple o?oading domains can be properly coordinated. In light of this new model, we will 150 upgrade the o?oading management in the ORCA framework. In addition, we plan to imple- ment a dispatch management module to allocate I/O-bound processes to appropriate storage nodes. The o?oading and dispatch management modules will address the challenge of how to control multiple collaborative o?oading domains. Another intriguing research issue is taking energy consumption into account when ap- plying ORCA to o?oad I/O-bound modules on a data-redundancy cluster. Modern data centres often use data redundancy techniques, generating and keeping several copies of data to provide good services in terms of data throughput and availability. A number of energy- saving approaches are proposed to balance energy consumption and I/O performance of cluster storage systems. In a future study, we will investigate the trade-off issue among energy consumption, performance, and availability when we deploy the ORCA framework in a cluster storage system employing data-redundancy techniques. 7.3 MOLQ Evaluation We formulated a novel optimal location selection problem. Except for designing two straightforward approaches that sequentially scan all object combinations and possible lo- cations, we propose an MOVD-based approach (RRB) that efficiently answers the query. Moreover, in order to minimize the costs induced by region overlapping, we propose the MBRB approach, in which MBRs are used as the boundaries of OVRs, since overlapping two rectangles is much cheaper than overlapping two arbitrary regions. In addition, a cost- bound iterative approach is proposed to efficiently process large number of Fermat-Weber problems. We demonstrate the excellent performance of the proposed approaches through extensive simulations. For integrity consideration, we plan to create and evaluate an inverse operation " " that removes an MOVD from another MOVD. Moreover, we will evaluate "+" and " " operations by varying the object weights. Overlapping weighted Voronoi diagrams is more 151 expensivethanoverlappingtworegulardiagramsbecauseofdifficultyinrepresentingVoronoi regions. 7.4 Spatial Keyword Query Geographic information systems are becoming increasingly sophisticated, and spatial keyword search represents an important class of queries. Most existing solutions for evalu- ating spatial keyword queries are based on Euclidean distance and cannot provide partially matched results. In this research, we introduce efficient techniques to answer spatial keyword k nearest neighbor and spatial keyword range queries on spatial networks. We demonstrate the excellent performance of the proposed algorithms through extensive simulations. For future work, we plan to extend our spatial keyword query evaluation framework to support other common spatial query types such as spatial join, reverse nearest neighbor, spatial skyline, etc. 152 Bibliography [1] BRTFS: The Linux B-tree Filesystem. http://domino.watson.ibm.com/library/ CyberDig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf. [2] Gnu core utilities. http://www.gnu.org/software/coreutils/. [3] Gnu Grep. http://www.gnu.org/software/grep/. [4] hdparm. http://en.wikipedia.org/wiki/Hdparm. [5] Oracle 11g release 1 rac on linux using nfs. http://www.oracle-base.com/articles/ 11g/OracleDB11gR1RACInstallationOnLinuxUsingNFS.php. [6] Postgresql. http://www.postgresql.org/. [7] Seagate product manual of barracuda 7200.12 serial ata. http://www.seagate. com/staticfiles/support/disc/manuals/desktop/Barracuda%207200.12/ 100529369b.pdf. [8] The IBM JFS project. http://www.ibm.com/developerworks/wikis/display/ WikiPtype/JFS. [9] The ReiserFS project. http://http://marc.info/?l=reiserfs-devel. [10] The SGI XFS project. http://oss.sgi.com/projects/xfs/. [11] WD1600AAJS specification. http://wdc.custhelp.com/app/answers/detail/ search/1/a_id/1400#. [12] Wd5000aaks specification. http://www.wdc.com/en/products/products.aspx?id= 110. [13] Apache hadoop. http://lucene.apache.org/hadoop/, 2006. [14] Corba. http://www.corba.org/, 2010. [15] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2:922?933, August 2009. [16] A. Acharya, M. Uysal, and J. Saltz. Active disks: programming model, algorithms and evaluation. SIGPLAN Not., 33(11):81?91, Oct. 1998. 153 [17] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch. A five-year study of file-system metadata. ACM Transactions on Storage (TOS), 3(3), Oct. 2007. [18] F. Anton, D. Mioc, and C. M. Gold. Dynamic additively weighted voronoi diagrams made easy. In CCCG, 1998. [19] P. F. Ash and E. D. Bolker. Generalized dirichlet tessellations. Geometriae Dedicata, 20:209?243, 1986. [20] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval - the con- cepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England, 2011. [21] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. In SIGMOD Conference, pages 322?331, 1990. [22] J.-D. Boissonnat and C. Delage. Convex Hull and Voronoi Diagram of Additively Weighted Points. In ESA, pages 367?378, 2005. [23] J. Bonwick. ZFS: The last word in file systems. http://www.opensolaris.org/os/ community/zfs/docs/zfs_last.pdf. [24] P. Carns, S. Lang, R. Ross, M. Vilayannur, J. Kunkel, and T. Ludwig. Small-file access in parallel file systems. In Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, pages 1?11, may 2009. [25] A. Cary, O. Wolfson, and N. Rishe. Efficient and Scalable Method for Processing Top-k Spatial Boolean Queries. In SSDBM, pages 87?95, 2010. [26] R. Chandrasekaran and A. Tamir. Algebraic Optimization: The Fermat-Weber Loca- tion Problem. Math. Program., 46:219?224, 1990. [27] M. A. Cheema, W. Zhang, X. Lin, Y. Zhang, and X. Li. Continuous reverse k nearest neighbors queries in euclidean space and in spatial networks. VLDB J., 21(1):69?95, 2012. [28] Y.-Y. Chen, T. Suel, and A. Markowetz. Efficient query processing in geographic web search engines. In SIGMOD Conference, pages 277?288, 2006. [29] S.Chiu,W.-k.Liao,andA.Choudhary. Designandevaluationofdistributedsmartdisk architecture for i/o-intensive workloads. ICCS?03, pages 230?241, Berlin, Heidelberg, 2003. Springer-Verlag. [30] W. Chu, W. Li, T. Mo, and Z. Wu. A Context-Source Abstraction Layer for Context- aware Middleware. In Information Technology: New Generations (ITNG), 2011 Eighth International Conference on, pages 1064 ?1065, april 2011. [31] G. Cong, C. S. Jensen, and D. Wu. Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects. PVLDB, 2(1):337?348, 2009. 154 [32] M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars. Computational Geometry: Algorithms and Applications. Springer, 3rd edition, 2008. [33] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107?113, 2008. [34] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269?271, 1959. [35] P. Dong. Generating and updating multiplicatively weighted Voronoi diagrams for point, line and polygon features in GIS. Computers & Geosciences, 34(4):411?421, 2008. [36] D. H. C. Du. Intelligent storage for information retrieval. NWESP ?05, pages 214?, Washington, DC, USA, 2005. IEEE Computer Society. [37] Y. Du, D. Zhang, and T. Xia. The Optimal-Location Query. In SSTD, pages 163?180, 2005. [38] F. E.J., F. K., R. K., and N. J. Active Storage Processing in a Parallel File System. In Proc. of the 6th LCI International Conference on Linux Clusters: The HPC Revolution, 2006. [39] R. Fagin, J. Nievergelt, N. Pippenger, and H. R. Strong. Extendible hashing-a fast access method for dynamic files. ACM Trans. Database Syst., 4(3):315?344, Sept. 1979. [40] I. D. Felipe, V. Hristidis, and N. Rishe. Keyword Search on Spatial Databases. In ICDE, pages 656?665, 2008. [41] B. G. Fitch, A. Rayshubskiy, M. C. Pitman, T. J. C. Ward, and R. S. Germain. Using the active storage fabrics model to address petascale storage challenges. PDSW ?09, pages 47?54, New York, NY, USA, 2009. ACM. [42] M. L. Fredman and R. E. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM, 34(3):596?615, 1987. [43] L. Fu, D. Sun, and L. Rilett. Heuristic shortest path algorithms for transportation applications: State of the art. In Computers and Operations Research, volume 33, pages 3324?3343, 2006. [44] M. Gahegan and I. Lee. Data structures and algorithms to support interactive spa- tial analysis using dynamic Voronoi diagrams. Computers, Environment and Urban Systems, 24(6):509?537, 2000. [45] J. F. Gantz. The Diverse and Exploding Digital Universe. IDC white paper, 2:1?16, 2008. [46] Y. Gao, B. Zheng, G. Chen, and Q. Li. Optimal-Location-Selection Query Processing in Spatial Databases. IEEE Trans. Knowl. Data Eng., 21(8):1162?1177, 2009. 155 [47] N. H. Gehani, H. V. Jagadish, and W. D. Roome. OdeFS: A File System Interface to an Object-Oriented Database. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ?94, pages 249?260, San Francisco, CA, USA, 1994. [48] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. SIGOPS Oper. Syst. Rev., 37(5):29?43, Oct. 2003. [49] T. P. G. D. Group. Postgresql developer?s guide. http://www.postgresql.org/docs/ 9.0/interactive/index.html. [50] T. Gu, H. K. Pung, and D. Q. Zhang. A middleware for building context-aware mobile services. InProceedings of IEEE Vehicular Technology Conference (VTC-Spring 2004), volume 5, pages 2656?2660, Milan, Italy, 17-19 May 2004. [51] A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD Conference, pages 47?57, 1984. [52] J. B. S. Haldane. Note on the Median of a Multivariate Distribution. Biometrika, 35:414?415, 1948. [53] R. Hariharan, B. Hore, C. Li, and S. Mehrotra. Processing Spatial-Keyword (SK) Queries in Geographic Information Retrieval (GIR) Systems. In SSDBM, page 16, 2007. [54] M. Henning. The rise and fall of corba. Queue, 4(5), 2006. [55] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Trans. Database Syst., 24(2):265?318, 1999. [56] H. Hu, D. L. Lee, and V. C. S. Lee. Distance Indexing on Road Networks. In VLDB, pages 894?905, 2006. [57] S. Jain, F. Shafique, V. Djeric, and A. Goel. Application-level isolation and recovery with solitude. SIGOPS Oper. Syst. Rev., 42(4):95?107, Apr. 2008. [58] G. Jalal and J. Krarup. Geometrical Solution to the Fermat Problem with Arbitrary Weights. Annals OR, 123(1-4):67?104, 2003. [59] C. S. Jensen, J. Kol?rvr, T. B. Pedersen, and I. Timko. Nearest Neighbor Queries in Road Networks. In GIS, pages 1?8, 2003. [60] M. I. Karavelas and M. Yvinec. Dynamic Additively Weighted Voronoi Diagrams in 2D. In ESA, pages 586?598, 2002. [61] R. Kassick, F. Boito, and P. Navaux. Impact of i/o coordination on a nfs-based parallel file system with dynamic reconfiguration. pages 199?206, oct. 2010. [62] J.Katcher. PostMark: a NewFileSystem Benchmark. Technical Report TR3022, pages 1?8, 1997. 156 [63] K. Keeton, D. A. Patterson, and J. M. Hellerstein. A case for intelligent disks (idisks). SIGMOD Rec., 27(3):42?52, Sept. 1998. [64] T. Kojm. ClamAV. http://www.clamav.net, 2004. [65] M. R. Kolahdouzan and C. Shahabi. Voronoi-Based K Nearest Neighbor Search for Spatial Network Databases. In VLDB, pages 840?851, 2004. [66] F. Korn and S. Muthukrishnan. Influence Sets Based on Reverse Nearest Neighbor Queries. In SIGMOD Conference, pages 201?212, 2000. [67] F. Korn, S. Muthukrishnan, and D. Srivastava. Reverse Nearest Neighbor Aggregates Over Data Streams. In VLDB, pages 814?825, 2002. [68] D. Kotz and C. S. Ellis. Practical prefetching techniques for parallel file systems. In Proceedings of the first international conference on Parallel and distributed information systems, PDIS ?91, pages 182?189, Los Alamitos, CA, USA, 1991. IEEE Computer Society Press. [69] A. Krause, A. Smailagic, and D. P. Siewiorek. Context-aware mobile computing: learn- ing context- dependent personal preferences from a wearable sensor array. Mobile Computing, IEEE Transactions on, 5(2):113 ? 127, feb. 2006. [70] W.-S. Ku, R. Zimmermann, H. Wang, and T. Nguyen. Annatto: Adaptive nearest neighbor queries in travel time networks. In MDM, page 50, 2006. [71] W.-S. Ku, R. Zimmermann, H. Wang, and C.-N. Wan. Adaptive nearest neighbor queries in travel time networks. In GIS, pages 210?219, 2005. [72] M. Lee, S. Min, C. Park, Y. Bae, H. Shin, and C. Kim. A dual-mode instruction prefetch scheme for improved worst case and average case program execution times. In Real-Time Systems Symposium, 1993., Proceedings., pages 98 ?105, dec 1993. [73] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters. SIGOPS Oper. Syst. Rev., 44(1):61?65, Mar. 2010. [74] C. Li, K. Shen, and A. E. Papathanasiou. Competitive prefetching for concurrent sequential I/O. SIGOPS Oper. Syst. Rev., 41(3):189?202, Mar. 2007. [75] H. Lim, V. Kapoor, C. Wighe, and D. H.-C. Du. Active disk file system: A distributed, scalable file system. MSS ?01, pages 101?, Washington, DC, USA, 2001. IEEE Com- puter Society. [76] X. Ma and A. L. N. Reddy. MVSS: Multi-View Storage System. In Distributed Com- puting Systems, 2001. 21st International Conference on., pages 31?38, apr 2001. [77] C. D. Manning, P. Raghavan, and H. Sch?tze. Introduction to information retrieval. Cambridge University Press, 2008. 157 [78] A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, A. Tomas, and L. Vivier. The new ext4 filesystem: Current status and future plans. In Proceedings of the 2007 Linux Symposium, pages 21?33, June 2007. [79] M. K. McKusick and G. R. Ganger. Soft updates: a technique for eliminating most synchronous writes in the fast filesystem. In Proceedings of the annual conference on USENIX Annual Technical Conference, ATEC ?99, pages 24?24, Berkeley, CA, USA, 1999. USENIX Association. [80] M. K. McKusick, W. N. Joy, S. J. Le?er, and R. S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems (TOCS), 2:181?197, August 1984. [81] L. Mu. Polygon Characterization With the Multiplicatively Weighted Voronoi Dia- gram. The Professional Geographer, 56(2):223?239, 2004. [82] K. Muller and J. Pasquale. A high performance multi-structured file system design. Proceedings of the 13th ACM symposium on Operating systems principles, 25(5):56?67, Sept. 1991. [83] A. Okabe, B. Boots, K. Sugihara, and S. N. Chiu. Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. Probability and Statistics. Wiley, NYC, 2nd edition, 2000. [84] R. B. R. P. H. Carns, W. B. Ligon III and R. Thakur. Pvfs: a parallel file system for linux clusters. Proceedings of the 4th annual Linux Showcase and Conference, pages 28?28, 2000. [85] F. Pagliara, J. Preston, and D. Simmonds. Residential Location Choice: Models and Applications. 2010. [86] V. S. Pai, P. Druschel, and W. Zwaenepoel. IO-Lite: a unified I/O buffering and caching system. ACM Trans. Comput. Syst., 18(1):37?66, Feb. 2000. [87] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao. Query Processing in Spatial Network Databases. In VLDB, pages 802?813, 2003. [88] S. V. Patil, G. A. Gibson, S. Lang, and M. Polte. GIGA+: scalable directories for shared file systems. In Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing ?07, PDSW ?07, pages 26?29, New York, NY, USA, 2007. ACM. [89] D.A.Patterson, G.Gibson, andR.H.Katz. Acaseforredundantarraysofinexpensive disks (RAID). Proceedings of the 1988 ACM SIGMOD international conference on Management of data, 17(3):109?116, June 1988. [90] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed prefetching and caching. SIGOPS Oper. Syst. Rev., 29:79?95, Dec. 1995. 158 [91] J. Piernas, T. Cortes, and J. M. Garc?a. DualFS: a new journaling file system without meta-data duplication. In Proceedings of the 16th international conference on Super- computing, ICS ?02, pages 137?146, New York, NY, USA, 2002. ACM. [92] J. Piernas and J. Nieplocha. Efficient management of complex striped files in active storage. Euro-Par ?08, pages 676?685, Berlin, Heidelberg, 2008. Springer-Verlag. [93] J. Piernas, J. Nieplocha, and E. J. Felix. Evaluation of active storage strategies for the lustre parallel file system. SC ?07, pages 28:1?28:10, New York, NY, USA, 2007. ACM. [94] J. Qi, R. Zhang, L. Kulik, D. Lin, and Y. Xue. The Min-dist Location Selection Query. In ICDE, 2012. [95] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill Sci- ence, 2002. [96] E. Riedel, G. A. Gibson, and C. Faloutsos. Active storage for large-scale data mining and multimedia. VLDB ?98, pages 62?73, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. [97] J. G. M. Robert F. Love and G. O. Wesolowsky. Facilities location, models and meth- ods. 1988. [98] M.Rom?n,C.Hess,R.Cerqueira,A.Ranganathan,R.H.Campbell,andK.Nahrstedt. A Middleware Infrastructure for Active Spaces. IEEE Pervasive Computing, 1(4):74? 83, Oct. 2002. [99] M. Rosenblum and J. K. Ousterhout. The design and implementation of a log- structured file system. ACM Trans. Comput. Syst., 10:26?52, February 1992. [100] D. S. H. Rosenthal. Evolving the Vnode Interface. USENIX Conference Proceedings, pages 107?118, 1990. [101] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest Neighbor Queries. In SIGMOD Conference, pages 71?79, 1995. [102] H. Samet, J. Sankaranarayanan, and H. Alborzi. Scalable network distance browsing in spatial databases. In SIGMOD Conference, pages 43?54, 2008. [103] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon. Design and implemen- tation or the sun network filesystem. In Proc. of USENIX Summer Technical Conf., 1985. [104] J. Schindler, S. Shete, and K. A. Smith. Improving throughput for small disk requests with proximal I/O. In Proceedings of the 9th USENIX conference on File and stroage technologies, FAST?11, pages 10?10, Berkeley, CA, USA, 2011. USENIX Association. [105] P. Schwan. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux Symposium, 2003. 159 [106] M. I. Seltzer, G. R. Ganger, M. K. McKusick, K. A. Smith, C. A. N. Soules, and C. A. Stein. Journaling versus soft updates: asynchronous meta-data protection in file systems. In Proceedings of the annual conference on USENIX Annual Technical Conference, ATEC ?00, pages 6?6, Berkeley, CA, USA, 2000. USENIX Association. [107] S. Shekhar and S. Chawla. Spatial Databases: A Tour. Prentice Hall, 2003. [108] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 0:1?10, 2010. [109] D. Siewiorek, A. Smailagic, J. Furukawa, A. Krause, N. Moraveji, K. Reiger, J. Shaffer, andF.L.Wong. SenSay: acontext-awaremobilephone. InWearable Computers, 2003. Proceedings. Seventh IEEE International Symposium on, pages 248 ? 249, oct. 2003. [110] S. W. Son, S. Lang, P. Carns, R. Ross, R. Thakur, B. Ozisikyilmaz, P. Kumar, W.- K. Liao, and A. Choudhary. Enabling active storage on parallel i/o software stacks. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1 ?12, may 2010. [111] G. Soundararajan, M. Mihailescu, and C. Amza. Context-aware prefetching at the storage server. In USENIX 2008 Annual Technical Conference on Annual Technical Conference, ATC?08, pages 377?390, Berkeley, CA, USA, 2008. USENIX Association. [112] G. Soundararajan, V. Prabhakaran, M. Balakrishnan, and T. Wobber. Extending SSD lifetimes with disk-based write caches. InProceedings of the 8th USENIX conference on File and storage technologies, FAST?10, pages 8?8, Berkeley, CA, USA, 2010. USENIX Association. [113] I. Stanoi, M. Riedewald, D. Agrawal, and A. E. Abbadi. Discovery of Influence Sets in Frequently Updated Databases. In VLDB, pages 99?108, 2001. [114] S. Sundararaman, L. Visampalli, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Refuse to crash with Re-FUSE. In Proceedings of the sixth conference on Computer systems, EuroSys ?11, pages 77?90, New York, NY, USA, 2011. ACM. [115] M. Szeredi. File system in user space(FUSE). http://fuse.sourgeforge.net. [116] A. S. Tanenbaum, J. N. Herder, and H. Bos. File size distribution on UNIX systems: then and now. ACM SIGOPS Operating Systems Review, 40:100?104, January 2006. [117] Y. Tao, D. Papadias, and X. Lian. Reverse kNN Search in Arbitrary Dimensionality. In VLDB, pages 744?755, 2004. [118] Y. Tao, D. Papadias, X. Lian, and X. Xiao. Multidimensional reverse k NN search. VLDB J., 16(3):293?316, 2007. [119] Y. Tao, M. L. Yiu, and N. Mamoulis. Reverse Nearest Neighbor Search in Metric Spaces. IEEE Trans. Knowl. Data Eng., 18(9):1239?1252, 2006. 160 [120] H. ?ster and R. Love. A generalization of the rectangular bounding method for contin- uous location models. Computers & Mathematics with Applications, 44(1-2):181?191, 2002. [121] Y. Vardi and C.-H. Zhang. A modified Weiszfeld algorithm for the Fermat-Weber location problem. Mathematical Programming, 90:559?566, 2001. [122] K. Veeraraghavan, J. Flinn, E. B. Nightingale, and B. Noble. quFiles: The right file at the right time. ACM Transactions on Storage (TOS), 6:12:1?12:28, September 2010. [123] B. S. Verkhovsky and Y. S. Polyakov. Feedback algorithm for the single-facility min- isum problem. Annals of the European Academy of Sciences, 1:127?136, 2003. [124] W. Wang, Y. Zhao, and R. Bunt. HyLog: A High Performance Approach to Managing Disk Layout. In Proceedings of the 3rd USENIX Conference on File and Storage Tech- nologies, FAST ?04, pages 145?158, Berkeley, CA, USA, 2004. USENIX Association. [125] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: a scalable, high-performance distributed file system. InProceedings of the 7th symposium on Operating systems design and implementation, OSDI ?06, pages 307?320, Berkeley, CA, USA, 2006. USENIX Association. [126] G. Weisbrod, M. Ben-Akiva, and S. Lerman. Tradeoffs in Residential Location De- cisions: Transportation versus Other Factors. Transportation Policy and Decision- Making, 1(1), 1980. [127] E. Weiszfeld and F. Plastria. On the point for which the sum of the distances to n given points is minimum. Annals OR, 167(1):7?41, 2009. [128] D. Wu, M. L. Yiu, C. S. Jensen, and G. Cong. Efficient Continuously Moving Top-K Spatial Keyword Query Processing. In ICDE, 2011. [129] T. Xia, D. Zhang, E. Kanoulas, and Y. Du. On Computing Top-t Most Influential Spatial Sites. In VLDB, pages 946?957, 2005. [130] X. Xiao, B. Yao, and F. Li. Optimal location queries in road network databases. In ICDE, pages 804?815, 2011. [131] C. Yang and K.-I. Lin. An Index Structure for Efficient Reverse Nearest Neighbor Queries. In ICDE, pages 485?492, 2001. [132] M. L. Yiu, D. Papadias, N. Mamoulis, and Y. Tao. Reverse Nearest Neighbors in Large Graphs. IEEE Trans. Knowl. Data Eng., 18(4):540?553, 2006. [133] E. Zadok. UnionFS: A Stackable Unification File System. http://www.fsl.cs. sunysb.edu/project-unionfs.html. [134] E. Zadok, I. Badulescu, and A. Shender. Extending file systems using stackable tem- plates. In Proceedings of the annual conference on USENIX Annual Technical Confer- ence, ATEC ?99, pages 5?5, Berkeley, CA, USA, 1999. USENIX Association. 161 [135] E. Zadok and J. Nieh. FIST: a language for stackable file systems. SIGOPS Oper. Syst. Rev., 34(2):38?, Apr. 2000. [136] D. Zhang, Y. M. Chee, A. Mondal, A. K. H. Tung, and M. Kitsuregawa. Keyword Search in Spatial Databases: Towards Searching by Document. In ICDE, pages 688? 699, 2009. [137] D. Zhang, Y. Du, T. Xia, and Y. Tao. Progressive Computation of the Min-Dist Optimal-Location Query. In VLDB, pages 643?654, 2006. [138] Z. Zhang and K. Ghose. hFS: a hybrid file system prototype for improving small file and metadata performance. Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, 41:175?187, March 2007. [139] Y. Zhou, X. Xie, C. Wang, Y. Gong, and W.-Y. Ma. Hybrid Index Structures for Location-based Web Search. In CIKM, pages 155?162, 2005. [140] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006. [141] J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted Files Versus Signature Files for Text Indexing. ACM Trans. Database Syst., 23(4):453?490, 1998. 162