This Is AuburnElectronic Theses and Dissertations

Exploring Index and Query Optimization Techniques for Database Applications




Tang, Liang

Type of Degree



Computer Science


Database index can be regarded as a data structure that speed up the data retrieval operations on a database table. The cost of indexing in database is additional writes and storage space to maintain the data structure. The created data structures in database are used to quickly query data without having to search every row of table in most relational database management system (RDBMS). The read and write performance of database is elevated by bringing in appropriate indexing technique, given the specific data type. As a result, indexing technique plays a significant role in database applications. After index is built completely, database will be able to answer the query. Generally, a query is a request for information from a database. It can be as simple as "finding the address of the headquater of company ZZ,'' or more complex like "finding the average total amount of penalties for football players who live in Auburn or Opelika, incur more than 3 penalties, and captain less than 2 teams.'' In order to quickly resolve the query result, we raise the definition of query optimization. The query optimization techniques try to determine the most efficient way to execute a specific query by considering all the possible query strategies. The goal of the query optimization is to find the way to process a given query in minimum time. The main contribution of this dissertation is to explore and study out efficient indexing and query optimization techniques regarding the specific problem. Three concrete database applications will be analyzed and explored, and indexing and querying techniques will be proposed respectively in order to enhance the database performance. First, two watchtower-based parameter-tunable indexing methods are introduced for efficient spatial processing with sparse distributions of Points of Interest (POIs) by exploiting mobile users' check-in data collected from the location-aware social networks. In the proposed frameworks, the network traversal can terminate earlier by retrieving the distance information. More important, by observing that people's movement often exhibit a strong spatial pattern, we employ Bayesian Information Criterion-based cluster analysis to model mobile users' check-in data as a mixture of 2-dimensional Gaussian distributions, where each cluster corresponds to a geographical hot zone. Afterwards, POI watchtowers are established in the hot zones and non-hot zones discriminatorily. Moreover, the optimal watchtower deployment mechanism is discussed in order to achieve a desired balance between the off-line pre-computation cost and the on-line query efficiency. Finally, the superiority of our solutions over the state-of-the-art approaches is demonstrated using the real data collected from Gowalla with large-scale road networks. Next, a novel probabilistic inference query machine is introduced to do the statistical inference on social network. Social network analysis will be carried out, because it attracted more and more interest from researchers due to the rapidly increasing availability of massive social network data. The link prediction problem is one of the most fundamental problems in social network analysis, and therefore has been extensively studied. In this dissertation, LinkProbe, a framework to quantitatively predict the existence of links in large-scale social networks based on Markov Logic Networks (MLNs) is designed and implemented. Differing from other probabilistic graph models, such as Bayesian Networks (BNs) or Conditional Random Fields (CRFs), MLNs allow undirected relationship with cycles and long-range (more than one hop) dependency, which are essential and abound in social networks. The extensive experiments with real social datasets verify the effectiveness and efficiency of our proposed framework. Finally, a method to resolve the probabilistic skyline query problem is proposed by using MapReduce programming model in a distributed machine environment. The method is able to filter unqualified objects (not needed by users) during the early MapReduce cycle. Only qualified objects is passed to compute the real probabilistic skyline probabilities. In order to let multiple machines filter objects simultaneously, high dimensional data is fetched by hyper-angular partitioning. Several efficient filtering strategies are proposed based on the feature of the angular partitioning. Comparisons against other State-of-the-art approaches are demonstrated in the experiments.