This Is AuburnElectronic Theses and Dissertations

Learning based approaches to achieve better statistics in Big Data




Paul, Parag

Type of Degree

PhD Dissertation


Computer Science and Software Engineering


Statistics in Big data has multiple applications. There are industrial systems that depend on quick and accurate insights into possible anomalies and irregularities on a large scale, for them to take better business decisions. Query Optimizers in Relational Database Management Systems(RDBMS) have traditionally used some form of sample based sketches or histograms to do cardinality estimations for join outputs and other predicates involved in the nodes of a SQL query. These approximations allow the optimizer component to find out near optimal join order for a tree of multiway joins. They also help in designing and planning in advance for the system resources like memory, number of cores(more relevant in modern cloud native configurations that utilize server chips with more than 100 cores). Other than query optimizers, optimal statistics can also be used to answer questions within some error guarantees. Many Big Data and Decision Support Systems(DSS) need the ability to give faster turnaround time. With the advent of Industrial Internet of Things(IOT), Edge devices and Autonomous systems, it is important that we can build better statistical models that work within the constraints of space available, memory requirements, error tolerance and speed of computation. The explosion of data has reached a point where it has already surpassed the growth rate of compute and memory capabilities which have not grown as fast. We hereby propose multiple solutions and approaches to solve problems that are inflicting big data systems. We solve two of the problems related to statistics generation in such production systems. First we achieve a consistent level of estimation errors for a given workload, by creating statistics tied to the reduction of overall error in the system. Statistics by definition is an approximation of the actual information and hence is prone to errors when queried. The second problem we solve is that of being able to generate statistics in distributed streaming systems with limited resources in a streaming fashion. Once we are loaded with the power of fast and accurate statistics, we show the impact of optimized workloads on the power consumption metrics of such a system. We solve the accuracy problem using regression algorithms to build statistical histograms that moves the state-of-the-art pivot of q-error or maximal estimation error to that of overall error by observing the slope and intercept of the errors. By minimizing the regression line of estimation errors, we guarantee a overall smoother workload than previous approaches that pivoted their solution towards reducing the maximal error. We present scenarios where this could be ineffective and result in larger error margins on the average workload, but provide lower maximal error guarantees. We introduce two new metrics, Q-Regression pair(Slope and Intercept) of the regression line on the estimation errors and QRegrArea(area under the regression line). The lower the QRegrArea, the lower the overall estimation errors. We prove experimentally that our algorithms are indeed effective on real workloads and various mathematical distribution families (Normal, Uniform, Pareto, Random, Laplace, Cauchy, Zipfian). The dataset used from real world encompass census data and public information from Big Data sets. For the second problem, we propose another algorithm, named as AUPASS. The algorithm can be used to create statistics using low footprint in terms of resources, in a streaming fashion, with the ability to losslessly merge the intermediate statistics in a distributed system, so that the summary view on the control node can be generated in a meaningful fashion. This has immense impact for modern day Industrial Internet of Things where the compute is limited and memory requirements are stringent. The algorithm uses sketch based summaries that can be integrated across nodes without loss of critical information. Algorithms like Count Min Sketch, Hyperloglog, Reservoir Sampling and KLL Sketches have been utilized to generate the summaries. The results show that the proposed algorithms make a significant improvement over the state of the art. The combined power of the two independent proposals have the singular affect of being memory/compute efficient, stateless and energy efficient. The results of the experiments demonstrate that a better equipped optimizer will create better execution plans and eventually lower the energy requirements of any given system.