Popularity-Aware Storage Systems for Big Data Applications
Type of DegreePhD Dissertation
Computer Science and Software Engineering
MetadataShow full item record
Recommendation algorithms play an increasingly dominant role in big data services like Netflix and YouTube. In streaming applications, it becomes unavoidable that trillion tons of personal and industrial data are flooded into the data center. This dissertation is focused on popularity-aware techniques anchored on recommendation algorithms to bolster the performance of data processing. In this dissertation study, we make the following contributions centered around data reconstruction, cache replacement, malware detection, and distributed denial of service (DDoS) detection. The first contribution of this dissertation is a popularity calculator coupled with a scheduler, where we advocate for erasure-coded data storage systems to archive warm data. Different from hot or cold data, warm data have to be treated in a distinctive way to optimize system performance and storage-space utilization. We employ two machine-learning algorithms to offer online data reconstruction in erasure coded storage systems. We also combine the factors includes the data item size and big data storage location to adjust the popularity value. The final popularity value indicates the malware detection priority. Our system is reliant on a big data storage mechanism to group files into multiple clusters, in each of which files share similar features. Furthermore, we set the prediction module with item size record and storage location which connects the closest users, thereby projecting files that are likely to be accessed in the not-too-distant future. The prediction module is responsible for computing similarities among users so as to set up priority levels of data blocks to be reconstructed. Our experimental results confirm that our system reduces the average waiting time of data recovery while maintaining a high data access performance for on-line users. The second contribution lies in a popularity-driven cache replacement policy - PDC - catered for big data storage caching systems, in which future accesses predictions are leveraged to push cache-replacement performance to the next level for big data applications. Our PDC governs data recommendation algorithms to gauge popularity values for data objects from active users' access history. Popularity values signify data replacement priorities amid making replacement decisions. The last contribution of the dissertation study is a similarity-based DDoS detection module. Inspired by a dynamic analysis of access behavior changes in active users, we propose a DDoS anomaly detection model to discover DDoS attack sources by diagnosing users’ similarities. The overarching goal of our solution is to pinpoint DDoS by monitoring the similarity of active users around existing users at a low cost. This goal is achieved by our proposed model embracing the following key steps. First, a sample user set is originated. Then, the active users' requests are tracked to assess similarity measures between each active user and sample users. Finally, if the deviation of similarity exceeds prescribed thresholds, detected users will be flagged as anomalous ones.