This Is AuburnElectronic Theses and Dissertations

Assessment of Multiple MapReduce Strategies for Fast Analytics of Small Files




Zhou, Fang

Type of Degree

Master's Thesis


Computer Science


Hadoop, an open-source implementation of MapReduce, is used widely because of its ease of programming, scalability, and availability. Hadoop distributed file system (HDFS) and Hadoop MapReduce are two important components of Hadoop. Hadoop MapReduce is used to process the data stored in HDFS. With the explosive development of cloud computing, increasingly business and scientific needs to take advantages of Hadoop. The sizes of files processed in Hadoop are not bound to very large files any more. Large amount of small files both in business and scientific area are processed by MapReduce, such as document type files, bioinformatics files, geographic information files, and so on. In this situation, MapReduce performance of Hadoop is impacted severely. Although Hadoop itself and other frameworks provide some MapReduce strategies, they are not directly designed for small files. In addition, there is no theoretical analysis for evaluating MapReduce strategies for small files. In this paper, I conduct an analysis of existing different MapReduce strategies for small files and use theoretical and empirical methods to conclude what the best MapReduce strategy is for processing small files. The experimental results show the correctness and efficiency of our analysis.​