Accelerate MapReduce’s Failure Recovery through Timely Detection and Work-conserving Logging
Type of DegreeDissertation
MetadataShow full item record
MapReduce has become an indispensable part of the increasing market for big data analytics. As the representative implementation of MapReduce, Hadoop/YARN strives to provide outstanding performance in terms of job turnaround time, scalability, fault tolerance, etc.. Specifically for fault tolerance, YARN is equipped with the speculation mechanism, which can regenerate data properly in the presence of system failures. However, we revealed that the existing speculation mechanism has some major drawbacks that hinder its efficiency in job recovery from system failures, especially for small jobs, which are the paramount counterparts of MapReduce jobs in practical use. As our experiments have shown, a single node failure can cause the job slowdown by up to 9.2 times. To address this issue, we have conducted a comprehensive study on the fundamental causes of the breakdown of existing speculation mechanism. In order to tackle down those issues, we brought about a set of techniques, including an optimized speculation mechanism, a centralized failure monitor and analyzer, a progress-conserving mechanism for MapTasks and a refined scheduling policy under failures. To evaluate our framework, we conducted a set of experiments that evaluate the performance of both single component and the framework in overall. Our experimental results show that our new framework has dramatic performance improvement dealing with task and node failures compared to the original YARN.