This Is AuburnElectronic Theses and Dissertations

RTAH: Resource and Thermal Aware Hadoop




Gautam, Dudeja

Type of Degree



Computer Science


The amount of unstructured data, also known as Big Data in Internet is growing every day. Because the Big data is unstructured, a large-scale distributed batch processing infrastructure like Hadoop is used instead of traditional databases. Hadoop is an open source framework, which uses MapReduce programming model to process large data set. Hadoop's true power lies in while working in a cluster of machines in data centers. Hadoop's masterslave architecture enables master node to control the slave nodes to store and process the data. When a client application submits a job to Hadoop, the scheduler in master node schedules tasks on every available slave to process the job in parallel fashion. Many existing Hadoop schedulers do not consider the workload distribution, its thermal impact and overall heat distribution in the data center which leads to unstructured increase in temperature and then massive power expenditure on cooling the data center which now stands about 25% of total investment in data centers.With the exponential increase in cooling costs of large-scale data centers, thermal management must be adequately addressed. Recent trends have discovered one of the critical reason behind the temperature rise turns out to be heat re-circulation within data center; where for a server i not only server i's workload but also its neighbor server's contribute in its temperature rise. Based on thorough investigations of Hadoop's available schedulers, we proposed a new resource and thermal aware scheduler that schedules tasks to minimize peak inlet temperature across all nodes and reduce power consumption by Air conditioning units and eventually cooling costs in data center. The proposed dynamic scheduler, schedules a job based on the current CPU, disk's utilization and number of tasks running and the feedback given by all slave nodes at run-time.