Thermal-aware Resource Management in Energy Efficient Clusters

Taneja, Shubbhi

View/Open

Dissertation_STaneja_2018.pdf (3.491Mb)

Date

2018-07-25

Author

Taneja, Shubbhi

Type of Degree

PhD Dissertation

Department

Computer Science and Software Engineering

Metadata

Show full item record

Abstract

There is a pressing need for thermal management in most electronic devices today, ranging from portables to high-performance servers, and it poses barriers to the safe operations of data centers. The goal of thermal management is to reduce thermal hotspots and non-uniform on-chip temperatures that may impact the longevity of hardware. In this dissertation research, we develop a thermal-aware job scheduling strategy called tDispatch tailored for MapReduce applications running on Hadoop clusters. The scheduling idea of tDispatch is motivated by a profiling study of CPU-intensive and I/O-intensive jobs from the perspective of thermal efficiency. We show that CPU-intensive and I/O-intensive jobs exhibit various thermal and performance impacts on multicore processors and hard drives of Hadoop cluster nodes. After we quantify the thermal behaviors of Hadoop jobs on the master and data nodes of a cluster, we propose our scheduler that performs job-to-node mappings for CPU-intensive and I/O-intensive jobs. We apply our strategy to several MapReduce applications with different resource consumption profiles. Our experimental results show that tDispatch is conducive to creating opportunities to cool down multicore processors and disks in Hadoop clusters deployed in modern data centers. Our findings can be applied in other thermal-efficient job schedulers that are aware of thermal behaviors of CPU-intensive and I/O-intensive applications submitted to Hadoop clusters. Characterizing thermal profiles of cluster nodes is an integral part of any approach that addresses thermal emergencies in a data center. As the power density of today’s data centers grows, preventing thermal emergencies in data centers becomes one of the vital issues. Most existing thermal models make use of CPU utilization to estimate power consumption, which in turn facilitates outlet-temperature predictions. Such utilization-based thermal models may introduce errors in predicting power usage due to inaccurate mappings from system utilization to outlet temperatures. To address this concern in the existing models, we eliminate utilization models as a middleman from the thermal model. In this dissertation, we propose a thermal model, tModel, that projects outlet temperatures from inlet temperatures as well as directly measured multicore temperatures rather than deploying a utilization model. The proposed thermal model estimates the outlet air temperature of the nodes for predicting cooling costs for cluster nodes. We validate the accuracy of our model against data gathered by thermal sensors in our cluster. Our results demonstrate that tModel estimates outlet temperatures of the cluster nodes with much higher accuracy over CPU-utilization based models. We further show that tModel is conducive to estimating the cooling cost of data centers using the predicted outlet temperatures. High energy efficiency of applications helps in reducing the operational costs of data centers. For a wide range of applications, the overall computational cost can be significantly reduced if an exact solution is not required. Approximate computing is one such paradigm leveraging the forgiving nature of many applications to improve the energy efficiency of cluster nodes. We propose a framework called tHadoop2 for MapReduce applications running on Hadoop clusters. To facilitate the development of tHadoop2, we incorporated an existing thermal-aware workload placement module called tHadoop into our tHadoop2. Our framework consists of three key components - tHadoop, a thermal monitoring and profiling module, approximation-aware thermal manager. We investigated the thermal behavior of a MapReduce application called Pi running on Hadoop clusters by varying the two input parameters - the number of maps and the number of sampling points per map. Our profiling results show that Pi exhibits inherent resilience in terms of the number of precision digits present in its value. It is noteworthy that this result quality varies with the application type. Other MapReduce applications can be scrutinized by exploring their characteristics and finding opportunities for acceptable inexactness in outputs. Nevertheless, the proposed framework, coupled with approaches for making tradeoffs, is generally applicable to any MapReduce application.

URI

http://hdl.handle.net/10415/6376