RTAH: Resource and Thermal Aware Hadoop by Gautam Dudeja A thesis submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Master of Science Auburn, Alabama August 2, 2014 Keywords: Data Centers, Thermal Energy, Hadoop, Map-Reduce Copyright 2014 by Gautam Dudeja Approved by Xiao Qin, Chair, Associate Professor of Computer Science Software Engineering Cheryl Seals, Associate Professor of Computer Science Software Engineering Alvin Lim, P Associate Professor of Computer Science Software Engineering Abstract The amount of unstructured data, also known as Big Data in Internet is growing every day. Because the Big data is unstructured, a large-scale distributed batch processing in- frastructure like Hadoop is used instead of traditional databases. Hadoop is an open source framework, which uses MapReduce programming model to process large data set. Hadoop?s true power lies in while working in a cluster of machines in data centers. Hadoop?s master- slave architecture enables master node to control the slave nodes to store and process the data. When a client application submits a job to Hadoop, the scheduler in master node schedules tasks on every available slave to process the job in parallel fashion. Many existing Hadoop schedulers do not consider the workload distribution, its thermal impact and overall heat distribution in the data center which leads to unstructured increase in temperature and then massive power expenditure on cooling the data center which now stands about 25% of total investment in data centers. With the exponential increase in cooling costs of large-scale data centers, thermal man- agement must be adequately addressed. Recent trends have discovered one of the critical reason behind the temperature rise turns out to be heat re-circulation within data center; where for a server i not only server i?s workload but also its neighbor server?s contribute in its temperature rise. Based on thorough investigations of Hadoop?s available schedulers, we proposed a new resource and thermal aware scheduler that schedules tasks to minimize peak inlet temperature across all nodes and reduce power consumption by Air conditioning units and eventually cooling costs in data center. The proposed dynamic scheduler, sched- ules a job based on the current CPU, disk?s utilization and number of tasks running and the feedback given by all slave nodes at run-time. This resource information gets simulated to nd the respective temperature of each slave node also its neighbor?s contribution. This ii resource and thermal aware scheduler is implemented on top of Hadoop?s FIFO scheduler. We test our schedulers and FIFO scheduler by running a set of standard Hadoop benchmark applications like WordCount, DistributedGrep, PI and TeraSort at di erent temperature, utilization thresholds and cluster sizes. The experimental results show that our scheduler achieve average inlet temperature reduction by 2.5C over the default FIFO scheduler that saves about 15% of cooling cost with little performance overhead. iii Acknowledgments I would never have been able to nish my dissertation without the guidance of my committee members, help from friends, and support from my family. I would like to express my deepest gratitude to my adviser, Dr. Xiao Qin, for his excellent guidance, caring, patience, and providing me with an excellent atmosphere for doing work. He has been a true mentor and adviser in every sense of the term. I could not have imagined having a better adviser and mentor for my Masters Thesis studies. I would also like to thank Dr. Cheryl Seals and Dr. Alvin Lim for serving as members of my advisory committee, for their consistent encouragement and insightful comments. I also owe much gratitude to Dr. Daniela Marghitu, Dr. David Umphress and Dr.Kai Chang for shaping my graduate student career for the better. In addition to it I would also like to thank all the great professor here Auburn CSSE department, for challenging me and teaching me all the things I know about computer science today. My sincere thanks to all those in my lab group Xunfei, Yuenqi, Tausif for their on going support and guidance. I owe very special thanks to Ajit Chavan my another lab mate helped me from strength to strength and being a true friend. Last but certainly not the least I would like to thank God and my family who are miles away wishing me well every second. In addition to my family I also thank my Auburn family my friends Mohit, Sarthak, Amrit, Harshit, Vaibhav, Shubbhi, Kanika and Yasmeen for their moral support and for just being there all good and hard times. My hope is that my actions and this thesis is done in the way that would please them. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Hadoop Map-Reduce Framework . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Existing Schedulers in Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 why Thermal Management in Data Center required? . . . . . . . . . . . . . 6 1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Apache Hadoop Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Hadoop Top Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 NameNode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 DataNode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Backup Node or Secondary NameNode . . . . . . . . . . . . . . . . . 13 2.2.4 Replica and Block Management . . . . . . . . . . . . . . . . . . . . . 14 2.3 MapRedue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 HearBeat Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 DataCenter Layout and Existing Thermal Models . . . . . . . . . . . . . . . . . 19 3.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 v 3.2 Data Center Layout and Heat Distribution . . . . . . . . . . . . . . . . . . . 19 3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 RTAH:Resource and Thermal Aware Scheduler . . . . . . . . . . . . . . . . . . 26 4.1 FIFO scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 RTAH Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.1 HeartBeat Mechanism Modi cation . . . . . . . . . . . . . . . . . . . 29 4.2.2 Thermal Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Results and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1 Set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.3 Cluster Size and Data set . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2.1 Temperature Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2.2 Cooling Cost Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 46 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1 conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 vi List of Figures 2.1 Hadoop Top Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Map-Reduce Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 An Overview of a data center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Coe cient of Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 An anatomy of typical FIFO scheduler. . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Cross-Interference between neighbor servers . . . . . . . . . . . . . . . . . . . . 32 4.3 RTAH Design Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1 Map Task Distribution for WordCount . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Peak Inlet Temperature for WordCount . . . . . . . . . . . . . . . . . . . . . . 42 5.3 Map Task distribution for Grep . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4 Peak Inlet Temperature for Grep . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5 Map Task distribution for Tera Sort . . . . . . . . . . . . . . . . . . . . . . . . 43 5.6 Peak Inlet Temperature for Tera Sort . . . . . . . . . . . . . . . . . . . . . . . . 44 5.7 Pi Estimation Map Distribution and Peak Inlet Temperature Reduction . . . . 44 vii 5.8 Map Distribution vs Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.9 Peak Inlet vs Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.10 Peak Inlet vs Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.11 Cooling Cost Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.12 Cooling Cost Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.1 Power Controller interaction with NameNode . . . . . . . . . . . . . . . . . . . 53 viii List of Tables 1.1 Improving energy e ciency in Data Centers . . . . . . . . . . . . . . . . . . . . 7 4.1 Model Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Server Speci cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Cooling Cost Reduction comparison for Di erent Benchmarks . . . . . . . . . . 48 ix List of Abbreviations Auburn Auburn University LoA List of Abbreviations RTAH: Resource and Thermal Aware Hadoop I/O: Input and Output CPU: Central processing unit CRAC: Computer room air conditioning HDFS: Hadoop Distributed File System x Chapter 1 Introduction Cloud computing often referred to as simply the cloud, is leading emerging utility com- puting, which provides the basic level of computing service that is considered essential to meet the everyday needs of the general community. Cloud computing is the next generation in computation. Maybe Clouds can save the world; possibly people can have everything they need on the cloud. Cloud computing is the next natural step in the evolution of on-demand information technology services and products. The Cloud is a metaphor for the Internet. It is a style of computing in which IT-related capabilities are provided as a service, allowing users to access technology-enabled services from the Internet (i.e., the Cloud) without knowledge of, expertise with, or control over the technology infrastructure that supports them. Gartner Inc. have predicted that at year-end 2016, more than 50% of Global 1000 companies will have stored customer-sensitive data in the public cloud [5]. IDC have predicted that 80% of new commercial enterprise apps will be deployed on cloud platforms [6]. Without any doubts we can say that term cloud computing phrase has become "cream of mussel" of the computing world. Software Engineering Institute [7] studies shows the pattern that cloud computing envi- ronments can either be public or private. These computing environments represent the way the services are o ered to the clients. In public environment the services are o ered to clients either free or for a fee. The private environments are generally limited to organizations in which services are deployed behind the organization?s rewall. The computing environments should take one of the 3 popular service models as described below: 1. Infrastructure-as-a-Service (IaaS): The computational infrastructure includes of a set of virtual machines that have computation and storage capacity and are available for 1 access over the Internet. In this model, clients can run computation intensive or a data intensive job using a variety of interfaces that facilitate the interaction. The services are provided over an infrastructure and client?s programs do not have rights to access or modify it. Some examples of IaaS include: Amazon Cloud Formation, Rackspace Cloud and Google Compute Engine. 2. Platform-as-a-Service (PaaS): This service model provides the basic platform upon which the clients can write their own applications and deploy it. The platform in- cludes operating systems, libraries, environments, services, supporting tools provided by the service provider. Like IaaS, PaaS also does not allow user programs to alter the underlying infrastructure. However, they can modify and change the settings that are in application?s scope and environment. 3. Software-as-a-Service (SaaS): In this service model, Software developed by the client, provider or by a third party is provided as service to the client. The client do not run the application locally, instead it would use an API to communicate with the application that runs in the cloud platform remotely. These APIs cannot modify the underlying cloud infrastructure, however they can still be used to customize and change the application?s con guration. A few examples of SaaS include GMail, GDocs and O ce 365. To provide these services and to supply the demand of computational power as a utility has increased in never heard volume and that has led to building infrastructure to handle that huge data which is being produced every second. The amount of unstructured data, also known as Big Data in Internet is growing every day. The sheer volume of data being stored today is exploding. In year 2000, 800,000 peta(10005) bytes data were stored in the world and by year 2012 it has grown to 8 Zetta (10007) bytes. Social networking and e commerce sites like Facebook, Twitter, Amazon and YouTube are generating and processing petabytes of data everyday. The Big data is large and unstructured, so it is really hard to process and 2 analyze using the traditional database models like RDBMS. In cloud computing, MapReduce is the programming model used for processing and analyzing large data sets.MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale out data processing solutions. 1.1 Hadoop Map-Reduce Framework MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. The model does not work by sharing the data arbitrarily between the nodes. Instead, the data elements in the MapReduce are immutable. The data is written only once and read many times. The data read from the input les in HDFS are processed and converted to intermediate values and further processed to generate outputs. Any changes on the input les during this process are not re ected on the actual les. As the name suggests MapReduce programs process the input data in two stages- Map stage and Reduce stage. In the mapping stage, the mapper takes one item at a time from the input list of data elements that are fetched from the HDFS and transforms to an intermediate output data element. The Map operations are paralleled when input le set is rst split to several pieces called File Splits or Input Splits. Every mapper would have exactly one input split; the number of mappers created is dependent on the number of input splits. Splitting the input le set helps in paralleling the processing as the mappers do not have to synchronize and contend to read the le. Moreover, mappers do not have any identities of their own, so all mappers are the same and are not aware of each other?s existence let alone communication taking place between them. Every mapper that receives the input split processes it in a speci ed format. The input split parser (or Record Reader) in the mapper parses the split and generates the key-value pairs. The key-value pairs are processed in parallel by the mappers, 3 one at a time to generate exactly one intermediate key-value pair for every (key,value) pair. The output (key,value) pair of the mapper serves as input to the reducer. When the map- ping phase has completed, the intermediate (key, value) pairs must be exchanged between machines to send all values with the same key to a single reducer. The reducer receives the intermediate data generated by the mapper as input, combines the values of all mapper outputs and generates a single output data corresponding to the input data fetched by the mapper. The reducers reduce a key value that is unique to each other, so reducers are same as mappers in the sense that they do not have to communicate with each other and also remain anonymous to each other. 1.2 Hadoop Distributed File System Hadoop includes a distributed le system called Hadoop Distributed Filesystem (HDFS) . It is designed for storing large les of the order of petabytes with streaming data access running on commodity hardware. It follows the write once read many times and since analyses are performed over the whole dataset, time to read it should be very fast. The write once read many model relaxes concurrency control issues, makes data coherency simple and enables high throughput. Data writes are restricted to one writer at a time. In HDFS Architecture, the NameNode manages lesystem operations and maps data blocks to DataNodes. The DataNodes ask the NameNodes for the type of le operations to perform. The NameNode returns values to the functions called from the DataNode. The NameNode maintains and administers changes to the le system namespace. Data replication and organization One of the key features of HDFS is that it provides fault tolerance. It does so by replicating le blocks. The number of replications can be provided by the user during input time. HDFS replication is rack aware so as to use network bandwidth intelligently. The location of the DataNodes is identi ed by the NameNodes through the rack IDs. 4 Failure detection and prevention Failures in HDFS can occur in NameNodes, DataNodes or network connectivity problems. HDFS uses heartbeat messages to detect the presence of connection between NameNode and DataNode. Each DataNode sends messages to the NameNode indicating it is alive. If the NameNode stops receiving the message from the DataNode, it marks it as dead and stops sending it requests. Since the dead node no longer responds to messages, hence the data present in that node is considered to be lost. If the loss of a node causes the replication factor to go below the minimum value, the NameNode starts the process of replicating the lost data in other nodes. 1.3 Existing Schedulers in Hadoop The performance of a master-worker system like MapReduce system closely ties to its task scheduler on the master. Hadoop schedulers are designed as jar module and can be easily plugged in to any Hadoop distro. Although there have been lot of work on schedulers,they are still in the early stages of its life compared to OS?s schedulers. Still, there are quite a few popular schedulers that are worth mentioning: The FIFO scheduler is the default scheduler in Hadoop which uses a single queue for scheduling tasks (partitioned jobs) with a FIFO method. Yahoo?s capacity scheduler uses multiple queues for scheduling. It schedules jobs and assigns resources to jobs based on resources capacity allocated for the queue of jobs and usage density density of capacities. Facebook?s fair scheduler uses multiple queues for allocating di erent resources in the cluster. The fair scheduler maintains a pool of jobs with each pool having a dedicated number of Map and Reduce slots. It runs a job by using the map and reduce slots and if a pool is not running any job then the free slots can be allocated to other pools. Dynamic Priority Scheduler is a parallel task scheduler in which it allows users to control their allocated capacity by adjusting their spending over time. 5 1.4 why Thermal Management in Data Center required? The energy consumption in data centers have seen a steady increase trend and also the power density with in the associated data center like environment. Almost half of such power consumption is been the cooling power consumption which signals that urgent and e cient solutions to reduce the energy consumption in data centers for greening the data centers. The Green Grid, a consortium of data center technology companies, is trying to make a standanrd pattern in industry to maintain and achieve energy and thermal e ciency [18]. 5.1No existing frameworks are not thermal e cient although resource and workload management is innate to achieve thermal e ciency. The high maintenance cost is predominantly due to high electricity and cooling costs, which is 25% of the total investment. In fact, the cooling costs of a data center are higher than the entire IT equipment it supports. In data centers the main contribution to power is computing and cooling power. The power usage e ectiveness(PUE) [18] of a data center is total power consumed and larger PUE shows large cooling power consumption since that is the only largest non-computing power consumer. Thermal aware resource management is of prominent importance to improve cooling energy e ciency. According to US Department of Energy, average data centers PUE is around 1.7 which means around 30% - 40% is cooling cost. Hadoop, being popularly used by enterprises that usually have average data centers, incorporating thermal awareness into it would bene t them. None of the Hadoop schedulers developed so far considers the source of temperature to handle the power consumption of the node and most importantly the thermal model of the data center while scheduling the jobs. Although there have been many works on scheduling the tasks in the data center to make data center more thermal aware, none of the schedulers have been implemented in Hadoop to see their performance in reality. 6 Table 1.1: Improving energy e ciency in Data Centers Design Equipment Framework Cold aisle, hot aisle ar- rangement; aisle isola- tion Low power states and DVFS (e.g. Intel?s Speed-Step, AMD?s Pow- erNow);Low power elec- tronics (Intel Atom) Popular commercial frameworks are not ther- mal aware e.g. Hadoop, Rocks, Moab are not thermal-aware 1.5 Contribution In this research we propose to incorporate thermal and resource awareness in Hadoop MapReduce framework(RTAH) so as to contribute in reducing cooling power in the data centers it runs. The proposed scheme dynamically schedules tasks on the slave nodes with only intention of reducing the inlet temperature of each slave node which leads to reduction of air conditioning cost. The cooling energy depends on two factors 1) the cooling demand driven by the power distribution and the redline temperature and 2) the cooling behavior i.e. the behavior of the Computer Room Air Conditioner to meet the cooling demand. Of particular concern is the re circulation and intermixing of the hot air generated from the servers running the jobs with the cold air supplied from the CRAC. The heat re circulation depends on the data center layout and can cause hot-spots which in turn increase the cooling demand. The proposed algorithm uses slave nodes CPU and disk utilization feedback to master node. The second most critical aspect algorithm utilizes here is the heat re circulation matrix which represents the temperature e ect of servers on each other. The algorithm tries to consider all heat generating sources and also its circulation a ect to make the decision in order to make sure the inlet temperature of slave nodes in a rack are below a redline. Finally we evaluate and compare our Thermal Scheduler(RTAH) with the native Hadoop scheduler (FIFO) on our lab cluster using standard benchmarks and applications like word count, distributed grep, pi and tera sort at di erent temperature threshold and cluster sizes. The experimental results show the our scheduler achieves peak inlet temperature reduction by 2.5C and power consumption reduction by 14%. We also used governmental data for 7 states with data center locations and we observed the cost saving of around half a million dollar annualy for a medium size data center with 12000 cluster size. 1.6 Organization The thesis is organized as follows Chapter 2 explains the Hadoop?s architecture, HDFS and MapReduce framework in Hadop. Chapter 3 explains the problem and existing thermal solutions. Chapter 4 describes of our thermal aware scheduler. Chapter 5 analyzes the results and performances of our scheduler. Chapter 6 refers to some possible future work with conclusion to the thesis. 8 Chapter 2 Apache Hadoop Framework Apache Hadoop is an open source software project that enables the distributed process- ing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the softwares ability to detect and handle failures at the application layer. Hadoop enables a computing solution that is scalable, cost e ective, exible and fault tolerant [6]. 2.1 Hadoop Top Architecture Hadoop is implemented using relatively simple model of Client-Master-Slave design pattern. There are two masters in the architecture, which are responsible for the controlling the slaves across the cluster. The rst master is the NameNode, which is dedicated to manage the HDFS and control the slaves that store the data. Second master is JobTracker, which manages parallel processing of HDFS data in slave nodes using the MapReduce programming model. The rest of the cluster is made up of slave nodes which runs both DataNode and TaskTracker daemons. DataNodes obey the commands from its master NameNode and store parts of HDFS data decoupled from the meta-data in the NameNode. TaskTrackers on the other hand obeys the commands from the JobTracker and does all the computing work assigned by the JobTracker. Finally, Client machines are neither Master or a Slave. The role of the Client machine is to give jobs to the masters to load data into HDFS, submit Map Reduce jobs describing how that data should be processed, and then retrieve or view the results of the job when its nished. 9 Figure 2.1: Hadoop Top Architecture Figure 2.1 [24] shows the basic organization of the Hadoop cluster. The client machines communicates with the NameNode to add, move, manipulate, or delete les in HDFS. The NameNode in turn calls the DataNodes to store, delete or make replicas of data being added to HDFS. When the client machines want to process the data in the HDFS, they communicate to the JobTracker to submit a job that uses MapReduce. JobTracker divides the jobs to map/reduce tasks and assigns it to the TaskTracker to process it. Typically, all nodes in Hadoop cluster are arranged in the air cooled racks in a data center. The racks are linked with each other with the help of rack switches which runs on TCP/IP. 2.2 HDFS Hadoop Distributed File System is the le system designed for Hadoop to store the large sets of data reliably and stream those data to the user application at the high throughput rather than providing low latency access. Hadoop is designed in Java and that makes it incredibly portable across platform and operating systems. Like the other distributed 10 le systems like Lustre and PVFS, HDFS too stores the meta data and the data sep- arately. 2.2 [3]NameNode stores the meta-data and the DataNodes store the application data. But, unlike Lustre and PVFS, the HDFS stores the replicas of the data to provide high throughput data access from multiple sources and also data redundancy increases the fault tolerance of HDFS. When the HDFS replicates it does not replicate the entire le, it divides the les into xed sized blocks and the blocks are placed and replicated in the DataNodes. The default block size in Hadoop is 64MB and is con gurable. Figure 2.2: HDFS Architecture 2.2.1 NameNode Namenode is the master of HDFS that maintains and manages the blocks present on the DataNodes(slave nodes). It keeps the directory tree of all les in the le system, and tracks where across the cluster the le data is kept. It does not store the data of these les itself. There is just one Namenode in Gen1 Hadoop which is the single point of failure in the entire Hadoop HDFS cluster. The HDFS architecture is built in such a way that the user data is never stored in the Namenode. These are the following functions of a NameNode: 11 The NameNode maintains and executes the le system namespace. If there are any modi cations in the le system namespace or in its properties, this is tracked by the NameNode It directs the Datanodes (Slave nodes) to execute the low-level I/O operations. It keeps a record of how the les in HDFS are divided into blocks, in which nodes these blocks are stored and by and large the NameNode manages cluster con guration. It maps a le name to a set of blocks and maps a block to the DataNodes where it is located. It records the metadata of all the les stored in the cluster, e.g. the location, the size of the les, permissions, hierarchy, etc. With the help of a transactional log, that is, the EditLog, the NameNode records each and every change that takes place to the le system metadata. For example, if a le is deleted in HDFS, the NameNode will immediately record this in the EditLog. The NameNode is also responsible to take care of the replication factor of all the blocks. If there is a change in the replication factor of any of the blocks, the NameNode will record this in the EditLog. NameNode regularly receives a Heartbeat and a Blockreport from all the DataNodes in the cluster to make sure that the datanodes are working properly. A Block Report contains a list of all blocks on a DataNode. In case of a datanode failure, the Namenode chooses new datanodes for new replicas, balances disk usage and also manages the communication tra c to the datanodes. 12 2.2.2 DataNode A DataNode stores data in the HDFS. A functional lesystem has more than one DataN- ode, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for lesystem operations. These are the following functions of a DataNode Datanodes perform the low-level read and write requests from the le systems clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode. They regularly send a report on all the blocks present in the cluster to the NameNode. Datanodes also enables pipelining of data. They forward data to other speci ed DataNodes. Datanodes send heartbeats to the NameNode once every 3 seconds, to report the overall health of HDFS. The DataNode stores each block of HDFS data in separate les in its local le system. When the Datanodes gets started, they scan through its local le system, creates a list of all HDFS data blocks that relate to each of these local les and send a Blockreport to the NameNode. 2.2.3 Backup Node or Secondary NameNode The NameNode is the single point of failure for the Hadoop cluster, so the HDFS copies the of the Namespace in NameNode periodically to a persistent storage for reliability and this process is called checkpointing. Along with the NameSpace it also maintains a log of the actions that change the Namespcace, this log is called journal. The checkpoint node copies the NameSpace and journal from NameNode to applies the transactions in journal 13 on the Namespace to create most up to date information of the namespace in NameNode. The backup node however copies the Namespace and accepts journal stream of Namespace and applies transactions on the namespace stored in its storage directory. It also stores the uptodate information of the Namespace in memory and synchronizes itself with the NameSpace. When the NameNode fails, the HDFS picks up the Namespace from either BackupNode or CheckPointNode. 2.2.4 Replica and Block Management HDFS makes replicas of a block with a strategy to enhance both the performance and reliability. By default the replica count is 3, and it places the rst block in the node of the writer, the second is placed in the same rack but di erent node and the third replica is placed in di erent rack. In the end, no DataNode contains more than one replica of a block and no rack contains more than two replicas of same block. The nodes chosen on the basis of proximity to the writer, to place the blocks. There are situations when the blocks might be over-replicated or under-replicated. In case of over-replication the NameNode deletes the replicas within the same rack rst and from the DataNode, which has least available space. In case of under-replication, the NameNode maintains a priority queue for the blocks to replicate and the priority is high for the least replicated blocks. There are tools in HDFS to maintain the balance and integrity of the data. Balancer is a tool that balances the data placement based on the node disk utilization in the cluster. The Block Scanner is a tool used to check integrity using checksums. Distcp is a tool that is used for inter/intra cluster copying. 2.3 MapRedue JobTracker is the master, to which the applications submit MapReduce jobs. The JobTracker gets the map tasks based on input splits and assigns tasks to TaskTracker nodes in the cluster. The JobTracker is aware of the data block location in the cluster and machines 14 which are near the data. The JobTracker assigns the job to TaskTracker that has the data with it and if it cannot, then it schedules it to the nearest node to the data to optimize the network bandwidth. The TaskTracker sends a HeartBeat message to the JobTracker periodically, to let JobTracker know that it is healthy, and in the message it includes the memory available, CPU frequency and etc. If the TaskTracker fails to send a HeartBeat to the JobTracker, the JobTracker assumes that the TaskTracker is down and schedules the task to the other node which is in the same rack as the failed node. Figure 2.3: Map-Reduce Flow The Figure 2.3 [4] shows the data ow of MapReduce in couple of nodes . The steps below explains the ow of the MapReduce. [4] Split the le: First the data in the HDFS are split up and read in InputFromat speci 15 ed. InputFormat can be speci ed by the user and any InputFormat chosen would read the les in the directory, select the les to be split into InputSplits and give it to RecordReader to read the records in (key, value) pair that would be processed in further steps. Standard InputFormats provided by the MapReduce are: { TextInputFormat reads text les where the byte o set is key and line contents is value. { KeyValueInputFromat reads (key,val) pair. Keys and values are separated with a tab key. { SequenceFileInputFormat is Hadoop speci c high-performance binary format where key and value are user de ned. The InputSplit is the unit work that comprises a single map task in a MapReduce program. The job submitted by the client is divided into the number of tasks, which is equal to the number of InputSplits. The default InputSplit size is 64MB and can be con gured by modifying split size parameter. The InputSplits enable the parallel processing of MapReduce by scheduling the map tasks on other nodes in cluster at same time. When the HDFS splits the le into blocks, the task assigned to that node accesses the data locally. Read the records in InputSplit: The InputSplit although is ready to be processed it still does not make sense to the MapReduce program as the input to it is not in keyvalue format. The RecordReader actually loads the data and converts it to pair expected by the Mapper task. The calls to RecordReader calls map() method of Mapper. Process the records: When the Mapper gets the key-value pair from the RecordReader, it calls the map() function to process the input key-value pair and output an intermedi- ate key-value pair. While these mappers are reading their share of data and processing 16 it in parallel fashion across the cluster, they do not communicate with each other as they have no data to share. Along with the key-value pair, the Mapper also gets couple of objects, which indicates where to forward the output and report the status of task. Combiner combines all the pair with same keys before sending intermediate data to the Reducer. It is in some ways a mini Reducer. Partition and Shu e: The mappers output the key,value pair which is the input for the reducer. This stage the mappers begin exchanging the intermediate outputs and the process is called shu ing. The reducer reduces the intermediate value with the same key and it partitions all the intermediate output with the same key. The partition er determines which partition a given pair go to. The intermediate data are sorted before they are presented to the Reducer. Reduce the mapper?s output: For every key in the assigned partition in the reducer a reduce() function is called. Because the reducer reduces the partition with the same key, it iterates over the partition to generate the output. The OutputFormat will specify the format of the output records, and the reporter object reports the status. The RecordWriter writes the data to le speci ed by the OutputFormat. 2.4 HearBeat Mechanism As we know that once if the input le is loaded on to the Hadoop Cluster, the le is sliced into blocks, and these blocks are distributed among the cluster. Now Job Tracker and Task Tracker comes into picture. To process the data, Job Tracker assigns certain tasks to the Task Tracker. Let us think that, while the processing is going on one DataNode in the cluster is down. Now, NameNode should know that the certain DataNode is down , otherwise it cannot continue processing by using replicas. To make NameNode aware of the status(active / inactive) of DataNodes, each DataNode sends a "Heart Beat Signal" for every 10 minutes(Default). Based on this Heart Beat Signal Job Tracker assigns tasks to 17 the Tasks Trackers which are active. If any task tracker is not able to send the signal in the span of 10 mins, Job Tracker treats it as inactive, and checks for the ideal one to assign the task. If there are no ideal Task Trackers, Job Tracker should wait until any Task Tracker becomes ideal. 18 Chapter 3 DataCenter Layout and Existing Thermal Models In this chapter we begin describing contemporary system model. Then we brie y de- scribe general layout of data center as well a the heat circulation and cooling energy asso- ciated with a typical data center. Next we presents a review of thermal aware scheduling algorithmic solutions proposed in literature. 3.1 System Model Here we start by some standard assumptions which are pretty generic in a data center like scene. we start with inferring Data center with non-uniform temperature distribution within the data center room. Believing servers are nitrogenous so they have non-uniformly contribution on the heat re-circulation in the data center room . Further, we assume the data center runs data-processing jobs using hadoop framework. Finally we assume the data-processing jobs consist of delay-tolerant batch processing jobs, such as background log analysis, that do not have strict completion time requirements; they can be delayed by a bounded amount of time. Cooling energy savings are possible by being able to temporally spread the workload, and assign it to the computing equipment?s which reduce the heat re-circulation in data center room and therefore the load on the cooling systems. 3.2 Data Center Layout and Heat Distribution 0 We consider typical CRAC cooled data centers, in which the servers are arranged in chassis which are in turn arranged in racks with the racks being organized in rows. In each aisle, between two rows, the front panels or the rear panels face each other and are termed 19 Figure 3.1: An Overview of a data center. cold/hot aisle respectively.The computing nodes or chassis consume power and generate heat according to the amount of power they consume. Computing Room Air-Conditioners (CRAC) provide the cooling through the perforated tiles in the oor [12]. Figure 3.1 represents a typical data center. We can see here cold aisle is towards the inlet side of the server while outlet side faces the hot aisle. Cold air is blown from CRAC which passes through perforated tiles placed in the cold aisles towards inlet sides. Ideally the hot air from hot aisle should directly go back to CRAC but here some heat gets recirculated back to inlet side of the computing nodes which leads to imbalance in thermal management and non uniformity in temperature distribution. Figure 3.1 shows that this he re circulation e ect of heat creates hot spots towards inlet sides of rack, the reason being intermixing of hot and cold air. As temperature increases towards the inlet side of server now CRAC temperature should go down to bring the inlet temperature down which is also power consumption and lately has become a large chunk of data center?s operational cost. The temperature of supplied cooled air by CRAC Tsup should be with in the limits of red line temperature Tred. Tred depends on the client using data center. Due to re-circulation of 20 heat from hot aisle to cold aisle the inlet temperature vector of servers Tin gets represented as 3.1. Tin = Tsup +4T (3.1) here T is the temperature rise of servers due to heat re-circulation which means it heat server recieves from own outlet temperature but also from other neighbor machine. T is directly proportion to the workload that particular server processing at the time. Cooling energy consumption depends on Tsup. Cooling energy of CRAC can be modeled by its coe cient of performance (CoP) 3.3, which is the ratio of heat removed (i.e., computing energy) over the work required to remove the heat(i.e., cooling energy). The The total power consumed (PTotal) is sum of total computing power(PC) and power used for cooling (PAC). The lighting costs and other energy costs have negligible contribution to the total costs(Equation 3.5) [11] and [20] PTotal = PAC +PC (3.2) CoP = heatremovedenergy consumedtoremovetheheat (3.3) The CoP model is given by equation 3.4 where Tsup is supply temperature. CoP is linear(Figure 3.2) and normally increases with supplied temperature. Higher CoP shows less work to do for cooling the servers and less power consumption [12]. CoP = (0:068T2sup + 0:0008Tsup + 0:458) (3.4) The power consumed by CRAC can shows in terms of computing power and COP as PAC = PCCoP (3.5) 21 Figure 3.2: Coe cient of Performance. In Hadoop PC is total number of tasks that has been given to hadoop to process. These tasks boil down to basically few machine instructions, which takes di erent time to execute, but at cluster level, all Task Trackers are executing the similar instructions or same task. So that could be inferred as all task trackers are uniformly utilized. [22] shows that their is a linear relationship between Power Consumption of a node and CPU utilization of a machine. As their is a linear relationship between temperature and Power consumption. Using the transitive property CPU usage and temperature are also holds a linear relationship. The CPU temperature has linear model with outlet temperature of server 3.6 P = aCutil +b P Cutil Tcpu P Tout Tcpu (3.6) 22 Its not only CPU temperature but also disk temperature which contributes to the outlet temperature. [8] shows that an active disk contributes by almost 1.2 -1.3 degrees. They do not have a linear relation so to also count the disk contribution in the thermal model it can been seen as in two binary states as active or idle. An active disk will contribute the constant temperature rise at the server?s outlet. For achieving the eventual goal of reduction in cooling cost, model minimizes the Tinlet and hence it minimizes the need of bringing down the Tsup which reduces the power consumption of CRAC. 3.3 Related Work There is considerable work has been done to develop the e cient thermal aware schedul- ing algorithms for an hyper active data center. The algorithms di er in terms of workload type, optimality or complexity of the solution, the data center thermal model parameters. XInt-GA and XInt-SQP : XInt-GA and XInt-SQP are based on heat re-circulation coe cient for all pairs of nodes in data center, taking into consideration the data center physical layout and thermodynamics conditions. D =jdi;jjNxN (3.7) Where N is total number of cluster machines. di;j is fraction of heat that ows from node j to node i. [22] re ered as HRM and can be calculated according to data center air ow simulations and data center physical layout. [23] proposed XInt-GA and X- Int-SQP to minimize the peak inlet temperatures results less cooling power. They structured the problem as a minimization of peak inlet temperature through task assignment and solve it using XInt-GA, a genetic algorithm (meta-heuristic) approach as well as XInt-SQP, a sequential quadratic approach. HP and Duke University in [16] and [2] built an online measurement and control techniques to enhance energy e ciency. Supply and Return Heat Index (RHI and SHI) 23 characterizes the energy e ciency of CRAC in data center. Using Computational Fluid Dynamic(CFD) tools which is a thermodynamic and uid simulator, they gured out the con gurations that results in hot spots and non-uniformity in heat distribution. Using those CFD simulations they dynamically distributed the workload to achieve more thermal e ciency. [1] gave Thermal Aware Server Provisioning(TASP) which decreases the heat gen- erated by server provisioning. It also uses the CFD models to formulate hear re- circulation model. The authors, formulate the problem as choosing the active server set among a set of servers so as to minimize total energy. The authors solve this problem using Mixed Integer Programming (i.e., TASP-MIP) and propose a and N- approximation greedy algorithm (i.e., TASP-LRH). MinHR proposed by [13] is a power budgeting policy. The goal is to minimize the total amount of heat recirculated in the data center before returning to CRAC and increase power budget. It uses Hear Re-circulation Factor(HRF) as parameter which models the heat re-circulation of one chassis on all other servers (i.e., for a data centers with whose servers has homogeneous power consumption, HRF for each server is equal to the sum of the correspond row of heat recirculation matrix D). MinHR is a heuristic solution which ranks chassis based on the ratio of the HRF of each individual chassis to the sum of all HRFs, and assign tasks to the chassis with lowest HRF ratio. Inverse-temperature: This algorithm proposed by [19]where imbalances in tempera- ture are resolved heuristically by redistributing workload inversely proportional to the chassis outlet temperature. Thermal load balancing is achieved by providing cooling inlet air for each rack below redline temperature, maintain uniformity in inlet tem- perature and also dynamically responding to thermal emergencies that cause uneven temperature. 24 We can see here all the above algorithms introduce a solution to choose thermal balance in data center. All solutions di er with their approach of handling the critical issue here of thermal e ciency. None of them are lithe to use them in real time framework like hadoop to really garner the solutions proposed in primarily in simulated territory. In next chapter we propose the dynamic scheduler implemented on real time distributed framework like Hadoop which uses XInt-GA like approach of minimizing the peak inlet temperature by also considering the Heat Circulation Matrix generated by CFD one time simulations of data center?s physical layout. 25 Chapter 4 RTAH:Resource and Thermal Aware Scheduler Scheduler in Hadoop primarily designed to share Map Reduce cluster for several jobs. Over time it has grown to support hierarchical scheduling, preemption and multiple ways of organizing and weighing jobs. Hadoop schedulers are like a pluggable interface and its possible to switch between them. By default Hadoop had three schedulers FIFO, Fair and Capacity scheduler. All these schedulers are structured to handle resources optimally but none of them really holds any thermal management aspect to them. RTAH here is been build on top of FIFO scheduler which is default scheduler Hadoop uses for processing the jobs. we chose FIFO because its already an optimized scheduler and adding thermal aware aspect to them will certainly be the best decision. 4.1 FIFO scheduler In Hadoop jobs are divided into tasks based on the split size of input data. Those tasks are then stored into a task queue. Job Tracker assigns tasks from the task queue to one of the task tracker in cluster which is healthy and free to perform the task. The interface between Job Tracker and Task Tracker is the mechanism called HeartBeat which we introduced in chapter 1. After Job client submits the job to the cluster. As gure 4.1 from [15] shows it puts it into an internal queue from where the job scheduler will pick in up and initialize it. Initialization involves creating a task queue which is nothing but total data size divided by block size being con gured that is total number of splits. 26 Figure 4.1: An anatomy of typical FIFO scheduler. After Splits being calculated Job Client creates map tasks for each split and each task being given an id. Now Task trackers run a simple loop that periodically (where period could be cus- tomized) sends heartbeat method calls to job tracker. The heartbeat message is the means of communication between the Task Tracker and the Job Tracker. The key elds which this message contains maximum Map and Reduce tasks,total physical and virtual memory, frequency of CPU and CPU time, responseid, state of Task Tracker health. The heartbeat message is sent periodically be the Task Tracker to the Job Tracker and if the Job Tracker does not get a heartbeat message from the Task Tracker for an interval of time, then it labels that node as unhealthy and allocates the tasks given to that Task Tracker to other healthy Task Trackers. In acknowledgement of heartbeat Job Tracker will further assign Map/Reduce tasks based on the number of free slots it has. The Task Trackers usually have 2 map and 27 reduce slots, which means that Task Tracker can only run 2 Map or 2 Reduce tasks at a time. Generally, the number of Map/Reduce slots in a Task Tracker are con gured based on the number of cores it has, one slot for each core is widely followed setting. If a node has at least 1 Map slot then the node gets a Mapper task, else it will do the Reduce task. Before processing heartbeat Job Tracker calls for Task Tracker pro ling on the basis on its previous heartbeat information. { Process HeartBeat: Job Tracker receives a HeartBeat from a Task Tracker with all the above mentioned data and accepts it only if an allowed Task Tracker sent it. If the HeartBeat is duplicate then Job Tracker will ignore it. Else, it will process the HeartBeat, process a response and check for the tasks to execute. { When the Task Tracker is ready to run a task, Job Tracker gets the list tasks that are either a setup or clean-up task. { Choose a task from list: The scheduler iterates through the set up task list, and it chooses a task if the task is runnable, not running and is a failed task. It will remove a task from the list if task is scheduled, killed, completed, running or failed on this Task Tracker before. { Check for Flaky Task Tracker: After it obtains the Map or Reduce task, it checks if many tasks have failed on this Task Tracker before; If yes, then it does not schedule the task. Otherwise, marks the task as schedulable. { Assign the created Map/Reduce Task: Get the Task Tracker?s total Map and Reduce Slots and get total Map and Reduce slots across the pool to calculate the load factor of Map and Reduce. Find a new Task from the FIFO queue and ensure it has all the resources and process the tasks in the order of: Failed Task, Non-running Task, Speculative Task, No Location information Task 28 { Launch the task: After the task is assigned, it launches the task in the Task Tracker and starts a timer. If the Task Tracker does not respond to this task for too long then it will mark the task as failed task. Job Tracker further checks for any jobs to be killed, cleaned up or tasks that needs to be committed. In the HeartBeat response, it checks if any restart information should be included before sending it to the Task Tracker. 4.2 RTAH Design The most important part of RTAH is to have required information about task trackers health and its current Disk and CPU utilization to be communicated to Job Tracker. The given information a Temperature Simulator consumes to calculate Temperature information Scheduler needs in order to keep the Peak Inlet Temperature below Redline. 4.2.1 HeartBeat Mechanism Modi cation Our one of the main design goal was to build a propagation system between Task Tracker and Job Tracker to get the required information of task tracker to scheduler in real time. There are few options from available mechanisms which already exist in Hadoop such as Metrics, Counters and HeartBeat. Metrics are not part of Hadoop?s internal MapReduce interface in fact it runs as an external process. We chose to use Heartbeat because of its light weight nature. There are 2 Heartbeat interfaces in Hadoop cluster one is between Job Tracker and Task Tracker while another between DataNode and NameNode. For our schedulers prospective we used the Mapreduce interface. At Task Tracker it maintains an instance of a class called ResourceStatus which com- posed of all resource information about the task trackers resources such as Virtual memory, Physical Memory, number of map/reduce slots. We modi ed the Resource status class and added few more methods to calculate the CPU Utilization, Disk Utilization, CPU core tem- perature and Disk temperature. Since Hadoop is a java based framework and running on 29 JVM it can?t access the resource of any machine. So we used some Linux Resource Calcu- lator plugins such as /proc/stat which provides the CPU and Disk information. For Disk Temperature we used another Linux application called hddtemp to extract the Disk Tem- perature. The advantage of using the heartbeat to communicate the data from Task Tracker to Job Tracker is to leverage an existing lightweight mechanism that is very well suited for real time data propagation. We also recognized that the data is real-time upto the period of the heartbeat which is minimum 3 seconds. JobTracker keeps a HashMap of Task Tracker Name to last TaskTrackerStatus received from that Task Tracker. The ResourceStatus instances can be accessed through TaskTrack- erStatus.getResourceStatus(). 4.2.2 Thermal Simulator We described the plan of the model used as well as the basic components necessary for the model. In this section, we will present the assumptions and the notations we used in the model. Following are the assumptions : 1. Initial temperature is always consistent which is room temperature throughout the data center. 2. The air ow is static in all parts of the data center. 3. Supplied temperature strength is linearly proportional to the distance from the vent. 4. disk is assumed to be in two states either active or idle. 5. we assume in active state disk contributes constant rise in outlet temperature. At Job Tracker a Temperature simulator class gathers CPU and Disk Utilization from the Hash structure maintained by Job Tracker. For a server i the outlet temperature is sum of Tinitial and Trise. Here Trise is linearly related to the CPU Utilization of server i as equation 30 Table 4.1: Model Notation Variables Description i Number of Server Node Q Heat generated (J) p Density of air (kg/m3) f Flow rate (m3/s) Tsup Air temperature as supplied from cooling unit Anxn Heat Re-circulation/ cross-interference matrix Tout Outlet Temperature (c) Tin Inlet Temperature (c) T Change/Rise in temperature (c) Tred Red Line/Threshold temperature Pc Computational Power R Ratio of distance di;j Cross Interference matrix di Distance of the server from AC vent (m) d Height of room (m) Tinitial Room Initial Temperature (c) Kdisk constant temperature rise by disk Tpossilblein Possible Inlet temperature if job assigned Cutil CPU utilization Dutil Disk Utilization PAC cooling cost/Power consumption by AC unit COP Coe cient of performance 4.2 where Kdisk is a constant contribution of an active disk to the outlet temperatureTout. We know the Inlet temperature of a server node(Task Tracker) gets a ected by CRAC temper- ature Tsup and also by outlet temperature of itself and also the contribution by other server nodes because of heat re-circulation e ect. Equation 4.1 shows the actual representation of Inlet Temperature of a server at any given point of time. In the article [21], they de ne Tin as dependent on Ts and a vector which models the exact strength of Ts at each height. Here R represents the factor of height of server from CRAC this gets calculated using ratio of server?s height from CRAC to ceiling height from CRAC. This in lines of the fact that the higher the server?s placement from CRAC which is at oor the less cooling e ect it will receive from cooling device. Inequation 4.1 A is a heat re-circulation or cross-interference 31 Figure 4.2: Cross-Interference between neighbor servers vector which sums up the temperature rise of server node i because of its neighbors outlet temperature sitting at placement for say at i+1, i+2 or i-1. Heat Re-circulation vector or matrix gets calculated using thorough evaluation of phys- ical layout of data center using CFD tools as proposed in [17] and [13]. In gure 4.2 A is a nxn matrix where n is total number of cluster nodes in the cluster. each element di;j from the matrix represents the fraction contribution of heat from server i?s outlet to j?s inlet temperature. R = 1 d i d Tini = Tiinitial RTsupi +ATout (4.1) This equation can be reorganized to solve for outlet temperature. 32 Trisei = Kdiski +aCutili +b Touti = Tiinitial +Trise (4.2) The main reason for Tout to rise is the heat transfer in the system from inlet side to outlet side. The PC components such as CPU chip and disks heats up on processing the workload and with air heat transfer it leads to rise in Tout. 4.3 Implementation The algorithm in goal of minimizing the inlet temperature uses some input parameters. which are listed here below: 1. A[][] : a cross interference matrix/heat re-circulation matrix 2. CPU and Disk utilization from heartbeat message 3. height[] vector which keeps the height information of each server with respect to CRAC at oor. 4. Currently running map tasks on task tracker i, also supplied via heartbeat. Now as previously stated when the Task Tracker sends the HeartBeat message, we incorporate CPU usage and disk utilization information in it. The Task Tracker java thread polls the OS for the temperature and utilization adds it to the HeartBeat message. For the disk and CPU utilization we use iostat command. The HeartBeat message interval sets the accuracy of the real time information we have on the Job Tracker. Now there are few assumptions which were considered for algorithm such as: 1. Try to keep Tiin below Tred 2. Each Task Tracker will be assigned one task at a time 33 3. All map tasks in a job are identical in terms of their resource requirement because they all process the same operation. 4. All map tasks are processing the same input block size 5. Each task will be assigned to utmost one Task Tracker 6. Above constraints meant that Temperature rise per map task will also be same for a server 7. Heat re-circulation e ect is more profound to servers belong to same rack. The rst assumption here is decision making condition in scheduling to keep inlet tem- perature below redline and this way no need to decrease the CRAC temperature Tsup. The FIFO scheduler by default assigns one map task to each task tracker whenever task tracker asks for a task from the task queue. Since each map task gets created to process a similar operation and also on similar size of data block they believe to be similar in terms of contri- bution to temperature rise. Each Task will be assigned to only one task tracker the most so duplicate work being done in entire scene. As shown in Figure 4.3 1. The HeartBeat messages are sent by the Task Tracker i with the information of CPU, Disk temperature and utilization along with the other HeartBeat elds periodically. The Disk temperature, CPU temperature and the utilization are extracted from the Task Tracker with system commands. 2. The Job Tracker calls Simulator which receives the heartbeat message and extracts the Task Tracker status. 3. In the initial part each Task Tracker has no task so they all will be available to accept a task. So each time a task tracker requests for a task by sending its availability through heartbeat it also registers itself with simulator. If i has just sent heartbeat and asked for a task it will register itself in Simulator. 34 Figure 4.3: RTAH Design Implementation 4. The Simulator then extracts the message and creates a map structure to keep the heartbeat required elds (CPUutil and Diskutil) with each information being mapped to the Task Trackers address i and their eld values. 5. Now on wards each time i task tracker sends heartbeat, simulator updates its map structure to keep information real time. 6. Now simulator has structure with all required elds for simulating the thermal model: 35 It uses the CPU usage and Disk usage to calculate the outlet temperature of server i and maintains another map structure for storing the outlet and inlet temperature for Task Tracker. Simulator uses model mentioned in equation 4.2 to calculate the outlet Temper- ature. We uses CPUutil as linear model between outlet temperature and Diskutil represents either active or idle and it contributes a constant temperature to outlet temperature. Now Simulator used outlet temperature to calculate the Inlet temperature for the server itself and also its neighbor task trackers using equation 4.1. Tsup is a constant temperature and A is cross interference matrix gives the factor di;j by which server i is going to contribute to the inlet temperature of its neighbor Task Trackers j. The far the neighbor task tracker is the least the factor in the matrix. The most prominent heat re-circulation a ect could be noticed in the local servers in a rack than servers in another rack. 7. The scheduler meanwhile iterates through the set up task list, and it chooses a task, or removes a task if it is completed or killed. 8. The simulator now a boolean method called shouldGo whether scheduler should assign task to the task seeking task tracker. The method here uses the inlet and outlet temperature di erence. Lets say task tracker i has requested for task as in equation 4.3. 4Ti = Touti Tini (4.3) 9. Now method calculates the temperature rise per map task using equation 4.4. This we can understand by our assumption above that all map tasks are similar so they have very close equal contribution to outlet temperature of server i. 36 Temp rise per map = 4Tmap i (4.4) 10. Method now calculates and maintains another structure called possible outlet temper- ature if job got assigned to that tracker Tpossilbleout . 11. Suppose for server i , the method calculates the average of di;j where j " [1...n]. Here n is number of Task Trackers in cluster as in equation 4.5. Average Factor for server i = Pn 1 n (4.5) 12. From given matrix then scheduler looks for neighbor nodes for which the cross-interference matrix value is greater than average for that server as point out in 4.5. Once known the neighbor servers could be critically a ected because of re-circulation e ect of heat it then uses Tpossilbleout to calculate the inlet temperature of all its neighbor nodes and stored as Tpossilblein . 13. Now is the moment to decide and make decision of ag the scheduler that whether its good to assign job to server i or not. The criteria being Tpossilblein for all neighbor nodes of server i and also for i are less than Tred. Equation 4.6 criteria = Tpossilblein