Design of a Hybrid Memory System for General-Purpose Graphics Processing Units by Patrick Carpenter A thesis submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Master of Science Auburn, Alabama August 4, 2012 Keywords: GPU, Non-volatile memory, Phase-change memory, Hybrid simulator Copyright 2012 by Patrick Carpenter Approved by Weikuan Yu, Chair, Assistant Professor of Computer Science and Software Engineering Xiao Qin, Associate Professor of Computer Science and Software Engineering Soo-Young Lee, Professor of Electrical and Computer Engineering Abstract Addressing a limited power budget is a prerequisite for maintaining the growth of com- puter system performance into and beyond the exascale. Two technologies with the potential to help solve this problem include general-purpose programming on graphics processors and fast non-volatile memories. Combining these technologies could yield devices capable of extreme-scale computation at lower power. The goal of this project is to design a simulator supporting a hybrid memory system, containing both dynamic random-access memory (DRAM) and phase-change random-access memory (PCRAM), toreplace the graphics global memory. Because of the proprietary nature of graphics hardware and the relative immaturity of phase-change memory, it is necessary to develop an appropriate simulation framework to conduct further research. In this work, GPGPU-Sim and a modified version of DRAMSim2 are combined in the design of a hybrid simulator named GPUHM-Sim. The design, implementation and validation of GPUHM-Sim are the primary contributions of this work. ii Acknowledgments This author would very much like to acknowledge the steadfast support and encourage- ment provided by his advisor and mentor, Dr. Weikuan Yu, as well as the expertise and enthusiasm of Drs. Xiao Qin and Soo-Young Lee as vital members of the graduate com- mittee. Without the combined experience and insights of these individuals, to whom this author is greatly indebted, this work would not have been possible. This work is thanks in large part to a collaboration with Dr. Dong Li of Oak Ridge National Laboratory and Dr. Xipeng Shen of the College of William and Mary. The author is very grateful for these collaborators? passion for excellence in research and unceasing diligence. The author is similarly grateful for the guidance and assistance provided by those fellow students (in no particular order: Cristian Cira, Adarsh Jain, Xuechao Li, Zhuo Liu, Xinyu Que, Yuan Tian, Bin Wang, Yandong Wang, Cong Xu) with whom, during his graduate studies in the Parallel Architecture and System Laboratory, he had the distinct privilege and honor of working, and without whom the quality of this work would be significantly diminished. Moreover, the assistance and recognition provided to the Laboratory by its academic and industrial partners and sponsors (in no particular order: Auburn University, Department of Energy, HPC Advisory Council, Mellanox Technologies, National Aeronautics and Space Administration, National Science Foundation, NVIDIA Corporation, Oak Ridge National Laboratory, TeraGrid) has done much to make possible the continued research activities of the Laboratory and is deeply appreciated. Finally, the author would like to thank the Department of Computer Science and Soft- ware Engineering and Auburn University for furnishing exciting opportunities, excellence in instruction and the environment in which this work was carried out. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 General-Purpose Computation on Graphics Processing Units . . . . . . . . . 10 2.1.1 The Rise of Graphics Processors for High Performance Computing . . 10 2.1.2 NVIDIA?s Compute Unified Device Architecture . . . . . . . . . . . . 11 2.1.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . 14 2.2 Non-Volatile Memory Technologies . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Varieties of Fast Non-Volatile Memory Technology . . . . . . . . . . . 17 2.2.2 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . 19 2.3 The General-Purpose Graphics Processor and Non-Volatile Memory . . . . . 21 3 Phase-Change Random Access Memory in a Hybrid Graphics Global Memory . 25 3.1 Proposal, Plan and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Simulation Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Simulating Graphics Devices with GPGPU-Sim . . . . . . . . . . . . 27 3.2.2 Simulating Phase-Change Memory with DRAMSim2 . . . . . . . . . 28 3.3 Integration, Verification and Validation . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 Integration of Simulation Frameworks . . . . . . . . . . . . . . . . . . 29 3.3.2 Verification Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Verification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 iv 4 Related and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 v List of Figures 2.1 Price-performance comparison of various memory technologies. . . . . . . . . . . 21 3.1 GPGPU-Sim architecture model. . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 High-level integration plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Intermediate integration plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Low-level integration plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Global memory read bandwidth from SHOC for several devices . . . . . . . . . 36 3.6 Global memory write bandwidth from SHOC for several devices . . . . . . . . . 37 3.7 Peak single-precision performance from SHOC for several devices . . . . . . . . 37 3.8 Peak double-precision performance from SHOC for several devices . . . . . . . . 38 3.9 Instructions per cycle achieved versus computational intensity . . . . . . . . . . 39 3.10 Instructions per cycle from hybrid simulator versus computational intensity . . 41 3.11 Memory fetch latency from hybrid simulator versus computational intensity . . 42 3.12 Comparison of global memory read bandwidth for original and hybrid simulators 43 vi List of Tables 2.1 Characteristics of non-volatile memory technologies . . . . . . . . . . . . . . . . 21 3.1 SHOC v.1.1.1 Benchmark Descriptions . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 System configuration overview for SHOC benchmark experiments. . . . . . . . . 34 vii Chapter 1 Introduction The demand for computer systems with ever-increasing levels of performance has driven, and continues to drive, innovation at the forefront of high performance computing. Engineers and scientists demand such systems in order to solve some of the world?s most challenging and exciting problems?problems such as climate change [4], nuclear fusion, the design of life-saving pharmaceutical drugs, etc.?both at larger scales (e.g., climate models at global rather than continental scales) and at finer granularities (e.g., ab initio rather than paramet- ric models to predict protein structure). Feasibly solving such problems at large scales and fine granularities expands the scope of questions which can meaningfully be addressed via computational techniques, accelerating the rate of scientific and engineering achievement? and all that goes with it. By pioneering innovations in the performance of computer systems and computer applications, all computing professionals?and, in the context of this work, the high performance computing community?not only advance their own science and tech- nology, but catalyze advances of the entire technical ecosystem. Over the years, innovations targeting performance have taken many forms and relied on diverse insights. For instance, improvements have been achieved from advances in the underlying technology (e.g., vacuum tubes versus transistors and the integrated circuit, solid- state drives versus hard disk drives, the historical shrinking of feature size with advances in the manufacturing process), in the hardware architecture (e.g., superscalar and pipelined processor architectures, multi-channel main memory), in hardware targeted at specific ap- plications domains and computing paradigms (e.g., application-specific integrate circuits and field-programmable gate arrays, various parallel architectures reflecting designations in Flynn?s taxonomy), and in software and algorithms themselves. Indeed, architectural and 1 system innovations can target and have targeted components at virtually every level (e.g., processors, peripherals, memory, disk, network). To date, the story of performance improve- ments in computing systems and the applications which they service is one of essentially unbridled success. In fact, innovations have consistently resulted in explosive growth in raw power of com- puting systems. Since the early 1970s, Moore?s Law has accurately reflected increasing transistor counts of computer microprocessors. For many years, these increases came in the form of increased instruction-level parallelism, which provided performance benefits to a very general class of computer applications, and relied on shrinking feature size and ad- justing voltage as necessary [15]. However, this process began to lead?somewhat recently, in the fairly short history of modern computing?to an increase in power dissipation, to the point that the community has begun to move towards increased thread-level parallelism (via multi- and many-core technologies) in order to maintain exponential growth of computing capacity. Unlike instruction-level parallelism, thread-level parallelism does not provide per- formance benefits to traditional sequential applications; rather, applications must typically be designed in such a way as to exploit thread-level parallelism for performance benefit (it is worth noting that the distinction is not entirely clear, since applications can be writ- ten in such a way as not to benefit from instruction-level parallelism, and since several techniques?including framework- and compiler-aided techniques?exist in order to ease the development of applications to exploit thread-level parallelism, but in general the distinction between instruction-level and thread-level parallelisms is a useful one for understanding chal- lenges particularly relevant to the high-performance computing community). Despite this shortcoming of thread-level parallelism, it has allowed for the development of truly powerful computing systems which benefit applications of actual import; indeed, Los Alamos Na- tional Laboratory?s Roadrunner?which supported massive thread-level parallelism through 122,400 processor cores, including cores from both Opteron processors and Cell graphics 2 processors?was [26] the first computer system to break the petaflop barrier in 2008, per- forming over 1015 floating-point instructions per second. As the high performance computing community strives to deliver an exaflop computer system?that is, one which can perform 1018 floating-point operations per second?it must overcome several obstacles, not least among which is a severely limited power budget. When Roadrunner was delivered in 2008, it required a total power of approximately 2.35 megawatts and achieved a sustained efficiency of around 437.43 million floating-point operations per second per watt, making it not only the fastest computer system in the world, but also the third most power-efficient [26] [25]. In order to build and operate an exaflop computing system, the high performance computing community must satisfy a variety of constraints, including the two which are of particular interest in the context of this work: performance and power. By a straightforward argument, the performance?in terms of floating-point operations per second?of an exaflop computer must be roughly three orders of magnitude larger than that of Roadrunner ca. 2008. On the other hand, realistic power budgets for future exaflop systems are generally held to be around 20 megawatts [13], around one order of magnitude larger than the power consumption of Roadrunner ca. 2008. What this means, in practical terms, is that a realistic exaflop computing system must boast a power efficiency of around two orders of magnitude higher than that of Roadrunner ca. 2008, or in the neighborhood of over 40,000 million floating-point operations per second per watt. As of November 2011, the most power-efficient petascale computing system?located at the GSIC Center at the Tokyo Institute of Technology?achieved a maximum sustained power efficiency of around 958.35 million floating-point instructions per second per watt?slightly more than twice the efficiency of Roadrunner ca. 2008, and still more than around 40 times less than will be required in order to break the exaflop computing barrier. Novel solutions will be required to meet power and performance budgets. 3 Several avenues exist along which the high performance computing community has been pursuing solutions to the problem of reducing power consumption of the world?s most pow- erful computer systems. At least three possibilities include increasing the flops per watt of computer processors (for instance, by using special-purpose accelerators such as graphics processing units or, more generally, adopting designs which employ massive thread- level parallelism on microprocessors of simpler design), making fundamental changes to the mem- ory of computer systems in order to promote power savings (such as the use of 3D stacked memory or emerging fast non-volatile memory technologies to reduce dynamic power con- sumption), and improving software which can reduce power from under-utilized computer hardware (e.g., techniques such as fine-grained dynamic voltage and frequency scaling might be successfully applied to many hardware components in future exascale systems). To satisfy the constraint on total power consumption, it is likely that a combination of such technologies will be useful. Of these technologies, two have emerged which have enjoyed significant attention in recent years, which promise to aid in the pursuit of the exascale, and which form the basis of this work: the use of graphics processing units for general purpose computation, and fast non-volatile memory technologies?in particular, phase-change random access memory?as replacements for the volatile memory technologies which are today commonplace. The for- mer is thanks to a combination of several factors: the evolution of the graphics processor into a device excelling at computations involving high computational intensity as well as massive thread- and data-parallelism, initial interest, culminating in the release by NVIDIA of its Compute Unified Device Architecture (CUDA) in 2007, in the use of graphics devices for applications in scientific computing, and the impressive power efficiency of these devices, which can help compute nodes reach power efficiencies which are greater than those of tra- ditional compute nodes by a factor of 4 or more, to name only a few. Meanwhile, the latter offers a non-volatile alternative to traditional volatile memory technologies such as dynamic 4 and static random-access memory, while providing orders of magnitude of performance im- provement over other non-volatile memory technologies (e.g., flash and hard disk). The non-volatility of these emerging technologies refers to the fact that power is not required to maintain the integrity of data they store (whereas dynamic random access memory must be periodically refreshed due to leakage), meaning that the proper use of these technologies can be used to reduce total power consumption of the memory system by addressing static power (that is, power which is consumed when the memory system is not being used to read or write data). Of course, one of the key benefits of these technologies is the non-volatility itself; however, this aspect is not treated in this work. In summary, both of these technolo- gies appear promising in terms of leading to the exascale, which leads to the fundamental idea of this project. By carefully combining these technologies so as to bolster their strengths and diminish their weaknesses, it is hoped that a hybrid system?one capable of massive computational capacity at lower power?can be developed. Several questions must be answered in order to design an effective and efficient hybrid system of this kind: What kind of non-volatile memory ought to be used? How can memory controllers effectively and efficiently manage data placement, access, and migration? More generally, how can proposed architectures and designs be compared quantitatively, in the absence of actual hardware implementations? The design and implementation of a suitable evaluation framework for hybrid graphics global memory is of the utmost importance, and as detailed later, is the main focus and contribution of this work. Many factors must be taken into consideration in order to properly analyze and un- derstand the performance of programs executing on graphics devices; these include relevant hardware and architectural details, execution configurations of the graphics programs, as well as inherent and emergent computational aspects of the program itself. Understanding hard- ware and architectural influences is complicated by at least two considerations: first, although NVIDIA hardware and its associated CUDA have, to date, been the predominant driver of 5 the use of graphics processors for general-purpose computation, challengers ATI (now owned by AMD) and OpenCL (among others) are gaining momentum; second, NVIDIA?s hardware and low-level driver interfaces are entirely proprietary, and the physical instruction-set archi- tecture varies from device to device (in practice, these issues do not represent huge obstacles to understanding device performance characteristics, effectively and efficiently leveraging NVIDIA hardware, or to enjoying easy application portability across hardware devices, but they are worth mentioning). The performance characteristics of fundamental algorithms and data structures is equally as important for computing on graphics devices as it is for com- puting on traditional processors; it is worth noting that these characteristics may need to be reinterpreted and re-evaluated in light of changed assumptions regarding hardware prop- erties. Such complications can render infeasible first-principles analyses of specific scenarios involving graphics devices and the applications they service. In many cases, other methods are appropriate. A useful way to gain some understanding of performance behavior of computer hardware, including graphics hardware, is to empirically evaluate the performance of many programs with differing computational qualities on one or more hardware systems with differing per- formance characteristics. Benchmarking?that is, evaluating the performance of standard applications while varying the hardware system on which these applications are executed?is a tried and tested technique. Several benchmark suites exist for evaluating the performance of systems using graphics devices, including the Scalable HeterOgeneous Computing (SHOC) suite [8] developed at Oak Ridge National Laboratory, Rodinia, Parboil, etc. As part of this work, experiments are conducted to evaluate the performance of the SHOC benchmark suite. Performance data are collected and used in order to evaluate the performance of the proposed simulation testbed. The goal of this project is to design a hybrid simulation framework, called the Graphics Processing Unit Hybrid Memory Simulator (GPUHM-Sim), to enable flexible and reconfig- urable simulation of hybrid global memory systems inside graphics devices. For a variety 6 of reasons (including the proprietary nature of graphics hardware, the relative immatu- rity of phase-change memory technology, etc.), it has been convenient to adopt the use of appropriate simulation frameworks, which have been integrated to produce GPUHM-Sim: GPGPU-Sim [3] for the graphics device, and a modified version of DRAMSim2 [23] for the phase-change memory. Each of these simulators represents a complex software artifact the functioning of which must be at least minimally understood to be of use in evaluating com- peting experimental designs. First, GPGPU-Sim and DRAMSim2 are carefully analyzed in order to understand the functioning of these two systems. Then, a detailed integration plan is formulated which seeks to create an extensible interface between these two simula- tors which will preserve the flow of information. The design emphasizes strict separation of interface and implementation, and supports hybridization at several levels of the memory request pipeline (across memory partition units, or within a device itself via changes to the external memory simulation framework). An initial integration of these simulators is then carried out and significant effort is spent in an attempt to evaluate the quality of the inte- gration in terms of the resiliency, performance and output quality of the hybrid simulator. Performance results are collected from simulations using the original version of GPGPU-Sim as well as the initial version of GPUHM-Sim, and results are compared. Results from these initial efforts have been good, with GPUHM-Sim deviating from GPGPU-Sim by at most a few percent for similar memory configurations. The integration of these two simulation frameworks, constituting the design of GPUHM-Sim, and the validation and verification of the resulting hybrid simulator, as well as the definition of a novel architectural simula- tion framework for hybrid graphics global memory inside graphics processing units, are the primary contributions of this work. The rest of this thesis is arranged as described in the remainder of this paragraph. Chapter 2 expands on the information presented in the introduction to provide additional background, context, and motivation for the project which is the subject of this work. Fur- thermore, it is the hope of this author that this chapter can serve as a useful starting point, 7 or guide, for readers wishing to somewhat better acquaint themselves with the topics of modern high performance computing, general-purpose computation on graphics processing units, or fundamental notions of emerging non-volatile random access memory technologies. Chapter 3 contains a complete report of the goals, work, and results of the project which is the main subject of this work. First, the project motivation, goals and scope are restated in a clear and concise fashion. Then, the simulators GPGPU-Sim and DRAMSim2 are described and analyzed. Finally, the integration plan is presented in detail, and the results of the ini- tial integration are described and evaluated. Chapter 4 consists of a review of more-or-less closely related work, conclusions that can be drawn from the work and the project, and an outline of future goals and directions. 8 Chapter 2 Background This chapter expands on the information presented in Chapter 1 in order to provide additional background, context, and motivation for the project which is the subject of this work. Furthermore, it is the hope of this author that this chapter can serve as a useful starting point, or guide, for readers wishing to somewhat better acquaint themselves with the topics of modern high performance computing, general-purpose computation on graphics processing units, or fundamental notions of emerging non-volatile random access memory technologies. Interested readers are encouraged to consult referenced sources directly. Section 2.1 introduces the graphics processor as general-purpose computational device, and recounts some of the key developments leading to its current widespread adoption in the high performance computing community. Specifically, parallelism and how it applies to graphics devices, an overview of NVIDIA?s CUDA, advantages and disadvantages of graph- ics devices for general-purpose computation, and some representative graphics devices and examples of systems which use them, are discussed. Section 2.2 attempts to provide a minimal introduction to the landscape of fast non- volatile memory technologies which are of primary interest to this work, and to compare these to slower non-volatile as well as volatile devices. The section concludes with a brief review of some actual products which are beginning to enter the market or which are slated for increased production in the near future. Section 2.3 examines the graphics memory hierarchy and candidate non-volatile memory technologies and attempts to provide an adequate answer to the following question, which was posed in Chapter 1 as a question of fundamental importance in beginning to carry 9 out the intended project of research: which graphics storage target(s), if any, are well- suited to integration with non-volatile memory technologies, and which non-volatile memory technology(ies) are appropriate for with graphics storage target(s)? By evaluating hardware characteristics and their relation to power and performance, a tentative answer is provided which establishes a clear direction for the remaining work. 2.1 General-Purpose Computation on Graphics Processing Units This section provides background information related to the use of graphics processing units for general-purpose computations. In particular, issues related to programmability and performance are described. These and other issues inform decisions related to the design GPUHM-Sim. 2.1.1 The Rise of Graphics Processors for High Performance Computing Over the last 20 to 30 years, graphics processors have evolved from fixed-function, special-purposecomputationaldevices, intodevicesprogrammablethrougharestrictedgraph- ics shader application programming interface, and recently into fully programmable computa- tional coprocessors. [21] Because of their humble origins as devices used for graphics render- ing, graphics processors have developed with a different set of performance characteristics to those of traditional processors, corresponding to the differences in engineering requirements. By way of comparison, graphics processors are very high-throughput, high-latency devices. This is due to the fact that graphics cards were designed primarily for graphics, and humans are more sensitive to lower throughput (e.g., lower screen resolutions or smaller monitors) than they are to longer latency (e.g., viewing graphics many milliseconds after the GPU began processing the scene). For instance, modern NVIDIA Fermi GPUs can achieve over one teraflops of single-precision floating-point performance and over 100 gigabytes per sec- ond of sustained on-device memory bandwidth. Additionally, graphics processors dedicate much more hardware real estate to arithmetic-logic and floating-point units than to control 10 or cache units, meaning that they excel at numerically-intensive applications but not at ones which exhibit complicated control logic. Again by way of comparison to traditional proces- sors, graphics hardware exhibits incredible amounts of thread- and memory-level parallelism. For instance, modern NVIDIA Fermi graphics cards contain in the neighborhood of 450 pro- cessor cores, many of which can simultaneously access a word of global memory over a wide memory bus. However, this hardware performance comes at the cost of programmability; achieving good performance on graphics processors takes effort and expertise. In many re- spects, graphics devices evolved into systems not dissimilar from those constituting an object of study in the fields of high performance and scientific computing. 2.1.2 NVIDIA?s Compute Unified Device Architecture NVIDIA?s CUDA programming model and toolkit, released in late 2007, allow devel- opers to write general-purpose programs to be executed, in part, on NVIDIA hardware [17] [16]. CUDA represents a significant step forward in terms of programmability and has led to drastically increased adoption of general-purpose computing on graphics devices by program- mers in several areas of application. It provides a minimal set of extensions to the standard C++ programming language, in addition to several application programming interfaces, for efficiently mapping general-purpose computation to their graphics devices. More specifically, CUDA allows programmers explicit access to the hierarchies of processors and memories on GPUs through similarly hierarchical software abstractions. Within a traditional C++ pro- gram, application programmers define a special function, called a kernel, to be executed by the GPU. Then, from within main (host) code, the application programmer may copy data from host memory to (graphics) device global memory and back again, and in addition, invoke, or launch, the kernel, which triggers execution on the GPU over the system I/O bus. Memory transfers and kernel executions happen (or can be made to happen through appro- priate flags to appropriate functions) asynchronously to execution of host code, allowing for increased performance via overlapping of communication and computation. 11 It is at this point useful to introduce, or revisit, terminology related to the hardware architecture of NVIDIA?s graphics devices which will be employed in explaining details of NVIDIA?s CUDA. NVIDIA Fermi devices are organized hierarchically in terms of both pro- cessing and memory models. At the highest level, the GPU represents the entire computing device, and contains a device-global memory area referred to as global memory. Modern graphics devices typically include on the order of gigabytes of such memory, which provides high-throughput, high-latency, device-wide read/write access. At the next level, groups of processor cores, referred to as streaming multiprocessors, of which NVIDIA Fermi devices typically have around 15, contain multiprocessor-local shared memory, which is treated as an explicitly software-managed cache. The size of shared memory varies, but is typically around 64 kilobytes per multiprocessor; regarding performance, shared memory is typically much lower-latency than global memory. At the lowest level, processor cores, also referred to as streaming processors, of which NVIDIA Fermi devices have 32 per multiprocessor, can access large numbers of processor-local registers, which represent the fastest available memory on GPUs. Having briefly outlined some of the key architectural features of representative CUDA- enabled hardware devices, the focus returns to CUDA. When a kernel is launched on a device, many threads (the number and arrangement of which are specified as part of the kernels invocation) are created and distributed to hardware scheduling units corresponding to the streaming multiprocessors. The set of all threads created in response to a kernel launch is referred to as a grid. Grids are further subdivided into thread blocks, such that all the threads belonging to a given thread block are scheduled on the same multiprocessor. Groups of threads within blocks referred to as warps are scheduled in a single-instruction, multiple-thread manner across the cores belonging to a multiprocessor. Using data pro- vided automatically by the CUDA framework, threads may access their position in the grid (which thread block they belong to, as well as their position within their thread block) and issue instructions accordingly. Note that for the threads of a warp to make efficient use of 12 hardware, they must abide by the hardware limitation already noted and execute the same instructions; however, they may operate on independent data elements. The introduction of CUDA 4.0 between February and April 2011 gives a strong indica- tion of the direction in which the community can expect graphics technology to be heading. NVIDIA [18] cites three main areas of improvement in CUDA 4.0: usability, multi-device programming capabilities, and developer tools. Each of these areas bears discussion in terms of its impact on general purpose programming on graphics processing units. Usability and programmability: For the purposes of this discussion, the most important innovations in this category include how host threads share graphics devices and no-copy pinning of system memory. With CUDA 4.0, multiple host threads can share a single graphics device, and a single host thread can use multiple graphics devices; this represents a significant paradigm shift in terms of application structuring to leverage varying graphics resources. No- copy pinning of system memory improves further upon the benefits of pinned host memory. Multi-device support: Unified Virtual Addressing (UVA) and GPUDirect 2.0 bring im- provements in both programmability (UVA brings the host/device domain closer to a truly global address space) and in performance (by reducing graphics devices? dependence on the host to perform communication and I/O). The importance of these advances to increased adoption of general purpose programming on graphics devices should not be understated. Developer tools: With the release of CUDA 4.0, NVIDIA has added multi-device support to the cuda-gdb debugger. In summary, we see a trend towards increased use of and reliance on the use GPUs in high-performance scientific computing applications. The most current version of CUDA?CUDA 4.1 [19]?was released in late 2011, and offers improvements in application development features (notably, an LLVM-based compiler which results in faster code for many applications), library support (e.g., over a thousand new image processing functions were added to the NVIDIA Performance Primitives library), and developer tools (more features in cuda-gdb and cuda-memcheck for debugging). Although representing a modest incremental increase in the overall framework, these changes?as well 13 as those made for the release of CUDA 4.0?are useful in evaluating the current and near- future needs of the community, as perceived by NVIDIA: more efficient and effective appli- cation development through more advanced language features and generally better tool and development and environment support. 2.1.3 Advantages and Disadvantages Several advantages, as well as important disadvantages, are associated with the use of graphics processing units for accelerating applications in high-performance scientific com- puting. Some of the most important advantages motivating their use have already been mentioned; briefly, they include the following: high peak floating point operation rates; high peak energy- and cost- efficiency; and large peak memory bandwidth. Together, these advan- tages make graphics devices potentially very attractive to solve the problems of extreme-scale computing. The chief disadvantages associated with the use of graphics devices in the aforementioned capacity are related to the difficulty associated with achieving near-peak performance for real- world applications. Broadly speaking, graphics devices suffer from three fundamental limiters of graphics performance: the I/O bottleneck across the PCIe bus; the single-instruction, multiple-thread model of parallelism; and the sensitivity of various memory areas to access patterns. Each of these limiters deserves some discussion. Applications whichemploy graphicsaccelerators toperformcomputationsmusttake into account the cost of transferring data to and from the graphics device. Currently, graphics devices are connected to the host system via the PCI-e bus, which provides relatively low- bandwidth, high-latency communication between the host and available devices. Typically, this problem can be overcome using a variety of techniques, including, but not necessarily limited to, the following: ensuring that computations to be performed on the device are such that the benefit from using the device outweighs the cost of communication; overlapping computation on the host with computation on the device (some recent interesting work in this 14 direction can be found in [5]); overlapping computation on the device with communication to and/or from the device (a feature introduced in recent versions of CUDA). Other efforts seek to eliminate the the potential for bottleneck entirely by integrating traditional processors and graphics processors; this is the basis for the notion of the arithmetic processing unit [9]. The single-instruction, multiple-thread model of parallelism required by graphics devices presents other challenges to achieving excellent performance for on-device computations. As described earlier in this section, threads belonging to a block which has been scheduled to a streaming multiprocessor are grouped into warps, which are then scheduled to the mul- tiprocessor?s array of streaming processors. These streaming processors share a common instruction cache and operate in lockstep. Therefore, to process an instruction from each thread in a warp in a single cycle, all threads must be executing the same instruction (i.e., have a common instruction pointer). A performance problem arises when threads within a warp exhibit divergent behavior; that is, when threads within a warp take different control- flow paths through the kernel function, the hardware is incapable of processing threads in lockstep. CUDA guarantees correctness of kernels exhibiting divergence by effectively splitting divergent warps into multiple warps containing dummy threads, which essentially represent no-op instructions (the actual mechanism underlying the treatment of this situ- ation is more complex, but unimportant for the purposes of this discussion). Control-flow divergence can easily result in performance degradation of a factor of two (for if/else type divergence) and could result in much more if nested branching is present. Fortunately, the hardware is able to rejoin split warps eventually (at any post-dominator, usually the immedi- ate post-dominator) in order to avoid excessive loss of performance. Warp divergence can be addressed at the application level by writing code which does not exhibit much divergence; at the hardware architecture level, warp divergence can be mitigated by techniques which exploit the existence of many warps processing the same instruction stream [11] [10]. At the compiler level, kernels can be statically analyzed and instructions reordered to reduce divergence [7]. Also, interesting research has investigated the benefits to be had by exploiting 15 inter-warp divergence (as opposed to intra-warp divergence, to which the above discussion applies), referred to as warp specialization [2]. Several memory areas are available inside graphics processors, each of which offers unique opportunities and challenges. At the lowest level in the memory hierarchy, each streaming multiprocessor contains a register file providing thread-local, low-latency access for data. However, the number of registers is limited, and so in practice not all data can be stored there. At the next level, streaming multiprocessor-local shared memory serves as an explicitly-managed cache, and provides low-latency accesses, when accessed according to an appropriate pattern. Words in shared memory are stored in a number of banks, each of which can be accessed simultaneously and with low latency. However, when multiple requests for words in shared memory correspond to the same shared memory bank?referred to as a shared memory bank collision?requests are serialized. The number of simultaneous requests for distinct words of shared memory is referred to as the degree of the shared memory bank conflict; performance of the shared memory system is inversely proportional to the maxi- mum degree of any bank conflict occurring during a round of simultaneous access requests (note, however, that in the case of all requests accessing the same word of shared memory, no performance penalty is incurred). Furthermore, the amount of shared memory available to threads scheduled onto a given multiprocessor is limited, necessitating other options. At the next level in the graphics memory hierarchy, there exist global memory, con- stant memory, and texture memory areas (note that each of these is realized by the same physical hardware, and differs from the others only in how it is treated by the architecture). Constant memory provides read-only access (on the device; it is written by the host) to all threads and all blocks. It is cached on each streaming multiprocessor, and on cache hits, provides low-latency access when all simultaneous requests are for the same word of constant memory; other access patterns result in degrees of serialization similar to shared memory. Texture memory, like constant memory, provides read-only access to all threads and is cached on streaming multiprocessors. However, texture fetches can provide low-latency access for 16 rounds of simultaneous requests exhibiting spatial locality; in other words, for rounds of requests accessing words of texture memory which are ?near? each other according to the texture geometry. Global memory provides both read and write access to all threads and all blocks, and (along with constant and texture memories) represents the memory area through which host applications may communicate data to and from the device. Global memory is not cached (typically; for more details on the capabilities of modern graphics devices, see [17] [16]) and provides high-latency access. Several important ways exist in which long la- tencies of global memory accesses are addressed by CUDA developers. First and foremost, by ensuring that memory accesses are coalesced (the requirements for coalescing vary with the graphics hardware, but coalescing can be guaranteed by having threads access linearly contiguous and aligned chunks of global memory; see [16] for more information), high levels of memory throughput can be achieved. In order to avoid performance degradation from inevitably long latencies, it is important to have both a large number of warps ready to execute instructions and a large enough number of instructions not accessing global areas of memory. Together, these ensure that processor cores have enough work to do to effectively mask memory access latency. 2.2 Non-Volatile Memory Technologies This section provides a brief introduction to the fast non-volatile memory technologies which are candidates for integration with the graphics processor main memory. These mem- ory technologies vary in terms of power, performance, and access characteristics, and as such may be more or less suitable for use in a variety of contexts. By covering characteristics of various alternatives, the decision made for this project can be given strong support. 2.2.1 Varieties of Fast Non-Volatile Memory Technology Fast non-volatile modern technologies, which have received some attention in the recent research literature, are different from both other non-volatile memory technologies such as 17 flash and volatile memory technologies such as dynamic and static varieties random access memories. They differ from the former in that they generally offer faster, lower-density storage; compared to the latter, they can provide a variety of benefits to real systems, not least of which is persistent storage of critical data. These storage technologies have been, and continue to be, enabled by innovative hardware designs. This section introduces the reader to three of these technologies: spin-transfer torque random access memory, a variety of magnetoresistive memory; random access memory based on memristors, a recently developed circuit element complementary to the resistor, capacitor, and inductor; and phase-change random access memory. The memristor is a recently introduced basic circuit element?like the resistor, capacitor, and inductor?capable of maintaining internal state, making them suitable for storing data, among other things [6]. Each of the four basic circuit elements can be defined in terms of relationships among fundamental circuit variables, which include current, voltage, charge, and flux. The resistor relates current and voltage, according to Ohm?s law; the capacitor relates charge and voltage; and the inductor relates flux and current. Note also that charge is related to current, and flux to voltage, by definition. The memristor relates the flux and charge; together with the other relationships, it completes the understanding of all relationships between fundamental circuit variables. By relating charge and flux, memristors are able to act as variable resistance elements, whose resistance is based on currents to which they have been subjected; that is, they maintain measurable state based on applied signals. Spin-transfer torque memory is a kind of magnetoresistive memory and differs from con- ventional dynamic memory technologies in that it stores information using magnetic tunnel junctions instead of electrical charges, which are comprised of three layers: one reference ferromagnetic layer, one free ferromagnetic layer, and a tunnel barrier layer [28]. While the magnetic direction of the reference layer remains fixed, the magnetic direction of the free layer can be changed; if the directions of the free and reference layers align, resistance is 18 high (indicating logically set, or one), and if the directions are different, resistance is low (indicating logically reset, or zero). Phase-change memory is based on a unique property of a certain kind of alloy (chalo- genide alloys) with two stable states?crystalline and amorphous?to store information [28]. The low-resistance crystalline state can be achieved by heating such an alloy to a tempera- ture between crystallization and melting temperatures, and a bit of information whose alloy is in this state is interpreted as logically set, or equal to one. The high-resistance amorphous state can be achieved by heating the alloy to a temperature above the melting temperature, after which the temperature is quickly reduced; a bit of information whose alloy is in this state is interpreted as logically reset, or equal to zero. Because the difference in electrical resistance between these two states is so large, it is possible to use intermediate states to encode several bits of information instead of one; devices exploiting this are referred to as multiple level devices, others being referred to as single level cell. 2.2.2 Advantages and Disadvantages All non-volatile memory technologies share a common property which makes them de- sirable, compared to volatile memory technologies such as static and dynamic random access memory technologies: the ability to store data in a persistent fashion without needing to be periodically refreshed. Devices with this property have at least two advantages: first, they can be used to protect data from interruptions in the supply of power, to avoid loss of important data; second, they can use less power than their volatile counterparts for storing data which are accessed according to appropriate patterns (appropriateness of access pat- terns for such technologies is discussed later, and represents a key consideration in much of the ongoing work), since their is no need to refresh stored data. It is the second of these advantages which the greater project, of which the work described in this thesis represents only one (albeit significant) aspect, seeks to exploit. It is worth noting, however, that the first advantage mentioned above?the suitability of non-volatile memory technologies for 19 use as a resilient or fast persistent data store?may also prove useful in achieving present and future goals of high-performance computing; indeed, such an advantage might even be exploited by graphics devices, although this possibility is not explored in this work. Where non-volatile memories pay the price for these advantages, however, is in per- formance (in terms of both access latency and dynamic power), or, more accurately, in price-performance (that is, performance compared to cost, which can be discussed in terms of monetary cost or memory density). In other words, non-volatile memory technologies typically suffer from lower levels of performance compared to volatile counterparts of similar cost. Figure 2.1, courtesy Motoyuki Ooishi, compares write latency and memory capac- ity of several memory technologies, both volatile and non-volatile. Memory technologies with higher write latencies and memory capacities are typically more well-suited to appli- cations requiring large amounts of persistent storage (e.g., disk), whereas technologies with lower write latencies and memory capacities are better suited to applications requiring small amounts of fast storage (e.g., caches). Traditional non-volatile memory devices such as NOR and NAND flash require that cells be erased before being written, which significantly increases the latencies associated with writing to these devices; these latencies, along with large memory capacities, make them good candidates for applications requiring persistent storage. Of the memory technologies which are more suitable for working memory, magnetic random access memory technologies offer a memory capacity comparable to that of static random access memory; this implies that they are, perhaps, better suited to situations where static random access memory proves useful (e.g., in caches?indeed, a recent research effort in this direction, which is closely aligned with the goals of the project motivated here, is touched upon later). Phase-change random access memory has a memory capacity, hence a cost, more comparable to that of dynamic random access memory; this implies that it might be used in much the same way as dynamic random access memory, if the order of magnitude write latency barrier can be overcome. 20 Figure 2.1: Price-performance comparison of various memory technologies. Table 2.1 provides an incomplete and, in parts, partially outdated summary of memory technology characteristics, and is based?in part?on data compiled by Perez and De Rose [22]. - PCRAM RRAM MRAM Cost 8?16F2 > 5F2 37F2 Read 48 ns < 10 ns < 10 ns Write 40?150 ns 10 ns 12.5 ns Energy 100 pJ 2 pJ 0.02 pJ Endurance 108 105 > 1015 Table 2.1: Characteristics of non-volatile memory technologies 2.3 The General-Purpose Graphics Processor and Non-Volatile Memory In this chapter, two technologies with significant potential to be of great benefit to efforts in the high-performance computing community?the use of graphics processing units 21 to accelerate general-purpose computations of scientific interest, and the use of fast non- volatile memory technologies to improve either aspects of traditional memory hierarchies? have been briefly introduced. In this section, the choice of phase-change memory as the non- volatile technology, and of the graphics global memory as the storage target, is motivated, and other directions which might have been selected are mentioned. In particular, available benefits?possible drawbacks?of the project are defined. As has already been discussed, non-volatile memory technologies differ in terms of both access latency and storage density, as well as in many other respects. Depending on such characteristics, a particular technology may be more-or-less well-suited to various appli- cations (e.g., cache, main memory, permanent storage). There are at least two graphics memory areas?the multiprocessor-local shared memory, and the global memory, including global storage areas for constant and texture memory?where non-volatile memory might be fruitfully introduced, and these areas differ fundamentally in terms of their performance properties. These properties must inform the selection of the non-volatile memory technol- ogy. As an explicitly managed cache, shared memory offers potentially low-latency access for a relatively small amount of on-chip storage. As such, non-volatile memory technologies such as magnetoresistive (e.g., spin-transfer torque) memories and ferroelectric memories? which offer relatively lower latencies and storage densities?might make for appropriate replacements for the static random access memory currently employed for that purpose. Indeed, that possibility [24] has already received investigation, and is reviewed later. As a sort of graphics device main memory area, global memory uses as its underlying technology GDDR dynamic random access memory, which is similar in most respects to DDR dynamic random access memory, and offers a relatively large amount of high-latency access. As such, non-volatile memory technologies such as phase-change random access memories? which offer somewhat longer latencies and storage densities similar to those of dynamic random access memory?might make for appropriate replacements of the dynamic random 22 access memory currently employed for that purpose. It is this possibility that provides the basis for the collaborative project, in support of which the primary contributions described in this thesis are made and later described. Thus far, only the possibility of using phase-change memory in place of dynamic random- access memory in the graphics global memory area has been advanced as a motivation for its use; however, advantages and, potentially, disadvantages are associated with its use, and understanding these is key to fully understandin the design decisions directly affecting the work which is the primary focus of this thesis. Among the benefits available from using phase-change memory in the proposed way include lower static power and the opportunity to exploit non-volatility for permanent or semi-permanent storage of data. The former is due to the lack of any need to refresh memory contents, owing to to the latter. At the present time, power savings alone (rather than the usefulness, per se, of having non-volatile storage within graphics devices) has been the primary motivator underlying development of a hybrid simulator; an example of exploiting non-volatility for graphics devices [14] is reviewed in a later chapter. Although memristor-based memories are being actively developed and researched, at the present time available technologies are too immature?especially in terms of manufacturing processes and chip yield?to warrant further consideration here; their relatively lower write endurance, compared to the alternatives, is another factor weighing against their use in the intended capacity. However, storage devices based on memristors should be expected to play an important and exciting role in the future of memory technologies. A few downsides to using phase-change memory ,which make using this technology within the graphics global memory area an interesting research challenge, are worth men- tioning. First off, while phase-change memory has a similar read latency and storage density as compared to dynamic random-access memory, its write latency is several times higher. Similarly, while static power consumption is much lower for phase-change memory, power 23 consumed during writes is typically much higher (although there appear to be design trade- offs involved in producing phase-change memory devices such that lower write energies could be had from decreasing the device?s non-volatility [29], this is not a possibility given further consideration here). Overcoming higher active power consumption and access latencies for writes involves developing effective data placement and migration strategies; additionally, increased latencies from writes can be addressed by the usual mechanism inside graphics devices: high bandwidth from massive memory parallelism and regular access patterns (i.e., coalesced global memory accesses) along with sufficient amounts of computation to mask latency. A hybrid global memory system, using both dynamic and phase-change random access memories, could use phase-change memory for mostly read-friendly memory objects, avoiding these disadvantages. 24 Chapter 3 Phase-Change Random Access Memory in a Hybrid Graphics Global Memory This chapter contains a complete report of the goals, work, and results of the project which is the main subject of this work. First, the project motivation, goals and scope are restated in a clear and consise fashion. Then, the simulators GPGPU-Sim and DRAMSim2 are described and analyzed. Finally, the integration plan is presented in detail, and the results of the initial integration are described and evaluated. Section 3.1 attempts to more clearly define this project in terms of both planned work and desired outcomes, and to situate this work in terms of other ongoing work which is being carried out as part of the larger collaboration (of which this project represents only one aspect). Section 3.2 provides background on the two simulation frameworks used in carrying out this work, GPGPU-Sim and DRAMSim2, and exposes those features of each simulator which are most relevant in the context of this project. Section 3.3 reports on the subsequent analysis, design and initial implementation of a hybrid simulator supporting further research goals, and the efforts spent in order to assess and improve quality and accuracy of the result. 3.1 Proposal, Plan and Goals This project seeks to design an architectural simulator for hybrid global memory inside graphics processing units. Such a framework is needed to carry out research involving the use of phase-change random-access memory to reduce power consumption of the graphics global memory system. This should be possible thanks to non-volatile memory?s not requiring static power to retain data. However, at least two limitations of phase-change random 25 access memory must be overcome in order that it be viable as a part of the graphics global memory system: a large energy requirement for writes, as well as high latencies for writes [27]. Overcoming these limitations, however, is not the focus of the work described in this thesis. Given the proprietary nature of NVIDIA graphics hardware, the relative immaturity of change memory technology, and the difficulty inherent in constructing an actual hybrid device and associated memory controller logic, direct experimentation is deemed inappropriate. On the other hand, detailed, first-principles analysis could prove unwieldy for so complex a system as the one being proposed. Simulation is, however, appealing, given the existence of simulation frameworks for both graphics devices and phase-change memory devices. Since the architectural changes being investigated in this work are new, currently no simulation framework exists, insofar as this author is aware, suitable for simulating a general-purpose graphics device with a customized, hybrid global-memory system. The initial efforts which have been aimed at developing such a capability are the main contribution of this work and are described in the following sections. 3.2 Simulation Frameworks In this section, the two simulation frameworks on which the described hybrid simulator is based are introduced. Given that much of the work involved in carrying out this initial integration relies on some (indeed, in some aspects, significant) understanding of the pur- pose and functioning of these constituents, the material presented here is of fundamental importance in the following sections. 26 3.2.1 Simulating Graphics Devices with GPGPU-Sim GPGPU-Sim [3] is a cycle-accurate simulation framework for graphics devices and the general-purpose compute kernels which they execute. At a high level, GPGPU-Sim is orga- nized into three modules?cuda-sim, gpgpu-sim and intersim?which perform unique roles necessary to the simulation of real CUDA programs on NVIDIA hardware. In GPGPU-Sim, cuda-sim is responsible for functional simulation of CUDA programs; that is, its purpose is to ensure that the execution of real CUDA programs behaves in a manner which is logically consistent with that in which it should be expected to behave on correctly-functioning NVIDIA hardware. Despite its being a vital component of the GPGPU-Sim architecture, it has been necessary, for the purposes of this work, to make alterations to this module; indeed, it is not within the scope of this project?nor, perhaps, should it ever be?to change the logical behavior of CUDA programs executing on devices with hybrid memory (i.e., causing the result to be different from that obtained using a device without a hybrid global memory system). As such, this module does not receive a great deal of attention in this work. The gpgpu-sim and intersim modules are responsible for the timing simulation of CUDA programs in GPGPU-Sim; in other words, they use models of NVIDIA hardware to estimate device performance for the simulated kernel. As illustrated in 3.1 [1] , gpgpu-sim models performance of processor, chips and global memory, while intersim is used to model the complex interconnection network connecting processors and off-chip global memory. Since the motivation for performing this study is to evaluate tradeoffs in power and performance, these modules are of great interest. In particular, it is useful for the purposes of discussion in section 3.3 to describe in some detail the memory request lifecycle modeled by GPGPU-Sim. At a high level, shader cores interpret CUDA instructions, some of which require global memory accesses to be performed. Shader cores generate memory fetch requests accordingly, forwarding these to the interconnection network. From there, memory fetch requests are routed to the appropriate memory partition unit, which consists of a memory controller, a 27 Figure 3.1: GPGPU-Sim architecture model. dedicated L2 cache for texture fetches, and a memory device. If a request is a texture fetch, then the L2 cache is checked before scheduling the request to the memory device. Memory fetches which must access the device incur some additional latency, which is modeled by GPGPU-Sim. When a request has been satisfied, notification is pushed back up the memory system, first to the interconnection network from the memory partition unit, and ultimately to the shader cores. 3.2.2 Simulating Phase-Change Memory with DRAMSim2 DRAMSim2 [23] is a cycle-accurate memory simulator, developed at the University of Maryland, for modeling DDR2/3 systems. DRAMSim2 can be used either as a standalone simulator for trace-based modeling of memory systems, or?through a library interface? as the memory model in other simulation frameworks. It is in this latter capacity that DRAMSim2 is used in this work. Use of the DRAMSim2 library interface consists of instantiating an instance of the Mem- orySystem class, and subsequently, adding memory transactions and cycling the simulator. Cycling of the simulator causes pending transactions to advance through the DRAMSim2 simulation model to completion. Upon completion of a transaction, DRAMSim2 invokes a 28 callback routine, provided upon initialization of the MemorySystem object, alerting the host application to the completion of the request. The reason for which it was deemed necessary to incorporate DRAMSim2 into GPGPU- Sim is the insufficiency of the GPGPU-Sim memory model to correctly model either power of the memory system or performance characteristics of phase-change memory technology. The choice of DRAMSim2 was motivated by collaborators? experience using this memory simulation framework, and the ability of DRAMSim2 to model phase-change (and potentially other non-volatile) memory power and performance characteristics. 3.3 Integration, Verification and Validation This section describes the design and initial implementation of a hybrid simulator en- abling research into the use of novel memory technologies inside graphics processing units, as well as some of the early efforts to ensure and assess quality. 3.3.1 Integration of Simulation Frameworks Figure 3.2 shows the general strategy for integration at a high level of abstraction (more specifically, at the level of the gpgpu-sim performance simulator). The idea is a very simple one: to use the DRAMSim2 memory simulator?via its library interface?instead of the memory simulation included in the gpgpu-sim performance simulator. Issues arising at this level of abstraction include the nature of the replacement (i.e., should DRAMSim2 be used exclusively, or should it only be used for phase-change memory), as well as the general compatibility between GPGPU-Sim and DRAMSim2, among others. Figure 3.3 shows the integration plan at the level of the memory request flow inside GPGPU-Sim. Here, DRAMSim2?s library interface is observed to provide similar function- ality to the component responsible for modeling a memory controller and controlled device 29 Figure 3.2: High-level integration plan Figure 3.3: Intermediate integration plan 30 Figure 3.4: Low-level integration plan within the gpgpu-sim performance simulation model. Given this, the strategy is to use mul- tiple instances of DRAMSim2?s library interface (one per memory controller and associated device), while leaving other stages of the memory request processing unaffected. Figure 3.4 shows the integration plan at the level of the memory controller and de- vice. As mentioned previously, the DRAMSim2 library interface provides an interface quite similar to that of the component in gpgpu-sim which is responsible for modeling a memory controller and controlled device; within the original gpgpu-sim module, this component is implemented as a class named dram t. In order to provide gpgpu-sim a means of using DRAMSim2 to model the memory controller and associated device, the interface common to both DRAMSim2?s library and the dram t class is used to define an abstract parent class, named ram t, which replaces dram t in the integrated simulation framework. To provide access to DRAMSim2, a new child class, named dramsim2 t, is implemented in such a way as to forward memory requests coming from the interconnect or L2 cache to DRAMSim2, 31 and forward requests completed by DRAMSim2 to the interconnect or L2 cache. In other words, the dramsim2 t class is responsible for initializing an instance of the DRAMSim2 library interface, and acting as an intermediary between the gpgpu-sim memory model and this created instance of the DRAMSim2 library. 3.3.2 Verification Methodology In order to perform a baseline evaluation of the correctness of the initial implementation of the design described in the previous subsection, several experiments were performed and results compared. First, empirical data was collected using the Scalalable HeterOgeneous Computing (SHOC) benchmark suite [8] using locally available graphics hardware. These provide some context with which to evaluate later results. Then, results of a proof-of- concept investigation performed using the original version of GPGPU-Sim with varying underlying memory latencies are used to illustrate the relationship between instructions per cycle, arithmetic intensity, and memory latency. Finally, a comparison is performed between the results output by the original GPGPU-Sim and those output by the hybrid GPGPU-Sim/DRAMSim2 framework. The SHOC benchmark suite, developed at Oak Ridge National Laboratory [20], is com- prised of a variety of programs with differing computational properties which exercise various areas of graphics hardware performance. Benchmarks in the suite are characterized accord- ing to the degree to which they are representative of synthetic microbenchmarks on the one hand (corresponding to SHOC level 0 benchmarks), or production applications on the other (corresponding to SHOC level 2 benchmarks). Table 3.1 describes the benchmarks which constitute version 1.1.1 of the SHOC suite. 32 Benchmark Level Description BusSpeedDownload 0 Measures the bandwidth of the PCIe bus for transfers from the host to the device. BusSpeedReadback 0 Measures the bandwidth of the PCIe bus for transfers from the device to the host. DeviceMemory 0 Measures read and write bandwidths of device global, texture and shared memories. MaxFlops 0 Measures the maximum floating-point computation rate of the device, for both single- and double-precision computations. FFT 1 Measures performance of a 1D Fast Fourier Transform kernel. MD 1 Measures performance of a pairwise nbody kernel from a Lennard-Jones potential calculation application. Reduction 1 Measures performance for a large global sum reduction kernel. Scan 1 Measures performance of a parallel prefix sum kernel. SGEMM 1 Measures performance for device versions of the SGEMM BLAS routine on square matrices. Sort 1 Measures performance of a radix sort kernel over unsigned integer key-value pairs. Spmv 1 Measures performance of a kernel which computes products of sparse matrices with dense vectors. Stencil2D 1 Measures performance for a 2D, 9-point stencil kernel. Triad 1 Measures bandwidth for a large vector dot product kernel. S3S 2 Measures bandwidth for S3D?s getrates kernel. Table 3.1: SHOC v.1.1.1 Benchmark Descriptions Furthermore, benchmarks can be executed in one of three configurations?Serial, Em- barrassingly Parallel and True Parallel?which differ in terms of how many graphics proces- sors and compute nodes are evaluated. All results in this section are for the Serial configura- tion, and as such, provide information most relevant at the level of a single graphics device within a single compute node. The SHOC benchmark suite has been used to evaluate performance of several graphics devices. Table 3.2 presents some of the relevant information related to the configuration of systems on which experiments were conducted. The Heterogeneous Auburn Working Cluster (hawc) was constructed in Fall 2011, with generous hardware donations by NVIDIA and with Laboratory funds and equipment, to support research and teaching objectives related to the awarding to Auburn of the status of 33 GPU Node OS CPU RAM GTX 280 hawc1 64-bit CentOS 6.0 Intel Q8200 8GB DDR2 GTX 470 hawc1 64-bit CentOS 6.0 Intel Q8200 8GB DDR2 GTX 480 hawc4 64-bit CentOS 6.0 Intel i7-950 8GB DDR3 Tesla M2050 eagles-0-0 64-bit CentOS 5.5 Intel Xeon 5650 24GB DDR3 Tesla C2070 hawc4 64-bit CentOS Intel i7-950 8GB DDR3 Table 3.2: System configuration overview for SHOC benchmark experiments. NVIDIA CUDA teaching center. It consists of four compute nodes?hawc1, hawc2, hawc3, and hawc4?publicly accessible via the head node at hawc.cse.eng.auburn.edu. A 64-bit ver- sion of CentOS 6 has been installed on each node, as have basic parallel development packages and tools to facilitate parallel programming using OpenMP, MPI and NVIDIA?s CUDA ver- sion 4.0. Each node is equipped with a multi-core processor?an Intel Q8200 on hawc1 and Intel i7-950s on each of nodes hawc2, hawc3 and hawc4?in addition to two graphics devices?currently, two GTX 470s on hawc1, two GTX 480s on each of hawc2 and hawc3, as well as a GTX 480 and a Tesla C2070 on hawc4?and eight gigabytes of main memory. These nodes are connected by a gigabit ethernet connection through hawc.cse.eng.auburn.edu. In order to gain some rough understanding?in addition to that which comes from knowledge and analysis of graphics architecture?of the relationship between instructions per cycle, memory latency and arithmetic intensity, an experiment was performed using GPGPU-Sim to simulate a CUDA kernel of tunable computational intensity. The kernel designed for this evaluation allows for varying arithmetic intensity by com- puting, with 220 threads linearly arranged in 210 thread blocks (hence, with 210 threads per block), an iterative map of 220 single-precision floating point numbers for numbers of itera- tions [1,2,4,8,16,32,64]. The number of global memory accesses per thread is fixed at two, and accesses are performed in a completely coalesced manner. Computational intensities corresponding to each of the numbers of iterations are estimated, based on a simple static 34 analysis, to be proportional to [6,9,15,27,51,99,195], where computational intensity is de- fined as the number of arithmetic operations over the number of global memory reads and writes (fixed at two for this kernel). To simulate varying memory latencies using GPGPU-Sim, initialization parameters for the default GPGPU-Sim memory system were changed to scale underlying memory latency parameters (specifically, tCCD, tRRD, tRCD, tRAS, tRP, tRC, CL, WL, tCDLR, and tWR) according to the desired increase in global memory latency; in other words, to increase the end-to-end memory latencies by a factor of n, each of these parameters was scaled by n. This simplification was motivated by the assumed approximate linearity of end-to-end latency of memory systems as a function of underlying device latencies. Experiments used v.3.0.1 of the GPGPU-Sim tool, and aside from the indicated changes to the initialization configuration, the initialization used was identical to the initialization files provided as part of the GPGPU- Sim distribution for the Quadro FX 5800 graphics processor. Finally, simulation experiments were carried out using the SHOC benchmark suite to evaluate the performance of the hybrid simulator compared to the original simulator. In particular, results for two benchmarks?the MaxFlops and DeviceMemory level-zero benchmarks?are reported. The selection of these benchmarks was motivated by the fact that these benchmarks have clear behaviorand performance characteristics, and differences in performance variation can be used to extract meaningful information. Specifically, variations in the DeviceMemory benchmarks should indicate sensitivity to memory system characteris- tics and variations in the MaxFlops benchmarks should be indicative of potential impacts on performance of compute-bound kernels. Reported results are for the implementations found in the Serial suite of CUDA benchmarks belonging to SHOC v.1.1.1. The initialization con- figuration files for GPGPU-Sim were those for the Fermi configuration, provided as part of the GPGPU-Sim v3.0.2 standard distribution. 35 0 20 40 60 80 100 120 140 160 C2070M2050GTX 480GTX 470S1070GTX 280 Peak Global Memory Read Bandwidth (GB/s) Figure 3.5: Global memory read bandwidth from SHOC for several devices 3.3.3 Verification Results Figures 3.5, 3.6, 3.7 and 3.8 show the global memory read and write bandwidth and single- and double-precision peak processing rates, respectively, for several graphics proces- sors belonging to the systems previously discussed. These figures provide a baseline against which simulation results are assessed. At least two interesting trends can be observed from the global memory peak band- width results: first, the relatively older graphics processors (the GTX 280 and S1070) have comparably higher global memory read bandwidths than global memory write bandwidths, whereas the opposite is true for relatively newer devices (the GTX 470, GTX 480, M2050a nd C2070); second, the devices designed exclusively for general-purpose computation (the M2050 and C2070) compare less favorably in terms of both peak read and peak write band- width to CUDA-enabled graphics devices designed to accommodate graphics applications? performance demands. 36 0 20 40 60 80 100 120 140 160 C2070M2050GTX 480GTX 470S1070GTX 280 Peak Global Memory Write Bandwidth (GB/s) Figure 3.6: Global memory write bandwidth from SHOC for several devices 0 200 400 600 800 1000 1200 1400 C2070M2050GTX 480GTX 470S1070GTX 280 Peak Single-Precision Processing Speed (GFLOPS) Figure 3.7: Peak single-precision performance from SHOC for several devices 37 0 200 400 600 800 1000 1200 1400 C2070M2050GTX 480GTX 470S1070GTX 280Peak Double-Precision Processing Speed (GFLOPS) Figure 3.8: Peak double-precision performance from SHOC for several devices Moreover, it is useful to identify at least two trends in the results for peak single- and double-precision floating point processing rates: first, single-precision processing is faster by a factor of at least two for all tested devices; second, the factor by which devices? peak single-precision processing rates exceed those devices? peak double-precision processing rates is a function of whether the device was designed specifically for general-purpose (scientific) computing (the M2050 and the C2070), in which case the factor is approximately two, or not (the GTX 280, S1070, GTX470 and GTX 480), in which case the factor is approximately ten. The first observation is owed to the fact that, at the hardware level, any device capable of performing a double-precision computation in a certain time should be capable of perform- ing two single-precision computations (since double-precision arithmetic units are typically comprised of at least two single-precision units). The second observation underscores one of the major design differences between older-generation devices from NVIDIA and new Fermi 38 0 200 400 600 800 1000 1200 1 2 4 8 16 32 64 128 256 Instructions per Cycle Compute Intensity Ratio 1x original latency3x original latency 10x original latency Figure 3.9: Instructions per cycle achieved versus computational intensity devices: tailor-made devices for general-purpose (scientific) computing often rely on higher- precision computations than are typically required for traditional graphics applications, and as such, hardware support for fast double-precision arithmetic is a valuable feature. Figure 3.9 shows the results from the proof-of-concept simulation work described previ- ously. The performance in terms of the instructions per cycle is evaluated for three different scenarios: the original GPGPU-Sim characteristic device latencies, denoted by the trend labeled 1x; a configuration corresponding to device access latencies increased by approxi- mately one-half an order of magnitude (a factor of 3), denoted by the trend labeled 3x; and a configuration corresponding to device access latencies increased by approximately one order of magnitude (a factor of 10), denoted by the trend labeled 10x. The required changes were made to the gpgpusim.config initialization file; see [1] for details. At low computational intensities, the impact of uniformly longer access latencies is sig- nificant; increasing latencies by a factor of ten causes a factor of five decrease in the achieved instructions per cycle. As computational intensity increases, processing efficiency increases for two very closely-related reasons: first, a greater fraction of the time spent on the kernel is spent on computation, so that the effects of a constant number of memory accesses are 39 diminished by comparison; second, the effect on total time required of increasing computa- tional load depends on the amount of communication (i.e., memory access latency) which can be hidden: at lower computational intensities, more latency is available to be hidden, and additional computation will overlap more effectively. For relatively (and modestly, com- pared to some real-world application kernels, such as those present in the SHOC suite) large intensities, the memory access latencies become much less of a factor, as the performance in each of the three cases asymptotically tends towards the device maximum processing rate (the maximum processing rate of the Quadro FX 5800, the configuration file for which was used to generate these trends, is 960 instructions per cycle). These results have at least two implications for future research work aimed at perfor- mance tuning of hybrid memory system designs: first, that performance may be expected to degrade, for realistic graphics devices and computational kernels, by no more than about five percent, in the worst case and if phase-change memory completely replaces dynamic random access memory in the hybrid design (this represents a worst case since these experiments deal with changes to performance while increasing both read and write access latencies; however, only the write latency, and not the read latency, of phase-change memory is high compared to dynamic random access memory); second, that for application kernels with modest to intense computational requirements, the performance penalty, in the worst case, of using phase-change memory is reduced and, asymptotically, becomes insignificant. This second point is especially promising, since it provides a clear point of reference to which candidate application kernels can be compared, and could prove useful both in the development of the hybrid memory controller and in determining the applicability of the proposed architecture to real-world applications. Figure 3.10 shows the results of executing the variable-intensity kernel used above on the original and hybrid graphics simulators. Note that the hybrid simulator utilizes a version of DRAMSim2 configured to simulate dynamic (not phase-change) random-access memory (the initialization is the default one and is comparable to GPGPU-Sim?s default configuration for 40 0 100 200 300 400 500 600 1 2 4 8 16 32 64 128 256 Instructions per Cycle Compute Intensity Ratio Original GPGPU-SimHybrid GPGPU-Sim/DRAMSim2 Figure 3.10: Instructions per cycle from hybrid simulator versus computational intensity global memory devices); this has been done in order to allow for a meaningful comparison between the output of the hybrid and original simulators. As can be seen in the figure, the performance of the hybrid simulator closely matches (to within a maximum percent error of three percent, at the lowest computational intensity) the output of the original version of GPGPU-Sim. Given the startlingly distinct nature of the simulation frameworks underlying the two simulators, this close correspondence is taken as a very positive sign that the initial implementation is correct in its essential details. Note also that these measurements closely correspond to measurements of single-precision floating-point processing rates for the GTX 480, which has both the same number of cores and core frequency as the device represented by the supplied configuration file (480 cores at 700 Mhz, with dual single-precision issue, yields around 1340 GFLOPS). Despite the close correspondence between hybrid and original simulator results, it has been more satisfying to this author to attempt an explanation of the existing discrepancy. Figure 3.11 better shows the difference in memory-device-level performance which can be used to help explain this discrepancy. 41 0 100 200 300 400 500 600 700 800 1 2 4 8 16 32 64 128 256 Memory Fetch Latency (cycles) Compute Intensity Ratio Original (max)Original (average) Hybrid (max)Hybrid (average) Figure 3.11: Memory fetch latency from hybrid simulator versus computational intensity Although, on the average, both simulators require essentially the same number of cycles in order to complete a memory fetch request, the maximum memory fetch latency recorded by the original simulator is considerably larger than the maximum memory fetch latency recorded by the hybrid simulator. This could indicate a comparatively greater variability in the original memory simulator than is found in DRAMSim2. Greater variability in memory fetch request latency could easily translate into longer running times when parallelism is involved. To see that this is a possibility, consider the following two scenarios. Let X be the random variable for the following experiment: a fair die is rolled, and the result multiplied by two. Let Y be the random variable for the following experiment: a fair die is rolled twice, and the results added together. By elementary probability, E[X] = 7 and Var[X] = 35/3 ? 11.67; similarly, E[Y] = 7 and Var[Y] = 35/6 ? 5.83. Suppose that two parallel processors must each process a single task, and that the amount of time required to complete the task is modeled by X. In this case, the makespan?or the time it takes from the initial assignment of both tasks to when the last processor to finish finishes?is a random variable MX with E[MX] = 161/18 ? 8.94. If the time to complete the task were instead modeled by Y, the makespan would be a random variable MY with E[MY] = 5425/648 ? 42 0 20 40 60 80 100 120 HybridOriginal Peak Memory Bandwidth (GB/s) Write Read Figure 3.12: Comparison of global memory read bandwidth for original and hybrid simulators 8.37. Therefore, despite having the same average, the process with less variability can result in overall reduction in the total time to complete tasks in parallel, and it is this author?s suggestion that the small discrepancy in observed instructions per cycle could be attributable, in whole or in part, to this phenomenon, particularly in light of the marked variance in the memory fetch request latency results. Whether this variability might become a source of concern later is another question, and one which may merit further consideration. Figure 3.12 shows the global memory read bandwidth for the original and hybrid graph- ics simulators. As in the other experiment, the hybrid simulator is observed to achieve a slightly higher memory bandwidth than was possible in the original GPGPU-Sim, possibly for the reasons already hypothesized; in any event, the discrepancy is minor enough to be at- tributable to a variety of factors, including slight variations in the device timing parameters or in simulation internals or scheduling policies. Note also that the difference between read and write achieved bandwidth is consistent with experimental results from real hardware, presented earlier. The following modifications had to be made to the SHOC DeviceMemory benchmark in order to collect these results: first, automatic tuning of iterations inside device kernels based on measured runtime had to be disabled, since the simulator is not real-time and excessively long initial runs can cause the benchmark to abort; second, and as a result of 43 the first modification, device iterations had to be limited manually, and as such, the reported memory bandwidths should not be mistaken as peak bandwidths (indeed, only between half and a third of total peak memory bandwidth is achieved as a result of simulation, compared to experimental results for the GTX 480). 44 Chapter 4 Related and Future Work This chapter contains a brief summary of some notable research efforts with goals which are closely aligned with the goals of this project, as well as some closing remarks on the work described herein and future work to be carried out to complete the project?s ultimate goals. Specifically, section 4.1 describes research endeavors involving the application of novel memory technologies to general purpose graphics processing, including attempts to introduce both hardware transactional memory and spin-transfer torque memory, a kind of magnetoresistive memory technology. Section 4.2 briefly summarizes the key motivations and contributions of the project and, in particular, of the work discussed in detail in the previous chapter.Also, future work and possibilities for further research are identified. 4.1 Related Work Recently, research carried out at the University of Virgina [24] investigated the potential to use spin-transfer torque random access memory, a kind of magnetoresistive memory tech- nology, as a replacement for static random access memory for shared memory in graphics processors. One of the motivations for that project?power savings?is the driving factor behind the present project. Other motivations for using spin-transfer torque memory as a replacement for shared memory include both area savings and increased capacity (since this variety of memory technology has a smaller cell size than static random access memory; see relevant sections of chapter 2 for more details). The goal of the study was to evaluate the potential of this non-volatile memory technology for use in graphics devices; longer write latencies were justified there, as here, by the observation that massive thread parallelism can effectively reduce the negative impact of high-latency transfers. GPGPU-Sim was used to 45 simulate performance, while CACTI was used for power and area estimation. The results were quite positive: it was found that the use of non-volatile memory technology resulted in appreciable power (and, in their case, area and capacity) improvements. The affirmative finding in this study adds considerably to the confidence in the project currently being car- ried out. The present work differs from this work in that the target for replacement is the graphics global memory, rather than the graphics shared memory. W. Fung et al. [12] studied potential benefits from using hardware transactional memory to implement an inter-multiprocessor synchronization primitive free from data races and deadlocks. Using GPGPU-Sim, they demonstrated that it was possible to capture much of the benefit of fine-grained locking mechanisms, enabling significantly improved performance compared to equivalent sequences of serial transactions. Although the methods employed to evaluate this novel hardware architecture are in many ways similar to those supporting this project, the motivations and goals are starkly dissimilar. A team of researchers from International Business Machines have filed a patent [14] on the use of a dedicated non-volatile memory storage device for graphics devices, in order to mitigate performance degradation due to the I/O bottleneck, and particularly, to the disk I/O bandwidth. This work differs from the present work in that it seeks only to augment graphics devices to avoid performance bottlenecks, instead of exploiting non-volatile memory technologies? power advantages. 4.2 Conclusions and Future Work The notion of utilizing non-volatile memory technology, specifically phase-change ran- dom access memory, as a replacement for graphics global memory has been introduced and motivated with considerations of graphics and memory performance characteristics. The need for an appropriate means of evaluating competing design decisions was then ex- plained as a key challenge in enabling future work in this direction. Appropriate simulation frameworks?GPGPU-Sim for graphics devices and DRAMSim2 for phase-change memory? were identified and justified, and the integration of these frameworks into a hybrid simulation 46 framework, GPUHM-Sim, has been described in detail. The design and initial implemen- tation of this integrated simulation framework are the main contributions of this work, and support future research goals. Ultimately, the project will help provide useful insight into the application of non-volatile memory technologies to general-purpose graphics processing devices. The work described in this thesis represents only a part (albeit an important one) of a larger research effort aimed at evaluating benefits of a proposed novel architecture. A great deal of additional work must be performed in support of these objectives. Additionally, while the design and architecture of the integrated simulation framework is essentially complete in its major features, there remain both design and implementation challenges which, when overcome, will greatly facilitate use of the tool. In any event, GPUHM-Sim should provide a much-needed capability to simulate novel hybrid architectures, enabling further research work in this direction. 47 Bibliography [1] T. Aamodt et al., ?GPGPU-Sim 3.x Manual,? http://gpgpu-sim.ece.ubc.ca/GPGPU- Sim 3.x Manual, 2012. [2] M. Bauer, H. Cook, B. Khailany, ?CudaDMA: optimizing GPU memory bandwidth via warp specialization,? In Proceedings of 2011 International Conference for High Perfor- mance Computing, Networking, Storage and Analysis, 2011. [3] A. Bakhoda, G. Yuan, W. Fung, H. Wong, T. Aamodt, ?Analyzing CUDA Workloads Using a Detailed GPU Simulator,? In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 163?174, 2009. [4] P. Carpenter, ?Performance in Scientific Computing with a Performance Characteriza- tion of the Parallel Ocean Program on Ranger,? Bachelor?s Thesis, Auburn University, 2010. [5] S. Che, J.W. Sheaffer, K. Skadron, ?Dymaxion: optimizing memory access patterns for heterogeneous systems,? In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. [6] L. Chua, ?Memristor - The missing circuit element,? IEEE Transactions on Circuit Theory, vol. 18, iss. 5, pp. 507?519, 1971. [7] B.R. Coutinho, D.N. Sampaio, F.M.Q. Pereira, W. Meira, ?Divergence Analysis and Optimizations,? In Proceedings of 2011 International Conference on Parallel Architec- tures and Compilation Techniques (PACT), pp. 320?329, 2011. [8] A. Danalis, G. Marin, C. McCurdy, J. Meredith, P. Roth, K. Spafford, V. Tippa- raju, J. Vetter, ?The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite,? In Proceedings of the Third Workshop on General-Purpose Computation on Graphics Processors (GPGPU 2010), 2010. [9] D. Foley, ?A Low-Power Integrated x86-64 and Graphics Processor for Mobile Comput- ing Devices,? IEEE Journal of Solid-State Circuits, vol. 47, iss. 1, pp. 220?231, 2012. [10] W. Fung, T. Aamodt, ?Thread Block Compaction for Efficient SIMT Control Flow,? In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA-17), pp. 25?36, 2011. [11] W. Fung, I. Sham, G. Yuan, T. Aamodt, ?Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,? In Proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO40), 2007. 48 [12] W. Fung, I. Singh, A. Brownsword, T. Aamodt, ?Hardware Transactional Memory for GPU Architectures,? In Proceedings of the 44th IEEE/ACM International Symposium on Microarchitecture (MICRO44), 2011. [13] P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R.S. Williams, K. Yelick, ?Exascale Computing Study: Technology Challenges in Achieving Exascale Systems,? Information Processing Techniques Office (IPTO) of the Defense Advanced Research Projects Agency (DARPA) of the United States of America, 2008. [14] A.W. Herr, A.T. Lake, R.T. Tabrah, ?Non-volatile storage for graphics hardware,? United States Patent Application Publication, Pub. No. US 2011/0292058 A1, 2011. [15] N.S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J.S. Hu, M.J. Irwin, M. Kandemir, V. Narayanan, ?Leakage current: Moore?s law meets static power,? IEEE Computer, vol. 36, iss. 12, pp. 68?75, 2003. [16] NVIDIA Corporation, ?CUDA Best Practices Guide,? developer.nvidia.com, 2011. [17] NVIDIA Corporation, ?CUDA Programming Guide,? developer.nvidia.com, 2011. [18] NVIDIA Corporation, ?CUDA Toolkit 4.0,? developer.nvidia.com, 2011. [19] NVIDIA Corporation, ?CUDA Toolkit 4.1,? developer.nvidia.com, 2011. [20] Oak Ridge National Laboratory, ?Scalable HeterOgeneous Computing (SHOC) Bench- mark Suite,? http://ft.ornl.gov/doku/shoc/start, 2012. [21] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips, ?GPU Computing,? In Proceedings of the IEEE, vol. 96, no. 5, pp. 879?899, 2008. [22] T. Perez, C.A.F. De Rose, ?Non-Volatile Memory: Emerging Technologies And Their Impacts on Memory Systems,? Technical Report No. 060, Pontificia Universidade Catolica do Rio Grande do Sul, 2010. [23] P. Rosenfeld, E. Cooper-Balis, B. Jacob, ?DRAMSim2: A Cycle Accurate Memory Simulator,? Computer Architecture Letters, vol. 10, iss. 1, pp. 16?19, 2011. [24] P. Satyamoorthy, ?STT-RAM for Shared Memory in GPUs,? Master?s Thesis, The University of Virginia, 2011. [25] The Green500, ?The Green500 List :: Environmentally Responsible Supercomputing,? http://www.green500.org/, 2011. [26] Top500.org, ?TOP500 Supercomputing Sites,? http://www.top500.org/, 2011. [27] H. Wong, S. Raoux, S. Kim, J. Liang, J. Reifenberg, B. Rajendran, M. Asheghi, K. Goodson, ?Phase Change Memory,? In Proceedings of the IEEE, vol. 98, iss. 12, pp. 2201?2227, 2010. 49 [28] C.J. Xue, Y. Zhang, Y. Chen, G. Sun, J.J. Yang, H. Li, ?Emerging non- volatile memories: opportunities and challenges,? In proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis (CODES+ISSS ?11), 2011. [29] G. Yue-Feng, S. Zhi-Tang, L. Yun, L. Yan, ?Programming voltage reduction in phase change cells with conventional structure,? In Proceedings of the International Confer- ence on Electric Information and Control Engineering (ICEICE 2011), pp. 2469?2471, 2011. 50