Modeling and Optimization of Parallel Matrix-based Computations on GPU by Andrew White A dissertation submitted to the Graduate Faculty of Auburn University in partial ful?llment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama May 5, 2013 Keywords: GPU, matrix-based computations, parallelization procedure, execution metrics Copyright 2013 by Andrew White Approved by Soo-Young Lee, Chair, Professor of Electrical and Computer Engineering Victor Nelson, Professor of Electrical and Computer Engineering Chwan-Hwa Wu, Professor of Electrical and Computer Engineering Abstract As graphics processing units (GPUs) are continually being utilized as coprocessors, the demand for optimally utilizing them for various applications continues to grow. This work narrows the gap between programmers and minimum execution time for matrix-based com- putations on a GPU. To minimize execution time, computation and communication time1 must be considered. For computation, the placement of data in GPU memory signi?cantly a?ects computation time and therefore is considered. Various matrix-based computation patterns are examined with respect to the layout of GPU memory. A computation pattern refers to the order in which each GPU thread computes a result. From examination of com- putation patterns, an optimized computation pattern, a pattern which reduces computation time, is derived. After the optimized computation pattern is derived, the access pattern to GPU memory, or order in which data is accessed, is considered. From the optimized access pattern, ?ne-tuning is performed to the GPU code such as minimization of index calcula- tions and loop unrolling to further reduce computation time and resource allocation. After ?ne-tuning, input parameters which yield the minimum computation time for a matrix-based computation on the GPU are derived. Input parameters are de?ned as the dimensions of the grid and blocks assigned for execution on the GPU. Execution metrics are formulated, as functions of the input parameters, to represent the executional behavior of the GPU. The execution metrics are utilized to derive the optimal input parameters which are input parameters that yield the minimum computation time. 1In this work, computation time refers to the amount of time required for the GPU to execute a given computation. Communication time refers to the amount of time required for a CPU to GPU or a GPU to CPU data transfer. ii Thematrix-basedcomputationsconsideredare matrix-vector multiplication (Mv), matrix- matrix multiplication (MM), convolution, and the conjugate gradient method. A paralleliza- tion procedure is developed to minimize execution time by deriving the optimal placement of data, optimized computation pattern, optimized access pattern, and optimal input parame- ters. Each step of the procedure is developed through analysis of Mv and MM. The procedure includes an accurate communication model and estimates for CPU and GPU data transfers. With accurate estimates for communication and computation times, partitioning computa- tion for CPU and GPU execution is performed to minimize execution time. Convolution and conjugate gradient are utilized to verify the validity of the procedure. Therefore, the overall goal of this work is to develop a parallelization procedure which minimizes execution time of matrix-based computations executing on the GPU. iii Acknowledgments First, I would like to thank God for blessing me with the opportunity to complete this work. Countless prayers from family, friends and myself enabled me to ?nish this dissertation. I would like to thank my wife, Abigail, who has always been supportive through the many years. Her dedication is truly inspiring. My family, particularly my parents, Bob and Ann, provided ?nancial assistance and emotional support which none of this would be possible without. The lengthy phone calls about this work with my father also provided an outside perspective for improvement and I am sincerely grateful. I would like to thank the entire Department of Electrical Engineering at Auburn Uni- versity. Speci?cally, I would like to acknowledge my advisor, Dr. Soo-Young Lee, for always being supportive. This work would not have been possible without his constant help and guidance. Through countless meetings, emails, and phone calls, Dr. Lee always believed we could ?nish this work and went above and beyond expectations of an advisor. I would also like to acknowledge Dr. Victor Nelson and Dr. Chwan-Hwa Wu for their thorough review of this work as well as availability to discuss any problems encountered along the way. Lastly, Dr. Amnon Meir, the outside reader of this work, provided quick and helpful revisions for completion. It was always a pleasure meeting with these professors and I truly appreciate the help. Lastly, I would like to acknowledge the graduate students with whom I have shared an o?ce for many years at Auburn. It was always enjoyable and relaxing to discuss life, and of course work, with Dr. Chris Wilson over lunch. I also appreciate the conversations with Siwei Wang and Praveen Venkataramani which provided me a wider understanding of the rest of the world and some social interaction in an otherwise non-social environment. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem De?nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 GPU, CUDA and Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 GPU History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 GPU Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 CUDA Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 Global Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Partition Camping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Shared Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Bank Con?icts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Execution Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Global Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 v 3.3.2 Active Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3 Fragmented Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.4 Global Memory Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Communication Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.1 CPU-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.2 GPU-CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.1 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 Parallelization Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Placement of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Computation Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Input Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.3 GPU Computation Summary . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6 Computation Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 vi 4.6.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1.1 Placement of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1.2 Computation Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.1.3 Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.1.4 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.1.5 Input Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.1.6 Computation Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.2 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 vii List of Figures 2.1 Architecture of the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Memory organization of the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 A square block of size 64. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Partitioning of a square block of size 64 into warps. . . . . . . . . . . . . . . . . . . 16 2.5 Partitioning of a square block of size 64 into half-warps. . . . . . . . . . . . . . . . 17 3.1 Global memory connection of the T10 GPU. Memory is divided into 8 equal-sized partitions. Each address represents 256 bytes. . . . . . . . . . . . . . . . . . . . . . 19 3.2 An example of partition camping occurring: global memory accesses by 2 HWs to 2 partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 An example of partition camping not occurring: global memory accesses by 2 HWs to 2 partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Shared memory layout in the T10 GPU. Memory is divided into 16 equal-sized partitions. Each column is 4 bytes wide. . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Threads within a HW accessing 16 shared memory banks. No bank con?icts occur. 22 3.6 Threads within a HW accessing one shared memory bank. 16 bank con?icts occur. 23 3.7 Modeling MM with input parameters: average percent di?erence between the maximum and minimum computation time (ms) of a group de?ned by n and a subset of the input parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 viii 3.8 Comparison of measured and estimated CPU to GPU communication time (ms) on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.9 Comparison of measured and estimated GPU to CPU communication time (ms) on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.10 Comparison of measured and estimated computation time (ms) for the optimized BLAS implementation of Mv on the CPU. . . . . . . . . . . . . . . . . . . . . . . . 36 3.11 Comparison of measured and estimated computation time (ms) for the optimized BLAS implementation of MM on the CPU. . . . . . . . . . . . . . . . . . . . . . . . 37 3.12 Comparison of measured and estimated computation time (ms) for the non- optimized C implementation of convolution on the CPU. FS=3. . . . . . . . . . . 38 3.13 Comparison of measured and estimated computation time (ms) for the non- optimized C implementation of convolution on the CPU. FS=63. . . . . . . . . . 38 3.14 Comparison of measured and estimated computation time (ms) for the non- optimized C implementation of convolution on the CPU. FS=513. . . . . . . . . . 39 4.1 Example of a HW, consisting of 4 threads, accessing A in each iteration of Mv. Each memory transaction is 4 bytes. The accesses to A are uncoalesced and 4 accesses are required for each iteration. . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Example of a HW, consisting of 4 threads, accessing A in each iteration of Mv utilizing shared memory. Each memory transaction is 4 bytes. The accesses to A are coalesced and 4 accesses are required every 4th iteration. . . . . . . . . . . . 50 4.3 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for Mv on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 ix 4.4 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for MM on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Computation patterns: two computation patterns for Mv for computing multiple Cjs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Average computation time (ms) varying the number of total threads for Mv on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.7 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for Mv on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.8 Computation patterns: four computation patterns for MM for computing multi- ple Cijs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.9 Computation patterns: hybrid computation patterns for MM for computing mul- tiple Cijs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.10 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for MM on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.11 Access patterns: example of block-level access patterns to A for matrix-based computations. Each column is one partition of global memory. Each row is 512 values of type ?oat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.12 Example of loading Ainto shared memory for the shared memory implementation of Mv utilizing the optimized computation pattern but not the optimized access pattern. No bank con?icts occur. dBlk:x = 16. . . . . . . . . . . . . . . . . . . . . . 69 4.13 Example of reading shared memory for the shared memory implementation of Mv utilizing the optimized computation pattern but not the optimized access pattern. Bank con?icts occur. dBlk:x = 16. . . . . . . . . . . . . . . . . . . . . . . . 70 x 4.14 Example of loading Ainto shared memory for the shared memory implementation of Mv utilizing the optimized computation and access pattern. No bank con?icts occur. dBlk:x = 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.15 Example of reading shared memory for the shared memory implementation of Mv utilizing the optimized computation and access pattern. No bank con?icts occur. dBlk:x = 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.16 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for Mv on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.17 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for MM on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.18 Comparison of ?ne-tuning: Maximum bandwidth (GB/s) for Mv on the T10 GPU. 79 4.19 Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for MM on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.20 Comparison of input parameters: Computation time (ms) of all input parameters for Mv on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.21 Comparison of input parameters: Computation time (ms) of all input parameters for MM on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.22 Comparison of e?ective bandwidth (GB/s) for Mv on the T10 GPU. . . . . . . . 95 4.23 Comparison of e?ective bandwidth (GB/s) for MM on the T10 GPU. . . . . . . . 96 4.24 Comparison of execution time (ms) for Mv on the T10 GPU. . . . . . . . . . . . . 99 4.25 Comparison of execution time (ms) for MM on the T10 GPU. . . . . . . . . . . . 101 xi 5.1 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4 Computation patterns: two computation patterns for convolution. . . . . . . . . . 110 5.5 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.6 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.7 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.8 Access patterns: example of a 3x3 ?lter, B, stored in global memory. Each row of a partition is 4 values of type ?oat. . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.9 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.10 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.11 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 xii 5.12 Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu- tion on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.13 Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu- tion on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.14 Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu- tion on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.15 Comparison of input parameters: Computation time (ms) of all input parameters for convolution on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . 128 5.16 Comparison of input parameters: Computation time (ms) of all input parameters for convolution on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . 129 5.17 Comparison of input parameters: Computation time (ms) of all input parameters for convolution on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . 131 5.18 Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.19 Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.20 Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.21 Comparison of execution time (ms) for convolution on the T10 GPU. FS = 3. . . 136 5.22 Comparison of execution time (ms) for convolution on the T10 GPU. FS = 63. . 137 5.23 Comparison of execution time (ms) for convolution on the T10 GPU. FS = 513. . 138 xiii 5.24 Comparison of execution time (ms) for 8 iterations of the conjugate gradient method on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.25 Comparison of execution time (ms) for 256 iterations of the conjugate gradient method on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 xiv List of Tables 3.1 Measured results of executing Listing 3.1 which demonstrate the e?ect of partition camping on the T10 GPU. v is the number of values each thread summed. . . . . 21 3.2 Measured results of executing Listing 3.2 which demonstrates the e?ect of bank con?icts on the T10 GPU. v is the number of values summed by each thread. . . 24 xv Chapter 1 Introduction 1.1 Problem De?nition As the market for massively multithreaded architectures continues to grow, so does the development of general-purpose graphics processing unit (GPGPU) applications. Many pro- gramming models such as NVIDIA?s compute uni ed device architecture (CUDA) try to ease the gap between programmers and GPUs. While much research has been done in optimizing written GPU applications, little has been done to bridge the gap between programmers and minimum execution time for matrix-based computations on a GPU. Therefore, a paralleliza- tion procedure is necessary to provide programmers a guide to achieve minimum execution time for matrix-based computations on a GPU. To develop a parallelization procedure, it is necessary to consider computation and communication time. Accurate estimates for computation and communication time are necessary to minimize execution time. To assist in modeling GPU computation time, the layout of GPU memory must be examined for various matrix-based computation patterns. Therefore, it is necessary to determine in which type of GPU memory data is placed. After the placement of data is determined, the order in which threads executing on the GPU compute results, the computation pattern, is considered. Therefore, a computation pattern which reduces the computation time with respect to the layout of GPU memory for matrix- based computations must be derived. After deriving the optimized computation pattern, it is necessary to derive an optimized access pattern to GPU memory since the order in which threads access memory a?ects computation time. Utilizing the optimized access pattern, the code must be ?ne-tuned to reduce computation and resource allocation. From the ?ne-tuned code, it is necessary to derive the input parameters for the GPU computation which yield 1 the minimum computation time. To derive optimal input parameters, an accurate model of the computational behavior of the GPU is required. Since modeling GPU computation using input parameters is insu?cient, execution metrics must be formulated as functions of the input parameters to model the behavior. From the execution metrics, optimal input parameters must then be derived which yield the minimum computation time for matrix- based computations. Communication estimates between the CPU and GPU are also necessary to develop a parallelization procedure which minimizes execution time. Accurate communication es- timates, in addition to accurate computation estimates, are necessary to determine which computations are performed by the CPU and GPU to minimize execution time. Therefore, a parallelization procedure which minimizes execution time of matrix-based computations on the GPU must consider the placement of data, computation patterns, access patterns, input parameters, communication time, and computation partitioning. 1.2 Review In early GPU research, Fatahalian and others performed a study of matrix-matrix multi- plication on GPU architectures [1]. It was shown at the time that GPUs su?er from memory bandwidth limitations which has later been expounded upon by many research groups. At that time, work showed that CPUs were better candidates for applications that feature data reuse such as matrix-matrix multiplication. Since then, GPU architecture has evolved creating an increasing atmosphere for GPUs being utilized as coprocessors. Shortly after 2004, work began on automatically tuning matrix-matrix multiplication for GPUs [2]. However, results proved that automatic tuning of GPU programs severely reduced performance. After that, a Microsoft research group proposed a system to program GPUs [3]. Results proved that CPUs were several times faster than GPUs at executing computations such as Mv. After that, research began to focus on modeling GPU performance [4]. At the 2 time, results showed accurate estimations of GPU computation time. However, the advent of CUDA and CUDA-enabled devices obsoleted this work. Researchers at the Georgia Institute of Technology attempted to predict GPU perfor- mance in very ?ne detail by examining the PTX code (NVIDIA assembly code equivalent) to determine the number of computation and memory access cycles and using these to determine the GPU computation time [5,6]. This work was utilized to predict the optimal number of active cores for the GPU and to disable some of the cores to reduce energy consumption [7]. Although the work provides a detailed algorithm to count the number of instructions from PTX code, this can be a tedious job to calculate the cycles and predict the result. Analysis at a higher level allows a more general approach and can reduce the time required to yield a minimum execution time. In addition, while the predictions for older model GPUs such as the FX5600 are mostly accurate, the behavior of the predicted time on newer GPUs, such as 200-series architectures, is not the same as the behavior of the measured time. Moreover, the work provided a static approach to predicting the computation time but left optimizations up to the programmer. Most importantly, the layout of the global memory in the GPU was not considered which largely a?ects the computation time for matrix-based computations on GPUs. Researchers at the International Institute of Information Technology in Hyderabad, India focused on explaining the behavior of current NVIDIA GPUs and worked on developing a model to illustrate the performance of an NVIDIA GPU [8]. Their work focused on creating a simulator for the GPU in the future which would aid in writing optimized applications. However, their research, as stated in publications, does not aim to provide a procedure to programmers which minimizes execution time for GPU applications. Researchers at the Center for Reliable and High-Performance Computing at the Uni- versity of Illinois at Urbana-Champaign have also spent time researching GPUs and their performance [9?13]. The work is a discussion of balancing shared resource usage to achieve high performance on the GeForce 8800 [9]. They focused on generating enough threads to 3 hide memory latency, choosing applications with a high percentage of ?oating point oper- ations such as MM, and reducing the number of global memory accesses. They use this example to show that shared memory is useful for reducing redundant loads and thus re- ducing the pressure on memory bandwidth. In addition, some of their research focuses on resource utilization by the kernel such as shared memory, registers and occupancy [10,11]. Similar to previous work, PTX code is examined for modeling computation time. The re- search focused on minimizing the number of trials necessary to ?nd optimal input parameters rather than deriving them. In addition, the research did not consider the layout of global memory or access pattern and thus does not work well for memory-bound problems such as many matrix-based computations. In addition to the previous work mentioned, the group at the Center for Reliable and High-Performance Computing at the University of Illinois at Urbana-Champaign developed a program called CUDA-lite [14]. This was developed to be an enhancement to CUDA that automates the task of selecting the appropriate memory to use and the coding of data transfers between memories. The tool was written to coalesce memory accesses which was a roadblock to GPU performance at the time. However, NVIDIA continues to relax the requirements to achieve coalesced read/writes with every SDK released so there is a dwindling need for a programmer to use an automated compiler to handle such a task. In addition, requiring programmers to learn other tools in addition to CUDA creates an additional burden. Regardless, it should be assumed that programmers understand the basic concepts of the memory hierarchy and thus do not need additional tools to optimize the access patterns if some basic guidelines are followed when writing applications. Lastly, this work does not consider the global memory layout and can produce poor results in such cases as utilizing shared memory to coalesce accesses. The most recent research by this group includes an adaptive performance modeling tool [12] and a CUDA application survey [13]. In [13], a survey is performed of di?erent 4 applications and the suitability of CUDA-enabled GPUs. While this provides a useful sum- mary of the architecture and how the GPU operates, along with analysis as to why certain applications are optimal, it does not satisfy the need for a procedure to minimize computa- tion time. In [12], a compiler-based approach is utilized to estimate GPU computation time. They, too, analyzed the e?ects on computation time from utilizing various GPU memories. While this was a major step in automatic selection of optimal applications, it does not pro- vide a procedure which yields optimal GPU code and derivation of input parameters that yield the minimum computation time. With the release of CUDA, NVIDIA released CUBLAS [15], an optimized implementa- tion of the BLAS [16] routines for GPUs. However, source for CUBLAS is closed and little work is published on how each routine is implemented. Alternatively, MAGMA [17] released an optimized implementation of the BLAS routines similar to LAPACK [18] for GPUs. Sev- eral works [19?34] provide insights into optimized GPU applications. However, the research aims to provide users optimized BLAS routines rather than provide programmers a procedure for optimizing GPU computations and minimizing execution time. Other work [35?48] also focuses on optimizing individual computations, such as Mv or MM, but is limited to speci?c implementations or ignores certain aspects of the GPU such as memory layout. Lastly, work has been performed that highlights the programming experiences with CUDA as well as the architectures of GPUs [49?52]. [49] provides a summary of research done with di?erent applications on GPUs utilizing CUDA. The work highlights the bene?ts of CUDA as well as what is necessary to make applications suitable for a GPU, such as exposing su?cient amounts of ?ne-grained parallelism to occupy the device, ability to block computation, e?ciency of data-parallel programs without thread divergence and ?nally, us- age of shared memory. [51] focuses on the architecture of Tesla GPUs and how it has evolved over the last decade from ?xed-function graphics pipelines to programmable processors with computing power greater than multi-core CPUs. One of the newest GPU architectures, 5 Fermi, is outlined in [52], which also discusses the demand for increasing GPU capabilities and the future of GPUs. The research shows the bene?t of co-processing architectures (such as CUDA) and explains the reason the market for these architectures and their usability will continue to grow. Because of this expanding market, research is needed to provide users with a procedure for achieving the minimum execution time. 1.3 Motivation With the development of environments such as CUDA, GPUs are increasingly becoming a viable option for parallel processing. With the increased interest in utilizing the GPU as a coprocessor, it is necessary that programmers be able to write applications for this platform without the need to fully understand the underlying architecture of the GPUs. Often, GPU manufacturers do not disclose certain aspects of the hardware, or sometimes even the software, to maintain a competitive advantage. Therefore, researchers are constantly testing and analyzing the new generations of GPUs to understand how to fully utilize them. It is necessary to provide programmers a procedure for minimizing the execution time while also providing a logical reasoning. Previous research does not provide programmers a parallelization procedure for matrix- based computations which yields minimum execution time. More speci?cally, previous re- search does not account for the layout of GPU global memory or the access pattern to that memory, which is determined by the code, and thus, optimized access patterns are not de- rived to minimize computation time. In addition, previous research does not consider the computation pattern by the executing threads, which a?ects computation time. Therefore, optimized computation patterns are not derived to achieve the minimum time. In previous research, input parameters are determined through testing the application, and thus, optimal input parameters are not derived for various computations to yield the minimum computa- tion time. Therefore, it is necessary to develop a parallelization procedure for matrix-based 6 computations executing on a GPU which minimizes execution time by considering the place- ment of data, computation patterns, access patterns, input parameters, communication time, and computation partitioning. 1.4 Objectives Since the primary goal of this work is to develop a procedure which yields the minimum execution time for matrix-based computations on a GPU, the objectives are to x88 examine the layout of GPU memory, x88 formulate execution metrics to accurately represent the computational behavior of the GPU, x88 model and estimate CPU to GPU and GPU to CPU communication time, x88 model and estimate CPU and GPU computation time, x88 determine the optimal placement of data in GPU memory, x88 analyze computation patterns for GPUs and derive the optimized computation pattern, x88 derive the optimized access pattern for GPU memory, x88 ?ne-tune code to minimize computation and resource allocation, x88 from the execution metrics, derive the optimal input parameters for the GPU, x88 determine the optimal partitioning of computation between the CPU and GPU, x88 verify these objectives by comparing the procedure with several matrix-based compu- tations. This is accomplished with consideration to grid and block partitioning, data arrangement and partitioning, code arrangement and data transfers. Therefore, the intellectual contribution 7 of this work is a parallelization procedure which minimizes execution time for matrix-based computations on a GPU. 1.5 Organization Chapter 2 is an introduction to GPUs, CUDA and terms utilized in this work. An examination of the layout of GPU memory and the formulation of execution metrics to accurately represent the computational behavior of the GPU are presented in Chapter 3. In addition, modeling communication and computation times is presented in Chapter 3. From the layout of GPU memory and formulation of execution metrics, the parallelization procedure to minimize execution time for matrix-based computations executing on a GPU is developed in Chapter 4. The parallelization procedure considers the placement of data, computation patterns, access patterns, ?ne-tuning, input parameters, communication, and computation partitioning. The procedure is developed through and applied to Mv and MM in Chapter 4. Results are included to illustrate the impact on time for each step of the procedure. In Chapter 5, the procedure is applied to convolution and the conjugate gradient method. The application of the procedure to convolution is shown. Results illustrate the performance of the procedure applied to convolution and the conjugate gradient method. Chapter 6 is a summary and conclusion of the presented work. 8 Chapter 2 GPU, CUDA and Terms This chapter includes a brief history of GPUs, a comparison of CPU and GPU architec- tures, and the performance of GPUs. Following is a summary of CUDA, the NVIDIA GPU programming environment, and lastly, terms are introduced. 2.1 GPU History As GPUs continue to become low-cost, parallel processing architectures, additional pro- grammers will turn to these architectures to minimize execution time. Therefore, there is an increasing need for optimizing applications written for GPUs. While much work has been done by NVIDIA and other researchers using the CUDA environment to explain the impact of di?erent application parameters, much information is not public knowledge about how GPUs function. Therefore, it is necessary to expand upon the previous research into opti- mizing applications, and develop a procedure for minimizing execution time to ease future programmers into writing e?cient CUDA applications for GPUs. Over the last 30 years, GPU architecture has evolved to a massively parallel design. In the 1980s, GPUs were large expensive systems that typically cost in the range of x2450,000 and were capable of processing 50 million pixels per second. By the 1990s, GPU costs decreased signi?cantly, causing them to be more widespread as they were deployed in small workstations as PC accelerators. During this time, graphics APIs such as DirectX became popular and programmers began utilizing the ?xed-functions to perform other tasks. In the 2000s, GPUs dropped in price to the x24100 range and became available on every computer. Those GPUs were capable of processing 1 billion pixels per second. More importantly, those GPUs became programmable for general purpose computing using such languages as CUDA and Open CL. 9 This advancement in graphics performance has been driven by the market demand for high- quality, real-time graphics in computer applications, namely the gaming industry. The result is that, over time, graphics architecture has changed from being a simple pipeline for drawing wire-frame diagrams to a highly parallel computing chip [53] [52] [51]. In order to fully utilize today?s chips and maximize performance, GPU algorithms need several key components: access to data with minimal bank con?icts, SIMD parallelism and many arithmetic computations. GPU algorithms are best suited for computationally in- tensive applications that require little inter-process communication and include but are not limited to physical modeling, computational engineering, matrix algebra, convolution, cor- relation and sorting. Therefore, matrix-based computations are ideally suited for GPUs. 2.2 GPU Architecture GPUs, as opposed to CPUs, are not well-suited for all algorithms due to their architec- ture. CPUs are designed for optimal sequential performance and application performance increases with increasing clock frequencies. However, the rate at which clock speeds increase is beginning to slow. Because of the importance of sequential performance, CPU cores in- clude the full x86 instruction set with complex branching. In addition CPUs have large cache memories to decrease instruction and data latencies for applications. The larger cache mem- ories and the need for complex control logic forces designers to use less of the available silicon for ALUs. On the other hand, GPUs are designed for optimal parallel performance. While GPU clock speeds are slower than CPU speeds, the application performance increases with the number of cores. The lack of large cache memories and complex control logic decreases the sequential performance in comparison to CPUs but allows designers to implement more ALUs on the available silicon thus increasing the numeric computing capability. Lastly, since GPUs are designed with a focus on parallelism, their structure includes many smaller cores than CPUs. 10 The NVIDIA GPU1 architecture is a collection of streaming multiprocessors (SMs) with each SM having a number of cores or streaming processors (SPs). The control logic and instruction cache for all SPs within an SM are shared as illustrated in Figure 2.1a. This work utilizes the Tesla T10 (T10) GPU which is built on the NVIDIA 200-series architecture. The T10 architecture has a total of 30 SMs with each SM having 8 SPs. A newer architecture, Fermi, includes only 16 SMs but each SM has 32 SPs. In both architectures, each SP consists of an ALU and FPU as illustrated in Figure 2.1b. The 200-series architecture has a 24-bit ALU while the Fermi architecture includes a 32-bit ALU. However, the 200-series architecture is capable of 32-bit precision arithmetic through the use of multiple arithmetic instructions. (a) Streaming Multiprocessor. (b) Streaming Processor. Figure 2.1: Architecture of the T10 GPU. Source: http://www.anandtech.com/show/2549/2 In current GPU systems, the GPU and CPU have a separate memory space with a GPU consisting of three separate types of memory: global, constant and shared. Figure 2.2 depicts the organization of GPU memory. Global and constant memory are accessible by all 1From this point forward, GPU refers to an NVIDIA GPU. 11 CPU Memory GPU ... Global Memory SM 0 SM 1 SM 29 Constant Memory Memory Shared Registers Memory Shared Registers Registers Shared Figure 2.2: Memory organization of the T10 GPU. SPs while shared memory is partitioned for each SM. Therefore, each SP in the same SM can share data through the shared memory. However, SPs from one SM cannot share data with SPs from a di?erent SM. Global memory is the slowest memory available on the GPU and is 4GB in size for the Tesla T10 GPU. It is utilized as a transfer medium for the host (CPU) to the device (GPU) or for the device to the host. The GPU can read and write to the global memory while the CPU can copy memory to or from the global memory. Global memory is persistent through kernel calls while constant and shared are not. A CUDA application can transfer data from the system memory at 4GB/s and at the same time upload data to the system memory at 4GB/s. Constant memory is much faster and smaller than the global memory and is only 64KB in size on the Tesla T10. However, the GPU cannot write to constant memory therefore it is simply used to transfer constant data from the CPU to the GPU. Shared memory is also faster and smaller than global memory and can be written to and read from by the GPU. There are 16KB of shared memory per SM for the Tesla T10. However, shared memory cannot be accessed by the CPU therefore it is typically used as a scratch-pad memory for each SP. All computational results generated by the GPU must be stored in global memory to be accessible by the CPU. In some applications, shared memory can also be used for sharing data amongst SPs in the same SM. In addition to 12 the three types of memory, each thread has access to registers and local memory2 both of which are read/write. The shared memory can be read/written by all threads in the same SM. However, threads from di?erent SMs cannot access the same shared memory while the constant memory can be read by any thread. The global memory can be read/written by any thread. 2.3 GPU Performance Because of di?erences in architecture, GPUs have continually outperformed CPUs in terms of ?oating-point operations per second. ?As of 2009, the ratio between many-core GPUs and multicore CPUs for peak ?oating-point calculation throughput is about 10 to 1? [53]. The fundamental di?erence in the design of CPUs versus GPUs has created this large increase in performance in terms of FLOPS. While the CPU is designed to optimize sequential performance, GPUs are designed to optimize parallel instructions throughput. Therefore, GPUs need thousands of threads executing in parallel to achieve full e?ciency where a CPU may only need a few. However, each thread of a GPU is lightweight in comparison and requires little overhead to create. The NVIDIA 200-series architecture supports 1024 threads per SM for a total of 30720 threads running in parallel for the entire chip. In addition, new generations of hardware, such as the Fermi, support even more threads running concurrently. Today?s Tesla C1060, currently sold by NVIDIA, includes one Tesla T10 processor built on the 200-series architecture. The T10 includes 240 cores, each with a clock speed of 1.33GHz and is connected to 4GB of DRAM. The C1060 ?ts in a standard PCIe dual slot and consumes around 160 watts of power. Because of these low-cost, low-power GPUs, massively parallel applications are quickly being ported for GPGPU use. To date, more than 200 million GPUs have been deployed, providing economical parallel processing around the globe. GPGPU hardware is currently being used in a variety of 2Local memory is a subsection of global memory used to store variables when the maximum amount of registers is exceeded. 13 applications, including MRI products. In the past, typical parallel processing research was focused on using large clusters but actual clinical MRI machines needed to be much smaller than that. Because of this, groups such as the National Institutes of Health (NIH) would not fund parallel processing research since it was considered to be limited to large cluster-based machines. However, current MRI machine manufacturers ship MRI products with GPUs and the NIH funds research using GPUs as coprocessors [53]. Previously, a large drawback of GPGPU applications was the necessary programming. In the past, programmers had to learn to program the GPUs using drivers and assembly language. These languages also did not include many useful general computing instructions like integer or bit operations since they were not implemented at the time. In addition, a lack of communication between any of the processors in the GPU was limited making any data sharing di?cult. While this is still somewhat of a limitation, it has been greatly improved, since the original design, through CUDA. 2.4 CUDA Environment Since GPUs were starting to be used as massively parallel processors, NVIDIA developed a programming interface to utilize them known as CUDA. CUDA stands for Compute Uni?ed Device Architecture and is based on the C/C++ language. It enables programmers to write applications in either C or C++ to utilize the GPU as a massively parallel co-processor. In addition, since MATLAB version 2010b, there is support for m-?les to use the CUDA environment allowing access for computation on the GPU. CUDA provides constructs for memory transfers between the CPU and GPU for sharing data. It is based on an SPMD programming style which executes the same program on multiple parts of the data. This is di?erent than SIMD since the processing units do not have to be executing the same instruction at the same time. In the CUDA programming environment, there is a host which is the CPU and a device which is the GPU. The NVIDIA CUDA programming language is a large reason more and more industries are looking at GPUs as portable parallel processing 14 systems. The CUDA extension of the C programming language has enabled all programmers to create parallel applications without any special hardware or software knowledge beyond C. However, knowledge of the underlying architecture to achieve high speedups has inhibited typical programmers from using GPUs as coprocessors. 2.5 Terms Using CUDA, a C/C++ program consists of host code, code written to execute on the CPU, and device code, code written to execute on the GPU. Device code is divided into individual functions known as kernels. To compile kernels, NVIDIA provides a C compiler known as NVCC. It separates the host and device code and only performs compilation on the device code while using the gcc for the host code. In addition, NVIDIA provides a low- level programming language, parallel thread execution, PTX, similar to assembly language for writing kernels without exposing any of the underlying instruction set. Kernels can be written for any computational operation involving the GPU and are similar to functions in C. They are executed using a speci?ed number of threads. Since the GPU is an SPMD structure, all threads execute the same code on di?erent portions of data and each thread has an identi?er known as the thread index. Thread identi?ers are used for control logic and data access. A collection of threads is a block, as shown in Figure 2.3. A block can be partitioned in 1, 2, or 3-dimensions. The number of threads in the x- and y-dimension of each block is de?ned by dBlk:x and dBlk:y, respectively. Threads in the same block dBlk.x dBlk.y Figure 2.3: A square block of size 64. can cooperate through atomic operations and synchronization. However, threads in separate 15 blocks cannot since there is no guarantee to which SM a block is assigned. Each block also has an identi?er similar to a thread, known as the block index. The collection of blocks is considered the grid and can be partitioned in 1 or 2-dimensions. The number of blocks in the x- and y-dimension of the grid is de?ned by dGrd:x and dGrd:y, respectively. Each kernel call can de?ne di?erent grid and block sizes and dimensions but these cannot be changed dynamically during kernel execution. In the T10 processor, each grid can have a maximum of 65,536 blocks in each dimen- sion and each block can have a maximum of 512 threads yielding a maximum total of 241 threads. The dimensions of a grid and block determine the total number of threads operating in the GPU, so each need to be chosen to ensure the GPU is fully occupied. Since blocks can execute in any order with respect to other blocks, CUDA applications are very scalable which allows applications to adapt to new hardware without changing any of the existing code. However, architecture changes that often accompany new hardware can cause issues with scalability. Once the partitioning for the grid and blocks is determined, the scheduler organizes threads into groups of 32 known as warps as shown in Figure 2.4. Since warps dBlk.y dBlk.x warp 0 warp 1 Figure 2.4: Partitioning of a square block of size 64 into warps. waiting for long-latency operations such as intensive arithmetic operations or memory access are not selected for execution, this provides a type of latency hiding. While those warps are waiting for their operations to ?nish executing, other warps that are not waiting are sched- uled to execute. This type of zero-overhead thread scheduling ensures that the maximum instruction throughput is realized. This is one major reason GPUs do not dedicate as much chip area to cache memories and branch prediction mechanisms as CPUs, thus allowing for many more ALUs to ?t in the same size of silicon. 16 A warp is divided into half-warps (HWs), a group of 16 threads in row-major order as shown in Figure 2.5. Memory accesses are issued at the HW level. The number of HWs dBlk.y dBlk.x half?warp 0 half?warp 1 half?warp 2 half?warp 3 Figure 2.5: Partitioning of a square block of size 64 into half-warps. is de?ned by the dimensions of the grid and blocks, input parameters, which are speci?ed by the programmer. The four input parameters are de?ned as dGrd:x, dGrd:y, dBlk:x, and dBlk:y. dGrd:x and dGrd:y are the dimensions of the grid in the x and y-dimensions, respectively. dBlk:x and dBlk:y are the dimensions of each block in the x and y-dimensions, respectively. Lastly, the focus of this work is on matrix-based computations. All matrices are assumed to be square and n is used to denote the width or height of a square matrix. 17 Chapter 3 Modeling Developing a parallelization procedure to minimize execution time requires an accurate model of communication and computation times. Several modeling techniques, including ?ne-grain, vector-based, and table-based modeling, were initially applied to model GPU computational behavior. Fine-grain modeling was performed by measuring computation times for arithmetic operations and extrapolating for larger computations. Vector-based modeling was performed by measuring computation times for vector-based arithmetic oper- ations such as a dot product and extrapolating for larger computations. Lastly, table-based modeling was performed by measuring computation time for various numbers of blocks and threads and extrapolating for other input parameters. However, none of these techniques provided reasonable models of the GPU computational behavior, partly due to the layout of the GPU?s memory. Therefore, this chapter includes an examination of the layout of GPU memory and the formulation of execution metrics to accurately represent the computational behavior of the GPU. Section 3.1 is an examination of the layout of global memory on the GPU. Included in this section are the e?ects on computation time from the layout. Section 3.2 is an examination of the layout of shared memory and the e?ects on computation time. The beginning of Section 3.3 provides reasoning for utilizing execution metrics to model GPU computational behavior. Instead of utilizing input parameters to model behavior, execution metrics are formulated as functions of the input parameters for modeling. Section 3.4 is a model and estimation of CPU to GPU and GPU to CPU communication time. Results are included to prove the validity of the estimation. CPU computation time is estimated in Section 3.5 through curve-?tting measured data and results are included in the section. 18 3.1 Global Memory Layout One reason the aforementioned modeling techniques for GPU computational behavior yield inaccurate models is the layout of global memory on the GPU. Due to the layout of global memory, a problem referred to as partition camping exists. Partition camping has been de?ned by NVIDIA [54], although its e?ects on various applications have not been studied until recently [55] [56] [57]. 3.1.1 Partition Camping A description of the GPU architecture is necessary to understand partition camping. The global memory on an 8-series and 200-series GPU is divided into 6 and 8 partitions, respectively. On both GPU architectures, each partition is 256-bytes wide. All partitions are connected to a memory controller allowing access by all SMs as depicted in Figure 3.1. Data N PK A B D E F G HC Q R T U V W XS I J L M O 543210 SM 0 SM 29...SM 1 SM 2 partition memory controller address 2,048 bytes 2,097,152 rows 76 Figure 3.1: Global memory connection of the T10 GPU. Memory is divided into 8 equal-sized partitions. Each address represents 256 bytes. is stored in row-major order and each address in the ?gure occupies one row of a partition. Accesses to global memory partitions are performed at the HW level. Therefore, par- tition camping occurs when multiple HWs attempt to access the same partition, but not necessarily the same address, at a given time. Figure 3.2 illustrates partition camping oc- curring when two HWs access two consecutive partitions where Figure 3.1 depicts in which partition each address resides. Since addresses A and I are in the same partition, HW 1 19 HW 1: addr. I HW 0: addr. B memory access time computation HW 1: addr. J HW 0: addr. A Figure 3.2: An example of partition camping occurring: global memory accesses by 2 HWs to 2 partitions. must wait, until HW 0 ?nishes accessing address A, to access address I. The same occurs for the second accesses to addresses B and J. Therefore, partition camping occurs and the memory accesses are serialized. Figure 3.3 illustrates the access of two partitions by two HWs which eliminates partition camping. For the ?rst access, HW 0 accesses address A which resides in partition 0 and HW computation HW 1: addr. I HW 0: addr. A HW 1: addr. J memory access time HW 0: addr. B Figure 3.3: An example of partition camping not occurring: global memory accesses by 2 HWs to 2 partitions. 1 accesses address J in partition 1. Partition camping does not occur since the accesses by each HW occur in di?erent partitions. Similarly, for the second access, each HW accesses di?erent partitions and partition camping does not occur. Figures 3.2 and 3.3 demonstrate the theoretical e?ects on computation time due to partition camping on a GPU such that minimizing partition camping should be performed to minimize computation time. E?ects For most GPU computations, thousands of HWs are executing at a given time. To min- imize partition camping, and thus the computation time, it is necessary that HWs utilize all partitions of global memory. In addition, it is necessary that the HWs be evenly distributed to the partitions. 20 The code in Listing 3.1 is a portion of a kernel tested to measure the e?ects of utilizing a varying number of partitions. 1 int offset = floor(BlkIdx / BlksPerPart) * 64; 2 for(j = 0; j < v; j++) 3 temp += A[j * 512 + offset + threadIdx.y]; Listing 3.1: Code to test the e?ects of partition camping. Line 1 calculates which partition of global memory each block accesses by utilizing a linear block index, BlkIdx, and the number of assigned blocks per partition, BlksPerPart. The fraction of the two is multiplied by 64 since each value is stored as a ?oat and a partition is 256 bytes wide on a 200-series GPU. The length of a vector in the matrix that is summed is represented by v. Adding the thread index in the y-dimension, threadIdx.y, minimizes the possible e?ect of memory request merging [58] [59]. The computation time and e?ective bandwidth, from executing the code in Listing 3.1 on the T10 GPU, are given in Table 3.11. Each time was measured using a square block of 256 threads and 120 blocks to yield a maximum occupancy for each SM. Since threads are assigned to warps in row-major order, each thread in a HW sums identical values. However, each HW in a block computes di?erent sums. The bandwidth decreases linearly as the number of partitions v 8 4 2 1 512 0.4 0.9 1.7 3.4 1024 0.9 1.7 3.4 6.7 2048 1.7 3.4 6.8 13.5 4096 3.3 6.8 13.5 26.9 8192 6.8 13.4 26.5 53.8 16384 13.7 26.5 52.3 107.5 (a) Measured computation time (ms). number of partitions v 8 4 2 1 512 69.4 34.4 17.2 8.7 1024 68.4 34.6 17.2 8.7 2048 69.9 34.5 17.2 8.7 4096 71.7 34.4 17.4 8.7 8192 69.6 35.1 17.7 8.7 16384 68.5 35.4 17.9 8.7 (b) E?ective bandwidth (GB/s). Table 3.1: Measured results of executing Listing 3.1 which demonstrate the e?ect of partition camping on the T10 GPU. v is the number of values each thread summed. number of partitions utilized decreases, due to partition camping. Since the number of HWs remains constant in the tests, using less partitions increases partition camping which 1These results suggest memory request merging mentioned in [58] and [59] does not occur. 21 increases the computation time. For MM, all partitions are utilized at some point when n2 ? 512, where n is the height or width of a square matrix. However, the order in which HWs access each partition varies depending on the computation pattern, which is determined by how the code is written. Therefore, partition camping and the e?ects are considered in the parallelization procedure. 3.2 Shared Memory Layout Shared memory is divided into equal-sized banks for GPUs. For the T10 GPU, there are 16 banks as depicted in Figure 3.4. Each bank consists of 256 rows, 4 bytes wide, and all banks can be accessed simultaneously. Bank con?icts occur at the thread level. If all bank 9 10 11 12 13 14 15876543210 Figure 3.4: Shared memory layout in the T10 GPU. Memory is divided into 16 equal-sized partitions. Each column is 4 bytes wide. threads within a HW access di?erent banks, as illustrated in Figure 3.5, no bank con?icts occur and only one access is necessary. 9 10 11 12 13 14 15876543210 bank Figure 3.5: Threads within a HW accessing 16 shared memory banks. No bank con?icts occur. 22 3.2.1 Bank Con?icts If two or more addresses of a memory access to shared memory fall into the same memory bank, a bank con?ict occurs. Since memory accesses are issued at the HW level, threads within a HW accessing di?ering addresses in the same bank cause bank con?icts and the accesses are serialized. In Figure 3.6, all threads within a HW access di?erent rows of shared memory but the same column. Since the column resides in the same shared memory bank, 9 10 11 12 13 14 15876543210 bank Figure 3.6: Threads within a HW accessing one shared memory bank. 16 bank con?icts occur. 16 bank con?icts occur. Therefore, 16 accesses to shared memory are issued to service the HW. However, the T10 GPU supports shared memory broadcasting. Therefore, if threads within a HW access the same address in shared memory, no bank con?icts occur. E?ects The portion of the kernel utilized to test the e?ects of bank con?icts is depicted in Listing 3.2. 1 __shared__ float As[dBlkx][dBlky]; 2 int offset = blockIdx.x % 8 * 64; 3 int soffset = threadIdx.x % Banks; 4 for(j = 0; j < v; j++) { 5 As[threadIdx.y][threadIdx.x] = d_A[j * 512 + offset + threadIdx.y]; 6 temp += As[threadIdx.x][soffset]; 7 } Listing 3.2: Code to test the e?ects of bank con?icts. 23 The number of banks each HW accesses is de?ned by Banks. In Line 3, soffset is calculated to specify which column of shared memory is accessed and therefore, the number of bank con?icts is ?xed. blockIdx.x and threadIdx.x are the indices of each block and thread in the x-dimension, respectively. Table 3.2 shows the e?ect on computation time due to bank con?icts. Similar to the number of banks v 16 8 4 2 1 512 0.5 0.5 0.6 0.9 1.5 1024 0.9 1.0 1.3 1.8 2.9 2048 1.7 1.9 2.5 3.6 5.8 4096 3.4 3.8 5.0 7.2 11.5 8192 6.6 7.6 9.9 14.3 23.0 16384 12.9 15.1 19.6 28.5 46.0 (a) Computation time (ms). number of banks v 16 8 4 2 1 512 65.1 58.6 45.8 31.8 19.9 1024 66.6 60.4 46.5 32.2 20.1 2048 67.7 61.4 47.1 32.6 20.2 4096 69.1 61.7 47.3 32.7 20.3 8192 71.2 61.9 47.6 32.8 20.4 16384 72.4 62.1 47.7 32.9 20.4 (b) E?ective bandwidth (GB/s). Table 3.2: Measured results of executing Listing 3.2 which demonstrates the e?ect of bank con?icts on the T10 GPU. v is the number of values summed by each thread. e?ects of partition camping in global memory, shared memory bank con?icts signi?cantly a?ect computation time. As the number of banks increases, the number of bank con?icts decreases as does the computation time. Although the change in computation time is not linearly proportional to the number of banks utilized, the e?ects of bank con?icts are clearly illustrated. Therefore, bank con?icts and the e?ects are considered in the parallelization procedure. 3.3 Execution Metrics In this section, execution metrics are formulated to represent computational behavior of the GPU to assist in modeling. Figure 3.7 shows the necessity for execution metrics, as input parameters, the dimensions of the grid and blocks, do not accurately model compu- tational behavior. Initially, all measured computation times of MM utilizing varying input parameters are grouped by n. The average percent di?erence between the maximum and minimum time within a group is shown. Grouping by n yields an average di?erence of 24 163% between the maximum and minimum time of a group. Next, all measured computa- tion times utilizing varying input parameters are grouped by n and dBlk:x. The average percentage di?erence is 109 between the maximum and minimum of a group. The order of the groupings is determined by the input parameter which yields the smallest average percent di?erence between the maximum and minimum time of a group. Continuing until 3 of the 4 input parameters are exhausted for grouping yields an average percent di?erence of 28. Therefore, modeling the computational behavior of the GPU is inaccurate utilizing input parameters. Because of this, execution metrics are formulated as functions of input parameters to accurately represent computational behavior. n n, dBlk.x n, dBlk.x, dGrid.y n, dBlk.x, dGrid.y, dGrid.x 20% 40% 60% 80% 100% 120% 140% 160% 180% grouping using inputs av g. % diff ere nc eb etw een ma x. an dm in. wi th in ag ro up Figure 3.7: Modeling MM with input parameters: average percent di?erence between the maximum and minimum computation time (ms) of a group de?ned by n and a subset of the input parameters. Four input parameters, dGrd:x, dGrd:y, dBlk:x and dBlk:y, and the size of each matrix, n, are used to formulate the execution metrics. Input parameters and n are assumed to be powers of two in this work. The product of the four input parameters yields the total number 25 of threads assigned for execution on the GPU2 as ThdstotalGPU = dGrd:x?dGrd:y ?dBlk:x?dBlk:y: (3.1) The execution metrics can be formulated for any matrix-based computation. However, in this work, the focus of the execution metrics pertains to Mv, MM, and convolution. 3.3.1 Global Memory Accesses ThdstotalGPU is used to formulate an execution metric for representing the number of global memory accesses. The number of memory accesses signi?cantly a?ects the computation time for matrix-based computations since the time is largely dependent on memory access latency. In general, for matrix-based computations, all matrices reside in global memory and portions of matrices may reside in shared memory. For MM, there are 2n3 values read from global memory, although consecutive reads from threads in the same block can be combined. The number of writes to global memory for MM is n2 and consecutive writes from threads in the same block can be combined. Therefore, the amount of time for writing to global memory for MM is signi?cantly less than the amount of time for reading from global memory. This is similar for other matrix-based computations such as Mv and 2D convolution. Therefore, the number of writes to global memory is ignored in this work. Memory accesses are issued at the HW level and the GPU combines consecutive thread accesses within a HW into one access, known as coalescing. The memory access is issued as either a 32-, 64- or 128-byte memory transaction. Therefore, the number of reads is the sum of 32B, 64B and 128B transactions. Gld = Gld32B +Gld64B +Gld128B: 2The notation used in Equation (3.1) is used throughout this work. The unit being de?ned, threads (Thds) or blocks (Blks), is speci?ed ?rst. The superscript represents the type of unit being de?ned, total (total), active (active) or fragmented (frag.). The subscript represents the execution unit being de?ned, GPU (GPU) or SM (SM). 26 Initially, each transaction is considered a 128B access by the GPU?s memory controller and is reduced if possible. If the threads in a HW access between 9 and 16 consecutive and aligned memory locations of ?oat or integer value, only one 64B memory transaction is issued. If the HW accesses less than 9 consecutive memory locations of ?oat or integer value, one 32B memory transaction is issued. If the HW accesses memory locations which are not consecutive and 128B aligned, multiple 32B and/or 64B memory transactions are issued until all threads in the HW have been serviced. Since each HW is a group of 16 threads, the number of HWs that execute on the GPU is de?ned as HWs# = Thds total GPU 16 : The equation assumes there are at least 16 threads in a block. From the input parameters, and considering coalescing, the number of 32B and 64B reads for matrix-based computations can be formulated3. For a na??ve implementation of Mv utilizing only global memory, threadj, to compute cj where c=A?b, computes rowj ? b where rowj is one row of A. Therefore, the neighboring thread, threadj+1, computes rowj+1 ? b. If dBlk:x ? 16, each HW accesses 16 values of A and 1 value of b. Since 1 value of b is accessed, a 32B transaction is issued. The 16 values of A reside in di?ering rows of A, and therefore the accesses are uncoalesced and require 16 32B transactions. This is repeated for each HW n times. Therefore, the number of 32B transactions for this implementation of Mv is Gld32B = uni23A7uni23AAuni23AAuni23AA uni23AAuni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23AAuni23A9 (dBlk:x+1)n2 dBlk:x if dBlk:x ? 8 17n2 16 if dBlk:x ? 16: (3.2) This equation assumes only global memory is utilized for Mv. The equation is modi?ed for varying memories as shown in Section 4.1. 3128B reads do not exist in matrix-based computations when using values of type int or ?oat. 27 For a na??ve implementation of MM utilizing only global memory, to compute Cij where C=A?B, threadij accesses rowi and colj. If dBlk:x > 1, the neighboring thread, threadi(j+1), accesses rowi and colj+1. Since both threads are in the same HW, then if dBlk:x ? 16, n 32B transactions are issued to read rowi and n 64B transactions to read colj:::colj+15. However, if dBlk:x ? 8, then only 32B transactions are issued. Therefore, the total number of reads performed to global memory is de?ned as Gld32B = uni23A7uni23AAuni23AAuni23AA uni23AAuni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23AAuni23A9 n3 dBlk:x + n3 16 if dBlk:x ? 8 n3 16 if dBlk:x ? 16 (3.3) Gld64B = n 3 16 if dBlk:x ? 16: (3.4) These equations assume only global memory is utilized for MM. The equations are modi?ed for varying memories as shown in Section 4.1. Assuming dBlk:x ? 16, half of the reads are 32B transactions and half are 64B. Since memory access latency to constant and shared memory is much shorter than global, the number of accesses performed to constant and shared memory is ignored in this work as it has an insigni?cant impact on computation time. For a na??ve implementation of 2D convolution4, threadij computes Cij where C = A ? B. In image processing, A is an image and B is a lter. Therefore, C is the result of applying a ?lter to an image and Cij is one pixel of the resulting image. In this work, all ?lters, B, are assumed to be square. Therefore, the size, height or width, of a ?lter is represented by FS. For a na??ve implementation of convolution utilizing only global memory, to compute Cij, threadij accesses FS2 values of A and B. Since each thread within a HW accesses the same value of B in each iteration of computation, the number of 32B transactions for accessing B is de?ned only by FS. If dBlk:x ? 16, neighboring threads within a HW access neighboring values of A and therefore the accesses are coalesced into one 64B transaction. However, if 4From this point forward, convolution refers to 2D convolution of two matrices where C = A ? B. 28 dBlk:x ? 8, no 64B transactions occur and the number of 32B transactions is dependent on FS and dBlk:x. Therefore, the number of global memory accesses is approximately5 Gld32B = uni23A7uni23AAuni23AAuni23AA uni23AAuni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23AAuni23A9 n2FS2 16 parenleft.alt31+ 16 dBlk:xparenright.alt3 if dBlk:x ? 8 n2FS2 16 if dBlk:x ? 16 (3.5) Gld64B = n 2FS2 16 if dBlk:x ? 16: (3.6) These equations assume only global memory is utilized for convolution. The equations are modi?ed for varying memories as shown in Section 5.1.1. The number of accesses performed to shared memory is ignored in this work as it has an insigni?cant impact on computation time. In addition to formulating the amount of reads to global memory, it is necessary to formulate the amount of threads executing in parallel on the GPU. 3.3.2 Active Threads The number of threads executing at a given time on a GPU, or the number of active threads, is de?ned by ThreadsactiveGPU . To formulate the number of active threads, it is nec- essary to formulate the number of active blocks per SM, BlksactiveSM . BlksactiveSM is dependent on the number of registers used per block (RegsPerBlk), amount of shared memory used per block (SMemPerBlk), maximum amount of blocks allowable, and maximum amount of threads allowable. The number of registers used per thread is dependent on the compiler and determined after compilation. The amount of shared memory used per block is de?ned by the kernel and determined after compilation. The maximum number of blocks and threads that can be active varies depending on the GPU. The T10 GPU consists of 30 SMs and each SM can have a maximum of 8 blocks or 1024 threads. Assuming there are more blocks assigned for execution than SMs, dGrd:x?dGrd:y > 5Due to boundary checking, not all reads are issued. 29 30, BlksactiveSM is formulated as BlksactiveSM = minparenleft.alt3? 16384RegsPerBlk?;? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3: (3.7) Therefore, from Equation (3.7), the number of active blocks on the GPU at a given time is BlksactiveGPU = minparenleft.alt130?BlksactiveSM ;dGrd:x?dGrd:yparenright.alt1: (3.8) The number of active threads on the GPU is formulated as the product of Equation (3.8) and the number of threads per block. Therefore, ThdsactiveGPU = BlksactiveGPU ?dBlk:x?dBlk:y: (3.9) 3.3.3 Fragmented Threads If the number of active threads on a GPU is not evenly divisible by the number of total threads (ThdsactiveGPU mod ThdstotalGPU ? 0)6, fragmentation occurs. Fragmented threads are threads which are not executed with the same amount of parallelization as active threads. From (3.8) and (3.9), two execution metrics are formulated to represent the number of fragmented blocks and threads: Blksfrag:GPU and Thdsfrag:GPU, respectively. Therefore, the number of fragmented blocks on the GPU is Blksfrag:GPU = (dGrd:x?dGrd:y) mod BlksactiveGPU (3.10) and the number of fragmented threads on the GPU is Thdsfrag:GPU = Blksfrag:GPU ?dBlk:x?dBlk:y: (3.11) 6x mod y denotes the remainder of x y. 30 3.3.4 Global Memory Partitions In addition to formulating execution metrics to represent the number of global memory accesses, active and fragmented threads, two execution metrics are formulated to represent the layout of global memory. The global memory on a 200-series GPU architecture is divided into 8 partitions where each partition is 256B wide. To represent the number of global memory partitions accessed per block, an execution metric, PartsPerBlk, is formulated as PartsPerBlk = dBlk:x64 (3.12) Since HWs are formed in row-major order, the x-dimension of the block is used. Since values of type ?oat are 4 bytes and each global memory partition is 256 bytes wide, 64 values are stored per row for each partition. Therefore, Equation (3.12) represents the number of global memory partitions accessed per block. From PartsPerBlk, the number of global memory partitions accessed by the active blocks on the GPU, PartsPerGPU, is formulated as PartsPerGPU = uni23A7uni23AAuni23AAuni23AA uni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23A9 minparenleft.alt1BlksactiveGPU ?PartsPerBlk;8parenright.alt1 if BlksactiveGPU < dGrd:x min(dGrd:x?PartsPerBlk;8) if BlksactiveGPU ? dGrd:x: (3.13) PartsPerGPU is dependent on the number of global memory partitions accessed per block and BlksactiveGPU . The equation de?nes a maximum of 8 partitions being accessed since there are 8 partitions of global memory in the T10 GPU. 3.4 Communication Time Estimating the communication time between the CPU and GPU is necessary for a parallelization procedure to minimize execution time. For small matrix-based computations, it is possible the communication time is higher than the CPU computation time. Therefore, it is necessary to accurately model the communication time between the CPU and GPU. 31 In addition, communication and computation time estimates are necessary to determine the partitioning of computation between the CPU and GPU. Communication time, Tcomm:, is modeled linearly with a setup time, Tsu, and a maximum transfer rate, Tx. Therefore, Tcomm: = bT x +Tsu (3.14) where b is the number of bytes being transferred. Research shows Tx is dependent on the type of transfer (CPU to GPU or GPU to CPU). In addition, analysis of measured data suggests Tsu is a function of b. Therefore, Tsu = c0b+c1 (3.15) where c0 is the time per byte to set up a transfer and c1 is a constant. Both are determined through curve-?tting measured data. For Mv, MM, and convolution, there are two CPU- GPU transfers and one GPU-CPU. 3.4.1 CPU-GPU From NVIDIA, the advertised transfer rate between the CPU and GPU is 4GB/s. However, the maximum transfer rate, Tx, measured for CPU to GPU communication is 3.5GB/s. For transfers smaller than 256KB, measured data shows c0 is approximately 525ms/GB and c1 is approximately 0.014ms. For transfers between 256KB and 1MB, data shows c0 is approximately 3.2ms/GB and c1 is approximately 0.061ms. Lastly, for transfers of size 1MB and larger, data shows c0 is approximately 3.2ms/GB and c1 is approximately 0.19ms. The varying values for c0 and c1 are presumably due to CPU cache size and memory layout as well as the layout of GPU memory. In this work, n ? 512, as discussed in Section 4.1, and therefore, all CPU to GPU matrix transfers are larger than 1MB. Figure 3.8 depicts the measured and estimated time for two CPU to GPU transfers. Two transfers are measured and estimated since the GPU requires a matrix and vector for Mv and two matrices for MM 32 and convolution. The nonuniform behavior illustrated in the ?gure is because c0 and c1 vary with the size of data. 102 104 106 108 100 102 bytes tim e( ms ) Measured Estimated Figure 3.8: Comparison of measured and estimated CPU to GPU communication time (ms) on the T10 GPU. For Mv, 4n2 + 4n bytes are transferred for the matrix, A, and vector, b, since values of type ?oat are utilized. For MM, 8n2 bytes are transferred for the two matrices, A and B. For convolution, 4n2 + 4FS2 bytes are transferred for the two matrices, A and B. The maximum and average percent error between the measured and estimated time are 5.0% and 1.0%, respectively. 3.4.2 GPU-CPU The maximum transfer rate measured for GPU to CPU communication is 3.1GB/s. For transfers smaller than 1MB, measured data shows c0 is approximately 210ms/GB and c1 is approximately 0.02ms. For transfers larger than 1MB, data shows c0 is approximately 6.9ms/GB and c1 is approximately 0.21ms. The varying values for c0 and c1 are presumably due to CPU cache size and memory layout as well as the layout of GPU memory. Figure 3.9 depicts the measured and estimated time for one GPU to CPU transfer. 33 102 104 106 108 100 102 bytes tim e( ms ) Measured Estimated Figure 3.9: Comparison of measured and estimated GPU to CPU communication time (ms) on the T10 GPU. For Mv, one GPU to CPU transfer of 4n bytes is necessary for the vector, c. For MM and convolution, one transfer of 4n2 bytes is necessary for the matrix, C. The maximum and average percent error between the measured and estimated time are 8.9% and 1.4%, respectively. 3.5 Computation Time For a parallelization procedure to minimize execution time of matrix-based computa- tions on a GPU, it is necessary to model CPU and GPU computation time. Computation time in this work refers to time for memory accesses and computation. Determining which computations are performed by the CPU and GPU require accurate computation and com- munication time estimates. The execution time for matrix-based computations is de?ned as the sum of the commu- nication and computation times and therefore, Texec: = Tcomm: +Tcomp: 34 The computation time, Tcomp:, is the sum of the CPU and GPU computation times, TCPUcomp: and TGPUcomp:, respectively, as shown in Equation (3.16): Tcomp: = TCPUcomp: +TGPUcomp:: (3.16) 3.5.1 CPU Due to CPU cache memories, branch prediction mechanisms, hyper-threading and other instruction and data latency-hiding optimizations, there is signi?cant research in modeling CPU computation time. Since the goal of this work is to minimize execution time of matrix- based computations on a GPU, an estimate of CPU computation time is obtained through analysis of measured data. In this work, the ATLAS [60] package is utilized for matrix- based computations executing on the CPU. ATLAS is an ongoing research project which provides C interfaces to an optimized BLAS implementation as well as some routines from LAPACK [18]. ATLAS is currently used in MAPLE, MATLAB, Octave and planned for use in Mathematica. Computation time for Mv and MM is measured utilizing the ATLAS interface to the sgemv (Mv) and sgemm (MM) BLAS functions, respectively. Bandwidth is utilized to es- timate CPU computation time and is calculated as the number of bytes accessed for each computation divided by the computation time. For Mv, 2n2 ?oat values are accessed and therefore, e?ective bandwidth is 8n2bytesTCPU comp: . From measured data, the maximum bandwidth is dependent on n and measured between 2GB/s and 8GB/s. Therefore, for Mv, the bandwidth is estimated as minparenleft.alt1maxparenleft.alt116384n ;2parenright.alt1;8parenright.alt1. Therefore, TCPUcomp: = 8n2 10243 minparenleft.alt1maxparenleft.alt116384n ;2parenright.alt1;8parenright.alt1seconds for the optimized BLAS implementation of Mv. Figure 3.10 depicts the measured and estimated time for the optimized BLAS implementation of Mv on the CPU. 35 512 1024 2048 4096 8192 16384 100 102 n tim e( ms ) Measured Estimated Figure 3.10: Comparison of measured and estimated computation time (ms) for the opti- mized BLAS implementation of Mv on the CPU. The maximum and average percent error between the measured and estimated time are 3.0% and 1.4%, respectively. For MM, 2n3 ?oat values are accessed and therefore, e?ective bandwidth is 8n3bytesTCPU comp: . From measured data, the maximum bandwidth varies between 50GB/s and 54GB/s. From the median bandwidth (51GB/s) measured, TCPUcomp: = 8n3 10243 51 seconds for the optimized BLAS implementation of MM. Figure 3.11 depicts the measured and estimated time for the optimized BLAS implementation of MM on the CPU. 36 512 1024 2048 4096 8192 16384 102 104 106 n tim e( ms ) Measured Estimated Figure 3.11: Comparison of measured and estimated computation time (ms) for the opti- mized BLAS implementation of MM on the CPU. The maximum and average percent error between the measured and estimated time are 5.6% and 3.2%, respectively. Since BLAS routines do not include convolution, time for a non-optimized C implemen- tation is measured. For convolution, 2n2FS2 ?oat values are accessed. Therefore, e?ective bandwidth is calculated as 8n2FS2bytesTCPU comp: . From measured data, maximum bandwidth is esti- mated as ( 3?FS ? n20000 +2:7)GB/s and therefore, TCPUcomp: = 8n2FS2 10243 parenleft.alt2 3?FS ? n20000 +2:7parenright.alt2 seconds for the non-optimized C implementation of convolution. Figure 3.12 depicts the measured and estimated time for the non-optimized C implementation of convolution with a ?lter size of 3 on the CPU. 37 512 1024 2048 4096 8192 16384 101 102 103 n tim e( ms ) Measured Estimated Figure 3.12: Comparison of measured and estimated computation time (ms) for the non- optimized C implementation of convolution on the CPU. FS=3. The maximum and average percent error between the measured and estimated time are 9.6% and 7.0%, respectively. Figure 3.13 depicts the time for convolution with a ?lter size of 63. 512 1024 2048 4096 8192 16384 104 105 106 n tim e( ms ) Measured Estimated Figure 3.13: Comparison of measured and estimated computation time (ms) for the non- optimized C implementation of convolution on the CPU. FS=63. The maximum and average percent error between the measured and estimated time are 4.0% and 2.1%, respectively. Figure 3.14 depicts the time for convolution with a ?lter size of 513. 38 512 1024 2048 4096 8192 16384 101 102 103 n tim e( ms ) Measured Estimated Figure 3.14: Comparison of measured and estimated computation time (ms) for the non- optimized C implementation of convolution on the CPU. FS=513. The maximum and average percent error between the measured and estimated time are 15.1% and 8.6%, respectively. Therefore, for the implementations of Mv, MM, and convolution, the worst-case percent error between the measured and estimated CPU computation time is 15.1%. 3.5.2 GPU Similar to CPU computation time, GPU computation time, TGPUcomp:, is dependent on the implementation of the matrix-based computation. Since the parallelization procedure minimizes execution time of matrix-based computations executing on a GPU, it is assumed the procedure yields the minimum computation time. Therefore, the time is estimated assuming the theoretical bandwidth to global memory. The theoretical bandwidth to global memory is calculated as the product of the memory clockrate and data bus width. The T10 GPU has a 792MHz memory clock rate and a 512-bit data bus. Since the memory is double-data rate, the theoretical bandwidth to global memory is approximately 94GB/s. According to NVIDIA [61], 70-80% of theoretical bandwidth ?is very good? for memory-bound computations. 39 Since GPU computation time is estimated assuming theoretical bandwidth, it is neces- sary to determine the number of bytes accessed from global memory for each matrix-based computation. The number of bytes accessed is dependent on the memory utilized and the size of each memory transaction. Therefore, the GPU computation time is estimated in the last step of the parallelization procedure after the type of memory is determined and the number and size of memory transactions are calculated. 40 Chapter 4 Parallelization Procedure This chapter describes the development of the parallelization procedure and its impact on minimizing execution time for matrix-based computations on the GPU. The procedure is developed to be general in that it applies to any matrix-based computation. Each step of the procedure is developed through research, testing, and analysis of measured data for Mv, MM, or both. The procedure is applied to convolution and the conjugate gradient method in Chapter 5 for veri?cation. The procedure considers the placement of data in GPU memory, computation patterns of the GPU, access patterns to GPU memory, ?ne-tuning GPU code, input parameters to the GPU, and computation partitioning between the CPU and GPU. Each aforementioned item is discussed individually in Sections 4.1 - 4.6. In each section of this chapter, one step of the parallelization procedure is developed through analysis of Mv, MM, or both. Results are included at the end of each section to demonstrate the e?ectiveness of each step. Each section includes the kernel for each matrix- based computation to illustrate the application of each step of the procedure. All results in this chapter were gathered using the NVIDIA Tesla S1070, which includes a pair of T10 GPUs connected to a quad-core 2.26GHz Intel Xeon E5520 CPU at Alabama Supercomputer Authority [62]. The height or width of a square matrix, n, is varied in the results from 512 to 16384 as de?ned by Equation (4.1) in Section 4.1. dBlk:x is varied in the results from 8 to 512, the maximum allowable value for the T10. The ?rst step is to determine the optimal placement of data in GPU memory. Therefore, Section 4.1 is a discussion of the three types of memory in the GPU and the appropriateness of each memory for various computations. Utilizing various memories on the GPU other than global can reduce the number of accesses to global memory and therefore, reduce GPU 41 computation time. This is the ?rst step of the procedure, as it is the most common optimiza- tion to GPU computations to minimize time. In addition, other steps of the procedure such as deriving the optimized computation and access patterns and optimal input parameters are dependent on which memories are utilized and therefore, this is the ?rst step. An algorithm is presented to determine the optimal placement of data in GPU memory to minimize GPU computation time. Step 2 in Section 4.2 is a discussion of various computation patterns determined by the kernel. The kernel de?nes which result(s) each thread computes and therefore, an examina- tion is performed of various computation patterns of Mv and MM. From measured data, it is shown that for Mv, if threadj computes one value of c, cj, where c=Ab, the computation time is bound by the number of threads and data size. Likewise, for MM, if threadij com- putes one value of C, Cij, where C=AB, the computation time is bound by computation and memory accesses. Increasing the overlap of computation and memory accesses reduces the computation time. Therefore, the optimized computation pattern which reduces GPU computation time is derived. Since the access pattern is dependent on the computation pattern, this step of the procedure is included before deriving the optimized access pattern. The optimized access pattern is dependent on the placement of data in GPU memory and the computation pattern. Therefore, Step 3 is included to derive the optimized access pattern. Access patterns to GPU memory are discussed in Section 4.3 and the optimized access pattern is derived to minimize partition camping and eliminate bank con?icts. This step is developed through analysis of Mv and MM. For Mv, bank con?icts occur when accessing shared memory and a?ect computation time as shown in Section 3.2. For Mv and MM, all HWs begin accessing matrix A from the same partition of global memory; thus partition camping occurs. From Section 3.1, partition camping a?ects computation time. Therefore, the optimized access pattern is derived to minimize partition camping and eliminate bank con?icts. 42 After the optimal placement of data is determined and the optimized computation and access patterns are derived, ?ne-tuning of the kernels is performed to reduce the amount of computation and resource allocation, as shown in Section 4.4. This step, Step 4, is developed from analysis of MM, as the optimized computation pattern for MM utilizes many registers. Due to the register usage, this step is included in the procedure to reduce the register requirement of the kernels such that more threads execute in parallel, and thus reduce the execution time. Included in this step is the minimization of index calculations. Developed from analysis of measured data for MM, reducing index calculations, particularly in inner loops, reduces the time necessary to issue global memory transactions thus reducing the computation time. After ?ne-tuning the kernels, analysis of Mv and MM yields a necessity of optimal input parameters. Therefore, Step 5 of the procedure is included to derive the optimal input parameters, which are input parameters that yield the minimum GPU computation time. From measured data, computation time varies signi?cantly depending on the number of blocks and size of each block. Therefore, from the formulation of execution metrics in Section 3.3, optimal input parameters are derived in Section 4.5. This step of the procedure is performed after Steps 1-4 since each of the aforementioned steps a?ects the execution metrics. Execution metrics are a?ected by each previous step since they are dependent on such things as the allocation of registers and shared memory. The last step, Step 6, of the parallelization procedure to minimize execution time is to determine the partitioning of computation between the CPU and GPU as shown in Section 4.6. From measured data for Mv, it is shown that the communication time may exceed the CPU computation time. Therefore, optimally determining which computations are per- formed by the CPU and GPU is included in the procedure. Optimally determining the partitioning of computation requires accurate estimates of communication and computation 43 times as discussed in Sections 3.4 and 3.5, respectively. An algorithm utilizing the com- munication and computation times is included for determining the optimal partitioning of computation. Therefore, the parallelization procedure to minimize execution time of matrix-based computations utilizing the GPU consists of 1. determining the optimal placement of data in GPU memory, 2. deriving the optimized computation pattern which reduces computation time, 3. deriving the optimized access pattern to data in GPU memory, 4. ?ne-tuning GPU code to reduce computation time, 5. deriving optimal input parameters which yield the minimum computation time for the GPU, and 6. optimally partitioning computation between the CPU and GPU. 4.1 Placement of Data To minimize GPU computation time, and thus the execution time, it is necessary to consider which GPU memories are utilized. Three types of memory exist in the T10 GPU: global, constant and shared. Global memory is the only memory available for both CPU to GPU and GPU to CPU transfers. It is the main memory of the device due to its read/write capability, large size, accessibility by both the CPU and GPU and its accessibility by all executing threads. It is the largest of the available memories, at 4GB in the T10. However, it is the slowest memory with a peak theoretical bandwidth of 94GB/s. Since some global memory is reserved by the system, not all 4GB are accessible. Therefore, using powers of two for n, the height or width of a square matrix, the maximum value of n is 16384. The minimum value of n, in 44 this work, is determined such that one row of a matrix spans at least one global memory partition. Therefore, 512 ? n ? 16384: (4.1) Constant memory is available for CPU to GPU transfers and is used for storing constant data; it is a read-only memory from the GPU?s perspective. Similar to shared memory, it is much faster than global memory and can reduce computation time in applications by reducing the number of global memory accesses. The T10 GPU includes 64KB of constant memory that is accessible by all threads. However, since it is read-only by the GPU, it can only be used for transferring data from the CPU to GPU. In computations such as MM, when n ? 512, utilizing constant memory is not possible as matrices exceed the size. However, for Mv, constant memory can be utilized to store the vector, and thus reduce the number of global memory reads as well as the partition camping e?ect. Shared memory is read/write for the GPU but not accessible to the CPU. Therefore, to utilize shared memory, the CPU must transfer data to the GPU?s global memory and the GPU must transfer the data from global memory to shared. Shared memory is much faster than global and is widely used as temporary storage, or scratchpad space, for threads. Utilizing shared memory for computations such as MM can greatly reduce the computation time as many of the accesses to global memory can be reduced. Data is transferred from the CPU to the GPU?s global memory and then the GPU copies portions of the data that are reused to shared memory thus greatly reducing the number of global memory accesses. The T10 GPU includes 16KB of shared memory per SM, for a total of 480KB of shared memory. However, shared memory is allocated per SM, so threads executing on one SM cannot read or write to shared memory of another SM. Even if it were possible, if n ? 512, the limited data size would not accommodate one matrix. However, signi?cant reductions in computation time can be realized in computations which reuse data by partitioning the reused data into smaller blocks and then utilizing shared memory. For MM, data is reused and utilizing shared memory reduces the demand on the global memory bus thus reducing 45 computation time. Shared memory can also be utilized to coalesce accesses to global memory for data which are not reused such as the matrix in Mv, and therefore, reduce computation time. Fromthesummariesofeachmemory, Listing4.1isdevelopedtodeterminetheplacement of data. 1 if(data reused ) 2 if(sizeof(data reused ) ? sizeof(cmem)) 3 utilize cmem for data reused 4 if(sizeof(data reused ) > sizeof(cmem)) 5 partition data reused into blocks ? sizeof(smem) 6 utilize smem for data reused 7 else 8 if(uncoalesced memory accesses ) 9 utilize smem to coalesce accesses 10 utilize gmem for data Listing 4.1: Algorithm for determining the placement of data in GPU memory. From Lines 7-10, data which is not reused for computation is stored in global memory (gmem). Since constant memory (cmem) and shared memory (smem) have similar access times, data, which is reused (data reused) and smaller than the maximum amount of constant memory, is stored in constant memory as shown in Lines 2 and 3. Data, which is reused but larger than constant memory, is partitioned into blocks no greater than the size of shared memory, as shown in Lines 4-6. However, the CPU does not have access to shared memory so data which is reused must be transferred to global memory. The GPU kernel partitions data which is reused into blocks and transfers the data from global to shared memory. Lines 8-9 refer to the access pattern to data which is not reused. If there is no data reuse, such as A for Mv, shared memory is utilized to coalesce the uncoalesced memory accesses and thus reduce computation time. 4.1.1 Mv Listing 4.2 is the kernel for the global memory implementation of Mv on a GPU, where each thread computes one cj. As mentioned in Section 4.2, computation time increases as each thread computes additional values of c. Each thread computes the index of the cj to 46 1 __global__ void Mv(float* A, float* b, float* c, int n) { 2 int index = blockIdx.x * blockDim.x + threadIdx.x; 3 float temp = 0.0; 4 for(int j = 0; j < n; j++) { 5 temp += A[index * n + j] * b[j]; 6 } 7 c[index] = temp; 8 } Listing 4.2: The global memory implementation of Mv on a GPU. compute in Line 2. Partial cjs are computed by each thread in Line 5 and the ?nal cj is stored in global memory in Line 7. The number of reads to A and b for this kernel is de?ned by Equation (3.2). Assuming dBlk:x ? 16, the number of reads is 17?n216 . The CPU code for this implementation of Mv is shown in Listing 4.3. 1 ... 2 float* d_A; 3 cudaMalloc((void**) &d_A, sizeof(float) * n * n); 4 float* d_b; 5 cudaMalloc((void**) &d_b, sizeof(float) * n); 6 float* d_c; 7 cudaMalloc((void**) &d_c, sizeof(float) * n); 8 cudaMemcpy(d_A, h_A, sizeof(float) * n * n, cudaMemcpyHostToDevice); 9 cudaMemcpy(d_b, h_b, sizeof(float) * n, cudaMemcpyHostToDevice); 10 Mv<<< dGrd:x, dBlk:x >>>(n, d_A, d_b, d_c); 11 cudaMemcpy(h_c, d_c, sizeof(float) * n, cudaMemcpyDeviceToHost); 12 ... Listing 4.3: CPU code for the global memory implementation of Mv. The notation h denotes CPU (host) memory and d denotes GPU (device) memory. Lines 2-7 allocate the matrix and vectors in global memory of the GPU. Lines 8 and 9 transfer A, h A, and b, h b, from the CPU to global memory, d A and d b. Line 10 launches the kernel for execution using dGrd:x blocks and dBlk:x threads per block. The size of data, n, and pointers to A, b and c are passed to the kernel. Line 11 transfers c from the GPU global memory, d c, to the CPU, h c. Lines 2-3 of Listing 4.1, which determines the placement of data, speci?es that reused data is placed in constant memory if the size of the reused data is less than the size of 47 constant memory. b, which is reused, is of size n. The upper limit of n is de?ned by Equation (4.1). Therefore, since n ? 16384, b is stored in constant memory. Listing 4.4 shows the modi?cation of the kernel in Listing 4.2 necessary to utilize constant memory. 1 __constant__ float b[n]; 2 __global__ void Mv(float* A, float* c, int n) { 3 ... Listing 4.4: The constant memory implementation of Mv on a GPU. Line 1 allocates constant memory of size n for b. Constant memory cannot be dynamically allocated and therefore, n must be de?ned for execution. Pointers to A and c, which reside in global memory, and the height or width of A, n, are passed to the kernel as shown in Line 2 of the kernel de?nition. To transfer b from the CPU to GPU constant memory, Line 9 is modi?ed from Listing 4.3 as shown in Listing 4.5. 8 ... 9 cudaMemcpyToSymbol(b, h_b, sizeof(float) * n); 10 Mv<<< dGrd:x, dBlk:x >>>(n, d_A, d_c); 11 ... Listing 4.5: CPU code for the constant memory implementation of Mv. Line 10 of Listing 4.3 is modi?ed such that the kernel is no longer passed a pointer to b. No other modi?cations to the kernel or CPU code are necessary. Utilizing constant memory for b eliminates global memory accesses to b. Therefore, the number of global memory reads for the constant memory implementation of Mv is Gld32B = n2: This is a 6.25% reduction of global memory reads compared to the global memory imple- mentation of Mv. In addition, with fewer accesses to global memory, partition camping is reduced. For the constant memory implementation of Mv, threadj, to compute cj, computes rowj ?b where rowj is one row of A. Therefore, the neighboring thread, threadj+1, computes rowj+1 ? b. An example of a HW accessing A is depicted in Figure 4.1. In this example, 48 the HW is only 4 threads and each memory access is 4 bytes. Since the values of A needed in each iteration reside in di?erent rows, the accesses to A are uncoalesced and the HW performs 4 reads in each iteration. iteration 1 shared memory iteration 0 ... ...iteration 4 value needed for iteration j memory access iteration n Figure 4.1: Example of a HW, consisting of 4 threads, accessing A in each iteration of Mv. Each memory transaction is 4 bytes. The accesses to A are uncoalesced and 4 accesses are required for each iteration. Assuming dBlk:x ? 16, each HW accesses 16 values of A from global memory. The 16 values of A reside in di?erent rows of A, and therefore the accesses are uncoalesced and require 16 32B transactions1. This is repeated for each HW n times. Coalescing the accesses to A signi?cantly reduces the number of transactions and thus the computation time. Reducing the number of transactions also reduces the e?ects of partition camping. Lines 8-9 of Listing 4.1, which de?nes the placement of data, speci?es that shared memory is utilized to coalesce uncoalesced accesses for data which is not reused. Since accesses to A of the global and constant memory implementations of Mv are uncoalesced, shared memory is utilized to coalesce accesses as shown in Figure 4.2. In the ?gure, the HW is only 4 threads. 132B is the minimum size of a memory transaction on the T10 GPU. 49 iteration n... ...iteration 4iteration 0 value needed for iteration j memory access shared memory iteration 1 Figure 4.2: Example of a HW, consisting of 4 threads, accessing A in each iteration of Mv utilizing shared memory. Each memory transaction is 4 bytes. The accesses to A are coalesced and 4 accesses are required every 4th iteration. In the ?rst iteration, the HW performs 4 reads of 4 bytes each and stores the data in shared memory. Therefore, in the next iteration, the next value needed for computation is read from shared memory instead of global. This reduces the number of global memory accesses by a factor of 4. Listing 4.6 is the kernel for the shared memory implementation of Mv. 1 __constant__ float b[n]; 2 __global__ void Mv(int n, float* A, float* c) { 3 __shared__ float As[blockDim.x][blockDim.x]; 4 int index = blockIdx.x * blockDim.x; 5 float temp = 0.0; 6 for(int j = 0; j < n; j += blockDim.x) { 7 for(int k = 0; k < blockDim.x; k++) { 8 As[k][threadIdx.x] = A[(index + k) * n + threadIdx.x + j]; 9 } 10 __syncthreads(); 11 for(k = 0; k < blockDim.x; k++) { 12 temp += As[threadIdx.x][k] * b[j + k]; 13 } 14 __syncthreads(); 15 } 16 c[index + threadIdx.x] = temp; 17 } Listing 4.6: The shared memory implementation of Mv on a GPU. Constant memory is utilized for b. 50 In Line 3, shared memory is allocated per block of size dBlk:x2. Therefore, for this kernel, the amount of shared memory per block is SMemPerBlk = (dBlk:x2)?4: (4.2) The ?rst row each block accesses from A is calculated in Line 4. In Lines 7-9, HWs load the values of A into shared memory in a coalesced manner. Line 10 is a synchronization instruction. HWs in a block stop executing at this instruc- tion until all HWs within the block reach this instruction. This is to ensure that shared memory is loaded before the following computation. In Lines 11-13, threads utilized shared memory values to compute a portion of cj. Line 14 ensures each HW completes the cur- rent iteration of computation before reloading shared memory. Similar to constant memory, shared memory cannot be allocated dynamically and therefore, dBlk:x must be de?ned. As mentioned, the shared memory implementation of Mv, Listing 4.6, signi?cantly re- duces the number of global memory accesses by coalescing the accesses to A. The number of global memory accesses for the shared memory implementation of Mv is Gld32B = n 2 dBlk:x if dBlk:x ? 8 and Gld64B = n 2 16 if dBlk:x ? 16: Assuming dBlk:x ? 16, the total number of reads for this implementation is n216 which is 16 times less than the number of reads for the constant memory implementation. Therefore, the shared memory implementation of Mv signi?cantly reduces the computation time compared to global and constant implementations. 51 Figure 4.3 depicts maximum e?ective bandwidth of the global memory implementation of Mv (Listing 4.2) and the shared and constant memory implementation (Listing 4.6). E?ective bandwidth is calculated as 8n2bytesTGPU comp: as discussed in Section 3.5. 512 1024 2048 4096 8192 163840 10 20 30 40 50 n eff ect ive ba nd wi dt h( GB /s ) Global Memory Shared Memory Figure 4.3: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for Mv on the T10 GPU. As illustrated in the ?gure, the shared and constant memory implementation yields a signi?cant speedup compared to the global memory implementation. As mentioned, utilizing shared memory for Mv coalesces memory accesses to global memory. Utilizing constant memory reduces the number of global memory transactions, and therefore, reduces partition camping. The shared and constant memory implementation yields a speedup, on average, of 4.79 compared to the global memory implementation. The minimum (n = 4096) and maximum (n = 16384) speedups are 2.51 and 10.58, respectively. Therefore, for Mv, this step of the procedure signi?cantly reduces GPU computation time. 4.1.2 MM Listing 4.7 is the kernel for the global memory implementation of MM if threadij com- putes Cij. Pointers to A, B and C and the height or width of each matrix, n, are passed to the kernel in Line 1. Lines 2 and 3 compute the row and column indexes of the Cij being computed for each thread, respectively. The x- and y-indices of each block are speci?ed by 52 1 __global__ void MM(float* A, float* B, float* C, int n) { 2 int Row = blockIdx.y * blockDim.y + threadIdx.y; 3 int Col = blockIdx.x * blockDim.x + threadIdx.x; 4 float temp = 0.0; 5 for(int j = 0; j < n; j++) { 6 temp += A[Row * n + j] * B[j * n + Col]; 7 } 8 C[Row * n + Col] = temp; 9 } Listing 4.7: The global memory implementation of MM on a GPU where threads compute one Cij. blockIdx.x and blockIdx.y, respectively. Block indices in the x- and y-dimensions have a range of 0 to dGrd:x ? 1 and 0 to dGrd:y ? 1, respectively. The x- and y-indices of each thread are speci?ed by threadIdx.x and threadIdx.y, respectively. Thread indices in the x- and y-dimensions have a range of 0 to dBlk:x?1 and 0 to dBlk:y?1, respectively. The x- and y-dimensions of each block are speci?ed by blockDim.x (dBlk:x) and blockDim.y (dBlk:y), respectively. In Line 6, each thread computes the partial Cij and the ?nal Cij is stored in global memory in Line 8. The number of global memory reads for the global memory implementation of MM is de?ned by Equations (3.3) and (3.4). Assuming dBlk:x ? 16, all accesses to A and B are coalesced and 2n316 accesses are performed. Lines 4-6 of Listing 4.1, which determines the placement of data, speci?es that shared memory is utilized for reused data which is larger than the size of constant memory. Line 5 speci?es that data which is reused is partitioned into blocks smaller than the size of shared memory. Therefore, A and B are partitioned into blocks of shared memory to reduce the number of global memory accesses as shown in Listing 4.8. Lines 3 and 4 allocate shared memory for blocks of A and B. Therefore, for this kernel, SMemPerBlk = (dBlk:x?dBlk:y +dBlk:x2)?4: (4.3) 53 1 __global__ void MM(float* A, float* B, float* C, int n, ... 2 int nele_x , int nele_y) { 3 __shared__ float As[blockDim.y][blockDim.x]; 4 __shared__ float Bs[blockDim.x][blockDim.x]; 5 int Row = blockIdx.y * blockDim.y + threadIdx.y; 6 int Col = blockIdx.x * blockDim.x + threadIdx.x; 7 float temp = 0.0; 8 for(int j = 0; j < n; j += blockDim.x) { 9 As[threadIdx.y][threadIdx.x] = A[Row * n + (j + threadIdx.x)]; 10 Bs[threadIdx.y][threadIdx.x] = B[(j + threadIdx.y) * n + Col]; 11 __syncthreads(); 12 for(int k = 0; k < blockDim.x; k++) { 13 temp += As[threadIdx.y][k] * Bs[k][threadIdx.x]; 14 } 15 __syncthreads(); 16 } 17 C[Row * n + Col] = temp; 18 } Listing 4.8: The shared memory implementation of MM on a GPU. Lines 9 and 10 load blocks of A and B into shared memory. This kernel assumes dBlk:x = dBlk:y and slight modi?cations are performed to Line 10 for loading B if dBlk:x ? dBlk:y. Line 11 forces all threads within a block to wait until shared memory is loaded by all HWs within that block. Partial Cijs are computed in Line 13. Line 15 forces all threads within a block to wait until all HWs have performed their respective computations with the current data in shared memory before new data is loaded. Calculating the number of reads for the shared memory implementation of MM in Listing 4.8 yields Gld32B = n 3 dBlk:x parenleft.alt3 1 dBlk:x + 1 dBlk:yparenright.alt3 if dBlk:x ? 8 (4.4) and Gld64B = n 3 16 parenleft.alt3 1 dBlk:x + 1 dBlk:yparenright.alt3 if dBlk:x ? 16: (4.5) If dBlk:x ? 16, all global memory accesses are 64B and the number of accesses to global memory is reduced by a factor of 2dBlk:x?dBlk:ydBlk:x+dBlk:y compared to the global memory implementa- tion. 54 Figure 4.4 depicts maximum e?ective bandwidth of the global memory implementation of MM (Listing 4.7) and the shared memory implementation (Listing 4.8). For MM, e?ective bandwidth is calculated as 8n3bytesTGPU comp: as discussed in Section 3.5. 512 1024 2048 4096 8192 163840 100 200 300 400 500 600 700 n eff ect ive ba nd wi dt h( GB /s ) Global Memory Shared Memory Figure 4.4: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for MM on the T10 GPU. Similar to Mv, the shared memory implementation yields a signi?cant speedup compared to the global memory implementation. Utilizing shared memory reduces the number of global memory transactions by data reuse, thus reducing computation time and increasing e?ective bandwidth. The shared memory implementation yields a speedup, on average, of 7.32 compared to the global memory implementation. The minimum (n = 16384) and maximum (n = 512) speedups are 5.96 and 10.46, respectively. Therefore, for MM, Step 1 of the procedure signi?cantly reduces GPU computation time. 4.2 Computation Patterns Computation patterns are determined by the code and represent the manner and order in which results are computed. For Mv, if threadj computes one value of c, cj, where c=Ab, there are no varying computation patterns. Likewise, for MM, if threadij computes one value of C, Cij, where C=AB, there are no varying computation patterns. However, if 55 threads compute more or less than one value, the order in which threads compute creates computation patterns. Since computation patterns are a?ected by partition camping, a derivation of the com- putation pattern which minimizes partition camping is necessary. In addition, computation patterns can a?ect the number of accesses to global memory. Therefore, it is necessary to derive the optimal computation patterns for matrix-based computations. Deriving the optimal pattern requires an examination of the various patterns. 4.2.1 Mv For Mv, there are two computation patterns if threadj computes multiple values of c as depicted in Figure 4.5. A computation pattern where each thread computes neighboring (a) Grouped. (b) Spread. Figure 4.5: Computation patterns: two computation patterns for Mv for computing multiple Cjs. values of c, grouped, is depicted in Figure 4.5a. A computation pattern where each thread computes values of c which are not neighboring, spread, is depicted in Figure 4.5b. Figure 4.6 depicts the average computation time for Mv on the T10 GPU if each thread computes one or more values of c utilizing the spread computation pattern. dGrd:x ? dBlk:x is the number of total threads. Each thread computes at least one value of c and therefore, the maximum number of total threads is n. The average computation time decreases as the number of total threads increases. The minimum time is achieved, for all data sizes tested, when the number of total threads equals n. This is due to the memory bus not being fully utilized when dGrd:x ? dBlk:x ? 8192.2 2Discussion on the number of threads necessary to saturate the memory bus is in Section 4.5. 56 64 128 256 512 1024 2048 4096 8192 1638410 ?1 100 101 102 dGrd.x?dBlk.x av era ge tim e( ms ) n=16384n=8192 n=4096n=2048 n=1024n=512 Figure 4.6: Average computation time (ms) varying the number of total threads for Mv on the T10 GPU. Optimized To increase the number of total threads, it is necessary that multiple threads are utilized to compute one cj. Therefore, the optimized computation pattern consists of each thread computing a part of cj as depicted in Listing 4.9. In Listing 4.9, each thread computes a part of cj and each block computes dBlk:x cjs. Shared memory is allocated in Line 3 and therefore, SMemPerBlk = dBlk:x2 ?4: In Line 5, index is calculated, which speci?es the group of cjs that are computed by each block. Lines 7-9 load parts of A into shared memory. All threads within a block are synchronized in Line 10 to ensure shared memory is loaded before computation. Each thread computes a part of cj in Lines 11-13. Threads are again synchronized in Line 14 to prevent shared memory being reloaded before each thread completes computation with the current data in shared memory. At the end of the for-loop in Line 15, each thread contains a part of a cj in temp. Threads with threadIdx:y ? 0 store their parts of cj into shared memory. The shared memory which is utilized for portions of A is reused for the parts of cj computed by each thread. Threads with threadIdx:y = 0 are utilized to sum the parts of cj stored in 57 1 __constant__ float b[n]; 2 __global__ void Mv(float* A, float* c, int n) { 3 __shared__ float As[blockDim.x][blockDim.x]; 4 float temp = 0.0; 5 int index = blockIdx.x * blockDim.x; 6 for(int j = 0; j < n; j+= blockDim.x) { 7 for(int k = 0; k < blockDim.x / blockDim.y; k++) { 8 As[k + threadIdx.y * blockDim.x / blockDim.y][threadIdx.x] = rcurvearrowse A[(index + k + threadIdx.y * blockDim.x / blockDim.y) * n + rcurvearrowse threadIdx.x + j]; 9 } 10 __syncthreads(); 11 for(int k = 0; k < blockDim.x / blockDim.y; k++) { 12 temp += As[threadIdx.x][k + threadIdx.y * blockDim.x / blockDim.y] rcurvearrowse * b[j + k + threadIdx.y * blockDim.x / blockDim.y]; 13 } 14 __syncthreads(); 15 } 16 if(threadIdx.y != 0) { 17 As[threadIdx.y-1][threadIdx.x] = temp; 18 } 19 __syncthreads(); 20 if(threadIdx.y == 0) { 21 for(k = 0; k < blockDim.y-1; k++) { 22 temp += As[k][threadIdx.x]; 23 } 24 c[index + threadIdx.x] = temp; 25 } 26 } Listing 4.9: The shared memory implementation of Mv on a GPU utilizing the optimized computation pattern. Constant memory is utilized for b. 58 shared memory in Lines 20-23. Lastly, in Line 24, these threads store the ?nal cjs in c in global memory. Since b is stored in constant memory, global memory reads are issued only for accessing A. In Line 8, the number of accesses to A by each HW is dependent on dBlk:x;dBlk:y and threadIdx:y. Therefore, Gld32B = n 2 dBlk:x if dBlk:x ? 8 (4.6) and Gld64B = n 2 16 if dBlk:x ? 16 (4.7) which is equivalent to the number of accesses in the shared memory implementation of Mv without the optimized computation pattern (Listing 4.6). Figure 4.7 illustrates the maximum e?ective bandwidth of Mv before (Listing 4.6) and after (Listing 4.9) utilizing the optimized computation pattern. As discussed in Section 3.5, for Mv, e?ective bandwidth is calculated as 8n2bytesTGPU comp: . 512 1024 2048 4096 8192 1638410 20 30 40 50 60 70 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 4.7: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for Mv on the T10 GPU. Since the optimized computation pattern increases the number of total threads, the e?ective bandwidth is signi?cantly increased. The optimized computation pattern yields a speedup, on average, of 1.80 compared to the non-optimized pattern. The minimum 59 (n = 2048) and maximum (n = 4096) speedups are 1.59 and 2.23, respectively. Therefore, Step 2 of the parallelization procedure, deriving the optimized computation pattern, is valid and necessary for minimizing GPU computation time of Mv. 4.2.2 MM For MM, rather than Mv, increasing the number of total threads does not necessarily yield the minimum computation time due to saturation of the memory bus. With a satu- rated memory bus, the e?ects of fragmentation and partition camping are more apparent. Therefore, if threadij computes more than Cij, the computation pattern, or order in which the threads compute results, a?ects the computation time. Four computation patterns for MM are illustrated in Figure 4.8 if each thread com- putes multiple Cijs. Figure 4.8a depicts a computation pattern where each thread computes (a) Grouped. (b) Column-major. (c) Row-major. (d) Hybrid. Figure 4.8: Computation patterns: four computation patterns for MM for computing mul- tiple Cijs. neighboring Cijs, de?ned as grouped. Figures 4.8b and c, de?ned as column-major and row- major, respectively, depict computation patterns where each thread computes multiple Cijs per column or row, respectively. Merging column- and row-major patterns yields the last computation pattern, de?ned as hybrid, as depicted in Figure 4.8d. Grouped The grouped computation pattern, depicted in Figure 4.8a, causes a reduction of mem- ory coalescing. Coalescing as many memory reads as possible at the HW level is ideal to minimize the computation time. Considering an example where each thread is to compute 60 4 Cijs in a grouped pattern, Cij, Ci(j+1), C(i+1)j and C(i+1)(j+1), to compute the ?rst Cij, threadij accesses one row of A, rowi and one column of B, colj. Since threads are assigned to warps in row-major order, the neighboring thread of threadij, threadi(j+1), accesses the same row of A, rowi, but a di?ering column of B, colj+2 to compute Ci(j+2) 3. In general, each HW read will only use 1x of the bytes where x is the number of Cijs computed in the x-dimension, thus reducing the number of coalesced memory accesses to B by a factor of x. Therefore, this computation pattern is not included in the results as it demonstrates the e?ects of memory coalescing more than partition camping. Column-major The row- and column-major computation patterns do not cause a reduction of mem- ory coalescing, as the grouped pattern does, if each block has at least 8 threads in the x-dimension, dBlk:x ? 8. For the column-major computation pattern, depicted in Figure 4.8b, threadij computes C(i+y?dGrd:y?dBlk:y)j for y = 0 to nele y where nele y is the number of Cijs to compute. For this pattern, colj is repeatedly read by threadij while the row that is read varies from i to (nele y?dGrd:y ?dBlk:y). If n%512 = 0, all accesses to colj remain in the same partition. Row-major For the row-major computation pattern, depicted in Figure 4.8c, threadij computes Ci(j+x?dGrd:x?dBlk:x) for x = 0 to nele x where nele x is the number of Cijs to compute. There- fore, to compute Cij to Ci(j+dBlk:x?1), threadij to threadi(j+dBlk:x?1) read rowi and the column reads vary in a coalesced pattern from colj to colj+dBlk:x?1. However, for computing addi- tional Cijs, the accesses to each partition for reading B vary depending on which Cij is being computed. Accesses to each partition for reading A always vary as the program exe- cution continues as it is not possible for an entire row of A to be stored in one partition if 3threadi(j+1) computes Ci(j+2) instead of Ci(j+1) because of the grouped computation pattern. In this case, threadij is computing Ci(j+1) as well as C(i+1)j and C(i+1)(j+1). 61 n > 8 without transposing A. However, an entire column of B is stored in one partition if n%512 = 0. For this computation pattern, as the program execution continues, each HW is accessing varying partitions to read B dependent on the current Cij being computed. Hybrid For the hybrid computation pattern, depicted in Figure 4.8d, it is necessary to determine the order each Cij is computed. Two computation patterns are de?ned and illustrated in Figure 4.9: hybrid-column and hybrid-row. The hybrid-column computation pattern allows each thread to compute multiple Cijs in each row and column but computes each ?rst in the y-dimension. The hybrid-row is similar except each thread computes each Cij ?rst in the x- dimension. Therefore, the column-major computation pattern is a subset of hybrid-column, and the row-major is a subset of hybrid-row. Because of this, hybrid-column and hybrid-row computation patterns are presented in the results, and row- and column-major are not. (a) Hybrid-column. (b) Hybrid-row. Figure 4.9: Computation patterns: hybrid computation patterns for MM for computing multiple Cijs. Similar to row- and column-major computation patterns, both hybrid patterns, de?ned in Figure 4.9, bene?t from memory coalescing if dBlk:x ? 8. If (dGrd:x?dBlk:x)%512 = 0, then all Cijs computed by threadij will be in the same partition. Therefore, with regards to the distribution of the HWs to all partitions, and from the GPU?s perspective, the hybrid- row and hybrid-column patterns are identical. In addition, in this case, both patterns are identical to the column-major pattern. However, if (dGrd:x ? dBlk:x)%512 ? 0, there is a di?erence between the hybrid-row and hybrid-column pattern. In both, threadij computes C(i+y?dGrd:y?dBlk:y)(j+x?dGrd:x?dBlk:x) 62 for y = 0 to nele y and x = 0 to nele x where nele x and nele y are the number of Cijs to compute in the x- and y-dimensions, respectively. If all partitions are not initially accessed by the HWs, the hybrid-row pattern accesses the unused partitions quicker than the hybrid- column pattern thus reducing partition camping. Optimized (column-major) Memory coalescing and the order of computation are considered to derive the optimized computation pattern for MM. The row- and column-major patterns, in addition to the hybrid patterns, do not reduce memory coalescing as the grouped pattern does. From the kernel in Listing 4.8, the shared memory implementation of MM on a GPU, global memory is ac- cessed in Lines 9 and 10. Computation is then performed in Line 13 before reloading shared memory. In this pattern, little overlap exists between accessing global memory and compu- tation. Therefore, the optimized computation pattern consists of an algorithm maximizing the overlap of global memory accesses and computation such as the one described in [63] and utilized in [19], [20], and [21]. In this implementation, only one matrix is placed in shared memory. In addition, the aforementioned implementations assume column-major storage for matrices. In this work, row-major storage is assumed due to the row-major storage of the GPU. Although the algorithm for MM is similar to the aforementioned implementations, other steps of the parallelization procedure augment the algorithm to further minimize GPU computation time. Utilizing this algorithm and the optimized computation pattern depicted in Figure 4.8b yields Listing 4.10. In this implementation of MM, all memory accesses are coalesced since the column-major computation pattern is utilized. Shared memory is allo- cated for parts of A in Line 2. Therefore, the amount of shared memory allocated is modi?ed from Equation (4.3) to yield SMemPerBlk = (n ele?dBlk:x)?4 (4.8) 63 1 __global__ void MM( float *A, float *B, float *C, int n) { 2 __shared__ float As[n_ele][blockDim.x]; 3 float temp[N_ele]; 4 for(int i = 0; i < n_ele; i++) { 5 temp[i] = 0.0; 6 } 7 for(int j = 0; j < n / blockDim.x; j++) { 8 for(int i = 0; i < n_ele / blockDim.y; i++) { 9 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[threadIdx.x + rcurvearrowse (blockIdx.y * n_ele + threadIdx.y) * n + (i * blockDim.y * n rcurvearrowse + j * blockDim.x)]; 10 } 11 __syncthreads(); 12 for(int i = 0; i < blockDim.x; i++) { 13 for(int k = 0; k < n_ele; k++) { 14 temp[k] += As[k][i] * B[(blockIdx.x * blockDim.x * blockDim.y + rcurvearrowse threadIdx.y * blockDim.x + threadIdx.x) + (i * n + j * n * rcurvearrowse blockDim.x)]; 15 } 16 } 17 __syncthreads(); 18 } 19 for (int i = 0; i < n_ele; i++) { 20 C[i * n + blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * rcurvearrowse blockDim.x + threadIdx.x + blockIdx.y * n_ele * n] = temp[i]; 21 } 22 } Listing 4.10: The shared memory implementation of MM on a GPU utilizing the optimized computation pattern. 64 where n ele is the number of Cijs computed per thread. In Lines 3-6, each thread allocates and initializes temporary variables for each Cij being computed. A portion of A is loaded into shared memory in Lines 8-10. This implementation assumes dBlk:y ? n ele. Slight modi?cations are made to loading A into shared memory in Lines 8-10 if dBlk:y > n ele. Threads within a block synchronize in Line 11 to ensure A is loaded. In Lines 13-15, each thread accesses B once and performs computation with that value and shared memory n ele times. B is not read in every iteration of k in Line 14 as the access to B is independent of the value of k. Therefore, the overlap between accessing global memory and computation is greatly increased. In addition, this implementation requires that only portions of one matrix be loaded into shared memory thus reducing the amount of shared memory necessary. The number of global memory accesses to A and B is dependent on n, dBlk:x, dBlk:y, and n ele. After simplifying, Gld32B = n 3 dBlk:x parenleft.alt3 1 dBlk:x?dBlk:y + 1 n eleparenright.alt3 if dBlk:x ? 8 (4.9) and Gld64B = n 3 16 parenleft.alt3 1 dBlk:x?dBlk:y + 1 n eleparenright.alt3 if dBlk:x ? 16: (4.10) Therefore, the number of global memory accesses is reduced by a factor of n ele(dBlk:x+dBlk:y)n ele+dBlk:x?dBlk:y compared to the shared memory implementation of MM without the optimized computation pattern (Listing 4.8). Figure 4.10 depicts maximum e?ective bandwidth of MM before (Listing 4.8) and af- ter (Listing 4.10) utilizing the optimized computation pattern. The e?ective bandwidth is calculated in Section 3.5 as 8n3bytesTGPU comp: . 65 512 1024 2048 4096 8192 16384500 600 700 800 900 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 4.10: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for MM on the T10 GPU. Similar to Mv, the optimized computation pattern yields a signi?cant speedup. In the ?gure, the optimized computation pattern yields a speedup, on average, of 1.32 compared to the non-optimized pattern. The speedup is due to the reuse of data by the optimized computation pattern. As mentioned, the optimized computation pattern reduces the number of global memory transactions and therefore, the e?ective bandwidth is increased. The minimum (n = 4096) and maximum (n = 512) speedups are 1.28 and 1.41, respectively. Therefore, Step 2 of the parallelization procedure is valid and necessary for reducing GPU computation time of MM. 4.3 Access Patterns The third step of the parallelization procedure speci?es that the optimized access pattern to data in memory be derived. The optimized access patterns are derived to minimize partition camping (Section 3.1) and eliminate bank con?icts (Section 3.2). Therefore, the GPU computation time is reduced. For matrix-based computations such as Mv and MM, all HWs begin by initially reading the ?rst value of their respective rows of A. Therefore, distributing the accesses to A at the block level to all partitions reduces partition camping. Figure 4.11a, assuming n ? 512, 66 illustrates an access pattern to A of 8 blocks for Mv or MM. At the beginning of execution, blocki reads the ?rst value of rowi from A, which is stored in partition 0. As execution continues, blocki accesses rowi of A where rowi is stored across all partitions. However, since all blocks begin accessing the ?rst value of A, all blocks begin reading from partition 0 and thus, partition camping occurs. Therefore, an optimized access pattern which minimizes partition camping for accessing A is derived by distributing blocks to all partitions at the beginning of execution. This is illustrated in Figure 4.11b. 0 1 2 3 4 5 6 7 partition block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 0 (a) Standard access pattern. 0 1 2 3 4 5 6 7 partition block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 0 (b) Optimized access pattern. Figure 4.11: Access patterns: example of block-level access patterns to A for matrix-based computations. Each column is one partition of global memory. Each row is 512 values of type ?oat. 4.3.1 Mv To implement Figure 4.11b for the shared memory implementation of Mv utilizing the optimized computation pattern, the kernel in Listing 4.9 is modi?ed to yield the kernel in Listing 4.11. This kernel utilizes the optimized access pattern. In Line 6, the partition from which each block begins accessing A, P, is calculated. The for-loop from Line 6 in Listing 4.9 is split into 2 for-loops, Line 7 and 17 of Listing 4.11, such that blocks begin accessing A from varying partitions, P. P is computed using blockId:x mod 8 since there are 8 global memory partitions. This is multiplied by 64 since each partition is 256 bytes wide and values are of type ?oat. The allocation of shared memory in Line 3 of Listing 4.11 is greater than the allocation of shared memory in Listing 4.9. Therefore, Equation (4.2), which de?nes the amount of 67 1 __constant__ float b[n]; 2 __global__ void Mv(float* A, float* c, int n) { 3 __shared__ float As[blockDim.x][blockDim.x+1]; 4 float temp = 0.0; 5 int index = blockIdx.x * blockDim.x; 6 int P = (blockIdx.x % 8) * 64; 7 for(int j = P; j < n; j+= blockDim.x) { 8 for(int k = 0; k < blockDim.x / blockDim.y; k++) { 9 As[k + threadIdx.y * blockDim.x / blockDim.y][threadIdx.x] = rcurvearrowse A[(index + k + threadIdx.y * blockDim.x / blockDim.y) * n + rcurvearrowse threadIdx.x + j]; 10 } 11 __syncthreads(); 12 for(int k = 0; k < blockDim.x / blockDim.y; k++) { 13 temp += As[threadIdx.x][k + threadIdx.y * blockDim.x / blockDim.y] rcurvearrowse * b[j + k + threadIdx.y * blockDim.x / blockDim.y]; 14 } 15 __syncthreads(); 16 } 17 for(int j = 0; j < P; j+= blockDim.x) { 18 for(int k = 0; k < blockDim.x / blockDim.y; k++) { 19 As[k + threadIdx.y * blockDim.x / blockDim.y][threadIdx.x] = rcurvearrowse A[(index + k + threadIdx.y * blockDim.x / blockDim.y) * n + rcurvearrowse threadIdx.x + j]; 20 } 21 __syncthreads(); 22 for(int k = 0; k < blockDim.x / blockDim.y; k++) { 23 temp += As[threadIdx.x][k + threadIdx.y * blockDim.x / blockDim.y] rcurvearrowse * b[j + k + threadIdx.y * blockDim.x / blockDim.y]; 24 } 25 __syncthreads(); 26 } 27 if(threadIdx.y != 0) { 28 As[threadIdx.y-1][threadIdx.x] = temp; 29 } 30 __syncthreads(); 31 if(threadIdx.y == 0) { 32 for(k = 0; k < blockDim.y-1; k++) { 33 temp += As[k][threadIdx.x]; 34 } 35 c[index + threadIdx.x] = temp; 36 } 37 } Listing 4.11: The shared memory implementation of Mv on a GPU utilizing the optimized computation and access pattern. Constant memory is utilized for b. 68 shared memory used per block for the shared memory implementation of Mv utilizing the optimized computation pattern is modi?ed to yield SMemPerBlk = (dBlk:x2 +dBlk:x)?4: (4.11) Although dBlk:x2 ? 4 bytes of shared memory are necessary for the kernel in Listing 4.11, (dBlk:x2+dBlk:x)?4 bytes are allocated. The additional allocation in the kernel is to reduce bank con?icts. An example of bank con?icts for Mv is depicted in Figures 4.12-4.15. In the ?gures, dBlk:x = 16. Figure 4.12 depicts shared memory being loaded with A for the shared memory implementation of Mv utilizing the optimized computation pattern but not the optimized access pattern. This implementation utilizes dBlk:x2 ?4 bytes of shared memory. Since the threads within a HW access di?erent banks of shared memory, no bank con?icts occur for loading shared memory. 9 10 11 12 13 14 15876543210 9 10 11 12 13 14 15876543210 9 10 11 12 13 1415876543210 values stored in shared memory for iteration k bank bank bank iteration 2 ...iteration 1iteration 0 Figure 4.12: Example of loading A into shared memory for the shared memory implementa- tion of Mv utilizing the optimized computation pattern but not the optimized access pattern. No bank con?icts occur. dBlk:x = 16. However, for reading the values from shared memory, the HW accesses varying rows and one column of shared memory in each iteration as depicted in Figure 4.13. As illustrated, each value resides in the same shared memory bank and thus bank con?icts occur. 69 9 10 11 12 13 14 15876543210 9 10 11 12 13 14 15876543210 9 10 11 12 13 1415876543210 bank bank bank values read from shared memory for iteration k iteration 2 ...iteration 1iteration 0 Figure 4.13: Example of reading shared memory for the shared memory implementation of Mv utilizing the optimized computation pattern but not the optimized access pattern. Bank con?icts occur. dBlk:x = 16. Figure 4.14 depicts shared memory being loaded with A for the shared memory imple- mentation of Mv that utilizes the optimized computation and access pattern. (dBlk:x2 + dBlk:x)?4 bytes of shared memory are allocated. Since the threads within a HW access all shared memory banks, no bank con?icts occur. 9 10 11 12 13 14 15876543210 9 10 11 12 13 14 15876543210 9 10 11 12 13 1415876543210 values stored in shared memory for iteration k bank bank bank iteration 2 ...iteration 1iteration 0 Figure 4.14: Example of loading A into shared memory for the shared memory implementa- tion of Mv utilizing the optimized computation and access pattern. No bank con?icts occur. dBlk:x = 16. For reading the values from shared memory, the HW accesses varying rows and the same column of shared memory in each iteration as depicted in Figure 4.15. As illustrated, each value resides in a di?erent shared memory bank and therefore, no bank con?icts occur. 70 9 10 11 12 13 14 15876543210 9 10 11 12 13 14 15876543210 9 10 11 12 13 1415876543210 values read from shared memory for iteration k bank bank bank iteration 0 iteration 1 iteration 2 ... Figure 4.15: Example of reading shared memory for the shared memory implementation of Mv utilizing the optimized computation and access pattern. No bank con?icts occur. dBlk:x = 16. Therefore, the optimized access pattern for Mv reduces partition camping and eliminates bank con?icts. Figure 4.16 depicts maximum e?ective bandwidth of Mv before (Listing 4.9) and after (Listing 4.11) utilizing the optimized access pattern. E?ective bandwidth is calculated as 8n2bytesTGPU comp: as discussed in Section 3.5. 512 1024 2048 4096 8192 163840 50 100 150 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 4.16: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for Mv on the T10 GPU. In the ?gure, the optimized access pattern yields a speedup, on average, of 2.28 compared to the non-optimized pattern. The signi?cant speedup is due to the optimized access pattern reducing partition camping and eliminating bank con?icts. The minimum (n = 512) and 71 maximum (n = 2048) speedups are 2.06 and 2.81, respectively. The signi?cant increase in e?ective bandwidth shows the validity and necessity of Step 3 of the procedure for Mv, deriving the optimized access pattern. 4.3.2 MM To reduce partition camping by implementing Figure 4.11b for the shared memory implementation of MM utilizing the optimized computation pattern, the kernel in Listing 4.10 is modi?ed to yield Listing 4.12. This kernel utilizes the optimized access pattern. In Listing 4.12, dBlk:y ? n ele. Slight modi?cations are performed to Lines 9-11 and 21-23 for loading A if dBlk:y > n ele. In this kernel, additional computation is required compared to Listing 4.10 to determine which value of A each block reads at the beginning of execution. Line 7 of Listing 4.12 assigns P, a variable used to specify which partition is accessed for reading A at the beginning of execution. To calculate P, a linear block index is calculated as blkIdx:x+dGrd:x?blkIdx:y. The linear block index mod 8, the number of partitions, is calculated such that blocks are assigned to each partition at the beginning of execution in round-robin order. This is multiplied by 64blockDim:x since there are 64 values of type ?oat per row of each partition. This kernel utilizes the same amount of shared memory as the kernel in Listing 4.8. Therefore, Equation (4.8) de?nes the amount of shared memory utilized per block. Splitting the for-loop in Line 7 of Listing 4.10 as shown in Lines 8-20 of Listing 4.12 distributes the blocks to all partitions for reading A at the beginning of execution. Therefore, this optimized access pattern reduces partition camping and thus reduces the computation time. Shared memory is allocated in Line 2 of Listing 4.12 the same as in Line 2 of Listing 4.10. Therefore, the amount of shared memory allocated per block is de?ned by Equation (4.8). In Lines 10 and 22, values from A are loaded into shared memory. If dBlk:x ? 16, threads within a HW have the same value for threadIdx:y. Therefore, no bank con?icts occur for loading shared memory as all banks are utilized. When reading from As in Lines 15 and 72 1 __global__ void MM( float *A, float *B, float *C, int n) { 2 __shared__ float As[n_ele][blockDim.x]; 3 float temp[N_ele]; 4 for(int i = 0; i < n_ele; i++) { 5 temp[i] = 0.0; 6 } 7 int P = ((blockIdx.x + gridDim.x * blockIdx.y) % 8) * (64 / blockDim.x); 8 for(int j = P; j < n / blockDim.x; j++) { 9 for(int i = 0; i < n_ele / blockDim.y; i++) { 10 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[threadIdx.x + rcurvearrowse (blockIdx.y * n_ele + threadIdx.y) * n + (i * blockDim.y * n + j * rcurvearrowse blockDim.x)]; 11 } 12 __syncthreads(); 13 for(int i = 0; i < blockDim.x; i++) { 14 for(int k = 0; k < n_ele; k++) { 15 temp[k] += As[k][i] * B[(blockIdx.x * blockDim.x * blockDim.y + rcurvearrowse threadIdx.y * blockDim.x + threadIdx.x) + (i * n + j * n * rcurvearrowse blockDim.x)]; 16 } 17 } 18 __syncthreads(); 19 } 20 for(int j = 0; j < P; j++) { 21 for(int i = 0; i < n_ele / blockDim.y; i++) { 22 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[threadIdx.x + rcurvearrowse (blockIdx.y * n_ele + threadIdx.y) * n + (i * blockDim.y * n + j * rcurvearrowse blockDim.x)]; 23 } 24 __syncthreads(); 25 for(int i = 0; i < blockDim.x; i++) { 26 for(int k = 0; k < n_ele; k++) { 27 temp[k] += As[k][i] * B[(blockIdx.x * blockDim.x * blockDim.y + rcurvearrowse threadIdx.y * blockDim.x + threadIdx.x) + (i * n + j * n * rcurvearrowse blockDim.x)]; 28 } 29 } 30 __syncthreads(); 31 } 32 for (int i = 0; i < n_ele; i++) { 33 C[i * n + blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * rcurvearrowse blockDim.x + threadIdx.x + blockIdx.y * n_ele * n] = temp[i]; 34 } 35 } Listing 4.12: The shared memory implementation of MM on a GPU utilizing the optimized computation and access pattern. 73 27, each HW reads the same value, row k and column i. Since the T10 GPU supports shared memory broadcasting, threads within a HW accessing the same address in shared memory cause no bank con?icts. Therefore, no further optimizations to the access pattern are necessary. Figure 4.17 depicts maximum e?ective bandwidth of MM before (Listing 4.10) and after (Listing 4.12) utilizing the optimized access pattern. For MM, e?ective bandwidth is calculated in Section 3.5 as 8n3bytesTGPU comp: . 512 1024 2048 4096 8192 16384740 760 780 800 820 840 860 880 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 4.17: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for MM on the T10 GPU. The optimized access pattern yields a speedup, on average, of 1.01 compared to the non-optimized pattern. The minimum (n = 2048) and maximum (n = 512) speedups are 1.00 and 1.04, respectively. Comparing results from Figure 4.16 with Figure 4.17, the optimized access pattern yields a signi?cant speedup for Mv but a modest speedup for MM. Since bank con?icts do not occur for MM, these results suggest that bank con?icts a?ect computation time more signi?cantly than partition camping. Regardless, the optimized access pattern increases maximum e?ective bandwidth, on average, for both. Therefore, Step 3 of the parallelization procedure, deriving the optimized access pattern, is valid and necessary. 74 4.4 Fine-tuning The fourth step of the parallelization procedure speci?es that ?ne-tuning is performed on each kernel to reduce computation and resource allocation. Fine-tuning is de?ned as optimizations to the kernel to reduce time, such as reducing index calculations and loop un- rolling. Reducing index calculations, particularly in inner loops, reduces the time necessary to issue global memory transactions thus reducing the computation time. Loop unrolling also reduces computation time by increasing the instruction mix of mem- ory transactions and computation instructions. In addition, loop unrolling can reduce the number of registers used per thread since the compiler can reuse registers. When the com- piler exceeds the maximum register allocation per thread, local memory is allocated. Local memory is a section of global memory reserved for each thread as storage for temporary variables. The usage of local memory instead of registers greatly increases the computation time as global memory accesses are orders of magnitude slower than register accesses. Since the implementation of MM utilizes many registers, the e?ect is more apparent than for Mv. Since loop unrolling can reduce the number of registers used per thread due to register reuse, it can eliminate allocation of local memory thus greatly reducing the computation time. For matrix-based computations, index calculations are necessary for each thread to determine which value of each matrix is accessed. Index calculations typically consist of several multiply and addition instructions inside a loop. The kernel de?nitions of matrix- based computations specify that pointers to each matrix are passed. Since each thread contains a pointer to each matrix, some of the multiplication instructions can be reduced to addition. Listing 4.13 depicts sample code for a typical matrix-based computation. In 1 __global__ void Example(float* A, float* B, int n) { 2 for(int i = 0; i < n; i++) { 3 for(int j = 0; j < n; j++) { 4 B[threadIdx.y * n + j] = A[i * n + threadIdx.x]; 5 } 6 } 7 } Listing 4.13: An example of a matrix-based computation without ?ne-tuning. 75 Line 4, to calculate a part of the index of B, each thread calculates threadIdx:y ?n which is performed n2 times. Since the value is constant for each thread, this calculation can be performed once before the for-loops. Likewise, each thread adds threadIdx:x which is calculated n2 times for a part of the index of A. This calculation can also be performed once outside of the for-loops. Similarly, each thread calculates i?n which is performed n2 times for a part of the index of A. Therefore, this calculation can be replaced by adding n to the pointer to A inside the outer for-loop. Now, each thread performs only n additions of n. The kernel in Listing 4.13 is modi?ed accordingly to yield the ?ne-tuned kernel, with respect to index calculations, in Listing 4.14. 1 __global__ void Example(float* A, float* B, int n) { 2 A += threadIdx.x; 3 B += threadIdx.y * n; 4 for(int i = 0; i < n; i++) { 5 for(int j = 0; j < n; j++) { 6 B[j] = A[0]; 7 } 8 A += n; 9 } 10 } Listing 4.14: An example of a matrix-based computation with index calculations minimized. As mentioned, loop unrolling can reduce computation time by increasing the instruction mix and reducing the number of registers used per thread. Reducing the number of registers used per thread can signi?cantly reduce computation time since the number of active threads can be dependent on register usage. Therefore, NVIDIA provides a #pragma unroll directive to instruct the compiler to unroll a loop. From the CUDA programming guide [64], ?By default, the compiler unrolls small loops with a known trip count.? However, it is unde?ned as to what constitutes a small loop. Therefore, #pragma unroll is inserted before each loop in the kernel. The compiler unrolls inner loops ?rst and stops unrolling loops once the maximum instruction limit is reached. Listing 4.14 is modi?ed to instruct the compiler to unroll the for-loops as depicted in Listing 4.15. 76 1 __global__ void Example(float* A, float* B, int n) { 2 A += threadIdx.x; 3 B += threadIdx.y * n; 4 #pragma unroll 5 for(int i = 0; i < n; i++) { 6 #pragma unroll 7 for(int j = 0; j < n; j++) { 8 B[j] = A[0]; 9 } 10 A += n; 11 } 12 } Listing 4.15: An example of a matrix-based computation after ?ne-tuning. Index calculations are minimized and all loops are unrolled 4.4.1 Mv Fine-tuning is performed to the shared memory implementation of Mv on a GPU uti- lizing the optimized computation and access pattern, Listing 4.11, to yield Listing 4.16. 1 __constant__ float b[n]; 2 __global__ void Mv(float* A, float* c, int n) { 3 __shared__ float As[blockDim.x][blockDim.x+1]; 4 float temp = 0.0; 5 int index = blockIdx.x * blockDim.x; 6 int P = (blockIdx.x % 8) * 64; 7 int calc1 = blockDim.x / blockDim.y; 8 int calc2 = calc1 * threadIdx.y; 9 A += threadIdx.x + index * n + calc2 * n + P; 10 c += index + threadIdx.x; 11 #pragma unroll 12 for(int j = P; j < n; j+= blockDim.x) { 13 #pragma unroll 14 for(int k = 0; k < calc1; k++) { 15 As[k + calc2][threadIdx.x] = A[0]; 16 A += n; 17 } 18 A += blockDim.x; 19 A -= n * calc1; 20 __syncthreads(); 21 #pragma unroll 22 for(int k = 0; k < calc1; k++) { 23 temp += As[threadIdx.x][k + calc2] * b[j + k + calc2]; 24 } 25 __syncthreads(); 26 } 27 A -= n; 28 #pragma unroll 77 29 for(int j = 0; j < P; j+= blockDim.x) { 30 #pragma unroll 31 for(int k = 0; k < calc1; k++) { 32 As[k + calc2][threadIdx.x] = A[0]; 33 A += n; 34 } 35 A += blockDim.x; 36 A -= n * calc1; 37 __syncthreads(); 38 #pragma unroll 39 for(int k = 0; k < calc1; k++) { 40 temp += As[threadIdx.x][k + calc2] * b[j + k + calc2]; 41 } 42 __syncthreads(); 43 } 44 if(threadIdx.y != 0) { 45 As[threadIdx.y-1][threadIdx.x] = temp; 46 } 47 __syncthreads(); 48 if(threadIdx.y == 0) { 49 #pragma unroll 50 for(k = 0; k < blockDim.y-1; k++) { 51 temp += As[k][threadIdx.x]; 52 } 53 c[0] = temp; 54 } 55 } Listing 4.16: The shared memory implementation of Mv on a GPU utilizing the optimized computation and access pattern after ?ne-tuning. Constant memory is utilized for b. No modi?cations are performed to the allocation of shared memory thus Equation (4.11) de?nes shared memory usage. Since modi?cations are performed only to index calculations and loop unrolling, the number of global memory reads is de?ned by Equations (4.6) and (4.7). calc1 and calc2 in Lines 7 and 8 are commonly used calculations in the for-loops. By de?ning temporary variables for the commonly used calculations, the amount of computation in the for-loops is reduced. Index calculations are reduced by initializing pointers to A and c in Lines 9 and 10. Likewise, pointer arithmetic is performed in Lines 16, 18, 19, 27, 33, 35, and 36 to further reduce index calculations. In Listing 4.16, for-loops are preceded with the #pragma unroll directive in Lines 11, 13, 21, 28, 30, 38, and 49. This instructs the compiler to unroll as many loops as possible to increase the instruction mix and reduce the number of registers used per thread. 78 Figure 4.18 depicts maximum e?ective bandwidth of Mv before (Listing 4.11) and after (Listing 4.16) ?ne-tuning. As mentioned, e?ective bandwidth is calculated as 8n2bytesTGPU comp: . 512 1024 2048 4096 8192 1638420 40 60 80 100 120 140 160 n eff ect ive ba nd wi dt h( GB /s ) Standard Fine-tuned Figure 4.18: Comparison of ?ne-tuning: Maximum bandwidth (GB/s) for Mv on the T10 GPU. In the ?gure, the kernel, after ?ne-tuning, yields a speedup, on average, of 1.05 compared to before ?ne-tuning. The minimum (n = 16384) and maximum (n = 1024) speedups are 1.00 and 1.12, respectively. As mentioned, the usage of local memory instead of registers greatly increases the computation time as global memory accesses are orders of magnitude slower than register accesses. However, for Mv, a small amount of registers are allocated and therefore, local memory is not utilized. Therefore, ?ne-tuning Mv yields a reduction in only the amount of index calculations and thus a modest speedup. Regardless, the maximum e?ective bandwidth is increased and therefore, Step 4, ?ne-tuning the kernel, is valid for Mv. 4.4.2 MM Fine-tuning is performed to the shared memory implementation of MM on a GPU uti- lizing the optimized computation and access pattern, Listing 4.12, to yield Listing 4.17. In this kernel, dBlk:y ? n ele. Slight modi?cations are performed to Lines 15-18 and 37-40 for loading A if dBlk:y > n ele. 79 1 __global__ void MM( float *A, float *B, float *C, int n) { 2 __shared__ float As[n_ele][blockDim.x]; 3 float temp[N_ele]; 4 #pragma unroll 5 for(int i = 0; i < n_ele; i++) { 6 temp[i] = 0.0; 7 } 8 int P = ((blockIdx.x + gridDim.x * blockIdx.y) % 8) * (64 / blockDim.x); 9 A += P * blockDim.x + threadIdx.x + (blockIdx.y * N_ele + threadIdx.y) rcurvearrowse * n; 10 B += P * n * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + rcurvearrowse threadIdx.y * blockDim.x + threadIdx.x; 11 C += threadIdx.x + blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * rcurvearrowse blockDim.x + blockIdx.y * N_ele * n; 12 #pragma unroll 13 for(int j = P; j < n / blockDim.x; j++) { 14 #pragma unroll 15 for(int i = 0; i < n_ele / blockDim.y; i++) { 16 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[0]; 17 A += blockDim.y * n; 18 } 19 A -= blockDim.y * n * N_ele / blockDim.y; 20 __syncthreads(); 21 #pragma unroll 22 for(int i = 0; i < blockDim.x; i++) { 23 #pragma unroll 24 for(int k = 0; k < n_ele; k++) { 25 temp[k] += As[k][i] * B[0]; 26 } 27 B += n; 28 } 29 A += blockDim.x; 30 __syncthreads(); 31 } 32 A -= n; 33 B -= n * n; 34 #pragma unroll 35 for(int j = 0; j < P; j++) { 36 #pragma unroll 37 for(int i = 0; i < n_ele / blockDim.y; i++) { 38 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[0]; 39 A += blockDim.y * n; 40 } 41 A -= blockDim.y * n * N_ele / blockDim.y; 42 __syncthreads(); 43 #pragma unroll 44 for(int i = 0; i < blockDim.x; i++) { 45 #pragma unroll 80 46 for(int k = 0; k < n_ele; k++) { 47 temp[k] += As[k][i] * B[0]; 48 } 49 B += n; 50 } 51 A += blockDim.x; 52 __syncthreads(); 53 } 54 #pragma unroll 55 for (int i = 0; i < n_ele; i++) { 56 C[0] = temp[i]; 57 C += n; 58 } 59 } Listing 4.17: The shared memory implementation of MM on a GPU utilizing the optimized computation and access pattern after ?ne-tuning. No modi?cations are performed to the allocation of shared memory, thus Equation (4.8) de?nes shared memory usage. Since modi?cations are performed only to index calculations and loop unrolling, the number of global memory reads is de?ned by Equations (4.9) and (4.10). Index calculations are reduced by initializing pointers to A, B, and C in Lines 9-11. Likewise, pointer arithmetic is performed in Lines 17, 19, 27, 29, 32, 33, 39, 41, 49, 51, and 57 to further reduce index calculations. The for-loops in Listing 4.17 are preceded with the #pragma unroll directive in Lines 4, 12, 14, 21, 23, 34, 36, 43, 45, and 54 to increase the instruction mix and reduce the number of registers used per thread. However, the for-loops in Lines 13 and 35 are dependent on P which is calculated in Line 8 and dependent on block indices. Since P is calculated at run-time, its value is unknown during compilation. Therefore, the for-loops in Lines 13 and 35 are not unrolled by the compiler since the trip count is unknown. Figure 4.19 depicts maximum e?ective bandwidth of MM before (Listing 4.12) and after (Listing 4.17) ?ne-tuning. As discussed, maximum e?ective bandwidth is 8n3bytesTGPU comp: for MM. 81 512 1024 2048 4096 8192 16384700 800 900 1000 1100 1200 1300 1400 n eff ect ive ba nd wi dt h( GB /s ) Standard Fine-tuned Figure 4.19: Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for MM on the T10 GPU. For MM, register utilization is high since each thread is computing multiple results and utilizing a register for each partial result. As shown in Figure 4.19 the kernel after ?ne-tuning yields a speedup, on average, of 1.51 compared to before ?ne-tuning. The minimum (n = 512) and maximum (n = 16384) speedups are 1.43 and 1.53, respectively. Since ?ne-tuning the kernel reduces the amount of register utilization, a signi?cant improvement in maximum e?ective bandwidth is realized for MM. Therefore, Step 4 of the parallelization procedure, ?ne-tuning the kernel, is necessary and valid for MM. 4.5 Input Parameters The penultimate step of the parallelization procedure, Step 5, speci?es that optimal input parameters are derived. Input parameters are de?ned as the dimensions of the grid and blocks assigned for execution on the GPU. Through analysis of measured data for Mv and MM, input parameters signi?cantly a?ect the GPU computation time. Therefore, optimal input parameters are derived as the input parameters which yield the minimum computation time. The derivation is performed utilizing the execution metrics formulated in Section 3.3. Since input parameters are dependent on the kernel, the derivation is performed after the 82 kernel is written, which considers the placement of data, optimized computation and access patterns, and ?ne-tuning adjustments. A list of ?ve steps is presented, in order, to derive the optimal input parameters: 1. Saturate the memory bus. 2. Maximize shared memory utilization. 3. Minimize the amount of shared memory used per block. 4. Maximize the number of global memory partitions accessed. 5. Minimize the number of fragmented blocks. The order of deriving optimal input parameters is developed through analysis of measured data for Mv and MM. Step 1 is listed to maximize the overlap of computation and memory accesses by ensuring the memory bus is fully utilized. Step 2 ensures that data which is loaded into shared memory is reused as much as possible. Step 3 is listed such that a minimum amount of shared memory is assigned per block. This increases the number of blocks assigned to each SM. In addition, this reduces the amount of time each HW in a block must wait for synchronization since there are less HWs per block. Step 4 maximizes the number of global memory partitions accessed such that partition camping is reduced. Lastly, Step 5 ensures the contribution to computation time from fragmented blocks is minimized. Due to limitations of the T10 GPU, a maximum of 512 threads can be assigned per block. Therefore, dBlk:x?dBlk:y ? 512: (4.12) Since threads are assigned to HWs in row-major order and the minimum global memory transaction size is 32B, dBlk:x ? 8: (4.13) Since values of type ?oat are used, this equation ensures that one 32B transaction can service half of the threads in a HW. 83 Step 1 speci?es that the memory bus is saturated which requires analysis of ThdsactiveGPU which is dependent on BlksactiveGPU . In the GPU documentation provided by NVIDIA [64], it is stated that each global memory access takes approximately 400-800 clock cycles. Therefore, there needs to exist a minimum of 4008 warps per partition, or 12800 ThdsactiveGPU , to fully saturate the memory bus. Since the maximum value of ThdsactiveGPU is determined by the product of the number of SMs and the maximum number of threads per SM, 12800 ? ThdsactiveGPU ? 30720 (4.14) Therefore, optimal input parameters are derived such that ThdsactiveGPU is greater than the minimum necessary to fully saturate the memory bus. Optimal input parameters for each matrix-based computation are derived in the following sections. Sections 4.5.1 and 4.5.2 include the derivation of optimal input parameters for Mv and MM, respectively. 4.5.1 Mv In this section, optimal input parameters are derived for the kernel in Listing 4.16, which is the shared memory implementation of Mv on a GPU utilizing the optimized computation and access pattern after ?ne-tuning. The amount of shared memory allocated per block is de?ned by Equation (4.11). Since each SM has a maximum of 16KB of shared memory, the maximum value of dBlk:x is 32. Combining with the minimum value of dBlk:x from Equation (4.13) yields 8 ? dBlk:x ? 32: (4.15) 84 From the kernel in Listing 4.16 and the maximum value of dBlk:y determined from solving for dBlk:y in Equation (4.12), dBlk:y ? minparenleft.alt3 512dBlk:x;dBlk:xparenright.alt3 (4.16a) dGrd:x = ndBlk:x (4.16b) dGrd:y = 1: The number of registers used per thread for the kernel varies from 10 to 15 dependent on dBlk:x and dBlk:y. Assuming the maximum is used, BlksactiveSM from Equation (3.7) is not limited by register usage. Therefore, Equation (3.8) simpli?es to BlksactiveSM = minparenleft.alt3? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3: (4.17) Since dGrd:y = 1, Equation (3.8) simpli?es to BlksactiveGPU = minparenleft.alt130?BlksactiveSM ;dGrd:xparenright.alt1: Since ThdsactiveGPU = BlksactiveGPU ?dBlk:x?dBlk:y; (4.18) then, if dGrd:x ? BlksactiveGPU , ThdsactiveGPU = dGrd:x?dBlk:x?dBlk:y: Substituting dGrd:x = ndBlk:x from Equation (4.16b) and the lower limit of n de?ned by Equation (4.1) yields ThdsactiveGPU ? 512?dBlk:y: 85 Since dBlk:x ? 32, the maximum value of dBlk:y from Equation (4.16a) yields ThdsactiveGPU ? 8192: Substituting into Equation (4.18) and substituting for BlksactiveGPU yields 8192 ? minparenleft.alt130?BlksactiveSM ;dGrd:xparenright.alt1?dBlk:x?dBlk:y: (4.19) If dGrd:x ? 30 ? BlksactiveSM , rearranging Equation (4.19) and solving for dBlk:y yields the lower limit as 8192 n ? dBlk:y: (4.20) The upper limit of dBlk:y is not modi?ed and therefore is de?ned by Equation (4.16a). However, if dGrd:x > 30?BlksactiveSM , then substituting for BlksactiveSM from Equation (4.17) into Equation (4.19) yields 8192 ? 30?minparenleft.alt3? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3?dBlk:x?dBlk:y: (4.21) FromEquation(4.11), SMemPerBlk = (dBlk:x2+dBlk:x)?4. AftersubstitutingSMemPerBlk into Equation (4.21), solving for dBlk:y, given the possible values of dBlk:x from Equation (4.15), yields the lower limit as maxparenleft.alt34; 64dBlk:xparenright.alt3 ? dBlk:y: (4.22) Combining lower limits of dBlk:y from Equations (4.20) and (4.22) with the upper limit from Equation (4.16a) yields maxparenleft.alt38192n ;4; 64dBlk:xparenright.alt3 ? dBlk:y ? minparenleft.alt3 512dBlk:x;dBlk:xparenright.alt3: (4.23) 86 Step 2 of deriving optimal input parameters speci?es shared memory utilization is max- imized to ensure data is reused as much as possible. From Lines 22 and 39 of the kernel in Listing 4.16, the trip count for the inner loops where shared memory is accessed is dependent on dBlk:xdBlk:y. Therefore, the maximum of dBlk:x from Equation (4.15) and the minimum of dBlk:y from Equation (4.23) yield dBlk:x = 32 dBlk:y = maxparenleft.alt38192n ;4parenright.alt3 dGrd:x = n32 dGrd:y = 1: (4.24) Given n, all input parameters are constant and therefore, no further steps are necessary to derive optimal input parameters for Mv. However, as realized through measured data for MM, additional steps are necessary to derive optimal input parameters for MM. Regardless, Equation (4.24) is the derivation of optimal input parameters for the shared memory im- plementation of Mv on a GPU utilizing the optimized computation and access pattern after ?ne-tuning. Figure 4.20 depicts measured computation time of all input parameters for GPU com- putation of Mv. The ?gure illustrates the necessity of optimal input parameters as the GPU computation time of Mv varies signi?cantly depending on the input parameters. In the ?g- ures in this section, the x-axis represents the number of threads per block (dBlk:x?dBlk:y). Therefore, there are several combinations of dBlk:x and dBlk:y which yield an equivalent number of threads per block. 87 8 16 32 64 128 256 512 10?1 dBlk.x?dBlk.y tim e( ms ) n=512 Optimal 8 16 32 64 128 256 512 10?1 dBlk.x?dBlk.y tim e( ms ) n=1024 Optimal 8 16 32 64 128 256 512 10?0.7 10?0.5 10?0.3 10?0.1 dBlk.x?dBlk.y tim e( ms ) n=2048 Optimal 8 16 32 64 128 256 512 100 dBlk.x?dBlk.y tim e( ms ) n=4096 Optimal 8 16 32 64 128 256 512 100.5 100.7 100.9 dBlk.x?dBlk.y tim e( ms ) n=8192 Optimal 8 16 32 64 128 256 512 101.1 101.3 101.5 101.7 dBlk.x?dBlk.y tim e( ms ) n=16384 Optimal Figure 4.20: Comparison of input parameters: Computation time (ms) of all input parame- ters for Mv on the T10 GPU. 88 As illustrated, derivation of optimal input parameters yields computation time, on av- erage, within 1.8% of the minimum measured time. In the worst-case (n = 8192), utilizing optimal input parameters yields time within 4.1% of the minimum. Best-case utilization of optimal input parameters (n = {1024;2048;16384}) yields the minimum. Therefore, Step 5 of the parallelization procedure, deriving optimal input parameters, yields minimum or near-minimum GPU computation times for Mv. 4.5.2 MM In this section, optimal input parameters are derived for the kernel in Listing 4.17, which is the shared memory implementation of MM on a GPU utilizing the optimized computation and access pattern after ?ne-tuning. From the kernel and Equations (4.12) and (4.13), 8 ?dBlk:x ? 512 (4.25a) 1 ?dBlk:y ? 512dBlk:x (4.25b) dGrd:x = ndBlk:x?dBlk:y (4.25c) dGrd:y = nn ele (4.25d) where 1 ? n ele ? 16 due to resource constraints. From Equation (4.8), SMemPerBlk = (n ele?dBlk:x)?4. From compilation of the kernel, RegsPerThd varies from 14 to 38. From the minimum of SMemPerBlk and RegsPerBlk, dGrd:x?dGrd:y > 30?BlksactiveSM . Therefore, Equation (3.8) simpli?es to BlksactiveGPU = 30?minparenleft.alt3? 16384RegsPerBlk?;? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3: (4.26) 89 Substituting BlksactiveGPU from Equation (4.26) and the minimum of ThdsactiveGPU from Equa- tion (4.14) into Equation (3.9) yields 12800 ? 30?minparenleft.alt3? 16384RegsPerBlk?;? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3?dBlk:x?dBlk:y: (4.27) Solving for dBlk:y modi?es the lower limit of dBlk:y from Equation (4.25b) to yield max uni239B uni239C uni239D 12800 30dBlk:x? 16384RegsPerBlk? ; 12800 30dBlk:x? 16384(n ele?dBlk:x)?4? ; 64dBlk:x uni239E uni239F uni23A0 ? dBlk:y ? 512dBlk:x: (4.28) Step 2 of deriving optimal input parameters speci?es shared memory utilization is max- imized to ensure data is reused as much as possible. From Lines 25 and 47 of the kernel in Listing 4.17, the trip count for the inner loops where shared memory is accessed is dependent on n ele. Since 1 ? n ele ? 16, n ele = 16. Substituting n ele = 16 into Equation (4.25d) yields dGrd:y = n16 (4.29) Since SMemPerBlk = (n ele ? dBlk:x) ? 4, dBlk:x ? 64. However, if dBlk:x = 64 and dBlk:y is the maximum from Equation (4.28), then BlksactiveGPU = 30 due to register usage. Therefore, ThdsactiveGPU = 7680 and is less than the minimum from Equation (4.14). Therefore, dBlk:x ? 32. However, if dBlk:x ? 16, due to register usage, only two values exist for dBlk:y which satisfy Equation (4.28). Therefore, Equation (4.28) simpli?es to 64 dBlk:x ? dBlk:y ? 512 dBlk:x if dBlk:x = 8: (4.30) and dBlk:y = uni23A7uni23AAuni23AAuni23AA uni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23A9 32 if dBlk:x = 16 2 if dBlk:x = 32: 90 Step 3 of deriving optimal input parameters speci?es the minimum amount of shared memory is allocated per block to increase the number of blocks assigned to each SM. In addition, minimizing the amount of shared memory allocated reduces time each HW in a block waits for synchronization since there are fewer HWs per block. Since n ele = 16, SMemPerBlk = 64dBlk:x. Therefore, the minimum value of dBlk:x from Equation (4.25a) is utilized. Since dBlk:x = 8, Equations (4.30) and (4.25c) are evaluated to yield 8 ?dBlk:y ? 64 (4.31a) dGrd:x = n8dBlk:y (4.31b) Step 4 of deriving optimal input parameters speci?es the maximum number of global memory partitions, PartsPerGPU, are accessed to reduce partition camping. PartsPerGPU is dependent on PartsPerBlk, dGrd:x, and BlksactiveGPU . Since dBlk:x and n ele are constant, RegsPerThd is constant. From compilation, 32 RegsPerThd are allocated so RegsPerBlk = 256dBlk:y. Since RegsPerBlk is the only limiting factor of Equation (4.26), the equation simpli?es to BlksactiveGPU = 1920dBlk:y: (4.32) If dGrd:x > BlksactiveGPU , substituting dBlk:x = 8 and BlksactiveGPU from Equation (4.32) into Equation (3.13) and solving for dBlk:y yields4 dBlk:y ? 16: (4.33) If dGrd:x ? BlksactiveGPU , substituting dBlk:x = 8 and dGrd:x from Equation (4.31b) into Equa- tion (3.13) and simplifying yields PartsPerGPU = minparenleft.alt3 n64dBlk:y;8parenright.alt3: (4.34) 4The solution is dBlk:y ? 30. However, dBlk:y is a power of two so it is rounded down to the nearest power of two. 91 Substituting the minimum value of dBlk:y from Equation (4.31a) into Equation (4.34) yields dGrd:xdBlk:x64 ? minparenleft.alt3 n512;8parenright.alt3: (4.35) Substituting dBlk:x = 8 and dGrd:x from Equation (4.31b) into Equation (4.35) and solving for dBlk:y yields dBlk:y ? n64minparenleft.alt1 n 512;8parenright.alt1 : (4.36) Combining the upper limits of dBlk:y from Equations (4.33) and (4.36) with the lower limit of dBlk:y from Equation (4.31a) yields 8 ? dBlk:y ? minuni239Buni239D16; n64minparenleft.alt1 n 512;8parenright.alt1 uni239E uni23A0: (4.37) Step 5 of deriving optimal input parameters speci?es the number of fragmented blocks, Blksfrag:GPU from Equation (3.10), is minimized. Substituting dGrd:x from Equation (4.31b), dGrd:y from Equation (4.29), and BlksactiveGPU from Equation (4.32) into Equation (3.10) yields n2 128dBlk:y mod 1920 dBlk:y: (4.38) In general, xy can be expressed as x = Qy + R where Q is the quotient, Q = ?xy?, and R is the remainder. Since Blksfrag:GPU = R, Blksfrag:GPU = x ? Qy. Substituting for x, y, and Q from Equation (4.38) and simplifying yields Blksfrag:GPU = n 2 128dBlk:y ?? n2 245760? 1920 dBlk:y: (4.39) Therefore, to minimize Blksfrag:GPU, the maximum value of dBlk:y from Equation (4.37) is utilized and dBlk:y = minuni239Buni239D16; n64minparenleft.alt1 n 512;8parenright.alt1 uni239E uni23A0: (4.40) 92 Since dBlk:x = 8 and from Equations (4.40), (4.25c), and (4.29), the optimal input parameters for the shared memory implementation of MM on a GPU utilizing the optimized computation and access pattern after ?ne-tuning are dBlk:x = 8 dBlk:y = minuni239Buni239D16; n64minparenleft.alt1 n 512;8parenright.alt1 uni239E uni23A0 dGrd:x = n8Blk:y dGrd:y = n16: (4.41) Figure 4.21 depicts measured GPU computation time of all input parameters for MM. The ?gure illustrates the necessity of optimal input parameters as the computation time of MM varies signi?cantly depending on the input parameters. 8 16 32 64 128 256 512 100 101 dBlk.x?dBlk.y tim e( ms ) n=512 Optimal 8 16 32 64 128 256 512 101 102 dBlk.x?dBlk.y tim e( ms ) n=1024 Optimal 93 8 16 32 64 128 256 512 102 103 dBlk.x?dBlk.y tim e( ms ) n=2048 Optimal 8 16 32 64 128 256 512 103 dBlk.x?dBlk.y tim e( ms ) n=4096 Optimal 8 16 32 64 128 256 512 104 dBlk.x?dBlk.y tim e( ms ) n=8192 Optimal 8 16 32 64 128 256 512 105 106 dBlk.x?dBlk.y tim e( ms ) n=16384 Optimal Figure 4.21: Comparison of input parameters: Computation time (ms) of all input parame- ters for MM on the T10 GPU. Derivation of optimal input parameters yields computation time, on average, within 0.5% of the minimum measured time. In the worst-case (n = 512), utilizing optimal in- put parameters yields time within 1.2% of the minimum. Best-case utilization of optimal input parameters (n = {1024;8192;16384}) yields the minimum. As depicted, deriving op- timal input parameters yields minimum or near-minimum GPU computation time for MM. Therefore, Step 5 of the parallelization procedure is valid and necessary for MM. 94 4.5.3 GPU Computation Summary After Step 5, the parallelization procedure is developed to determine the optimal parti- tioning of computation between the CPU and GPU. Therefore, the procedure with regards to minimizing GPU computation time is complete. Therefore, this section illustrates max- imum e?ective bandwidth of each aforementioned step of the parallelization procedure as it pertains to GPU computation time. In addition, a comparison of the procedure to the- oretical bandwidth and CUBLAS e?ective bandwidth is illustrated. For each ?gure in this subsection, Na ve denotes the global memory implementation, Sh. Mem. denotes the shared memory implementation, O.C.P. denotes the optimized computation pattern utilizing shared memory, and O.A.P. denotes the optimized access pattern utilizing the optimized compu- tation pattern and shared memory. P.P. denotes the entire parallelization procedure with regards to GPU computation which utilizes shared memory, optimized computation and access patterns, ?ne-tuning adjustments, and optimal input parameters. Mv Figure 4.22 depicts e?ective bandwidth of Mv for each step of the parallelization pro- cedure in addition to CUBLAS e?ective bandwidth and theoretical bandwidth. 512 1024 2048 4096 8192 16384 101 102 n eff ect ive ba nd wi dt h( GB /s ) TheoreticalCUBLAS P.P.O.A.P. O.C.P.Sh. Mem. Na??ve Figure 4.22: Comparison of e?ective bandwidth (GB/s) for Mv on the T10 GPU. 95 The parallelization procedure yields a speedup, on average, of 18.98 compared to the na??ve implementation. The minimum (n = 4096) and maximum (n = 16384) speedups are 11.92 and 37.35, respectively. As illustrated, each step of the procedure increases the e?ective bandwidth. On average, the procedure yields a speedup of 1.47 compared to the CUBLAS implementation with a minimum (n = 8192) and maximum (n = 1024) speedup of 0.96 and 2.17, respectively. Compared to theoretical bandwidth, the procedure yields e?ective bandwidth, on average, as 63.7% of the theoretical bandwidth. Utilizing the procedure yields a worst-case (n = 512) e?ective bandwidth of 21.3% of theoretical. Best-case utilization of the procedure (n = 8192) yields an e?ective bandwidth of 79.1% of the theoretical. The procedure is veri?ed to yield increases in speedup compared to the optimized CUBLAS GPU implementation of Mv. More importantly, the procedure is veri?ed to yield signi?cant increases in speedup compared to the na??ve GPU implementation of Mv. MM Figure 4.23 depicts e?ective bandwidth of MM for each step of the parallelization pro- cedure in addition to the theoretical and CUBLAS e?ective bandwidth. 512 1024 2048 4096 8192 16384 102 103 n eff ect ive ba nd wi dt h( GB /s ) TheoreticalCUBLAS P.P.O.A.P. O.C.P.Sh. Mem. Na??ve Figure 4.23: Comparison of e?ective bandwidth (GB/s) for MM on the T10 GPU. Similar to Mv, the procedure applied to MM yields a signi?cant speedup, on average, of 14.58 compared to the na??ve implementation. The minimum (n = 16384) and maximum 96 (n = 512) speedups are 12.36 and 21.56, respectively. In Figure 4.23, each step of the procedure increases the e?ective bandwidth. The largest increase in bandwidth occurs after the ?rst step of the procedure. This is due to the ?rst step signi?cantly reducing the number of global memory transactions. Compared to the CUBLAS implementations, the procedure yields a speedup, on average, of 1.07 with a minimum (n = 4096) and maximum (n = 512) speedup of 0.95 and 1.39, respectively. As illustrated, the procedure yields bandwidth near the theoretical. On average, the procedure yields 84.2% of the theoretical bandwidth. In the worst-case (n = 512), utilizing the procedure yields e?ective bandwidth of 73.2% of the theoretical. Best-case utilization of the procedure (n = 16384) yields an e?ective bandwidth of 88.3% of the theoretical. Similar to Mv, the procedure is veri?ed to yield increases in speedup compared to the optimized CUBLAS GPU implementation of MM. Likewise, the procedure is veri?ed to yield signi?cant increases in speedup compared to the na??ve GPU implementation of MM. Therefore, for Mv and MM, the procedure yields GPU computation times competitive with the CUBLAS implementation and signi?cantly lower than the na??ve implementation. 4.6 Computation Partitioning The last step, Step 6, of the parallelization procedure to minimize execution time is to determine the partitioning of computation between the CPU and GPU. For small matrix- based computations, the communication time may exceed the CPU computation time as measured for certain data sizes of Mv. Therefore, determining which computations are per- formed by the CPU and GPU requires accurate estimates of communication and computation times. From accurate estimates of communication and computation times, it is possible to de- termine the appropriate computations to be performed by the CPU and GPU. The algorithm in Listing 4.18 determines which matrix-based computations are performed by the CPU and GPU. TCPUcomp: and Tcomm: are presented in Chapter 3. 97 1 if TCPUcomp: ? TGPUcomp: +Tcomm: 2 perform computation on CPU. 3 if TCPUcomp: > TGPUcomp: +Tcomm: 4 perform computation on GPU. Listing 4.18: Algorithm to determine which computations are performed by the CPU and GPU. Since GPU computation time is estimated assuming maximum bandwidth, it is neces- sary to determine the number of bytes accessed from global memory for each matrix-based computation. The theoretical bandwidth is partially calculated by dividing the number of bytes necessary for computation by the number of bytes accessed from global memory. This fraction is multiplied by the maximum bandwidth to global memory (94GB/s) to yield the theoretical bandwidth. 4.6.1 Mv For Mv, optimal input parameters de?ne dBlk:x = 32 from Equation (4.24). Therefore, from Equation (4.7), the number of 64B memory transactions is 4n2. Since 2n2 ?oat values are necessary for computation and 4n2 bytes are accessed from global memory, the theoretical bandwidth for Mv is 8n24n2(94GB/s). Therefore, TGPUcomp: = 8n2 10243 188 . Communication requires two CPU to GPU transfers and one GPU to CPU transfer. Substituting for b, c0, and c1 into Equations (3.14) and (3.15) and simplifying yields the communication time in seconds as Tcomm: = 5:00?10?9n+ uni23A7uni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23A9 3:02?10?9n2 +5:00?10?5 if n < 256 1:08?10?9n2 +9:60?10?5 if 256 ? n < 512 1:08?10?9n2 +2:27?10?4 if n ? 512: From Lines 1-2 of Listing 4.18, computation is partitioned for CPU execution if TCPUcomp: ? TGPUcomp: +Tcomm:. Therefore, after substitution, if 8n2 10243 minparenleft.alt1maxparenleft.alt116384n ;2parenright.alt1;8parenright.alt1 ? 8n2 10243 188 +Tcomm:; 98 computation is performed on the CPU to minimize execution time. Solving for n yields TCPUcomp: ? TGPUcomp: +Tcomm: if n ? 2048. Therefore, if n > 2048, the computation is performed on the GPU to minimize execution time. Figure 4.24 illustrates measured execution time for Mv. CPU (ATLAS) execution time is measured time of the CPU utilizing ATLAS to perform Mv. GPU (CUBLAS) execution time is measured time for communication between the CPU and GPU in addition to measured GPU computation time utilizing CUBLAS. 512 1024 2048 4096 8192 1638410 ?1 100 101 102 103 n me asu red ex ecu tio nt im e( ms ) CPU (ATLAS) GPU (na??ve) GPU (CUBLAS) P.P. Figure 4.24: Comparison of execution time (ms) for Mv on the T10 GPU. As mentioned, if n ? 2048, Mv is performed on the CPU and therefore, the procedure yields execution time identical to the ATLAS implementation. If n > 2048, Mv is performed on the GPU utilizing the kernel developed in the procedure (Listing 4.16). The procedure yields a speedup, on average, of 1.68 compared to the ATLAS implementation, 2.63 com- pared to the na??ve GPU implementation (Listing 4.2), and 1.39 compared to the CUBLAS implementation. The speedup compared to the na??ve and CUBLAS implementations in Fig- ure 4.24 is less than in Figure 4.22 due to communication time. For Mv, the communication time is a signi?cant portion of execution time. Therefore, large speedups measured for Mv computation have less of an impact on execution time since the execution time includes communication and computation time. 99 From Figure 4.24, the procedure yields execution time, on average, within 0.8% of the theoretical minimum execution time. Utilizing the procedure yields a worst-case (n = 8192) time within 1.9% of the theoretical minimum. In the best-case (n = {512;1024;2048}), uti- lizing the procedure yields the theoretical minimum. This is possible since the theoretical minimum for n ? 2048 is identical to the CPU utilizing ATLAS. This is because the theoreti- cal minimum for the CPU is calculated utilizing e?ective bandwidths of the CPU. Therefore, Step 6 of the parallelization procedure, determine the optimal partitioning of computation, is valid and necessary for Mv. 4.6.2 MM For MM, 2n3 ?oat values are necessary for computation. If n ele = 16 as de?ned in Section 4.5.2 (optimal input parameters), data is reused via shared memory. Since each thread in a HW reuses 16 ?oat values, 8n316 bytes are necessary from global memory. Assuming the maximum global memory bandwidth of 94GB/s, the theoretical bandwidth for MM is 1504GB/s. Therefore, TGPUcomp: = 8n3 10243 1504 . Communication requires two CPU to GPU transfers and one GPU to CPU transfer. Substituting for b, c0, and c1 into Equations (3.14) and (3.15) and simplifying yields the communication time in seconds as Tcomm: = uni23A7uni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23A9 8:02?10?9n2 +5:00?10?5 if n < 256 4:14?10?9n2 +1:42?10?4 if 256 ? n < 512 3:38?10?9n2 +5:94?10?4 if n ? 512: From Lines 1-2 of Listing 4.18, computation is partitioned for CPU execution if TCPUcomp: ? TGPUcomp: +Tcomm:. Therefore, after substitution, if 8n2 10243 51 ? 8n2 10243 1504 +Tcomm:; 100 computation is performed on the CPU to minimize execution time. Solving for n yields TCPUcomp: ? TGPUcomp: +Tcomm: if n ? 64. Therefore, if n > 64, the computation is performed on the GPU to minimize execution time. Figure 4.25 illustrates measured execution time for MM. CPU (ATLAS) execution time is measured time of the CPU utilizing ATLAS to perform MM. GPU (CUBLAS) execution timeismeasuredtimeforcommunicationbetweentheCPUandGPUinadditiontomeasured GPU computation time utilizing CUBLAS. 512 1024 2048 4096 8192 1638410 0 102 104 106 n me asu red ex ecu tio nt im e( ms ) CPU (ATLAS) GPU (na??ve) GPU (CUBLAS) P.P. Figure 4.25: Comparison of execution time (ms) for MM on the T10 GPU. For MM, the parallelization procedure yields execution time similar to the CUBLAS implementation for all data sizes tested. The speedup, on average, is 1.02. The procedure yields a speedup, on average, of 18.81 compared to the ATLAS implementation, and 10.60 compared to the na??ve implementation. Similar to Mv, the speedup compared to the na??ve and CUBLAS implementations in Figure 4.25 is less than in Figure 4.23 due to communica- tion time. However, di?ering from Mv, the communication time is not a signi?cant portion of execution time. Therefore, the reduction in speedup, due to communication, is less than for Mv. Similar to Mv, the procedure yields execution time, on average, within 10.9% of the theoretical minimum execution time. Worst-case utilizing the procedure (n = 4096) yields time within 11.4% of the theoretical minimum. Best-case utilizing the procedure (n = 512) 101 yields time within 10.4% of the theoretical minimum. For MM, Step 6 of the procedure, determine the partitioning of computation, is not necessary for MM since n > 64. However, Step 6 is necessary to minimize execution time for Mv and thus is included. 102 Chapter 5 Performance Analysis The parallelization procedure is developed and veri?ed through Mv and MM. Therefore, Section 5.1 of this chapter is the application of the parallelization procedure to convolution. Each step is illustrated in a subsection and results are presented. In addition to veri?cation through convolution, in Section 5.2, execution time of the conjugate gradient method [65] utilizing the parallelization procedure is compared with execution time of the CPU utilizing ATLAS and the GPU utilizing CUBLAS. All results in this chapter were gathered using the NVIDIA Tesla S1070 which includes a pair of T10 GPUs connected to a quad-core 2.26GHz Intel Xeon E5520 CPU at Alabama Supercomputer Authority [62]. The height or width of a square matrix, n, is varied from 512 to 16384 as de?ned by Equation (4.1). dBlk:x is varied from 8 to 512, the maximum allowable value for the T10. 5.1 Convolution As mentioned, the parallelization procedure developed in Chapter 4 is applied to con- volution to show the validity of the procedure. Three various ?lter sizes (FS = {3;63;513}) were selected for testing convolution. 5.1.1 Placement of Data Step 1 of the parallelization procedure is to determine the optimal placement of data in GPU memory. Listing 5.1 is the kernel for the global memory implementation of convolution on a GPU where each thread computes one Cij. The size of the ?lter, or the height or width of B, is passed to the kernel as FS. The radius of the ?lter is computed in Line 2. Each thread 103 1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) { 2 int radius = (FS-1)/2; 3 int indexCol = blockIdx.x * blockDim.x + threadIdx.x; 4 int indexRow = blockIdx.y * blockDim.y + threadIdx.y; 5 float temp = 0.0; 6 for(int k = 0; k < FS; k++) { 7 int Col = indexCol + k - radius; 8 for(int l = 0; l < FS; l++) { 9 int Row = indexRow + l - radius; 10 if(Col >= 0 && Col < n) { 11 if(Row >= 0 && Row < n) { 12 temp += A[Row * n + Col] * B[l * FS + k]; 13 } 14 } 15 } 16 } 17 C[indexRow * n + indexCol] = temp; 18 } Listing 5.1: The global memory implementation of convolution on a GPU. computes the column and row index of the Cij to compute in Lines 3 and 4. Lines 10 and 11 perform boundary checking to ensure no address is accessed which is out of range. k and l determine which row and column of the ?lter, respectively, are being accessed. The order of the for-loops in Lines 6 and 8 determines the order in which computation is performed and thus a?ects the computation time. The order of computation, or the computation pattern, is discussed in Section 5.1.2. In Line 3 of Listing 5.1, indexCol is calculated utilizing threadIdx.x. Therefore, Col, Line 7, is also dependent on threadIdx.x. Since HWs are assigned in row-major order (threadIdx.x ?rst), accesses to A in Line 12 are coalesced. The number of global memory accesses for accessing A is dependent on the number of threads per HW with di?ering threadIdx.y values. This is because each HW has threadIdx.y unique values for Row from Line 4. Therefore, the number of global memory accesses for accessing A is approximately1 Gld32B = n 2FS2 dBlk:x if dBlk:x ? 8 1For simplicity, the equations are approximations. Due to boundary checking, not all reads de?ned by the equations are necessary or issued. 104 and Gld64B = n 2FS2 16 if dBlk:x ? 16: Since each thread in a HW accesses the same value of B in Line 12, the number of global memory accesses for accessing B is approximately1 Gld32B = n 2FS2 16 (5.1) regardless of the value of dBlk:x. Lines 4-6 of Listing 4.1, the algorithm to determine the placement of data, specify that shared memory is utilized for reused data which is larger than the size of constant memory. It is not assumed the ?lter is smaller than the size of constant memory 2. Therefore, Line 5 speci?es that data which is reused is partitioned into blocks smaller than the size of shared memory. Therefore, A and B are partitioned into blocks of shared memory to reduce the number of global memory accesses as shown in Listing 5.2. Shared memory is allocated for parts of A and B in Lines 2 and 3, respectively. Therefore, SMemPerBlk = (2?dBlk:y ?dBlk:x+dBlk:y)?4: (5.2) Every dBlk:y iteration, parts of A and B are loaded into shared memory as shown in Lines 12-28. Since all HWs within a block access the same part of B, only some threads are utilized to load B into shared memory. In this kernel, threads of each HW with threadIdx.x=0 are utilized to load part of B into shared memory as shown in Lines 14-18. Boundary checking is performed in Lines 15, 19, 20, 23, 29, and 30. Similar to the global memory implementation, the order of the for-loops in Lines 8 and 10 determines the computation pattern, which is discussed in Section 5.1.2. 2If constant memory is utilized to store the ?lter, the maximum size of the ?lter is limited to 128. 105 1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) { 2 __shared__ float As[blockDim.y * 2][blockDim.x]; 3 __shared__ float Bs[blockDim.y]; 4 int radius = (FS-1)/2; 5 int indexCol = blockIdx.x * blockDim.x + threadIdx.x; 6 int indexRow = blockIdx.y * blockDim.y + threadIdx.y; 7 float temp = 0.0; 8 for(int k = 0; k < FS; k++) { 9 int Col = indexCol + k - radius; 10 for(int l = 0; l < FS; l++) { 11 int Row = indexRow + l - radius; 12 if((l % blockDim.y) == 0) { 13 __syncthreads(); 14 if(threadIdx.x == 0) { 15 if(l + threadIdx.y < FS) { 16 Bs[threadIdx.y] = B[(l + threadIdx.y) * FS + k]; 17 } 18 } 19 if(Col >= 0 && Col < n) { 20 if(Row >= 0 && Row < n) { 21 As[threadIdx.y][threadIdx.x] = A[Row * n + Col]; 22 } 23 if((Row + blockDim.y) >= 0 && (Row + blockDim.y) < n ) { 24 As[threadIdx.y + blockDim.y][threadIdx.x] = A[(Row + rcurvearrowse blockDim.y) * n + Col]; 25 } 26 } 27 __syncthreads(); 28 } 29 if(Col >= 0 && Col < n ) { 30 if(Row >= 0 && Row < n) { 31 temp += As[threadIdx.y + (l % blockDim.y)][threadIdx.x] * Bs[(l rcurvearrowse % blockDim.y)]; 32 } 33 } 34 } 35 } 36 C[indexRow * n + indexCol] = temp; 37 } Listing 5.2: The shared memory implementation of convolution on a GPU. 106 Since global memory is accessed every dBlk:y iteration, the number of global memory accesses for accessing A is dependent on the size of B, FS, and the number of threads per HW with di?ering threadIdx.y values. This is because each HW has threadIdx.y unique values for Row from Lines 6 and 11. Therefore, the number of global memory accesses for accessing A is approximately3 Gld32B = 2n 2FS dBlk:x ? FS dBlk:y? if dBlk:x ? 8 (5.3) and4 Gld64B = 2n 2FS 16 ? FS dBlk:y? if dBlk:x ? 16: (5.4) If dBlk:x ? 16, this reduces the number of global memory accesses for accessing A by a factor of FS2? FS dBlk:y? compared to the global memory implementation. Since global memory is accessed every dBlk:y iteration, the number of global memory accesses for accessing B is approximately3 Gld32B = 16n 2FS dBlk:x2 ? FS dBlk:y? if dBlk:x ? 8 (5.5) and Gld32B = n 2FS dBlk:x ? FS dBlk:y? if dBlk:x ? 16: (5.6) This reduces the number of 32B accesses for accessing B by a factor of FSdBlk:x16? FS dBlk:y? compared to the global memory implementation. Figures 5.1-5.3 depict maximum e?ective bandwidth of the global memory implemen- tation of convolution (Listing 5.1) and the shared memory implementation (Listing 5.2). E?ective bandwidth for convolution is calculated in Section 3.5 as 8n2FS2bytesTGPU comp: . Figure 5.1 depicts maximum e?ective bandwidth for a ?lter size, FS, of 3. 3For simplicity, the equations are approximations. Due to boundary checking, not all reads de?ned by the equations are necessary or issued. 4Since FS is not a power of two, some reads are 32B transactions if dBlk:x ? 16. 107 512 1024 2048 4096 8192 1638420 30 40 50 60 70 80 n eff ect ive ba nd wi dt h( GB /s ) Global Memory Shared Memory Figure 5.1: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=3. As illustrated, the reduction in global memory accesses signi?cantly increases the e?ec- tive bandwidth for convolution with a small ?lter size. The shared memory implementation yields a speedup, on average, of 2.74 compared to the global memory implementation. The minimum (n = 512) and maximum (n = 16384) speedups are 2.37 and 2.86, respectively. Figure 5.2 depicts maximum e?ective bandwidth for a ?lter size of 63. 512 1024 2048 4096 8192 1638470 75 80 85 90 95 n eff ect ive ba nd wi dt h( GB /s ) Global Memory Shared Memory Figure 5.2: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=63. For a ?lter size of 63, utilizing shared memory increases e?ective bandwidth due to the reduction in global memory accesses. However, comparing Figures 5.1 and 5.2 shows 108 the increase in bandwidth is less as the ?lter size has increased. Regardless, in Figure 5.2, the shared memory implementation yields a speedup, on average, of 1.25 compared to the global memory implementation. The minimum (n = 512) and maximum (n = 4096) speedups are 1.22 and 1.26, respectively. Although the increase in e?ective bandwidth for the shared memory implementation is less than for a small ?lter size, Step 1 of the procedure is valid for a ?lter size of 63. Lastly, maximum e?ective bandwidth is illustrated in Figure 5.3 for a ?lter size of 513. 512 1024 2048 4096 8192 1638460 70 80 90 100 110 120 n eff ect ive ba nd wi dt h( GB /s ) Global Memory Shared Memory Figure 5.3: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=513. Similar to a ?lter size of 63, the shared memory implementation yields a speedup, on average, of 1.27 compared to the global memory implementation for a large ?lter size. Likewise, the minimum (n = 512) and maximum (n = 8192) speedups are 1.18 and 1.32, respectively. These results show that Step 1 of the procedure yields larger reductions of computation time for smaller ?lter sizes. Regardless, Step 1 is veri?ed to reduce computation time for all data and ?lter sizes tested. 5.1.2 Computation Patterns Step 2 of the procedure speci?es that the optimized computation pattern is derived. As mentioned, computation patterns are determined by the code and represent the manner and 109 order in which results are computed. For convolution, if threadij computes Cij where C = A ? B, the order in which Cij is computed creates varying computation patterns. For convolution, the ?lter, B, is centered over each value of the input, A. The products of the overlapping values of A and B are summed to compute one value of C, Cij, where C = A ? B. This is depicted in Figure 5.4. The order in which overlapping values are multiplied filter value being computed (a) Column-major. filter value being computed (b) Row-major. Figure 5.4: Computation patterns: two computation patterns for convolution. determines the computation pattern. A pattern where each partial sum is computed ?rst in the y-dimension, as depicted in Figure 5.4a, is the column-major computation pattern. Figure 5.4b depicts the computation pattern where each value is computed ?rst in the x- dimension, row-major. Column-major The shared memory implementation of convolution utilizing the column-major compu- tation pattern depicted in Figure 5.4a is shown in Listing 5.2 in Section 5.1.1. The amount of shared memory allocated per block is de?ned by Equation (5.2) and the number of global memory accesses for accessing A and B is approximated5 in Equations (5.3)-(5.6). 5Due to boundary checking, not all reads are issued. 110 Optimized (row-major) The shared memory implementation of convolution utilizing the row-major computation pattern depicted in Figure 5.4b requires several minor modi?cations to Listing 5.2. There- fore, the shared memory implementation of convolution utilizing the row-major computation pattern is shown in Listing 5.3. Shared memory is allocated for parts of A and B in Lines 2 and 3, respectively. Therefore, SMemPerBlk = (2?dBlk:x?dBlk:y +dBlk:x)?4: (5.7) Every dBlk:x iteration, parts of A and B are loaded into shared memory as shown in Lines 12-28. Since all HWs within a block access the same part of B, only some threads are utilized to load B into shared memory. In this kernel, threads of each HW with threadIdx.y=0 are utilized to load part of B into shared memory as shown in Line 14. Boundary checking is performed in Lines 15, 19, 20, 23, 29, and 30. Since global memory is accessed every dBlk:x iteration, the number of global memory accesses for accessing A is dependent on the size of B, FS, and the number of threads per HW with di?ering threadIdx.y values. This is because each HW has threadIdx.y unique values for Row from Lines 6 and 9. Therefore, the number of global memory accesses for accessing A is approximately8 Gld32B = 2n 2FS dBlk:x ? FS dBlk:x? if dBlk:x ? 8 (5.8) and6 Gld64B = 2n 2FS 16 ? FS dBlk:x? if dBlk:x ? 16: (5.9) If dBlk:x ? 16, this reduces the number of global memory accesses for accessing A by a factor of FS2? FS dBlk:x? compared to the global memory implementation. However, if dBlk:x = 6Since FS is not a power of two, some reads are 32B transactions if dBlk:x ? 16. 111 1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) { 2 __shared__ float As[blockDim.y][blockDim.x * 2]; 3 __shared__ float Bs[blockDim.x]; 4 int radius = (FS-1)/2; 5 int indexCol = blockIdx.x * blockDim.x + threadIdx.x; 6 int indexRow = blockIdx.y * blockDim.y + threadIdx.y; 7 float temp = 0.0; 8 for(int k = 0; k < FS; k++) { 9 int Row = indexRow + k - radius; 10 for(int l = 0; l < FS; l++) { 11 int Col = indexCol + l - radius; 12 if((l % blockDim.x) == 0) { 13 __syncthreads(); 14 if(threadIdx.y == 0) { 15 if(l + threadIdx.x < FS) { 16 Bs[threadIdx.x] = B[k * FS + l + threadIdx.x]; 17 } 18 } 19 if(Row >= 0 && Row < n) { 20 if(Col >= 0 && Col < n) { 21 As[threadIdx.y][threadIdx.x] = A[Row * n + Col]; 22 } 23 if((Col + blockDim.x) >= 0 && (Col + blockDim.x) < n ) { 24 As[threadIdx.y][threadIdx.x + blockDim.x] = A[Row * n + Col + rcurvearrowse blockDim.x]; 25 } 26 } 27 __syncthreads(); 28 } 29 if(Row >= 0 && Row < n ) { 30 if(Col >= 0 && Col < n) { 31 temp += As[threadIdx.y][threadIdx.x + (l % blockDim.x)] * Bs[(l rcurvearrowse % blockDim.x)]; 32 } 33 } 34 } 35 } 36 C[indexRow * n + indexCol] = temp; 37 } Listing 5.3: The shared memory implementation of convolution on a GPU utilizing the optimized computation pattern. 112 dBlk:y, there is no reduction in the number of accesses compared to the shared memory implementation utilizing the column-major computation pattern. Since global memory is accessed every dBlk:x iteration, the number of global memory accesses for accessing B is approximately8 Gld32B = n 2FS dBlk:x?dBlk:y ? FS dBlk:x? if dBlk:x ? 8: (5.10) and Gld64B = n 2FS 16dBlk:y ? FS dBlk:x? if dBlk:x ? 16: (5.11) If dBlk:x ? 16, 32B accesses for accessing B are replaced with 64B accesses compared to the shared memory implementation utilizing the column-major computation pattern. The number of accesses to global memory is reduced by a factor of 16dBlk:y? FS dBlk:y? dBlk:x? FSdBlk:x? . Therefore, if dBlk:x = dBlk:y = 16, this computation pattern reduces the number of accesses to global memory by a factor of 16 and the number of bytes accessed by a factor of 8. Therefore, the row-major pattern is the optimized computation pattern. Figures 5.5-5.7 depict maximum e?ective bandwidth of convolution before (Listing 5.2) and after (Listing 5.3) utilizing the optimized computation pattern. The optimal placement of data is utilized in both listings. As mentioned, e?ective bandwidth for convolution is 8n2FS2bytes TGPUcomp: . Figure 5.5 depicts maximum e?ective bandwidth for a ?lter size, FS, of 3. 113 512 1024 2048 4096 8192 1638460 65 70 75 80 85 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 5.5: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=3. Shown in Figure 5.5, Step 2 of the procedure yields a modest speedup, on average, of 1.03 compared to the non-optimized pattern. As mentioned, for the optimized computation pattern, the number of accesses to global memory is reduced by a factor of 16dBlk:y? FS dBlk:y? dBlk:x? FSdBlk:x? compared to the non-optimized pattern. Since maximum e?ective bandwidth is illustrated and dBlk:x and dBlk:y vary, the optimized computation pattern yields a modest speedup. The minimum (n = 16384) and maximum (n = 512) speedups are 1.02 and 1.07, respectively. For a ?lter size of 63, the maximum e?ective bandwidth is illustrated in Figure 5.6. 512 1024 2048 4096 8192 1638489.5 90 90.5 91 91.5 92 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 5.6: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=63. 114 In Figure 5.6, the e?ective bandwidth, before and after utilizing the optimized compu- tation pattern, shows a di?ering behavior for the varying data sizes. However, given the scale of the ?gure, the di?erence is small and not explored further. Similar to a small ?lter size, the optimized computation pattern for a ?lter size of 63 yields a near-constant speedup, on average, of 1.01 compared to the non-optimized pattern. The minimum (n = 2048) and maximum (n = 512) speedups are 1.00 and 1.02, respectively. Lastly, Figure 5.7 depicts maximum e?ective bandwidth for a ?lter size of 513. 512 1024 2048 4096 8192 1638490 95 100 105 110 115 120 125 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 5.7: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=513. AsillustratedinFigure5.7, theoptimizedcomputationpatternyieldsaconstantspeedup of 1.03 compared to the non-optimized pattern. The e?ective bandwidth for all ?lter sizes tested is roughly equivalent and a modest increase is shown when utilizing the optimized computation pattern. Although modest speedups are measured for all ?lter sizes, the band- width is increased for all data and ?lter sizes tested. Therefore, Step 2 of the parallelization procedure is shown to yield increases in e?ective bandwidth, and thus reduce the computa- tion time, for convolution. 115 5.1.3 Access Patterns Step 3 of the parallelization procedure is to derive the optimized access pattern that minimizes partition camping and eliminates bank con?icts. For Mv and MM, HWs begin execution by reading the ?rst value of their respective rows of A. However, for convolution, HWs begin by reading varying columns of their respective rows of A. Therefore, HWs are spread to varying partitions for accessing A without an optimized access pattern. Figure 5.8a illustrates a 3x3 ?lter, B, stored in global memory. Padding the ?lter with zeros such that each row resides in the next partition of global memory reduces partition camping. This is illustrated in Figure 5.8b. In addition, padding the ?lter aligns each 0 1 2 3 row of filter filter values partition (a) Non-padded ?lter in global memory. 0 1 2 3 row of filter filter values partition (b) Padded ?lter in global memory. Figure 5.8: Access patterns: example of a 3x3 ?lter, B, stored in global memory. Each row of a partition is 4 values of type ?oat. row to a speci?ed boundary. From the CUDA programming guide [64], ?Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).? Therefore, padding the ?lter with zeros ensures the data is naturally aligned thus reducing the number of global memory transactions. Similarly, padding A with zeros eliminates the boundary checking instructions in Lines 15, 19, 20, 23, 29, and 30 of Listing 5.3. From the CUDA programming guide [64], ?If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each 116 branch path taken.? Eliminating boundary checking instructions via padding eliminates thread divergence. Therefore, modifying Listing 5.3 to accommodate padding of A and B yields Listing 5.4 which is the shared memory implementation of convolution utilizing the optimized computation and access pattern. Shared memory is allocated in Lines 2 and 3 of 1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) { 2 __shared__ float As[blockDim.y][blockDim.x * 2]; 3 __shared__ float Bs[blockDim.x]; 4 int padding = 64 - FS % 64; 5 int indexCol = blockIdx.x * blockDim.x + threadIdx.x; 6 int indexRow = blockIdx.y * blockDim.y + threadIdx.y; 7 float temp = 0.0; 8 for(int k = 0; k < FS; k++) { 9 int Row = indexRow + k; 10 for(int l = 0; l < FS; l++) { 11 int Col = indexCol + l; 12 if((l % blockDim.x) == 0) { 13 __syncthreads(); 14 if(threadIdx.y == 0) { 15 Bs[threadIdx.x] = B[k * (FS + padding) + l + threadIdx.x]; 16 } 17 As[threadIdx.y][threadIdx.x] = A[Row * (n + FS - 1) + Col]; 18 As[threadIdx.y][threadIdx.x + blockDim.x] = A[Row * (n + FS - 1) rcurvearrowse + Col + blockDim.x]; 19 __syncthreads(); 20 } 21 temp += As[threadIdx.y][threadIdx.x + (l % blockDim.x)] * Bs[(l % rcurvearrowse blockDim.x)]; 22 } 23 } 24 C[indexRow * n + indexCol] = temp; 25 } Listing 5.4: The shared memory implementation of convolution on a GPU utilizing the optimized computation and access pattern. Listing 5.4 the same as in Lines 2 and 3 of Listing 5.3. Therefore, the amount of shared memory allocated per block is de?ned by Equation (5.7). In Line 4, padding instead of the ?lter radius is calculated. padding speci?es the number of zeros added to each row of the ?lter. The index calculations for accessing A, Lines 9 and 11, no longer include the radius of the ?lter, as A is padded. Since A is padded by the size of the ?lter, the length of each row is n+FS ?1 rather than n as shown in Lines 17 and 18. 117 Figures 5.9-5.11 depict maximum e?ective bandwidth of convolution before (Listing 5.3) and after (Listing 5.4) utilizing the optimized access pattern. The optimal placement of data and optimized computation patterns are utilized. Figure 5.9 depicts maximum e?ective bandwidth for a ?lter size of 3. 512 1024 2048 4096 8192 1638460 80 100 120 140 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 5.9: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for con- volution on the T10 GPU. FS=3. As mentioned, the optimized access pattern, which consists of padding A and B with zeros, reduces partition camping. In addition, the optimized access pattern eliminates bound- ary checking, and thus, reduces thread divergence. As illustrated in Figure 5.9, the optimized access pattern yields a speedup, on average, of 1.54 compared to the non-optimized pattern for a small ?lter size. The minimum (n = 512) and maximum (n = 16384) speedups are 1.27 and 1.67, respectively. Similar to Mv and MM, the optimized access pattern, which reduces partition camping, signi?cantly increases bandwidth for convolution with a small ?lter size. Maximum e?ective bandwidth for a ?lter size of 63 is depicted in Figure 5.10. 118 512 1024 2048 4096 8192 1638490 100 110 120 130 140 150 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 5.10: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=63. Similar to a ?lter size of 3, the optimized access pattern for a ?lter size of 63 yields a speedup, on average, of 1.60 compared to the non-optimized pattern. The minimum (n = 512) and maximum (n = 16384) speedups are 1.58 and 1.61, respectively. Lastly, Figure 5.11 depicts maximum e?ective bandwidth for a ?lter size of 513. 512 1024 2048 4096 8192 1638490 100 110 120 130 140 150 n eff ect ive ba nd wi dt h( GB /s ) Non-optimized Optimized Figure 5.11: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=513. As shown in Figure 5.11, the optimized access pattern yields a speedup, on average, of 1.46 compared to the non-optimized pattern. The minimum (n = 512) and maximum (n = 16384) speedups are 1.19 and 1.59, respectively. For each ?lter size tested, Step 3 of 119 the parallelization procedure, derive the optimized access pattern, yields signi?cant increases in maximum e?ective bandwidth. In addition, the optimized access pattern yields similar increases in speedup for all data and ?lter sizes tested. 5.1.4 Fine-tuning Step 4 of the parallelization procedure is to ?ne-tune the kernel by unrolling loops and minimizing index calculations. As mentioned, loop unrolling reduces computation time by increasing the instruction mix of memory transactions and computation instructions. Minimizing index calculations, particularly in inner loops, reduces the time necessary to issue global memory transactions, thus reducing the computation time. Step 4, ?ne-tuning, is applied to the shared memory implementation of convolution on a GPU utilizing the optimized computation and access pattern, Listing 5.4, to yield Listing 5.5. No modi?cations are performed to the allocation of shared memory and therefore, Equation (5.7) de?nes the amount of shared memory utilized. Since modi?cations are performed only to index calculations and loop unrolling, the number of global memory reads is de?ned by Equations (5.8)-(5.11). In Line 8, calc1 is calculated outside of the for-loops as it is a commonly used calculation in the for-loops for index calculations of A and B. Index calculations are reduced by initializing pointerstoA,B, andCin Lines 9-11, thus eliminating the need for Row and Col from Listing 5.4. Pointer arithmetic is performed in Lines 21, 23, 29, and 30 to further reduce index calculations. In Listing 5.5, for-loops are preceded with the #pragma unroll directive in Lines 12 and 14 to increase the instruction mix and reduce the number of registers used per thread. Since the loops are dependent only on FS, the compiler unrolls the loops until the maximum instruction limit is reached. Therefore, the amount of loop unrolling performed is dependent on FS. 120 1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) { 2 __shared__ float As[blockDim.y][blockDim.x * 2]; 3 __shared__ float Bs[blockDim.x]; 4 int padding = 64 - FS % 64; 5 int indexCol = blockIdx.x * blockDim.x + threadIdx.x; 6 int indexRow = blockIdx.y * blockDim.y + threadIdx.y; 7 float temp = 0.0; 8 int calc1 = blockDim.x * (floorf(FS / blockDim.x) + 1); 9 A += indexRow * (n + FS-1) + indexCol; 10 B += threadIdx.x; 11 C += indexRow * n + indexCol; 12 #pragma unroll 13 for(int k = 0; k < FS; k++) { 14 #pragma unroll 15 for(int l = 0; l < FS; l++) { 16 if((l % blockDim.x) == 0) { 17 __syncthreads(); 18 if(threadIdx.y == 0) { 19 Bs[threadIdx.x] = B[0]; 20 } 21 B += blockDim.x; 22 As[threadIdx.y][threadIdx.x] = A[0]; 23 A += blockDim.x; 24 As[threadIdx.y][threadIdx.x + blockDim.x] = A[0]; 25 __syncthreads(); 26 } 27 temp += As[threadIdx.y][threadIdx.x + (l % blockDim.x)] * Bs[(l % rcurvearrowse blockDim.x)]; 28 } 29 A += n + FS-1 - calc1; 30 B += FS + padding - calc1; 31 } 32 C[0] = temp; 33 } Listing 5.5: The shared memory implementation of convolution on a GPU utilizing the optimized computation and access pattern after ?ne-tuning 121 Figures 5.12-5.14 depict the maximum e?ective bandwidth of convolution before (Listing 5.4) and after (Listing 5.5) ?ne-tuning. Figure 5.12 depicts maximum e?ective bandwidth for a ?lter size, FS, of 3. 512 1024 2048 4096 8192 1638480 90 100 110 120 130 140 n eff ect ive ba nd wi dt h( GB /s ) Standard Fine-tuned Figure 5.12: Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu- tion on the T10 GPU. FS=3. In Figure 5.12, the kernel after ?ne-tuning yields a modest speedup, on average, of 1.04 compared to before ?ne-tuning. The minimum (n = 16384) and maximum (n = 512) speedups are 0.99 and 1.22, respectively. Similar to Mv, the ?gure suggests Step 4 of the procedure, ?ne-tuning the kernel, has a minimal impact on computation time for a small ?lter size. However, as illustrated in Figure 5.13, Step 4 of the procedure increases bandwidth more for larger ?lter sizes. 122 512 1024 2048 4096 8192 163840 200 400 600 800 1000 n eff ect ive ba nd wi dt h( GB /s ) Standard Fine-tuned Figure 5.13: Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu- tion on the T10 GPU. FS=63. The kernel after ?ne-tuning yields a speedup, on average, of 6.62 compared to before ?ne-tuning for a ?lter size of 63. The minimum (n = 512) and maximum (n = 16384) speedups are 6.58 and 6.64, respectively. The speedup is signi?cantly higher than for a ?lter size of 3 presumably due to loop unrolling. As mentioned, the compiler automatically unrolls small loops. The results in Figures 5.12 and 5.13 suggest that for a small ?lter size, the compiler automatically unrolls both loops depicted in Listing 5.4. Therefore, ?ne-tuning the kernel has a minimal impact for a small ?lter size. However, as the ?lter size increases, results indicate the compiler does not automatically unroll the for-loops. Therefore, the procedure yields signi?cant speedups by forcing the compiler to unroll the loops. Lastly, Figure 5.14 depicts e?ective bandwidth for a ?lter size of 513. 123 512 1024 2048 4096 8192 163840 200 400 600 800 1000 1200 n eff ect ive ba nd wi dt h( GB /s ) Standard Fine-tuned Figure 5.14: Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu- tion on the T10 GPU. FS=513. Similar to a ?lter size of 63, for a ?lter size of 513, the kernel after ?ne-tuning yields a speedup, on average, of 6.98 compared to before ?ne-tuning. The minimum (n = 512) and maximum (n = 8192) speedups are 6.91 and 7.00, respectively. Again, the speedup is signi?cantly higher for a ?lter size of 513 than for a ?lter size of 3. Since the amount of loop unrolling performed is dependent on FS, results suggest that loop unrolling occurs without instructing the compiler for small ?lter sizes but not for larger sizes. Therefore, signi?cant speedup from Step 4 is attained only for larger ?lter sizes. Therefore, the speedup from Step 4 of the procedure, ?ne-tuning the kernel, is dependent on the ?lter size. Regardless, Step 4 of the procedure yields an increase in e?ective bandwidth, and thus a decrease in computation time, on average, for all data and ?lter sizes tested. 5.1.5 Input Parameters Step 5 of the parallelization procedure speci?es that optimal input parameters are de- rived. A list of ?ve steps to derive optimal input parameters is presented in Section 4.5. Since the derivation is Step 5 of the procedure, optimal input parameters are derived for the kernel in Listing 5.5 which is the shared memory implementation of convolution on a GPU utilizing the optimized computation and access pattern after ?ne-tuning. From the kernel 124 and Equations (4.12) and (4.13), 8 ?dBlk:x ? 512 (5.12a) 1 ?dBlk:y ? 512dBlk:x (5.12b) dGrd:x = ndBlk:x (5.12c) dGrd:y = ndBlk:y (5.12d) From Equation (5.7), SMemPerBlk = (2?dBlk:x?dBlk:y +dBlk:x)?4. From compilation of the kernel, RegsPerThd = 10 and is constant regardless of n, FS, and input parameters. Therefore, Equation (3.7) simpli?es to BlksactiveSM = minparenleft.alt38; 1024dBlk:x?dBlk:yparenright.alt3: (5.13) Substituting BlksactiveSM from Equation (5.13), dGrd:x from Equation (5.12c), and dGrd:y from Equation (5.12c) into Equation (3.8) yields BlksactiveGPU = minparenleft.alt330?minparenleft.alt38; 1024dBlk:x?dBlk:yparenright.alt3; n 2 dBlk:x?dBlk:yparenright.alt3: (5.14) Since n ? 512 from Equation (4.1) and dBlk:x?dBlk:y ? 512 from Equation (4.12), Equation (5.14) simpli?es to BlksactiveGPU = 30?minparenleft.alt38; 1024dBlk:x?dBlk:yparenright.alt3: (5.15) Substituting BlksactiveGPU from Equation (5.15) and the minimum of ThdsactiveGPU from Equation (4.14) into Equation (3.9) yields 12800 ? 30?minparenleft.alt38; 1024dBlk:x?dBlk:yparenright.alt3?dBlk:x?dBlk:y: (5.16) 125 If 1024dBlk:x?dBlk:y ? 8, then Equation (5.16) is true. Solving for dBlk:y in Equation (5.16) if 1024 dBlk:x?dBlk:y < 8 yields dBlk:y ? 12800240dBlk:x: Since dBlk:y is a power of two, this simpli?es to dBlk:y ? 64dBlk:x. Combining this with the limits of dBlk:y from Equation (5.12b) yields maxparenleft.alt3 64dBlk:x;1parenright.alt3 ? dBlk:y ? 512dBlk:x: (5.17) Step 2 of deriving optimal input parameters speci?es shared memory utilization is max- imized to ensure data is reused as much as possible. From Line 15 of the kernel in Listing 5.5, the trip count for the inner loop, where shared memory is accessed, is dependent on FS. From the number of accesses to global memory from Equations (5.8)-(5.11), dBlk:x ? 16. Since accesses are performed ? FSdBlk:x? times in the inner loop, dBlk:x > FS. Therefore, dBlk:x ? max(16;FS). However, if dBlk:x > FS, dBlk:x ? dBlk:y values are accessed for A and dBlk:x values are accessed for B, but only FS values are necessary. Therefore, dBlk:x > FS but less than the next power of two. So, dBlk:x = max(16;2?log2(FS)?). If FS is greater than the upper limit of dBlk:x from Equation (5.12a), then dBlk:x = 512. Therefore, dBlk:x = minparenleft.alt1maxparenleft.alt116;2?log2(FS)?parenright.alt1;512parenright.alt1: (5.18) Step 3 of deriving optimal input parameters speci?es the minimum amount of shared memory is allocated per block to increase the number of blocks assigned to each SM. In addition, minimizing the amount of shared memory allocated reduces time each HW in a block waits for synchronization since there are fewer HWs per block. Since SMemPerBlk = (2?dBlk:x?dBlk:y+dBlk:x)?4 and dBlk:x is constant from Equation (5.18), the minimum value of dBlk:y from Equation (5.17) is utilized. Since dBlk:y = maxparenleft.alt1 64dBlk:x;1parenright.alt1 and from 126 Equations (5.18), (5.12c), and (5.12c), dBlk:x = minparenleft.alt1maxparenleft.alt116;2?log2(FS)?parenright.alt1;512parenright.alt1 dBlk:y = maxparenleft.alt3 64dBlk:x;1parenright.alt3 dGrd:x = ndBlk:x dGrd:y = ndBlk:y: (5.19) Given n and FS, all input parameters are constant and therefore, no further steps are necessary to derive the optimal input parameters. So, Equation (5.19) de?nes the optimal input parameters for the shared memory implementation of convolution on a GPU utilizing the optimized computation and access pattern after ?ne-tuning. Figures 5.15-5.17 depict measured computation time of all input parameters for GPU computation of convolution. In the ?gures in this section, the x-axis represents the number of threads per block (dBlk:x?dBlk:y). Therefore, there are several combinations of dBlk:x and dBlk:y which yield an equivalent number of threads per block. Figure 5.15 depicts time for a ?lter size of 3. 8 16 32 64 128 256 512 10?0.8 10?0.6 10?0.4 10?0.2 dBlk.x?dBlk.y tim e( ms ) n=512 Optimal 8 16 32 64 128 256 512 100 dBlk.x?dBlk.y tim e( ms ) n=1024 Optimal 127 8 16 32 64 128 256 512 101 dBlk.x?dBlk.y tim e( ms ) n=2048 Optimal 8 16 32 64 128 256 512 101 dBlk.x?dBlk.y tim e( ms ) n=4096 Optimal 8 16 32 64 128 256 512 102 dBlk.x?dBlk.y tim e( ms ) n=8192 Optimal 8 16 32 64 128 256 512 102.1 102.3 102.5 102.7 dBlk.x?dBlk.y tim e( ms ) n=16384 Optimal Figure 5.15: Comparison of input parameters: Computation time (ms) of all input parame- ters for convolution on the T10 GPU. FS=3. For a ?lter size of 3, derivation of optimal input parameters yields computation time, on average, within 9.6% of the minimum measured time. Although the percentage di?erence between the optimal input parameters and the minimum time is greater than Mv and MM, the worst-case utilizing optimal input parameters (n = 8192) yields time within 12.1% of the minimum. In the best-case (n = 512), utilizing optimal input parameters yields time within 5.3% of the minimum. Computation time for a ?lter size of 63 is illustrated in Figure 5.16. 128 8 16 32 64 128 256 512 101 102 dBlk.x?dBlk.y tim e( ms ) n=512 Optimal 8 16 32 64 128 256 512 102 dBlk.x?dBlk.y tim e( ms ) n=1024 Optimal 8 16 32 64 128 256 512 103 dBlk.x?dBlk.y tim e( ms ) n=2048 Optimal 8 16 32 64 128 256 512 103 dBlk.x?dBlk.y tim e( ms ) n=4096 Optimal 8 16 32 64 128 256 512 104 dBlk.x?dBlk.y tim e( ms ) n=8192 Optimal 8 16 32 64 128 256 512 104 105 dBlk.x?dBlk.y tim e( ms ) n=16384 Optimal Figure 5.16: Comparison of input parameters: Computation time (ms) of all input parame- ters for convolution on the T10 GPU. FS=63. 129 As illustrated in Figure 5.16, for a ?lter size of 63, derivation of optimal input parameters yields computation time, on average, within 0.2% of the minimum measured time. The percentage di?erence between optimal input parameters and minimum time is much less for a ?lter size of 63 than for a ?lter size of 3. In addition, the percentage di?erence is roughly equivalent to the di?erence measured for Mv and MM. In the worst-case utilizing optimal input parameters (n = 512), the time is within 1.0% of the minimum measured. The time for the best-case utilizing optimal input parameters (n = {1024;2048}) is the minimum measured. Figure 5.17 depicts computation time for a ?lter size of 513. 8 16 32 64 128 256 512 103 dBlk.x?dBlk.y tim e( ms ) n=512 Optimal 8 16 32 64 128 256 512 104 dBlk.x?dBlk.y tim e( ms ) n=1024 Optimal 8 16 32 64 128 256 512 104 105 dBlk.x?dBlk.y tim e( ms ) n=2048 Optimal 8 16 32 64 128 256 512 105 dBlk.x?dBlk.y tim e( ms ) n=4096 Optimal 130 8 16 32 64 128 256 512 106 dBlk.x?dBlk.y tim e( ms ) n=8192 Optimal 8 16 32 64 128 256 512 106 dBlk.x?dBlk.y tim e( ms ) n=16384 Optimal Figure 5.17: Comparison of input parameters: Computation time (ms) of all input parame- ters for convolution on the T10 GPU. FS=513. For a ?lter size of 513, derivation of optimal input parameters yields computation time, on average, within 0.3% of the minimum measured time. In the worst-case (n = 512), utilizing optimal input parameters yields time within 1.9% of the minimum measured. In the best-case (n = {1024;2048;4096;8192;16384}), utilizing optimal input parameters yields the minimum measured time. Similar to a ?lter size of 63, the percentage di?erence between optimal input parameters and minimum time is much less for a ?lter size of 513 than for a ?lter size of 3. Likewise, the percentage di?erence is roughly equivalent to the di?erence measured for Mv and MM. For all data and ?lter sizes tested, the worst-case time utilizing optimal input parameters occurs for a ?lter size of 3 and yields time within 12.1% of the minimum measured. GPU Computation Summary After Step 5, the parallelization procedure is developed to determine the optimal parti- tioning of computation between the CPU and GPU. Therefore, the procedure with regards to minimizing GPU computation time is complete. So, this section illustrates maximum ef- fective bandwidth of each aforementioned step of the parallelization procedure as it pertains 131 to GPU computation time. In addition, a comparison of the procedure to theoretical band- width is illustrated. For each ?gure in this subsection, Na ve denotes the global memory implementation, Sh. Mem. denotes the shared memory implementation, O.C.P. denotes the optimized computation pattern utilizing shared memory, and O.A.P. denotes the optimized access pattern utilizing the optimized computation pattern and shared memory. P.P. de- notes the entire parallelization procedure with regards to GPU computation which utilizes shared memory, optimized computation and access patterns, ?ne-tuning adjustments, and optimal input parameters. Figures 5.18-5.20 depict maximum e?ective bandwidth of convolution for each step of the parallelization procedure, in addition to the theoretical bandwidth. Figure 5.18 depicts maximum e?ective bandwidth for a ?lter size, FS, of 3. 512 1024 2048 4096 8192 16384 102 n eff ect ive ba nd wi dt h( GB /s ) TheoreticalP.P. O.A.P.O.C.P. Sh. Mem.Na??ve Figure 5.18: Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=3. The parallelization procedure yields a speedup, on average, of 4.07 compared to the na??ve implementation. The minimum (n = 512) and maximum (n = 16384) speedups are 3.74 and 4.39, respectively. The procedure yields e?ective bandwidth, on average, as 39.0% of the theoretical bandwidth. Worst-case (n = 512) and best-case (n = 16384) utilizing the procedure yield e?ective bandwidths of 39.0% and 48.2% of the theoretical, respectively. As mentioned, the procedure yields computation time utilizing optimal input parameters, on 132 average, within 9.6% of the minimum. Therefore, as shown in Figure 5.18, the maximum e?ective bandwidth is slightly higher for the O.A.P. than for the P.P. which includes the optimal input parameters. Regardless, the procedure yields a signi?cant speedup compared to the na??ve implementation. Figure 5.19 depicts e?ective bandwidth for a ?lter size of 63. 512 1024 2048 4096 8192 16384 102 103 n eff ect ive ba nd wi dt h( GB /s ) TheoreticalP.P. O.A.P.O.C.P. Sh. Mem.Na??ve Figure 5.19: Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=63. In Figure 5.19, the large di?erence between the O.A.P. and the procedure is due to the e?ect on computation time from ?ne-tuning being dependent on ?lter size. For a small ?lter size, Figure 5.18, the O.A.P. slightly outperforms the procedure for some data sizes. How- ever, for a ?lter size of 63, Figure 5.19, the procedure signi?cantly outperforms the O.A.P. for all data sizes. Regardless, the procedure yields a speedup, on average, of 13.31 compared to the na??ve implementation. The minimum (n = 512) and maximum (n = 16384) speedups are 12.85 and 13.52, respectively. The procedure yields e?ective bandwidth, on average, of 64.2% of the theoretical bandwidth. In the worst-case (n = 512), e?ective bandwidth is 62.6% of the theoretical. In the best-case (n = 16384), the procedure yields 64.6% of the theoretical bandwidth. Lastly, e?ective bandwidth for a ?lter size of 513 is depicted in Figure 5.20. 133 512 1024 2048 4096 8192 16384 102 103 n eff ect ive ba nd wi dt h( GB /s ) TheoreticalP.P. O.A.P.O.C.P. Sh. Mem.Na??ve Figure 5.20: Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU. FS=513. Similar to a ?lter size of 63, the large di?erence between the O.A.P. and the procedure is due to the e?ect on computation time from ?ne-tuning being dependent on ?lter size. The procedure yields a speedup, on average, of 13.38 compared to the na??ve implementation. The minimum (n = 512) and maximum (n = 16384) speedups are 9.82 and 15.04, respectively. The procedure yields e?ective bandwidth, on average, of 68.7% of the theoretical bandwidth. In the worst-case utilizing the procedure (n = 512), e?ective bandwidth is 65.9% of the theoretical. In the best-case (n = 16384), the procedure yields 69.5% of the theoretical bandwidth. Therefore, for all data and ?lter sizes tested, the parallelization procedure signi?cantly reduces GPU computation time compared to the na??ve implementation. 5.1.6 Computation Partitioning The last step of the parallelization procedure, Step 6, is to determine the optimal par- titioning of computation between the CPU and GPU. For convolution, 2n2FS2 ?oat values are necessary for computation. If FS ? 15, each thread in a HW reuses FS ?oat values from global memory. However, if FS > 15, each thread in a HW reuses 16 ?oat values from global memory. Therefore, 8n2FS2min(FS;16) bytes are necessary from global memory and the theoretical bandwidth is min(FS;16)94GB/s. Therefore, TGPUcomp: = 8n2FS2 10243 min(FS;16)94. Communication requires 134 two CPU to GPU transfers and one GPU to CPU transfer. Substituting for b, c0, and c1 into Equations (3.14) and (3.15) and simplifying yields the communication time in seconds. Since communication time is dependent on FS and n, if FS < 256, Tcomm: = 3:02?10?9FS2 + uni23A7uni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23A9 5:00?10?9n2 +5:00?10?5 if n < 256 3:06?10?9n2 +9:60?10?5 if 256 ? n < 512 2:30?10?9n2 +4:17?10?4 if n ? 512: If 256 ? FS < 512, Tcomm: = 1:08?10?9FS2 + uni23A7uni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23A9 5:00?10?9n2 +9:60?10?5 if n < 256 3:06?10?9n2 +1:42?10?4 if 256 ? n < 512 2:30?10?9n2 +4:63?10?4 if n ? 512: If FS ? 512, Tcomm: = 1:08?10?9FS2 + uni23A7uni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23A8 uni23AAuni23AAuni23AAuni23AA uni23AAuni23AAuni23AAuni23AA uni23AAuni23A9 5:00?10?9n2 +2:27?10?4 if n < 256 3:06?10?9n2 +2:73?10?4 if 256 ? n < 512 2:30?10?9n2 +5:94?10?4 if n ? 512: From Lines 1-2 of Listing 4.18, computation is partitioned for CPU execution if TCPUcomp: ? TGPUcomp: +Tcomm:. Therefore, after substitution, if 8n2FS2 10243 parenleft.alt2 3?FS ? n20000 +2:7parenright.alt2 ? 8n2FS2 10243 min(FS;16)94 +Tcomm:; computation is performed on the CPU to minimize execution time. Assuming the minimum for FS (3), solving for n yields TCPUcomp: ? TGPUcomp: + Tcomm: if n ? 64. Therefore, if FS = 3 and n > 64, computation is performed on the GPU to minimize execution time. If FS ? n, 135 solving for n yields TCPUcomp: ? TGPUcomp: +Tcomm: if n ? 11. Therefore, if FS ? n > 11, computation is performed on the GPU. Figures 5.21-5.23 illustrate measured execution time for convolution. Since BLAS rou- tines do not include convolution, CPU execution time is measured time for a non-optimized C implementation. Likewise, CUBLAS does not include convolution, so it is excluded. Figure 5.21 depicts execution time for convolution with a ?lter size of 3. 512 1024 2048 4096 8192 1638410 0 101 102 103 104 n me asu red ex ecu tio nt im e( ms ) CPU GPU (na??ve) P.P. Figure 5.21: Comparison of execution time (ms) for convolution on the T10 GPU. FS = 3. In Figure 5.21, the parallelization procedure yields a speedup, on average, of 5.30 com- pared to the CPU implementation, and 1.61 compared to the na??ve implementation. The procedure yields execution time, on average, within 11.9% of the theoretical minimum exe- cution time. In the worst-case utilizing the procedure (n = 8192), execution time is within 13.2% of the theoretical minimum. Best-case utilizing the procedure (n = 512) yields time within 11.1% of the theoretical minimum. Similar to Mv, the speedup compared to the na??ve implementation in Figure 5.21 is less than in Figure 5.18 due to communication time. For convolution with a small ?lter size, the communication time is a signi?cant portion of execution time. Therefore, large speedups measured for computation of convolution, with a small ?lter size, have less of an impact on execution time since the execution time includes communication and computation time. Execution time for convolution with a ?lter size of 63 is illustrated in Figure 5.22. 136 512 1024 2048 4096 8192 1638410 0 102 104 106 108 n me asu red ex ecu tio nt im e( ms ) CPU GPU (na??ve) P.P. Figure 5.22: Comparison of execution time (ms) for convolution on the T10 GPU. FS = 63. The procedure yields a speedup, on average, of 323.21 compared to the CPU implemen- tation, and 12.33 compared to the na??ve implementation. In Figure 5.22, for a ?lter size of 63, communication time is a small portion of execution time compared to a ?lter size of 3. Therefore, the procedure yields a speedup in execution time similar to the speedup achieved for computation only. The procedure yields execution time, on average, within 33.0% of the theoretical minimum execution time. Worst-case utilizing the procedure (n = 512) yields time within 33.4% of the theoretical minimum. In the best-case (n = 1024), the procedure yields time within 32.9% of the theoretical minimum. The signi?cant speedup compared to the CPU implementation is presumably due to the non-optimized CPU implementation. Lastly, ?gure 5.23 depicts execution time for convolution with a ?lter size of 513. 137 512 1024 2048 4096 8192 1638410 2 104 106 108 1010 n me asu red ex ecu tio nt im e( ms ) CPU GPU (na??ve) P.P. Figure 5.23: Comparison of execution time (ms) for convolution on the T10 GPU. FS = 513. As illustrated in Figure 5.23, the procedure yields a speedup, on average, of 424.27 compared to the CPU implementation, and 13.36 compared to the na??ve implementation. Similar to a ?lter size of 63, the procedure yields a speedup in execution time similar to the speedup achieved for computation only. This is due to communication time being a small portion of execution time. The procedure yields execution time, on average, within 31.3% of the theoretical minimum execution time. In the worst-case (n = 512), the procedure yields time within 34.0% of the theoretical minimum. In the best-case utilizing the procedure (n = 16384), the time is within 30.5% of the theoretical minimum. As mentioned for FS = 63, the signi?cant speedup compared to the CPU implementation is presumably due to the non- optimized CPU implementation. For all data and ?lter sizes tested, the parallelization procedure yields a worst-case speedup of execution time, on average, of 1.61 compared to the na??ve GPU implementation. Therefore, results show the procedure is valid for convolution. 138 5.2 Conjugate Gradient This section is an illustration of execution time of the parallelization procedure applied to an application utilizing a matrix-based computation. The conjugate gradient method [65] is utilized to further validate the procedure. The conjugate gradient method is an iterative algorithm for solving a system of linear equations consisting of several vector-based computations (Level 1 BLAS) and Mv (Level 2 BLAS). The GPU (CUBLAS) implementation presented in the results utilizes CUBLAS routines for each vector- and matrix-based computation. Therefore, execution time is the sum of GPU computation and communication time. The CPU (ATLAS) implementation utilizes BLAS routines via ATLAS for all vector- and matrix-based computation and no communication is necessary. The na??ve implementation and parallelization procedure utilize the GPU for matrix-based computations and CPU for vector-based computations. The na??ve implementation utilizes a na??ve GPU kernel. The procedure utilizes the GPU kernel developed by the procedure. Since the matrix in Mv is constant, it is necessary to transfer it to the GPU once. However, for the vector in Mv, it is necessary to transfer it to the GPU in each iteration. Likewise, it is necessary to transfer the result of Mv to the CPU in each iteration. Therefore, execution time of the procedure includes GPU computation, CPU computation, and all data transfers. For the conjugate gradient method, the number of iterations necessary for the solution to converge is dependent on data size and input. Therefore, execution time is depicted for 8 and 256 iterations. Figure 5.24 illustrates measured execution time for the conjugate gradient method through 8 iterations. 139 512 1024 2048 4096 8192 1638410 0 101 102 103 104 n me asu red ex ecu tio nt im e( ms ) CPU (ATLAS) GPU (na??ve) GPU (CUBLAS) P.P. Figure 5.24: Comparison of execution time (ms) for 8 iterations of the conjugate gradient method on the T10 GPU. As illustrated in Figure 5.24, the parallelization procedure yields a signi?cant speedup, on average, of 11.15 compared to the ATLAS implementation. The minimum (n = 512) and maximum (n = 16384) speedups are 3.16 and 22.30, respectively. The procedure yields a speedup, on average, of 6.84 compared to the na??ve implementation. The minimum (n = 4096) and maximum (n = 16384) speedups are 4.30 and 12.88, respectively. The procedure yields a speedup, on average, of 3.00 compared to the CUBLAS implementation. The minimum (n = 16384) and maximum (n = 512) speedups are 1.12 and 8.02, respectively. Although communication time is included in Figure 5.24, the procedure still yields signi?- cant increases in speedup of execution time for a small number of iterations of the conjugate gradient method. Figure 5.25 illustrates measured execution time for the conjugate gradient method through 256 iterations. 140 512 1024 2048 4096 8192 1638410 0 102 104 106 n me asu red ex ecu tio nt im e( ms ) CPU (ATLAS) GPU (na??ve) GPU (CUBLAS) P.P. Figure 5.25: Comparison of execution time (ms) for 256 iterations of the conjugate gradient method on the T10 GPU. The parallelization procedure yields a speedup, on average, of 26.89 compared to the ATLAS implementation. The minimum (n = 512) and maximum (n = 16384) speedups are 2.67 and 61.56, respectively. The procedure yields a speedup, on average, of 16.17 compared to the na??ve implementation. The minimum (n = 1024) and maximum (n = 16384) speedups are 9.48 and 38.74, respectively. The procedure yields a speedup, on average, of 5.88 compared to the CUBLAS implementation. The minimum (n = 16384) and maximum (n = 512) speedups are 1.01 and 15.99, respectively. Similar to a small number of iterations of the conjugate gradient method, the procedure yields signi?cant speedup of execution time for a large number of iterations. Comparing Figures 5.24 and 5.25 shows the speedup achieved by the procedure increases as the number of iterations increases. This is due to the amortization of communication time for transferring the matrix to the GPU. As mentioned, for the conjugate gradient method, the matrix is constant and must be transferred to the GPU only once. As the number of iterations increases, the contribution to execution time for transferring the matrix to the GPU decreases. Therefore, the speedup in execution time achieved by the procedure increases as the number of iterations increases. As illustrated in Figures 5.24 and 5.25, the parallelization procedure yields minimum execution time of the conjugate gradient method for all data sizes and iterations tested. 141 Chapter 6 Conclusion In this work, a parallelization procedure is developed to minimize execution time for matrix-based computations on a GPU. The procedure considers such factors as the placement of data, computation patterns, access patterns, ?ne-tuning, input parameters, communica- tion time, and computation partitioning. The procedure is applied to Mv, MM, convolution, and the conjugate gradient method. An accurate examination of the layout of GPU memory is presented. Partition camping in global memory is examined and the e?ects on GPU computation time are shown. Bank con?icts in shared memory are analyzed and e?ects are shown. Following the memory layout, execution metrics are formulated as functions of the input parameters to accurately represent the computational behavior of the GPU. The necessity of execution metrics is shown. In addition to representing computational behavior of the GPU, communication time between the CPU and GPU is modeled and estimates are presented. Computation time is modeled for the CPU utilizing ATLAS and for the GPU assuming a maximum bandwidth. From the layout of GPU memory, execution metrics, and communication and compu- tation time estimates, a parallelization procedure is developed through analysis and testing of Mv and MM to minimize execution time for matrix-based computations. The procedure determines placement of data in GPU memory and derives the optimized computation and access patterns to reduce computation time. Fine-tuning is performed on the GPU code to further reduce time. From the execution metrics, optimal input parameters are derived to yield the minimum computation time. Results show that utilizing optimal input parame- ters for Mv yields computation time, on average, within 1.8% of the minimum measured. For MM, results show that optimal input parameters yield time, on average, within 0.5% 142 of the minimum. Therefore, the derivation of optimal input parameters yields minimum or near-minimum computation time for each matrix-based computation tested. For Mv, results show the parallelization procedure yields a speedup of GPU computa- tion, on average, of 36.37 compared to the ATLAS CPU implementation, 18.98 compared to the na??ve GPU implementation, and 1.47 compared to the CUBLAS GPU implementation. Results show for MM the procedure yields a speedup of GPU computation, on average, of 24.47 compared to the ATLAS CPU implementation, 14.58 compared to the na??ve GPU implementation, and 1.07 compared to the CUBLAS GPU implementation. Since the parallelization procedure is developed with analysis of measured data for Mv and MM, the procedure is applied to convolution and the conjugate gradient method to further validate the procedure. Results show for convolution that optimal input parameters yield time, on average, within 9.6% of the minimum for small ?lters and within 1.0% for large ?lters. For convolution with a small ?lter, results show the procedure yields a speedup of GPU computation, on average, of 26.75 compared to the non-optimized CPU implemen- tation, and 4.07 compared to the na??ve GPU implementation. For convolution with a large ?lter, results show the procedure yields a speedup of GPU computation, on average of 424.92 compared to the non-optimized CPU implementation, and 13.38 compared to the na??ve GPU implementation. For a small number of iterations of conjugate gradient, results show the procedure yields a speedup of execution time, on average, of 11.15 compared to the ATLAS CPU implementation, 6.84 compared to the na??ve GPU implementation, and 3.00 compared to the CUBLAS implementation. For a large number of iterations of conjugate gradient, results show the procedure yields a speedup of execution time, on average, of 26.89 compared to the ATLAS CPU implementation, 16.17 compared to the na??ve GPU implementation, and 5.88 compared to the CUBLAS implementation. Therefore, the parallelization procedure developed minimizes execution time for matrix- based computations on a GPU. 143 Bibliography [1] K. Fatahalian, J. Sugerman, and P. Hanrahan, ?Understanding the e?ciency of gpu algorithms for matrix-matrix multiplication,? in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, ser. HWWS ?04. New York, NY, USA: ACM, 2004, pp. 133?137. [Online]. Available: http://doi.acm.org/10.1145/1058129.1058148 [2] C. Jiang and M. Snir, ?Automatic tuning matrix multiplication performance on graphics hardware,? in Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ?05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 185?196. [Online]. Available: http://dx.doi.org/10.1109/PACT.2005.10 [3] D. Tarditi, S. Puri, and J. Oglesby, ?Accelerator: using data parallelism to program gpus for general-purpose uses,? SIGARCH Comput. Archit. News, vol. 34, no. 5, pp. 325?335, Oct. 2006. [4] W. Liu, Muller-Wittig, and B. Schmidt, ?Performance predictions for general-purpose computation on gpus,? in Parallel Processing, 2007. ICPP 2007. International Confer- ence on, Sept. 2007, p. 50. [5] S. Hong and H. Kim, ?An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness,? in Proceedings of the 36th annual international symposium on Computer architecture, ser. ISCA ?09, 2009, pp. 152?163. [6] ??. (2009) Memory-level and thread-level parallelism aware gpu architecture performance analytical model. [Online]. Available: http://www.cc.gatech.edu/ ?hyesoon/hong report09.pdf [7] ??, ?An integrated gpu power and performance model,? SIGARCH Comput. Archit. News, vol. 38, no. 3, pp. 280?289, June 2010. [8] K. Kothapalli, R. Mukherjee, M. Rehman, S. Patidar, P. Narayanan, and K. Srinathan, ?A performance prediction model for the cuda gpgpu platform,? in High Performance Computing (HiPC), 2009 International Conference on, Dec. 2009, pp. 463 ?472. [9] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, ?Optimization principles and application performance evaluation of a multithreaded gpu using cuda,? in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ser. PPoPP ?08. New York, NY, USA: ACM, 2008, pp. 73?82. [Online]. Available: http://doi.acm.org/10.1145/1345206.1345220 144 [10] S. Ryoo, C. I. Rodrigues, S. S. Stone, J. A. Stratton, S.-Z. Ueng, S. S. Baghsorkhi, and W.-m. W. Hwu, ?Program optimization carving for gpu computing,? J. Parallel Distrib. Comput., vol. 68, no. 10, pp. 1389?1401, Oct. 2008. [Online]. Available: http://dx.doi.org/10.1016/j.jpdc.2008.05.011 [11] S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu, ?Program optimization space pruning for a multithreaded gpu,? in Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, ser. CGO ?08. New York, NY, USA: ACM, 2008, pp. 195?204. [Online]. Available: http://doi.acm.org/10.1145/1356058.1356084 [12] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu, ?An adaptive performance modeling tool for gpu architectures,? SIGPLAN Not., vol. 45, no. 5, pp. 105?114, Jan. 2010. [13] W.-M. Hwu, C. Rodrigues, S. Ryoo, and J. Stratton, ?Compute uni?ed device architec- ture application suitability,? Computing in Science Engineering, vol. 11, no. 3, pp. 16 ?26, May-June 2009. [14] S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu, ?Languages and compilers for parallel computing,? J. N. Amaral, Ed. Berlin, Heidelberg: Springer-Verlag, 2008, ch. CUDA-Lite: Reducing GPU Programming Complexity, pp. 1?15. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-89740-8 1 [15] CUBLAS, NVIDIA, 2013. [Online]. Available: https://developer.nvidia.com/cublas [16] ?An updated set of basic linear algebra subprograms (blas),? ACM Trans. Math. Softw., vol. 28, no. 2, pp. 135?151, Jun. 2002. [Online]. Available: http://doi.acm.org/10.1145/567806.567807 [17] MAGMA, Innovative Computing Laboratory at the University of Tennessee, 2013. [Online]. Available: http://icl.cs.utk.edu/magma/docs/ [18] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A.Greenbaum, S. Hammarling, A.McKenney, andD. Sorensen, LAPACK Users? Guide, 3rd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999. [19] R. Nath, S. Tomov, and J. Dongarra, ?Accelerating gpu kernels for dense linear algebra,? in Proceedings of the 9th international conference on High performance computing for computational science, ser. VECPAR?10. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 83?92. [Online]. Available: http://dl.acm.org/citation.cfm?id=1964238.1964250 [20] ??, ?An improved magma gemm for fermi graphics processing units,? Int. J. High Perform. Comput. Appl., vol. 24, no. 4, pp. 511?515, November 2010. [Online]. Available: http://dx.doi.org/10.1177/1094342010385729 [21] Y. Li, J. Dongarra, and S. Tomov, ?A note on auto-tuning gemm for gpus,? in Proceedings of the 9th International Conference on Computational Science: Part I, ser. 145 ICCS ?09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 884?892. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-01970-8slash.right 89 [22] R. Nath, S. Tomov, T. T. Dong, and J. Dongarra, ?Optimizing symmetric dense matrix-vector multiplication on gpus,? in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ?11. New York, NY, USA: ACM, 2011, pp. 6:1?6:10. [Online]. Available: http://doi.acm.org/10.1145/2063384.2063392 [23] R. Nath, S. Tomov, and J. Dongarra, ?Blas for gpus,? in Scienti c Computing with Multicore and Accelerators, 2010, ch. 4, pp. 57?80. [24] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, ?Dense linear algebra solvers for mul- ticore with gpu accelerators,? in Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, pp. 1?8. [25] S. Tomov, R. Nath, and J. Dongarra, ?Accelerating the reduction to upper hessen- berg, tridiagonal, and bidiagonal forms through hybrid gpu-based computing,? Parallel Comput., vol. 36, no. 12, pp. 645?654, Dec. 2010. [26] S. Tomov, J. Dongarra, and M. Baboulin, ?Towards dense linear algebra for hybrid gpu accelerated manycore systems,? Parallel Comput., vol. 36, no. 5-6, pp. 232?240, Jun. 2010. [27] E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov, ?A hybridization methodology for high-performance linear algebra software for gpus,? vol. 2, pp. 473?484, 2011. [28] ??, ?Faster, cheaper, better - a hybridization methodology to develop linear algebra software for gpus,? 2010. [29] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, ?Numerical linear algebra on emerging architectures: The plasma and magma projects,? in Journal of Physics: Conference Series, vol. Vol. 180, 2009. [30] F. Song, S. Tomov, and J. Dongarra, ?E?cient support for matrix computations on heterogeneous multi-core and multi-gpu architectures,? 2011. [31] J. Kurzak, S. Tomov, and J. Dongarra, ?Autotuning gemm kernels for the fermi gpu,? Parallel and Distributed Systems, IEEE Transactions on, vol. 23, no. 11, pp. 2045?2057, 2012. [32] M. Horton, S. Tomov, and J. Dongarra, ?A class of hybrid lapack algorithms for multicore and gpu architectures,? in Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing, ser. SAAHPC ?11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 150?158. [Online]. Available: http://dx.doi.org/10.1109/SAAHPC.2011.18 146 [33] H. Ltaief, S. Tomov, R. Nath, P. Du, and J. Dongarra, ?A scalable high performant cholesky factorization for multicore with gpu accelerators lapack working note #223.? [34] A. D. Malony, S. Biersdor?, S. Shende, H. Jagode, S. Tomov, G. Juckeland, R. Dietrich, D. Poole, and C. Lamb, ?Parallel performance measurement of heterogeneous parallel systems with gpus,? in Proceedings of the 2011 International Conference on Parallel Processing, ser. ICPP ?11, 2011, pp. 176?185. [35] J. W. Choi, A. Singh, and R. W. Vuduc, ?Model-driven autotuning of sparse matrix- vector multiply on gpus,? in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ?10, 2010, pp. 115?126. [36] D. Grewe and A. Lokhmotov, ?Automatically generating and tuning gpu code for sparse matrix-vector multiplication from a high-level representation,? in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4, 2011, pp. 12:1?12:8. [37] J. Bolz, I. Farmer, E. Grinspun, and P. Schr?ooder, ?Sparse matrix solvers on the gpu: conjugate gradients and multigrid,? ACM Trans. Graph., vol. 22, no. 3, pp. 917?924, Jul. 2003. [Online]. Available: http://doi.acm.org/10.1145/882262.882364 [38] B. Boyer, J.-G. Dumas, and P. Giorgi, ?Exact sparse matrix-vector multiplication on gpu?s and multicore architectures,? in Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, ser. PASCO ?10, 2010, pp. 80?88. [39] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, ?Improving gpu performance via large warps and two-level warp scheduling,? in Pro- ceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44 ?11, 2011, pp. 308?317. [40] A. Bulu, J. R. Gilbert, and C. Budak, ?Gaussian elimination based algorithms on the gpu,? 2008. [41] L. Y. Kah, A. Akoglu, I. Guven, and E. Madenci, ?High performance linear equation solver using nvidia gpus,? in Adaptive Hardware and Systems (AHS), 2011 NASA/ESA Conference on, 2011, pp. 367?374. [42] A. El Zein and A. Rendell, ?From sparse matrix to optimal gpu cuda sparse matrix vector product implementation,? in Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on, 2010, pp. 808?813. [43] D. Q. Ren and R. Suda, ?Power e?cient large matrices multiplication by load scheduling on multi-core and gpu platform with cuda,? in Computational Science and Engineering, 2009. CSE ?09. International Conference on, vol. 1, 2009, pp. 424?429. [44] S. Solomon and P. Thulasiraman, ?Performance study of mapping irregular com- putations on gpus,? in Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, pp. 1?8. 147 [45] G. Cummins, R. Adams, and T. Newell, ?Scienti?c computation through a gpu,? in Southeastcon, 2008. IEEE, 2008, pp. 244?246. [46] Y. Sun and Y. Tong, ?Cuda based fast implementation of very large matrix computa- tion,? in Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010 International Conference on, 2010, pp. 487?491. [47] N. P. Karunadasa and D. N. Ranasinghe, ?Accelerating high performance applications with cuda and mpi,? in Industrial and Information Systems (ICIIS), 2009 International Conference on, 2009, pp. 331?336. [48] A. Bustamam, K. Burrage, and N. Hamilton, ?Fast parallel markov clustering in bioin- formatics using massively parallel graphics processing unit computing,? in Parallel and Distributed Methods in Veri cation, 2010 Ninth International Workshop on, and High Performance Computational Systems Biology, Second International Workshop on, 2010, pp. 116?125. [49] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov, ?Parallel computing experiences with cuda,? Micro, IEEE, vol. 28, no. 4, pp. 13 ?27, July-Aug. 2008. [50] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, ?Analyzing cuda workloads using a detailed gpu simulator,? in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, April 2009, pp. 163 ?174. [51] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, ?Nvidia tesla: A uni?ed graph- ics and computing architecture,? Micro, IEEE, vol. 28, no. 2, pp. 39 ?55, March-April 2008. [52] J. Nickolls and W. Dally, ?The gpu computing era,? Micro, IEEE, vol. 30, no. 2, pp. 56 ?69, March-April 2010. [53] D. B. Kirk and W. mei Hwu, Programming Massively Parallel Processors, 1st ed., 2010. [54] G. Ruetsch and P. Micikevicius, Optimizing Matrix Transpose in CUDA, NVIDIA, 2009. [55] A. M. Aji, M. Daga, and W.-c. Feng, ?Bounding the e?ect of partition camping in gpu kernels,? in Proceedings of the 8th ACM International Conference on Computing Frontiers, ser. CF ?11. New York, NY, USA: ACM, 2011, pp. 27:1?27:10. [Online]. Available: http://doi.acm.org/10.1145/2016604.2016637 [56] J. Wu and J. JaJa, ?Optimized strategies for mapping three-dimensional ?ts onto cuda gpus,? in Innovative Parallel Computing (InPar), 2012, May 2012, pp. 1 ?12. [57] Y. Yang, P. Xiang, J. Kong, and H. Zhou, ?A gpgpu compiler for memory optimization and parallelism management,? in Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, ser. PLDI ?10. New York, NY, USA: ACM, 2010, pp. 86?97. [Online]. Available: http://doi.acm.org/10.1145/1806596.1806606 148 [58] A. Dasgupta, ?Cuda performance analyzer,? Master?s thesis, Georgia Institute of Technology, May 2011. [Online]. Available: http://smartech.gatech.edu/jspui/ bitstream/1853/39555/1/dasgupta aniruddha s 201105 mast.pdf [59] N. B. Lakshminarayana and H. Kim, ?E?ect of instruction fetch and memory scheduling on gpu performance,? in Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010. [60] R. C. Whaley, A. Petitet, and J. J. Dongarra, ?Automated empirical optimization of software and the ATLAS project,? Parallel Computing, vol. 27, no. 1?2, pp. 3?35, 2001. [Online]. Available: http://math-atlas.sourceforge.net/ [61] T. Bradley and P. Micikevicius. (2009) Advanced c for cuda. NVIDIA. [Online]. Available: http://www.gputechconf.com/object/gtc2009-on-demand.html [62] HPC User Manual, The Alabama Supercomputer Authority, 2010. [63] V. Volkov and J. W. Demmel, ?Benchmarking gpus to tune dense linear algebra,? in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, ser. SC ?08. Piscataway, NJ, USA: IEEE Press, 2008, pp. 31:1?31:11. [Online]. Available: http://dl.acm.org/citation.cfm?id=1413370.1413402 [64] CUDA Programming Guide, NVIDIA, 2013. [Online]. Available: http://docs.nvidia. com/cuda/cuda-c-programming-guide/ [65] M. R. Hestenes and E. Stiefel, ?Methods of Conjugate Gradients for Solving Linear Systems,? Journal of Research of the National Bureau of Standards, vol. 49, pp. 409? 436, Dec. 1952. 149