Modeling and Optimization of Parallel Matrix-based Computations on GPU
by
Andrew White
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial ful?llment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
May 5, 2013
Keywords: GPU, matrix-based computations, parallelization procedure, execution metrics
Copyright 2013 by Andrew White
Approved by
Soo-Young Lee, Chair, Professor of Electrical and Computer Engineering
Victor Nelson, Professor of Electrical and Computer Engineering
Chwan-Hwa Wu, Professor of Electrical and Computer Engineering
Abstract
As graphics processing units (GPUs) are continually being utilized as coprocessors, the
demand for optimally utilizing them for various applications continues to grow. This work
narrows the gap between programmers and minimum execution time for matrix-based com-
putations on a GPU. To minimize execution time, computation and communication time1
must be considered. For computation, the placement of data in GPU memory signi?cantly
a?ects computation time and therefore is considered. Various matrix-based computation
patterns are examined with respect to the layout of GPU memory. A computation pattern
refers to the order in which each GPU thread computes a result. From examination of com-
putation patterns, an optimized computation pattern, a pattern which reduces computation
time, is derived. After the optimized computation pattern is derived, the access pattern to
GPU memory, or order in which data is accessed, is considered. From the optimized access
pattern, ?ne-tuning is performed to the GPU code such as minimization of index calcula-
tions and loop unrolling to further reduce computation time and resource allocation. After
?ne-tuning, input parameters which yield the minimum computation time for a matrix-based
computation on the GPU are derived. Input parameters are de?ned as the dimensions of
the grid and blocks assigned for execution on the GPU. Execution metrics are formulated,
as functions of the input parameters, to represent the executional behavior of the GPU.
The execution metrics are utilized to derive the optimal input parameters which are input
parameters that yield the minimum computation time.
1In this work, computation time refers to the amount of time required for the GPU to execute a given
computation. Communication time refers to the amount of time required for a CPU to GPU or a GPU to
CPU data transfer.
ii
Thematrix-basedcomputationsconsideredare matrix-vector multiplication (Mv), matrix-
matrix multiplication (MM), convolution, and the conjugate gradient method. A paralleliza-
tion procedure is developed to minimize execution time by deriving the optimal placement of
data, optimized computation pattern, optimized access pattern, and optimal input parame-
ters. Each step of the procedure is developed through analysis of Mv and MM. The procedure
includes an accurate communication model and estimates for CPU and GPU data transfers.
With accurate estimates for communication and computation times, partitioning computa-
tion for CPU and GPU execution is performed to minimize execution time. Convolution and
conjugate gradient are utilized to verify the validity of the procedure. Therefore, the overall
goal of this work is to develop a parallelization procedure which minimizes execution time
of matrix-based computations executing on the GPU.
iii
Acknowledgments
First, I would like to thank God for blessing me with the opportunity to complete this
work. Countless prayers from family, friends and myself enabled me to ?nish this dissertation.
I would like to thank my wife, Abigail, who has always been supportive through the many
years. Her dedication is truly inspiring. My family, particularly my parents, Bob and Ann,
provided ?nancial assistance and emotional support which none of this would be possible
without. The lengthy phone calls about this work with my father also provided an outside
perspective for improvement and I am sincerely grateful.
I would like to thank the entire Department of Electrical Engineering at Auburn Uni-
versity. Speci?cally, I would like to acknowledge my advisor, Dr. Soo-Young Lee, for always
being supportive. This work would not have been possible without his constant help and
guidance. Through countless meetings, emails, and phone calls, Dr. Lee always believed we
could ?nish this work and went above and beyond expectations of an advisor.
I would also like to acknowledge Dr. Victor Nelson and Dr. Chwan-Hwa Wu for their
thorough review of this work as well as availability to discuss any problems encountered
along the way. Lastly, Dr. Amnon Meir, the outside reader of this work, provided quick and
helpful revisions for completion. It was always a pleasure meeting with these professors and
I truly appreciate the help.
Lastly, I would like to acknowledge the graduate students with whom I have shared an
o?ce for many years at Auburn. It was always enjoyable and relaxing to discuss life, and
of course work, with Dr. Chris Wilson over lunch. I also appreciate the conversations with
Siwei Wang and Praveen Venkataramani which provided me a wider understanding of the
rest of the world and some social interaction in an otherwise non-social environment.
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem De?nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 GPU, CUDA and Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 GPU History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 GPU Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 CUDA Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Global Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Partition Camping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Shared Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Bank Con?icts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Execution Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Global Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
v
3.3.2 Active Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Fragmented Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.4 Global Memory Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Communication Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 CPU-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 GPU-CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.1 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.2 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Parallelization Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Placement of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Computation Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Input Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.5.3 GPU Computation Summary . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6 Computation Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
vi
4.6.1 Mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.6.2 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.1 Placement of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.2 Computation Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.1.3 Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.4 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.1.5 Input Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.1.6 Computation Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
vii
List of Figures
2.1 Architecture of the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Memory organization of the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 A square block of size 64. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Partitioning of a square block of size 64 into warps. . . . . . . . . . . . . . . . . . . 16
2.5 Partitioning of a square block of size 64 into half-warps. . . . . . . . . . . . . . . . 17
3.1 Global memory connection of the T10 GPU. Memory is divided into 8 equal-sized
partitions. Each address represents 256 bytes. . . . . . . . . . . . . . . . . . . . . . 19
3.2 An example of partition camping occurring: global memory accesses by 2 HWs
to 2 partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 An example of partition camping not occurring: global memory accesses by 2
HWs to 2 partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Shared memory layout in the T10 GPU. Memory is divided into 16 equal-sized
partitions. Each column is 4 bytes wide. . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Threads within a HW accessing 16 shared memory banks. No bank con?icts occur. 22
3.6 Threads within a HW accessing one shared memory bank. 16 bank con?icts occur. 23
3.7 Modeling MM with input parameters: average percent di?erence between the
maximum and minimum computation time (ms) of a group de?ned by n and a
subset of the input parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
viii
3.8 Comparison of measured and estimated CPU to GPU communication time (ms)
on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.9 Comparison of measured and estimated GPU to CPU communication time (ms)
on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.10 Comparison of measured and estimated computation time (ms) for the optimized
BLAS implementation of Mv on the CPU. . . . . . . . . . . . . . . . . . . . . . . . 36
3.11 Comparison of measured and estimated computation time (ms) for the optimized
BLAS implementation of MM on the CPU. . . . . . . . . . . . . . . . . . . . . . . . 37
3.12 Comparison of measured and estimated computation time (ms) for the non-
optimized C implementation of convolution on the CPU. FS=3. . . . . . . . . . . 38
3.13 Comparison of measured and estimated computation time (ms) for the non-
optimized C implementation of convolution on the CPU. FS=63. . . . . . . . . . 38
3.14 Comparison of measured and estimated computation time (ms) for the non-
optimized C implementation of convolution on the CPU. FS=513. . . . . . . . . . 39
4.1 Example of a HW, consisting of 4 threads, accessing A in each iteration of Mv.
Each memory transaction is 4 bytes. The accesses to A are uncoalesced and 4
accesses are required for each iteration. . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Example of a HW, consisting of 4 threads, accessing A in each iteration of Mv
utilizing shared memory. Each memory transaction is 4 bytes. The accesses to
A are coalesced and 4 accesses are required every 4th iteration. . . . . . . . . . . . 50
4.3 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for Mv
on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ix
4.4 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for MM
on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Computation patterns: two computation patterns for Mv for computing multiple
Cjs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Average computation time (ms) varying the number of total threads for Mv on
the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.7 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for
Mv on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.8 Computation patterns: four computation patterns for MM for computing multi-
ple Cijs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.9 Computation patterns: hybrid computation patterns for MM for computing mul-
tiple Cijs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.10 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for
MM on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.11 Access patterns: example of block-level access patterns to A for matrix-based
computations. Each column is one partition of global memory. Each row is 512
values of type ?oat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.12 Example of loading Ainto shared memory for the shared memory implementation
of Mv utilizing the optimized computation pattern but not the optimized access
pattern. No bank con?icts occur. dBlk:x = 16. . . . . . . . . . . . . . . . . . . . . . 69
4.13 Example of reading shared memory for the shared memory implementation of
Mv utilizing the optimized computation pattern but not the optimized access
pattern. Bank con?icts occur. dBlk:x = 16. . . . . . . . . . . . . . . . . . . . . . . . 70
x
4.14 Example of loading Ainto shared memory for the shared memory implementation
of Mv utilizing the optimized computation and access pattern. No bank con?icts
occur. dBlk:x = 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.15 Example of reading shared memory for the shared memory implementation of
Mv utilizing the optimized computation and access pattern. No bank con?icts
occur. dBlk:x = 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.16 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for Mv on
the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.17 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for MM
on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.18 Comparison of ?ne-tuning: Maximum bandwidth (GB/s) for Mv on the T10 GPU. 79
4.19 Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for MM on
the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.20 Comparison of input parameters: Computation time (ms) of all input parameters
for Mv on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.21 Comparison of input parameters: Computation time (ms) of all input parameters
for MM on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.22 Comparison of e?ective bandwidth (GB/s) for Mv on the T10 GPU. . . . . . . . 95
4.23 Comparison of e?ective bandwidth (GB/s) for MM on the T10 GPU. . . . . . . . 96
4.24 Comparison of execution time (ms) for Mv on the T10 GPU. . . . . . . . . . . . . 99
4.25 Comparison of execution time (ms) for MM on the T10 GPU. . . . . . . . . . . . 101
xi
5.1 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Computation patterns: two computation patterns for convolution. . . . . . . . . . 110
5.5 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for
convolution on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for
convolution on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.7 Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for
convolution on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.8 Access patterns: example of a 3x3 ?lter, B, stored in global memory. Each row
of a partition is 4 values of type ?oat. . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.9 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.10 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.11 Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xii
5.12 Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu-
tion on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.13 Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu-
tion on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.14 Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu-
tion on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.15 Comparison of input parameters: Computation time (ms) of all input parameters
for convolution on the T10 GPU. FS=3. . . . . . . . . . . . . . . . . . . . . . . . . 128
5.16 Comparison of input parameters: Computation time (ms) of all input parameters
for convolution on the T10 GPU. FS=63. . . . . . . . . . . . . . . . . . . . . . . . . 129
5.17 Comparison of input parameters: Computation time (ms) of all input parameters
for convolution on the T10 GPU. FS=513. . . . . . . . . . . . . . . . . . . . . . . . 131
5.18 Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU.
FS=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.19 Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU.
FS=63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.20 Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU.
FS=513. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.21 Comparison of execution time (ms) for convolution on the T10 GPU. FS = 3. . . 136
5.22 Comparison of execution time (ms) for convolution on the T10 GPU. FS = 63. . 137
5.23 Comparison of execution time (ms) for convolution on the T10 GPU. FS = 513. . 138
xiii
5.24 Comparison of execution time (ms) for 8 iterations of the conjugate gradient
method on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.25 Comparison of execution time (ms) for 256 iterations of the conjugate gradient
method on the T10 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
xiv
List of Tables
3.1 Measured results of executing Listing 3.1 which demonstrate the e?ect of partition
camping on the T10 GPU. v is the number of values each thread summed. . . . . 21
3.2 Measured results of executing Listing 3.2 which demonstrates the e?ect of bank
con?icts on the T10 GPU. v is the number of values summed by each thread. . . 24
xv
Chapter 1
Introduction
1.1 Problem De?nition
As the market for massively multithreaded architectures continues to grow, so does the
development of general-purpose graphics processing unit (GPGPU) applications. Many pro-
gramming models such as NVIDIA?s compute uni ed device architecture (CUDA) try to ease
the gap between programmers and GPUs. While much research has been done in optimizing
written GPU applications, little has been done to bridge the gap between programmers and
minimum execution time for matrix-based computations on a GPU. Therefore, a paralleliza-
tion procedure is necessary to provide programmers a guide to achieve minimum execution
time for matrix-based computations on a GPU.
To develop a parallelization procedure, it is necessary to consider computation and
communication time. Accurate estimates for computation and communication time are
necessary to minimize execution time. To assist in modeling GPU computation time, the
layout of GPU memory must be examined for various matrix-based computation patterns.
Therefore, it is necessary to determine in which type of GPU memory data is placed. After
the placement of data is determined, the order in which threads executing on the GPU
compute results, the computation pattern, is considered. Therefore, a computation pattern
which reduces the computation time with respect to the layout of GPU memory for matrix-
based computations must be derived. After deriving the optimized computation pattern, it
is necessary to derive an optimized access pattern to GPU memory since the order in which
threads access memory a?ects computation time. Utilizing the optimized access pattern, the
code must be ?ne-tuned to reduce computation and resource allocation. From the ?ne-tuned
code, it is necessary to derive the input parameters for the GPU computation which yield
1
the minimum computation time. To derive optimal input parameters, an accurate model
of the computational behavior of the GPU is required. Since modeling GPU computation
using input parameters is insu?cient, execution metrics must be formulated as functions
of the input parameters to model the behavior. From the execution metrics, optimal input
parameters must then be derived which yield the minimum computation time for matrix-
based computations.
Communication estimates between the CPU and GPU are also necessary to develop
a parallelization procedure which minimizes execution time. Accurate communication es-
timates, in addition to accurate computation estimates, are necessary to determine which
computations are performed by the CPU and GPU to minimize execution time.
Therefore, a parallelization procedure which minimizes execution time of matrix-based
computations on the GPU must consider the placement of data, computation patterns, access
patterns, input parameters, communication time, and computation partitioning.
1.2 Review
In early GPU research, Fatahalian and others performed a study of matrix-matrix multi-
plication on GPU architectures [1]. It was shown at the time that GPUs su?er from memory
bandwidth limitations which has later been expounded upon by many research groups. At
that time, work showed that CPUs were better candidates for applications that feature
data reuse such as matrix-matrix multiplication. Since then, GPU architecture has evolved
creating an increasing atmosphere for GPUs being utilized as coprocessors.
Shortly after 2004, work began on automatically tuning matrix-matrix multiplication for
GPUs [2]. However, results proved that automatic tuning of GPU programs severely reduced
performance. After that, a Microsoft research group proposed a system to program GPUs [3].
Results proved that CPUs were several times faster than GPUs at executing computations
such as Mv. After that, research began to focus on modeling GPU performance [4]. At the
2
time, results showed accurate estimations of GPU computation time. However, the advent
of CUDA and CUDA-enabled devices obsoleted this work.
Researchers at the Georgia Institute of Technology attempted to predict GPU perfor-
mance in very ?ne detail by examining the PTX code (NVIDIA assembly code equivalent) to
determine the number of computation and memory access cycles and using these to determine
the GPU computation time [5,6]. This work was utilized to predict the optimal number of
active cores for the GPU and to disable some of the cores to reduce energy consumption [7].
Although the work provides a detailed algorithm to count the number of instructions from
PTX code, this can be a tedious job to calculate the cycles and predict the result. Analysis
at a higher level allows a more general approach and can reduce the time required to yield a
minimum execution time. In addition, while the predictions for older model GPUs such as
the FX5600 are mostly accurate, the behavior of the predicted time on newer GPUs, such as
200-series architectures, is not the same as the behavior of the measured time. Moreover, the
work provided a static approach to predicting the computation time but left optimizations
up to the programmer. Most importantly, the layout of the global memory in the GPU was
not considered which largely a?ects the computation time for matrix-based computations on
GPUs.
Researchers at the International Institute of Information Technology in Hyderabad,
India focused on explaining the behavior of current NVIDIA GPUs and worked on developing
a model to illustrate the performance of an NVIDIA GPU [8]. Their work focused on creating
a simulator for the GPU in the future which would aid in writing optimized applications.
However, their research, as stated in publications, does not aim to provide a procedure to
programmers which minimizes execution time for GPU applications.
Researchers at the Center for Reliable and High-Performance Computing at the Uni-
versity of Illinois at Urbana-Champaign have also spent time researching GPUs and their
performance [9?13]. The work is a discussion of balancing shared resource usage to achieve
high performance on the GeForce 8800 [9]. They focused on generating enough threads to
3
hide memory latency, choosing applications with a high percentage of ?oating point oper-
ations such as MM, and reducing the number of global memory accesses. They use this
example to show that shared memory is useful for reducing redundant loads and thus re-
ducing the pressure on memory bandwidth. In addition, some of their research focuses on
resource utilization by the kernel such as shared memory, registers and occupancy [10,11].
Similar to previous work, PTX code is examined for modeling computation time. The re-
search focused on minimizing the number of trials necessary to ?nd optimal input parameters
rather than deriving them. In addition, the research did not consider the layout of global
memory or access pattern and thus does not work well for memory-bound problems such as
many matrix-based computations.
In addition to the previous work mentioned, the group at the Center for Reliable and
High-Performance Computing at the University of Illinois at Urbana-Champaign developed
a program called CUDA-lite [14]. This was developed to be an enhancement to CUDA
that automates the task of selecting the appropriate memory to use and the coding of
data transfers between memories. The tool was written to coalesce memory accesses which
was a roadblock to GPU performance at the time. However, NVIDIA continues to relax
the requirements to achieve coalesced read/writes with every SDK released so there is a
dwindling need for a programmer to use an automated compiler to handle such a task.
In addition, requiring programmers to learn other tools in addition to CUDA creates an
additional burden. Regardless, it should be assumed that programmers understand the
basic concepts of the memory hierarchy and thus do not need additional tools to optimize
the access patterns if some basic guidelines are followed when writing applications. Lastly,
this work does not consider the global memory layout and can produce poor results in such
cases as utilizing shared memory to coalesce accesses.
The most recent research by this group includes an adaptive performance modeling
tool [12] and a CUDA application survey [13]. In [13], a survey is performed of di?erent
4
applications and the suitability of CUDA-enabled GPUs. While this provides a useful sum-
mary of the architecture and how the GPU operates, along with analysis as to why certain
applications are optimal, it does not satisfy the need for a procedure to minimize computa-
tion time. In [12], a compiler-based approach is utilized to estimate GPU computation time.
They, too, analyzed the e?ects on computation time from utilizing various GPU memories.
While this was a major step in automatic selection of optimal applications, it does not pro-
vide a procedure which yields optimal GPU code and derivation of input parameters that
yield the minimum computation time.
With the release of CUDA, NVIDIA released CUBLAS [15], an optimized implementa-
tion of the BLAS [16] routines for GPUs. However, source for CUBLAS is closed and little
work is published on how each routine is implemented. Alternatively, MAGMA [17] released
an optimized implementation of the BLAS routines similar to LAPACK [18] for GPUs. Sev-
eral works [19?34] provide insights into optimized GPU applications. However, the research
aims to provide users optimized BLAS routines rather than provide programmers a procedure
for optimizing GPU computations and minimizing execution time.
Other work [35?48] also focuses on optimizing individual computations, such as Mv or
MM, but is limited to speci?c implementations or ignores certain aspects of the GPU such
as memory layout.
Lastly, work has been performed that highlights the programming experiences with
CUDA as well as the architectures of GPUs [49?52]. [49] provides a summary of research
done with di?erent applications on GPUs utilizing CUDA. The work highlights the bene?ts
of CUDA as well as what is necessary to make applications suitable for a GPU, such as
exposing su?cient amounts of ?ne-grained parallelism to occupy the device, ability to block
computation, e?ciency of data-parallel programs without thread divergence and ?nally, us-
age of shared memory. [51] focuses on the architecture of Tesla GPUs and how it has evolved
over the last decade from ?xed-function graphics pipelines to programmable processors with
computing power greater than multi-core CPUs. One of the newest GPU architectures,
5
Fermi, is outlined in [52], which also discusses the demand for increasing GPU capabilities
and the future of GPUs. The research shows the bene?t of co-processing architectures (such
as CUDA) and explains the reason the market for these architectures and their usability
will continue to grow. Because of this expanding market, research is needed to provide users
with a procedure for achieving the minimum execution time.
1.3 Motivation
With the development of environments such as CUDA, GPUs are increasingly becoming
a viable option for parallel processing. With the increased interest in utilizing the GPU as a
coprocessor, it is necessary that programmers be able to write applications for this platform
without the need to fully understand the underlying architecture of the GPUs. Often,
GPU manufacturers do not disclose certain aspects of the hardware, or sometimes even the
software, to maintain a competitive advantage. Therefore, researchers are constantly testing
and analyzing the new generations of GPUs to understand how to fully utilize them. It is
necessary to provide programmers a procedure for minimizing the execution time while also
providing a logical reasoning.
Previous research does not provide programmers a parallelization procedure for matrix-
based computations which yields minimum execution time. More speci?cally, previous re-
search does not account for the layout of GPU global memory or the access pattern to that
memory, which is determined by the code, and thus, optimized access patterns are not de-
rived to minimize computation time. In addition, previous research does not consider the
computation pattern by the executing threads, which a?ects computation time. Therefore,
optimized computation patterns are not derived to achieve the minimum time. In previous
research, input parameters are determined through testing the application, and thus, optimal
input parameters are not derived for various computations to yield the minimum computa-
tion time. Therefore, it is necessary to develop a parallelization procedure for matrix-based
6
computations executing on a GPU which minimizes execution time by considering the place-
ment of data, computation patterns, access patterns, input parameters, communication time,
and computation partitioning.
1.4 Objectives
Since the primary goal of this work is to develop a procedure which yields the minimum
execution time for matrix-based computations on a GPU, the objectives are to
x88 examine the layout of GPU memory,
x88 formulate execution metrics to accurately represent the computational behavior of the
GPU,
x88 model and estimate CPU to GPU and GPU to CPU communication time,
x88 model and estimate CPU and GPU computation time,
x88 determine the optimal placement of data in GPU memory,
x88 analyze computation patterns for GPUs and derive the optimized computation pattern,
x88 derive the optimized access pattern for GPU memory,
x88 ?ne-tune code to minimize computation and resource allocation,
x88 from the execution metrics, derive the optimal input parameters for the GPU,
x88 determine the optimal partitioning of computation between the CPU and GPU,
x88 verify these objectives by comparing the procedure with several matrix-based compu-
tations.
This is accomplished with consideration to grid and block partitioning, data arrangement and
partitioning, code arrangement and data transfers. Therefore, the intellectual contribution
7
of this work is a parallelization procedure which minimizes execution time for matrix-based
computations on a GPU.
1.5 Organization
Chapter 2 is an introduction to GPUs, CUDA and terms utilized in this work. An
examination of the layout of GPU memory and the formulation of execution metrics to
accurately represent the computational behavior of the GPU are presented in Chapter 3.
In addition, modeling communication and computation times is presented in Chapter 3.
From the layout of GPU memory and formulation of execution metrics, the parallelization
procedure to minimize execution time for matrix-based computations executing on a GPU
is developed in Chapter 4. The parallelization procedure considers the placement of data,
computation patterns, access patterns, ?ne-tuning, input parameters, communication, and
computation partitioning. The procedure is developed through and applied to Mv and MM
in Chapter 4. Results are included to illustrate the impact on time for each step of the
procedure. In Chapter 5, the procedure is applied to convolution and the conjugate gradient
method. The application of the procedure to convolution is shown. Results illustrate the
performance of the procedure applied to convolution and the conjugate gradient method.
Chapter 6 is a summary and conclusion of the presented work.
8
Chapter 2
GPU, CUDA and Terms
This chapter includes a brief history of GPUs, a comparison of CPU and GPU architec-
tures, and the performance of GPUs. Following is a summary of CUDA, the NVIDIA GPU
programming environment, and lastly, terms are introduced.
2.1 GPU History
As GPUs continue to become low-cost, parallel processing architectures, additional pro-
grammers will turn to these architectures to minimize execution time. Therefore, there is an
increasing need for optimizing applications written for GPUs. While much work has been
done by NVIDIA and other researchers using the CUDA environment to explain the impact
of di?erent application parameters, much information is not public knowledge about how
GPUs function. Therefore, it is necessary to expand upon the previous research into opti-
mizing applications, and develop a procedure for minimizing execution time to ease future
programmers into writing e?cient CUDA applications for GPUs.
Over the last 30 years, GPU architecture has evolved to a massively parallel design. In
the 1980s, GPUs were large expensive systems that typically cost in the range of x2450,000 and
were capable of processing 50 million pixels per second. By the 1990s, GPU costs decreased
signi?cantly, causing them to be more widespread as they were deployed in small workstations
as PC accelerators. During this time, graphics APIs such as DirectX became popular and
programmers began utilizing the ?xed-functions to perform other tasks. In the 2000s, GPUs
dropped in price to the x24100 range and became available on every computer. Those GPUs
were capable of processing 1 billion pixels per second. More importantly, those GPUs became
programmable for general purpose computing using such languages as CUDA and Open CL.
9
This advancement in graphics performance has been driven by the market demand for high-
quality, real-time graphics in computer applications, namely the gaming industry. The result
is that, over time, graphics architecture has changed from being a simple pipeline for drawing
wire-frame diagrams to a highly parallel computing chip [53] [52] [51].
In order to fully utilize today?s chips and maximize performance, GPU algorithms need
several key components: access to data with minimal bank con?icts, SIMD parallelism and
many arithmetic computations. GPU algorithms are best suited for computationally in-
tensive applications that require little inter-process communication and include but are not
limited to physical modeling, computational engineering, matrix algebra, convolution, cor-
relation and sorting. Therefore, matrix-based computations are ideally suited for GPUs.
2.2 GPU Architecture
GPUs, as opposed to CPUs, are not well-suited for all algorithms due to their architec-
ture. CPUs are designed for optimal sequential performance and application performance
increases with increasing clock frequencies. However, the rate at which clock speeds increase
is beginning to slow. Because of the importance of sequential performance, CPU cores in-
clude the full x86 instruction set with complex branching. In addition CPUs have large cache
memories to decrease instruction and data latencies for applications. The larger cache mem-
ories and the need for complex control logic forces designers to use less of the available silicon
for ALUs. On the other hand, GPUs are designed for optimal parallel performance. While
GPU clock speeds are slower than CPU speeds, the application performance increases with
the number of cores. The lack of large cache memories and complex control logic decreases
the sequential performance in comparison to CPUs but allows designers to implement more
ALUs on the available silicon thus increasing the numeric computing capability. Lastly, since
GPUs are designed with a focus on parallelism, their structure includes many smaller cores
than CPUs.
10
The NVIDIA GPU1 architecture is a collection of streaming multiprocessors (SMs) with
each SM having a number of cores or streaming processors (SPs). The control logic and
instruction cache for all SPs within an SM are shared as illustrated in Figure 2.1a. This
work utilizes the Tesla T10 (T10) GPU which is built on the NVIDIA 200-series architecture.
The T10 architecture has a total of 30 SMs with each SM having 8 SPs. A newer architecture,
Fermi, includes only 16 SMs but each SM has 32 SPs. In both architectures, each SP consists
of an ALU and FPU as illustrated in Figure 2.1b. The 200-series architecture has a 24-bit
ALU while the Fermi architecture includes a 32-bit ALU. However, the 200-series architecture
is capable of 32-bit precision arithmetic through the use of multiple arithmetic instructions.
(a) Streaming Multiprocessor. (b) Streaming Processor.
Figure 2.1: Architecture of the T10 GPU.
Source: http://www.anandtech.com/show/2549/2
In current GPU systems, the GPU and CPU have a separate memory space with a
GPU consisting of three separate types of memory: global, constant and shared. Figure 2.2
depicts the organization of GPU memory. Global and constant memory are accessible by all
1From this point forward, GPU refers to an NVIDIA GPU.
11
CPU
Memory
GPU
...
Global Memory
SM 0 SM 1 SM 29
Constant Memory
Memory
Shared
Registers
Memory
Shared
Registers Registers
Shared
Figure 2.2: Memory organization of the T10 GPU.
SPs while shared memory is partitioned for each SM. Therefore, each SP in the same SM can
share data through the shared memory. However, SPs from one SM cannot share data with
SPs from a di?erent SM. Global memory is the slowest memory available on the GPU and is
4GB in size for the Tesla T10 GPU. It is utilized as a transfer medium for the host (CPU) to
the device (GPU) or for the device to the host. The GPU can read and write to the global
memory while the CPU can copy memory to or from the global memory. Global memory
is persistent through kernel calls while constant and shared are not. A CUDA application
can transfer data from the system memory at 4GB/s and at the same time upload data
to the system memory at 4GB/s. Constant memory is much faster and smaller than the
global memory and is only 64KB in size on the Tesla T10. However, the GPU cannot write
to constant memory therefore it is simply used to transfer constant data from the CPU to
the GPU. Shared memory is also faster and smaller than global memory and can be written
to and read from by the GPU. There are 16KB of shared memory per SM for the Tesla
T10. However, shared memory cannot be accessed by the CPU therefore it is typically
used as a scratch-pad memory for each SP. All computational results generated by the GPU
must be stored in global memory to be accessible by the CPU. In some applications, shared
memory can also be used for sharing data amongst SPs in the same SM. In addition to
12
the three types of memory, each thread has access to registers and local memory2 both of
which are read/write. The shared memory can be read/written by all threads in the same
SM. However, threads from di?erent SMs cannot access the same shared memory while the
constant memory can be read by any thread. The global memory can be read/written by
any thread.
2.3 GPU Performance
Because of di?erences in architecture, GPUs have continually outperformed CPUs in
terms of ?oating-point operations per second. ?As of 2009, the ratio between many-core
GPUs and multicore CPUs for peak ?oating-point calculation throughput is about 10 to
1? [53]. The fundamental di?erence in the design of CPUs versus GPUs has created this
large increase in performance in terms of FLOPS.
While the CPU is designed to optimize sequential performance, GPUs are designed
to optimize parallel instructions throughput. Therefore, GPUs need thousands of threads
executing in parallel to achieve full e?ciency where a CPU may only need a few. However,
each thread of a GPU is lightweight in comparison and requires little overhead to create. The
NVIDIA 200-series architecture supports 1024 threads per SM for a total of 30720 threads
running in parallel for the entire chip. In addition, new generations of hardware, such as
the Fermi, support even more threads running concurrently. Today?s Tesla C1060, currently
sold by NVIDIA, includes one Tesla T10 processor built on the 200-series architecture. The
T10 includes 240 cores, each with a clock speed of 1.33GHz and is connected to 4GB
of DRAM. The C1060 ?ts in a standard PCIe dual slot and consumes around 160 watts
of power. Because of these low-cost, low-power GPUs, massively parallel applications are
quickly being ported for GPGPU use.
To date, more than 200 million GPUs have been deployed, providing economical parallel
processing around the globe. GPGPU hardware is currently being used in a variety of
2Local memory is a subsection of global memory used to store variables when the maximum amount of
registers is exceeded.
13
applications, including MRI products. In the past, typical parallel processing research was
focused on using large clusters but actual clinical MRI machines needed to be much smaller
than that. Because of this, groups such as the National Institutes of Health (NIH) would not
fund parallel processing research since it was considered to be limited to large cluster-based
machines. However, current MRI machine manufacturers ship MRI products with GPUs
and the NIH funds research using GPUs as coprocessors [53].
Previously, a large drawback of GPGPU applications was the necessary programming.
In the past, programmers had to learn to program the GPUs using drivers and assembly
language. These languages also did not include many useful general computing instructions
like integer or bit operations since they were not implemented at the time. In addition, a lack
of communication between any of the processors in the GPU was limited making any data
sharing di?cult. While this is still somewhat of a limitation, it has been greatly improved,
since the original design, through CUDA.
2.4 CUDA Environment
Since GPUs were starting to be used as massively parallel processors, NVIDIA developed
a programming interface to utilize them known as CUDA. CUDA stands for Compute Uni?ed
Device Architecture and is based on the C/C++ language. It enables programmers to write
applications in either C or C++ to utilize the GPU as a massively parallel co-processor.
In addition, since MATLAB version 2010b, there is support for m-?les to use the CUDA
environment allowing access for computation on the GPU. CUDA provides constructs for
memory transfers between the CPU and GPU for sharing data. It is based on an SPMD
programming style which executes the same program on multiple parts of the data. This
is di?erent than SIMD since the processing units do not have to be executing the same
instruction at the same time. In the CUDA programming environment, there is a host which
is the CPU and a device which is the GPU. The NVIDIA CUDA programming language is
a large reason more and more industries are looking at GPUs as portable parallel processing
14
systems. The CUDA extension of the C programming language has enabled all programmers
to create parallel applications without any special hardware or software knowledge beyond
C. However, knowledge of the underlying architecture to achieve high speedups has inhibited
typical programmers from using GPUs as coprocessors.
2.5 Terms
Using CUDA, a C/C++ program consists of host code, code written to execute on the
CPU, and device code, code written to execute on the GPU. Device code is divided into
individual functions known as kernels. To compile kernels, NVIDIA provides a C compiler
known as NVCC. It separates the host and device code and only performs compilation on
the device code while using the gcc for the host code. In addition, NVIDIA provides a low-
level programming language, parallel thread execution, PTX, similar to assembly language
for writing kernels without exposing any of the underlying instruction set. Kernels can be
written for any computational operation involving the GPU and are similar to functions
in C. They are executed using a speci?ed number of threads. Since the GPU is an SPMD
structure, all threads execute the same code on di?erent portions of data and each thread
has an identi?er known as the thread index. Thread identi?ers are used for control logic
and data access. A collection of threads is a block, as shown in Figure 2.3. A block can
be partitioned in 1, 2, or 3-dimensions. The number of threads in the x- and y-dimension
of each block is de?ned by dBlk:x and dBlk:y, respectively. Threads in the same block
dBlk.x
dBlk.y
Figure 2.3: A square block of size 64.
can cooperate through atomic operations and synchronization. However, threads in separate
15
blocks cannot since there is no guarantee to which SM a block is assigned. Each block also
has an identi?er similar to a thread, known as the block index. The collection of blocks is
considered the grid and can be partitioned in 1 or 2-dimensions. The number of blocks in the
x- and y-dimension of the grid is de?ned by dGrd:x and dGrd:y, respectively. Each kernel
call can de?ne di?erent grid and block sizes and dimensions but these cannot be changed
dynamically during kernel execution.
In the T10 processor, each grid can have a maximum of 65,536 blocks in each dimen-
sion and each block can have a maximum of 512 threads yielding a maximum total of 241
threads. The dimensions of a grid and block determine the total number of threads operating
in the GPU, so each need to be chosen to ensure the GPU is fully occupied. Since blocks
can execute in any order with respect to other blocks, CUDA applications are very scalable
which allows applications to adapt to new hardware without changing any of the existing
code. However, architecture changes that often accompany new hardware can cause issues
with scalability. Once the partitioning for the grid and blocks is determined, the scheduler
organizes threads into groups of 32 known as warps as shown in Figure 2.4. Since warps
dBlk.y
dBlk.x
warp 0
warp 1
Figure 2.4: Partitioning of a square block of size 64 into warps.
waiting for long-latency operations such as intensive arithmetic operations or memory access
are not selected for execution, this provides a type of latency hiding. While those warps are
waiting for their operations to ?nish executing, other warps that are not waiting are sched-
uled to execute. This type of zero-overhead thread scheduling ensures that the maximum
instruction throughput is realized. This is one major reason GPUs do not dedicate as much
chip area to cache memories and branch prediction mechanisms as CPUs, thus allowing for
many more ALUs to ?t in the same size of silicon.
16
A warp is divided into half-warps (HWs), a group of 16 threads in row-major order as
shown in Figure 2.5. Memory accesses are issued at the HW level. The number of HWs
dBlk.y
dBlk.x
half?warp 0
half?warp 1
half?warp 2
half?warp 3
Figure 2.5: Partitioning of a square block of size 64 into half-warps.
is de?ned by the dimensions of the grid and blocks, input parameters, which are speci?ed
by the programmer. The four input parameters are de?ned as dGrd:x, dGrd:y, dBlk:x,
and dBlk:y. dGrd:x and dGrd:y are the dimensions of the grid in the x and y-dimensions,
respectively. dBlk:x and dBlk:y are the dimensions of each block in the x and y-dimensions,
respectively. Lastly, the focus of this work is on matrix-based computations. All matrices
are assumed to be square and n is used to denote the width or height of a square matrix.
17
Chapter 3
Modeling
Developing a parallelization procedure to minimize execution time requires an accurate
model of communication and computation times. Several modeling techniques, including
?ne-grain, vector-based, and table-based modeling, were initially applied to model GPU
computational behavior. Fine-grain modeling was performed by measuring computation
times for arithmetic operations and extrapolating for larger computations. Vector-based
modeling was performed by measuring computation times for vector-based arithmetic oper-
ations such as a dot product and extrapolating for larger computations. Lastly, table-based
modeling was performed by measuring computation time for various numbers of blocks and
threads and extrapolating for other input parameters. However, none of these techniques
provided reasonable models of the GPU computational behavior, partly due to the layout of
the GPU?s memory.
Therefore, this chapter includes an examination of the layout of GPU memory and
the formulation of execution metrics to accurately represent the computational behavior
of the GPU. Section 3.1 is an examination of the layout of global memory on the GPU.
Included in this section are the e?ects on computation time from the layout. Section 3.2 is
an examination of the layout of shared memory and the e?ects on computation time. The
beginning of Section 3.3 provides reasoning for utilizing execution metrics to model GPU
computational behavior. Instead of utilizing input parameters to model behavior, execution
metrics are formulated as functions of the input parameters for modeling. Section 3.4 is
a model and estimation of CPU to GPU and GPU to CPU communication time. Results
are included to prove the validity of the estimation. CPU computation time is estimated in
Section 3.5 through curve-?tting measured data and results are included in the section.
18
3.1 Global Memory Layout
One reason the aforementioned modeling techniques for GPU computational behavior
yield inaccurate models is the layout of global memory on the GPU. Due to the layout of
global memory, a problem referred to as partition camping exists. Partition camping has
been de?ned by NVIDIA [54], although its e?ects on various applications have not been
studied until recently [55] [56] [57].
3.1.1 Partition Camping
A description of the GPU architecture is necessary to understand partition camping.
The global memory on an 8-series and 200-series GPU is divided into 6 and 8 partitions,
respectively. On both GPU architectures, each partition is 256-bytes wide. All partitions are
connected to a memory controller allowing access by all SMs as depicted in Figure 3.1. Data
N PK
A B D E F G HC
Q R T U V W XS
I J L M O
543210
SM 0 SM 29...SM 1 SM 2
partition
memory controller
address
2,048 bytes
2,097,152 rows
76
Figure 3.1: Global memory connection of the T10 GPU. Memory is divided into 8 equal-sized
partitions. Each address represents 256 bytes.
is stored in row-major order and each address in the ?gure occupies one row of a partition.
Accesses to global memory partitions are performed at the HW level. Therefore, par-
tition camping occurs when multiple HWs attempt to access the same partition, but not
necessarily the same address, at a given time. Figure 3.2 illustrates partition camping oc-
curring when two HWs access two consecutive partitions where Figure 3.1 depicts in which
partition each address resides. Since addresses A and I are in the same partition, HW 1
19
HW 1: addr. I
HW 0: addr. B
memory access
time
computation
HW 1: addr. J
HW 0: addr. A
Figure 3.2: An example of partition camping occurring: global memory accesses by 2 HWs
to 2 partitions.
must wait, until HW 0 ?nishes accessing address A, to access address I. The same occurs
for the second accesses to addresses B and J. Therefore, partition camping occurs and the
memory accesses are serialized.
Figure 3.3 illustrates the access of two partitions by two HWs which eliminates partition
camping. For the ?rst access, HW 0 accesses address A which resides in partition 0 and HW
computation
HW 1: addr. I
HW 0: addr. A
HW 1: addr. J
memory access
time
HW 0: addr. B
Figure 3.3: An example of partition camping not occurring: global memory accesses by 2
HWs to 2 partitions.
1 accesses address J in partition 1. Partition camping does not occur since the accesses by
each HW occur in di?erent partitions. Similarly, for the second access, each HW accesses
di?erent partitions and partition camping does not occur. Figures 3.2 and 3.3 demonstrate
the theoretical e?ects on computation time due to partition camping on a GPU such that
minimizing partition camping should be performed to minimize computation time.
E?ects
For most GPU computations, thousands of HWs are executing at a given time. To min-
imize partition camping, and thus the computation time, it is necessary that HWs utilize all
partitions of global memory. In addition, it is necessary that the HWs be evenly distributed
to the partitions.
20
The code in Listing 3.1 is a portion of a kernel tested to measure the e?ects of utilizing
a varying number of partitions.
1 int offset = floor(BlkIdx / BlksPerPart) * 64;
2 for(j = 0; j < v; j++)
3 temp += A[j * 512 + offset + threadIdx.y];
Listing 3.1: Code to test the e?ects of partition camping.
Line 1 calculates which partition of global memory each block accesses by utilizing a linear
block index, BlkIdx, and the number of assigned blocks per partition, BlksPerPart. The
fraction of the two is multiplied by 64 since each value is stored as a ?oat and a partition is
256 bytes wide on a 200-series GPU. The length of a vector in the matrix that is summed
is represented by v. Adding the thread index in the y-dimension, threadIdx.y, minimizes the
possible e?ect of memory request merging [58] [59].
The computation time and e?ective bandwidth, from executing the code in Listing 3.1
on the T10 GPU, are given in Table 3.11. Each time was measured using a square block of
256 threads and 120 blocks to yield a maximum occupancy for each SM. Since threads are
assigned to warps in row-major order, each thread in a HW sums identical values. However,
each HW in a block computes di?erent sums. The bandwidth decreases linearly as the
number of partitions
v 8 4 2 1
512 0.4 0.9 1.7 3.4
1024 0.9 1.7 3.4 6.7
2048 1.7 3.4 6.8 13.5
4096 3.3 6.8 13.5 26.9
8192 6.8 13.4 26.5 53.8
16384 13.7 26.5 52.3 107.5
(a) Measured computation time (ms).
number of partitions
v 8 4 2 1
512 69.4 34.4 17.2 8.7
1024 68.4 34.6 17.2 8.7
2048 69.9 34.5 17.2 8.7
4096 71.7 34.4 17.4 8.7
8192 69.6 35.1 17.7 8.7
16384 68.5 35.4 17.9 8.7
(b) E?ective bandwidth (GB/s).
Table 3.1: Measured results of executing Listing 3.1 which demonstrate the e?ect of partition
camping on the T10 GPU. v is the number of values each thread summed.
number of partitions utilized decreases, due to partition camping. Since the number of
HWs remains constant in the tests, using less partitions increases partition camping which
1These results suggest memory request merging mentioned in [58] and [59] does not occur.
21
increases the computation time. For MM, all partitions are utilized at some point when
n2 ? 512, where n is the height or width of a square matrix. However, the order in which
HWs access each partition varies depending on the computation pattern, which is determined
by how the code is written. Therefore, partition camping and the e?ects are considered in
the parallelization procedure.
3.2 Shared Memory Layout
Shared memory is divided into equal-sized banks for GPUs. For the T10 GPU, there
are 16 banks as depicted in Figure 3.4. Each bank consists of 256 rows, 4 bytes wide, and
all banks can be accessed simultaneously. Bank con?icts occur at the thread level. If all
bank
9 10 11 12 13 14 15876543210
Figure 3.4: Shared memory layout in the T10 GPU. Memory is divided into 16 equal-sized
partitions. Each column is 4 bytes wide.
threads within a HW access di?erent banks, as illustrated in Figure 3.5, no bank con?icts
occur and only one access is necessary.
9 10 11 12 13 14 15876543210
bank
Figure 3.5: Threads within a HW accessing 16 shared memory banks. No bank con?icts
occur.
22
3.2.1 Bank Con?icts
If two or more addresses of a memory access to shared memory fall into the same memory
bank, a bank con?ict occurs. Since memory accesses are issued at the HW level, threads
within a HW accessing di?ering addresses in the same bank cause bank con?icts and the
accesses are serialized. In Figure 3.6, all threads within a HW access di?erent rows of shared
memory but the same column. Since the column resides in the same shared memory bank,
9 10 11 12 13 14 15876543210
bank
Figure 3.6: Threads within a HW accessing one shared memory bank. 16 bank con?icts
occur.
16 bank con?icts occur. Therefore, 16 accesses to shared memory are issued to service the
HW. However, the T10 GPU supports shared memory broadcasting. Therefore, if threads
within a HW access the same address in shared memory, no bank con?icts occur.
E?ects
The portion of the kernel utilized to test the e?ects of bank con?icts is depicted in
Listing 3.2.
1 __shared__ float As[dBlkx][dBlky];
2 int offset = blockIdx.x % 8 * 64;
3 int soffset = threadIdx.x % Banks;
4 for(j = 0; j < v; j++) {
5 As[threadIdx.y][threadIdx.x] = d_A[j * 512 + offset + threadIdx.y];
6 temp += As[threadIdx.x][soffset];
7 }
Listing 3.2: Code to test the e?ects of bank con?icts.
23
The number of banks each HW accesses is de?ned by Banks. In Line 3, soffset is calculated
to specify which column of shared memory is accessed and therefore, the number of bank
con?icts is ?xed. blockIdx.x and threadIdx.x are the indices of each block and thread in the
x-dimension, respectively.
Table 3.2 shows the e?ect on computation time due to bank con?icts. Similar to the
number of banks
v 16 8 4 2 1
512 0.5 0.5 0.6 0.9 1.5
1024 0.9 1.0 1.3 1.8 2.9
2048 1.7 1.9 2.5 3.6 5.8
4096 3.4 3.8 5.0 7.2 11.5
8192 6.6 7.6 9.9 14.3 23.0
16384 12.9 15.1 19.6 28.5 46.0
(a) Computation time (ms).
number of banks
v 16 8 4 2 1
512 65.1 58.6 45.8 31.8 19.9
1024 66.6 60.4 46.5 32.2 20.1
2048 67.7 61.4 47.1 32.6 20.2
4096 69.1 61.7 47.3 32.7 20.3
8192 71.2 61.9 47.6 32.8 20.4
16384 72.4 62.1 47.7 32.9 20.4
(b) E?ective bandwidth (GB/s).
Table 3.2: Measured results of executing Listing 3.2 which demonstrates the e?ect of bank
con?icts on the T10 GPU. v is the number of values summed by each thread.
e?ects of partition camping in global memory, shared memory bank con?icts signi?cantly
a?ect computation time. As the number of banks increases, the number of bank con?icts
decreases as does the computation time. Although the change in computation time is not
linearly proportional to the number of banks utilized, the e?ects of bank con?icts are clearly
illustrated. Therefore, bank con?icts and the e?ects are considered in the parallelization
procedure.
3.3 Execution Metrics
In this section, execution metrics are formulated to represent computational behavior
of the GPU to assist in modeling. Figure 3.7 shows the necessity for execution metrics, as
input parameters, the dimensions of the grid and blocks, do not accurately model compu-
tational behavior. Initially, all measured computation times of MM utilizing varying input
parameters are grouped by n. The average percent di?erence between the maximum and
minimum time within a group is shown. Grouping by n yields an average di?erence of
24
163% between the maximum and minimum time of a group. Next, all measured computa-
tion times utilizing varying input parameters are grouped by n and dBlk:x. The average
percentage di?erence is 109 between the maximum and minimum of a group. The order
of the groupings is determined by the input parameter which yields the smallest average
percent di?erence between the maximum and minimum time of a group. Continuing until
3 of the 4 input parameters are exhausted for grouping yields an average percent di?erence
of 28. Therefore, modeling the computational behavior of the GPU is inaccurate utilizing
input parameters. Because of this, execution metrics are formulated as functions of input
parameters to accurately represent computational behavior.
n n, dBlk.x n, dBlk.x, dGrid.y n, dBlk.x, dGrid.y, dGrid.x
20%
40%
60%
80%
100%
120%
140%
160%
180%
grouping using inputs
av
g.
%
diff
ere
nc
eb
etw
een
ma
x.
an
dm
in.
wi
th
in
ag
ro
up
Figure 3.7: Modeling MM with input parameters: average percent di?erence between the
maximum and minimum computation time (ms) of a group de?ned by n and a subset of the
input parameters.
Four input parameters, dGrd:x, dGrd:y, dBlk:x and dBlk:y, and the size of each matrix,
n, are used to formulate the execution metrics. Input parameters and n are assumed to be
powers of two in this work. The product of the four input parameters yields the total number
25
of threads assigned for execution on the GPU2 as
ThdstotalGPU = dGrd:x?dGrd:y ?dBlk:x?dBlk:y: (3.1)
The execution metrics can be formulated for any matrix-based computation. However, in
this work, the focus of the execution metrics pertains to Mv, MM, and convolution.
3.3.1 Global Memory Accesses
ThdstotalGPU is used to formulate an execution metric for representing the number of global
memory accesses. The number of memory accesses signi?cantly a?ects the computation time
for matrix-based computations since the time is largely dependent on memory access latency.
In general, for matrix-based computations, all matrices reside in global memory and
portions of matrices may reside in shared memory. For MM, there are 2n3 values read from
global memory, although consecutive reads from threads in the same block can be combined.
The number of writes to global memory for MM is n2 and consecutive writes from threads in
the same block can be combined. Therefore, the amount of time for writing to global memory
for MM is signi?cantly less than the amount of time for reading from global memory. This
is similar for other matrix-based computations such as Mv and 2D convolution. Therefore,
the number of writes to global memory is ignored in this work.
Memory accesses are issued at the HW level and the GPU combines consecutive thread
accesses within a HW into one access, known as coalescing. The memory access is issued as
either a 32-, 64- or 128-byte memory transaction. Therefore, the number of reads is the sum
of 32B, 64B and 128B transactions.
Gld = Gld32B +Gld64B +Gld128B:
2The notation used in Equation (3.1) is used throughout this work. The unit being de?ned, threads
(Thds) or blocks (Blks), is speci?ed ?rst. The superscript represents the type of unit being de?ned, total
(total), active (active) or fragmented (frag.). The subscript represents the execution unit being de?ned,
GPU (GPU) or SM (SM).
26
Initially, each transaction is considered a 128B access by the GPU?s memory controller and is
reduced if possible. If the threads in a HW access between 9 and 16 consecutive and aligned
memory locations of ?oat or integer value, only one 64B memory transaction is issued. If
the HW accesses less than 9 consecutive memory locations of ?oat or integer value, one
32B memory transaction is issued. If the HW accesses memory locations which are not
consecutive and 128B aligned, multiple 32B and/or 64B memory transactions are issued
until all threads in the HW have been serviced. Since each HW is a group of 16 threads, the
number of HWs that execute on the GPU is de?ned as
HWs# = Thds
total
GPU
16 :
The equation assumes there are at least 16 threads in a block.
From the input parameters, and considering coalescing, the number of 32B and 64B
reads for matrix-based computations can be formulated3.
For a na??ve implementation of Mv utilizing only global memory, threadj, to compute cj
where c=A?b, computes rowj ? b where rowj is one row of A. Therefore, the neighboring
thread, threadj+1, computes rowj+1 ? b. If dBlk:x ? 16, each HW accesses 16 values of A
and 1 value of b. Since 1 value of b is accessed, a 32B transaction is issued. The 16 values
of A reside in di?ering rows of A, and therefore the accesses are uncoalesced and require
16 32B transactions. This is repeated for each HW n times. Therefore, the number of 32B
transactions for this implementation of Mv is
Gld32B =
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23A9
(dBlk:x+1)n2
dBlk:x if dBlk:x ? 8
17n2
16 if dBlk:x ? 16:
(3.2)
This equation assumes only global memory is utilized for Mv. The equation is modi?ed for
varying memories as shown in Section 4.1.
3128B reads do not exist in matrix-based computations when using values of type int or ?oat.
27
For a na??ve implementation of MM utilizing only global memory, to compute Cij where
C=A?B, threadij accesses rowi and colj. If dBlk:x > 1, the neighboring thread, threadi(j+1),
accesses rowi and colj+1. Since both threads are in the same HW, then if dBlk:x ? 16, n 32B
transactions are issued to read rowi and n 64B transactions to read colj:::colj+15. However,
if dBlk:x ? 8, then only 32B transactions are issued. Therefore, the total number of reads
performed to global memory is de?ned as
Gld32B =
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23A9
n3
dBlk:x +
n3
16 if dBlk:x ? 8
n3
16 if dBlk:x ? 16
(3.3)
Gld64B = n
3
16 if dBlk:x ? 16: (3.4)
These equations assume only global memory is utilized for MM. The equations are modi?ed
for varying memories as shown in Section 4.1. Assuming dBlk:x ? 16, half of the reads are
32B transactions and half are 64B. Since memory access latency to constant and shared
memory is much shorter than global, the number of accesses performed to constant and
shared memory is ignored in this work as it has an insigni?cant impact on computation
time.
For a na??ve implementation of 2D convolution4, threadij computes Cij where C = A ?
B. In image processing, A is an image and B is a  lter. Therefore, C is the result of applying
a ?lter to an image and Cij is one pixel of the resulting image. In this work, all ?lters, B, are
assumed to be square. Therefore, the size, height or width, of a ?lter is represented by FS.
For a na??ve implementation of convolution utilizing only global memory, to compute Cij,
threadij accesses FS2 values of A and B. Since each thread within a HW accesses the same
value of B in each iteration of computation, the number of 32B transactions for accessing B
is de?ned only by FS. If dBlk:x ? 16, neighboring threads within a HW access neighboring
values of A and therefore the accesses are coalesced into one 64B transaction. However, if
4From this point forward, convolution refers to 2D convolution of two matrices where C = A ? B.
28
dBlk:x ? 8, no 64B transactions occur and the number of 32B transactions is dependent on
FS and dBlk:x. Therefore, the number of global memory accesses is approximately5
Gld32B =
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23A9
n2FS2
16 parenleft.alt31+
16
dBlk:xparenright.alt3 if dBlk:x ? 8
n2FS2
16 if dBlk:x ? 16
(3.5)
Gld64B = n
2FS2
16 if dBlk:x ? 16: (3.6)
These equations assume only global memory is utilized for convolution. The equations are
modi?ed for varying memories as shown in Section 5.1.1. The number of accesses performed
to shared memory is ignored in this work as it has an insigni?cant impact on computation
time. In addition to formulating the amount of reads to global memory, it is necessary to
formulate the amount of threads executing in parallel on the GPU.
3.3.2 Active Threads
The number of threads executing at a given time on a GPU, or the number of active
threads, is de?ned by ThreadsactiveGPU . To formulate the number of active threads, it is nec-
essary to formulate the number of active blocks per SM, BlksactiveSM . BlksactiveSM is dependent
on the number of registers used per block (RegsPerBlk), amount of shared memory used
per block (SMemPerBlk), maximum amount of blocks allowable, and maximum amount of
threads allowable. The number of registers used per thread is dependent on the compiler
and determined after compilation. The amount of shared memory used per block is de?ned
by the kernel and determined after compilation.
The maximum number of blocks and threads that can be active varies depending on the
GPU. The T10 GPU consists of 30 SMs and each SM can have a maximum of 8 blocks or 1024
threads. Assuming there are more blocks assigned for execution than SMs, dGrd:x?dGrd:y >
5Due to boundary checking, not all reads are issued.
29
30, BlksactiveSM is formulated as
BlksactiveSM = minparenleft.alt3? 16384RegsPerBlk?;? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3: (3.7)
Therefore, from Equation (3.7), the number of active blocks on the GPU at a given time is
BlksactiveGPU = minparenleft.alt130?BlksactiveSM ;dGrd:x?dGrd:yparenright.alt1: (3.8)
The number of active threads on the GPU is formulated as the product of Equation (3.8)
and the number of threads per block. Therefore,
ThdsactiveGPU = BlksactiveGPU ?dBlk:x?dBlk:y: (3.9)
3.3.3 Fragmented Threads
If the number of active threads on a GPU is not evenly divisible by the number of
total threads (ThdsactiveGPU mod ThdstotalGPU ? 0)6, fragmentation occurs. Fragmented threads are
threads which are not executed with the same amount of parallelization as active threads.
From (3.8) and (3.9), two execution metrics are formulated to represent the number of
fragmented blocks and threads: Blksfrag:GPU and Thdsfrag:GPU, respectively. Therefore, the number
of fragmented blocks on the GPU is
Blksfrag:GPU = (dGrd:x?dGrd:y) mod BlksactiveGPU (3.10)
and the number of fragmented threads on the GPU is
Thdsfrag:GPU = Blksfrag:GPU ?dBlk:x?dBlk:y: (3.11)
6x mod y denotes the remainder of x
y.
30
3.3.4 Global Memory Partitions
In addition to formulating execution metrics to represent the number of global memory
accesses, active and fragmented threads, two execution metrics are formulated to represent
the layout of global memory. The global memory on a 200-series GPU architecture is divided
into 8 partitions where each partition is 256B wide. To represent the number of global
memory partitions accessed per block, an execution metric, PartsPerBlk, is formulated as
PartsPerBlk = dBlk:x64 (3.12)
Since HWs are formed in row-major order, the x-dimension of the block is used. Since values
of type ?oat are 4 bytes and each global memory partition is 256 bytes wide, 64 values are
stored per row for each partition. Therefore, Equation (3.12) represents the number of global
memory partitions accessed per block.
From PartsPerBlk, the number of global memory partitions accessed by the active blocks
on the GPU, PartsPerGPU, is formulated as
PartsPerGPU =
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23A9
minparenleft.alt1BlksactiveGPU ?PartsPerBlk;8parenright.alt1 if BlksactiveGPU < dGrd:x
min(dGrd:x?PartsPerBlk;8) if BlksactiveGPU ? dGrd:x:
(3.13)
PartsPerGPU is dependent on the number of global memory partitions accessed per block
and BlksactiveGPU . The equation de?nes a maximum of 8 partitions being accessed since there
are 8 partitions of global memory in the T10 GPU.
3.4 Communication Time
Estimating the communication time between the CPU and GPU is necessary for a
parallelization procedure to minimize execution time. For small matrix-based computations,
it is possible the communication time is higher than the CPU computation time. Therefore,
it is necessary to accurately model the communication time between the CPU and GPU.
31
In addition, communication and computation time estimates are necessary to determine the
partitioning of computation between the CPU and GPU.
Communication time, Tcomm:, is modeled linearly with a setup time, Tsu, and a maximum
transfer rate, Tx. Therefore,
Tcomm: = bT
x
+Tsu (3.14)
where b is the number of bytes being transferred. Research shows Tx is dependent on the
type of transfer (CPU to GPU or GPU to CPU). In addition, analysis of measured data
suggests Tsu is a function of b. Therefore,
Tsu = c0b+c1 (3.15)
where c0 is the time per byte to set up a transfer and c1 is a constant. Both are determined
through curve-?tting measured data. For Mv, MM, and convolution, there are two CPU-
GPU transfers and one GPU-CPU.
3.4.1 CPU-GPU
From NVIDIA, the advertised transfer rate between the CPU and GPU is 4GB/s.
However, the maximum transfer rate, Tx, measured for CPU to GPU communication is
3.5GB/s. For transfers smaller than 256KB, measured data shows c0 is approximately
525ms/GB and c1 is approximately 0.014ms. For transfers between 256KB and 1MB, data
shows c0 is approximately 3.2ms/GB and c1 is approximately 0.061ms. Lastly, for transfers
of size 1MB and larger, data shows c0 is approximately 3.2ms/GB and c1 is approximately
0.19ms. The varying values for c0 and c1 are presumably due to CPU cache size and memory
layout as well as the layout of GPU memory. In this work, n ? 512, as discussed in Section
4.1, and therefore, all CPU to GPU matrix transfers are larger than 1MB. Figure 3.8 depicts
the measured and estimated time for two CPU to GPU transfers. Two transfers are measured
and estimated since the GPU requires a matrix and vector for Mv and two matrices for MM
32
and convolution. The nonuniform behavior illustrated in the ?gure is because c0 and c1 vary
with the size of data.
102 104 106 108
100
102
bytes
tim
e(
ms
)
Measured
Estimated
Figure 3.8: Comparison of measured and estimated CPU to GPU communication time (ms)
on the T10 GPU.
For Mv, 4n2 + 4n bytes are transferred for the matrix, A, and vector, b, since values
of type ?oat are utilized. For MM, 8n2 bytes are transferred for the two matrices, A and
B. For convolution, 4n2 + 4FS2 bytes are transferred for the two matrices, A and B. The
maximum and average percent error between the measured and estimated time are 5.0% and
1.0%, respectively.
3.4.2 GPU-CPU
The maximum transfer rate measured for GPU to CPU communication is 3.1GB/s. For
transfers smaller than 1MB, measured data shows c0 is approximately 210ms/GB and c1
is approximately 0.02ms. For transfers larger than 1MB, data shows c0 is approximately
6.9ms/GB and c1 is approximately 0.21ms. The varying values for c0 and c1 are presumably
due to CPU cache size and memory layout as well as the layout of GPU memory. Figure 3.9
depicts the measured and estimated time for one GPU to CPU transfer.
33
102 104 106 108
100
102
bytes
tim
e(
ms
)
Measured
Estimated
Figure 3.9: Comparison of measured and estimated GPU to CPU communication time (ms)
on the T10 GPU.
For Mv, one GPU to CPU transfer of 4n bytes is necessary for the vector, c. For MM
and convolution, one transfer of 4n2 bytes is necessary for the matrix, C. The maximum
and average percent error between the measured and estimated time are 8.9% and 1.4%,
respectively.
3.5 Computation Time
For a parallelization procedure to minimize execution time of matrix-based computa-
tions on a GPU, it is necessary to model CPU and GPU computation time. Computation
time in this work refers to time for memory accesses and computation. Determining which
computations are performed by the CPU and GPU require accurate computation and com-
munication time estimates.
The execution time for matrix-based computations is de?ned as the sum of the commu-
nication and computation times and therefore,
Texec: = Tcomm: +Tcomp:
34
The computation time, Tcomp:, is the sum of the CPU and GPU computation times, TCPUcomp:
and TGPUcomp:, respectively, as shown in Equation (3.16):
Tcomp: = TCPUcomp: +TGPUcomp:: (3.16)
3.5.1 CPU
Due to CPU cache memories, branch prediction mechanisms, hyper-threading and other
instruction and data latency-hiding optimizations, there is signi?cant research in modeling
CPU computation time. Since the goal of this work is to minimize execution time of matrix-
based computations on a GPU, an estimate of CPU computation time is obtained through
analysis of measured data. In this work, the ATLAS [60] package is utilized for matrix-
based computations executing on the CPU. ATLAS is an ongoing research project which
provides C interfaces to an optimized BLAS implementation as well as some routines from
LAPACK [18]. ATLAS is currently used in MAPLE, MATLAB, Octave and planned for use
in Mathematica.
Computation time for Mv and MM is measured utilizing the ATLAS interface to the
sgemv (Mv) and sgemm (MM) BLAS functions, respectively. Bandwidth is utilized to es-
timate CPU computation time and is calculated as the number of bytes accessed for each
computation divided by the computation time.
For Mv, 2n2 ?oat values are accessed and therefore, e?ective bandwidth is 8n2bytesTCPU
comp:
.
From measured data, the maximum bandwidth is dependent on n and measured between
2GB/s and 8GB/s. Therefore, for Mv, the bandwidth is estimated as minparenleft.alt1maxparenleft.alt116384n ;2parenright.alt1;8parenright.alt1.
Therefore,
TCPUcomp: =
8n2
10243
minparenleft.alt1maxparenleft.alt116384n ;2parenright.alt1;8parenright.alt1seconds
for the optimized BLAS implementation of Mv. Figure 3.10 depicts the measured and
estimated time for the optimized BLAS implementation of Mv on the CPU.
35
512 1024 2048 4096 8192 16384
100
102
n
tim
e(
ms
)
Measured
Estimated
Figure 3.10: Comparison of measured and estimated computation time (ms) for the opti-
mized BLAS implementation of Mv on the CPU.
The maximum and average percent error between the measured and estimated time are
3.0% and 1.4%, respectively.
For MM, 2n3 ?oat values are accessed and therefore, e?ective bandwidth is 8n3bytesTCPU
comp:
.
From measured data, the maximum bandwidth varies between 50GB/s and 54GB/s. From
the median bandwidth (51GB/s) measured,
TCPUcomp: =
8n3
10243
51 seconds
for the optimized BLAS implementation of MM. Figure 3.11 depicts the measured and
estimated time for the optimized BLAS implementation of MM on the CPU.
36
512 1024 2048 4096 8192 16384
102
104
106
n
tim
e(
ms
)
Measured
Estimated
Figure 3.11: Comparison of measured and estimated computation time (ms) for the opti-
mized BLAS implementation of MM on the CPU.
The maximum and average percent error between the measured and estimated time are
5.6% and 3.2%, respectively.
Since BLAS routines do not include convolution, time for a non-optimized C implemen-
tation is measured. For convolution, 2n2FS2 ?oat values are accessed. Therefore, e?ective
bandwidth is calculated as 8n2FS2bytesTCPU
comp:
. From measured data, maximum bandwidth is esti-
mated as ( 3?FS ? n20000 +2:7)GB/s and therefore,
TCPUcomp: =
8n2FS2
10243
parenleft.alt2 3?FS ? n20000 +2:7parenright.alt2
seconds
for the non-optimized C implementation of convolution. Figure 3.12 depicts the measured
and estimated time for the non-optimized C implementation of convolution with a ?lter size
of 3 on the CPU.
37
512 1024 2048 4096 8192 16384
101
102
103
n
tim
e(
ms
)
Measured
Estimated
Figure 3.12: Comparison of measured and estimated computation time (ms) for the non-
optimized C implementation of convolution on the CPU. FS=3.
The maximum and average percent error between the measured and estimated time are
9.6% and 7.0%, respectively. Figure 3.13 depicts the time for convolution with a ?lter size
of 63.
512 1024 2048 4096 8192 16384
104
105
106
n
tim
e(
ms
)
Measured
Estimated
Figure 3.13: Comparison of measured and estimated computation time (ms) for the non-
optimized C implementation of convolution on the CPU. FS=63.
The maximum and average percent error between the measured and estimated time are
4.0% and 2.1%, respectively. Figure 3.14 depicts the time for convolution with a ?lter size
of 513.
38
512 1024 2048 4096 8192 16384
101
102
103
n
tim
e(
ms
)
Measured
Estimated
Figure 3.14: Comparison of measured and estimated computation time (ms) for the non-
optimized C implementation of convolution on the CPU. FS=513.
The maximum and average percent error between the measured and estimated time are
15.1% and 8.6%, respectively.
Therefore, for the implementations of Mv, MM, and convolution, the worst-case percent
error between the measured and estimated CPU computation time is 15.1%.
3.5.2 GPU
Similar to CPU computation time, GPU computation time, TGPUcomp:, is dependent on
the implementation of the matrix-based computation. Since the parallelization procedure
minimizes execution time of matrix-based computations executing on a GPU, it is assumed
the procedure yields the minimum computation time. Therefore, the time is estimated
assuming the theoretical bandwidth to global memory.
The theoretical bandwidth to global memory is calculated as the product of the memory
clockrate and data bus width. The T10 GPU has a 792MHz memory clock rate and a 512-bit
data bus. Since the memory is double-data rate, the theoretical bandwidth to global memory
is approximately 94GB/s. According to NVIDIA [61], 70-80% of theoretical bandwidth ?is
very good? for memory-bound computations.
39
Since GPU computation time is estimated assuming theoretical bandwidth, it is neces-
sary to determine the number of bytes accessed from global memory for each matrix-based
computation. The number of bytes accessed is dependent on the memory utilized and the
size of each memory transaction. Therefore, the GPU computation time is estimated in the
last step of the parallelization procedure after the type of memory is determined and the
number and size of memory transactions are calculated.
40
Chapter 4
Parallelization Procedure
This chapter describes the development of the parallelization procedure and its impact
on minimizing execution time for matrix-based computations on the GPU. The procedure
is developed to be general in that it applies to any matrix-based computation. Each step of
the procedure is developed through research, testing, and analysis of measured data for Mv,
MM, or both. The procedure is applied to convolution and the conjugate gradient method in
Chapter 5 for veri?cation. The procedure considers the placement of data in GPU memory,
computation patterns of the GPU, access patterns to GPU memory, ?ne-tuning GPU code,
input parameters to the GPU, and computation partitioning between the CPU and GPU.
Each aforementioned item is discussed individually in Sections 4.1 - 4.6.
In each section of this chapter, one step of the parallelization procedure is developed
through analysis of Mv, MM, or both. Results are included at the end of each section to
demonstrate the e?ectiveness of each step. Each section includes the kernel for each matrix-
based computation to illustrate the application of each step of the procedure. All results
in this chapter were gathered using the NVIDIA Tesla S1070, which includes a pair of T10
GPUs connected to a quad-core 2.26GHz Intel Xeon E5520 CPU at Alabama Supercomputer
Authority [62]. The height or width of a square matrix, n, is varied in the results from 512
to 16384 as de?ned by Equation (4.1) in Section 4.1. dBlk:x is varied in the results from 8
to 512, the maximum allowable value for the T10.
The ?rst step is to determine the optimal placement of data in GPU memory. Therefore,
Section 4.1 is a discussion of the three types of memory in the GPU and the appropriateness
of each memory for various computations. Utilizing various memories on the GPU other
than global can reduce the number of accesses to global memory and therefore, reduce GPU
41
computation time. This is the ?rst step of the procedure, as it is the most common optimiza-
tion to GPU computations to minimize time. In addition, other steps of the procedure such
as deriving the optimized computation and access patterns and optimal input parameters are
dependent on which memories are utilized and therefore, this is the ?rst step. An algorithm
is presented to determine the optimal placement of data in GPU memory to minimize GPU
computation time.
Step 2 in Section 4.2 is a discussion of various computation patterns determined by the
kernel. The kernel de?nes which result(s) each thread computes and therefore, an examina-
tion is performed of various computation patterns of Mv and MM. From measured data, it
is shown that for Mv, if threadj computes one value of c, cj, where c=Ab, the computation
time is bound by the number of threads and data size. Likewise, for MM, if threadij com-
putes one value of C, Cij, where C=AB, the computation time is bound by computation
and memory accesses. Increasing the overlap of computation and memory accesses reduces
the computation time. Therefore, the optimized computation pattern which reduces GPU
computation time is derived. Since the access pattern is dependent on the computation
pattern, this step of the procedure is included before deriving the optimized access pattern.
The optimized access pattern is dependent on the placement of data in GPU memory
and the computation pattern. Therefore, Step 3 is included to derive the optimized access
pattern. Access patterns to GPU memory are discussed in Section 4.3 and the optimized
access pattern is derived to minimize partition camping and eliminate bank con?icts. This
step is developed through analysis of Mv and MM. For Mv, bank con?icts occur when
accessing shared memory and a?ect computation time as shown in Section 3.2. For Mv
and MM, all HWs begin accessing matrix A from the same partition of global memory;
thus partition camping occurs. From Section 3.1, partition camping a?ects computation
time. Therefore, the optimized access pattern is derived to minimize partition camping and
eliminate bank con?icts.
42
After the optimal placement of data is determined and the optimized computation and
access patterns are derived, ?ne-tuning of the kernels is performed to reduce the amount of
computation and resource allocation, as shown in Section 4.4. This step, Step 4, is developed
from analysis of MM, as the optimized computation pattern for MM utilizes many registers.
Due to the register usage, this step is included in the procedure to reduce the register
requirement of the kernels such that more threads execute in parallel, and thus reduce the
execution time. Included in this step is the minimization of index calculations. Developed
from analysis of measured data for MM, reducing index calculations, particularly in inner
loops, reduces the time necessary to issue global memory transactions thus reducing the
computation time.
After ?ne-tuning the kernels, analysis of Mv and MM yields a necessity of optimal
input parameters. Therefore, Step 5 of the procedure is included to derive the optimal
input parameters, which are input parameters that yield the minimum GPU computation
time. From measured data, computation time varies signi?cantly depending on the number
of blocks and size of each block. Therefore, from the formulation of execution metrics in
Section 3.3, optimal input parameters are derived in Section 4.5. This step of the procedure
is performed after Steps 1-4 since each of the aforementioned steps a?ects the execution
metrics. Execution metrics are a?ected by each previous step since they are dependent on
such things as the allocation of registers and shared memory.
The last step, Step 6, of the parallelization procedure to minimize execution time is to
determine the partitioning of computation between the CPU and GPU as shown in Section
4.6. From measured data for Mv, it is shown that the communication time may exceed
the CPU computation time. Therefore, optimally determining which computations are per-
formed by the CPU and GPU is included in the procedure. Optimally determining the
partitioning of computation requires accurate estimates of communication and computation
43
times as discussed in Sections 3.4 and 3.5, respectively. An algorithm utilizing the com-
munication and computation times is included for determining the optimal partitioning of
computation.
Therefore, the parallelization procedure to minimize execution time of matrix-based
computations utilizing the GPU consists of
1. determining the optimal placement of data in GPU memory,
2. deriving the optimized computation pattern which reduces computation time,
3. deriving the optimized access pattern to data in GPU memory,
4. ?ne-tuning GPU code to reduce computation time,
5. deriving optimal input parameters which yield the minimum computation time for the
GPU, and
6. optimally partitioning computation between the CPU and GPU.
4.1 Placement of Data
To minimize GPU computation time, and thus the execution time, it is necessary to
consider which GPU memories are utilized. Three types of memory exist in the T10 GPU:
global, constant and shared.
Global memory is the only memory available for both CPU to GPU and GPU to CPU
transfers. It is the main memory of the device due to its read/write capability, large size,
accessibility by both the CPU and GPU and its accessibility by all executing threads. It is
the largest of the available memories, at 4GB in the T10. However, it is the slowest memory
with a peak theoretical bandwidth of 94GB/s. Since some global memory is reserved by
the system, not all 4GB are accessible. Therefore, using powers of two for n, the height or
width of a square matrix, the maximum value of n is 16384. The minimum value of n, in
44
this work, is determined such that one row of a matrix spans at least one global memory
partition. Therefore,
512 ? n ? 16384: (4.1)
Constant memory is available for CPU to GPU transfers and is used for storing constant
data; it is a read-only memory from the GPU?s perspective. Similar to shared memory, it
is much faster than global memory and can reduce computation time in applications by
reducing the number of global memory accesses. The T10 GPU includes 64KB of constant
memory that is accessible by all threads. However, since it is read-only by the GPU, it can
only be used for transferring data from the CPU to GPU. In computations such as MM, when
n ? 512, utilizing constant memory is not possible as matrices exceed the size. However, for
Mv, constant memory can be utilized to store the vector, and thus reduce the number of
global memory reads as well as the partition camping e?ect.
Shared memory is read/write for the GPU but not accessible to the CPU. Therefore,
to utilize shared memory, the CPU must transfer data to the GPU?s global memory and
the GPU must transfer the data from global memory to shared. Shared memory is much
faster than global and is widely used as temporary storage, or scratchpad space, for threads.
Utilizing shared memory for computations such as MM can greatly reduce the computation
time as many of the accesses to global memory can be reduced. Data is transferred from
the CPU to the GPU?s global memory and then the GPU copies portions of the data that
are reused to shared memory thus greatly reducing the number of global memory accesses.
The T10 GPU includes 16KB of shared memory per SM, for a total of 480KB of shared
memory. However, shared memory is allocated per SM, so threads executing on one SM
cannot read or write to shared memory of another SM. Even if it were possible, if n ? 512,
the limited data size would not accommodate one matrix. However, signi?cant reductions
in computation time can be realized in computations which reuse data by partitioning the
reused data into smaller blocks and then utilizing shared memory. For MM, data is reused
and utilizing shared memory reduces the demand on the global memory bus thus reducing
45
computation time. Shared memory can also be utilized to coalesce accesses to global memory
for data which are not reused such as the matrix in Mv, and therefore, reduce computation
time.
Fromthesummariesofeachmemory, Listing4.1isdevelopedtodeterminetheplacement
of data.
1 if(data reused )
2 if(sizeof(data reused ) ? sizeof(cmem))
3 utilize cmem for data reused
4 if(sizeof(data reused ) > sizeof(cmem))
5 partition data reused into blocks ? sizeof(smem)
6 utilize smem for data reused
7 else
8 if(uncoalesced memory accesses )
9 utilize smem to coalesce accesses
10 utilize gmem for data
Listing 4.1: Algorithm for determining the placement of data in GPU memory.
From Lines 7-10, data which is not reused for computation is stored in global memory
(gmem). Since constant memory (cmem) and shared memory (smem) have similar access
times, data, which is reused (data reused) and smaller than the maximum amount of constant
memory, is stored in constant memory as shown in Lines 2 and 3. Data, which is reused but
larger than constant memory, is partitioned into blocks no greater than the size of shared
memory, as shown in Lines 4-6. However, the CPU does not have access to shared memory
so data which is reused must be transferred to global memory. The GPU kernel partitions
data which is reused into blocks and transfers the data from global to shared memory. Lines
8-9 refer to the access pattern to data which is not reused. If there is no data reuse, such as
A for Mv, shared memory is utilized to coalesce the uncoalesced memory accesses and thus
reduce computation time.
4.1.1 Mv
Listing 4.2 is the kernel for the global memory implementation of Mv on a GPU, where
each thread computes one cj. As mentioned in Section 4.2, computation time increases as
each thread computes additional values of c. Each thread computes the index of the cj to
46
1 __global__ void Mv(float* A, float* b, float* c, int n) {
2 int index = blockIdx.x * blockDim.x + threadIdx.x;
3 float temp = 0.0;
4 for(int j = 0; j < n; j++) {
5 temp += A[index * n + j] * b[j];
6 }
7 c[index] = temp;
8 }
Listing 4.2: The global memory implementation of Mv on a GPU.
compute in Line 2. Partial cjs are computed by each thread in Line 5 and the ?nal cj is
stored in global memory in Line 7. The number of reads to A and b for this kernel is de?ned
by Equation (3.2). Assuming dBlk:x ? 16, the number of reads is 17?n216 .
The CPU code for this implementation of Mv is shown in Listing 4.3.
1 ...
2 float* d_A;
3 cudaMalloc((void**) &d_A, sizeof(float) * n * n);
4 float* d_b;
5 cudaMalloc((void**) &d_b, sizeof(float) * n);
6 float* d_c;
7 cudaMalloc((void**) &d_c, sizeof(float) * n);
8 cudaMemcpy(d_A, h_A, sizeof(float) * n * n, cudaMemcpyHostToDevice);
9 cudaMemcpy(d_b, h_b, sizeof(float) * n, cudaMemcpyHostToDevice);
10 Mv<<< dGrd:x, dBlk:x >>>(n, d_A, d_b, d_c);
11 cudaMemcpy(h_c, d_c, sizeof(float) * n, cudaMemcpyDeviceToHost);
12 ...
Listing 4.3: CPU code for the global memory implementation of Mv.
The notation h denotes CPU (host) memory and d denotes GPU (device) memory. Lines
2-7 allocate the matrix and vectors in global memory of the GPU. Lines 8 and 9 transfer A,
h A, and b, h b, from the CPU to global memory, d A and d b. Line 10 launches the kernel
for execution using dGrd:x blocks and dBlk:x threads per block. The size of data, n, and
pointers to A, b and c are passed to the kernel. Line 11 transfers c from the GPU global
memory, d c, to the CPU, h c.
Lines 2-3 of Listing 4.1, which determines the placement of data, speci?es that reused
data is placed in constant memory if the size of the reused data is less than the size of
47
constant memory. b, which is reused, is of size n. The upper limit of n is de?ned by
Equation (4.1). Therefore, since n ? 16384, b is stored in constant memory. Listing 4.4
shows the modi?cation of the kernel in Listing 4.2 necessary to utilize constant memory.
1 __constant__ float b[n];
2 __global__ void Mv(float* A, float* c, int n) {
3 ...
Listing 4.4: The constant memory implementation of Mv on a GPU.
Line 1 allocates constant memory of size n for b. Constant memory cannot be dynamically
allocated and therefore, n must be de?ned for execution. Pointers to A and c, which reside
in global memory, and the height or width of A, n, are passed to the kernel as shown in Line
2 of the kernel de?nition. To transfer b from the CPU to GPU constant memory, Line 9 is
modi?ed from Listing 4.3 as shown in Listing 4.5.
8 ...
9 cudaMemcpyToSymbol(b, h_b, sizeof(float) * n);
10 Mv<<< dGrd:x, dBlk:x >>>(n, d_A, d_c);
11 ...
Listing 4.5: CPU code for the constant memory implementation of Mv.
Line 10 of Listing 4.3 is modi?ed such that the kernel is no longer passed a pointer to b. No
other modi?cations to the kernel or CPU code are necessary. Utilizing constant memory for
b eliminates global memory accesses to b. Therefore, the number of global memory reads
for the constant memory implementation of Mv is
Gld32B = n2:
This is a 6.25% reduction of global memory reads compared to the global memory imple-
mentation of Mv. In addition, with fewer accesses to global memory, partition camping is
reduced.
For the constant memory implementation of Mv, threadj, to compute cj, computes
rowj ?b where rowj is one row of A. Therefore, the neighboring thread, threadj+1, computes
rowj+1 ? b. An example of a HW accessing A is depicted in Figure 4.1. In this example,
48
the HW is only 4 threads and each memory access is 4 bytes. Since the values of A needed
in each iteration reside in di?erent rows, the accesses to A are uncoalesced and the HW
performs 4 reads in each iteration.
iteration 1
shared memory
iteration 0 ... ...iteration 4
value needed for iteration j
memory access
iteration n
Figure 4.1: Example of a HW, consisting of 4 threads, accessing A in each iteration of Mv.
Each memory transaction is 4 bytes. The accesses to A are uncoalesced and 4 accesses are
required for each iteration.
Assuming dBlk:x ? 16, each HW accesses 16 values of A from global memory. The
16 values of A reside in di?erent rows of A, and therefore the accesses are uncoalesced
and require 16 32B transactions1. This is repeated for each HW n times. Coalescing the
accesses to A signi?cantly reduces the number of transactions and thus the computation
time. Reducing the number of transactions also reduces the e?ects of partition camping.
Lines 8-9 of Listing 4.1, which de?nes the placement of data, speci?es that shared
memory is utilized to coalesce uncoalesced accesses for data which is not reused. Since
accesses to A of the global and constant memory implementations of Mv are uncoalesced,
shared memory is utilized to coalesce accesses as shown in Figure 4.2. In the ?gure, the HW
is only 4 threads.
132B is the minimum size of a memory transaction on the T10 GPU.
49
iteration n... ...iteration 4iteration 0
value needed for iteration j
memory access
shared memory
iteration 1
Figure 4.2: Example of a HW, consisting of 4 threads, accessing A in each iteration of
Mv utilizing shared memory. Each memory transaction is 4 bytes. The accesses to A are
coalesced and 4 accesses are required every 4th iteration.
In the ?rst iteration, the HW performs 4 reads of 4 bytes each and stores the data in
shared memory. Therefore, in the next iteration, the next value needed for computation
is read from shared memory instead of global. This reduces the number of global memory
accesses by a factor of 4. Listing 4.6 is the kernel for the shared memory implementation of
Mv.
1 __constant__ float b[n];
2 __global__ void Mv(int n, float* A, float* c) {
3 __shared__ float As[blockDim.x][blockDim.x];
4 int index = blockIdx.x * blockDim.x;
5 float temp = 0.0;
6 for(int j = 0; j < n; j += blockDim.x) {
7 for(int k = 0; k < blockDim.x; k++) {
8 As[k][threadIdx.x] = A[(index + k) * n + threadIdx.x + j];
9 }
10 __syncthreads();
11 for(k = 0; k < blockDim.x; k++) {
12 temp += As[threadIdx.x][k] * b[j + k];
13 }
14 __syncthreads();
15 }
16 c[index + threadIdx.x] = temp;
17 }
Listing 4.6: The shared memory implementation of Mv on a GPU. Constant memory is
utilized for b.
50
In Line 3, shared memory is allocated per block of size dBlk:x2. Therefore, for this kernel,
the amount of shared memory per block is
SMemPerBlk = (dBlk:x2)?4: (4.2)
The ?rst row each block accesses from A is calculated in Line 4. In Lines 7-9, HWs load the
values of A into shared memory in a coalesced manner.
Line 10 is a synchronization instruction. HWs in a block stop executing at this instruc-
tion until all HWs within the block reach this instruction. This is to ensure that shared
memory is loaded before the following computation. In Lines 11-13, threads utilized shared
memory values to compute a portion of cj. Line 14 ensures each HW completes the cur-
rent iteration of computation before reloading shared memory. Similar to constant memory,
shared memory cannot be allocated dynamically and therefore, dBlk:x must be de?ned.
As mentioned, the shared memory implementation of Mv, Listing 4.6, signi?cantly re-
duces the number of global memory accesses by coalescing the accesses to A. The number
of global memory accesses for the shared memory implementation of Mv is
Gld32B = n
2
dBlk:x if dBlk:x ? 8
and
Gld64B = n
2
16 if dBlk:x ? 16:
Assuming dBlk:x ? 16, the total number of reads for this implementation is n216 which is 16
times less than the number of reads for the constant memory implementation. Therefore, the
shared memory implementation of Mv signi?cantly reduces the computation time compared
to global and constant implementations.
51
Figure 4.3 depicts maximum e?ective bandwidth of the global memory implementation
of Mv (Listing 4.2) and the shared and constant memory implementation (Listing 4.6).
E?ective bandwidth is calculated as 8n2bytesTGPU
comp:
as discussed in Section 3.5.
512 1024 2048 4096 8192 163840
10
20
30
40
50
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Global Memory
Shared Memory
Figure 4.3: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for Mv
on the T10 GPU.
As illustrated in the ?gure, the shared and constant memory implementation yields a
signi?cant speedup compared to the global memory implementation. As mentioned, utilizing
shared memory for Mv coalesces memory accesses to global memory. Utilizing constant
memory reduces the number of global memory transactions, and therefore, reduces partition
camping. The shared and constant memory implementation yields a speedup, on average,
of 4.79 compared to the global memory implementation. The minimum (n = 4096) and
maximum (n = 16384) speedups are 2.51 and 10.58, respectively. Therefore, for Mv, this
step of the procedure signi?cantly reduces GPU computation time.
4.1.2 MM
Listing 4.7 is the kernel for the global memory implementation of MM if threadij com-
putes Cij. Pointers to A, B and C and the height or width of each matrix, n, are passed to
the kernel in Line 1. Lines 2 and 3 compute the row and column indexes of the Cij being
computed for each thread, respectively. The x- and y-indices of each block are speci?ed by
52
1 __global__ void MM(float* A, float* B, float* C, int n) {
2 int Row = blockIdx.y * blockDim.y + threadIdx.y;
3 int Col = blockIdx.x * blockDim.x + threadIdx.x;
4 float temp = 0.0;
5 for(int j = 0; j < n; j++) {
6 temp += A[Row * n + j] * B[j * n + Col];
7 }
8 C[Row * n + Col] = temp;
9 }
Listing 4.7: The global memory implementation of MM on a GPU where threads compute
one Cij.
blockIdx.x and blockIdx.y, respectively. Block indices in the x- and y-dimensions have a
range of 0 to dGrd:x ? 1 and 0 to dGrd:y ? 1, respectively. The x- and y-indices of each
thread are speci?ed by threadIdx.x and threadIdx.y, respectively. Thread indices in the x-
and y-dimensions have a range of 0 to dBlk:x?1 and 0 to dBlk:y?1, respectively. The x- and
y-dimensions of each block are speci?ed by blockDim.x (dBlk:x) and blockDim.y (dBlk:y),
respectively. In Line 6, each thread computes the partial Cij and the ?nal Cij is stored in
global memory in Line 8.
The number of global memory reads for the global memory implementation of MM is
de?ned by Equations (3.3) and (3.4). Assuming dBlk:x ? 16, all accesses to A and B are
coalesced and 2n316 accesses are performed.
Lines 4-6 of Listing 4.1, which determines the placement of data, speci?es that shared
memory is utilized for reused data which is larger than the size of constant memory. Line 5
speci?es that data which is reused is partitioned into blocks smaller than the size of shared
memory. Therefore, A and B are partitioned into blocks of shared memory to reduce the
number of global memory accesses as shown in Listing 4.8. Lines 3 and 4 allocate shared
memory for blocks of A and B. Therefore, for this kernel,
SMemPerBlk = (dBlk:x?dBlk:y +dBlk:x2)?4: (4.3)
53
1 __global__ void MM(float* A, float* B, float* C, int n, ...
2 int nele_x , int nele_y) {
3 __shared__ float As[blockDim.y][blockDim.x];
4 __shared__ float Bs[blockDim.x][blockDim.x];
5 int Row = blockIdx.y * blockDim.y + threadIdx.y;
6 int Col = blockIdx.x * blockDim.x + threadIdx.x;
7 float temp = 0.0;
8 for(int j = 0; j < n; j += blockDim.x) {
9 As[threadIdx.y][threadIdx.x] = A[Row * n + (j + threadIdx.x)];
10 Bs[threadIdx.y][threadIdx.x] = B[(j + threadIdx.y) * n + Col];
11 __syncthreads();
12 for(int k = 0; k < blockDim.x; k++) {
13 temp += As[threadIdx.y][k] * Bs[k][threadIdx.x];
14 }
15 __syncthreads();
16 }
17 C[Row * n + Col] = temp;
18 }
Listing 4.8: The shared memory implementation of MM on a GPU.
Lines 9 and 10 load blocks of A and B into shared memory. This kernel assumes dBlk:x =
dBlk:y and slight modi?cations are performed to Line 10 for loading B if dBlk:x ? dBlk:y.
Line 11 forces all threads within a block to wait until shared memory is loaded by all HWs
within that block. Partial Cijs are computed in Line 13. Line 15 forces all threads within a
block to wait until all HWs have performed their respective computations with the current
data in shared memory before new data is loaded.
Calculating the number of reads for the shared memory implementation of MM in Listing
4.8 yields
Gld32B = n
3
dBlk:x parenleft.alt3
1
dBlk:x +
1
dBlk:yparenright.alt3 if dBlk:x ? 8 (4.4)
and
Gld64B = n
3
16 parenleft.alt3
1
dBlk:x +
1
dBlk:yparenright.alt3 if dBlk:x ? 16: (4.5)
If dBlk:x ? 16, all global memory accesses are 64B and the number of accesses to global
memory is reduced by a factor of 2dBlk:x?dBlk:ydBlk:x+dBlk:y compared to the global memory implementa-
tion.
54
Figure 4.4 depicts maximum e?ective bandwidth of the global memory implementation
of MM (Listing 4.7) and the shared memory implementation (Listing 4.8). For MM, e?ective
bandwidth is calculated as 8n3bytesTGPU
comp:
as discussed in Section 3.5.
512 1024 2048 4096 8192 163840
100
200
300
400
500
600
700
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Global Memory
Shared Memory
Figure 4.4: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for MM
on the T10 GPU.
Similar to Mv, the shared memory implementation yields a signi?cant speedup compared
to the global memory implementation. Utilizing shared memory reduces the number of
global memory transactions by data reuse, thus reducing computation time and increasing
e?ective bandwidth. The shared memory implementation yields a speedup, on average,
of 7.32 compared to the global memory implementation. The minimum (n = 16384) and
maximum (n = 512) speedups are 5.96 and 10.46, respectively. Therefore, for MM, Step 1 of
the procedure signi?cantly reduces GPU computation time.
4.2 Computation Patterns
Computation patterns are determined by the code and represent the manner and order
in which results are computed. For Mv, if threadj computes one value of c, cj, where
c=Ab, there are no varying computation patterns. Likewise, for MM, if threadij computes
one value of C, Cij, where C=AB, there are no varying computation patterns. However, if
55
threads compute more or less than one value, the order in which threads compute creates
computation patterns.
Since computation patterns are a?ected by partition camping, a derivation of the com-
putation pattern which minimizes partition camping is necessary. In addition, computation
patterns can a?ect the number of accesses to global memory. Therefore, it is necessary
to derive the optimal computation patterns for matrix-based computations. Deriving the
optimal pattern requires an examination of the various patterns.
4.2.1 Mv
For Mv, there are two computation patterns if threadj computes multiple values of c
as depicted in Figure 4.5. A computation pattern where each thread computes neighboring
(a) Grouped. (b) Spread.
Figure 4.5: Computation patterns: two computation patterns for Mv for computing multiple
Cjs.
values of c, grouped, is depicted in Figure 4.5a. A computation pattern where each thread
computes values of c which are not neighboring, spread, is depicted in Figure 4.5b.
Figure 4.6 depicts the average computation time for Mv on the T10 GPU if each thread
computes one or more values of c utilizing the spread computation pattern. dGrd:x ? dBlk:x
is the number of total threads. Each thread computes at least one value of c and therefore,
the maximum number of total threads is n. The average computation time decreases as the
number of total threads increases. The minimum time is achieved, for all data sizes tested,
when the number of total threads equals n. This is due to the memory bus not being fully
utilized when dGrd:x ? dBlk:x ? 8192.2
2Discussion on the number of threads necessary to saturate the memory bus is in Section 4.5.
56
64 128 256 512 1024 2048 4096 8192 1638410
?1
100
101
102
dGrd.x?dBlk.x
av
era
ge
tim
e(
ms
)
n=16384n=8192
n=4096n=2048
n=1024n=512
Figure 4.6: Average computation time (ms) varying the number of total threads for Mv on
the T10 GPU.
Optimized
To increase the number of total threads, it is necessary that multiple threads are utilized
to compute one cj. Therefore, the optimized computation pattern consists of each thread
computing a part of cj as depicted in Listing 4.9. In Listing 4.9, each thread computes a
part of cj and each block computes dBlk:x cjs. Shared memory is allocated in Line 3 and
therefore,
SMemPerBlk = dBlk:x2 ?4:
In Line 5, index is calculated, which speci?es the group of cjs that are computed by each
block. Lines 7-9 load parts of A into shared memory. All threads within a block are
synchronized in Line 10 to ensure shared memory is loaded before computation. Each thread
computes a part of cj in Lines 11-13. Threads are again synchronized in Line 14 to prevent
shared memory being reloaded before each thread completes computation with the current
data in shared memory. At the end of the for-loop in Line 15, each thread contains a part
of a cj in temp. Threads with threadIdx:y ? 0 store their parts of cj into shared memory.
The shared memory which is utilized for portions of A is reused for the parts of cj computed
by each thread. Threads with threadIdx:y = 0 are utilized to sum the parts of cj stored in
57
1 __constant__ float b[n];
2 __global__ void Mv(float* A, float* c, int n) {
3 __shared__ float As[blockDim.x][blockDim.x];
4 float temp = 0.0;
5 int index = blockIdx.x * blockDim.x;
6 for(int j = 0; j < n; j+= blockDim.x) {
7 for(int k = 0; k < blockDim.x / blockDim.y; k++) {
8 As[k + threadIdx.y * blockDim.x / blockDim.y][threadIdx.x] =
rcurvearrowse A[(index + k + threadIdx.y * blockDim.x / blockDim.y) * n +
rcurvearrowse threadIdx.x + j];
9 }
10 __syncthreads();
11 for(int k = 0; k < blockDim.x / blockDim.y; k++) {
12 temp += As[threadIdx.x][k + threadIdx.y * blockDim.x / blockDim.y]
rcurvearrowse * b[j + k + threadIdx.y * blockDim.x / blockDim.y];
13 }
14 __syncthreads();
15 }
16 if(threadIdx.y != 0) {
17 As[threadIdx.y-1][threadIdx.x] = temp;
18 }
19 __syncthreads();
20 if(threadIdx.y == 0) {
21 for(k = 0; k < blockDim.y-1; k++) {
22 temp += As[k][threadIdx.x];
23 }
24 c[index + threadIdx.x] = temp;
25 }
26 }
Listing 4.9: The shared memory implementation of Mv on a GPU utilizing the optimized
computation pattern. Constant memory is utilized for b.
58
shared memory in Lines 20-23. Lastly, in Line 24, these threads store the ?nal cjs in c in
global memory.
Since b is stored in constant memory, global memory reads are issued only for accessing
A. In Line 8, the number of accesses to A by each HW is dependent on dBlk:x;dBlk:y and
threadIdx:y. Therefore,
Gld32B = n
2
dBlk:x if dBlk:x ? 8 (4.6)
and
Gld64B = n
2
16 if dBlk:x ? 16 (4.7)
which is equivalent to the number of accesses in the shared memory implementation of Mv
without the optimized computation pattern (Listing 4.6).
Figure 4.7 illustrates the maximum e?ective bandwidth of Mv before (Listing 4.6) and
after (Listing 4.9) utilizing the optimized computation pattern. As discussed in Section 3.5,
for Mv, e?ective bandwidth is calculated as 8n2bytesTGPU
comp:
.
512 1024 2048 4096 8192 1638410
20
30
40
50
60
70
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 4.7: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for
Mv on the T10 GPU.
Since the optimized computation pattern increases the number of total threads, the
e?ective bandwidth is signi?cantly increased. The optimized computation pattern yields
a speedup, on average, of 1.80 compared to the non-optimized pattern. The minimum
59
(n = 2048) and maximum (n = 4096) speedups are 1.59 and 2.23, respectively. Therefore,
Step 2 of the parallelization procedure, deriving the optimized computation pattern, is valid
and necessary for minimizing GPU computation time of Mv.
4.2.2 MM
For MM, rather than Mv, increasing the number of total threads does not necessarily
yield the minimum computation time due to saturation of the memory bus. With a satu-
rated memory bus, the e?ects of fragmentation and partition camping are more apparent.
Therefore, if threadij computes more than Cij, the computation pattern, or order in which
the threads compute results, a?ects the computation time.
Four computation patterns for MM are illustrated in Figure 4.8 if each thread com-
putes multiple Cijs. Figure 4.8a depicts a computation pattern where each thread computes
(a) Grouped. (b) Column-major. (c) Row-major. (d) Hybrid.
Figure 4.8: Computation patterns: four computation patterns for MM for computing mul-
tiple Cijs.
neighboring Cijs, de?ned as grouped. Figures 4.8b and c, de?ned as column-major and row-
major, respectively, depict computation patterns where each thread computes multiple Cijs
per column or row, respectively. Merging column- and row-major patterns yields the last
computation pattern, de?ned as hybrid, as depicted in Figure 4.8d.
Grouped
The grouped computation pattern, depicted in Figure 4.8a, causes a reduction of mem-
ory coalescing. Coalescing as many memory reads as possible at the HW level is ideal to
minimize the computation time. Considering an example where each thread is to compute
60
4 Cijs in a grouped pattern, Cij, Ci(j+1), C(i+1)j and C(i+1)(j+1), to compute the ?rst Cij,
threadij accesses one row of A, rowi and one column of B, colj. Since threads are assigned
to warps in row-major order, the neighboring thread of threadij, threadi(j+1), accesses the
same row of A, rowi, but a di?ering column of B, colj+2 to compute Ci(j+2) 3. In general,
each HW read will only use 1x of the bytes where x is the number of Cijs computed in the
x-dimension, thus reducing the number of coalesced memory accesses to B by a factor of
x. Therefore, this computation pattern is not included in the results as it demonstrates the
e?ects of memory coalescing more than partition camping.
Column-major
The row- and column-major computation patterns do not cause a reduction of mem-
ory coalescing, as the grouped pattern does, if each block has at least 8 threads in the
x-dimension, dBlk:x ? 8. For the column-major computation pattern, depicted in Figure
4.8b, threadij computes C(i+y?dGrd:y?dBlk:y)j for y = 0 to nele y where nele y is the number of
Cijs to compute. For this pattern, colj is repeatedly read by threadij while the row that is
read varies from i to (nele y?dGrd:y ?dBlk:y). If n%512 = 0, all accesses to colj remain in
the same partition.
Row-major
For the row-major computation pattern, depicted in Figure 4.8c, threadij computes
Ci(j+x?dGrd:x?dBlk:x) for x = 0 to nele x where nele x is the number of Cijs to compute. There-
fore, to compute Cij to Ci(j+dBlk:x?1), threadij to threadi(j+dBlk:x?1) read rowi and the column
reads vary in a coalesced pattern from colj to colj+dBlk:x?1. However, for computing addi-
tional Cijs, the accesses to each partition for reading B vary depending on which Cij is
being computed. Accesses to each partition for reading A always vary as the program exe-
cution continues as it is not possible for an entire row of A to be stored in one partition if
3threadi(j+1) computes Ci(j+2) instead of Ci(j+1) because of the grouped computation pattern. In this
case, threadij is computing Ci(j+1) as well as C(i+1)j and C(i+1)(j+1).
61
n > 8 without transposing A. However, an entire column of B is stored in one partition if
n%512 = 0. For this computation pattern, as the program execution continues, each HW is
accessing varying partitions to read B dependent on the current Cij being computed.
Hybrid
For the hybrid computation pattern, depicted in Figure 4.8d, it is necessary to determine
the order each Cij is computed. Two computation patterns are de?ned and illustrated in
Figure 4.9: hybrid-column and hybrid-row. The hybrid-column computation pattern allows
each thread to compute multiple Cijs in each row and column but computes each ?rst in the
y-dimension. The hybrid-row is similar except each thread computes each Cij ?rst in the x-
dimension. Therefore, the column-major computation pattern is a subset of hybrid-column,
and the row-major is a subset of hybrid-row. Because of this, hybrid-column and hybrid-row
computation patterns are presented in the results, and row- and column-major are not.
(a) Hybrid-column. (b) Hybrid-row.
Figure 4.9: Computation patterns: hybrid computation patterns for MM for computing
multiple Cijs.
Similar to row- and column-major computation patterns, both hybrid patterns, de?ned
in Figure 4.9, bene?t from memory coalescing if dBlk:x ? 8. If (dGrd:x?dBlk:x)%512 = 0,
then all Cijs computed by threadij will be in the same partition. Therefore, with regards to
the distribution of the HWs to all partitions, and from the GPU?s perspective, the hybrid-
row and hybrid-column patterns are identical. In addition, in this case, both patterns are
identical to the column-major pattern.
However, if (dGrd:x ? dBlk:x)%512 ? 0, there is a di?erence between the hybrid-row
and hybrid-column pattern. In both, threadij computes C(i+y?dGrd:y?dBlk:y)(j+x?dGrd:x?dBlk:x)
62
for y = 0 to nele y and x = 0 to nele x where nele x and nele y are the number of Cijs to
compute in the x- and y-dimensions, respectively. If all partitions are not initially accessed
by the HWs, the hybrid-row pattern accesses the unused partitions quicker than the hybrid-
column pattern thus reducing partition camping.
Optimized (column-major)
Memory coalescing and the order of computation are considered to derive the optimized
computation pattern for MM. The row- and column-major patterns, in addition to the hybrid
patterns, do not reduce memory coalescing as the grouped pattern does. From the kernel
in Listing 4.8, the shared memory implementation of MM on a GPU, global memory is ac-
cessed in Lines 9 and 10. Computation is then performed in Line 13 before reloading shared
memory. In this pattern, little overlap exists between accessing global memory and compu-
tation. Therefore, the optimized computation pattern consists of an algorithm maximizing
the overlap of global memory accesses and computation such as the one described in [63] and
utilized in [19], [20], and [21]. In this implementation, only one matrix is placed in shared
memory. In addition, the aforementioned implementations assume column-major storage for
matrices. In this work, row-major storage is assumed due to the row-major storage of the
GPU. Although the algorithm for MM is similar to the aforementioned implementations,
other steps of the parallelization procedure augment the algorithm to further minimize GPU
computation time. Utilizing this algorithm and the optimized computation pattern depicted
in Figure 4.8b yields Listing 4.10. In this implementation of MM, all memory accesses are
coalesced since the column-major computation pattern is utilized. Shared memory is allo-
cated for parts of A in Line 2. Therefore, the amount of shared memory allocated is modi?ed
from Equation (4.3) to yield
SMemPerBlk = (n ele?dBlk:x)?4 (4.8)
63
1 __global__ void MM( float *A, float *B, float *C, int n) {
2 __shared__ float As[n_ele][blockDim.x];
3 float temp[N_ele];
4 for(int i = 0; i < n_ele; i++) {
5 temp[i] = 0.0;
6 }
7 for(int j = 0; j < n / blockDim.x; j++) {
8 for(int i = 0; i < n_ele / blockDim.y; i++) {
9 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[threadIdx.x +
rcurvearrowse (blockIdx.y * n_ele + threadIdx.y) * n + (i * blockDim.y * n
rcurvearrowse + j * blockDim.x)];
10 }
11 __syncthreads();
12 for(int i = 0; i < blockDim.x; i++) {
13 for(int k = 0; k < n_ele; k++) {
14 temp[k] += As[k][i] * B[(blockIdx.x * blockDim.x * blockDim.y +
rcurvearrowse threadIdx.y * blockDim.x + threadIdx.x) + (i * n + j * n *
rcurvearrowse blockDim.x)];
15 }
16 }
17 __syncthreads();
18 }
19 for (int i = 0; i < n_ele; i++) {
20 C[i * n + blockIdx.x * blockDim.x * blockDim.y + threadIdx.y *
rcurvearrowse blockDim.x + threadIdx.x + blockIdx.y * n_ele * n] = temp[i];
21 }
22 }
Listing 4.10: The shared memory implementation of MM on a GPU utilizing the optimized
computation pattern.
64
where n ele is the number of Cijs computed per thread. In Lines 3-6, each thread allocates
and initializes temporary variables for each Cij being computed. A portion of A is loaded
into shared memory in Lines 8-10. This implementation assumes dBlk:y ? n ele. Slight
modi?cations are made to loading A into shared memory in Lines 8-10 if dBlk:y > n ele.
Threads within a block synchronize in Line 11 to ensure A is loaded. In Lines 13-15, each
thread accesses B once and performs computation with that value and shared memory n ele
times. B is not read in every iteration of k in Line 14 as the access to B is independent of
the value of k. Therefore, the overlap between accessing global memory and computation is
greatly increased. In addition, this implementation requires that only portions of one matrix
be loaded into shared memory thus reducing the amount of shared memory necessary.
The number of global memory accesses to A and B is dependent on n, dBlk:x, dBlk:y,
and n ele. After simplifying,
Gld32B = n
3
dBlk:x parenleft.alt3
1
dBlk:x?dBlk:y +
1
n eleparenright.alt3 if dBlk:x ? 8 (4.9)
and
Gld64B = n
3
16 parenleft.alt3
1
dBlk:x?dBlk:y +
1
n eleparenright.alt3 if dBlk:x ? 16: (4.10)
Therefore, the number of global memory accesses is reduced by a factor of n ele(dBlk:x+dBlk:y)n ele+dBlk:x?dBlk:y
compared to the shared memory implementation of MM without the optimized computation
pattern (Listing 4.8).
Figure 4.10 depicts maximum e?ective bandwidth of MM before (Listing 4.8) and af-
ter (Listing 4.10) utilizing the optimized computation pattern. The e?ective bandwidth is
calculated in Section 3.5 as 8n3bytesTGPU
comp:
.
65
512 1024 2048 4096 8192 16384500
600
700
800
900
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 4.10: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s)
for MM on the T10 GPU.
Similar to Mv, the optimized computation pattern yields a signi?cant speedup. In the
?gure, the optimized computation pattern yields a speedup, on average, of 1.32 compared
to the non-optimized pattern. The speedup is due to the reuse of data by the optimized
computation pattern. As mentioned, the optimized computation pattern reduces the number
of global memory transactions and therefore, the e?ective bandwidth is increased. The
minimum (n = 4096) and maximum (n = 512) speedups are 1.28 and 1.41, respectively.
Therefore, Step 2 of the parallelization procedure is valid and necessary for reducing GPU
computation time of MM.
4.3 Access Patterns
The third step of the parallelization procedure speci?es that the optimized access pattern
to data in memory be derived. The optimized access patterns are derived to minimize
partition camping (Section 3.1) and eliminate bank con?icts (Section 3.2). Therefore, the
GPU computation time is reduced.
For matrix-based computations such as Mv and MM, all HWs begin by initially reading
the ?rst value of their respective rows of A. Therefore, distributing the accesses to A at
the block level to all partitions reduces partition camping. Figure 4.11a, assuming n ? 512,
66
illustrates an access pattern to A of 8 blocks for Mv or MM. At the beginning of execution,
blocki reads the ?rst value of rowi from A, which is stored in partition 0. As execution
continues, blocki accesses rowi of A where rowi is stored across all partitions. However,
since all blocks begin accessing the ?rst value of A, all blocks begin reading from partition 0
and thus, partition camping occurs. Therefore, an optimized access pattern which minimizes
partition camping for accessing A is derived by distributing blocks to all partitions at the
beginning of execution. This is illustrated in Figure 4.11b.
0 1 2 3 4 5 6 7
partition
block 1
block 2
block 3
block 4
block 5
block 6
block 7
block 0
(a) Standard access pattern.
0 1 2 3 4 5 6 7
partition
block 1
block 2
block 3
block 4
block 5
block 6
block 7
block 0
(b) Optimized access pattern.
Figure 4.11: Access patterns: example of block-level access patterns to A for matrix-based
computations. Each column is one partition of global memory. Each row is 512 values of
type ?oat.
4.3.1 Mv
To implement Figure 4.11b for the shared memory implementation of Mv utilizing the
optimized computation pattern, the kernel in Listing 4.9 is modi?ed to yield the kernel in
Listing 4.11. This kernel utilizes the optimized access pattern. In Line 6, the partition from
which each block begins accessing A, P, is calculated. The for-loop from Line 6 in Listing
4.9 is split into 2 for-loops, Line 7 and 17 of Listing 4.11, such that blocks begin accessing
A from varying partitions, P. P is computed using blockId:x mod 8 since there are 8 global
memory partitions. This is multiplied by 64 since each partition is 256 bytes wide and values
are of type ?oat.
The allocation of shared memory in Line 3 of Listing 4.11 is greater than the allocation
of shared memory in Listing 4.9. Therefore, Equation (4.2), which de?nes the amount of
67
1 __constant__ float b[n];
2 __global__ void Mv(float* A, float* c, int n) {
3 __shared__ float As[blockDim.x][blockDim.x+1];
4 float temp = 0.0;
5 int index = blockIdx.x * blockDim.x;
6 int P = (blockIdx.x % 8) * 64;
7 for(int j = P; j < n; j+= blockDim.x) {
8 for(int k = 0; k < blockDim.x / blockDim.y; k++) {
9 As[k + threadIdx.y * blockDim.x / blockDim.y][threadIdx.x] =
rcurvearrowse A[(index + k + threadIdx.y * blockDim.x / blockDim.y) * n +
rcurvearrowse threadIdx.x + j];
10 }
11 __syncthreads();
12 for(int k = 0; k < blockDim.x / blockDim.y; k++) {
13 temp += As[threadIdx.x][k + threadIdx.y * blockDim.x / blockDim.y]
rcurvearrowse * b[j + k + threadIdx.y * blockDim.x / blockDim.y];
14 }
15 __syncthreads();
16 }
17 for(int j = 0; j < P; j+= blockDim.x) {
18 for(int k = 0; k < blockDim.x / blockDim.y; k++) {
19 As[k + threadIdx.y * blockDim.x / blockDim.y][threadIdx.x] =
rcurvearrowse A[(index + k + threadIdx.y * blockDim.x / blockDim.y) * n +
rcurvearrowse threadIdx.x + j];
20 }
21 __syncthreads();
22 for(int k = 0; k < blockDim.x / blockDim.y; k++) {
23 temp += As[threadIdx.x][k + threadIdx.y * blockDim.x / blockDim.y]
rcurvearrowse * b[j + k + threadIdx.y * blockDim.x / blockDim.y];
24 }
25 __syncthreads();
26 }
27 if(threadIdx.y != 0) {
28 As[threadIdx.y-1][threadIdx.x] = temp;
29 }
30 __syncthreads();
31 if(threadIdx.y == 0) {
32 for(k = 0; k < blockDim.y-1; k++) {
33 temp += As[k][threadIdx.x];
34 }
35 c[index + threadIdx.x] = temp;
36 }
37 }
Listing 4.11: The shared memory implementation of Mv on a GPU utilizing the optimized
computation and access pattern. Constant memory is utilized for b.
68
shared memory used per block for the shared memory implementation of Mv utilizing the
optimized computation pattern is modi?ed to yield
SMemPerBlk = (dBlk:x2 +dBlk:x)?4: (4.11)
Although dBlk:x2 ? 4 bytes of shared memory are necessary for the kernel in Listing
4.11, (dBlk:x2+dBlk:x)?4 bytes are allocated. The additional allocation in the kernel is to
reduce bank con?icts. An example of bank con?icts for Mv is depicted in Figures 4.12-4.15.
In the ?gures, dBlk:x = 16. Figure 4.12 depicts shared memory being loaded with A for
the shared memory implementation of Mv utilizing the optimized computation pattern but
not the optimized access pattern. This implementation utilizes dBlk:x2 ?4 bytes of shared
memory. Since the threads within a HW access di?erent banks of shared memory, no bank
con?icts occur for loading shared memory.
9 10 11 12 13 14 15876543210 9 10 11 12 13 14 15876543210 9 10 11 12 13 1415876543210
values stored in shared memory for iteration k
bank bank bank
iteration 2 ...iteration 1iteration 0
Figure 4.12: Example of loading A into shared memory for the shared memory implementa-
tion of Mv utilizing the optimized computation pattern but not the optimized access pattern.
No bank con?icts occur. dBlk:x = 16.
However, for reading the values from shared memory, the HW accesses varying rows and
one column of shared memory in each iteration as depicted in Figure 4.13. As illustrated,
each value resides in the same shared memory bank and thus bank con?icts occur.
69
9 10 11 12 13 14 15876543210 9 10 11 12 13 14 15876543210 9 10 11 12 13 1415876543210
bank bank bank
values read from shared memory for iteration k
iteration 2 ...iteration 1iteration 0
Figure 4.13: Example of reading shared memory for the shared memory implementation of
Mv utilizing the optimized computation pattern but not the optimized access pattern. Bank
con?icts occur. dBlk:x = 16.
Figure 4.14 depicts shared memory being loaded with A for the shared memory imple-
mentation of Mv that utilizes the optimized computation and access pattern. (dBlk:x2 +
dBlk:x)?4 bytes of shared memory are allocated. Since the threads within a HW access all
shared memory banks, no bank con?icts occur.
9 10 11 12 13 14 15876543210 9 10 11 12 13 14 15876543210 9 10 11 12 13 1415876543210
values stored in shared memory for iteration k
bank bank bank
iteration 2 ...iteration 1iteration 0
Figure 4.14: Example of loading A into shared memory for the shared memory implementa-
tion of Mv utilizing the optimized computation and access pattern. No bank con?icts occur.
dBlk:x = 16.
For reading the values from shared memory, the HW accesses varying rows and the same
column of shared memory in each iteration as depicted in Figure 4.15. As illustrated, each
value resides in a di?erent shared memory bank and therefore, no bank con?icts occur.
70
9 10 11 12 13 14 15876543210 9 10 11 12 13 14 15876543210 9 10 11 12 13 1415876543210
values read from shared memory for iteration k
bank bank bank
iteration 0 iteration 1 iteration 2 ...
Figure 4.15: Example of reading shared memory for the shared memory implementation
of Mv utilizing the optimized computation and access pattern. No bank con?icts occur.
dBlk:x = 16.
Therefore, the optimized access pattern for Mv reduces partition camping and eliminates
bank con?icts. Figure 4.16 depicts maximum e?ective bandwidth of Mv before (Listing
4.9) and after (Listing 4.11) utilizing the optimized access pattern. E?ective bandwidth is
calculated as 8n2bytesTGPU
comp:
as discussed in Section 3.5.
512 1024 2048 4096 8192 163840
50
100
150
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 4.16: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for Mv
on the T10 GPU.
In the ?gure, the optimized access pattern yields a speedup, on average, of 2.28 compared
to the non-optimized pattern. The signi?cant speedup is due to the optimized access pattern
reducing partition camping and eliminating bank con?icts. The minimum (n = 512) and
71
maximum (n = 2048) speedups are 2.06 and 2.81, respectively. The signi?cant increase in
e?ective bandwidth shows the validity and necessity of Step 3 of the procedure for Mv,
deriving the optimized access pattern.
4.3.2 MM
To reduce partition camping by implementing Figure 4.11b for the shared memory
implementation of MM utilizing the optimized computation pattern, the kernel in Listing
4.10 is modi?ed to yield Listing 4.12. This kernel utilizes the optimized access pattern. In
Listing 4.12, dBlk:y ? n ele. Slight modi?cations are performed to Lines 9-11 and 21-23 for
loading A if dBlk:y > n ele. In this kernel, additional computation is required compared to
Listing 4.10 to determine which value of A each block reads at the beginning of execution.
Line 7 of Listing 4.12 assigns P, a variable used to specify which partition is accessed for
reading A at the beginning of execution. To calculate P, a linear block index is calculated
as blkIdx:x+dGrd:x?blkIdx:y. The linear block index mod 8, the number of partitions, is
calculated such that blocks are assigned to each partition at the beginning of execution in
round-robin order. This is multiplied by 64blockDim:x since there are 64 values of type ?oat per
row of each partition. This kernel utilizes the same amount of shared memory as the kernel
in Listing 4.8. Therefore, Equation (4.8) de?nes the amount of shared memory utilized per
block. Splitting the for-loop in Line 7 of Listing 4.10 as shown in Lines 8-20 of Listing 4.12
distributes the blocks to all partitions for reading A at the beginning of execution. Therefore,
this optimized access pattern reduces partition camping and thus reduces the computation
time.
Shared memory is allocated in Line 2 of Listing 4.12 the same as in Line 2 of Listing 4.10.
Therefore, the amount of shared memory allocated per block is de?ned by Equation (4.8).
In Lines 10 and 22, values from A are loaded into shared memory. If dBlk:x ? 16, threads
within a HW have the same value for threadIdx:y. Therefore, no bank con?icts occur for
loading shared memory as all banks are utilized. When reading from As in Lines 15 and
72
1 __global__ void MM( float *A, float *B, float *C, int n) {
2 __shared__ float As[n_ele][blockDim.x];
3 float temp[N_ele];
4 for(int i = 0; i < n_ele; i++) {
5 temp[i] = 0.0;
6 }
7 int P = ((blockIdx.x + gridDim.x * blockIdx.y) % 8) * (64 / blockDim.x);
8 for(int j = P; j < n / blockDim.x; j++) {
9 for(int i = 0; i < n_ele / blockDim.y; i++) {
10 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[threadIdx.x +
rcurvearrowse (blockIdx.y * n_ele + threadIdx.y) * n + (i * blockDim.y * n + j *
rcurvearrowse blockDim.x)];
11 }
12 __syncthreads();
13 for(int i = 0; i < blockDim.x; i++) {
14 for(int k = 0; k < n_ele; k++) {
15 temp[k] += As[k][i] * B[(blockIdx.x * blockDim.x * blockDim.y +
rcurvearrowse threadIdx.y * blockDim.x + threadIdx.x) + (i * n + j * n *
rcurvearrowse blockDim.x)];
16 }
17 }
18 __syncthreads();
19 }
20 for(int j = 0; j < P; j++) {
21 for(int i = 0; i < n_ele / blockDim.y; i++) {
22 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[threadIdx.x +
rcurvearrowse (blockIdx.y * n_ele + threadIdx.y) * n + (i * blockDim.y * n + j *
rcurvearrowse blockDim.x)];
23 }
24 __syncthreads();
25 for(int i = 0; i < blockDim.x; i++) {
26 for(int k = 0; k < n_ele; k++) {
27 temp[k] += As[k][i] * B[(blockIdx.x * blockDim.x * blockDim.y +
rcurvearrowse threadIdx.y * blockDim.x + threadIdx.x) + (i * n + j * n *
rcurvearrowse blockDim.x)];
28 }
29 }
30 __syncthreads();
31 }
32 for (int i = 0; i < n_ele; i++) {
33 C[i * n + blockIdx.x * blockDim.x * blockDim.y + threadIdx.y *
rcurvearrowse blockDim.x + threadIdx.x + blockIdx.y * n_ele * n] = temp[i];
34 }
35 }
Listing 4.12: The shared memory implementation of MM on a GPU utilizing the optimized
computation and access pattern.
73
27, each HW reads the same value, row k and column i. Since the T10 GPU supports
shared memory broadcasting, threads within a HW accessing the same address in shared
memory cause no bank con?icts. Therefore, no further optimizations to the access pattern
are necessary.
Figure 4.17 depicts maximum e?ective bandwidth of MM before (Listing 4.10) and
after (Listing 4.12) utilizing the optimized access pattern. For MM, e?ective bandwidth is
calculated in Section 3.5 as 8n3bytesTGPU
comp:
.
512 1024 2048 4096 8192 16384740
760
780
800
820
840
860
880
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 4.17: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for MM
on the T10 GPU.
The optimized access pattern yields a speedup, on average, of 1.01 compared to the
non-optimized pattern. The minimum (n = 2048) and maximum (n = 512) speedups are 1.00
and 1.04, respectively. Comparing results from Figure 4.16 with Figure 4.17, the optimized
access pattern yields a signi?cant speedup for Mv but a modest speedup for MM. Since bank
con?icts do not occur for MM, these results suggest that bank con?icts a?ect computation
time more signi?cantly than partition camping. Regardless, the optimized access pattern
increases maximum e?ective bandwidth, on average, for both. Therefore, Step 3 of the
parallelization procedure, deriving the optimized access pattern, is valid and necessary.
74
4.4 Fine-tuning
The fourth step of the parallelization procedure speci?es that ?ne-tuning is performed
on each kernel to reduce computation and resource allocation. Fine-tuning is de?ned as
optimizations to the kernel to reduce time, such as reducing index calculations and loop un-
rolling. Reducing index calculations, particularly in inner loops, reduces the time necessary
to issue global memory transactions thus reducing the computation time.
Loop unrolling also reduces computation time by increasing the instruction mix of mem-
ory transactions and computation instructions. In addition, loop unrolling can reduce the
number of registers used per thread since the compiler can reuse registers. When the com-
piler exceeds the maximum register allocation per thread, local memory is allocated. Local
memory is a section of global memory reserved for each thread as storage for temporary
variables. The usage of local memory instead of registers greatly increases the computation
time as global memory accesses are orders of magnitude slower than register accesses. Since
the implementation of MM utilizes many registers, the e?ect is more apparent than for Mv.
Since loop unrolling can reduce the number of registers used per thread due to register reuse,
it can eliminate allocation of local memory thus greatly reducing the computation time.
For matrix-based computations, index calculations are necessary for each thread to
determine which value of each matrix is accessed. Index calculations typically consist of
several multiply and addition instructions inside a loop. The kernel de?nitions of matrix-
based computations specify that pointers to each matrix are passed. Since each thread
contains a pointer to each matrix, some of the multiplication instructions can be reduced
to addition. Listing 4.13 depicts sample code for a typical matrix-based computation. In
1 __global__ void Example(float* A, float* B, int n) {
2 for(int i = 0; i < n; i++) {
3 for(int j = 0; j < n; j++) {
4 B[threadIdx.y * n + j] = A[i * n + threadIdx.x];
5 }
6 }
7 }
Listing 4.13: An example of a matrix-based computation without ?ne-tuning.
75
Line 4, to calculate a part of the index of B, each thread calculates threadIdx:y ?n which
is performed n2 times. Since the value is constant for each thread, this calculation can
be performed once before the for-loops. Likewise, each thread adds threadIdx:x which is
calculated n2 times for a part of the index of A. This calculation can also be performed once
outside of the for-loops. Similarly, each thread calculates i?n which is performed n2 times
for a part of the index of A. Therefore, this calculation can be replaced by adding n to the
pointer to A inside the outer for-loop. Now, each thread performs only n additions of n.
The kernel in Listing 4.13 is modi?ed accordingly to yield the ?ne-tuned kernel, with respect
to index calculations, in Listing 4.14.
1 __global__ void Example(float* A, float* B, int n) {
2 A += threadIdx.x;
3 B += threadIdx.y * n;
4 for(int i = 0; i < n; i++) {
5 for(int j = 0; j < n; j++) {
6 B[j] = A[0];
7 }
8 A += n;
9 }
10 }
Listing 4.14: An example of a matrix-based computation with index calculations minimized.
As mentioned, loop unrolling can reduce computation time by increasing the instruction
mix and reducing the number of registers used per thread. Reducing the number of registers
used per thread can signi?cantly reduce computation time since the number of active threads
can be dependent on register usage. Therefore, NVIDIA provides a #pragma unroll directive
to instruct the compiler to unroll a loop. From the CUDA programming guide [64], ?By
default, the compiler unrolls small loops with a known trip count.? However, it is unde?ned
as to what constitutes a small loop. Therefore, #pragma unroll is inserted before each loop
in the kernel. The compiler unrolls inner loops ?rst and stops unrolling loops once the
maximum instruction limit is reached. Listing 4.14 is modi?ed to instruct the compiler to
unroll the for-loops as depicted in Listing 4.15.
76
1 __global__ void Example(float* A, float* B, int n) {
2 A += threadIdx.x;
3 B += threadIdx.y * n;
4 #pragma unroll
5 for(int i = 0; i < n; i++) {
6 #pragma unroll
7 for(int j = 0; j < n; j++) {
8 B[j] = A[0];
9 }
10 A += n;
11 }
12 }
Listing 4.15: An example of a matrix-based computation after ?ne-tuning. Index calculations
are minimized and all loops are unrolled
4.4.1 Mv
Fine-tuning is performed to the shared memory implementation of Mv on a GPU uti-
lizing the optimized computation and access pattern, Listing 4.11, to yield Listing 4.16.
1 __constant__ float b[n];
2 __global__ void Mv(float* A, float* c, int n) {
3 __shared__ float As[blockDim.x][blockDim.x+1];
4 float temp = 0.0;
5 int index = blockIdx.x * blockDim.x;
6 int P = (blockIdx.x % 8) * 64;
7 int calc1 = blockDim.x / blockDim.y;
8 int calc2 = calc1 * threadIdx.y;
9 A += threadIdx.x + index * n + calc2 * n + P;
10 c += index + threadIdx.x;
11 #pragma unroll
12 for(int j = P; j < n; j+= blockDim.x) {
13 #pragma unroll
14 for(int k = 0; k < calc1; k++) {
15 As[k + calc2][threadIdx.x] = A[0];
16 A += n;
17 }
18 A += blockDim.x;
19 A -= n * calc1;
20 __syncthreads();
21 #pragma unroll
22 for(int k = 0; k < calc1; k++) {
23 temp += As[threadIdx.x][k + calc2] * b[j + k + calc2];
24 }
25 __syncthreads();
26 }
27 A -= n;
28 #pragma unroll
77
29 for(int j = 0; j < P; j+= blockDim.x) {
30 #pragma unroll
31 for(int k = 0; k < calc1; k++) {
32 As[k + calc2][threadIdx.x] = A[0];
33 A += n;
34 }
35 A += blockDim.x;
36 A -= n * calc1;
37 __syncthreads();
38 #pragma unroll
39 for(int k = 0; k < calc1; k++) {
40 temp += As[threadIdx.x][k + calc2] * b[j + k + calc2];
41 }
42 __syncthreads();
43 }
44 if(threadIdx.y != 0) {
45 As[threadIdx.y-1][threadIdx.x] = temp;
46 }
47 __syncthreads();
48 if(threadIdx.y == 0) {
49 #pragma unroll
50 for(k = 0; k < blockDim.y-1; k++) {
51 temp += As[k][threadIdx.x];
52 }
53 c[0] = temp;
54 }
55 }
Listing 4.16: The shared memory implementation of Mv on a GPU utilizing the optimized
computation and access pattern after ?ne-tuning. Constant memory is utilized for b.
No modi?cations are performed to the allocation of shared memory thus Equation (4.11)
de?nes shared memory usage. Since modi?cations are performed only to index calculations
and loop unrolling, the number of global memory reads is de?ned by Equations (4.6) and
(4.7). calc1 and calc2 in Lines 7 and 8 are commonly used calculations in the for-loops. By
de?ning temporary variables for the commonly used calculations, the amount of computation
in the for-loops is reduced. Index calculations are reduced by initializing pointers to A and
c in Lines 9 and 10. Likewise, pointer arithmetic is performed in Lines 16, 18, 19, 27, 33,
35, and 36 to further reduce index calculations.
In Listing 4.16, for-loops are preceded with the #pragma unroll directive in Lines 11,
13, 21, 28, 30, 38, and 49. This instructs the compiler to unroll as many loops as possible
to increase the instruction mix and reduce the number of registers used per thread.
78
Figure 4.18 depicts maximum e?ective bandwidth of Mv before (Listing 4.11) and after
(Listing 4.16) ?ne-tuning. As mentioned, e?ective bandwidth is calculated as 8n2bytesTGPU
comp:
.
512 1024 2048 4096 8192 1638420
40
60
80
100
120
140
160
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Standard
Fine-tuned
Figure 4.18: Comparison of ?ne-tuning: Maximum bandwidth (GB/s) for Mv on the T10
GPU.
In the ?gure, the kernel, after ?ne-tuning, yields a speedup, on average, of 1.05 compared
to before ?ne-tuning. The minimum (n = 16384) and maximum (n = 1024) speedups are 1.00
and 1.12, respectively. As mentioned, the usage of local memory instead of registers greatly
increases the computation time as global memory accesses are orders of magnitude slower
than register accesses. However, for Mv, a small amount of registers are allocated and
therefore, local memory is not utilized. Therefore, ?ne-tuning Mv yields a reduction in only
the amount of index calculations and thus a modest speedup. Regardless, the maximum
e?ective bandwidth is increased and therefore, Step 4, ?ne-tuning the kernel, is valid for Mv.
4.4.2 MM
Fine-tuning is performed to the shared memory implementation of MM on a GPU uti-
lizing the optimized computation and access pattern, Listing 4.12, to yield Listing 4.17. In
this kernel, dBlk:y ? n ele. Slight modi?cations are performed to Lines 15-18 and 37-40 for
loading A if dBlk:y > n ele.
79
1 __global__ void MM( float *A, float *B, float *C, int n) {
2 __shared__ float As[n_ele][blockDim.x];
3 float temp[N_ele];
4 #pragma unroll
5 for(int i = 0; i < n_ele; i++) {
6 temp[i] = 0.0;
7 }
8 int P = ((blockIdx.x + gridDim.x * blockIdx.y) % 8) * (64 / blockDim.x);
9 A += P * blockDim.x + threadIdx.x + (blockIdx.y * N_ele + threadIdx.y)
rcurvearrowse * n;
10 B += P * n * blockDim.x + blockIdx.x * blockDim.x * blockDim.y +
rcurvearrowse threadIdx.y * blockDim.x + threadIdx.x;
11 C += threadIdx.x + blockIdx.x * blockDim.x * blockDim.y + threadIdx.y *
rcurvearrowse blockDim.x + blockIdx.y * N_ele * n;
12 #pragma unroll
13 for(int j = P; j < n / blockDim.x; j++) {
14 #pragma unroll
15 for(int i = 0; i < n_ele / blockDim.y; i++) {
16 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[0];
17 A += blockDim.y * n;
18 }
19 A -= blockDim.y * n * N_ele / blockDim.y;
20 __syncthreads();
21 #pragma unroll
22 for(int i = 0; i < blockDim.x; i++) {
23 #pragma unroll
24 for(int k = 0; k < n_ele; k++) {
25 temp[k] += As[k][i] * B[0];
26 }
27 B += n;
28 }
29 A += blockDim.x;
30 __syncthreads();
31 }
32 A -= n;
33 B -= n * n;
34 #pragma unroll
35 for(int j = 0; j < P; j++) {
36 #pragma unroll
37 for(int i = 0; i < n_ele / blockDim.y; i++) {
38 As[threadIdx.y + i * blockDim.y][threadIdx.x] = A[0];
39 A += blockDim.y * n;
40 }
41 A -= blockDim.y * n * N_ele / blockDim.y;
42 __syncthreads();
43 #pragma unroll
44 for(int i = 0; i < blockDim.x; i++) {
45 #pragma unroll
80
46 for(int k = 0; k < n_ele; k++) {
47 temp[k] += As[k][i] * B[0];
48 }
49 B += n;
50 }
51 A += blockDim.x;
52 __syncthreads();
53 }
54 #pragma unroll
55 for (int i = 0; i < n_ele; i++) {
56 C[0] = temp[i];
57 C += n;
58 }
59 }
Listing 4.17: The shared memory implementation of MM on a GPU utilizing the optimized
computation and access pattern after ?ne-tuning.
No modi?cations are performed to the allocation of shared memory, thus Equation (4.8)
de?nes shared memory usage. Since modi?cations are performed only to index calculations
and loop unrolling, the number of global memory reads is de?ned by Equations (4.9) and
(4.10). Index calculations are reduced by initializing pointers to A, B, and C in Lines 9-11.
Likewise, pointer arithmetic is performed in Lines 17, 19, 27, 29, 32, 33, 39, 41, 49, 51, and
57 to further reduce index calculations.
The for-loops in Listing 4.17 are preceded with the #pragma unroll directive in Lines 4,
12, 14, 21, 23, 34, 36, 43, 45, and 54 to increase the instruction mix and reduce the number
of registers used per thread. However, the for-loops in Lines 13 and 35 are dependent on
P which is calculated in Line 8 and dependent on block indices. Since P is calculated at
run-time, its value is unknown during compilation. Therefore, the for-loops in Lines 13 and
35 are not unrolled by the compiler since the trip count is unknown.
Figure 4.19 depicts maximum e?ective bandwidth of MM before (Listing 4.12) and after
(Listing 4.17) ?ne-tuning. As discussed, maximum e?ective bandwidth is 8n3bytesTGPU
comp:
for MM.
81
512 1024 2048 4096 8192 16384700
800
900
1000
1100
1200
1300
1400
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Standard
Fine-tuned
Figure 4.19: Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for MM on
the T10 GPU.
For MM, register utilization is high since each thread is computing multiple results and
utilizing a register for each partial result. As shown in Figure 4.19 the kernel after ?ne-tuning
yields a speedup, on average, of 1.51 compared to before ?ne-tuning. The minimum (n = 512)
and maximum (n = 16384) speedups are 1.43 and 1.53, respectively. Since ?ne-tuning the
kernel reduces the amount of register utilization, a signi?cant improvement in maximum
e?ective bandwidth is realized for MM. Therefore, Step 4 of the parallelization procedure,
?ne-tuning the kernel, is necessary and valid for MM.
4.5 Input Parameters
The penultimate step of the parallelization procedure, Step 5, speci?es that optimal
input parameters are derived. Input parameters are de?ned as the dimensions of the grid
and blocks assigned for execution on the GPU. Through analysis of measured data for Mv and
MM, input parameters signi?cantly a?ect the GPU computation time. Therefore, optimal
input parameters are derived as the input parameters which yield the minimum computation
time. The derivation is performed utilizing the execution metrics formulated in Section 3.3.
Since input parameters are dependent on the kernel, the derivation is performed after the
82
kernel is written, which considers the placement of data, optimized computation and access
patterns, and ?ne-tuning adjustments.
A list of ?ve steps is presented, in order, to derive the optimal input parameters:
1. Saturate the memory bus.
2. Maximize shared memory utilization.
3. Minimize the amount of shared memory used per block.
4. Maximize the number of global memory partitions accessed.
5. Minimize the number of fragmented blocks.
The order of deriving optimal input parameters is developed through analysis of measured
data for Mv and MM. Step 1 is listed to maximize the overlap of computation and memory
accesses by ensuring the memory bus is fully utilized. Step 2 ensures that data which is loaded
into shared memory is reused as much as possible. Step 3 is listed such that a minimum
amount of shared memory is assigned per block. This increases the number of blocks assigned
to each SM. In addition, this reduces the amount of time each HW in a block must wait for
synchronization since there are less HWs per block. Step 4 maximizes the number of global
memory partitions accessed such that partition camping is reduced. Lastly, Step 5 ensures
the contribution to computation time from fragmented blocks is minimized.
Due to limitations of the T10 GPU, a maximum of 512 threads can be assigned per
block. Therefore,
dBlk:x?dBlk:y ? 512: (4.12)
Since threads are assigned to HWs in row-major order and the minimum global memory
transaction size is 32B,
dBlk:x ? 8: (4.13)
Since values of type ?oat are used, this equation ensures that one 32B transaction can service
half of the threads in a HW.
83
Step 1 speci?es that the memory bus is saturated which requires analysis of ThdsactiveGPU
which is dependent on BlksactiveGPU . In the GPU documentation provided by NVIDIA [64], it is
stated that each global memory access takes approximately 400-800 clock cycles. Therefore,
there needs to exist a minimum of 4008 warps per partition, or 12800 ThdsactiveGPU , to fully
saturate the memory bus. Since the maximum value of ThdsactiveGPU is determined by the
product of the number of SMs and the maximum number of threads per SM,
12800 ? ThdsactiveGPU ? 30720 (4.14)
Therefore, optimal input parameters are derived such that ThdsactiveGPU is greater than the
minimum necessary to fully saturate the memory bus. Optimal input parameters for each
matrix-based computation are derived in the following sections. Sections 4.5.1 and 4.5.2
include the derivation of optimal input parameters for Mv and MM, respectively.
4.5.1 Mv
In this section, optimal input parameters are derived for the kernel in Listing 4.16, which
is the shared memory implementation of Mv on a GPU utilizing the optimized computation
and access pattern after ?ne-tuning. The amount of shared memory allocated per block is
de?ned by Equation (4.11). Since each SM has a maximum of 16KB of shared memory,
the maximum value of dBlk:x is 32. Combining with the minimum value of dBlk:x from
Equation (4.13) yields
8 ? dBlk:x ? 32: (4.15)
84
From the kernel in Listing 4.16 and the maximum value of dBlk:y determined from solving
for dBlk:y in Equation (4.12),
dBlk:y ? minparenleft.alt3 512dBlk:x;dBlk:xparenright.alt3 (4.16a)
dGrd:x = ndBlk:x (4.16b)
dGrd:y = 1:
The number of registers used per thread for the kernel varies from 10 to 15 dependent
on dBlk:x and dBlk:y. Assuming the maximum is used, BlksactiveSM from Equation (3.7) is
not limited by register usage. Therefore, Equation (3.8) simpli?es to
BlksactiveSM = minparenleft.alt3? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3: (4.17)
Since dGrd:y = 1, Equation (3.8) simpli?es to
BlksactiveGPU = minparenleft.alt130?BlksactiveSM ;dGrd:xparenright.alt1:
Since
ThdsactiveGPU = BlksactiveGPU ?dBlk:x?dBlk:y; (4.18)
then, if dGrd:x ? BlksactiveGPU ,
ThdsactiveGPU = dGrd:x?dBlk:x?dBlk:y:
Substituting dGrd:x = ndBlk:x from Equation (4.16b) and the lower limit of n de?ned by
Equation (4.1) yields
ThdsactiveGPU ? 512?dBlk:y:
85
Since dBlk:x ? 32, the maximum value of dBlk:y from Equation (4.16a) yields
ThdsactiveGPU ? 8192:
Substituting into Equation (4.18) and substituting for BlksactiveGPU yields
8192 ? minparenleft.alt130?BlksactiveSM ;dGrd:xparenright.alt1?dBlk:x?dBlk:y: (4.19)
If dGrd:x ? 30 ? BlksactiveSM , rearranging Equation (4.19) and solving for dBlk:y yields the
lower limit as
8192
n ? dBlk:y: (4.20)
The upper limit of dBlk:y is not modi?ed and therefore is de?ned by Equation (4.16a).
However, if dGrd:x > 30?BlksactiveSM , then substituting for BlksactiveSM from Equation (4.17)
into Equation (4.19) yields
8192 ? 30?minparenleft.alt3? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3?dBlk:x?dBlk:y: (4.21)
FromEquation(4.11), SMemPerBlk = (dBlk:x2+dBlk:x)?4. AftersubstitutingSMemPerBlk
into Equation (4.21), solving for dBlk:y, given the possible values of dBlk:x from Equation
(4.15), yields the lower limit as
maxparenleft.alt34; 64dBlk:xparenright.alt3 ? dBlk:y: (4.22)
Combining lower limits of dBlk:y from Equations (4.20) and (4.22) with the upper limit
from Equation (4.16a) yields
maxparenleft.alt38192n ;4; 64dBlk:xparenright.alt3 ? dBlk:y ? minparenleft.alt3 512dBlk:x;dBlk:xparenright.alt3: (4.23)
86
Step 2 of deriving optimal input parameters speci?es shared memory utilization is max-
imized to ensure data is reused as much as possible. From Lines 22 and 39 of the kernel in
Listing 4.16, the trip count for the inner loops where shared memory is accessed is dependent
on dBlk:xdBlk:y. Therefore, the maximum of dBlk:x from Equation (4.15) and the minimum of
dBlk:y from Equation (4.23) yield
dBlk:x = 32
dBlk:y = maxparenleft.alt38192n ;4parenright.alt3
dGrd:x = n32
dGrd:y = 1: (4.24)
Given n, all input parameters are constant and therefore, no further steps are necessary to
derive optimal input parameters for Mv. However, as realized through measured data for
MM, additional steps are necessary to derive optimal input parameters for MM. Regardless,
Equation (4.24) is the derivation of optimal input parameters for the shared memory im-
plementation of Mv on a GPU utilizing the optimized computation and access pattern after
?ne-tuning.
Figure 4.20 depicts measured computation time of all input parameters for GPU com-
putation of Mv. The ?gure illustrates the necessity of optimal input parameters as the GPU
computation time of Mv varies signi?cantly depending on the input parameters. In the ?g-
ures in this section, the x-axis represents the number of threads per block (dBlk:x?dBlk:y).
Therefore, there are several combinations of dBlk:x and dBlk:y which yield an equivalent
number of threads per block.
87
8 16 32 64 128 256 512
10?1
dBlk.x?dBlk.y
tim
e(
ms
)
n=512
Optimal
8 16 32 64 128 256 512
10?1
dBlk.x?dBlk.y
tim
e(
ms
)
n=1024
Optimal
8 16 32 64 128 256 512
10?0.7
10?0.5
10?0.3
10?0.1
dBlk.x?dBlk.y
tim
e(
ms
)
n=2048
Optimal
8 16 32 64 128 256 512
100
dBlk.x?dBlk.y
tim
e(
ms
)
n=4096
Optimal
8 16 32 64 128 256 512
100.5
100.7
100.9
dBlk.x?dBlk.y
tim
e(
ms
)
n=8192
Optimal
8 16 32 64 128 256 512
101.1
101.3
101.5
101.7
dBlk.x?dBlk.y
tim
e(
ms
)
n=16384
Optimal
Figure 4.20: Comparison of input parameters: Computation time (ms) of all input parame-
ters for Mv on the T10 GPU.
88
As illustrated, derivation of optimal input parameters yields computation time, on av-
erage, within 1.8% of the minimum measured time. In the worst-case (n = 8192), utilizing
optimal input parameters yields time within 4.1% of the minimum. Best-case utilization of
optimal input parameters (n = {1024;2048;16384}) yields the minimum. Therefore, Step
5 of the parallelization procedure, deriving optimal input parameters, yields minimum or
near-minimum GPU computation times for Mv.
4.5.2 MM
In this section, optimal input parameters are derived for the kernel in Listing 4.17, which
is the shared memory implementation of MM on a GPU utilizing the optimized computation
and access pattern after ?ne-tuning. From the kernel and Equations (4.12) and (4.13),
8 ?dBlk:x ? 512 (4.25a)
1 ?dBlk:y ? 512dBlk:x (4.25b)
dGrd:x = ndBlk:x?dBlk:y (4.25c)
dGrd:y = nn ele (4.25d)
where 1 ? n ele ? 16 due to resource constraints. From Equation (4.8), SMemPerBlk =
(n ele?dBlk:x)?4. From compilation of the kernel, RegsPerThd varies from 14 to 38.
From the minimum of SMemPerBlk and RegsPerBlk, dGrd:x?dGrd:y > 30?BlksactiveSM .
Therefore, Equation (3.8) simpli?es to
BlksactiveGPU = 30?minparenleft.alt3? 16384RegsPerBlk?;? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3: (4.26)
89
Substituting BlksactiveGPU from Equation (4.26) and the minimum of ThdsactiveGPU from Equa-
tion (4.14) into Equation (3.9) yields
12800 ? 30?minparenleft.alt3? 16384RegsPerBlk?;? 16384SMemPerBlk?;8; 1024dBlk:x?dBlk:yparenright.alt3?dBlk:x?dBlk:y:
(4.27)
Solving for dBlk:y modi?es the lower limit of dBlk:y from Equation (4.25b) to yield
max
uni239B
uni239C
uni239D
12800
30dBlk:x? 16384RegsPerBlk?
; 12800
30dBlk:x? 16384(n ele?dBlk:x)?4?
; 64dBlk:x
uni239E
uni239F
uni23A0
? dBlk:y ? 512dBlk:x: (4.28)
Step 2 of deriving optimal input parameters speci?es shared memory utilization is max-
imized to ensure data is reused as much as possible. From Lines 25 and 47 of the kernel in
Listing 4.17, the trip count for the inner loops where shared memory is accessed is dependent
on n ele. Since 1 ? n ele ? 16, n ele = 16. Substituting n ele = 16 into Equation (4.25d)
yields
dGrd:y = n16 (4.29)
Since SMemPerBlk = (n ele ? dBlk:x) ? 4, dBlk:x ? 64. However, if dBlk:x = 64 and
dBlk:y is the maximum from Equation (4.28), then BlksactiveGPU = 30 due to register usage.
Therefore, ThdsactiveGPU = 7680 and is less than the minimum from Equation (4.14). Therefore,
dBlk:x ? 32. However, if dBlk:x ? 16, due to register usage, only two values exist for dBlk:y
which satisfy Equation (4.28). Therefore, Equation (4.28) simpli?es to
64
dBlk:x ? dBlk:y ?
512
dBlk:x if dBlk:x = 8: (4.30)
and
dBlk:y =
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23A9
32 if dBlk:x = 16
2 if dBlk:x = 32:
90
Step 3 of deriving optimal input parameters speci?es the minimum amount of shared
memory is allocated per block to increase the number of blocks assigned to each SM. In
addition, minimizing the amount of shared memory allocated reduces time each HW in a
block waits for synchronization since there are fewer HWs per block. Since n ele = 16,
SMemPerBlk = 64dBlk:x. Therefore, the minimum value of dBlk:x from Equation (4.25a)
is utilized. Since dBlk:x = 8, Equations (4.30) and (4.25c) are evaluated to yield
8 ?dBlk:y ? 64 (4.31a)
dGrd:x = n8dBlk:y (4.31b)
Step 4 of deriving optimal input parameters speci?es the maximum number of global
memory partitions, PartsPerGPU, are accessed to reduce partition camping. PartsPerGPU
is dependent on PartsPerBlk, dGrd:x, and BlksactiveGPU . Since dBlk:x and n ele are constant,
RegsPerThd is constant. From compilation, 32 RegsPerThd are allocated so RegsPerBlk =
256dBlk:y. Since RegsPerBlk is the only limiting factor of Equation (4.26), the equation
simpli?es to
BlksactiveGPU = 1920dBlk:y: (4.32)
If dGrd:x > BlksactiveGPU , substituting dBlk:x = 8 and BlksactiveGPU from Equation (4.32) into
Equation (3.13) and solving for dBlk:y yields4
dBlk:y ? 16: (4.33)
If dGrd:x ? BlksactiveGPU , substituting dBlk:x = 8 and dGrd:x from Equation (4.31b) into Equa-
tion (3.13) and simplifying yields
PartsPerGPU = minparenleft.alt3 n64dBlk:y;8parenright.alt3: (4.34)
4The solution is dBlk:y ? 30. However, dBlk:y is a power of two so it is rounded down to the nearest
power of two.
91
Substituting the minimum value of dBlk:y from Equation (4.31a) into Equation (4.34) yields
dGrd:xdBlk:x64 ? minparenleft.alt3 n512;8parenright.alt3: (4.35)
Substituting dBlk:x = 8 and dGrd:x from Equation (4.31b) into Equation (4.35) and solving
for dBlk:y yields
dBlk:y ? n64minparenleft.alt1 n
512;8parenright.alt1
: (4.36)
Combining the upper limits of dBlk:y from Equations (4.33) and (4.36) with the lower limit
of dBlk:y from Equation (4.31a) yields
8 ? dBlk:y ? minuni239Buni239D16; n64minparenleft.alt1 n
512;8parenright.alt1
uni239E
uni23A0: (4.37)
Step 5 of deriving optimal input parameters speci?es the number of fragmented blocks,
Blksfrag:GPU from Equation (3.10), is minimized. Substituting dGrd:x from Equation (4.31b),
dGrd:y from Equation (4.29), and BlksactiveGPU from Equation (4.32) into Equation (3.10) yields
n2
128dBlk:y mod
1920
dBlk:y: (4.38)
In general, xy can be expressed as x = Qy + R where Q is the quotient, Q = ?xy?, and R is
the remainder. Since Blksfrag:GPU = R, Blksfrag:GPU = x ? Qy. Substituting for x, y, and Q from
Equation (4.38) and simplifying yields
Blksfrag:GPU = n
2
128dBlk:y ??
n2
245760?
1920
dBlk:y: (4.39)
Therefore, to minimize Blksfrag:GPU, the maximum value of dBlk:y from Equation (4.37) is
utilized and
dBlk:y = minuni239Buni239D16; n64minparenleft.alt1 n
512;8parenright.alt1
uni239E
uni23A0: (4.40)
92
Since dBlk:x = 8 and from Equations (4.40), (4.25c), and (4.29), the optimal input
parameters for the shared memory implementation of MM on a GPU utilizing the optimized
computation and access pattern after ?ne-tuning are
dBlk:x = 8
dBlk:y = minuni239Buni239D16; n64minparenleft.alt1 n
512;8parenright.alt1
uni239E
uni23A0
dGrd:x = n8Blk:y
dGrd:y = n16: (4.41)
Figure 4.21 depicts measured GPU computation time of all input parameters for MM.
The ?gure illustrates the necessity of optimal input parameters as the computation time of
MM varies signi?cantly depending on the input parameters.
8 16 32 64 128 256 512
100
101
dBlk.x?dBlk.y
tim
e(
ms
)
n=512
Optimal
8 16 32 64 128 256 512
101
102
dBlk.x?dBlk.y
tim
e(
ms
)
n=1024
Optimal
93
8 16 32 64 128 256 512
102
103
dBlk.x?dBlk.y
tim
e(
ms
)
n=2048
Optimal
8 16 32 64 128 256 512
103
dBlk.x?dBlk.y
tim
e(
ms
)
n=4096
Optimal
8 16 32 64 128 256 512
104
dBlk.x?dBlk.y
tim
e(
ms
)
n=8192
Optimal
8 16 32 64 128 256 512
105
106
dBlk.x?dBlk.y
tim
e(
ms
)
n=16384
Optimal
Figure 4.21: Comparison of input parameters: Computation time (ms) of all input parame-
ters for MM on the T10 GPU.
Derivation of optimal input parameters yields computation time, on average, within
0.5% of the minimum measured time. In the worst-case (n = 512), utilizing optimal in-
put parameters yields time within 1.2% of the minimum. Best-case utilization of optimal
input parameters (n = {1024;8192;16384}) yields the minimum. As depicted, deriving op-
timal input parameters yields minimum or near-minimum GPU computation time for MM.
Therefore, Step 5 of the parallelization procedure is valid and necessary for MM.
94
4.5.3 GPU Computation Summary
After Step 5, the parallelization procedure is developed to determine the optimal parti-
tioning of computation between the CPU and GPU. Therefore, the procedure with regards
to minimizing GPU computation time is complete. Therefore, this section illustrates max-
imum e?ective bandwidth of each aforementioned step of the parallelization procedure as
it pertains to GPU computation time. In addition, a comparison of the procedure to the-
oretical bandwidth and CUBLAS e?ective bandwidth is illustrated. For each ?gure in this
subsection, Na  ve denotes the global memory implementation, Sh. Mem. denotes the shared
memory implementation, O.C.P. denotes the optimized computation pattern utilizing shared
memory, and O.A.P. denotes the optimized access pattern utilizing the optimized compu-
tation pattern and shared memory. P.P. denotes the entire parallelization procedure with
regards to GPU computation which utilizes shared memory, optimized computation and
access patterns, ?ne-tuning adjustments, and optimal input parameters.
Mv
Figure 4.22 depicts e?ective bandwidth of Mv for each step of the parallelization pro-
cedure in addition to CUBLAS e?ective bandwidth and theoretical bandwidth.
512 1024 2048 4096 8192 16384
101
102
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
TheoreticalCUBLAS
P.P.O.A.P.
O.C.P.Sh. Mem.
Na??ve
Figure 4.22: Comparison of e?ective bandwidth (GB/s) for Mv on the T10 GPU.
95
The parallelization procedure yields a speedup, on average, of 18.98 compared to the
na??ve implementation. The minimum (n = 4096) and maximum (n = 16384) speedups are
11.92 and 37.35, respectively. As illustrated, each step of the procedure increases the e?ective
bandwidth. On average, the procedure yields a speedup of 1.47 compared to the CUBLAS
implementation with a minimum (n = 8192) and maximum (n = 1024) speedup of 0.96
and 2.17, respectively. Compared to theoretical bandwidth, the procedure yields e?ective
bandwidth, on average, as 63.7% of the theoretical bandwidth. Utilizing the procedure yields
a worst-case (n = 512) e?ective bandwidth of 21.3% of theoretical. Best-case utilization of
the procedure (n = 8192) yields an e?ective bandwidth of 79.1% of the theoretical. The
procedure is veri?ed to yield increases in speedup compared to the optimized CUBLAS
GPU implementation of Mv. More importantly, the procedure is veri?ed to yield signi?cant
increases in speedup compared to the na??ve GPU implementation of Mv.
MM
Figure 4.23 depicts e?ective bandwidth of MM for each step of the parallelization pro-
cedure in addition to the theoretical and CUBLAS e?ective bandwidth.
512 1024 2048 4096 8192 16384
102
103
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
TheoreticalCUBLAS
P.P.O.A.P.
O.C.P.Sh. Mem.
Na??ve
Figure 4.23: Comparison of e?ective bandwidth (GB/s) for MM on the T10 GPU.
Similar to Mv, the procedure applied to MM yields a signi?cant speedup, on average,
of 14.58 compared to the na??ve implementation. The minimum (n = 16384) and maximum
96
(n = 512) speedups are 12.36 and 21.56, respectively. In Figure 4.23, each step of the
procedure increases the e?ective bandwidth. The largest increase in bandwidth occurs after
the ?rst step of the procedure. This is due to the ?rst step signi?cantly reducing the number
of global memory transactions. Compared to the CUBLAS implementations, the procedure
yields a speedup, on average, of 1.07 with a minimum (n = 4096) and maximum (n = 512)
speedup of 0.95 and 1.39, respectively. As illustrated, the procedure yields bandwidth near
the theoretical. On average, the procedure yields 84.2% of the theoretical bandwidth. In
the worst-case (n = 512), utilizing the procedure yields e?ective bandwidth of 73.2% of the
theoretical. Best-case utilization of the procedure (n = 16384) yields an e?ective bandwidth
of 88.3% of the theoretical. Similar to Mv, the procedure is veri?ed to yield increases in
speedup compared to the optimized CUBLAS GPU implementation of MM. Likewise, the
procedure is veri?ed to yield signi?cant increases in speedup compared to the na??ve GPU
implementation of MM.
Therefore, for Mv and MM, the procedure yields GPU computation times competitive
with the CUBLAS implementation and signi?cantly lower than the na??ve implementation.
4.6 Computation Partitioning
The last step, Step 6, of the parallelization procedure to minimize execution time is to
determine the partitioning of computation between the CPU and GPU. For small matrix-
based computations, the communication time may exceed the CPU computation time as
measured for certain data sizes of Mv. Therefore, determining which computations are per-
formed by the CPU and GPU requires accurate estimates of communication and computation
times.
From accurate estimates of communication and computation times, it is possible to de-
termine the appropriate computations to be performed by the CPU and GPU. The algorithm
in Listing 4.18 determines which matrix-based computations are performed by the CPU and
GPU. TCPUcomp: and Tcomm: are presented in Chapter 3.
97
1 if TCPUcomp: ? TGPUcomp: +Tcomm:
2 perform computation on CPU.
3 if TCPUcomp: > TGPUcomp: +Tcomm:
4 perform computation on GPU.
Listing 4.18: Algorithm to determine which computations are performed by the CPU and
GPU.
Since GPU computation time is estimated assuming maximum bandwidth, it is neces-
sary to determine the number of bytes accessed from global memory for each matrix-based
computation. The theoretical bandwidth is partially calculated by dividing the number of
bytes necessary for computation by the number of bytes accessed from global memory. This
fraction is multiplied by the maximum bandwidth to global memory (94GB/s) to yield the
theoretical bandwidth.
4.6.1 Mv
For Mv, optimal input parameters de?ne dBlk:x = 32 from Equation (4.24). Therefore,
from Equation (4.7), the number of 64B memory transactions is 4n2. Since 2n2 ?oat values
are necessary for computation and 4n2 bytes are accessed from global memory, the theoretical
bandwidth for Mv is 8n24n2(94GB/s). Therefore, TGPUcomp: =
8n2
10243
188 . Communication requires two
CPU to GPU transfers and one GPU to CPU transfer. Substituting for b, c0, and c1 into
Equations (3.14) and (3.15) and simplifying yields the communication time in seconds as
Tcomm: = 5:00?10?9n+
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23A9
3:02?10?9n2 +5:00?10?5 if n < 256
1:08?10?9n2 +9:60?10?5 if 256 ? n < 512
1:08?10?9n2 +2:27?10?4 if n ? 512:
From Lines 1-2 of Listing 4.18, computation is partitioned for CPU execution if TCPUcomp: ?
TGPUcomp: +Tcomm:. Therefore, after substitution, if
8n2
10243
minparenleft.alt1maxparenleft.alt116384n ;2parenright.alt1;8parenright.alt1 ?
8n2
10243
188 +Tcomm:;
98
computation is performed on the CPU to minimize execution time. Solving for n yields
TCPUcomp: ? TGPUcomp: +Tcomm: if n ? 2048. Therefore, if n > 2048, the computation is performed on
the GPU to minimize execution time.
Figure 4.24 illustrates measured execution time for Mv. CPU (ATLAS) execution time is
measured time of the CPU utilizing ATLAS to perform Mv. GPU (CUBLAS) execution time
is measured time for communication between the CPU and GPU in addition to measured
GPU computation time utilizing CUBLAS.
512 1024 2048 4096 8192 1638410
?1
100
101
102
103
n
me
asu
red
ex
ecu
tio
nt
im
e(
ms
)
CPU (ATLAS)
GPU (na??ve)
GPU (CUBLAS)
P.P.
Figure 4.24: Comparison of execution time (ms) for Mv on the T10 GPU.
As mentioned, if n ? 2048, Mv is performed on the CPU and therefore, the procedure
yields execution time identical to the ATLAS implementation. If n > 2048, Mv is performed
on the GPU utilizing the kernel developed in the procedure (Listing 4.16). The procedure
yields a speedup, on average, of 1.68 compared to the ATLAS implementation, 2.63 com-
pared to the na??ve GPU implementation (Listing 4.2), and 1.39 compared to the CUBLAS
implementation. The speedup compared to the na??ve and CUBLAS implementations in Fig-
ure 4.24 is less than in Figure 4.22 due to communication time. For Mv, the communication
time is a signi?cant portion of execution time. Therefore, large speedups measured for Mv
computation have less of an impact on execution time since the execution time includes
communication and computation time.
99
From Figure 4.24, the procedure yields execution time, on average, within 0.8% of the
theoretical minimum execution time. Utilizing the procedure yields a worst-case (n = 8192)
time within 1.9% of the theoretical minimum. In the best-case (n = {512;1024;2048}), uti-
lizing the procedure yields the theoretical minimum. This is possible since the theoretical
minimum for n ? 2048 is identical to the CPU utilizing ATLAS. This is because the theoreti-
cal minimum for the CPU is calculated utilizing e?ective bandwidths of the CPU. Therefore,
Step 6 of the parallelization procedure, determine the optimal partitioning of computation,
is valid and necessary for Mv.
4.6.2 MM
For MM, 2n3 ?oat values are necessary for computation. If n ele = 16 as de?ned in
Section 4.5.2 (optimal input parameters), data is reused via shared memory. Since each
thread in a HW reuses 16 ?oat values, 8n316 bytes are necessary from global memory. Assuming
the maximum global memory bandwidth of 94GB/s, the theoretical bandwidth for MM is
1504GB/s. Therefore, TGPUcomp: =
8n3
10243
1504 . Communication requires two CPU to GPU transfers
and one GPU to CPU transfer. Substituting for b, c0, and c1 into Equations (3.14) and
(3.15) and simplifying yields the communication time in seconds as
Tcomm: =
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23A9
8:02?10?9n2 +5:00?10?5 if n < 256
4:14?10?9n2 +1:42?10?4 if 256 ? n < 512
3:38?10?9n2 +5:94?10?4 if n ? 512:
From Lines 1-2 of Listing 4.18, computation is partitioned for CPU execution if TCPUcomp: ?
TGPUcomp: +Tcomm:. Therefore, after substitution, if
8n2
10243
51 ?
8n2
10243
1504 +Tcomm:;
100
computation is performed on the CPU to minimize execution time. Solving for n yields
TCPUcomp: ? TGPUcomp: +Tcomm: if n ? 64. Therefore, if n > 64, the computation is performed on the
GPU to minimize execution time.
Figure 4.25 illustrates measured execution time for MM. CPU (ATLAS) execution time
is measured time of the CPU utilizing ATLAS to perform MM. GPU (CUBLAS) execution
timeismeasuredtimeforcommunicationbetweentheCPUandGPUinadditiontomeasured
GPU computation time utilizing CUBLAS.
512 1024 2048 4096 8192 1638410
0
102
104
106
n
me
asu
red
ex
ecu
tio
nt
im
e(
ms
)
CPU (ATLAS)
GPU (na??ve)
GPU (CUBLAS)
P.P.
Figure 4.25: Comparison of execution time (ms) for MM on the T10 GPU.
For MM, the parallelization procedure yields execution time similar to the CUBLAS
implementation for all data sizes tested. The speedup, on average, is 1.02. The procedure
yields a speedup, on average, of 18.81 compared to the ATLAS implementation, and 10.60
compared to the na??ve implementation. Similar to Mv, the speedup compared to the na??ve
and CUBLAS implementations in Figure 4.25 is less than in Figure 4.23 due to communica-
tion time. However, di?ering from Mv, the communication time is not a signi?cant portion
of execution time. Therefore, the reduction in speedup, due to communication, is less than
for Mv.
Similar to Mv, the procedure yields execution time, on average, within 10.9% of the
theoretical minimum execution time. Worst-case utilizing the procedure (n = 4096) yields
time within 11.4% of the theoretical minimum. Best-case utilizing the procedure (n = 512)
101
yields time within 10.4% of the theoretical minimum. For MM, Step 6 of the procedure,
determine the partitioning of computation, is not necessary for MM since n > 64. However,
Step 6 is necessary to minimize execution time for Mv and thus is included.
102
Chapter 5
Performance Analysis
The parallelization procedure is developed and veri?ed through Mv and MM. Therefore,
Section 5.1 of this chapter is the application of the parallelization procedure to convolution.
Each step is illustrated in a subsection and results are presented. In addition to veri?cation
through convolution, in Section 5.2, execution time of the conjugate gradient method [65]
utilizing the parallelization procedure is compared with execution time of the CPU utilizing
ATLAS and the GPU utilizing CUBLAS.
All results in this chapter were gathered using the NVIDIA Tesla S1070 which includes
a pair of T10 GPUs connected to a quad-core 2.26GHz Intel Xeon E5520 CPU at Alabama
Supercomputer Authority [62]. The height or width of a square matrix, n, is varied from
512 to 16384 as de?ned by Equation (4.1). dBlk:x is varied from 8 to 512, the maximum
allowable value for the T10.
5.1 Convolution
As mentioned, the parallelization procedure developed in Chapter 4 is applied to con-
volution to show the validity of the procedure. Three various ?lter sizes (FS = {3;63;513})
were selected for testing convolution.
5.1.1 Placement of Data
Step 1 of the parallelization procedure is to determine the optimal placement of data in
GPU memory. Listing 5.1 is the kernel for the global memory implementation of convolution
on a GPU where each thread computes one Cij. The size of the ?lter, or the height or width
of B, is passed to the kernel as FS. The radius of the ?lter is computed in Line 2. Each thread
103
1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) {
2 int radius = (FS-1)/2;
3 int indexCol = blockIdx.x * blockDim.x + threadIdx.x;
4 int indexRow = blockIdx.y * blockDim.y + threadIdx.y;
5 float temp = 0.0;
6 for(int k = 0; k < FS; k++) {
7 int Col = indexCol + k - radius;
8 for(int l = 0; l < FS; l++) {
9 int Row = indexRow + l - radius;
10 if(Col >= 0 && Col < n) {
11 if(Row >= 0 && Row < n) {
12 temp += A[Row * n + Col] * B[l * FS + k];
13 }
14 }
15 }
16 }
17 C[indexRow * n + indexCol] = temp;
18 }
Listing 5.1: The global memory implementation of convolution on a GPU.
computes the column and row index of the Cij to compute in Lines 3 and 4. Lines 10 and 11
perform boundary checking to ensure no address is accessed which is out of range. k and l
determine which row and column of the ?lter, respectively, are being accessed. The order of
the for-loops in Lines 6 and 8 determines the order in which computation is performed and
thus a?ects the computation time. The order of computation, or the computation pattern,
is discussed in Section 5.1.2.
In Line 3 of Listing 5.1, indexCol is calculated utilizing threadIdx.x. Therefore, Col, Line
7, is also dependent on threadIdx.x. Since HWs are assigned in row-major order (threadIdx.x
?rst), accesses to A in Line 12 are coalesced. The number of global memory accesses for
accessing A is dependent on the number of threads per HW with di?ering threadIdx.y values.
This is because each HW has threadIdx.y unique values for Row from Line 4. Therefore, the
number of global memory accesses for accessing A is approximately1
Gld32B = n
2FS2
dBlk:x if dBlk:x ? 8
1For simplicity, the equations are approximations. Due to boundary checking, not all reads de?ned by
the equations are necessary or issued.
104
and
Gld64B = n
2FS2
16 if dBlk:x ? 16:
Since each thread in a HW accesses the same value of B in Line 12, the number of global
memory accesses for accessing B is approximately1
Gld32B = n
2FS2
16 (5.1)
regardless of the value of dBlk:x.
Lines 4-6 of Listing 4.1, the algorithm to determine the placement of data, specify that
shared memory is utilized for reused data which is larger than the size of constant memory.
It is not assumed the ?lter is smaller than the size of constant memory 2. Therefore, Line 5
speci?es that data which is reused is partitioned into blocks smaller than the size of shared
memory. Therefore, A and B are partitioned into blocks of shared memory to reduce the
number of global memory accesses as shown in Listing 5.2. Shared memory is allocated for
parts of A and B in Lines 2 and 3, respectively. Therefore,
SMemPerBlk = (2?dBlk:y ?dBlk:x+dBlk:y)?4: (5.2)
Every dBlk:y iteration, parts of A and B are loaded into shared memory as shown in Lines
12-28. Since all HWs within a block access the same part of B, only some threads are utilized
to load B into shared memory. In this kernel, threads of each HW with threadIdx.x=0 are
utilized to load part of B into shared memory as shown in Lines 14-18. Boundary checking is
performed in Lines 15, 19, 20, 23, 29, and 30. Similar to the global memory implementation,
the order of the for-loops in Lines 8 and 10 determines the computation pattern, which is
discussed in Section 5.1.2.
2If constant memory is utilized to store the ?lter, the maximum size of the ?lter is limited to 128.
105
1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) {
2 __shared__ float As[blockDim.y * 2][blockDim.x];
3 __shared__ float Bs[blockDim.y];
4 int radius = (FS-1)/2;
5 int indexCol = blockIdx.x * blockDim.x + threadIdx.x;
6 int indexRow = blockIdx.y * blockDim.y + threadIdx.y;
7 float temp = 0.0;
8 for(int k = 0; k < FS; k++) {
9 int Col = indexCol + k - radius;
10 for(int l = 0; l < FS; l++) {
11 int Row = indexRow + l - radius;
12 if((l % blockDim.y) == 0) {
13 __syncthreads();
14 if(threadIdx.x == 0) {
15 if(l + threadIdx.y < FS) {
16 Bs[threadIdx.y] = B[(l + threadIdx.y) * FS + k];
17 }
18 }
19 if(Col >= 0 && Col < n) {
20 if(Row >= 0 && Row < n) {
21 As[threadIdx.y][threadIdx.x] = A[Row * n + Col];
22 }
23 if((Row + blockDim.y) >= 0 && (Row + blockDim.y) < n ) {
24 As[threadIdx.y + blockDim.y][threadIdx.x] = A[(Row +
rcurvearrowse blockDim.y) * n + Col];
25 }
26 }
27 __syncthreads();
28 }
29 if(Col >= 0 && Col < n ) {
30 if(Row >= 0 && Row < n) {
31 temp += As[threadIdx.y + (l % blockDim.y)][threadIdx.x] * Bs[(l
rcurvearrowse % blockDim.y)];
32 }
33 }
34 }
35 }
36 C[indexRow * n + indexCol] = temp;
37 }
Listing 5.2: The shared memory implementation of convolution on a GPU.
106
Since global memory is accessed every dBlk:y iteration, the number of global memory
accesses for accessing A is dependent on the size of B, FS, and the number of threads
per HW with di?ering threadIdx.y values. This is because each HW has threadIdx.y unique
values for Row from Lines 6 and 11. Therefore, the number of global memory accesses for
accessing A is approximately3
Gld32B = 2n
2FS
dBlk:x ?
FS
dBlk:y? if dBlk:x ? 8 (5.3)
and4
Gld64B = 2n
2FS
16 ?
FS
dBlk:y? if dBlk:x ? 16: (5.4)
If dBlk:x ? 16, this reduces the number of global memory accesses for accessing A by a
factor of FS2? FS
dBlk:y?
compared to the global memory implementation. Since global memory is
accessed every dBlk:y iteration, the number of global memory accesses for accessing B is
approximately3
Gld32B = 16n
2FS
dBlk:x2 ?
FS
dBlk:y? if dBlk:x ? 8 (5.5)
and
Gld32B = n
2FS
dBlk:x ?
FS
dBlk:y? if dBlk:x ? 16: (5.6)
This reduces the number of 32B accesses for accessing B by a factor of FSdBlk:x16? FS
dBlk:y?
compared
to the global memory implementation.
Figures 5.1-5.3 depict maximum e?ective bandwidth of the global memory implemen-
tation of convolution (Listing 5.1) and the shared memory implementation (Listing 5.2).
E?ective bandwidth for convolution is calculated in Section 3.5 as 8n2FS2bytesTGPU
comp:
. Figure 5.1
depicts maximum e?ective bandwidth for a ?lter size, FS, of 3.
3For simplicity, the equations are approximations. Due to boundary checking, not all reads de?ned by
the equations are necessary or issued.
4Since FS is not a power of two, some reads are 32B transactions if dBlk:x ? 16.
107
512 1024 2048 4096 8192 1638420
30
40
50
60
70
80
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Global Memory
Shared Memory
Figure 5.1: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=3.
As illustrated, the reduction in global memory accesses signi?cantly increases the e?ec-
tive bandwidth for convolution with a small ?lter size. The shared memory implementation
yields a speedup, on average, of 2.74 compared to the global memory implementation. The
minimum (n = 512) and maximum (n = 16384) speedups are 2.37 and 2.86, respectively.
Figure 5.2 depicts maximum e?ective bandwidth for a ?lter size of 63.
512 1024 2048 4096 8192 1638470
75
80
85
90
95
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Global Memory
Shared Memory
Figure 5.2: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=63.
For a ?lter size of 63, utilizing shared memory increases e?ective bandwidth due to
the reduction in global memory accesses. However, comparing Figures 5.1 and 5.2 shows
108
the increase in bandwidth is less as the ?lter size has increased. Regardless, in Figure 5.2,
the shared memory implementation yields a speedup, on average, of 1.25 compared to the
global memory implementation. The minimum (n = 512) and maximum (n = 4096) speedups
are 1.22 and 1.26, respectively. Although the increase in e?ective bandwidth for the shared
memory implementation is less than for a small ?lter size, Step 1 of the procedure is valid
for a ?lter size of 63. Lastly, maximum e?ective bandwidth is illustrated in Figure 5.3 for a
?lter size of 513.
512 1024 2048 4096 8192 1638460
70
80
90
100
110
120
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Global Memory
Shared Memory
Figure 5.3: Comparison of GPU memories: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=513.
Similar to a ?lter size of 63, the shared memory implementation yields a speedup,
on average, of 1.27 compared to the global memory implementation for a large ?lter size.
Likewise, the minimum (n = 512) and maximum (n = 8192) speedups are 1.18 and 1.32,
respectively. These results show that Step 1 of the procedure yields larger reductions of
computation time for smaller ?lter sizes. Regardless, Step 1 is veri?ed to reduce computation
time for all data and ?lter sizes tested.
5.1.2 Computation Patterns
Step 2 of the procedure speci?es that the optimized computation pattern is derived. As
mentioned, computation patterns are determined by the code and represent the manner and
109
order in which results are computed. For convolution, if threadij computes Cij where C =
A ? B, the order in which Cij is computed creates varying computation patterns.
For convolution, the ?lter, B, is centered over each value of the input, A. The products
of the overlapping values of A and B are summed to compute one value of C, Cij, where C =
A ? B. This is depicted in Figure 5.4. The order in which overlapping values are multiplied
filter
value being computed
(a) Column-major.
filter
value being computed
(b) Row-major.
Figure 5.4: Computation patterns: two computation patterns for convolution.
determines the computation pattern. A pattern where each partial sum is computed ?rst
in the y-dimension, as depicted in Figure 5.4a, is the column-major computation pattern.
Figure 5.4b depicts the computation pattern where each value is computed ?rst in the x-
dimension, row-major.
Column-major
The shared memory implementation of convolution utilizing the column-major compu-
tation pattern depicted in Figure 5.4a is shown in Listing 5.2 in Section 5.1.1. The amount
of shared memory allocated per block is de?ned by Equation (5.2) and the number of global
memory accesses for accessing A and B is approximated5 in Equations (5.3)-(5.6).
5Due to boundary checking, not all reads are issued.
110
Optimized (row-major)
The shared memory implementation of convolution utilizing the row-major computation
pattern depicted in Figure 5.4b requires several minor modi?cations to Listing 5.2. There-
fore, the shared memory implementation of convolution utilizing the row-major computation
pattern is shown in Listing 5.3. Shared memory is allocated for parts of A and B in Lines
2 and 3, respectively. Therefore,
SMemPerBlk = (2?dBlk:x?dBlk:y +dBlk:x)?4: (5.7)
Every dBlk:x iteration, parts of A and B are loaded into shared memory as shown in Lines
12-28. Since all HWs within a block access the same part of B, only some threads are utilized
to load B into shared memory. In this kernel, threads of each HW with threadIdx.y=0 are
utilized to load part of B into shared memory as shown in Line 14. Boundary checking is
performed in Lines 15, 19, 20, 23, 29, and 30.
Since global memory is accessed every dBlk:x iteration, the number of global memory
accesses for accessing A is dependent on the size of B, FS, and the number of threads
per HW with di?ering threadIdx.y values. This is because each HW has threadIdx.y unique
values for Row from Lines 6 and 9. Therefore, the number of global memory accesses for
accessing A is approximately8
Gld32B = 2n
2FS
dBlk:x ?
FS
dBlk:x? if dBlk:x ? 8 (5.8)
and6
Gld64B = 2n
2FS
16 ?
FS
dBlk:x? if dBlk:x ? 16: (5.9)
If dBlk:x ? 16, this reduces the number of global memory accesses for accessing A by a
factor of FS2? FS
dBlk:x?
compared to the global memory implementation. However, if dBlk:x =
6Since FS is not a power of two, some reads are 32B transactions if dBlk:x ? 16.
111
1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) {
2 __shared__ float As[blockDim.y][blockDim.x * 2];
3 __shared__ float Bs[blockDim.x];
4 int radius = (FS-1)/2;
5 int indexCol = blockIdx.x * blockDim.x + threadIdx.x;
6 int indexRow = blockIdx.y * blockDim.y + threadIdx.y;
7 float temp = 0.0;
8 for(int k = 0; k < FS; k++) {
9 int Row = indexRow + k - radius;
10 for(int l = 0; l < FS; l++) {
11 int Col = indexCol + l - radius;
12 if((l % blockDim.x) == 0) {
13 __syncthreads();
14 if(threadIdx.y == 0) {
15 if(l + threadIdx.x < FS) {
16 Bs[threadIdx.x] = B[k * FS + l + threadIdx.x];
17 }
18 }
19 if(Row >= 0 && Row < n) {
20 if(Col >= 0 && Col < n) {
21 As[threadIdx.y][threadIdx.x] = A[Row * n + Col];
22 }
23 if((Col + blockDim.x) >= 0 && (Col + blockDim.x) < n ) {
24 As[threadIdx.y][threadIdx.x + blockDim.x] = A[Row * n + Col +
rcurvearrowse blockDim.x];
25 }
26 }
27 __syncthreads();
28 }
29 if(Row >= 0 && Row < n ) {
30 if(Col >= 0 && Col < n) {
31 temp += As[threadIdx.y][threadIdx.x + (l % blockDim.x)] * Bs[(l
rcurvearrowse % blockDim.x)];
32 }
33 }
34 }
35 }
36 C[indexRow * n + indexCol] = temp;
37 }
Listing 5.3: The shared memory implementation of convolution on a GPU utilizing the
optimized computation pattern.
112
dBlk:y, there is no reduction in the number of accesses compared to the shared memory
implementation utilizing the column-major computation pattern.
Since global memory is accessed every dBlk:x iteration, the number of global memory
accesses for accessing B is approximately8
Gld32B = n
2FS
dBlk:x?dBlk:y ?
FS
dBlk:x? if dBlk:x ? 8: (5.10)
and
Gld64B = n
2FS
16dBlk:y ?
FS
dBlk:x? if dBlk:x ? 16: (5.11)
If dBlk:x ? 16, 32B accesses for accessing B are replaced with 64B accesses compared to
the shared memory implementation utilizing the column-major computation pattern. The
number of accesses to global memory is reduced by a factor of 16dBlk:y?
FS
dBlk:y?
dBlk:x? FSdBlk:x? . Therefore, if
dBlk:x = dBlk:y = 16, this computation pattern reduces the number of accesses to global
memory by a factor of 16 and the number of bytes accessed by a factor of 8. Therefore, the
row-major pattern is the optimized computation pattern.
Figures 5.5-5.7 depict maximum e?ective bandwidth of convolution before (Listing 5.2)
and after (Listing 5.3) utilizing the optimized computation pattern. The optimal placement
of data is utilized in both listings. As mentioned, e?ective bandwidth for convolution is
8n2FS2bytes
TGPUcomp: . Figure 5.5 depicts maximum e?ective bandwidth for a ?lter size, FS, of 3.
113
512 1024 2048 4096 8192 1638460
65
70
75
80
85
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 5.5: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for
convolution on the T10 GPU. FS=3.
Shown in Figure 5.5, Step 2 of the procedure yields a modest speedup, on average, of
1.03 compared to the non-optimized pattern. As mentioned, for the optimized computation
pattern, the number of accesses to global memory is reduced by a factor of 16dBlk:y?
FS
dBlk:y?
dBlk:x? FSdBlk:x?
compared to the non-optimized pattern. Since maximum e?ective bandwidth is illustrated
and dBlk:x and dBlk:y vary, the optimized computation pattern yields a modest speedup.
The minimum (n = 16384) and maximum (n = 512) speedups are 1.02 and 1.07, respectively.
For a ?lter size of 63, the maximum e?ective bandwidth is illustrated in Figure 5.6.
512 1024 2048 4096 8192 1638489.5
90
90.5
91
91.5
92
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 5.6: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for
convolution on the T10 GPU. FS=63.
114
In Figure 5.6, the e?ective bandwidth, before and after utilizing the optimized compu-
tation pattern, shows a di?ering behavior for the varying data sizes. However, given the
scale of the ?gure, the di?erence is small and not explored further. Similar to a small ?lter
size, the optimized computation pattern for a ?lter size of 63 yields a near-constant speedup,
on average, of 1.01 compared to the non-optimized pattern. The minimum (n = 2048) and
maximum (n = 512) speedups are 1.00 and 1.02, respectively. Lastly, Figure 5.7 depicts
maximum e?ective bandwidth for a ?lter size of 513.
512 1024 2048 4096 8192 1638490
95
100
105
110
115
120
125
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 5.7: Comparison of computation patterns: Maximum e?ective bandwidth (GB/s) for
convolution on the T10 GPU. FS=513.
AsillustratedinFigure5.7, theoptimizedcomputationpatternyieldsaconstantspeedup
of 1.03 compared to the non-optimized pattern. The e?ective bandwidth for all ?lter sizes
tested is roughly equivalent and a modest increase is shown when utilizing the optimized
computation pattern. Although modest speedups are measured for all ?lter sizes, the band-
width is increased for all data and ?lter sizes tested. Therefore, Step 2 of the parallelization
procedure is shown to yield increases in e?ective bandwidth, and thus reduce the computa-
tion time, for convolution.
115
5.1.3 Access Patterns
Step 3 of the parallelization procedure is to derive the optimized access pattern that
minimizes partition camping and eliminates bank con?icts. For Mv and MM, HWs begin
execution by reading the ?rst value of their respective rows of A. However, for convolution,
HWs begin by reading varying columns of their respective rows of A. Therefore, HWs are
spread to varying partitions for accessing A without an optimized access pattern.
Figure 5.8a illustrates a 3x3 ?lter, B, stored in global memory. Padding the ?lter with
zeros such that each row resides in the next partition of global memory reduces partition
camping. This is illustrated in Figure 5.8b. In addition, padding the ?lter aligns each
0 1 2 3
row of filter
filter values
partition
(a) Non-padded ?lter in global memory.
0 1 2 3
row of filter
filter values
partition
(b) Padded ?lter in global memory.
Figure 5.8: Access patterns: example of a 3x3 ?lter, B, stored in global memory. Each row
of a partition is 4 values of type ?oat.
row to a speci?ed boundary. From the CUDA programming guide [64], ?Any access (via a
variable or a pointer) to data residing in global memory compiles to a single global memory
instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is
naturally aligned (i.e., its address is a multiple of that size).? Therefore, padding the ?lter
with zeros ensures the data is naturally aligned thus reducing the number of global memory
transactions.
Similarly, padding A with zeros eliminates the boundary checking instructions in Lines
15, 19, 20, 23, 29, and 30 of Listing 5.3. From the CUDA programming guide [64], ?If threads
of a warp diverge via a data-dependent conditional branch, the warp serially executes each
116
branch path taken.? Eliminating boundary checking instructions via padding eliminates
thread divergence. Therefore, modifying Listing 5.3 to accommodate padding of A and B
yields Listing 5.4 which is the shared memory implementation of convolution utilizing the
optimized computation and access pattern. Shared memory is allocated in Lines 2 and 3 of
1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) {
2 __shared__ float As[blockDim.y][blockDim.x * 2];
3 __shared__ float Bs[blockDim.x];
4 int padding = 64 - FS % 64;
5 int indexCol = blockIdx.x * blockDim.x + threadIdx.x;
6 int indexRow = blockIdx.y * blockDim.y + threadIdx.y;
7 float temp = 0.0;
8 for(int k = 0; k < FS; k++) {
9 int Row = indexRow + k;
10 for(int l = 0; l < FS; l++) {
11 int Col = indexCol + l;
12 if((l % blockDim.x) == 0) {
13 __syncthreads();
14 if(threadIdx.y == 0) {
15 Bs[threadIdx.x] = B[k * (FS + padding) + l + threadIdx.x];
16 }
17 As[threadIdx.y][threadIdx.x] = A[Row * (n + FS - 1) + Col];
18 As[threadIdx.y][threadIdx.x + blockDim.x] = A[Row * (n + FS - 1)
rcurvearrowse + Col + blockDim.x];
19 __syncthreads();
20 }
21 temp += As[threadIdx.y][threadIdx.x + (l % blockDim.x)] * Bs[(l %
rcurvearrowse blockDim.x)];
22 }
23 }
24 C[indexRow * n + indexCol] = temp;
25 }
Listing 5.4: The shared memory implementation of convolution on a GPU utilizing the
optimized computation and access pattern.
Listing 5.4 the same as in Lines 2 and 3 of Listing 5.3. Therefore, the amount of shared
memory allocated per block is de?ned by Equation (5.7). In Line 4, padding instead of the
?lter radius is calculated. padding speci?es the number of zeros added to each row of the
?lter. The index calculations for accessing A, Lines 9 and 11, no longer include the radius
of the ?lter, as A is padded. Since A is padded by the size of the ?lter, the length of each
row is n+FS ?1 rather than n as shown in Lines 17 and 18.
117
Figures 5.9-5.11 depict maximum e?ective bandwidth of convolution before (Listing
5.3) and after (Listing 5.4) utilizing the optimized access pattern. The optimal placement of
data and optimized computation patterns are utilized. Figure 5.9 depicts maximum e?ective
bandwidth for a ?lter size of 3.
512 1024 2048 4096 8192 1638460
80
100
120
140
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 5.9: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for con-
volution on the T10 GPU. FS=3.
As mentioned, the optimized access pattern, which consists of padding A and B with
zeros, reduces partition camping. In addition, the optimized access pattern eliminates bound-
ary checking, and thus, reduces thread divergence. As illustrated in Figure 5.9, the optimized
access pattern yields a speedup, on average, of 1.54 compared to the non-optimized pattern
for a small ?lter size. The minimum (n = 512) and maximum (n = 16384) speedups are 1.27
and 1.67, respectively. Similar to Mv and MM, the optimized access pattern, which reduces
partition camping, signi?cantly increases bandwidth for convolution with a small ?lter size.
Maximum e?ective bandwidth for a ?lter size of 63 is depicted in Figure 5.10.
118
512 1024 2048 4096 8192 1638490
100
110
120
130
140
150
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 5.10: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for
convolution on the T10 GPU. FS=63.
Similar to a ?lter size of 3, the optimized access pattern for a ?lter size of 63 yields
a speedup, on average, of 1.60 compared to the non-optimized pattern. The minimum
(n = 512) and maximum (n = 16384) speedups are 1.58 and 1.61, respectively. Lastly, Figure
5.11 depicts maximum e?ective bandwidth for a ?lter size of 513.
512 1024 2048 4096 8192 1638490
100
110
120
130
140
150
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Non-optimized
Optimized
Figure 5.11: Comparison of access patterns: Maximum e?ective bandwidth (GB/s) for
convolution on the T10 GPU. FS=513.
As shown in Figure 5.11, the optimized access pattern yields a speedup, on average,
of 1.46 compared to the non-optimized pattern. The minimum (n = 512) and maximum
(n = 16384) speedups are 1.19 and 1.59, respectively. For each ?lter size tested, Step 3 of
119
the parallelization procedure, derive the optimized access pattern, yields signi?cant increases
in maximum e?ective bandwidth. In addition, the optimized access pattern yields similar
increases in speedup for all data and ?lter sizes tested.
5.1.4 Fine-tuning
Step 4 of the parallelization procedure is to ?ne-tune the kernel by unrolling loops
and minimizing index calculations. As mentioned, loop unrolling reduces computation time
by increasing the instruction mix of memory transactions and computation instructions.
Minimizing index calculations, particularly in inner loops, reduces the time necessary to
issue global memory transactions, thus reducing the computation time.
Step 4, ?ne-tuning, is applied to the shared memory implementation of convolution on a
GPU utilizing the optimized computation and access pattern, Listing 5.4, to yield Listing 5.5.
No modi?cations are performed to the allocation of shared memory and therefore, Equation
(5.7) de?nes the amount of shared memory utilized. Since modi?cations are performed only
to index calculations and loop unrolling, the number of global memory reads is de?ned
by Equations (5.8)-(5.11). In Line 8, calc1 is calculated outside of the for-loops as it is
a commonly used calculation in the for-loops for index calculations of A and B. Index
calculations are reduced by initializing pointerstoA,B, andCin Lines 9-11, thus eliminating
the need for Row and Col from Listing 5.4. Pointer arithmetic is performed in Lines 21, 23,
29, and 30 to further reduce index calculations.
In Listing 5.5, for-loops are preceded with the #pragma unroll directive in Lines 12
and 14 to increase the instruction mix and reduce the number of registers used per thread.
Since the loops are dependent only on FS, the compiler unrolls the loops until the maximum
instruction limit is reached. Therefore, the amount of loop unrolling performed is dependent
on FS.
120
1 __global__ void Conv(float* A, float* B, float* C, int n, int FS) {
2 __shared__ float As[blockDim.y][blockDim.x * 2];
3 __shared__ float Bs[blockDim.x];
4 int padding = 64 - FS % 64;
5 int indexCol = blockIdx.x * blockDim.x + threadIdx.x;
6 int indexRow = blockIdx.y * blockDim.y + threadIdx.y;
7 float temp = 0.0;
8 int calc1 = blockDim.x * (floorf(FS / blockDim.x) + 1);
9 A += indexRow * (n + FS-1) + indexCol;
10 B += threadIdx.x;
11 C += indexRow * n + indexCol;
12 #pragma unroll
13 for(int k = 0; k < FS; k++) {
14 #pragma unroll
15 for(int l = 0; l < FS; l++) {
16 if((l % blockDim.x) == 0) {
17 __syncthreads();
18 if(threadIdx.y == 0) {
19 Bs[threadIdx.x] = B[0];
20 }
21 B += blockDim.x;
22 As[threadIdx.y][threadIdx.x] = A[0];
23 A += blockDim.x;
24 As[threadIdx.y][threadIdx.x + blockDim.x] = A[0];
25 __syncthreads();
26 }
27 temp += As[threadIdx.y][threadIdx.x + (l % blockDim.x)] * Bs[(l %
rcurvearrowse blockDim.x)];
28 }
29 A += n + FS-1 - calc1;
30 B += FS + padding - calc1;
31 }
32 C[0] = temp;
33 }
Listing 5.5: The shared memory implementation of convolution on a GPU utilizing the
optimized computation and access pattern after ?ne-tuning
121
Figures 5.12-5.14 depict the maximum e?ective bandwidth of convolution before (Listing
5.4) and after (Listing 5.5) ?ne-tuning. Figure 5.12 depicts maximum e?ective bandwidth
for a ?lter size, FS, of 3.
512 1024 2048 4096 8192 1638480
90
100
110
120
130
140
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Standard
Fine-tuned
Figure 5.12: Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu-
tion on the T10 GPU. FS=3.
In Figure 5.12, the kernel after ?ne-tuning yields a modest speedup, on average, of 1.04
compared to before ?ne-tuning. The minimum (n = 16384) and maximum (n = 512) speedups
are 0.99 and 1.22, respectively. Similar to Mv, the ?gure suggests Step 4 of the procedure,
?ne-tuning the kernel, has a minimal impact on computation time for a small ?lter size.
However, as illustrated in Figure 5.13, Step 4 of the procedure increases bandwidth more for
larger ?lter sizes.
122
512 1024 2048 4096 8192 163840
200
400
600
800
1000
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Standard
Fine-tuned
Figure 5.13: Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu-
tion on the T10 GPU. FS=63.
The kernel after ?ne-tuning yields a speedup, on average, of 6.62 compared to before
?ne-tuning for a ?lter size of 63. The minimum (n = 512) and maximum (n = 16384) speedups
are 6.58 and 6.64, respectively. The speedup is signi?cantly higher than for a ?lter size of 3
presumably due to loop unrolling. As mentioned, the compiler automatically unrolls small
loops. The results in Figures 5.12 and 5.13 suggest that for a small ?lter size, the compiler
automatically unrolls both loops depicted in Listing 5.4. Therefore, ?ne-tuning the kernel
has a minimal impact for a small ?lter size. However, as the ?lter size increases, results
indicate the compiler does not automatically unroll the for-loops. Therefore, the procedure
yields signi?cant speedups by forcing the compiler to unroll the loops. Lastly, Figure 5.14
depicts e?ective bandwidth for a ?lter size of 513.
123
512 1024 2048 4096 8192 163840
200
400
600
800
1000
1200
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
Standard
Fine-tuned
Figure 5.14: Comparison of ?ne-tuning: Maximum e?ective bandwidth (GB/s) for convolu-
tion on the T10 GPU. FS=513.
Similar to a ?lter size of 63, for a ?lter size of 513, the kernel after ?ne-tuning yields
a speedup, on average, of 6.98 compared to before ?ne-tuning. The minimum (n = 512)
and maximum (n = 8192) speedups are 6.91 and 7.00, respectively. Again, the speedup is
signi?cantly higher for a ?lter size of 513 than for a ?lter size of 3. Since the amount of loop
unrolling performed is dependent on FS, results suggest that loop unrolling occurs without
instructing the compiler for small ?lter sizes but not for larger sizes. Therefore, signi?cant
speedup from Step 4 is attained only for larger ?lter sizes. Therefore, the speedup from
Step 4 of the procedure, ?ne-tuning the kernel, is dependent on the ?lter size. Regardless,
Step 4 of the procedure yields an increase in e?ective bandwidth, and thus a decrease in
computation time, on average, for all data and ?lter sizes tested.
5.1.5 Input Parameters
Step 5 of the parallelization procedure speci?es that optimal input parameters are de-
rived. A list of ?ve steps to derive optimal input parameters is presented in Section 4.5.
Since the derivation is Step 5 of the procedure, optimal input parameters are derived for the
kernel in Listing 5.5 which is the shared memory implementation of convolution on a GPU
utilizing the optimized computation and access pattern after ?ne-tuning. From the kernel
124
and Equations (4.12) and (4.13),
8 ?dBlk:x ? 512 (5.12a)
1 ?dBlk:y ? 512dBlk:x (5.12b)
dGrd:x = ndBlk:x (5.12c)
dGrd:y = ndBlk:y (5.12d)
From Equation (5.7), SMemPerBlk = (2?dBlk:x?dBlk:y +dBlk:x)?4. From compilation
of the kernel, RegsPerThd = 10 and is constant regardless of n, FS, and input parameters.
Therefore, Equation (3.7) simpli?es to
BlksactiveSM = minparenleft.alt38; 1024dBlk:x?dBlk:yparenright.alt3: (5.13)
Substituting BlksactiveSM from Equation (5.13), dGrd:x from Equation (5.12c), and dGrd:y
from Equation (5.12c) into Equation (3.8) yields
BlksactiveGPU = minparenleft.alt330?minparenleft.alt38; 1024dBlk:x?dBlk:yparenright.alt3; n
2
dBlk:x?dBlk:yparenright.alt3: (5.14)
Since n ? 512 from Equation (4.1) and dBlk:x?dBlk:y ? 512 from Equation (4.12), Equation
(5.14) simpli?es to
BlksactiveGPU = 30?minparenleft.alt38; 1024dBlk:x?dBlk:yparenright.alt3: (5.15)
Substituting BlksactiveGPU from Equation (5.15) and the minimum of ThdsactiveGPU from Equation
(4.14) into Equation (3.9) yields
12800 ? 30?minparenleft.alt38; 1024dBlk:x?dBlk:yparenright.alt3?dBlk:x?dBlk:y: (5.16)
125
If 1024dBlk:x?dBlk:y ? 8, then Equation (5.16) is true. Solving for dBlk:y in Equation (5.16) if
1024
dBlk:x?dBlk:y < 8 yields
dBlk:y ? 12800240dBlk:x:
Since dBlk:y is a power of two, this simpli?es to dBlk:y ? 64dBlk:x. Combining this with the
limits of dBlk:y from Equation (5.12b) yields
maxparenleft.alt3 64dBlk:x;1parenright.alt3 ? dBlk:y ? 512dBlk:x: (5.17)
Step 2 of deriving optimal input parameters speci?es shared memory utilization is max-
imized to ensure data is reused as much as possible. From Line 15 of the kernel in Listing
5.5, the trip count for the inner loop, where shared memory is accessed, is dependent on FS.
From the number of accesses to global memory from Equations (5.8)-(5.11), dBlk:x ? 16.
Since accesses are performed ? FSdBlk:x? times in the inner loop, dBlk:x > FS. Therefore,
dBlk:x ? max(16;FS). However, if dBlk:x > FS, dBlk:x ? dBlk:y values are accessed for
A and dBlk:x values are accessed for B, but only FS values are necessary. Therefore,
dBlk:x > FS but less than the next power of two. So, dBlk:x = max(16;2?log2(FS)?). If FS is
greater than the upper limit of dBlk:x from Equation (5.12a), then dBlk:x = 512. Therefore,
dBlk:x = minparenleft.alt1maxparenleft.alt116;2?log2(FS)?parenright.alt1;512parenright.alt1: (5.18)
Step 3 of deriving optimal input parameters speci?es the minimum amount of shared
memory is allocated per block to increase the number of blocks assigned to each SM. In
addition, minimizing the amount of shared memory allocated reduces time each HW in a
block waits for synchronization since there are fewer HWs per block. Since SMemPerBlk =
(2?dBlk:x?dBlk:y+dBlk:x)?4 and dBlk:x is constant from Equation (5.18), the minimum
value of dBlk:y from Equation (5.17) is utilized. Since dBlk:y = maxparenleft.alt1 64dBlk:x;1parenright.alt1 and from
126
Equations (5.18), (5.12c), and (5.12c),
dBlk:x = minparenleft.alt1maxparenleft.alt116;2?log2(FS)?parenright.alt1;512parenright.alt1
dBlk:y = maxparenleft.alt3 64dBlk:x;1parenright.alt3
dGrd:x = ndBlk:x
dGrd:y = ndBlk:y: (5.19)
Given n and FS, all input parameters are constant and therefore, no further steps are
necessary to derive the optimal input parameters. So, Equation (5.19) de?nes the optimal
input parameters for the shared memory implementation of convolution on a GPU utilizing
the optimized computation and access pattern after ?ne-tuning.
Figures 5.15-5.17 depict measured computation time of all input parameters for GPU
computation of convolution. In the ?gures in this section, the x-axis represents the number
of threads per block (dBlk:x?dBlk:y). Therefore, there are several combinations of dBlk:x
and dBlk:y which yield an equivalent number of threads per block. Figure 5.15 depicts time
for a ?lter size of 3.
8 16 32 64 128 256 512
10?0.8
10?0.6
10?0.4
10?0.2
dBlk.x?dBlk.y
tim
e(
ms
)
n=512
Optimal
8 16 32 64 128 256 512
100
dBlk.x?dBlk.y
tim
e(
ms
)
n=1024
Optimal
127
8 16 32 64 128 256 512
101
dBlk.x?dBlk.y
tim
e(
ms
)
n=2048
Optimal
8 16 32 64 128 256 512
101
dBlk.x?dBlk.y
tim
e(
ms
)
n=4096
Optimal
8 16 32 64 128 256 512
102
dBlk.x?dBlk.y
tim
e(
ms
)
n=8192
Optimal
8 16 32 64 128 256 512
102.1
102.3
102.5
102.7
dBlk.x?dBlk.y
tim
e(
ms
)
n=16384
Optimal
Figure 5.15: Comparison of input parameters: Computation time (ms) of all input parame-
ters for convolution on the T10 GPU. FS=3.
For a ?lter size of 3, derivation of optimal input parameters yields computation time,
on average, within 9.6% of the minimum measured time. Although the percentage di?erence
between the optimal input parameters and the minimum time is greater than Mv and MM,
the worst-case utilizing optimal input parameters (n = 8192) yields time within 12.1% of the
minimum. In the best-case (n = 512), utilizing optimal input parameters yields time within
5.3% of the minimum. Computation time for a ?lter size of 63 is illustrated in Figure 5.16.
128
8 16 32 64 128 256 512
101
102
dBlk.x?dBlk.y
tim
e(
ms
)
n=512
Optimal
8 16 32 64 128 256 512
102
dBlk.x?dBlk.y
tim
e(
ms
)
n=1024
Optimal
8 16 32 64 128 256 512
103
dBlk.x?dBlk.y
tim
e(
ms
)
n=2048
Optimal
8 16 32 64 128 256 512
103
dBlk.x?dBlk.y
tim
e(
ms
)
n=4096
Optimal
8 16 32 64 128 256 512
104
dBlk.x?dBlk.y
tim
e(
ms
)
n=8192
Optimal
8 16 32 64 128 256 512
104
105
dBlk.x?dBlk.y
tim
e(
ms
)
n=16384
Optimal
Figure 5.16: Comparison of input parameters: Computation time (ms) of all input parame-
ters for convolution on the T10 GPU. FS=63.
129
As illustrated in Figure 5.16, for a ?lter size of 63, derivation of optimal input parameters
yields computation time, on average, within 0.2% of the minimum measured time. The
percentage di?erence between optimal input parameters and minimum time is much less
for a ?lter size of 63 than for a ?lter size of 3. In addition, the percentage di?erence is
roughly equivalent to the di?erence measured for Mv and MM. In the worst-case utilizing
optimal input parameters (n = 512), the time is within 1.0% of the minimum measured. The
time for the best-case utilizing optimal input parameters (n = {1024;2048}) is the minimum
measured. Figure 5.17 depicts computation time for a ?lter size of 513.
8 16 32 64 128 256 512
103
dBlk.x?dBlk.y
tim
e(
ms
)
n=512
Optimal
8 16 32 64 128 256 512
104
dBlk.x?dBlk.y
tim
e(
ms
)
n=1024
Optimal
8 16 32 64 128 256 512
104
105
dBlk.x?dBlk.y
tim
e(
ms
)
n=2048
Optimal
8 16 32 64 128 256 512
105
dBlk.x?dBlk.y
tim
e(
ms
)
n=4096
Optimal
130
8 16 32 64 128 256 512
106
dBlk.x?dBlk.y
tim
e(
ms
)
n=8192
Optimal
8 16 32 64 128 256 512
106
dBlk.x?dBlk.y
tim
e(
ms
)
n=16384
Optimal
Figure 5.17: Comparison of input parameters: Computation time (ms) of all input parame-
ters for convolution on the T10 GPU. FS=513.
For a ?lter size of 513, derivation of optimal input parameters yields computation time,
on average, within 0.3% of the minimum measured time. In the worst-case (n = 512),
utilizing optimal input parameters yields time within 1.9% of the minimum measured. In
the best-case (n = {1024;2048;4096;8192;16384}), utilizing optimal input parameters yields
the minimum measured time. Similar to a ?lter size of 63, the percentage di?erence between
optimal input parameters and minimum time is much less for a ?lter size of 513 than for
a ?lter size of 3. Likewise, the percentage di?erence is roughly equivalent to the di?erence
measured for Mv and MM. For all data and ?lter sizes tested, the worst-case time utilizing
optimal input parameters occurs for a ?lter size of 3 and yields time within 12.1% of the
minimum measured.
GPU Computation Summary
After Step 5, the parallelization procedure is developed to determine the optimal parti-
tioning of computation between the CPU and GPU. Therefore, the procedure with regards
to minimizing GPU computation time is complete. So, this section illustrates maximum ef-
fective bandwidth of each aforementioned step of the parallelization procedure as it pertains
131
to GPU computation time. In addition, a comparison of the procedure to theoretical band-
width is illustrated. For each ?gure in this subsection, Na  ve denotes the global memory
implementation, Sh. Mem. denotes the shared memory implementation, O.C.P. denotes the
optimized computation pattern utilizing shared memory, and O.A.P. denotes the optimized
access pattern utilizing the optimized computation pattern and shared memory. P.P. de-
notes the entire parallelization procedure with regards to GPU computation which utilizes
shared memory, optimized computation and access patterns, ?ne-tuning adjustments, and
optimal input parameters.
Figures 5.18-5.20 depict maximum e?ective bandwidth of convolution for each step of
the parallelization procedure, in addition to the theoretical bandwidth. Figure 5.18 depicts
maximum e?ective bandwidth for a ?lter size, FS, of 3.
512 1024 2048 4096 8192 16384
102
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
TheoreticalP.P.
O.A.P.O.C.P.
Sh. Mem.Na??ve
Figure 5.18: Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU.
FS=3.
The parallelization procedure yields a speedup, on average, of 4.07 compared to the
na??ve implementation. The minimum (n = 512) and maximum (n = 16384) speedups are
3.74 and 4.39, respectively. The procedure yields e?ective bandwidth, on average, as 39.0%
of the theoretical bandwidth. Worst-case (n = 512) and best-case (n = 16384) utilizing the
procedure yield e?ective bandwidths of 39.0% and 48.2% of the theoretical, respectively. As
mentioned, the procedure yields computation time utilizing optimal input parameters, on
132
average, within 9.6% of the minimum. Therefore, as shown in Figure 5.18, the maximum
e?ective bandwidth is slightly higher for the O.A.P. than for the P.P. which includes the
optimal input parameters. Regardless, the procedure yields a signi?cant speedup compared
to the na??ve implementation. Figure 5.19 depicts e?ective bandwidth for a ?lter size of 63.
512 1024 2048 4096 8192 16384
102
103
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
TheoreticalP.P.
O.A.P.O.C.P.
Sh. Mem.Na??ve
Figure 5.19: Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU.
FS=63.
In Figure 5.19, the large di?erence between the O.A.P. and the procedure is due to the
e?ect on computation time from ?ne-tuning being dependent on ?lter size. For a small ?lter
size, Figure 5.18, the O.A.P. slightly outperforms the procedure for some data sizes. How-
ever, for a ?lter size of 63, Figure 5.19, the procedure signi?cantly outperforms the O.A.P.
for all data sizes. Regardless, the procedure yields a speedup, on average, of 13.31 compared
to the na??ve implementation. The minimum (n = 512) and maximum (n = 16384) speedups
are 12.85 and 13.52, respectively. The procedure yields e?ective bandwidth, on average, of
64.2% of the theoretical bandwidth. In the worst-case (n = 512), e?ective bandwidth is 62.6%
of the theoretical. In the best-case (n = 16384), the procedure yields 64.6% of the theoretical
bandwidth. Lastly, e?ective bandwidth for a ?lter size of 513 is depicted in Figure 5.20.
133
512 1024 2048 4096 8192 16384
102
103
n
eff
ect
ive
ba
nd
wi
dt
h(
GB
/s
)
TheoreticalP.P.
O.A.P.O.C.P.
Sh. Mem.Na??ve
Figure 5.20: Comparison of e?ective bandwidth (GB/s) for convolution on the T10 GPU.
FS=513.
Similar to a ?lter size of 63, the large di?erence between the O.A.P. and the procedure
is due to the e?ect on computation time from ?ne-tuning being dependent on ?lter size. The
procedure yields a speedup, on average, of 13.38 compared to the na??ve implementation. The
minimum (n = 512) and maximum (n = 16384) speedups are 9.82 and 15.04, respectively.
The procedure yields e?ective bandwidth, on average, of 68.7% of the theoretical bandwidth.
In the worst-case utilizing the procedure (n = 512), e?ective bandwidth is 65.9% of the
theoretical. In the best-case (n = 16384), the procedure yields 69.5% of the theoretical
bandwidth. Therefore, for all data and ?lter sizes tested, the parallelization procedure
signi?cantly reduces GPU computation time compared to the na??ve implementation.
5.1.6 Computation Partitioning
The last step of the parallelization procedure, Step 6, is to determine the optimal par-
titioning of computation between the CPU and GPU. For convolution, 2n2FS2 ?oat values
are necessary for computation. If FS ? 15, each thread in a HW reuses FS ?oat values from
global memory. However, if FS > 15, each thread in a HW reuses 16 ?oat values from global
memory. Therefore, 8n2FS2min(FS;16) bytes are necessary from global memory and the theoretical
bandwidth is min(FS;16)94GB/s. Therefore, TGPUcomp: =
8n2FS2
10243
min(FS;16)94. Communication requires
134
two CPU to GPU transfers and one GPU to CPU transfer. Substituting for b, c0, and c1
into Equations (3.14) and (3.15) and simplifying yields the communication time in seconds.
Since communication time is dependent on FS and n, if FS < 256,
Tcomm: = 3:02?10?9FS2 +
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23A9
5:00?10?9n2 +5:00?10?5 if n < 256
3:06?10?9n2 +9:60?10?5 if 256 ? n < 512
2:30?10?9n2 +4:17?10?4 if n ? 512:
If 256 ? FS < 512,
Tcomm: = 1:08?10?9FS2 +
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23A9
5:00?10?9n2 +9:60?10?5 if n < 256
3:06?10?9n2 +1:42?10?4 if 256 ? n < 512
2:30?10?9n2 +4:63?10?4 if n ? 512:
If FS ? 512,
Tcomm: = 1:08?10?9FS2 +
uni23A7uni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23A8
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23AAuni23AAuni23AA
uni23AAuni23A9
5:00?10?9n2 +2:27?10?4 if n < 256
3:06?10?9n2 +2:73?10?4 if 256 ? n < 512
2:30?10?9n2 +5:94?10?4 if n ? 512:
From Lines 1-2 of Listing 4.18, computation is partitioned for CPU execution if TCPUcomp: ?
TGPUcomp: +Tcomm:. Therefore, after substitution, if
8n2FS2
10243
parenleft.alt2 3?FS ? n20000 +2:7parenright.alt2
?
8n2FS2
10243
min(FS;16)94 +Tcomm:;
computation is performed on the CPU to minimize execution time. Assuming the minimum
for FS (3), solving for n yields TCPUcomp: ? TGPUcomp: + Tcomm: if n ? 64. Therefore, if FS = 3 and
n > 64, computation is performed on the GPU to minimize execution time. If FS ? n,
135
solving for n yields TCPUcomp: ? TGPUcomp: +Tcomm: if n ? 11. Therefore, if FS ? n > 11, computation
is performed on the GPU.
Figures 5.21-5.23 illustrate measured execution time for convolution. Since BLAS rou-
tines do not include convolution, CPU execution time is measured time for a non-optimized C
implementation. Likewise, CUBLAS does not include convolution, so it is excluded. Figure
5.21 depicts execution time for convolution with a ?lter size of 3.
512 1024 2048 4096 8192 1638410
0
101
102
103
104
n
me
asu
red
ex
ecu
tio
nt
im
e(
ms
)
CPU
GPU (na??ve)
P.P.
Figure 5.21: Comparison of execution time (ms) for convolution on the T10 GPU. FS = 3.
In Figure 5.21, the parallelization procedure yields a speedup, on average, of 5.30 com-
pared to the CPU implementation, and 1.61 compared to the na??ve implementation. The
procedure yields execution time, on average, within 11.9% of the theoretical minimum exe-
cution time. In the worst-case utilizing the procedure (n = 8192), execution time is within
13.2% of the theoretical minimum. Best-case utilizing the procedure (n = 512) yields time
within 11.1% of the theoretical minimum. Similar to Mv, the speedup compared to the
na??ve implementation in Figure 5.21 is less than in Figure 5.18 due to communication time.
For convolution with a small ?lter size, the communication time is a signi?cant portion of
execution time. Therefore, large speedups measured for computation of convolution, with a
small ?lter size, have less of an impact on execution time since the execution time includes
communication and computation time.
Execution time for convolution with a ?lter size of 63 is illustrated in Figure 5.22.
136
512 1024 2048 4096 8192 1638410
0
102
104
106
108
n
me
asu
red
ex
ecu
tio
nt
im
e(
ms
)
CPU
GPU (na??ve)
P.P.
Figure 5.22: Comparison of execution time (ms) for convolution on the T10 GPU. FS = 63.
The procedure yields a speedup, on average, of 323.21 compared to the CPU implemen-
tation, and 12.33 compared to the na??ve implementation. In Figure 5.22, for a ?lter size of
63, communication time is a small portion of execution time compared to a ?lter size of 3.
Therefore, the procedure yields a speedup in execution time similar to the speedup achieved
for computation only. The procedure yields execution time, on average, within 33.0% of the
theoretical minimum execution time. Worst-case utilizing the procedure (n = 512) yields
time within 33.4% of the theoretical minimum. In the best-case (n = 1024), the procedure
yields time within 32.9% of the theoretical minimum. The signi?cant speedup compared
to the CPU implementation is presumably due to the non-optimized CPU implementation.
Lastly, ?gure 5.23 depicts execution time for convolution with a ?lter size of 513.
137
512 1024 2048 4096 8192 1638410
2
104
106
108
1010
n
me
asu
red
ex
ecu
tio
nt
im
e(
ms
)
CPU
GPU (na??ve)
P.P.
Figure 5.23: Comparison of execution time (ms) for convolution on the T10 GPU. FS = 513.
As illustrated in Figure 5.23, the procedure yields a speedup, on average, of 424.27
compared to the CPU implementation, and 13.36 compared to the na??ve implementation.
Similar to a ?lter size of 63, the procedure yields a speedup in execution time similar to
the speedup achieved for computation only. This is due to communication time being a
small portion of execution time. The procedure yields execution time, on average, within
31.3% of the theoretical minimum execution time. In the worst-case (n = 512), the procedure
yields time within 34.0% of the theoretical minimum. In the best-case utilizing the procedure
(n = 16384), the time is within 30.5% of the theoretical minimum. As mentioned for FS = 63,
the signi?cant speedup compared to the CPU implementation is presumably due to the non-
optimized CPU implementation.
For all data and ?lter sizes tested, the parallelization procedure yields a worst-case
speedup of execution time, on average, of 1.61 compared to the na??ve GPU implementation.
Therefore, results show the procedure is valid for convolution.
138
5.2 Conjugate Gradient
This section is an illustration of execution time of the parallelization procedure applied
to an application utilizing a matrix-based computation. The conjugate gradient method [65]
is utilized to further validate the procedure.
The conjugate gradient method is an iterative algorithm for solving a system of linear
equations consisting of several vector-based computations (Level 1 BLAS) and Mv (Level
2 BLAS). The GPU (CUBLAS) implementation presented in the results utilizes CUBLAS
routines for each vector- and matrix-based computation. Therefore, execution time is the
sum of GPU computation and communication time. The CPU (ATLAS) implementation
utilizes BLAS routines via ATLAS for all vector- and matrix-based computation and no
communication is necessary. The na??ve implementation and parallelization procedure utilize
the GPU for matrix-based computations and CPU for vector-based computations. The
na??ve implementation utilizes a na??ve GPU kernel. The procedure utilizes the GPU kernel
developed by the procedure. Since the matrix in Mv is constant, it is necessary to transfer
it to the GPU once. However, for the vector in Mv, it is necessary to transfer it to the
GPU in each iteration. Likewise, it is necessary to transfer the result of Mv to the CPU in
each iteration. Therefore, execution time of the procedure includes GPU computation, CPU
computation, and all data transfers.
For the conjugate gradient method, the number of iterations necessary for the solution
to converge is dependent on data size and input. Therefore, execution time is depicted
for 8 and 256 iterations. Figure 5.24 illustrates measured execution time for the conjugate
gradient method through 8 iterations.
139
512 1024 2048 4096 8192 1638410
0
101
102
103
104
n
me
asu
red
ex
ecu
tio
nt
im
e(
ms
)
CPU (ATLAS)
GPU (na??ve)
GPU (CUBLAS)
P.P.
Figure 5.24: Comparison of execution time (ms) for 8 iterations of the conjugate gradient
method on the T10 GPU.
As illustrated in Figure 5.24, the parallelization procedure yields a signi?cant speedup,
on average, of 11.15 compared to the ATLAS implementation. The minimum (n = 512) and
maximum (n = 16384) speedups are 3.16 and 22.30, respectively. The procedure yields
a speedup, on average, of 6.84 compared to the na??ve implementation. The minimum
(n = 4096) and maximum (n = 16384) speedups are 4.30 and 12.88, respectively. The
procedure yields a speedup, on average, of 3.00 compared to the CUBLAS implementation.
The minimum (n = 16384) and maximum (n = 512) speedups are 1.12 and 8.02, respectively.
Although communication time is included in Figure 5.24, the procedure still yields signi?-
cant increases in speedup of execution time for a small number of iterations of the conjugate
gradient method.
Figure 5.25 illustrates measured execution time for the conjugate gradient method
through 256 iterations.
140
512 1024 2048 4096 8192 1638410
0
102
104
106
n
me
asu
red
ex
ecu
tio
nt
im
e(
ms
)
CPU (ATLAS)
GPU (na??ve)
GPU (CUBLAS)
P.P.
Figure 5.25: Comparison of execution time (ms) for 256 iterations of the conjugate gradient
method on the T10 GPU.
The parallelization procedure yields a speedup, on average, of 26.89 compared to the
ATLAS implementation. The minimum (n = 512) and maximum (n = 16384) speedups
are 2.67 and 61.56, respectively. The procedure yields a speedup, on average, of 16.17
compared to the na??ve implementation. The minimum (n = 1024) and maximum (n = 16384)
speedups are 9.48 and 38.74, respectively. The procedure yields a speedup, on average, of
5.88 compared to the CUBLAS implementation. The minimum (n = 16384) and maximum
(n = 512) speedups are 1.01 and 15.99, respectively. Similar to a small number of iterations
of the conjugate gradient method, the procedure yields signi?cant speedup of execution
time for a large number of iterations. Comparing Figures 5.24 and 5.25 shows the speedup
achieved by the procedure increases as the number of iterations increases. This is due to the
amortization of communication time for transferring the matrix to the GPU. As mentioned,
for the conjugate gradient method, the matrix is constant and must be transferred to the
GPU only once. As the number of iterations increases, the contribution to execution time
for transferring the matrix to the GPU decreases. Therefore, the speedup in execution time
achieved by the procedure increases as the number of iterations increases.
As illustrated in Figures 5.24 and 5.25, the parallelization procedure yields minimum
execution time of the conjugate gradient method for all data sizes and iterations tested.
141
Chapter 6
Conclusion
In this work, a parallelization procedure is developed to minimize execution time for
matrix-based computations on a GPU. The procedure considers such factors as the placement
of data, computation patterns, access patterns, ?ne-tuning, input parameters, communica-
tion time, and computation partitioning. The procedure is applied to Mv, MM, convolution,
and the conjugate gradient method.
An accurate examination of the layout of GPU memory is presented. Partition camping
in global memory is examined and the e?ects on GPU computation time are shown. Bank
con?icts in shared memory are analyzed and e?ects are shown. Following the memory layout,
execution metrics are formulated as functions of the input parameters to accurately represent
the computational behavior of the GPU. The necessity of execution metrics is shown. In
addition to representing computational behavior of the GPU, communication time between
the CPU and GPU is modeled and estimates are presented. Computation time is modeled
for the CPU utilizing ATLAS and for the GPU assuming a maximum bandwidth.
From the layout of GPU memory, execution metrics, and communication and compu-
tation time estimates, a parallelization procedure is developed through analysis and testing
of Mv and MM to minimize execution time for matrix-based computations. The procedure
determines placement of data in GPU memory and derives the optimized computation and
access patterns to reduce computation time. Fine-tuning is performed on the GPU code to
further reduce time. From the execution metrics, optimal input parameters are derived to
yield the minimum computation time. Results show that utilizing optimal input parame-
ters for Mv yields computation time, on average, within 1.8% of the minimum measured.
For MM, results show that optimal input parameters yield time, on average, within 0.5%
142
of the minimum. Therefore, the derivation of optimal input parameters yields minimum or
near-minimum computation time for each matrix-based computation tested.
For Mv, results show the parallelization procedure yields a speedup of GPU computa-
tion, on average, of 36.37 compared to the ATLAS CPU implementation, 18.98 compared to
the na??ve GPU implementation, and 1.47 compared to the CUBLAS GPU implementation.
Results show for MM the procedure yields a speedup of GPU computation, on average, of
24.47 compared to the ATLAS CPU implementation, 14.58 compared to the na??ve GPU
implementation, and 1.07 compared to the CUBLAS GPU implementation.
Since the parallelization procedure is developed with analysis of measured data for Mv
and MM, the procedure is applied to convolution and the conjugate gradient method to
further validate the procedure. Results show for convolution that optimal input parameters
yield time, on average, within 9.6% of the minimum for small ?lters and within 1.0% for
large ?lters. For convolution with a small ?lter, results show the procedure yields a speedup
of GPU computation, on average, of 26.75 compared to the non-optimized CPU implemen-
tation, and 4.07 compared to the na??ve GPU implementation. For convolution with a large
?lter, results show the procedure yields a speedup of GPU computation, on average of 424.92
compared to the non-optimized CPU implementation, and 13.38 compared to the na??ve GPU
implementation.
For a small number of iterations of conjugate gradient, results show the procedure
yields a speedup of execution time, on average, of 11.15 compared to the ATLAS CPU
implementation, 6.84 compared to the na??ve GPU implementation, and 3.00 compared to
the CUBLAS implementation. For a large number of iterations of conjugate gradient, results
show the procedure yields a speedup of execution time, on average, of 26.89 compared to
the ATLAS CPU implementation, 16.17 compared to the na??ve GPU implementation, and
5.88 compared to the CUBLAS implementation.
Therefore, the parallelization procedure developed minimizes execution time for matrix-
based computations on a GPU.
143
Bibliography
[1] K. Fatahalian, J. Sugerman, and P. Hanrahan, ?Understanding the e?ciency
of gpu algorithms for matrix-matrix multiplication,? in Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, ser. HWWS
?04. New York, NY, USA: ACM, 2004, pp. 133?137. [Online]. Available:
http://doi.acm.org/10.1145/1058129.1058148
[2] C. Jiang and M. Snir, ?Automatic tuning matrix multiplication performance
on graphics hardware,? in Proceedings of the 14th International Conference on
Parallel Architectures and Compilation Techniques, ser. PACT ?05. Washington,
DC, USA: IEEE Computer Society, 2005, pp. 185?196. [Online]. Available:
http://dx.doi.org/10.1109/PACT.2005.10
[3] D. Tarditi, S. Puri, and J. Oglesby, ?Accelerator: using data parallelism to program
gpus for general-purpose uses,? SIGARCH Comput. Archit. News, vol. 34, no. 5, pp.
325?335, Oct. 2006.
[4] W. Liu, Muller-Wittig, and B. Schmidt, ?Performance predictions for general-purpose
computation on gpus,? in Parallel Processing, 2007. ICPP 2007. International Confer-
ence on, Sept. 2007, p. 50.
[5] S. Hong and H. Kim, ?An analytical model for a gpu architecture with memory-level
and thread-level parallelism awareness,? in Proceedings of the 36th annual international
symposium on Computer architecture, ser. ISCA ?09, 2009, pp. 152?163.
[6] ??. (2009) Memory-level and thread-level parallelism aware gpu architecture
performance analytical model. [Online]. Available: http://www.cc.gatech.edu/
?hyesoon/hong report09.pdf
[7] ??, ?An integrated gpu power and performance model,? SIGARCH Comput. Archit.
News, vol. 38, no. 3, pp. 280?289, June 2010.
[8] K. Kothapalli, R. Mukherjee, M. Rehman, S. Patidar, P. Narayanan, and K. Srinathan,
?A performance prediction model for the cuda gpgpu platform,? in High Performance
Computing (HiPC), 2009 International Conference on, Dec. 2009, pp. 463 ?472.
[9] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu,
?Optimization principles and application performance evaluation of a multithreaded
gpu using cuda,? in Proceedings of the 13th ACM SIGPLAN Symposium on Principles
and practice of parallel programming, ser. PPoPP ?08. New York, NY, USA: ACM,
2008, pp. 73?82. [Online]. Available: http://doi.acm.org/10.1145/1345206.1345220
144
[10] S. Ryoo, C. I. Rodrigues, S. S. Stone, J. A. Stratton, S.-Z. Ueng, S. S. Baghsorkhi,
and W.-m. W. Hwu, ?Program optimization carving for gpu computing,? J. Parallel
Distrib. Comput., vol. 68, no. 10, pp. 1389?1401, Oct. 2008. [Online]. Available:
http://dx.doi.org/10.1016/j.jpdc.2008.05.011
[11] S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and
W.-m. W. Hwu, ?Program optimization space pruning for a multithreaded gpu,? in
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation
and optimization, ser. CGO ?08. New York, NY, USA: ACM, 2008, pp. 195?204.
[Online]. Available: http://doi.acm.org/10.1145/1356058.1356084
[12] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu, ?An
adaptive performance modeling tool for gpu architectures,? SIGPLAN Not., vol. 45,
no. 5, pp. 105?114, Jan. 2010.
[13] W.-M. Hwu, C. Rodrigues, S. Ryoo, and J. Stratton, ?Compute uni?ed device architec-
ture application suitability,? Computing in Science Engineering, vol. 11, no. 3, pp. 16
?26, May-June 2009.
[14] S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu, ?Languages
and compilers for parallel computing,? J. N. Amaral, Ed. Berlin, Heidelberg:
Springer-Verlag, 2008, ch. CUDA-Lite: Reducing GPU Programming Complexity, pp.
1?15. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-89740-8 1
[15] CUBLAS, NVIDIA, 2013. [Online]. Available: https://developer.nvidia.com/cublas
[16] ?An updated set of basic linear algebra subprograms (blas),? ACM Trans.
Math. Softw., vol. 28, no. 2, pp. 135?151, Jun. 2002. [Online]. Available:
http://doi.acm.org/10.1145/567806.567807
[17] MAGMA, Innovative Computing Laboratory at the University of Tennessee, 2013.
[Online]. Available: http://icl.cs.utk.edu/magma/docs/
[18] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz,
A.Greenbaum, S. Hammarling, A.McKenney, andD. Sorensen, LAPACK Users? Guide,
3rd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999.
[19] R. Nath, S. Tomov, and J. Dongarra, ?Accelerating gpu kernels for dense linear algebra,?
in Proceedings of the 9th international conference on High performance computing for
computational science, ser. VECPAR?10. Berlin, Heidelberg: Springer-Verlag, 2011,
pp. 83?92. [Online]. Available: http://dl.acm.org/citation.cfm?id=1964238.1964250
[20] ??, ?An improved magma gemm for fermi graphics processing units,? Int. J. High
Perform. Comput. Appl., vol. 24, no. 4, pp. 511?515, November 2010. [Online].
Available: http://dx.doi.org/10.1177/1094342010385729
[21] Y. Li, J. Dongarra, and S. Tomov, ?A note on auto-tuning gemm for gpus,? in
Proceedings of the 9th International Conference on Computational Science: Part I, ser.
145
ICCS ?09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 884?892. [Online]. Available:
http://dx.doi.org/10.1007/978-3-642-01970-8slash.right 89
[22] R. Nath, S. Tomov, T. T. Dong, and J. Dongarra, ?Optimizing symmetric
dense matrix-vector multiplication on gpus,? in Proceedings of 2011 International
Conference for High Performance Computing, Networking, Storage and Analysis,
ser. SC ?11. New York, NY, USA: ACM, 2011, pp. 6:1?6:10. [Online]. Available:
http://doi.acm.org/10.1145/2063384.2063392
[23] R. Nath, S. Tomov, and J. Dongarra, ?Blas for gpus,? in Scienti c Computing with
Multicore and Accelerators, 2010, ch. 4, pp. 57?80.
[24] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, ?Dense linear algebra solvers for mul-
ticore with gpu accelerators,? in Parallel Distributed Processing, Workshops and Phd
Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, pp. 1?8.
[25] S. Tomov, R. Nath, and J. Dongarra, ?Accelerating the reduction to upper hessen-
berg, tridiagonal, and bidiagonal forms through hybrid gpu-based computing,? Parallel
Comput., vol. 36, no. 12, pp. 645?654, Dec. 2010.
[26] S. Tomov, J. Dongarra, and M. Baboulin, ?Towards dense linear algebra for hybrid gpu
accelerated manycore systems,? Parallel Comput., vol. 36, no. 5-6, pp. 232?240, Jun.
2010.
[27] E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov,
?A hybridization methodology for high-performance linear algebra software for gpus,?
vol. 2, pp. 473?484, 2011.
[28] ??, ?Faster, cheaper, better - a hybridization methodology to develop linear algebra
software for gpus,? 2010.
[29] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief,
P. Luszczek, and S. Tomov, ?Numerical linear algebra on emerging architectures: The
plasma and magma projects,? in Journal of Physics: Conference Series, vol. Vol. 180,
2009.
[30] F. Song, S. Tomov, and J. Dongarra, ?E?cient support for matrix computations on
heterogeneous multi-core and multi-gpu architectures,? 2011.
[31] J. Kurzak, S. Tomov, and J. Dongarra, ?Autotuning gemm kernels for the fermi gpu,?
Parallel and Distributed Systems, IEEE Transactions on, vol. 23, no. 11, pp. 2045?2057,
2012.
[32] M. Horton, S. Tomov, and J. Dongarra, ?A class of hybrid lapack algorithms
for multicore and gpu architectures,? in Proceedings of the 2011 Symposium
on Application Accelerators in High-Performance Computing, ser. SAAHPC ?11.
Washington, DC, USA: IEEE Computer Society, 2011, pp. 150?158. [Online].
Available: http://dx.doi.org/10.1109/SAAHPC.2011.18
146
[33] H. Ltaief, S. Tomov, R. Nath, P. Du, and J. Dongarra, ?A scalable high performant
cholesky factorization for multicore with gpu accelerators lapack working note #223.?
[34] A. D. Malony, S. Biersdor?, S. Shende, H. Jagode, S. Tomov, G. Juckeland, R. Dietrich,
D. Poole, and C. Lamb, ?Parallel performance measurement of heterogeneous parallel
systems with gpus,? in Proceedings of the 2011 International Conference on Parallel
Processing, ser. ICPP ?11, 2011, pp. 176?185.
[35] J. W. Choi, A. Singh, and R. W. Vuduc, ?Model-driven autotuning of sparse matrix-
vector multiply on gpus,? in Proceedings of the 15th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, ser. PPoPP ?10, 2010, pp. 115?126.
[36] D. Grewe and A. Lokhmotov, ?Automatically generating and tuning gpu code for sparse
matrix-vector multiplication from a high-level representation,? in Proceedings of the
Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser.
GPGPU-4, 2011, pp. 12:1?12:8.
[37] J. Bolz, I. Farmer, E. Grinspun, and P. Schr?ooder, ?Sparse matrix solvers on the gpu:
conjugate gradients and multigrid,? ACM Trans. Graph., vol. 22, no. 3, pp. 917?924,
Jul. 2003. [Online]. Available: http://doi.acm.org/10.1145/882262.882364
[38] B. Boyer, J.-G. Dumas, and P. Giorgi, ?Exact sparse matrix-vector multiplication on
gpu?s and multicore architectures,? in Proceedings of the 4th International Workshop
on Parallel and Symbolic Computation, ser. PASCO ?10, 2010, pp. 80?88.
[39] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt,
?Improving gpu performance via large warps and two-level warp scheduling,? in Pro-
ceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture,
ser. MICRO-44 ?11, 2011, pp. 308?317.
[40] A. Bulu, J. R. Gilbert, and C. Budak, ?Gaussian elimination based algorithms on the
gpu,? 2008.
[41] L. Y. Kah, A. Akoglu, I. Guven, and E. Madenci, ?High performance linear equation
solver using nvidia gpus,? in Adaptive Hardware and Systems (AHS), 2011 NASA/ESA
Conference on, 2011, pp. 367?374.
[42] A. El Zein and A. Rendell, ?From sparse matrix to optimal gpu cuda sparse matrix
vector product implementation,? in Cluster, Cloud and Grid Computing (CCGrid),
2010 10th IEEE/ACM International Conference on, 2010, pp. 808?813.
[43] D. Q. Ren and R. Suda, ?Power e?cient large matrices multiplication by load scheduling
on multi-core and gpu platform with cuda,? in Computational Science and Engineering,
2009. CSE ?09. International Conference on, vol. 1, 2009, pp. 424?429.
[44] S. Solomon and P. Thulasiraman, ?Performance study of mapping irregular com-
putations on gpus,? in Parallel Distributed Processing, Workshops and Phd Forum
(IPDPSW), 2010 IEEE International Symposium on, 2010, pp. 1?8.
147
[45] G. Cummins, R. Adams, and T. Newell, ?Scienti?c computation through a gpu,? in
Southeastcon, 2008. IEEE, 2008, pp. 244?246.
[46] Y. Sun and Y. Tong, ?Cuda based fast implementation of very large matrix computa-
tion,? in Parallel and Distributed Computing, Applications and Technologies (PDCAT),
2010 International Conference on, 2010, pp. 487?491.
[47] N. P. Karunadasa and D. N. Ranasinghe, ?Accelerating high performance applications
with cuda and mpi,? in Industrial and Information Systems (ICIIS), 2009 International
Conference on, 2009, pp. 331?336.
[48] A. Bustamam, K. Burrage, and N. Hamilton, ?Fast parallel markov clustering in bioin-
formatics using massively parallel graphics processing unit computing,? in Parallel and
Distributed Methods in Veri cation, 2010 Ninth International Workshop on, and High
Performance Computational Systems Biology, Second International Workshop on, 2010,
pp. 116?125.
[49] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips,
Y. Zhang, and V. Volkov, ?Parallel computing experiences with cuda,? Micro, IEEE,
vol. 28, no. 4, pp. 13 ?27, July-Aug. 2008.
[50] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, ?Analyzing cuda workloads
using a detailed gpu simulator,? in Performance Analysis of Systems and Software,
2009. ISPASS 2009. IEEE International Symposium on, April 2009, pp. 163 ?174.
[51] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, ?Nvidia tesla: A uni?ed graph-
ics and computing architecture,? Micro, IEEE, vol. 28, no. 2, pp. 39 ?55, March-April
2008.
[52] J. Nickolls and W. Dally, ?The gpu computing era,? Micro, IEEE, vol. 30, no. 2, pp. 56
?69, March-April 2010.
[53] D. B. Kirk and W. mei Hwu, Programming Massively Parallel Processors, 1st ed., 2010.
[54] G. Ruetsch and P. Micikevicius, Optimizing Matrix Transpose in CUDA, NVIDIA, 2009.
[55] A. M. Aji, M. Daga, and W.-c. Feng, ?Bounding the e?ect of partition camping in
gpu kernels,? in Proceedings of the 8th ACM International Conference on Computing
Frontiers, ser. CF ?11. New York, NY, USA: ACM, 2011, pp. 27:1?27:10. [Online].
Available: http://doi.acm.org/10.1145/2016604.2016637
[56] J. Wu and J. JaJa, ?Optimized strategies for mapping three-dimensional ?ts onto cuda
gpus,? in Innovative Parallel Computing (InPar), 2012, May 2012, pp. 1 ?12.
[57] Y. Yang, P. Xiang, J. Kong, and H. Zhou, ?A gpgpu compiler for memory
optimization and parallelism management,? in Proceedings of the 2010 ACM
SIGPLAN conference on Programming language design and implementation, ser.
PLDI ?10. New York, NY, USA: ACM, 2010, pp. 86?97. [Online]. Available:
http://doi.acm.org/10.1145/1806596.1806606
148
[58] A. Dasgupta, ?Cuda performance analyzer,? Master?s thesis, Georgia Institute
of Technology, May 2011. [Online]. Available: http://smartech.gatech.edu/jspui/
bitstream/1853/39555/1/dasgupta aniruddha s 201105 mast.pdf
[59] N. B. Lakshminarayana and H. Kim, ?E?ect of instruction fetch and memory scheduling
on gpu performance,? in Workshop on Language, Compiler, and Architecture Support
for GPGPU, 2010.
[60] R. C. Whaley, A. Petitet, and J. J. Dongarra, ?Automated empirical optimization
of software and the ATLAS project,? Parallel Computing, vol. 27, no. 1?2, pp. 3?35,
2001. [Online]. Available: http://math-atlas.sourceforge.net/
[61] T. Bradley and P. Micikevicius. (2009) Advanced c for cuda. NVIDIA. [Online].
Available: http://www.gputechconf.com/object/gtc2009-on-demand.html
[62] HPC User Manual, The Alabama Supercomputer Authority, 2010.
[63] V. Volkov and J. W. Demmel, ?Benchmarking gpus to tune dense linear algebra,?
in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, ser. SC
?08. Piscataway, NJ, USA: IEEE Press, 2008, pp. 31:1?31:11. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1413370.1413402
[64] CUDA Programming Guide, NVIDIA, 2013. [Online]. Available: http://docs.nvidia.
com/cuda/cuda-c-programming-guide/
[65] M. R. Hestenes and E. Stiefel, ?Methods of Conjugate Gradients for Solving Linear
Systems,? Journal of Research of the National Bureau of Standards, vol. 49, pp. 409?
436, Dec. 1952.
149