Modeling and Optimization of Parallel Matrix-based Computations on GPU
Type of Degreedissertation
MetadataShow full item record
As graphics processing units (GPUs) are continually being utilized as coprocessors, the demand for optimally utilizing them for various applications continues to grow. This work narrows the gap between programmers and minimum execution time for matrix-based computations on a GPU. To minimize execution time, computation and communication time must be considered. For computation, the placement of data in GPU memory significantly affects computation time and therefore is considered. Various matrix-based computation patterns are examined with respect to the layout of GPU memory. A computation pattern refers to the order in which each GPU thread computes a result. From examination of computation patterns, an optimized computation pattern, a pattern which reduces computation time, is derived. After the optimized computation pattern is derived, the access pattern to GPU memory, or order in which data is accessed, is considered. From the optimized access pattern, fine-tuning is performed to the GPU code such as minimization of index calculations and loop unrolling to further reduce computation time and resource allocation. After fine-tuning, input parameters which yield the minimum computation time for a matrix-based computation on the GPU are derived. Input parameters are defined as the dimensions of the grid and blocks assigned for execution on the GPU. Execution metrics are formulated, as functions of the input parameters, to represent the executional behavior of the GPU. The execution metrics are utilized to derive the optimal input parameters which are input parameters that yield the minimum computation time. The matrix-based computations considered are matrix-vector multiplication (Mv), matrix-matrix multiplication (MM), convolution, and the conjugate gradient method. A parallelization procedure is developed to minimize execution time by deriving the optimal placement of data, optimized computation pattern, optimized access pattern, and optimal input parameters. Each step of the procedure is developed through analysis of Mv and MM. The procedure includes an accurate communication model and estimates for CPU and GPU data transfers. With accurate estimates for communication and computation times, partitioning computation for CPU and GPU execution is performed to minimize execution time. Convolution and conjugate gradient are utilized to verify the validity of the procedure. Therefore, the overall goal of this work is to develop a parallelization procedure which minimizes execution time of matrix-based computations executing on the GPU.