Mitigating GPU Memory Divergence for Data-Intensive Applications
Type of DegreeDissertation
MetadataShow full item record
Graphics Processing Units (GPUs) have proven as a viable technology for a wide variety of general purpose applications to exploit the massive computing capability and high computation efficiency. In GPUs, threads are organized into warps and threads in a warp execute in lock-step. GPUs deliver massive parallelism by alternating the execution of many concurrent warps and overlapping the long latency off-chip memory accesses of some warps with the computation of other warps. Following the success of GPU accelerations for compute-intensive high performance computing applications, the arrival of big data era has energized a new trend of GPU accelerations for data-intensive applications. Pioneering works have demonstrated that GPU-based implementations of data-intensive applications can provide significant performance improvement over traditional CPU-based implementations. However, due to the complexity in managing GPU on-chip resources through high level programming languages and the complicated memory access patterns in data-intensive applications, it often takes tremendous efforts to optimize these applications for high performance. Memory divergence is a major performance bottleneck that prevents data-intensive applications from gaining high performance in GPUs. In the lock-step execution model, memory divergence refers to the case where intra-warp accesses cannot be coalesced into one or two cache blocks. Even though the impacts of memory divergence can be alleviated through various software techniques, architectural support for memory divergence mitigation is still highly desirable to ease the complexity in the programming and optimization of GPU-accelerated data-intensive applications. When memory divergence occurs, a warp incurs up to warp-size (e.g., 32) independent cache accesses. Such a burst of divergent accesses not only generates large volume of long latency off-chip memory operations, but also exhibits three new architectural challenges, including intra-warp associativity conflicts, partial caching, and memory occlusion. To be more specific, intra-warp associativity conflicts are caused by the pathological behaviors of current cache indexing method that concentrates divergent intra-warp memory accesses into a few cache sets. Divergent memory accesses are often associated with high intra-warp locality, but current cache management cannot manage all cache lines of a warp as a unit, leading to severe partial caching of high intra-warp locality. Memory occlusion is a structural hazard in the GPU pipeline that occurs when the available Memory Status History Register (MSHR) entries are insufficient to track all the memory requests of a divergent load. In current GPUs, replaying missing memory accesses that are caused by associativity conflicts, intra-warp locality loss, and MSHR unavailability is a common approach to overcome the three challenges. However, replaying memory accesses stalls the execution in Load/Store (LD/ST) units and eventually impacts the instruction throughput in warp schedulers, severely degrading the performance of executing memory divergent benchmarks on GPUs. This dissertation introduces three novel and light-weight architectural modifications to independently solve the three challenges: 1) a Full Permutation (FUP) based GPU cache indexing method is presented to uniformly disperse intra-warp accesses into all available cache sets so that associativity conflicts can be eliminated; 2) a Divergence-Aware Cache (DaCache) Management technique is designed to orchestrate warp scheduling and cache management, make caching decisions at the granularity of individual warps, reduce partial caching of high intra-warp locality, and resist inter- and intra-warp cache thrashing; and 3) a Memory Occlusion Aware Warp Scheduler (OAWS) is proposed to dynamically predict the MSHR consumption of each divergent load instruction and only schedule warps that will not incur memory occlusion. The proposed techniques are implemented in a cycle-accurate GPGPU simulator, and compared with closely related state-of-the-art techniques. Specifically, FUP is compared with conventional indexing method, Bitwise-XOR, and Prime Number Displacement; DaCache is compared with two representative thrashing resistant cache management techniques, Dynamic Insertion Policy (DIP) and Re-Reference Interval Prediction (RRIP); and OAWS is compared with state-of-the-art warp scheduling techniques that can mitigate the impacts of memory occlusion, including Static Warp Limiting (SWL), Cache Conscious Wavefront Scheduling (CCWS), and Memory Aware Scheduling and Cache Access Re-execution (MASCAR). Data-intensive workloads from various publically available GPU benchmark suites are used for performance evaluations. Through systematic experiments and comprehensive comparisons with existing state-of-the-art techniques, this dissertation has demonstrated the effectiveness of our aforementioned techniques and the viability of mitigating memory divergence through architectural support. Meanwhile, this dissertation reveals optimization spaces for proposed solutions and other promising opportunities for future research on GPU architecture.