dc.description.abstract | The rapid growth of artificial intelligence (AI) and deep learning (DL) workloads has created an urgent need for more efficient and high-performance AI accelerators, both at the edge and in cloud data centers. The computational and memory demands of large models, such as ChatGPT and Sora, have far outpaced advancements in semiconductor technology, leading to the emergence of the memory wall and area wall. These challenges necessitate the exploration of new technologies and methodologies. This dissertation presents a comprehensive investigation into emerging memory technologies, innovative architectural designs, and optimization methodologies aimed at improving energy efficiency, performance, and area utilization in AI accelerators.
First, we introduce a high-performance AI accelerator that incorporates spin transfer torque magnetic RAM (STT-MRAM) as the on-chip memory system. Through model-driven design space exploration, we develop a novel scratchpad-assisted buffer architecture that
optimizes memory retention time, read/write latency, and energy efficiency by dynamically adjusting for process and temperature variations. Our STT-MRAM-based design (STT-AI) achieves a 75% reduction in area and 3% power savings compared to SRAM-based systems, with minimal trade-offs in accuracy, demonstrating its suitability for modern AI workloads.
Next, we address the limitations of existing accelerators in handling large-batch AI training and inference due to memory bandwidth and capacity constraints. We propose a design technology co-optimization (DTCO)-enabled memory system utilizing spin-orbit torque magnetic RAM (SOT-MRAM) to significantly increase on-chip memory capacity. The limitations posed by STT-MRAM are also addressed by introducing SOT-MRAM. This workload-aware memory system shifts AI accelerators from being memory-bound to achieving system-level peak performance. Our results show an 8× improvement in energy efficiency and 9× reduction in latency for computer vision benchmarks, along with substantial gains in natural language processing tasks, while consuming just 50% of the area compared to SRAM at the same capacity.
Finally, to address the limitations of large monolithic designs, we explore the potential of chiplet-based architectures for AI accelerators. The vast design space and complex trade-offs between power, performance, area, and cost (PPAC) require a systematic optimization approach. We introduce an optimization framework, Chiplet-Gym, which integrates heuristic-based methods, such as simulated annealing (SA), with learning-based algorithms, such as reinforcement learning (RL), to evaluate and optimize chiplet-based AI accelerator designs by accounting for resource allocation, placement, and packaging architecture. Our results indicate that reinforcement learning demonstrates greater stability and achieves a 16% higher cost model value than simulated annealing. The framework-suggested design
choice delivers a 1.52×improvement in throughput, a 0.27×reduction in energy, and a 0.89×lower cost compared to monolithic designs at iso-area, underscoring the potential of chiplet architectures for the next-generation AI hardware. | en_US |