A Low-Power Analog Bus for On-Chip Digital Communication by Farah Naz Taher A thesis submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Master of Science Auburn, Alabama August 3, 2013 Keywords: System-on-Chip Design, Low Power Design, On-chip Communication, Bus Architectures, Power Management, Analog Bus Copyright 2013 by Farah Naz Taher Approved by Vishwani D. Agrawal, Chair, James J. Danaher Professor of Electrical Engineering Victor P. Nelson, Professor of Electrical and Computer Engineering Adit D. Singh, James B. Davis Professor of Electrical and Computer Engineering Abstract At present, performance and e ciency of a system-on-chip (SoC) design depends sig- ni cantly on the on-chip global communication across various modules on the chip. On-chip communication is mostly implemented using a bus architecture that runs long distances, covering signi cant area of the integrated circuit. Di cult challenges in designing of a large SoC, e.g., one containing many processor cores, include hardware area, power dissipation, routing complexity, congestion and latency of the communication network. In this work, we propose an analog bus for digital data. In our scheme we replace n wires of an n-bit digital bus carrying data between cores with just one (or few) wire(s) carrying analog signal(s) en- coding 2n levels of voltage. This analog bus uses digital-to-analog converter (DAC) drivers and analog-to-digital converter (ADC) receivers. Such on-chip communication scheme can potentially save hardware area and power. Reduction in number of wires saves chip area and the reduction in total intrinsic wire capacitance consequently reduces bus power consump- tion. The scheme should also reduce signal interference and crosstalk by eliminating the need for multiple line drivers and bu ers. In spite of overheads of the DACs and ADCs, savings in power consumption from our scheme is signi cant. We have carried out simulated exper- iments that serve as a proof-of-concept by evaluating power consumption of a single wire with DAC=ADC encoding in comparison to an n-bit digital bus of a large system. SPICE simulation for an ideal case shows that the ratio of bus power consumed by the proposed analog scheme to a typical digital scheme (without bus encoding or di erential signaling) is given by Panalog=Pdigital = 1=(3n). For 500MHz frequency and 1mm intermediate wire line, 4 bit replacement analog bus consumes 16 W over 219 W in parallel bus. Whereas, the 8 bit replacement bus consumes 18 W over the 470 W power consumption in the 8-bit parallel bus. ii Acknowledgments ?The more you learn, the more you realize how little you know?{ has become the code I live by, especially in the last two years. In the process of attaining this MS degree, I have realized at every step that whatever little I have achieved was not possible without the amazing people I have in my life as mentors, friends and family. First and foremost I want to thank my advisor, Dr. Vishwani Agrawal, for being there for me from my rst day in Auburn. He is a great mentor, guide, and teacher. He has always been very supportive, and guided me with encouragement, patience and judicious advice. I would like to thank Dr. Adit Singh, not only for being in my thesis committee, but also for the two wonderful courses I had the privilege of taking with him. He has always been helpful and kind. I also thank Dr. Victor Nelson for agreeing to be in my thesis committee, and for giving his detailed feedback on the thesis. I express my sincere appreciation and gratitude to Mr. Charles Ellis for giving me the opportunity to work in the Alabama Microelectronics Science and Technology Center. He helped me out in a di cult time by providing me with a research assistantship. I thank Dr. Suraj Sindia for all his help and suggestions. He has always been sel essly helping everyone in need. I would take the opportunity to thank all my teachers from my school, to North South University, to here in Auburn University. I thank Mustafa Munawar Shihab for being my brother and my best friend. All the hard work we did together now feels worth it as we complete our Masters, and achieve a goal together yet again. Thank you Brother. No words are adequate to express my gratefulness to my family for their unconditional love and support. I am more than lucky to have such a family who understands and supports my goal. I thank my mother Shahnaz Sultana and my father Abu Taher Chowdhury for all iii the sacri ces they have made, and for all their encouragement that made me what I am today. I thank my sister Mayesha Naz Taher for all the love and courage she gave me. I thank my husband Muhammad Asaduzzaman Shanto for all his love, patience, and support. I am lucky to have such a supportive and patient life partner. I dedicate my work to my awesome family. Finally, I thank the Almighty for this wonderful life, and for the wonderful people He has lled it up with. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Overview: Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Dynamic Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Static Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Power Reduction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Technology Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Interconnect Optimization . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.4 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.5 Power Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.6 Supply Voltage and Threshold Voltage Scaling . . . . . . . . . . . . . 15 2.2.7 Multi-Voltage Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.8 Variable Supply and Threshold Voltages . . . . . . . . . . . . . . . . 15 2.2.9 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 v 2.2.10 Floorplanning, Cell Placement and Wire Routing . . . . . . . . . . . 16 3 On-Chip Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Bus Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Bus Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Issues With Parallel Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Routing Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.3 Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.5 Signal Integrity and Crosstalk . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Possible Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 NOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 SerDes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.1 Construction of SerDes . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.2 SERDES Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5 Analog Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Proposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.4 Vswing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.5.1 Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.5.2 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 Data Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.1 Analog to Digital Converter . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.2 Digital to Analog Converter . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 vi 6.3 Design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.2 Power Analysis: Replacement of 4-Bit Parallel Bus . . . . . . . . . . . . . . 48 7.3 Power Analysis: Replacement of 8-Bit Parallel Bus . . . . . . . . . . . . . . 52 7.4 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 8.1 Challenges and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 8.1.1 Design suitable converters . . . . . . . . . . . . . . . . . . . . . . . . 56 8.1.2 Encoding Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 8.1.3 Combination of Analog Bus with other schemes . . . . . . . . . . . . 56 8.1.4 Mixed-Signal Compression of Digital Test Data . . . . . . . . . . . . 57 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 vii List of Figures 1.1 A brief chronology of the major milestones in the development of VLSI [65]. . . 1 1.2 Four dimensions of optimization in VLSI design. . . . . . . . . . . . . . . . . . . 3 2.1 Switching Power [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Short-Circuit Power [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Static power [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Clock gating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 IBM Cell ring bus communication architecture [53]. . . . . . . . . . . . . . . . . 17 3.2 Bus structure [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Shared bus [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Hierarchical bus [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Ring bus [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.6 Split bus [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.7 Crossbar bus [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.8 Partial crossbar/matrix bus [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.9 Tristate bu er bus [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 viii 4.1 Various communication architectures [8]. . . . . . . . . . . . . . . . . . . . . . . 26 4.2 SerDes [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 SerDes stucture [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4 Example of serialization [37]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5 The Silent scheme [38]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.1 Total interconnect length (m/cm2) - Metal 1 and ve intermediate levels, active wiring only [60]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Parallel bus and analog bus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3 Vswing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.1 Signals resulting from A/D and D/A conversion in a mixed-signal system [5]. . . 41 6.2 Basic ADC structure [30]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.3 Basic DAC structure [30]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.1 4-bit parallel bus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.2 Analog bus replacing 4-bit parallel bus of Figure 7.1. . . . . . . . . . . . . . . . 48 7.3 Experimental setup for analog bus replacing a 4-Bit parallel bus. . . . . . . . . 49 7.4 4-Bit input patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.5 4-Bit digital input converted to analog data. . . . . . . . . . . . . . . . . . . . . 50 7.6 Parallel bus vs. analog bus (bus width = 4, frequency = 1GHz). . . . . . . . . . 51 7.7 Parallel bus vs. analog bus (bus width = 4, frequency = 500MHz). . . . . . . . 51 ix 7.8 An analog bus to replace an 8-bit parallel bus. . . . . . . . . . . . . . . . . . . . 52 7.9 8-bit input patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.10 8-bit digital input converted to analog data. . . . . . . . . . . . . . . . . . . . . 54 7.11 Parallel bus vs. analog bus (bus Width = 8, frequency = 500MHz). . . . . . . . 54 x List of Tables 2.1 Strategies for low power designs [26]. . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Trade o associated with power management techniques [26] . . . . . . . . . . . 13 4.1 Comparison overview of advantages/disadvantages of SerDes architectures [39]. 29 5.1 Bit-wise noise tolerance of analog bus. . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Random data patterns and transition analysis. . . . . . . . . . . . . . . . . . . . 38 5.3 Comparison of parallel, serial and analog buses. . . . . . . . . . . . . . . . . . . 39 7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.2 Comparison of power consumption of 4-bit parallel bus and analog bus for fre- quency = 1GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.3 Comparison of power consumption of 4-bit parallel bus and analog bus for fre- quency = 500MHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.4 Comparison of power consumption of 8-bit parallel bus and analog bus for fre- quency = 500MHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.5 Power consumption of 4-bit and 8-bit buses. . . . . . . . . . . . . . . . . . . . . 53 7.6 Converter design survey [52]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 xi Chapter 1 Introduction The transistor, one of the most important discoveries of 20th century and the heart of electronics, was invented at Bell Labs in New Jersey in 1947 by John Bardeen, Walter Brattain, and William Shockley. The second gigantic step, the invention of the integrated circuit, took place simultaneously at Fairchild and Texas Instruments from 1957 to 1959. So, it has been more than sixty years since the invention of the bipolar transistor, more than fty years since the invention of the Integrated Circuit (IC) technology and there has been an extraordinary escalation of the electronics industry, with a massive impact on the way people live and work. In the last thirty years or so, by far the area of the industry with most developments has been in the VLSI of silicon chips.A brief chronology of the major milestones in the development of VLSI industry is depicted in Figure 1.1. Figure 1.1: A brief chronology of the major milestones in the development of VLSI [65]. In 1965, Gordon Moore observed that Integrated Circuit (IC) complexity evolved ex- ponentially, and manufacturers has been doubling the density of components per Integrated 1 Circuit at regular intervals, and they would carry on doing so as far as the eye could per- ceive [47{49]. As an outcome of these observations, in the 1970s a scaling algorithm known as Moore?s Law was developed [46]. It stated that device feature sizes would decrease by a factor of 0.7 every three years. The accuracy of the Moore?s Law in predicting growth in IC complexity had been a reliable method to calculate future trends, as well as settling the pace of innovation and competition. But in the latest technology nodes it appears that Moore?s Law and semiconductor industry are in the middle of a perfect storm [42]. Semiconductor growth is presently limited by overall electronics growth and the ?smaller the better? situation is no longer viable. Innovation will surely go on, and go on strong, but not with the traditional scaling of feature sizes; as it is reaching its saturation or close to that. 1.1 Motivation Until recent years, power has been a second order optimization issue in chip design, only to follow the rst order concerns of area, timing and testability. But now, for most System-on-Chip (SoC) designs, power budget is one of the most signi cant design objectives of a project. Reliability issues are getting increasingly vital for SoC design because of the use of nanometer technology. Exceeding a power budget can be fatal, causing poor reliability, reduced battery life, and increased temperature. Increased temperature decreases mean time to failure exponentially, and increases timing and leakage. It also introduces packaging and cooling challenges. Chip design has four distinctive features: I. Computation II. Memory III. Communication and IV. Input/Output 2 Figure 1.2: Four dimensions of optimization in VLSI design. For continuing the performance growth, the microprocessor industry has shifted to multi-core scaling by increasing the number of cores per die each generation. Many re- searchers believe this core scaling will continue into hundreds or maybe thousands of cores per chip [9, 17].Increased processing power and data intensive applications have attracted attention to the communication aspect of the system. Continuous voltage scaling has de- creased the noise margin, making interconnects susceptible to cross talk, power supply noise, process variation, and radiation defects. The design of SoCs is turning out to be increas- ingly di cult, as adding more and more functionalities are worsening the already complex size, performance, and power consumption constraints. On-chip global communication is required for data and control transfer across various modules on the chip, and it signi cantly determines the performance of the integrated circuits in current technology. Global and intermediate bus architecture does not follow transistor scaling, and as a result makes long range on-chip data communication challenging in terms of latency, throughput, and power. A di cult challenge at present is the routing complexity and congestion of parallel buses that span over large distances on the chip, connecting various modules placed all around the chip. 3 Buses not only have to compete with power grid, clocks and other global signals for global resources, the process of boosting their performance by inserting drivers, repeaters and reg- isters makes it considerably area-hungry. The performance enhancing techniques increases power dissipation due to increased capacitances. The power consumed by the interconnect for on-chip global communication now account for a signi cant fraction of the total power of a system, and this fraction is expected to grow as technology scales further. To address this issue of increased energy consumption, circuit techniques such as low-swing signaling and bit encoding can be used. As switching activity determines the dynamic power dissipation, some methods attempt to reduce the number of transitions on the bus. Techniques like Adaptive Supply Voltage Links are deployed at the system level for energy-e cient on-chip global communication. 1.2 Contribution Improvement of the overall performance cannot be achieved by a single technology improvement. It is a product of all the technologies from semiconductor to system design. This work focuses on methods for possible reductions of power consumption and area of the bus architecture for on-chip communication. Modern SoC devices need signi cant amount of data transfer and computing power, which implies the number of on-chip modules will increase, as will the number of on-chip buses connecting them. Due to technology scaling, the delay and power dissipation of the on-chip communication is becoming on the major bottleneck in the current SoC designs. This thesis proposes the design of an on-chip analog bus for replacing the current parallel bus. Reduction in the number of wires saves chip area, and the reduction in total intrinsic wire capacitance consequently reduces bus power consumption. The scheme should also reduce signal interference and crosstalk by eliminating the need of multiple line drivers and bu ers. Analog bus can even be useful for short chip- to-chip interconnections in order to reduce pin and trace counts. This analog bus uses digital-to-analog converter (DAC) drivers and analog-to-digital converter (ADC) receivers. 4 We replace n wires of an n-bit digital bus carrying data between cores with just one (or few) wire(s) carrying analog signal(s) encoding 2n levels of voltage.Such on-chip communication scheme can potentially save area and power in spite of the additional the DACs and the ADCs used. Appropriate theoretical and experimental work has been done to validate the signi cant power saving that can be achieved by implementing this method. We have carried out simulated experiments by evaluating power consumption of a single wire with DAC/ADC encoding in comparison to an n-bit digital bus of a large system. SPICE simulation for an ideal case shows that, the ratio of bus power consumed by the proposed analog scheme to a typical digital scheme (without bus encoding or di erential signaling) can be given by Panalog=Pdigital = 1=(3n). For 500MHz frequency and 1mm intermediate wire line, 4 bit replacement analog bus consumes 16 W over 219 W in parallel bus. Whereas, the 8 bit replacement bus consumes 18 W over the 470 W power consumption in the 8 bit parallel bus. 1.3 Problem Statement The objective of this work is to develop a low power analog bus for on-chip communi- cation to replace existing parallel digital bus. 1.4 Organization The thesis is organized as follows: Chapter 2 introduces the reader to sources of power consumption in CMOS design and various existing low power design techniques. Chapter 3 explains on-chip communication and bottleneck of the area in detail. Chapter 4 discusses previous contributions in the eld of low power on-chip communi- cation. The main focus is on the SerDes approach. 5 Chapter 5 introduces the concept of analog bus for digital on-chip communication. Chapter 6 explains the proposed scheme with the results obtained during the experi- mental implementation. Chapter 7 discusses the theory of analog-to-digital and digital-to-analog converters. Chapter 8 concludes the thesis with challenges of the proposition and suggestions for future research. 6 Chapter 2 Overview: Power Power dissipation is one of the most important factors for the choice of technology in VLSI design. According to Pollack?s Rule, which states each technology generation doubles the number of transistor on a chip which enables that performance increase is roughly pro- portional to square root of increase in complexity [9, 10], the scale of integration depends on the increased device and power density. Designers pay special attention to apply power reduction techniques as the maximum power can limit the scale of integration. Power reduc- tion techniques focus on the total power for both active and standby modes of the circuit. The total power in a design consists of dynamic power and static power. The components of power consumption in integrated circuits, consisting of registers, control, data path logic, clock tree, memory etc., are design and application dependent [12,25,27,59]. 2.1 Power Dissipation The total power consumption of a CMOS circuit is PTotal = PDynamic + PLeakage + PShort circuit Where, PDynamic = Dynamic switching power dissipated while charging or discharging the parasitic capacitances during a node voltage transition. PLeakage = Combination of all the sub-threshold leakage power due to the non-ideal o - state characteristics of the MOSFET switches, and the gate leakage power caused by carrier tunneling through the thin gate oxides. PShort circuit= Transitory power dissipated during an input signal transition when both the pull-up and the pull-down networks of a CMOS gate are simultaneously on. 7 Figure 2.1: Switching Power [27]. 2.1.1 Dynamic Power Dynamic power is primarily due to switching capacitances and short circuit power. The primary source of dynamic power consumption is switching power, which is the power required to charge and discharge the output capacitance on a gate (Figure 2.1). Glitches present in the signals increase switching activity by 15% to 20% [12]. The switching power of a single gate can be expressed as PD = fsCLVDDVswing Where, is the switching activity, fs is the operation frequency, CL is the load capacitance, VDD is the supply voltage, and Vswing is the voltage swing. Internal power also contributes to dynamic power. Internal power consists of the short circuit current that arises when both the NMOS and PMOS transistors are on, and also the current required for charging the internal capacitances [12,27,56]. The short circuit power occurs for a short time during each transition, so the overall dynamic power is dominated by switching power. Switching power is not a function of transistor size, but rather a function of switching activity and load capacitance, thus it is data dependent. Methods of reducing active power often focus on reducing VDD, as dynamic power depends on VDD quadratically. Measures are also taken to reduce the capacitance and the wire lengths [12,27,56]. 8 Short-Circuit Power In static CMOS circuits, between transitions of the input signals, due to non-zero rise and fall times of the input signals, for a certain small period of time both the pull-up and pull- down network transistors are simultaneously on, thereby forming a DC current path between the power supply and ground(Figure 2.2). The DC current in the circuit during this input signal transient is called the short-circuit current. Short-circuit current is a function of the rise/fall times of the input and output signals and the output load. The short-circuit current is signi cant if the rise and fall times of the input signals are considerably larger than the output rise and fall times, because the short circuit current path has the opportunity to exist for a longer period of time [35]. Figure 2.2: Short-Circuit Power [27]. Short-circuit power which is due to the nonzero rise and fall time of input waveforms, which contributes to less than 10% of the total dynamic power. Short-circuit power can be reduced by matching input and output rise and fall times. 9 2.1.2 Static Power Static power, also known as leakage power, is related to the requirement of sustaining the logic values of circuit nodes between switching events. Static power dissipation is gen- erally due to current leakage mechanisms (even in o state) within the circuit and does not contribute to any computation. A transistor switch is fundamentally a resistive/capacitive network between the power supply and ground. Current is drawn from the power supply, even when a transistor operates in the cut-o region, due to the non-ideal o -state charac- teristics (a nite resistance) of a transistor. The leakage currents are dominated by weak inversion and reverse biased pn junction diode currents in long channel devices [35]. Leakage can contribute a large portion of the average power consumption for low per- formance applications, particularly when a chip has long idle modes without being fully o [12,27,57]. In today?s technology, leakage can account for 10% to 30% of the total power when a chip is active. Unfortunately, as CMOS technology scaling proceeds, mechanisms that cause leakage are becoming worse. Static power dissipation plays a vital role in determining how long and far Moore?s Law can continue unabated. There are four main sources of leakage currents in a CMOS gate: Sub-threshold Leakage, Gate Leakage, Gate Induced Drain Leakage and Reverse Bias Junction Leakage (Figure 2.3). Sub-threshold Leakage is the current that ows from the drain to the source current of a transistor operating in the weak inversion region when a CMOS gate is not turned completely o . The equation can be given as follows: ISUB = CoxV2thWLe VGS VT nVth Where, W and L are width and length of the transistor, VT = thermal voltage, and n is a function of the device fabrication process that ranges from 1.0 to 2.5. This equation tells us that sub-threshold leakage depends exponentially on the di erence between VGS and VT. Subthreshold leakage increases exponentially with decreasing Vth and increasing 10 Figure 2.3: Static power [27]. temperature, which complicates the problem of designing low power systems. It is also dependent on transistor channel length in short channel devices [27]. Gate Leakage is the current which ows directly from the gate through the oxide to the substrate due to gate oxide tunneling and hot carrier injection. Leakage current has increased exponentially with reduction in gate oxide thickness. The gate oxide thickness (TOX), which is only a few atoms thick, makes tunneling current substantial. Starting with 90nm, gate leakage can be nearly one-third as much as sub-threshold leakage. High-k dielectric materials are required to keep gate leakage in control [12,27,57]. Gate Induced Drain Leakage is the current which ows from the drain to the substrate induced by a high eld e ect in the MOSFET drain caused by a high VDG. Reverse Bias Junction Leakage is caused by generation of electron/hole pairs in the depletion regions and minority carrier drift [12,27,57]. 2.2 Power Reduction Techniques Until recent times, power was a second order problem in chip design, the rst order considerations being cost, area, and timing. Now, for most System-on-Chip (SoC) designs, 11 Table 2.1: Strategies for low power designs [26]. Design Level Strategies Operating System Level Portioning, Power down Software level Regularity, locality, concurrency Architecture level Pipelining, Redundancy, data encoding Circuit /Logic level Logic styles, transistor sizing and energy recovery Technology Level Threshold reduction, multi threshold devices the power budget is extremely signi cant. Issues like thermal limits, packaging constraints, battery life, and cooling options are now key factors in the success of a product. Today some of the most powerful microprocessor chips can dissipate an average power density of 50-75 Watts per square centimeter. The power density creates problems with not only packaging and cooling, but also with decreased reliability. Exceeding the power budget is critical to the scheme, as it can cause an unacceptably poor reliability due to excessive power density and make the design fail before the required time. There is a con ict between reduction and balance of dynamic and static power. Sup- ply voltage is reduced to lower dynamic power and threshold voltage is reduced to sustain performance. But this process raises the leakage current. Technology has moved to a point where both static and dynamic power reduction is important and a balance needs to be struck between the techniques [27]. To optimize the power consumption in VLSI design, designers take various approaches for power management, using diverse strategies at various levels of the design process (Table 2.1). 12 Some of these power reduction techniques are discussed below in Table 2.2: Table 2.2: Trade o associated with power management techniques [26] Power Methodology Impact Reduction Power Timing Area Architec- Design Veri ca- Implem- Technique Bene t Penalty Penalty ture tion entation Multi Vt optimization Medium Little Little Low Low None Low Clock Gating Medium Little Little Low Low None Low Multi supply voltage Large Some Little High Medium Low Medium Power Shut o Huge Some Some High High High High Dynamic and Adaptive voltage Large Some Some High High High High Frequency scaling Substrate Biasing Large Some Some Medium None None High 2.2.1 Technology Scaling Technology scaling is the most common optimization method used. If the dimensions, voltages and doping are scaled by a factor , the electric eld con guration in the scaled device will be exactly the same as it was in the larger device but speed increases by the scale factor and the power density remains constant. In recent technologies the supply voltage has reached 1V. It has imposed physical limitations to scaling as the silicon band-gap energy and built-in potentials of the device remains same with scaling. Threshold voltage scaling with manageable leakage is not further possible due to thermodynamic limitations. To accommodate the slower voltage scaling, electric eld is increased by an additional factor, > 1. As this method reduces reliability and increases power consumption, alternative methods should be chosen to overcome the issue [31,57]. 2.2.2 Transistor Sizing To reduce junction capacitance and overall gate capacitance, transistor sizing is a sig- ni cant method. There are several methods to minimize the area of the circuit that reduces power while maintaining performance [56]. 13 Figure 2.4: Clock gating. 2.2.3 Interconnect Optimization In every technology scaling, the local interconnect capacitance reduces, but the global interconnect capacitance increases. The increasing die size increases the global interconnect length, as well as the capacitance and delay. For optimizing interconnect power, optimum width, height and spacing of wires are used. The research done in this thesis contributes to this issue also. A signi cant amount of power can be saved by interconnect optimization [56]. 2.2.4 Clock Gating For any general purpose microprocessor, only a small portion of the circuit is active at a certain time. Turning o the idle portion of the circuit is an e ective way to save dynamic power consumption (Figure 2.4). The clock has the highest toggle rate and consumes a signi cant portion of the total dynamic power. The clock gating approach, where the clock is turned o when not required, can save a signi cant amount of power without changing any logic function of the circuit [12,27,56]. 2.2.5 Power Gating While clock gating reduces dynamic power, power gating reduces static leakage. Here, power rails are disconnected when transistors are in idle mode. This method consumes power, so it is worthy only when a unit is idle for a su cient number of clock cycles [12]. 14 2.2.6 Supply Voltage and Threshold Voltage Scaling Reducing supply voltage reduces dynamic power as well as short circuit power. Delay increases with reduced supply voltage; as a result, threshold voltage also has to be reduced. But reducing threshold voltage increases leakage current. A tradeo has to be made among performance, dynamic power and static power [12,27]. 2.2.7 Multi-Voltage Design Voltage scaling also increases the delay of the gates in the design. For System on Chip design, di erent blocks have di erent constraints and performance objective. The block which does not need to run particularly fast can have a lower supply voltage than the speed critical block. This method is called multi-voltage design [20, 27]. Some methods of multi-voltage design are: Static Voltage Scaling (SVS), Multi-level Voltage Scaling (MVS), Dynamic Voltage and Frequency Scaling (DVFS), and Adaptive Voltage Scaling (AVS) [27]. 2.2.8 Variable Supply and Threshold Voltages To meet circuit timing constraints, high supply voltage and low threshold voltage are necessary. But low supply voltage reduces dynamic power, and high threshold voltage reduces leakage power. To reduce overall power and meeting timing constraints, high VDD /low Vth is used in critical paths, and low VDD/high Vth is used where su cient timing slack is available [12]. 2.2.9 Technology Mapping Logic can be implemented by di erent combinations of cells. In technology mapping, a logic netlist is mapped to a standard cell library within a given technology. Nets with high activity can be assigned with lower input capacitance pins. Swinging activity can be reduced by refactoring, whereas balancing path delay can reduce glitches [12]. 15 2.2.10 Floorplanning, Cell Placement and Wire Routing A signi cant portion of total capacitance in a design is made of wire capacitances. Capacitance of a wire depends on its length, and wire lengths in a chip greatly depend on quality of global wire routing, oor planning and cell placement. Additional bu ers to drive long wires also contribute to extra power consumption. Several techniques are applied to reduce the power consumption due to long global wire length [12]. The technique discussed in this thesis also reduces wire routing. 16 Chapter 3 On-Chip Communication There is no turning back from the era of multi-million gate chips that the semiconduc- tor industry has entered. Traditionally, the design and development of the System on Chip (SoC) technology focused on the computational aspects of the problem. But as the num- ber of elements on a single chip and their performance requirements continued to increase, computation-based design shifted to communication-based design. Now-a-days, the commu- nication architecture plays a key role in the area, performance, and energy consumption of the overall system [36,44]. The System-on-chip (SoC) approach enables an increasing number of IP cores to be inte- grated on a single chip. A large number of di erent kinds of blocks of the size of a few hundred thousand gates comprise the computational resources. For such a complex design, the communication architecture is vital and has to be e cient [34, 44]. Conventionally, on-chip communication schemes are of two types - point-to-point (P2P) and bus-based com- munication architecture. An SoC bus architecture is shown in Figure 3.1. Figure 3.1: IBM Cell ring bus communication architecture [53]. 17 Figure 3.2: Bus structure [53]. 3.1 Bus Architecture A bus is a collection of signals (wires) that connects two or more IP components for the purpose of data communication. On-chip communication is mostly implemented using bus architecture in SoC designs. Figure 3.2 shows a typical bus system, where a variety of devices are tied to the bus for communicating between each other. Use of standard internal bus design around particular modules facilitates design reuse. The performance of the SoC design depends greatly on the e ciency of the bus structure [53]. 3.2 Bus Topology The bus architecture topologies can be classi ed as: Shared bus. The simplest bus architecture commonly found in SoCs is shared bus, where several master and slave devices can be connected. Bus arbiter examines requests from the master interfaces periodically and grants access to an arbiter master according to bus pro- tocol speci cation. The bus bandwidth can be limited by increased load on global bus lines. 18 Figure 3.3: Shared bus [53]. Figure 3.4: Hierarchical bus [53]. Advantages of Shared bus are simple topology along with low area cost, e cient implemen- tation. Large load per data line, delay, and energy consumptions are the disadvantages of shared bus that limits its bandwidth. Low-voltage swing signaling techniques can overcome these disadvantages [44]. Figure 3.3 illustrated a shared bus. Hierarchical Bus. In a hierarchical bus, several shared buses are connected by bridges to form a hierarchy. Components are placed in the hierarchy according to their performance level.. Hence, low and high performance components are placed in low and high performance bus. AMBA bus and CoreConnect bus are examples of this bus architecture. Hierarchical bus architecture o er larger throughput than shared bus, as it has decreased load per bus and the potential of transactions proceeding in parallel on di erent busses. Communications can proceed in a pipelined manner. However, additional overhead of transactions across 19 Figure 3.5: Ring bus [53]. the bridge during the transfer may make the bus inaccessible to other components [44]. Figure 3.4 illustrated a hierarchical bus. Ring Bus. In the Ring Bus architecture each node component communicates using a ring interface implemented by a token pass protocol. Ring based bus is widely used in numerous architectures like network processor and ATM switches [44]. Figure 3.5 illustrates a ring bus. Other Architectures. Some other bus architectures are Split Bus, Full Crossbar bus, Partial Crossbar Bus, tri-state bu er based bus, etc., as illustrated in Figures 3.6 through 3.9. Figure 3.6: Split bus [53]. 20 Figure 3.7: Crossbar bus [53]. Figure 3.8: Partial crossbar/matrix bus [53]. Figure 3.9: Tristate bu er bus [53]. 21 3.3 Issues With Parallel Bus Computation-based design shifted to communication-based design as communication has become the most critical aspect of system performance and cost. Whenever a system is imagined, it includes a bus system including various devices coupled with it. Communication architecture consisting of wires, repeaters, bus components can consume up to 50% of the total chip power [53]. Design, customization, exploration, veri cation and implementation of the communication architecture take up a signi cant portion of the system design cycle. A number of trends have enforced evolutions of systems architectures resulting in evolutions of the required buses. These trends consist of application convergence, integration of IP blocks in single chip, process evolution, time to market pressure, etc. [2, 14]. Parallel buses are a large number of wires bundled together that enable data to be transmitted in parallel [53]. Key issues of the bus architecture design are power consumption, performance, design time reduction, ease-of-use, and silicon e ciency. Complexities of parallel bus architecture are explained below [2,14,28,53]. 3.3.1 Routing Complexity Bus architecture has to compete for global resources with clock, power grid and other global signals. The length of interconnect is increasing due to increasing number of modules that span large distances on the system on chip. The number of buses required is also increasing as the number of IP cores are increasing. As a result, routing on-chip parallel bus is getting complicated due to increasing congestion [2,8,14,28]. 3.3.2 Area Besides routing complexity, a parallel bus also occupies large silicon area, as a number of drivers, repeaters and registers are inserted along with the interconnect. The use of wider metal pitch and protective shield to reduce coupling are also area consuming [2,8,14,28]. 22 3.3.3 Power Dissipation Integrated circuits designed with battery constraints in mind makes energy e cient global communication techniques necessary. Every attached additional element in the circuit to constructa bus architecture adds to the overall capacitance. The power consumed by the bus architecture is a signi cant fraction of the total power consumption of the integrated circuit. Increasing number of cores creates increased number of bus lines, which correspond to increased capacitance. Furthermore fringe capacitance increases as interconnects are getting closer. The repeaters, bu ers, etc. inserted to improve performance and throughput also consume lots of energy [28]. 3.3.4 Performance Bandwidth is limited, but shared by all elements. Skew and Jitter on the parallel bus make synchronization complicated, and therefore leads to bandwidth limitations. As technology scales, the RC delay of the interconnects gets worse. To counter this, more repeaters and bu ers are inserted, which on other hand increases power consumption due to additional elements. Another method to reduce delay is to increase the pitch. This method reduces the delay, but the raise in area is signi cant [8,28]. 3.3.5 Signal Integrity and Crosstalk Increased package density and feature size reduction causes complexity in on-chip com- munication. The most important signal integrity problems are crosstalk, signal skew, over- shoot, and re ection. The crosstalk created in a parallel bus not only serves as a conductor of electrons but also introduces additional resistance, capacitance, and inductance. Crosstalk induces delay and noise too. Crosstalk between neighboring lines in a parallel bus creates data-dependent signal delay worse limiting the transmission bandwidth [28,53]. 23 3.4 Possible Solutions Bus architectures cannot directly trail process and system architecture evolution. The architectures have to balance among the various driving forces. A prominent technique to reduce parallel bus issues in an inter-core bus communication is reducing the number of transitions occurring on each of the bus lines by bus encoding procedures [7]. This reduces the e ective activity on the lines, and the number of lines that need to be run between two cores. Alternate schemes for power reduction include low voltage and di erential signaling [23]; all of which try to limit the signal swing on the bit lines, thereby reducing power. Another solution is replacement of parallel buses with an on-chip serial link [29]. 24 Chapter 4 Previous Work Point-to-point (P2P) and bus-based communication architectures are the two types of on-chip communication schemes widely considered. Intellectual property (IP) cores commu- nicate with each other through dedicated channels in P2P communication, providing utmost performance. This architecture however experiences scalability issues because of complexity, design e ort and cost. Bus architecture connects multiple IP cores, reducing the complexity of dedicated communication. Still, bus based architecture also su ers from requirements of scalability in terms of performance and power e ciency [36]. A prominent technique to reduce power in an inter-core bus communication is reducing the number of transitions occurring on each of the bus lines by bus encoding procedures [7]. This reduces the e ective activity on the lines, and the number of lines that need to be run between two cores. Alternate schemes for power reduction include low voltage and di erential signaling [23]; all of which try to limit the signal swing on the bit lines, thereby reduce power. Techniques such as Adaptive Supply Voltage Links are employed at the system level for energy-e cient on-chip global communication. Another solution to the problems of parallel buses is to replace it with an on-chip serial link [29]. 4.1 NOC The network-on-chip (NOC) methodology is a solution to the design productivity prob- lems in communication centric on-chip communication. The NOC architecture is an m n mesh of switches and resources, placed on the slots formed by the switches [22]. NOC communication infrastructure connects the resources via a network of switches which com- municate with each other using addressed data packets routed to their destination by the 25 Figure 4.1: Various communication architectures [8]. switch fabric. Communication among IP cores is carried out by generating and forwarding packets through the network structure [8, 36]. Here, the hardware resources are developed independently as standalone blocks, and the NOC is created by connecting the blocks in the network. The con gurable network, being a exible platform, can be modi ed as per need of the workload, while maintaining the generality of the application. [8,36] Figure 4.1 shows the structures of bus, P2P and network on chip architectures [36]. NOC architecture has various advantages of scalability, design reuse and predictability factor. A large number of IP cores can be connected without using global wires, as communication can be achieved by routing packets. The approach provides highly scalable communication architecture. NOC o ers great potential for reuse of network and IP cores complying with the network that can be reused in various applications. The architecture is structured, which facilitates controlled and optimized electrical parameters [8]. Multi-route and redundancy is possible in this architecture. Disadvantages of NOC are area and speed overheads. There is an area overhead because of the switches used and because the xed wire layout is not always optimal. Internal network in the architecture with packaging, routing and switching may add latency in the system. Synchronization is imperative in this system [34]. 26 Figure 4.2: SerDes [28]. 4.2 SerDes A promising solution for on-chip communication that may replace parallel buses is an on-chip serial link. A parallel link comprises n wires that can carry n bits of data simul- taneously through the link. Serializer/De-serializer (SerDes) is a widely used technique for replacing multiple lines of an on-chip bus with a single on-chip line to achieve high speed serial communication. It is illustrated in Figures 4.2 and 4.3. In this architecture, n parallel data bits are serialized on the transmitter side. The data transfer takes place at a speed which is n-times higher than the data rate of the parallel data. On the receiver side, the data have to be de-serialized to reproduce the n-bit parallel word. In general, n wires can be compressed into m wires where m