Polynomial-Time Algorithms for Designing Dual-Voltage Energy E cient Circuits by Mridula Allani A thesis submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Master of Science Auburn, Alabama December 12, 2011 Keywords: Energy e cient design, Dual-Vdd design, Level converter, Gate slack, Static timing analysis (STA) Copyright 2011 by Mridula Allani Approved by Vishwani D. Agrawal, Chair, James J. Danaher Professor of Electrical and Computer Engineering Victor P. Nelson, Professor of Electrical and Computer Engineering Adit Singh, James B. Davis Professor of Electrical and Computer Engineering Abstract In this work, we propose a technique to use dual supply voltages in digital designs in order to get a reduction in energy consumption. Three new algorithms are proposed for nd- ing and assigning low voltage in dual voltage designs. Given a circuit and a supply voltage, the rst algorithm nds an optimal lower supply voltage and the other two algorithms assign that lower voltage to individual gates. A linear time algorithm described in the literature is used for computing slacks for all gates in a circuit for a given supply voltage. For the computed gate slacks and the lower supply voltage, the gates in the circuit are divided into three groups. No gate in the rst group can be assigned the lower supply. All gates in the second group can be simultaneously set to lower supply while maintaining positive slack for all gates. The gates in the third group are assigned low voltage in small subgroups. The gate slacks are recalculated after each such voltage assignment. Thus, the overall complexity of this reduced power dual voltage assignment procedure is O(n2). But in practice, it is observed that the computation time is close to linear in the circuit size. SPICE simulations of ISCAS?85 benchmark circuits using the PTM model for 90-nm bulk CMOS technology results show up to 60% energy savings. ii Acknowledgments This thesis would not have been possible without the support from my advisor, Dr. Vishwani Agrawal. I am thankful for his guidance and encouragement. It has been a pleasure working with him. I thank my advisory committee members, Dr. Victor Nelson and Dr. Adit Singh. Also, I am grateful to the Auburn transition Leadership Institute (ATLI) and the Department of Electrical Engineering at Auburn University for supporting my graduate study. I sincerely appreciate Mrs. G. Kanaka Durga and Mrs. B. Sarala for their guidance during my college. I also thank all my teachers who have inspired me through my life. Graduate study at Auburn University has been a learning experience. I thank my colleagues Farhana, Jia, Kim, Manish, Nitin, Priya and Suraj for their valuable inputs when required. Without them my graduate experience would not have been this enriching. I am indebted to my parents, brother, sister and all friends for their love and support. Also, I am very thankful to my husband, Abhishek Bichal, for being with me through the thick and thin. Finally, I thank everyone who directly or indirectly helped and inspired me during the course of my graduate study. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Unique Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Power Dissipation in CMOS Circuits . . . . . . . . . . . . . . . . . . 6 2.2 Power Reduction in CMOS Circuits . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Technology Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Supply Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.4 Clock gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.5 Power gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.6 Dual/multi-threshold designs . . . . . . . . . . . . . . . . . . . . . . 20 2.2.7 Variable threshold CMOS (VTCMOS) . . . . . . . . . . . . . . . . . 21 2.2.8 Dynamic threshold scaling . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.9 Variable supply and threshold voltages . . . . . . . . . . . . . . . . . 22 2.2.10 Transistor stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 iv 2.2.11 Minimum leakage vector method . . . . . . . . . . . . . . . . . . . . 23 2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Overview of Static Timing Analysis Algorithms . . . . . . . . . . . . . . . . . . 25 3.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Slack-Based Algorithm For Finding an Optimal Lower Supply Voltage . . . . . . 35 4.1 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Slack-Based Algorithm For Dual Voltage Assignment Without Using Level Con- verters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 An Example of a Chain of Inverters . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6 Asynchronous Level Converters in Dual-Voltage Design . . . . . . . . . . . . . . 65 7 Slack-Based Algorithm For Dual Voltage Assignment Using Level Converters . . 77 7.1 An Example of a Chain of Inverters . . . . . . . . . . . . . . . . . . . . . . . 78 7.2 Algorithm 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 8.1 Using the Proposed Algorithms at Higher Abstraction Levels . . . . . . . . . 87 8.2 Accommodate Level Converter Overheads . . . . . . . . . . . . . . . . . . . 88 8.3 Dual Threshold Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.4 Multiple-Supply Voltage Design . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.5 Simultaneous Multiple VDD and Multiple Vth Designs . . . . . . . . . . . . . 89 8.6 Leakage Energy Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.7 Process Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.8 Short Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 v List of Figures 2.1 Switching current in an inverter. . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Short circuit current in an inverter. . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Short circuit current characteristics of an inverter. . . . . . . . . . . . . . . . . . 10 2.4 Leakage currents in an nMOS transistor. . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Scaling of an nMOS transistor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Clock gating of a ip- op. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.7 Clock gating of a clock-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.8 Power gating implementation using a header and a footer sleep transistors. . . . 18 2.9 A stack of nMOS transistors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Generalization of a synchronous circuit. . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Example of a combinational circuit . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Traditional STA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 New STA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Power reduction versus ratio of low voltage and high voltage, [56, 73]. . . . . . . 36 4.2 Optimum supply voltages and minimum power dissipation for multiple VDD val- ues, [56, 73]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 vi 4.3 Delay increment versus slack for gates of c880 circuit, VH = 1:2V; VL = 0:58V. . 38 4.4 Su Vs. VL for ISCAS?85 benchmark circuits. . . . . . . . . . . . . . . . . . . . . 42 4.5 Energy savings per gate vs. VL for c880 when level converters are not allowed. G and P+G show the number of gates. Energy savings per gate is (V2H V2L)=V2H. 42 4.6 Energy savings per gate vs. VL for c880 when level converters are allowed. G and P+G show the number of gates. Energy savings per gate is (V2H V2L)=V2H. 43 4.7 Energy versus voltage for c880 when level converters are not allowed. . . . . . . 51 4.8 Energy versus voltage for c880 when level converters are allowed. . . . . . . . . 51 4.9 Energy versus voltage for c1355 when level converters are not allowed. . . . . . 52 4.10 Energy versus voltage for c1355 when level converters are allowed. . . . . . . . . 52 5.1 A chain of ten inverters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Energy and delay measurements at various values of V1 and V2 for the inverter chain of Figure 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3 Delay increment versus initial slack for gates of c880 circuit, VH = 1:2V; VL = 0:49V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4 Delay increment versus nal slack for gates of c880 circuit, VH = 1:2V; VL = 0:49V. 59 5.5 Delay increment versus initial slack for gates of c880 circuit when Tc is increased by 5%, VH = 1:2V; VL = 0:67V. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 Delay increment versus nal slack for gates of c880 circuit when Tc is increased by 5%, VH = 1:2V; VL = 0:67. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 vii 5.7 Delay increment versus initial slack for gates of c499 circuit, VH = 1:2V; VL = 0:91V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.8 Delay increment versus nal slack for gates of c499 circuit, VH = 1:2V; VL = 0:91V. 61 5.9 An illustration conforming to the topological constraints on gates with slacks greater than Su. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.10 An illustration violating the topological constraints on gates with slacks greater than Su. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.1 Experimental setup for level converter. . . . . . . . . . . . . . . . . . . . . . . . 66 6.2 Equivalent setup for level converter experiments. . . . . . . . . . . . . . . . . . 66 6.3 Energy per cycle versus VL. VH = 1.2V. . . . . . . . . . . . . . . . . . . . . . . 67 6.4 Dual voltage with level converter inserted before each fan-in of a high-voltage gate. 68 6.5 Dual voltage with level converter inserted after the output of a low-voltage gate. 68 6.6 Dual Cascode Voltage Switch (DCVS) level converter [75, 96]. . . . . . . . . . . 69 6.7 Pass transistor logic level converter [31, 50, 51]. . . . . . . . . . . . . . . . . . . 70 6.8 Conventional Type II level converter [49, 97] . . . . . . . . . . . . . . . . . . . . 71 6.9 Conventional single supply level converter [5, 31, 53]. . . . . . . . . . . . . . . . 72 6.10 Recently published single supply level converter [5]. . . . . . . . . . . . . . . . . 73 6.11 Contention mitigated level converter [23, 53, 89] . . . . . . . . . . . . . . . . . . 74 6.12 Level shifting NAND2 with one high voltage and one low voltage inputs. M1 has higher threshold voltage than M2 [19]. . . . . . . . . . . . . . . . . . . . . . . . 75 viii 7.1 Energy and delay measurements at various values of V1 and V2 for the inverter chain of Figure 7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2 A chain of ten inverters with a level converter. . . . . . . . . . . . . . . . . . . . 79 7.3 Delay increment versus initial slack for gates of c880 circuit, VH = 1:2V; VL = 0:58V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.4 Delay increment versus nal slack for gates of c880 circuit, VH = 1:2V; VL = 0:58V. 82 7.5 Delay increment versus initial slack for gates of c880 circuit when Tc is increased by 5%, VH = 1:2V; VL = 0:6V. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.6 Delay increment versus nal slack for gates of c880 circuit when Tc is increased by 5%, VH = 1:2V; VL = 0:6V. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.7 Delay increment versus initial slack for gates of c499 circuit, VH = 1:2V; VL = 0:91V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.8 Delay increment versus nal slack for gates of c499 circuit, VH = 1:2V; VL = 0:91V. 84 8.1 Trends in parameter variations [70]. . . . . . . . . . . . . . . . . . . . . . . . . . 90 ix List of Tables 4.1 Optimal lower supply voltage values (VL), VL1;VL2, their arithmetic mean and their geometric mean and energy savings estimate when level converters are not allowed for ISCAS?85 benchmark circuits. VH = 1:2V. . . . . . . . . . . . . . . 47 4.2 Optimal lower supply voltage values (VL), VL1;VL2, their arithmetic mean and their geometric mean and energy savings estimate when level converters are al- lowed for ISCAS?85 benchmark circuits. VH = 1:2V. . . . . . . . . . . . . . . . 48 4.3 Optimal lower supply voltage values (VL) and energy savings estimate using Algo- rithms 2 when level converters are not allowed for ISCAS?85 benchmark circuits. VH = 1:2V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 Optimal lower supply voltage values (VL) and energy savings estimate using Al- gorithms 2 when level converters are allowed for ISCAS?85 benchmark circuits. VH = 1:2V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1 Optimal lower supply voltage values (VL) and energy savings using Algorithms 2 and 3 for ISCAS?85 benchmark circuits, VH = 1:2V. . . . . . . . . . . . . . . . . 57 5.2 Optimal lower supply voltage values (VL) and energy savings using Algorithms 2 and 3 for ISCAS?85 benchmark circuits, when TC is increased by 5%, VH = 1:2V. 57 6.1 DCVS level converter, VH = 1:2V. . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2 Pass transistor logic level converter, VH = 1:2V. . . . . . . . . . . . . . . . . . . 70 6.3 Conventional Type II level converter, VH = 1:2V. . . . . . . . . . . . . . . . . . 71 6.4 Conventional single supply level converter, VH = 1:2V. . . . . . . . . . . . . . . 72 6.5 Recently published single supply level converter, VH = 1:2V. . . . . . . . . . . . 73 6.6 Contention mitigated logic level converter, VH = 1:2V. . . . . . . . . . . . . . . 74 7.1 Optimal lower supply voltage values (VL) and energy savings using Algorithms 2 and 4 for ISCAS?85 benchmark circuits using Contention Mitigated Level Con- verter, VH = 1:2V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.2 Optimal lower supply voltage values (VL) and energy savings using Algorithms 2 and 4 for ISCAS?85 benchmark circuits using Contention Mitigated Level Con- verter, when Tc increased by 5%, VH = 1:2V. . . . . . . . . . . . . . . . . . . . 81 x Chapter 1 Introduction Power and performance are the main design constraints in modern VLSI design. An e ective design is obtained with the optimization of these two parameters. Decreasing the supply voltage results in decreased performance. Hence, a trade o is necessary between power consumption and circuit delay. The use of multiple supply voltages to reduce energy consumption is a very commonly used technique in CMOS circuits. This results from the fact that the dynamic power of a CMOS circuit is directly proportional to the square of its supply voltage [9, 73]. The underlying idea of this technique is to trade timing slacks for power reduction under given timing requirements. Generally, the gates in the critical path are kept at high supply voltage and the gates in non-critical paths are put to lower supply voltage, thus avoiding timing violations. Multiple supply voltage design using the time slack is a popular technique for power reduction, [8, 10, 13, 21, 27, 41, 42, 43, 63, 78, 80, 82, 84, 86, 100]. Several references [12, 27, 62, 63] discuss dual voltage designs in FPGAs. In this work we use a linear time slack analysis algorithm [41] to calculate gate slacks. Apart from using multiple supply voltages we can combine multiple supply voltage technique with multiple thresholds and transistor sizing to achieve additional power savings. Such work is described in articles [6, 18, 30, 37, 78, 83, 85, 86]. The trade-o s of energy savings and increase in delay when using gate sizing and multiple supplies is studied in [87]. Authors of [102] propose an algorithm that assigns voltage islands generated using a slack distribution and voltage partitioning algorithms in order to reduce the power consumption and peak temperature. Article [64] proposes a linear time approximation algorithm for 1 partitioning of circuit components into various voltage islands during high-level synthesis. Other works [11, 15, 26, 57, 58, 98] deal with the voltage partitioning problem. In some instances not meeting the timing requirements of an application is not catas- trophic. These are called soft deadlines. Hua and Qu [29] describe an algorithm for energy e cient designs using dual voltage to meet such soft deadlines. 1.1 Motivation There are many algorithms for assigning a xed low voltage value to the gates of a circuit, but relatively fewer algorithms to nd the lower voltage. Many of these works related to voltage assignment have used a low voltage, VL value of 70% of the high voltage, VH [6, 21, 50, 56, 73]. The works described in [10, 50, 84] claims that the optimal value of VL for minimizing total power is 50% of VH. Authors in [101] describe an algorithm to nd the lowest feasible supply voltages according to their slacks from a set of given voltages. An algorithm to nd an optimum VL value is described in [41]. The authors assign a low voltage value to a group of gates based on a modi cation of CVS algorithm and then calculate energy over a set of low voltages. The VL value resulting in minimum energy is chosen to be the lower voltage. This algorithm requires the voltage assignment to be done for each voltage value and is exhaustive in nature. We understand that the lower voltage for minimum energy operation of a dual-supply design is dependent on the circuit topology and is not a xed value for all circuits. These reasons motivate us to propose a new algorithm that nds the VL accurately and with linear complexity. Authors in [92, 94] propose two ways of assigning a lower voltage in a dual-voltage design. The algorithm in [92] is called Clustered Voltage Scaling (CVS). This method puts a topological restriction on the resultant dual-voltage design which does not allow a low voltage gate at the input of a high voltage gate. Following this an Enhanced Clustered voltage Scaling (ECVS) algorithm was proposed in [94]. This algorithm allows a low voltage gate to drive a high voltage gate with the inclusion of an asynchronous level converter at the 2 boundary. Reference [93] describes the methodology to synthesize circuits for the CVS and ECVS structures. Authors of [88] describe three algorithms for dual voltage designs based on linear programming models. Several other algorithms have been proposed which aim at power reduction using dual voltage designs. Not many aim at energy-e cient designs. Also, the complexity of these algorithms is often polynomial. Thus, we are motivated to propose two quadratic time algorithms for dual-voltage assignment one of which allows level converters and the other which does not allow level converters. 1.2 Unique Contributions An algorithm that nds a lower supply voltage maximizing the energy savings from a speci c group of gates in a given circuit is proposed. This algorithm is proportional to the number of gates in the circuit as each gate slack is compared to the di erence between the low voltage delay and high voltage delay of the gate. Thus, its complexity is linear, i.e. O(n), where n is the number of the gates. Two other algorithms are proposed which assign the lower supply voltage found using the gate slacks. The gate slacks are recalculated after each iteration of voltage assignment. Both the voltage assignment and the slack calculating algorithms are linear in time. Thus, the overall complexity of this method quadratic, i.e, O(n2), where n is the number of gates. But in practice, it is observed that the computation time is close to linear-time. All algorithm is written in perl programming language. Energy savings of up to 60% is seen for ISCAS?85 benchmark circuits and when the critical timing is relaxed, savings of up to 70% are observed. Such high savings have not been reported in the earlier work. Su cient theoretical and experimental work has been done to validate these results. 1.3 Problem Statement The aim of this work is to: 3 1. Develop a linear time algorithm that nds the optimal lower voltage for the dual-Vdd design for CMOS circuits. 2. Develop new algorithms for voltage assignment in dual-Vdd design for given high and low voltages using linear-time gate slack analysis to reduce computation time. 1.4 Organization The proposed work is organized as follows. Chapter 2 provides the background infor- mation on the various existing low power/energy-e cient design techniques. The sources of energy consumption in CMOS designs have also been detailed. Chapter 3 gives an overview of the static timing analysis algorithms and illustrates the algorithms proposed in [24, 41] with examples. In Chapter 4, we propose a new linear time algorithm to nd the lower supply voltage using gate slacks for dual-voltage designs. Relevant theorems are proved and experimental data has been discussed. Chapter 5 proposes a new quadratic time algorithm for low voltage assignment based on gate slacks which uses certain constraints that put restrictions on circuit topology ass proposed in [92]. Relevant results have been discussed. In Chapter 6, various level converter designs have been studied for energy and delay. The level converter with least energy consumption is selected as we are concentrating on energy-e cient designs. In Chapter 7, we propose another new quadratic time algorithm for low voltage as- signment based on gate slacks which uses the selected level converter to lift the topological constraints imposed earlier. Finally, Chapter 8 concludes the work with suggestions for future research. The need to include level converter overheads in the algorithm is realized from the results of Chapter 7. Hence, we propose this for the future work. 4 Chapter 2 Background 2.1 Introduction Power and performance are the main design constraints in modern VLSI design. An e ective design is obtained with the optimization of these two parameters. This requirement for higher performance and lower power directly results from the increasing demand for portable electronic devices. Fortunately, the VLSI technology has been able to integrate millions of gates into a single chip of small area, thus contributing towards reduction of size and increase in portability of the electronic devices incorporating these chips. Also, the humongous integration on a small area has increased the speed of the devices owing to greater current drive at the transistor level and smaller propagation paths at the gate level and several other factors. But the continuous scaling has led to increased power dissipation in CMOS circuits. Hence, we need to nd an optimum design where the power dissipation is acceptable and the computational speeds are not disappointing. The power and timing e ciency can be captured by the following metrics [73]: 1. Average power 2. Power per MIPS (million instructions per second) 3. Energy 4. Energy-delay product 5. Energy-delay squared 6. Peak power The choice of the design metric used for design optimization depends on the application and its performance speci cations and power budgets. The various mechanisms causing 5 power dissipation in CMOS circuits and the popular power reduction techniques are discussed in the following sections of this chapter. 2.1.1 Power Dissipation in CMOS Circuits Power dissipation in CMOS digital circuits can be classi ed as: static power and dynamic power. The static dissipation arises due to the leakage current through the resistive paths from the power supply to ground during steady state. The static power exists even in the inactive blocks of the digital circuits to hold the logic states between the switching events [73]. The leakage current through the transistors results from substrate injection and sub-threshold e ects; and is primarily determined by fabrication technology considerations. Dynamic power is due to the switching of capacitive loads and the short circuit current resulting when both the pFET and nFET of a CMOS circuits are ON during switching transitions. The above-mentioned sources of power consumption are summed up in the equation P = CVDD Vf +ISCVDD +IleakVDD (2.1) where is the activity factor, C is the total capacitance, VDD is the supply voltage, f is the clock frequency, V is the input voltage swing [9, 73], ISC is the short circuit current and Ileakageis the leakage current. In equation 2.1, the rst term corresponds to the switching component of power, the second is due to the short circuit current and the last term de nes the power dissipated by the leakage current. When the input voltage swing is equal to VDD, the switching power will be PSwitching = CV2DDf (2.2) This switching component of power is due to the logic transition at the gates as shown in Figure 2.1. When the input goes from 1-to-0, the pMOS transistor is switched ON and 6 Figure 2.1: Switching current in an inverter. the capacitance at the output node, C, is charged from the supply VDD. When the input transitions from 0-to-1, the nodal capacitance is discharged into the ground through the nMOS transistor. Thus, during each transition power given by equation 2.2 is expended. 7 The short circuit current ows from the supply to the ground during a logic transition for a short duration for which both the nMOS and pMOS trees of the gate are turned ON. It is shown in Figure 2.2. The graphs in Figure 2.3 show the characteristics of the short circuit current. We see that during a rise transition at the input of an inverter, when Vin is at 0V, the pMOS transistor is ON and the nMOS transistor is OFF. Once the input becomes more than VDD Vtn the nMOS transistor turns ON. As the input keeps rising and becomes greater than VDD Vtp the nMOS transistor turns OFF. We notice that during the period when the input is VDD Vtn and when the input is VDD Vtp, both the nMOS and pMOS transistors are ON. This causes a current to ow from the supply to the ground through the ON-resistances of the two transistors. Similarly, we can interpret the fall transition in the graph 2.3. Hence, the short circuit current depends on the rise and fall times of the inputs. The short circuit power of an unloaded inverter shown in gure 2.2 is found to be approximately PSC = 12(VDD Vth)3 f (2.3) where is the transistor coe cient, Vth is the threshold voltage of the transistor assuming both pMOS and nMOS have the same threshold voltage, is the rise/fall time and f is the frequency [72, 77]. From the equations 2.1, 2.1 2.3 we observe that the dynamic power consumed by digital CMOS circuits can be reduced by adjusting the supply voltage, threshold voltages, node capacitance, activity factor, clock frequency, low voltage swing signaling techniques, etc. Finally, the third component of power is the static power. This power is consumed to maintain the logic levels at the gate outputs when the inputs are in steady state. The main sources of leakage currents are the source/substrate or drain/substrate p-n junction band-to-band tunneling leakage current (ID), gate direct tunneling current (IG), which is due to the tunneling of electron or a hole from the substrate to the gate through the gate oxide, and the subthreshold leakage current from drain to source through the channel of an OFF transistor (ISUB) [35, 72, 73, 74]. 8 Figure 2.2: Short circuit current in an inverter. The subthreshold leakage or weak inversion conduction occurs when the gate voltage is below Vth. This is due to the di usion current of the minority carriers in the channel. Consider a CMOS inverter. When the input is low, the nMOS transistor is turned OFF and the output voltage is high. Even when VGS is 0V, there is still a current owing in the channel of the OFF nMOS device due to the VDD potential between the drain and the source. This component of leakage is the dominant component due to the low threshold voltages being used. The subthreshold current is given the following equation ISUB = Ae( qnkT)(VGS Vth0 0VSB+ VDS)(1 e qVDSkT ) (2.4) A = 0C0ox WL eff (kTq )2e1:8 (2.5) where 0 is the linearized body-e ect coe cient, VSB is the source-to-bulk voltage of the transistor, VGS is the gate-to-source voltage of the transistor, is the DIBL coe cient, VDS is the drain-to-source voltage of the transistor, Cox is the gate oxide capacitance, 0 is the zero-bias mobility, Vth0 is the zero-body-bias threshold voltage and n is the subthreshold swing coe cient of the transistor [35, 72, 73, 74]. 9 Figure 2.3: Short circuit current characteristics of an inverter. The gate-direct tunneling current density is given by the equation JDT = A(Voxt ox )2e B(1 (1 Vox ox) 32 ) Vox tox (2.6) 10 Figure 2.4: Leakage currents in an nMOS transistor. where Vox is the potential drop across the thin oxide, ox is the barrier height of tunneling electron, and tox is the oxide thickness. The various components of the gate direct tunneling current are the edge direct tunneling which goes into the source and drain extension regions (Igso and Igdo), the gate to channel current (Igc) which gets divided into the gate-to-source (Igsc) and the gate-to-drain (Igdc) currents and the gate-to substrate leakage current (Igb) [72]. The gate direct tunneling current increases exponentially with the reduction in oxide thickness and supply voltage. This component of leakage is fast becoming the dominant part. In 90nm devices, gate leakage can be 1/3 of the subthreshold leakage. It can become equal to subthreshold leakage in 65nm devices [35]. Using a high-k dielectric for gate oxide can help reduce the gate leakage. The band-to-band tunneling (BTBT) current from the drain and source to substrate is the reverse-biased pn-junction current consisting of the di usion/drift near the edge of the depletion region and the current due to the electron-hole pair generation in the depletion region. In the case of an inverter, when the input is low the pMOS is ON, the nMOS is 11 OFF, and the output is high. This high-voltage at the output makes the drain-to-substrate voltage of the nMOS transistor equal to VDD. This results in leakage from the drain to the substrate. This component of leakage current increases signi cantly when the wells or the substrate are highly doped or the drain and source at very high potential than the substrate. The band-to-band tunneling leakage can be reduced by forward biasing the substrate or by reducing the doping near the drain-substrate and source-substrate junctions [72, 73, 74, 76]. The BTBT current density is given by JBTBT = A(EVrevE1=2 g )e( B E3=2g E ) (2.7) A = p2m q3 4 3h2 (2.8) B = 4 p2m 3qh (2.9) E = vu ut2qNaNd(Vrev +Vbi) si(Na +Nd) (2.10) where Na and Nd are the doping in the p and n sides, si is the permittivity of silicon, m is the e ective mass of the electron, q is the charge of an electron, h is 1=2 times the Planck?s constant, E is the electric eld at the junction, Eg is the energy-band gap, Vrev is the applied reverse bias, and Vbi is the built-in voltage across the junction [76]. In addition to the above discussed leakage mechanisms, other secondary e ects also exist. Gate induced drain leakage IGIDL is the current owing from drain to substrate because of the high electric eld e ect in the drain region due to the high drain-to-gate voltage [35, 76]. In short-channel devices, due to the proximity of the drain and the source, the depletion regions at the drain-substrate and source-substrate junctions extend into the channel. As the channel length is reduced, if the doping is kept constant, the separation between the 12 depletion region boundaries decreases. An increase in the reverse bias across the junctions also pushes the junctions nearer to each other. When the combination of channel length and reverse bias leads to the merging of the depletion regions, punchthrough currents Ipunchthrough ow from drain to source [76]. Leakage currents are exponentially increasing with the scaling of CMOS. They increase with the decrease in threshold voltages and the length of the transistors. The operating temperature increases during device operation. This causes an exponential increase in the leakage currents. The power dissipated by the CMOS digital circuits can be reduced while retaining the required functionality and performance. An outline of few of the techniques proposed for the power reduction in CMOS circuits is given in the following section. 2.2 Power Reduction in CMOS Circuits Minimizing power consumption in digital CMOS circuits can be done either by slightly altering the make of the transistors at the fabrication level, or by making changes in the implementation level, or the architecture of the basic building blocks of a complex CMOS system. At the designing level, power reduction can be achieved by making low-power dissipation the key objective of the design. This allows a simpler design with the reduction of the number of primitive gates and the resistive paths connecting them. This reduction in the number of dissipating components decreases the overall power dissipation of the system while performing the same required operations. 2.2.1 Technology Scaling This technique proposes to scale all the voltages and linear dimensions by a constant factor (< 1) as shown in Figure 2.5 [73, 74, 76]. Since the electric eld in the device remains constant the device current and wire capacitances are all scaled by the same factor. 13 Figure 2.5: Scaling of an nMOS transistor. The energy scales by a factor 3, as both voltage and current are scaled by . The delay is improved by a factor . Hence the energy-delay product decreases by a factor 4. But this method requires all voltages to be scaled down, including the threshold voltage. But the threshold voltage is limited by the leakage current through the OFF transistors and allowable static power [25, 73, 74]. In recent technologies, the supply voltage has reached 1V. This has imposed physical limitations to scaling. The built-in potentials of the device and Si bandgap energy do not change with scaling. They can be adjusted to required levels by increasing the doping or forward biasing the substrate. Because of thermodynamic limitations the threshold voltage of a device cannot be scaled further while keeping the leakage currents manageable. Reaching this ultimate level of scaling is stalled by increasing the electric eld in the device by a factor (> 1). But this increases the power dissipation and decreases the reliability of the device. Hence, there is a limit to this scaling process due to various physical phenomena and alternative methods should be chosen to overcome this [73]. 14 2.2.2 Transistor Sizing We can reduce the junction capacitance and the overall gate capacitance of the transis- tors, which cause loading on the input, by making the transistors much smaller than the load capacitance applied [25, 72]. Reduction in size reduces the current drive of the transistors, thereby decreasing the system performance. This technique provides a trade-o between en- ergy and speed as lower sizing results in lower energy and greater delay. Gates in the critical paths are sized according to the performance requirements and the gates in the non-critical paths are sized to minimize area to reduce power dissipation [72]. 2.2.3 Supply Voltage Scaling This method employs a reduction in supply voltage. Decreasing the supply voltage results in large savings in power because switching power is proportional to the square of the supply voltage. Supply voltage scaling also reduces the leakage power since the gate leakage and GIDL and DIBL leakage components are also reduced [72]. As the capacitance and the threshold voltage are kept constant, the performance of the system decreases. At larger voltages, reducing the supply voltage reduces energy for an acceptable change in delay. At voltages near the device threshold, small supply changes cause a large change in delay for a modest change in energy. Hence, this technique is best suitable at higher voltage ranges. For voltages comparable to the threshold voltage this method cannot be used. When using a lower VDD, we can use parallel or/and pipelined architectures to maintain decent circuit speeds [9, 25, 73]. Parallelism allows one to build N functional units, and perform N operations at the same time. This can decrease the time required for the operation of these N functions when compared with the same system having only one output. Besides saving energy, parallelism decreases the system delay. Parallelism allows a greater number of functions to be performed simultaneously at the same time. In systems with single output function, these operations must be carried out sequentially and thus, require a greater time. Thus we can decrease the supply voltages of each of these blocks 15 while keeping the delays equal to that of the single block delay and eventually decrease the energy consumption of the system. Additionally, parallelism allows us to shut down some blocks when not being used. Hence, an energy-e cient system can be designed. But all systems cannot be designed to extract parallelism. The system design, in some cases, may not give the desired level of parallelism, i.e., the number of functions being per- formed simultaneously might not be as large as required. Besides, designing such alternative systems requires a lot of creative insight and may also extend the time required for designing. We can use di erent voltages to di erent functional blocks based on the performance requirements of those speci c blocks. There are many algorithms proposed in the literature to do this. The various voltage scaling schemes can be classi ed as follows [35]: 1. Static Voltage Scaling (SVS) - di erent blocks and subsystems are given di erent, xed supply voltages. Higher supply voltage is used for blocks falling on the critical path of the circuit and lower supply voltage on non-critical paths [72]. If two voltages are used, it is called dual-VDD technique and when more than two supplies are used, it is called multi-VDD technique. 2. Multi-level Voltage Scaling (MVS) - is an extension of the static voltage scaling case where a block or subsystem is switched between two or more voltage levels. Only a few, xed, discrete levels are supported for di erent operating modes. 3. Dynamic Voltage and Frequency Scaling (DVFS) - an extension of MVS where a larger number of voltage levels are dynamically switched to follow changing workloads. The highest VDD is used when the functional block is required to perform at the fastest frequency. When the performance demand is less, a lower VDD is used for the same block [72]. The voltages must be limited to the range over which delay and voltage track monotonically [35]. A minimum clock speed that meets the workload requirements is determined and then the lowest supply voltage that will support the clock speed is determined [35]. Articles [14, 28, 45, 46, 67, 69] talk about di erent techniques used for employing Dual/multiple supply voltage scaling. 16 Figure 2.6: Clock gating of a ip- op. Figure 2.7: Clock gating of a clock-tree. 4. Adaptive Voltage Scaling (AVS) - an extension of DVFS where a control loop is used to adjust the voltage. A feedback system is implemented between the voltage scaling power supply and delay-sensing performance monitor [35]. Also, the supply voltage scaling used to reduce dynamic power increases the gate delay. To mitigate this performance degradation, the threshold voltages of the transistors are also lowered with the supply voltage. This increases the leakage currents in the CMOS devices. 2.2.4 Clock gating In synchronous CMOS circuits, at the block level if the clock is gated to the functional blocks, the inactive blocks are e ectively turned OFF with the stop in the clock pulse. This transition avoids the switching of the nodes in inactive blocks of the system while maintaining 17 Figure 2.8: Power gating implementation using a header and a footer sleep transistors. their logic values [72, 73]. Consider the circuit in Figures 2.6 and 2.7. When the enable is 1, the circuit works as usual. When it is 0, the clock signal is cut o from the circuit and the switching activity in the ip- op or clock-tree is stopped. Clock gating decreases the switching activity in the ip- ops, gates in the fanout of the ip- ops, and the clock-tree and thus, decreases the power. The performance impact of clock gating is small [73]. 2.2.5 Power gating Power-gating is a technique used to reduce the subthreshold leakage power of CMOS circuits. It does not reduce gate leakage current [39]. A high-Vt pMOS transistor (header) or a high-Vt nMOS transistor (footer) controlled by a sleep bar or sleep signal is used to isolate the supply rails, VDD or VSS, respectively, from the rest of the logic when operating in sleep mode. This prevents the oating nodes at the logic outputs thus stopping the ow of short circuit currents into the active blocks connected to the power-gated block outputs [72, 81]. The two implementations of power gating are shown in Figure 2.8. Usually, any one of the schemes is implemented. Largely, pMOS header is used as compared to nMOS footer to reduce the leakage currents. 18 The sleep transistor has to be designed carefully so that the voltage drop across that is not very large during the active state. This ensures that the e ective supply voltage to the logic block is maintained at VDD [73]. Also, there is a dynamic power consumption associated with the turning ON/OFF of the sleep transistor. The power savings associated with the stand-by mode of a power gated logic block should be more than the power consumed while turning-ON/OFF of the sleep transistor [73]. The grouping of logic blocks for power gating implementation is also important. We should make sure to minimize the number of sleep transistors used while increasing the leakage energy savings [73]. Several algorithms have been published in the literature to address this problem. The high-Vt sleep transistors decrease the subthreshold leakage currents. In technologies less than 60nm, gate leakage is expected to become larger than the subthreshold leakage current [39, 72]. In footer transistor implementation, once the logic block output has settled to VDD, the footer induces reverse biased gate leakage current due to its bias condition. Also, nMOSFETs connected to primary inputs are other sources of gate leakage current (input gate leakage current). Assuming that the primary inputs are driven by logic blocks that are active in, the footer induces large reverse biased gate leakage current (70% of forward biased one [39]) due to the large voltage di erence between its gate and source/drain. The reverse biased gate leakage current of pMOSFET is negligible; implying that a header would be less leaky. In the header implementation, the nMOS transistors driven by logic 1 inputs experience a forward biased input gate leakage current which might imply that a footer which passes a reverse biased current is superior to a header [39, 60]. Though di erent mechanisms of leakage exist in each of the power gating schemes and to di erent extents, when total leakage current is considered, using a header transistor proves to be bene cial [39]. A header is recommended when a switch is sized such that lower delay penalty can be tolerated, but a footer is superior when delay penalty is larger [39]. When a low-Vt sleep transistor is used instead of a high-Vt, a footer is always better than a header. This is because the sub-threshold leakage current of a switch now makes up large portion of total leakage 19 currents due to its exponential dependency on the threshold voltage, and the sub-threshold leakage current is smaller in a footer than in a header due to the smaller size of a footer with the same delay penalty [39]. The power consumption of power gating circuits can be further reduced by controlling primary inputs. By providing logic 1 to all primary inputs of a power-gated circuit with a footer (similarly logic 0 in the case of a header), input gate leakage currents can be virtually eliminated. However, gate leakage current of a footer goes up due to elevated voltage of a virtual ground. This is because the additional transistors in input control circuits induce current that ows through a footer in addition to the current from the logic block [39]. Also, when the logic is cut o from the ground using a leakier footer transistor, the di erence in the potentials between this virtual ground and the VSS will cause ground noise causing signal integrity problems [35, 81]. Thus, depending on the application requirements, a header or a footer sleep transistor is used. For designs with greater area constraints, an nMOS footer is a better option. Availability of less leaky transistor design in a given technology is important for power gating implementation. A thick oxide transistor can be used where an increase in area can be tolerated. A high-Vt thin oxide sleep transistor is preferred where area has higher priority; the higher leakage of the thin oxide can be reduced by longer gate and/or reverse body bias techniques [35]. 2.2.6 Dual/multi-threshold designs The threshold voltage of a transistor is given by Vth = Vth0 + ( q j 2 F +VSBj q j2 Fj) (2.11) 20 where Vth0 is the zero source-bulk bias threshold voltage, F is the substrate Fermi potential, is the body-e ect coe cient. This shows that reverse body biasing a transistor increases threshold voltage and thus decreases the leakage currents [73]. In this method, as with dual-VDD technique, slower transistors with high-Vth are used in non-critical paths and low-Vth transistors are used in critical paths to achieve desired performance. Increasing the channel length or its doping density increases the threshold voltage. A thicker oxide layer can be used to increase the threshold voltage., but this reduces only subthreshold leakage and not the gate direct tunneling [72]. 2.2.7 Variable threshold CMOS (VTCMOS) In this technique a zero body bias is used in the active mode and a high reverse body bias is applied in the standby mode to increase the threshold voltage, thereby decreasing the standby leakage. This capability of reverse body biasing technique to reduce the leakage current lowers as the technology scales [36, 72]. This is because of the exponential increase in the band-to-band tunneling current at the source-substrate and drain-substrate junctions in scaled technologies [36, 72]. Authors in [91] have described a technique which reduces the standby leakage by using high-Vth devices and during active mode a forward body bias is applied to reduce the Vth. This reduces the short channel e ects, limiting the reduction of the leakage current. It has been shown that forward body biasing with high-Vth and reverse body biasing reduces leakage 20 times while only reverse body biasing with low-Vth reduces leakage only 3 times [72]. 2.2.8 Dynamic threshold scaling Body biasing techniques are used to change the threshold voltage according to the performance requirements. When performance demand is low, clock frequency is lowers and the threshold voltage is increased to reduce run-time leakage. In standby mode the threshold voltage is increased to its maximum limit to reduce the standby leakage. A zero 21 body bias with lowest possible threshold is used when the performance requirements are the highest [72]. 2.2.9 Variable supply and threshold voltages The threshold voltage and the supply voltage can be varied simultaneously to get re- quired power and timing e ciency. The ON current of a transistor is given by IDS = CoxWL (VGS Vth) 2 2 (2.12) where is the mobility, Cox is the gate capacitance, Vth is the threshold voltage, VGS is the gate-to-source voltage. As we reduce the supply voltage, the input voltage swing, VGS, is also reduced. To maintain the performance, the current drive, IDS, has to be maintained. For this the threshold voltage also has to be decreased. If we use low-VDD and low-Vth, the power dissipation will increase because lowering Vth results in exponential increase in the subthreshold leakage current. Hence, we require lower VDD and higher Vth on non-critical paths to reduce power and maintain performance on the critical paths. We can also use dynamic supply scaling and dynamic threshold scaling together to reduce power [73, 74]. 2.2.10 Transistor stacking Using a stack of transistors in the OFF state reduces the leakage current because of negative gate-to-source biasing, body-e ect induced threshold voltage increase, and increased threshold voltage due to reduced drain-to-source voltage. The time required for the leakage current in the transistor stacks to converge to its nal value depends on the intermediate node capacitances and the threshold voltages of the transistors [99]. Turning o more than one transistor in the stack increases the source voltage of the stack and thus, reverse biases the source. Consider a stack of four nMOS transistors as shown in Figure 2.9 [72]. All four of these transistors are in the OFF state. If only one transistor is OFF, the voltage at the 22 Figure 2.9: A stack of nMOS transistors. source node of that transistor will be zero as the other transistor are ON and act as a short circuit path to ground. When all the transistors are OFF, the source voltages of the OFF transistors, except the one directly connected to the ground, will be non-zero causing the self-reverse biasing of the transistors. This reverse bias reduces the leakage currents through the transistors [72]. 2.2.11 Minimum leakage vector method The leakage current of a logic block depends on the inputs as this determines the number of ON and OFF transistors. The minimum leakage is seen when the resistance between the supply and ground through the transistors is maximum [73]. Hence, there is a combination of inputs for which the leakage power is the least. This input combination is called the minimum leakage vector (MLV). Various algorithms have been described in the literature to nd the MLV of a circuit [2, 3]. After determining the MLV for a logic block, it is applied to its inputs in the standby state to reduce the standby leakage. 23 2.3 Conclusion A reduction of power dissipation is important to save energy. However, it is not the only concern. Lower power dissipation decreases the heat generated within the chip, in between the individual transistors. This reduction in heat allows further increase in the level of integration and thus, contributing towards further reduction of size. Besides, the lower dissipation allows the packaging to be done using light and inexpensive plastics, rather than much costlier and heavier ceramic cases. Further, in larger devices like PCs and servers, fans are provided to cool the system during operation. This is not practical in mobile devices like laptops and personal digital assistants (PDAs). Hence, to increase the portability of the electronic devices power dissipation must be lowered. This allows their manufacturing to be done at lower costs and also reduces the chip area. 24 Chapter 3 Overview of Static Timing Analysis Algorithms Static timing analysis has been rst described in [4, 20, 24, 34, 55]. The author in [55] describes a linear time algorithm that nds the shortest/longest paths in order of their total delay. Static Timing Analysis (STA) is a technique to verify whether a digital design meets the given timing speci cations. An alternate approach used to verify the timing is timing simulation, which can verify the functionality as well as the timing of the design. The term timing analysis is used to refer to either of these two methods - static timing analysis, or timing simulation, [7]. Timing Analysis (TA) is de ned by [24] as \a design automation program which provides an alternative to the hardware debugging of timing problems". In STA, the analysis of the design is carried out statically and does not depend upon input vectors. In simulation based timing analysis, a stimulus is applied on primary inputs, resulting behavior is observed and veri ed, then time is advanced with new inputs applied, and the new behavior is observed and veri ed, [7]. This makes STA faster than the simulation based methods for large circuits. Any synchronous circuit can be seen as consisting of combinational and sequential blocks such that the combinational block receives the data from a clocked ip- op and outputs data to another clocked ip- op, Figure 3.1. The input and output ip- ops can operate at di erent clocks. The timing analysis program veri es whether all combinational paths meet the required timing speci cations, that is, that data signals arriving at storage elements do not cause any set-up or hold-time violations. The setup time is the time taken by the the data signal to arrive at a ip- op. This should be within the given clock period. The hold time is the time for which the data should not change after a clock transition so that there is no unexpected pass-through of data through a ip- op. That is, the data needs to be 25 Figure 3.1: Generalization of a synchronous circuit. held constant for at least a minimum time ensure that a ip- op captures the intended data correctly. These checks ensure that the proper data is ready and available for capture and latched in for the new state [7]. The timing analysis of a circuit produces ?slack? at each gate to provide a measure of any timing violations. The slack is de ned in [24] as the di erence between the required arrival time and the actual arrival time. Negative slack indicates that the data signal takes ?slack? units longer time to arrive at the storage elements than the given speci cations. A positive slack corresponds to an early signal which comes at the storage element ?slack? units earlier than the speci ed timing. In order to meet the speci cations, the data signal should not be too late that it does not meet the set-up time requirements and not too fast that it violates the hold time of the previous signal. Thus, the entire design is analyzed and the required timing checks are performed for all possible paths of the design. Thus, STA is a complete and exhaustive method for verifying the timing of a design [7] as opposed to the simulation based timing analysis which does not verify the parts of the circuit which are not excited by input stimulus. The basic timing analysis algorithm is analogous to the Program Evaluation and Review Techniques (PERT) [68] algorithm when run non-probabilistically. The application of PERT to logic design is described in [47]. A combinational circuit can be viewed as a Directed Acyclic Graph (DAG). In this DAG, the various nodes represent the logic blocks and edges represent the interconnects. For example, the combinational circuit shown in Figure 3.2 can be redrawn as a DAG shown in Figures 3.3 and 3.4. In these gures, each node is associated 26 with a weight, which represents the largest delay of the gate, and the edges do not have any weights associated with them. The delays of the gates can be speci ed as minimum and maximum delays or as rise and fall delays or in any other manner. Always the maximum of the delays is considered for the longest path analysis and the minimum for the shortest path analysis. When the interconnect delays are speci ed, then the maximum/minimum of the sum of a node and its output edge weights is considered. In the proposed work, longest path analysis is considered. The primary inputs (PI) and primary outputs (PO) as assumed to emerge from the source and sink nodes respectively. The source and sink nodes do not have any weights associated with them. In the static timing analysis program proposed in [24], we traverse from the primary inputs to the primary outputs through each gate such that we assign the longest path delay through this gate from the primary inputs as its weight. Once the weights of all nodes are assigned, we traverse in the reverse direction, i.e., from the primary outputs to the primary inputs to assign the slack of each node. The slack of each node is calculated as the minimum di erence between the required arrival time at the gate output so that the primary outputs meet the timing constraints and the actual arrival time of the gate output. This is best described by Figure 3.3. In Figure 3.3, PI represent the primary inputs, PO the primary outputs and A through J represent logic gates. Each logic gate is characterized by a pair of delays, (mindelay, maxdelay), which represent the minimum and maximum delays through the gate, respectively. The actual arrival times, required arrival times and the slacks of each gate are computed. The results are recorded at each node. For example, for node E, minimum delay is 1, maximum delay is 2, actual arrival time is 7, actual arrival time is 11 and the slack is 4. These values are noted at the top of node E in the DAG of Figure 3.3 as f(1,2), 7, 11, 4g. Now, consider gate E. We have to nd the longest path delay through this gate. We begin with the primary inputs and assume that the start of the algorithm represents the time 0 units. Gate E has outputs of gates A and B as its inputs. It is observed that output of gate 27 Figure 3.2: Example of a combinational circuit A arrives 3 units earlier than gate B, which arrives at 5 time units from the start. Hence, the longest path from the primary inputs to the inputs of gate E would be 5 time units. Now, the maximum delay of gate E is 2 units. Hence, the maximum time from the start till the time that the output of E takes to stabilize is the sum of 5 and 2, i.e., 7 units. Hence, the path delay till the output of node E from the primary inputs is 7 units. The path delays of each node are calculated thus. From the outputs of E we can take two di erent paths to the primary outputs, shown in green and red colors. But, the path that goes through the gate G and I, green, yields the longest path with path delay of 13 units. The longest path delay of the circuit, shown in yellow, is 17 units. Now, to nd the slacks of each gate, suppose the required arrival times of both the primary outputs are 17, the longest path delay of the circuit. The output of node I arrives at 13 units, i.e., its slack is 17 13 = 4 units. Similarly, the output of node J arrives at 17 units and its slack is 0 units. Since the maximum delay of node I is 4, the inputs of I should 28 Figure 3.3: Traditional STA. arrive at 13 units of time. The path delay of gate G is 9, i.e., its actual arrival time is 9 units. Hence, the slack of gate G is 13 9 = 4 units. Now, we can reach node E from node J, or node I or node G from the sink node. Required arrival time at node J?s input is 14 units, and node E?s output arrives at 7 units, hence, node E?s slack is 7 units. But node I?s required arrival time is 13 units and this makes the slack of node E 6 units. Now, node G?s required arrival time is 11 units and hence, the slack of node E is 4 units. Thus, we obtain 3 values for the slack of node E. We take the minimum of these three values because we need to meet all the three timing requirements. Hence, the actual slack of gate E is 4 units. The slacks of each gate are found as described above. It is observed that the gates on the longest path of the circuit, also called the critical path, have 0 slack when the required time at the primary output is equal to the critical path delay. The two primary outputs in the example are assumed to have the same required arrival time. But, in practice, they can have di erent arrival times. This adds minor changes to the slack calculation. If the required arrival time at node I is 15 and node J remains at 17. Then the slack of gate I will be 2 units instead of 4. This algorithm requires us to traverse the graph two times, once 29 forward to calculate the path delays and once backwards to calculate the gate slacks. Thus, the time taken is proportional to the number of gates and the complexity of the algorithm is linear, O(n) where n is the number of gates. 3.1 Algorithm 1 We use the O(n) slack calculation algorithm proposed in [40, 43, 41, 42, 44] to nd out the gate slacks for a given circuit. The aim is to nd the nd the longest path delay from the primary inputs to the primary outputs through a given gate. And then the gate slacks are calculated as the di erence of critical path delay and the longest path delay through the gate. The step by step approach to implement this algorithm is described below using the example shown in Figure 3.2. The directed acyclic graph (DAG) for this circuit is shown in Figure 3.4. A given circuit is levelized. We start from the primary inputs and traverse to the outputs of each gate such that we always choose the longest delay path. Simultaneously we traverse in the opposite direction to nd the longest delay path from the primary outputs to the respective gate output. The sum of the longest delay from the primary inputs to the gate output and the longest delay from the primary outputs to the gate output will give the longest path delay through that gate. Let TPI(i) be the longest time for the data signal to arrive from a primary inputs to the output of gate i and TPO(i) be the longest time for the data signal to reach a primary outputs from the output of gate i. The delay of the longest path [41, 42, 43] through gate i is given by, Dp;i = TPI(i) +TPO(i) (3.1) The critical path delay of the circuit is given by Tc = MaxfDp;jg8gate j (3.2) 30 Figure 3.4: New STA. The slack for gate i is then calculated as Si = Tc Dp;i (3.3) Consider Figure 3.4. Here, PI represent the primary inputs, PO the primary outputs and A through J represent logic gates. As in Figure 3.3, for node E, minimum delay is 1, maximum delay is 2, TPI is 7, TPO is 6 and the slack is 4. These values are noted at the top of node E in the DAG of Figure 3.3 asf(1,2), 7, 6, 4g. Note that in this case, we record the path delay from the primary outputs as well. Also the slack of gate E is the same as the slack obtained from the algorithm described in [24]. To begin with we levelize the circuit. For this, the source node and primary inputs are considered to be in Level 0. Then all the nodes that have primary inputs as their inputs are grouped in Level 1. So, nodes A, B, C, D and H form Level 1. Now the gates to which the outputs of Level 1 gates are fed are considered to be in Level 2. That is, nodes E, F, and G are in Level 2. Again, the outputs of Level 2 gates feed into Level 3 gates. Thus, nodes G, H, I and J form Level 3. Note that G falls in Levels 2 and 3. We take the maximum of both. 31 Hence, G falls in Level 3. Similarly, H also falls in Level 3. Then Level 4 gates are gates that have Level 3 gates as their inputs. This makes node I and J as Level 4. Finally, the primary outputs and the sink node form Level 5, the highest level. These levels are separated by dashed vertical lines in Figure 3.4. Again, as in the previous algorithm, consider gate E. To nd the longest path delay through this gate, begin with the primary inputs and assume that the start of the algorithm represents the time 0 units. Gate E has outputs of gates A and B as its inputs. It is observed that output of gate A arrives 3 units earlier than gate B, which arrives at 5 time units from the start. Hence, the longest path from the primary inputs to the inputs of gate E would be 5 time units. Now, the maximum delay of gate E is 2 units. Hence, the maximum time from the start till the time that the output of E takes to stabilize is the sum of 5 and 2, i.e., 7 units. Hence, the path delay till the output of node E from the primary inputs is 7 units. The path delays of each node are calculated thus. From the outputs of E we can take two di erent paths to the primary outputs, shown in green and red colors. But, the path that goes through the gate G and I, green, yields the longest path with path delay of 13 units. The longest path delay of the circuit, shown in yellow, is 17 units. As we traverse from the primary inputs to the gate,concurrently, we also traverse from the primary outputs to the gate to nd the longest path delay from the primary outputs to the gate output. Thus, to nd the longest path delay from the primary outputs to the output of gate E, we start from the sink node and try to reach node E. We can reach E from node J or from node I or from node G . The path from node J, red line, results in a delay of 3 units at E, the maximum delay of gate J. Similarly the path from node I, red line, results in delay of 4 units at E. The path through node G, green line, starts from the sink, goes through node I, then through node G and then end at node E. The delay of this path will be 6, the sum of maximum delays of nodes I and G. Out of these three paths leading to node E, the longest path delay will be 8 units. The longest path delay from the primary inputs to the primary outputs through gate E is thus, 7 + 6 = 13 units. 32 Now, to nd the slacks of each gate, we subtract the longest path delay thus obtained from the critical path delay of the circuit. The critical path delay is found similarly and is 17 units. Thus, the slack of gate E is 17 13 = 4 units. Similarly, the slacks of each gate are found. Again, it is observed that the gates on the longest path of the circuit, also called the critical path, have 0 slack when the required time at the primary output is equal to the critical path delay. Also, the slacks of all gates that have the same longest path have the same slack. This is in contrast to the slacks derived from the previous algorithms because, here, slack is calculated as the di erence of critical path delay and the longest path delay through a gate. G gates having the same longest path will have the same path delay, whereas the slack de ned in [24] is the di erence between the required arrival time and the actual arrival time each gate output. From Figures 3.3 and 3.4, we observe that the slacks of all gates derived in both ways are the same. The algorithm described in the previous section requires complete forward and backward traversal. But this new algorithm requires only partial traversal of the graph, up to the node output to calculate the slack of one gate. In order to nd the slacks of all the gates, we require complete traversal of the graphs in both directions, but to calculate the slack of each gate we need not wait until forward traversal is complete. Traversal in both directions can be done simultaneously. Thus, the time taken for this algorithm is lower than the time taken by the previous algorithm. The complexity of the algorithm is still linear, O(n) where n is the number of gates, as the time is still proportional to the number of gates. This algorithm nds the slacks corresponding to the longest path delay through a gate. Similarly, the slacks corresponding to the shortest path delay can be calculated by considering the minimum gate delay and the paths yielding the shortest delay. The shortest delay through the gate E considered in the earlier example would then be 3 units starting from the source and 1 from the sink. The shortest path problem is considered when we are required to check for the signals arriving earlier than the required time. 33 The linear complexity of this algorithm makes it e ective for large machines in contrast to other techniques such as delay simulation, which requires large numbers of test patterns, and path tracing. 34 Chapter 4 Slack-Based Algorithm For Finding an Optimal Lower Supply Voltage As discussed in Chapter 2, using multiple supply voltages is a widely used technique to reduce power consumption in electronic circuits. In such designs, a level converter is required to boost the logic ?1? level at the output of a low voltage gate, to the high voltage logic ?1? level, when it has a high voltage gate at its fan-outs. If we never allow a low voltage gate to drive a high voltage gate, then a level converter is not required. Level converters are discussed in detail in Chapter 7. The authors in [56, 73] describe a method to nd the lower supply voltage in a dual-voltage design. The power reduction ratio R is found as the ratio of power dissipated in a dual-voltage design and a single voltage design with the higher supply voltage and is expressed as R = 1 (CVLC )f1 (VLV H )2g (4.1) where CVL is the total capacitance of the low voltage cells and C is the total capacitance of the single voltage circuit [56]. If p(t) is the number of paths whose delay is t when VL = VH and path-delay t is the path-delay normalized to the cycle time, then the power reduction for various distribution functions of p(t) is shown in Figure 4.1 [56]. Figure 4.1 shows that the minimum value of R depends on p(t) and it is minimum when VL falls between 0.6VH and 0.7VH. This means that VL should always be between 0.6VH and 0.7VH to minimize power dissipation. Same authors in [21] derive a relationship between the optimum voltage ratios in multiple-VDD circuits. For dual-supplies fV1;V2g, V2V1 = 0:5 + 0:5VthV1 . For three supplies fV1;V2;V3g, V2V1 = V3V2 = 0:6 + 0:4VthV1 . And for four supplies fV1;V2;V3;V4g, V2V1 = V3V2 = V4V3 = 0:7 + 0:3VthV1 . The authors claim that this rule of thumb gives the optimum supplies which reduce power to a range within 1% of the optimum minimum. These results show that the 35 Figure 4.1: Power reduction versus ratio of low voltage and high voltage, [56, 73]. Figure 4.2: Optimum supply voltages and minimum power dissipation for multiple VDD values, [56, 73]. power savings tend to saturate as the number of supplies is increased and also that the savings decrease as the supply voltage is scaled and as Vth=VDD is higher. 36 An algorithm to nd the optimum VDD for each node is described in [10]. It is estimated by nding maxX v (V2H V2L) C(v) E(v) kv (4.2) where, kv = 1, if VDD(v) VL or 0, if VDD(v) >VL. The works described in [10, 50, 84] claims that the optimal value of VL for minimizing total power is 50% of VH. Several linear programming algorithms have been proposed in [8, 32, 33, 41, 79] to select the optimum value for the lower voltage. Authors in [101] describe an algorithm to nd the lowest feasible supply voltages according to their slacks from a set of given voltages. Another slack-based algorithm has been proposed in [41]. The authors have proposed a method to nd the optimum low voltage value based on the estimation of minimum energy state for the circuit. In our work, we propose an algorithm to nd the lower supply voltage using the gate slack. The slack of a gate is de ned as the di erence of the critical path delay of the circuit and the delay of the longest path through a gate. Thus, each gate has its own slack and the gates with same slack fall on the same path unless there are two paths with equal delays. Before describing our algorithm for assigning low-voltage, we propose three theorems to categorize the gates in any given circuit based on the gate slacks. The slack of each gate in a given circuit is plotted against the di erence of its respective low voltage delay and the high voltage delay, as shown in Figure 4.3. Theorem 4.1 The gates that fall above the 45o line in the ?delay increment versus slack? plot cannot be assigned lower supply voltage without violating the positive slack constraint. These gates fall in group 1. 37 Figure 4.3: Delay increment versus slack for gates of c880 circuit, VH = 1:2V; VL = 0:58V. Proof The slacks used in the ?Delay increment versus slack? plot are the slacks when all gates are at higher supply voltage. That means that the gate delays are high voltage delays, dhi, where i is the gate number. When we put a gate to low voltage, its delay becomes the low voltage delay, dli, where i is the gate number. Then the slack of the gate, is the di erence between the critical path delay and the sum of gate delays through the longest path through the gate. This can be written as in the following equation slackh = Tc nX i=1 dhi (4.3) 38 where n is the number of gates in the circuit. When this gate is put to low voltage, its slack will become slackl = Tc ( nX i=1 dhi dhi +dli) (4.4) )slackl = Tc nX i=1 dhi (dli dhi) (4.5) )slackl = slackh (dli dhi) (4.6) The above equation suggests that the slack is reduced by dli dhi amount when a gate i is put to low voltage. Now suppose, slackh