PATH DELAY TUNING FOR PERFORMANCE GAIN IN THE FACE OF RANDOM MANUFACTURING VARIATIONS by Kautalya Mishra A thesis submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Master of Science Auburn, Alabama May 9, 2011 Keywords: Process variability, Path delay improvement, Performance tuning Copyright 2011 by Kautalya Mishra Approved by Adit Singh, Chair, James B. Davis Professor of Electrical and Computer Engineering Vishwani Agrawal, James J. Danaher Professor of Electrical and Computer Engineering Victor Nelson, Professor of Electrical and Computer Engineering Abstract Moore?s laws predictions of transistor densities doubling every two years in an integrated circuit has held true for the past fty years, and is predicted to hold true in the coming few years. But as technologies shrink to smaller dimensions and scaling becomes more aggressive, a number of factors are beginning to hinder the urge towards miniaturization. Process variability, that introduces parametric variations in a device, is one such factor that is today seriously limiting clock rates in large synchronous designs. These variations are predicted to increase signi cantly with device scaling as silicon technology approaches the end of roadmap. Prior research work on process variability shows that factors contributing to process variations a ect the threshold voltage (Vth) of transistors. Varying Vth corresponds to varying path delays in a circuit, a few of which are signi cant outliers, with path delays much larger than the nominal delays the circuits were designed to have. These variations make it hard to make any pre-fabrication predictions on the slowest paths in designs that would make room for a speeding methodology because of the random way in which they are distributed in a chip. With circuits today including several hundred million transistors, virtually every design will have a dozen of these outlier transistors that would limit the maximum clock frequency at which the chip can be correctly operated. This thesis studies a new design architecture that allows for tuning and speedup of exceptionally slow paths in a chip to recover the lost performance and signi cantly increase the average clock speed attainable by the manufactured parts. ii Acknowledgments I?m very thankful and honored for having got the opportunity to work with Dr. Adit Singh, whose deep insight, energy and conviction was a source of inspiration and motivation. His words of advice and his work ethics have not just helped me work better academically, but also perform beyond it in other areas of life. I?m deeply indebted to him for having given me the opportunity to pursue research for him, and thankful for the lessons learned and the knowledge gained. I?d like to thank Dr. Vishwani Agrawal for being on my thesis defense committee. Over the period of my Master?s pursuit I?ve learned a number of lessons on dedication and humility while observing Dr. Agrawal both on and o classroom hours. His inquisitive nature and ease of approachability have been instrumental in me overcoming a number of brick walls in my understanding of various concepts. I?d also like to thank Dr. Victor P. Nelson for also being on my thesis defense committee, and for the knowledge imparted through his courses, that have been of utmost importance as it satis ed a prerequisite requirement for entering the industry and starting my electrical engineering career. I?d like to thank Auburn University for providing me with the necessary facilities and softwares that were required for my research work, and the people at Auburn who?ve been a part of my memorable stay at Auburn. Last but not the least, I?d like to thank my family for their love, support and words of wisdom, and for letting me know that they are always there for me no matter what. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Historical Overview of Process Variability . . . . . . . . . . . . . . . . . . . 4 2.2 Sources of Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Random Dopant Fluctuation (RDF) . . . . . . . . . . . . . . . . . . 6 2.2.2 Line-Edge Roughness (LER) and Line-Width Roughness (LWR) . . . 7 2.2.3 Oxide thickness variation due to interface roughness . . . . . . . . . . 9 2.2.4 Polysilicon granularity (PSG) . . . . . . . . . . . . . . . . . . . . . . 10 2.2.5 High K dielectric morphology . . . . . . . . . . . . . . . . . . . . . . 10 2.2.6 Other sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Impact of Process Variability . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 E ects of Outlier Presence on Path Delays . . . . . . . . . . . . . . . . . . . . . 13 3.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 VOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 E ects of process variability on high performance applications . . . . . . . . 17 4 CMOS tunable gate architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 CMOS tunable gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Tuning strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 iv 4.3 Sizing of tuning transistors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.4 Faulty tuning transistors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.5 NAND gate simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 SPICE Netlist Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1 Mentor Graphics Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1.1 ModelSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1.2 Leonardo Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.1.3 Design Architect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.1.4 IC Station . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.1.5 LVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.6 PEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Tuning transistors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6 Tuning implemented on very large circuits . . . . . . . . . . . . . . . . . . . . . 35 6.1 Building large circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.1 Di culties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.2 Basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.1.3 Standard circuit used to create larger circuits . . . . . . . . . . . . . 36 6.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.2.1 Primitive model les . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.2.2 Single EXOR tree simulations . . . . . . . . . . . . . . . . . . . . . . 40 6.2.3 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.2.4 EXOR tree simulation: example . . . . . . . . . . . . . . . . . . . . . 44 6.2.5 Picking ?N? sub-circuits to create a larger circuits . . . . . . . . . . . 46 7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.1 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.2 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.3 Bene ts obtained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 v 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 8 Power Dissipated with tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8.1 Static power dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8.2 Simulation di culty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 8.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 8.4.1 Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 8.4.2 SPICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 8.4.3 Picking 10,000 sub-circuits to create a larger circuit . . . . . . . . . . 56 8.5 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 vi List of Figures 1.1 Processor clock frequency versus Time . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Random Dopant Fluctuations in sub-micron technologies . . . . . . . . . . . . . 7 2.2 De nitions of LER and LWR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Threshold voltage variation with scaling . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Standard inverter with an input and an output load . . . . . . . . . . . . . . . . 15 3.3 VOL versus Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Process variability e ects on pre-optimized designs . . . . . . . . . . . . . . . . 17 4.1 Tunable CMOS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Ho man model of Sequential circuits with tunable gates . . . . . . . . . . . . . 21 4.3 Rising transition speed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4 Falling transition speed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.5 Voltage division on tuned nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.6 Tunable NAND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Tuning circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Inverter as a tuning circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 vii 5.3 Inverted inverter as a tuning circuit . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.1 EXOR tree circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.1 Simulation case : Vdd=1.5V sigma=0.075 . . . . . . . . . . . . . . . . . . . . . 48 7.2 Simulation case : Vdd=1.2V sigma=0.075 . . . . . . . . . . . . . . . . . . . . . 48 7.3 Simulation case : Vdd=1.5V sigma=0.150 . . . . . . . . . . . . . . . . . . . . . 49 7.4 Simulation case : Vdd=1.2V sigma=0.150 . . . . . . . . . . . . . . . . . . . . . 49 viii List of Tables 3.1 Normal distribution - Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Delay versus VOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1 Tunable NAND gate simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.1 Number of circuits tuned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2 Tunable EXOR tree simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.3 Performance gain for a Single EXOR tree . . . . . . . . . . . . . . . . . . . . . 45 7.1 Performance gain for a circuit with 10,000 EXOR trees . . . . . . . . . . . . . . 50 8.1 Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 8.2 Power dissipated with tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 ix List of Abbreviations SiO2 Silicon-dioxide Mean Standard deviation .pm Primitive Model le VHSIC Very High Speed Integrated Circuits CMOS Complementary Metal-Oxide Semiconductor GND Ground HfO2 Hafnium-dioxide ITRS International Technology Roadmap for Semiconductors LER Line Edge Roughness LWR Line Width Roughness MOSFET Metal-Oxide Semiconductor Field E ect Transistor MOS Metal-Oxide Semiconductor NMOS N-type MOS transistor OPC Optical Proximity Correction PMOS P-type MOS transistor PSG Polysilicon granularity x PSM Phase-Shifting Mask RDF Random Dopant Fluctuations RET Resolution Enhancement Techniques SDE Source-Drain Extension Si Silicon TOX Gate oxide thickness Vdd Supply voltage VHDL VHSIC Hardware Description Language VOL Di erence between Vdd and Vth Vth Threshold voltage xi Chapter 1 Introduction The electronic industry has come a long way ever since the integrated circuit was in- vented in 1958. Gordon E. Moore?s predictions, popularly known as Moore?s Law, on tran- sistor densities doubling every two years in an integrated circuit have held true so far, and is expected to hold true in the coming next decade at least. Increased complexity, faster clock rates, increased storage capacity and the ability to put in a lot more in to the same area have been the major driving forces favoring scaling. But sustaining the scaling trend as forecast by Moore?s Law is now facing serious chal- lenges [1] in the fundamental limitations imposed by the inherent physical properties of the underlying materials, required tolerances down to a few atomic dimensions in the manu- facturing processes, and statistical anomalies arising from the nite number of atoms con- stituting device structures in highly scaled technologies. Even though the scaling trend as predicted by the International Technology Roadmap for Semiconductors (ITRS) is expected to continue for another decade at least, performance gains measured in terms of processor speeds is expected to saturate or even fall back a little. Figure 1.1 represents the processor clock frequency variation with time. A majority of the work done in this area, to compensate for the drop in processor speeds, has been in the eld of multiprocessing, where microprocessors are being designed with multiple cores to improve performance through parallelism. Indeed impressive gains have been observed in cases of dual and quad core processors, which have even inspired thoughts of incorporating a dozen or even several hundred cores in future advanced designs. But according to Amdahl?s law, and backed by extensive research, it has been shown that additional processors only yield diminishing returns in most general purpose applications. 1 Figure 1.1: Processor clock frequency versus Time Hence, in the long run, having multiple cores is not going to be a viable solution to improving performance. It is imperative that focus be given to addressing challenges of scaling that limit clock frequencies. At the small dimensions devices are being fabricated today, manufacturing variations that produce changes in device parameters are found to occur randomly and in signi cant amounts. Because of their random nature, these variations are beyond the control of de- signers for them to make any prefabrication changes to their designs to overcome the e ects of these variations. Such random process variations occur signi cantly in nanoscale devices, and are expected to be more signi cant in the 22nm technology and beyond. Large complex designs today have several millions of transistors, and it is statistically likely that every manufactured part will have dozens of these slow transistors that are per- formance outliers, lying in the far ends of the distribution, beyond 4-6 standard deviations away from the nominal value. A presence of such an outlier along a path can signi cantly push the path delay up, and hence limit the frequency at which the device can be operated. The focus of this thesis is to study a new tuning technique that would enable post- manufacture speedup of exceptionally slow paths, and hence provide the ability to push clock frequencies up to help get back much of the lost performance. The aim is to bring 2 down the outlier path delays as close to the nominal average case delay, without changing the logic levels at which the gates switch, and ensuring the extra power dissipation is within acceptable limits. Also, variations in processor speeds make it imperative that every chip or core be indi- vidually \speed binned" and operated at the fastest clock rate that it can reliably sustain, to take advantage of this tuning. Tuning, again, is done post manufacture and not during the design stage. Process variability being random introduces parametric variations randomly, hence taking away any scope of predicting where the outliers might be to make any changes during the design stage. It is also assumed in this work that the digital circuitry is designed in CMOS logic. This assumption is valid as CMOS logic continues to be the most power e cient logic, and is hence expected to be the dominant logic to design digital circuits with. The aim of this work has been to enable tuning by adding a minimum amount of extra circuitry in each CMOS gate to speed up a particular transition at the output. Hence, once an exceptionally slow transistor is identi ed, cells along the path are programmed to speed up the slowest transition. This tuning, limited to only a few cells that are statistical outliers, can help push the clock frequencies up by a signi cant amount. This also happens with minimal impact on the total power dissipated in the chip, as will be discussed in the following chapters. The following sections of this thesis are arranged as follows. Literature survey is pre- sented in Chapter 2, e ects of the presence of an outlier in Chapter 3, the CMOS tunable gate architecture that is to be included in the design is introduced in Chapter 4, followed by the steps taken to extract the SPICE netlist in Chapter 5, an outline of the method and the algorithm used to show the e ectiveness of the design in Chapters 6, the simulation results in Chapter 7, power dissipation simulation and results in Chapter 8, and nally a conclusion in Chapter 9. 3 Chapter 2 Literature Survey This chapter presents a brief study on the present and past research work done on process variability. The tuning architecture proposed in this work is novel, but the subject of process variability as such has been dealt with and studied in the past. A historical overview of process variability is rst presented. That is followed by sections on the critical sources of variability. A nal section on the impact of process variability is also presented. 2.1 Historical Overview of Process Variability The semiconductor industry has distinguished itself by its rapid pace of improvement, which it owes largely to the industry?s ability to exponentially decrease the minimum feature sizes used to fabricate integrated circuits [1]. But as the feature sizes get smaller, the variability associated with fabrication becomes more signi cant. It may seem like process variability is a new challenge, with the large amount of research literature today dedicated to it. But in fact process variability has always been a critical aspect of semiconductor fabrication [11]. Random variations in a semiconductor was rst discussed in Shockley?s 1961 paper titled \Problems related to p-n junctions in silicon," where the e ect of statistical spatial uctuations of donor and acceptor atoms with its resulting impact on the breakdown voltage in the junction is considered [12]. After that a number of other works on the e ects of random ion implantations have been done. The most prominent ones include works by Schemmert and Zimmer on the sensitivity of threshold voltage (Vth) to implantation energy [13] , and works by Yokoyama et al. [14] on studying Vth sensitivity using Monte Carlo approach. 4 As can be seen, process variation has been a subject of interest for a long time. It has, though, become more signi cant in recent years as feature sizes get smaller and as we slowly approach the end of the roadmap, which poses certain fundamental limitations on further scaling. 2.2 Sources of Variability Process variability may be categorized under two types in advanced CMOS technologies - local or intra-die variations and global or inter-die variations [8]. Local variability is over a short distance and normally within a die, while global variability is over larger distances and normally seen between die-die and wafer-wafer. Global variability causes a shift in the mean value of sensitive design parameters such as channel length (L), channel width (W), layer thickness, resistivity, doping concentration, and body e ect, while local variability introduces (1) systematic variability and (2) random variability. Systematic variability includes variability caused by optical proximity corrections (OPC), phase-shifting mask (PSM), layout-induced strain, and well proximity e ect, and can be ad- dressed by more controlled Resolution Enhancement Techniques (RET) and layout designs. Random variability includes several sources of which the critical ones are Random Dopant Fluctuations (RDF) [8, 11, 15, 16] , Line-edge and Line-width roughness (LER and LWR) [8, 11, 17, 18, 19, 20] , variations in gate oxide thickness (TOX) due to interface roughness [8, 11, 22] , polysilicon granularity [8, 11, 21, 22] , and high-K dielectrics with metal gates [8, 11, 21, 22]. Of the above variability types, random variability is the most critical because its im- pact on circuit performance is becoming increasingly signi cant for technology nodes below 45nm. The large outlier delays, observed in almost every device, that are almost four to ve times above the average case worst case delays arise out of random variability. Also, addressing random variability requires innovative process and circuit design techniques and 5 device modeling. The focus of this work, hence, has been to address the issue of random local variability. A brief description of some important sources of random variability is presented below. 2.2.1 Random Dopant Fluctuation (RDF) RDF has an increasingly signi cant e ect on the MOS Vth in the sub-micron technolo- gies. In MOSFET?s, transistor channels are doped with dopant atoms to control the Vth. To keep short channel e ects from degrading the Vth in technology nodes below 90nm, the dop- ing concentrations are kept very high. For a MOSFET device with e ective channel length, Leff, e ective width, Weff, channel doping concentration, NCH, and source-drain extension (SDE) junction depth, xj, the total number of dopant atoms in the channel, according to Saha, K. S. [8] , is given by: Ntotal;chan = NCH:(Weff:Leff):xj (2.1) According to Equation 2.1, as dimensions are scaled the number of doping atoms, given by Ntotal;chan, comes down drastically even though the doping concentration goes up, because the transistor dimensions Leff, Weff and xj reduce. Technology scaling involves reducing transistor area by a half with every generation. Hence, Ntotal;chan would decrease exponentially over technology generations; 45nm technology node has only around a 100 dopant atoms in its channel [11]. Any small change in the number of atoms would hence imply a signi cant e ective change in the doping concentration, and hence a signi cant di erence between the Vth of one transistor and another. This is what is referred to as process variability. Its contribution towards device mismatch can be studied through the following equation, Vtran = 4p4:q3: Si: B 2 ! :TOX ox : 0 @ 4pN q Weff:Leff 1 A = 1p 2: 0 @ c2q Weff:Leff 1 A (2.2) 6 Figure 2.1: Random Dopant Fluctuations in sub-micron technologies where q, Si, ox, and B are the electron charge, permittivity of the silicon, permittivity of SiO2, and the built-in potential of S/D-to-channel PN junction of MOFETs, respectively. C2 is a constant that depends on the NCH and TOX, and hence depends on Ntotal;chan. Even though Ntotal;chan decreases with scaling, Weff and Leff reduce as well, and hence e ectively the device mismatch becomes bigger with scaling [15]. A 3D numerical model with an adaptive local meshing scheme that allows prediction of Vth for arbitrary dopant pro les (see Figure 2.1 ) was developed and simulation results for 45nm and 65nm were compared to the observed variations [16]. RDF was found to be 65% of the total Vt in 65nm silicon, and 60% of the total Vt in 45nm silicon. 2.2.2 Line-Edge Roughness (LER) and Line-Width Roughness (LWR) A second important factor responsible for process variability is one that arises out of variations in gate patterning, leading to non-ideal (rough) line edges. LER is a result of sub- wavelength lithography, which the semiconductor industry has been using for patterning 7 Figure 2.2: De nitions of LER and LWR transistors since the 0.25 m technology node. For example, fabrication processes previously used the wavelength of light, =248nm, to pattern the minimum feature size, Critical Dimen- sion CDmin=250nm and 180nm transistors. Wavelength decreased to 193nm for 130nm technology, and has remained the same since, even for 65nm transistors. Until ultra-violet technology becomes available, sub-wavelength lithography will con- tinue to be used for patterning, and will cause LER and LWR e ects in scaled MOSFETs [8, 11]. Impacts of LER and LWR include increases in sub-threshold current and degradation in the Vth characteristics [17, 18, 19, 20]. Diaz et al [17] found transistor performance degradation to occur with an increase in the observed LER values. Experiments performed by Kim et al. [18] showed that LER e ects began for gate lengths below 85nm. They also observed a four-order increase in the standard deviation of the sub-threshold current for the smallest gate lengths in their study. Fukutome et al. [19] in their experiments, observed that roughness of extension edges induced by gate LER depended on the implanted dose, halos (pockets), and various co-implantations. They showed an improvement of 4nm in 8 Vth roll-o with a decrease in the average LER, and con rmed that co-implants induced a degradation of 5mV in the standard deviation of Vth. Another important observation noted in experiments performed by Asenov et al. [20] was that LER and RDF e ects act in a statistically independent manner, and that LER induced uctuations have stronger channel length dependence and are hence expected to supplant RDF as the dominant source of variation. The Vth-mismatch due to LER depends on the variability in Weff of the MOSFETs and is given by [11] : VTH;LER/ 1qW eff < VTH;RDD (2.3) 2.2.3 Oxide thickness variation due to interface roughness Conventional CMOS technologies with Silicon-dioxide (SiO2) gate dielectrics are sub- jected to process variability introduced by the Si/SiO2 and SiO2/polysilicon-gate interface roughness through TOX variation [8]. In advanced MOSFETs with TOX of about 1nm, the interface roughness is found to be comparable to the TOX, and hence the variations associated with it can be about 50%. Such variations in the TOX introduce variations in the gate current Ig which produces a voltage drop in the polysilicon gate that changes the Vth signi cantly. Studies done by Asenov et al. [22] show that intrinsic Vth uctuations induced by local TOX variations become comparable ( 30mV) to voltage uctuations introduced by RDF for conventional MOS devices with dimensions 30nm and below. It is also found that process variations introduced by TOX variations due to interface roughness is found to be statistically independent of process variations introduced by RDF [8, 11]. The e ective contribution towards Vth variation through the two e ects is hence given by: VTH;total = r VTOXTH 2 + ( VRDDTH )2 (2.4) 9 2.2.4 Polysilicon granularity (PSG) PSG enhances the gate dopant di usion along the grain boundaries which leads to non- uniform polysilicon gate doping and potential localized penetration of the dopants through the gate oxide into the channel region. The most signi cant source of uctuation within the polysilicon gate though, is the Fermi level pinning at the boundaries between grains due to the high density of defect states. Local variations in the potential of up to 0.6V within the gate would re ect on uctuations in the surface potential ( s) within the MOSFET channel leading to Vth mismatch between devices. In the sub 65nm MOSFETs these uctuations are comparable to those introduced by RDF [8, 11, 21, 22]. 2.2.5 High K dielectric morphology Advanced CMOS technologies use high-K dielectrics like HfO2 with metal gates to pro- vide a thicker physical TOX to reduce the amount of gate leakage and ensure a thin electrical TOX required for the continuous scaling of MOSFETs. Its inclusion, however, introduces sig- ni cant process variability due to the Si/high-K and high-K/MG interface roughness which causes mobility degradation and TOX variation, and the phase separation that is created between the crystallized grain and the amorphous SiOx matrix which causes uctuations in the channel potential under the gate. Also, the polycrystalline HfO2 with random grain orientations causes dielectric constant to vary across the gate oxide. Hence, the surface potential s varies with L and causes variability in Vth. This Vth variability due to the presence of high-K dielectrics increases with scaling, and is expected to be more than 30 mV for 10nm MOSFETs with TOX of about 4nm [8, 11, 21, 22]. 2.2.6 Other sources Other sources of process variability, besides the critical sources mentioned above are variations associated with patterning proximity e ects, variations associated with polish such as shallow-trench isolation, gate and interconnect variation, variations associated with strain, 10 high stress capping layers, and embedded silicon-germanium, and variations associated with implants and anneals. 2.3 Impact of Process Variability Random variations introduced through RDF, LER/LWR, TOX variation, PSG, dielec- tric variations, and other sources has a critical impact on the yield, performance and reli- ability of the manufactured circuits. As a result parameters de ned for transistors at the design stage vary signi cantly, post-manufacture, from their given nominal values. These variations are expected to get only worse with scaling. All of the sources mentioned above primarily degrade Vth. RDF and LER contributions towards Vth variation are described in Equation 2.2 and Equation 2.3; of the two sources LER is expected to become more critical as scaling continues. TOX variation due to interface roughness has been shown to degrade threshold voltage as well, and for transistors with 1nm TOX the Vth variation introduced is almost comparable to that introduced by RDF. PSG and high-K dielectric e ects degrade the threshold voltage Vth by a ecting the surface potential ( s). Hence the focus of this work has been on studying threshold voltage variability in circuits designed with CMOS logic. Only CMOS logic is assumed as it is the most power e cient logic among all logic types, and is expected to continue to be the dominant logic type to design with in future technologies as well. Even though every process variability source a ects Vth in a statistically independent manner, in this work values of Vth are drawn from a single normal distribution assuming the e ects of all sources have been included. Hence for the purpose of simulation, every transistor in every circuit simulated has Vth drawn from a nominal distribution whose mean and standard deviation values are speci ed by the technology generation in which the transistors are being designed. 11 In conclusion, random process variability is an increasing concern as dimensions shrink and scaling continues. Di erent sources that contribute to variability have been studied and their e ects analyzed. The next chapter studies the impact of variability on path delays. 12 Chapter 3 E ects of Outlier Presence on Path Delays Extensive prior research suggests Deep Sub-Micron Technologies (DSM) are prone to high process variability [1-24]. Distributions representing parametric values of transistors, post fabrication, indicate a high standard deviation, implying a wide spread of the distribu- tion, and a trend that the standard deviation would increase with scaling. 3.1 Normal distribution Figure 3.1 compares two distributions of Vth for two di erent technology generations. As can be seen, the spread of the distribution gets wider while the height of the distribution gets shorter, as we scale from one technology generation to another. Assuming a normal distribution for the spread, the results shown in Table 3.1 can be arrived at [25] : Range(a,b) Population within Approximately 1 ( +a ; +b ) (a,b) in (0,1) 34.13% 3 (1,2) 13.59% 7 (2,3) 2.14% 47 (3,4) 0.13% 770 (4,5) 0.003% 30000 (5,6) 0.000028% 3.5 million (6,1) 0.0000001% 1 billion Table 3.1: Normal distribution - Statistics Table 3.1 represents the percentage of transistors that fall within the ranges speci ed on the right side of the distribution. A similar table can be constructed indicating the percentage of transistors that would fall in the left side of the distribution; being symmetrical 13 Figure 3.1: Threshold voltage variation with scaling the percentages would be the same on both sides. Vth greater than nominal, i.e. falling on the right side of the distribution, would correspond to a slow transistor, while Vth less than nominal, i.e. falling in the left side of the distribution, would correspond to a fast transistor. It is the slow transistors that are of interest, as their presence on a path pushes path delays up by a factor which is greater than the speedup achieved by a symmetrically falling low Vth transistor, and is hence in e ect responsible for the slow down. Assume a nominal threshold voltage =0.5V, a standard deviation =50mV, and Vdd=1V, for a particular deep sub-micron technology [26]. Here +5* = 0.75V. By Table 3.1, 1 in 3.5 million transistors lie beyond 5 standard deviations, i.e. approximately 300 in a design with a billion transistors. Most designs today have billions of transistors in them, and hence, statistically there is every chance to nd at least a few hundreds of such outlier transistors in every fabricated chip that lie far away in a distribution, at least 5 standard deviations away. 14 Figure 3.2: Standard inverter with an input and an output load 3.2 VOL In the above example, Vth is greater than 0.75V beyond 5 standard normal deviations, i.e. the di erence between Vdd and Vth, de ned as VOL, is less than 0.25V. As Vth gets larger, it gets closer to Vdd, the supply voltage, and the delays increase exponentially. To get a measure of this increase, simulations were done on a SPICE netlist of an inverter (Figure 3.2) with an input load and an output load, by varying Vth of the pull-up PMOS from 0.2V through the nominal value of 0.4V to 0.7V, for a supply voltage Vdd of 1V. The simulations were done using 180nm technology les, and the extraction of the SPICE netlist through the various Mentor Graphics tools. More on the technology and tools used is discussed later. Vdd = 1V VOL=Vdd-Vth Vth Delay Change from nominal (V) (V) (ps) (%) 0.3 0.7 377 508 0.4 0.6 167 170 0.5 0.5 95 53 0.6 0.4 62 0 0.7 0.3 44 -29 0.8 0.2 33 -46 Table 3.2: Delay versus VOL Table 3.2 gives a sense of how delay varies with Vth. As Vth increases, inverter delays increase exponentially. For an increase of 0.2V above the nominal Vth, the increase in delay 15 Figure 3.3: VOL versus Delay is close to three times the nominal delay value. Hence, for transistors that lie beyond ve standard normal deviations, the increase in delays will be more than four times the nominal. The presence of such an outlier in a path can push up delays by a signi cant amount. To illustrate using an oversimpli ed example, assume an 8 length inverter chain, with all inverters having nominal delays ?X?. The total delay of the inverter chain would be ?8X?. Now assume a ?4X? increase in the delay of an outlier inverter that is on such a chain. The total delay is now pushed to ?12X?, implying a 50% increase in path delays, and hence a 33% decrease in the clock frequency. It is important to note that even though the distribution is symmetric, implying the presence of an equal number of fast transistors along a path in the circuit, the total delays on a path do not get averaged out. Figure 3.3, which is a plot between VOL and Delay, illustrates that for an equal change in Vth, the speed up is a lot smaller than slow down. Hence, there is a very small chance that a slow outlier node in a path would get compensated by a fast switching node. 16 Figure 3.4: Process variability e ects on pre-optimized designs 3.3 E ects of process variability on high performance applications The presence of a slow outlier on a path doesn?t necessarily imply that the path becomes critical. There are chances of the path itself being small in comparison to other paths. The large delays introduced by an outlier node in such a case might not make the path critical. But most high performance applications today are speed optimized in the design stage to have close to approximately equal path delays so as to save on power. In such high performance designs where all paths are close to critical, the presence of an outlier node on any path can push the critical path delay up by a signi cant amount. Figure 3.4 illustrates the case. It represents a pipelined architecture with the rectangles representing registers and arrow lengths representing path delays. The minimun clock period of a design equals the worst case delay. Hence, in the presence of an outlier, even though average case delays do not change by much, the clock period increases solely because of the exponential increase in the delay of a few outlier paths. In conclusion, statistically every fabricated design with close to a billion transistors will have at least a few hundreds of outliers whose presence on any path can push clock delays up by a signi cant amount and hence restrict the frequency at which the circuit can be 17 operated. The aim of post manufacture tuning is to bring down the clock period close to the average case delays. 18 Chapter 4 CMOS tunable gate architecture This chapter introduces the tunable CMOS gate architecture, which forms the basis of this work. The tunable gate architecture design is rst introduced, followed by the tuning strategy, a study on the sizing of the tuning transistors, the occurrence of an interesting case, and a nal simulation result that shows the e ectiveness of the tuning methodology on a single NAND gate. 4.1 CMOS tunable gates Every CMOS gate has a pull-up network with PMOS transistors, and a pull-down network with NMOS transistors. Under static conditions either the pull-up network is ?ON? or the pull-down network is ?ON?. The output load capacitance of the gate charges to Vdd when the pull-up network is ?ON?, while the capacitance discharges when the pull-down network is ?ON?. It is this charging or discharging time that determines the speed at which the gate switches. The presence of a parallel path can speed up the charge time or the discharge time; it is this principle that lays down the foundation of the tuning strategy. A parallel path for charging and discharging the output load capacitance is provided to speed up an otherwise slow transistor. To introduce this parallel path in the design, a parallel PMOS transistor with tuning capability is introduced in to the pull-up network, and a parallel NMOS transistor with tuning capability is introduced in to the pull-down network. The resulting architecture resembles the design shown below in Figure 4.1. To include the parallel path of charging or discharging, the corresponding tuning tran- sistors have to be turned ?ON?. This is only done after diagnosis has been performed and 19 Figure 4.1: Tunable CMOS architecture the slow outliers identi ed. The gate terminals of the tuning transistors are connected to a switch which can be turned ?ON? or ?OFF?. The switches are ?OFF? when no tuning is required, and ?ON? while tuning is done. Once turned ?ON?, logic of that particular gate switches from a CMOS type to a pseudo-NMOS type. The tuning transistors have to be sized appropriately to provide good speed up without a ecting the output voltage logic levels. More is discussed about the sizing of the tuning transistors in the subsequent sections of this chapter. It is important to note that every node in the design comprising a pull-up and a pull- down network has to have this tuning capability in it (Figure 4.2). This is because of the inability to make any prefabrication predictions on the location of the outliers; outliers are created randomly and can hence fall anywhere on the chip. The chances of a gate having both a faulty PMOS and a faulty NMOS are very small; in its occurrence the fabricated chip would be discarded as yield loss. There is also a possibility of the occurrence of a rare and an interesting case; it is possible that either the pull-up network or the pull-down network has a high resistance slow outlier and the corresponding tuning transistor, that would have to be turned ?ON? to bring down 20 Figure 4.2: Ho man model of Sequential circuits with tunable gates the gate delay, be a low resistance low Vth outlier. In such a case tuning would not be possible because of the possible degradation of the voltage logic. This occurrence again is very rare and is considered a yield loss. More is discussed on this case later. There are drawbacks to adding tuning transistors. One of the most signi cant being the additional parasitic capacitances that are added that push path delays up. But it will be seen in subsequent sections of this work that on tuning the slow outliers, the worst case path delays of the fabricated chip can be pulled well below the worst case path delays of a fabricated circuit that has no tuning capability in it. Also, turning ?ON? a particular tuning transistor to speed up the slow transition would slow down the complementary transition. But it will be seen that the slowdown is much less than the speedup. There are other important e ects of including tuning transistors, a few being additional power dissipation and increase in the amount of area required. It will be seen that these 21 e ects are minimal and within a ordable limits of adding tuning circuitry. More is discussed in the subsequent chapters. Such a tuning architecture has been studied before by Ashouei et al. [5] but only as a defect tolerance methodology. In the presence of a fault, say in the pull-up network, the entire pull-up network is disconnected from the power rail and is replaced by a properly sized single always ?ON? pull-up transistor, hence converting the CMOS gate to a pseudo NMOS structure. A similar technique is adopted to replace a faulty pull-down network with an always ?ON? pull-down transistor. This method is similar to the one proposed in this work, with the di erence that no additional circuitry is required to disconnect a network. Such a defect tolerant methodology, however, has not been applied for performance tuning. As will be seen, the potential exists to combine both defect tolerance and performance tuning in aggressive CMOS technologies. 4.2 Tuning strategy As discussed in previous chapters, statistically there is every chance that a large design will have at least a few hundreds of outlier transistors that lie far away in the distribution, whose single presence on a path pushes path delays up by a signi cant amount. The slow transistor could either be in the pull-up network or the pull-down network; chances of both the pull-up and the pull-down network having a slow transistor are very small, in which case the chip would count as faulty and would fall under yield loss. The presence of a slow transistor in the pull-up network would push up the rising delay as it would take longer for the output capacitance to charge through it. To counter the slow transistor, the tuning PMOS connected in parallel is turned ?ON? to provide a separate path for charging the output capacitance, as shown in Figure 4.3. The red dot in the pull-up network indicates the presence of a slow PMOS transistor. In the same way, the presence of a slow transistor in the pull-down network would push up the falling delays as it would take longer for the output capacitance to discharge through 22 Figure 4.3: Rising transition speed-up Figure 4.4: Falling transition speed-up 23 it. Their presence is countered by turning ?ON? the NMOS tuning transistor which would provide an additional path for discharge, as shown in Figure 4.4. It is assumed here that an algorithm exists to diagnose the slow transistors in the design. Once diagnosed, the appropriate tuning transistors are turned ?ON? to bring down the gate delays. 4.3 Sizing of tuning transistors Sizing of tuning transistors is an important factor determining the e ectiveness of the tuning strategy. The tuning transistors should be sized large to provide good speed up, but at the same time be sized small enough to ensure that logic levels are maintained. For instance assume the pull-up network to be slow. To speed up rise time the tunable PMOS transistor is turned ?ON?. Now, when the pull down network is activated, the circuit would behave like a voltage division network (Figure 4.5), with the output voltage being supply voltage times the fraction of pull down resistance to the total resistance. To ensure that this output voltage stays within bounds of logic ?0?, the resistance of the tuning PMOS should be high, or the W:L ratio of the transistor should be small. A similar voltage division circuit is set up when an outlier transistor exists in the pull- down network and the NMOS tuning transistor is turned ?ON? while the pull-up network is activated. Hence, a trade o between a large ratio and a small ratio is required for the strategy to work correctly. For the purpose of simulations in this work, a 4:1 resistance ratio between the tuning transistor and the corresponding complementary network is maintained. This ensures a LOW voltage of Vdd/5, and a HIGH voltage of 4Vdd/5, which are within acceptable ranges of logic ?0? and logic ?1?, respectively. 24 Figure 4.5: Voltage division on tuned nodes 4.4 Faulty tuning transistors It is important to note that the tuning circuitry comprising the tuning transistors would also be exposed to process variability, as it is fabricated along with the other functional transistors of the chip. Hence, it could itself be a victim of process variability. A faulty tuning transistor would be of relevance only if it is required during the tuning process, otherwise it is of no importance. There is a di erence though in how a defective tuning transistor is perceived. As men- tioned earlier, process variability a ects Vth distribution. In other words the Vth of the transistor could be high, corresponding to a slow, high resistance, low leakage transistor, or the Vth could be low, corresponding to a fast, low resistance, high leakage transistor. For a transistor in the pull-up network or the pull-down network, a high Vth would imply a faulty slow transistor as it takes longer to switch; a low Vth transistor would be fast to switch and hence not be an outlier candidate. But for a tuning transistor, a high Vth would correspond to a high resistance transistor, and even though slow, the e ective 25 Figure 4.6: Tunable NAND gate resistance of the slow outlier and the tuning transistor would be lower and hence some speed up will be achieved. On the other hand, a low Vth tuning transistor, even though faster to switch, could degrade the output voltage logic so much that the output doesn?t switch anymore. Such an occurrence would make it impossible to tune the outlier transistor to avoid voltage logic degradation. Hence, for the case of tuning transistors, it is the low Vth transistor that is considered defective and not the high Vth transistor. This occurrence is again not common, and in the chance that it is to occur, the fabricated chip would be discarded as a yield loss. 4.5 NAND gate simulations To test the e ectiveness of the tuning strategy, simulations were performed on a single NAND gate with input and output loads. Mentor Graphics tools were made use of to generate the NAND gate SPICE circuit with input and output loads. A gate level structural netlist was rst created as a verilog le. This verilog le was fed in as an input to Design 26 Architect to create design view points. The design view points created in Design Architect are used in IC Station to generate the circuit layout. Once the layout was generated, it was veri ed for over ows and shorts, after which a layout versus schematic check was done using the LVS tool. Capacitances were nally extracted through Calibre PEX. Once the nal SPICE netlist was generated, tuning transistors connected to the NAND gate were added. This addition had to be done manually as the standard cell libraries do not have tunable gates in them. The width over length ratios for the tuning transistors were xed so as to have a 4:1 resistance ratio between the tuning transistor and the complementary network of the gate. Additional capacitance was added to the gate outputs to include the e ects of the tuning transistors. More is discussed on the SPICE netlist extraction in the following chapter. The threshold voltages of the transistors were assumed to be nominal rst, and the rise and fall time delays were noted. Another set of simulations were performed on the same NAND gate, but with one pull-up PMOS transistor having a threshold voltage 0.1V above nominal; rise and fall time delays were noted before and after tuning. The same simulations were repeated for Vth of PMOS being 0.2V above nominal, and 0.3V above nominal. Figure 4.6 shows the tunable NAND gate structure, and TABLE 4.1 provides the results of the above simulations. 180nm Technology Vth Rising Delay Falling Delay Delay change (PMOS) Untuned Tuned Untuned Tuned (%) 0.4 113 89 130 142 -25.67 0.5 174 125 129 140 19.54 0.6 305 188 128 139 38.36 0.7 703 296 127 138 57.89 Table 4.1: Tunable NAND gate simulations The transistor W:L ratios in the NAND gate are the standard ratios de ned by the cell libraries. The W:L ratios of the tunable gates were xed manually to ensure a 4:1 ratio 27 between the tunable PMOS and pull-down, and tunable NMOS and pull-up resistances. As can be seen the performance gain for Vth = 0.4V is negative, as the complementary delays are pushed up, but su cient gains are obtained as Vth of PMOS increases; for Vth = 0.7V the reduction in gate delay is close to 58%. In conclusion, the methodology seems to be very e ective in pulling back gate delays by a signi cant amount. Its e ectiveness on a large circuit with long chains having multiple gates needs to be tested though. The steps taken in generating the SPICE netlist for the purpose of simulations are discussed in the next chapter. 28 Chapter 5 SPICE Netlist Extraction Extracting the SPICE netlist for the purpose of simulations is a very critical part of the work done here. A description of the tools used and the manual additions implemented are discussed in this chapter. Mentor Graphics tools consisting of ModelSim, Leonardo Spectrum, Design Architect, IC Station, LVS and PEX, have been used for extracting the transistor level SPICE netlist. The circuits studied in this work include basic inverter chains, NAND gate chains and EXORtree circuits. Any larger circuit constructed, is constructed out of these simple circuits. The reason for studying these simple circuits is explained later. The chapter starts with a brief description of the tools in the order in which they were used, followed by a description of the method used to include the tuning transistors. 5.1 Mentor Graphics Tools 5.1.1 ModelSim The behavioral model of the circuit is rst written either in VHDL or Verilog and then simulated in ModelSim. This VHDL/Verilog description forms the seed from which ultimately the transistor netilst is extracted. Process variability, in the context of this work, was studied on basic inverter chains, NAND gate chains and EXORtree circuits. As the structural description of these circuits is known, a gate level structural description was written in Verilog. The behavioral description is not provided to ensure that no optimization is done on the circuit. 29 5.1.2 Leonardo Spectrum Once the behavioral/structural description of the circuit is created in VHDL/Verilog, Leonardo Spectrum is used to generate the structural description of the circuit in Verilog format with the basic details of the gate delays speci ed through the .sdf le. It is to be ensured that no optimization is done while synthesizing an input structural netlist to generate a Verilog gate level structural description; optimization if done could change the gates used and a few inter-connections without of course changing the output functionality. 5.1.3 Design Architect The next step in the extraction process is to load the Verilog le generated by Leonardo Spectrum into Design Architect to generate a schematic of the circuit. While generating the schematic it will be seen that standard cells with xed width and length ratios, for a given drive, are used to construct the circuit. It is not possible to change the width or length information of the standard cells. This inability is the reason why tuning transistors are to be included manually at the end of the extraction process. More on this is discussed later. Design Architect also generates design view points which are used by IC Station in extracting the layout information. 5.1.4 IC Station The next step is to open the schematic in IC Station to generate the layout. While generating the layout all over ows are to be connected, all shorts removed, and input output pins added. This would ensure that the design veri cations in IC Station are carried out correctly. 30 5.1.5 LVS The LVS tool is used next in performing the layout versus schematic check. This is done by comparing transistor dimensions and the interconnects of the transistors of the layout, with the transistor dimensions and interconnects of the transistors of the schematic. 5.1.6 PEX The last step in the SPICE extraction of the functional transistors is carried out through PEX. Capacitances are extracted by the tools, that makes use of lookup tables in the tech- nology le package. A nal transistor level SPICE netlist with these capacitances extracted is generated. It is this SPICE netlist that is used in HSPICE for simulation. 5.2 Tuning transistors So far, tuning transistors have not been included in the design. There are di culties associated with including tuning transistors in the design. The most important being that standard cells do not come with the tuning capability. A number of di erent techniques were studied to implement the automated addition of tuning transistors. As can be seen from Figure 5.1 above, tuning transistors together closely resemble the architecture of an inverter, with the di erence that the gate terminals are connected to separate voltage sources instead of being connected together to an input voltage source, and input and output nodes are connected together. With this in mind the following architectures were implemented. Figure 5.2 shows the close resemblance of an inverter connected to a gate output with the actual tuning architecture. It was proposed that every gate be connected to an additional inverter and the SPICE netlist extraction be done. Once the SPICE netlist was extracted, one would manually alter the connections between the inverter and the gate to transform the inverter architecture to a tuning architecture. 31 Figure 5.1: Tuning circuit Figure 5.2: Inverter as a tuning circuit 32 Figure 5.3: Inverted inverter as a tuning circuit The problem associated with this architecture is that additional capacitance is added at the gate output nodes from the gate terminals of the tuning transistors. It is only the di usion capacitance of the tuning transistors that is to be added. To overcome this problem the architecture shown in Figure 5.3 was proposed. Figure 5.3 better resembles the tuning circuitry because of the direct addition of the di usion capacitance of the transistors in the inverters with the node output capacitance. It was proposed that the inverter input node and its capacitance be removed once the SPICE netlist extraction was complete. But this architecture could also not be implemented. As it was later found, while gener- ating the schematic, inverter transistor width and length values could not be altered; only the standard cell values of width and lengths would be interpreted by IC Station in generating the transistor layout. It is important that the transistor dimensions of the tuning circuitry be altered to have a 4:1 resistance ratio between it and the corresponding complementary network in the node, for good speed up to be achieved without disturbing the voltage logic. As a result it was decided that all tuning transistors be included manually in the SPICE netlist after the functional netlist was extracted from the various design tools described 33 earlier. The transistor dimensions were also manually xed to the required ratio. Additional parasitic capacitance that would have to be included to compensate for the tuning transistors, is also added manually to every node. The value of this additional capacitance at every node is determined by extracting the di usion capacitance of transistors in an inverter. The di usion capacitance here happens to be the gate output capacitance of the inverter. But this capacitance extracted corresponds to the standard cell dimensions of the inverter whose L=180nm and W=450nm for the NMOS and W=990nm for the PMOS. To scale the capacitance, the ratio by which the widths of the transistors scale, is calculated. Width for the PMOS was scaled to 270nm and for NMOS was scaled to 180nm. Hence on an average the dimensions were scaled by a factor of 3. The inverter output capacitance, which happens to be the di usion capacitance, is then divided by 3, and this value added to every node capacitance. This strategy of adding tuning transistors was possible in the case of this work as simple circuits entirely constructed using a single gate, such as NAND gate or a NOT gate, were studied. For complex circuits with a mix of several gates, the above strategy would fail as it would be very time consuming to manually x transistor dimensions of every tuning transistor connected to di erent types of gates; these dimensions would vary from one gate type to another. Future studies involves incorporating tunable gates into the standard cell libraries to make this automation possible. Once the SPICE netlist extraction was complete, the SPICE le was simulated using HSPICE. The next chapter studies the algorithm implemented to study process variability on very large circuits. 34 Chapter 6 Tuning implemented on very large circuits The preceding sections indicate the potential for improving performances of circuits through tuning. However, to truly identify the e ectiveness of such a strategy, it is necessary to create an environment where the circuit has several million transistors. This chapter presents the work in creating such an environment through clever innovative techniques, and then simulating every circuit before and after tuning to calculate worst case delays and measuring performance improvement. 6.1 Building large circuits Simulating large circuits is necessary because it is in such large circuits that transis- tors are exposed to process variability. In a small circuit, because of the small number of transistors involved, statistical outliers will not be found to occur in most cases. Chips fab- ricated today have a few hundred outlier transistors in them because of the close to billion transistors that are there in the chip. 6.1.1 Di culties There are a number of problems associated with simulating large circuits. First, every node in the circuit has to have the tuning capability, which means adding a PMOS and an NMOS transistor manually to every node. This has to be done manually because the library les that are used in generating circuit layouts and node capacitances, do not have gates with tuning capabilities in them. Hence doing the addition manually in a circuit that has several thousands of gates at least, would be very tedious. 35 Secondly, every circuit has to be simulated for di erent nominal threshold voltages and standard deviations, and di erent supply voltages, each be repeated at least a 1000 times to get an average case out. It is also necessary to simulate a circuit with di erent sizes to get a sense of an approx- imate size beyond which there would be su cient performance bene t to make this tuning technique applicable. A single HSPICE simulation for a large standard circuit takes several hours even on a fast computer. Such simulations are hence not very practical. 6.1.2 Basic algorithm To get around this problem of being unable to simulate large circuits, an innovative algorithm is adopted. First a standard small circuit with less than a 100 transistors is simulated, with Vth for every transistor in the circuit dawn from a normal distribution. The same circuit is again simulated, but with di erent transistor threshold voltages, again drawn randomly from a normal distribution. This process of drawing Vth and simulating is done 20,000 times, and for each simulation the delays of the circuit are calculated. Now, to construct a larger circuit from these smaller sub-circuits, a random ?N? number of sub-circuits are drawn from the collection of 20,000 circuits, and the largest delays of the ?N? sub-circuits selected determine the delay of the larger circuit. ?N? represents the number of small sub-circuits selected at a time to create the larger circuit. ?N? is allowed to vary from 10 to 10,000. The random pick of ?N? sub-circuits is done 1000 times each to get an average case for every size ?N?. In this way large circuits can be constructed from smaller ones, and their delays calculated without having to simulate them. 6.1.3 Standard circuit used to create larger circuits The standard small circuit picked in this work is an EXOR tree with 8 inputs and 1 output. The tree has 7 EXOR gates, each constructed with 4 NAND gates. Every NAND 36 gate has the tuning capability added to it. There are 8 paths to the output, and each path has 6 active NAND gates on it. The advantage of using an EXOR tree is that it is very testable, which would make the diagnosis easier, and a single bit change in the input is re ected at the output. This property is made use of in activating a single path individually, by sending out pulses through every input, one input at a time, and every pulse sent with enough spacing to allow the output to settle even in the presence of a slow outlier, before being pulsed at the next input to activate the next path. Figure 6.1A is a gure of an EXOR gate constructed using NAND gates. The NAND gates here are all tunable. Figure 6.1B is a gure of an 8 input EXOR tree. Figure 6.1C represents a large circuit constructed out of ?N? EXOR trees, the number ?N? varying from 10 to 10,000. 6.2 Simulation This section details the logic used while programming in MATLAB and PERL to create the environment described in the preceding section. The EXOR tree circuit was rst con- structed and synthesized as a SPICE le. This le was labeled ?without tuning circuitry?. Another EXOR tree SPICE le labeled ?with tuning circuitry? was created with tuning tran- sistors connected to every NAND gate output. Two les were created so that a comparison could be drawn between the delays of the original circuit and the delays of the tuned circuit. To start simulations on HSPICE, however, the primitive model (?.pm?) les that would be needed to run with the SPICE netlists need to be created. The standard ?.pm? le that is made available in the package cannot be used as it details the parameters of a single transistor. It is, however, used as a seed to construct the ?.pm? les required for the EXOR tree simulations. The following subsection details the method used to generate primitive model les that incorporate variability. 37 Figure 6.1: EXOR tree circuit 38 6.2.1 Primitive model les As mentioned, the ?.pm? les are included in the SPICE netlists before being simulated. To ensure that every transistor has Vth drawn from a normal distribution, and is hence di erent from every other transistor in the circuit, the ?.pm? le for an EXOR tree circuit would have instances of the standard single transistor ?.pm? le for every transistor in the circuit, with Vth of every instance di erent from the other. In the 8 input EXOR tree circuit with tuning circuitry included, there are 84 PMOS transistors and 84 NMOS transistors. Hence, there will be 84 di erent descriptions of PMOS and 84 di erent descriptions of NMOS in a single ?.pm? le to be included with the circuit SPICE netlist; these descriptions only vary in the value of Vth. It was also mentioned earlier, in the section on the basic algorithm implemented to create large circuits, that 20,000 copies of the EXOR tree are to be simulated. All of these 20,000 copies will also have transistors that vary in Vth from every other transistor?s Vth in any of the 20,000 copies. Hence, a MATLAB program to generate 20,000 * 84 ( = 1,680,000) values of PMOS-Vth and 20,000 * 84 ( = 1,680,000) values of NMOS-Vth was written that would extract the 1,680,000 Vth values from the standard normal distribution for a speci c (mean) and (standard deviation). These values are then assigned one by one to a ?.pm? le. Every time 84 transistor instances have been written, a new ?.pm? le is created to have the next 84 set of instances. In this way 20,000 di erent ?.pm? les, with Vth for every transistor drawn from a nominal distribution was created, and stored in a speci c folder, each le numbered sequentially from 1 to 20,000. The entire process of creating the 20,000 ?.pm? les was repeated for di erent values. This process of generating ?.pm? les was also done for di erent values of Vdd. The reason being that, while drawing Vth from a distribution, there is a possibility of the value being greater than Vdd or less than zero. This would not be correct and needs to be avoided. In other words bounds are set for the Vth pick, depending on the value of Vdd. 39 The Vdd values chosen for simulation are 1.2V and 1.5V. For Vdd=1.2V, the lower bound was xed at 0.1V and the upper bound at 1.0V. While for Vdd=1.5V, the lower bound was at 0.1V and the upper bound at 1.3V. Hence for every combination of Vdd and , ?.pm? les were generated. All in all there were 4 sets of les, each with 20,000 ?.pm? les in them, for the cases Vdd=1.2V =0.075V, Vdd=1.2V =0.150V, Vdd=1.5V =0.075V, and Vdd=1.5V =0.150V. It is to be noted that the circuit labeled ?without tuning circuitry? would also use the same ?.pm? les in spite of having fewer transistors; this circuit would have 56 PMOS tran- sistors and 56 NMOS transistors. To ensure that transistors belonging to a gate in the two circuits labeled ?without tuning circuitry? and ?with tuning circuitry? are the same, the tun- ing transistors are only labeled from 57 to 84. This is important because if an outlier is to exist, it is to fall at the same location in the two circuits. Only then can a comparison be made between the delays of the two circuits. 6.2.2 Single EXOR tree simulations The SPICE netlist les are now ready for simulation. It is to be noted that even though there are 20,000 di erent ?.pm? les for a speci c , there is only one copy of the SPICE netlist le; one for ?without tuning circuitry? and one for ?with tuning circuitry?. Only the ?.pm? le that the SPICE le points to is modi ed every time. A MATLAB program was written to open a SPICE netlist le, modify the ?.pm? le to be pointed to, run the SPICE le on HSPICE, and extract the path delay information to be stored in a le; HSPICE was invoked from the MATLAB program. The path delay information was stored in a le labeled with the same number the ?.pm? le was labeled with. This le had information on the delays of every path and also the worst case delay, found by calculating the largest delay in the simulation, along with the worst case path number and transition type. The worst case delays were also stored in a matrix in MATLAB to speed up its use by another MATLAB program. 40 The SPICE le had commands in it to extract every rising and falling transition delay information. It also has the capability to extract the power dissipated, but this command is made use of only later while simulating for power. In this way 20,000 di erent delay les for the 20,000 copies were created and stored in a speci c folder, along with the worst delays also being stored in a matrix in MATLAB (the matrix size was 20000 x 1). This process of simulating was repeated for di erent Vdd and values. All in all there were 4 cases : CASE 1 Vdd = 1.2V = 0.075V CASE 2 Vdd = 1.2V = 0.150V CASE 3 Vdd = 1.5V = 0.075V CASE 4 Vdd = 1.5V = 0.150V For all of the above cases, there were 2 basic simulations done on each of the 20,000 copies - one on ?without tuning circuitry? and the other on ?with tuning circuitry?. In the process two di erent worst case delay matrices were also constructed. It is to be noted that no tuning has been performed yet. 6.2.3 Tuning Once all simulations were done for both ?without tuning circuitry? and ?with tuning circuitry?, and the worst case delays calculated and stored in a matrix, tuning was implemented. Only those circuits which have outlier transistors in them need to be tuned. To pick outliers, a delay cut-o was established, and all worst case delays larger than the cut-o were declared outliers. This cut-o was selected so as to have between a 100 and 150 outlier cases out of the 20,000 circuits simulated for every case. Once the cut-o was established, a search is performed through a MATLAB program to extract those circuit numbers that have their worst case delays greater than the cut-o . 41 The delay le for an outlier circuit is then opened and the value of the worst case delay, along with the path number of the largest delay and the transition type is extracted. With this information in hand, the corresponding ?.pm? le is opened, and a search is performed on the largest Vth values in the le. This search was done through a PERL script that would extract every line one-by-one from the ?.pm? le and search for the patern ?Vth0? in the le. The PERL script is invoked from MATLAB, and hence forms a part of the same MATLAB code. This search of Vth is restricted only to the transistors that fall on the critical path, and only those that are involved in the slowest transition. This is done to speed up the search. Details on which transistor lies on which path, for a particular transition, was predetermined and stored in the MATLAB code. Since a comparison needs to be done on a Vth belonging to a PMOS and a Vth belonging to an NMOS, the average Vth of the NMOS transistors in a particular gate is compared with the Vth of a PMOS transistor. This is done because, in NAND gates, the NMOS transistors are connected in series, while the PMOS transistors are connected in parallel. Hence for a falling transition to occur, the discharge would happen through the series connection of the NMOS transistors; slowest charging would occur only through a single PMOS transistor. Hence, Vth values of a PMOS transistor and average Vth values of the two NMOS transistors in a particular gate were calculated and extracted by the MATLAB program, for every gate in the critical path. Gates on this critical path are then sorted in descending order of Vth. To tune the circuit, the tuning transistor connected to the gate with the highest Vth value, is to be turned ?ON?. If the the outlier transistor happens to be a PMOS transistor, the PMOS tuning transistor of that gate is turned ?ON?. On the other hand, if the outlier transistor happens to be an NMOS transistor, the NMOS tuning transistor of that gate is turned ?ON?. 42 The tuning transistors in the SPICE le are labeled with the gate number the tuning transistor belongs to. This makes the search for the tuning transistors connected to the gate under study easier. Once the tuning transistor number is established, the voltage source connected to the gate terminal of the tuning transistor is determined. Here again, the voltage source is labeled conveniently to make its search possible. For an NMOS tuning transistor, the voltage source is initially connected to the GND (logic ?0?). To enable tuning, the value of the voltage is switched to Vdd (logic ?1?). In the case of a PMOS tuning transistor, the voltage source which was initially connected to Vdd (logic ?1?), is switched to GND (logic ?0?) to enable tuning. Once the tuning voltage source values are modi ed in the SPICE le, the MATLAB program invokes HSPICE to simulate the SPICE le. The delays again are calculated, and the worst case delay is noted. If the worst case delay falls below the cut-o , the circuit is said to be tuned, and the next outlier circuit is picked for tuning. If not, the second gate in the sorted list of gates on the critical path is tuned. Again, HSPICE is invoked to simulate the circuit, and path delays calculated and extracted by the program. If worst case delay stays above the cut-o again, the next gate in the sorted list is tuned. This process of picking a gate for tuning is carried on until tuning is achieved. If all single gate tuning cases are exhausted, the program would check for multiple gate tuning to see if tuning multiple gates on the critical path brings the worst case delays down. In most cases, speedup was achieved by tuning the rst gate in the sorted list of gates on the critical path. For the case Vdd = 1.2V = 0.150V, out of the 120 cases tuned, 112 cases were tuned by the tuning the rst gate in the sorted list. For 7 circuits, tuning was achieved by tuning 2 gates in the critical path. 1 circuit could not be tuned; this is because of the occurrence of the special case. Table 6.1 gives the number of gates tuned for each of the cases studied. 43 CASE Number of circuits tuned Vdd = 1.2V = 0:075V 136 Vdd = 1.2V = 0:150V 120 Vdd = 1.5V = 0:075V 144 Vdd = 1.5V = 0:150V 113 Table 6.1: Number of circuits tuned Once tuning is done, a nal third matrix with tuned delay values is constructed and stored in MATLAB. Together the three matrices are used in measuring performance bene t. The motive behind storing values in a matrix is to provide faster accessibility of these data to other MATLAB programs. It is to be noted that once an outlier path in a circuit is tuned, there is no possibility of another path becoming an outlier as a result of tuning. It is true that the complementary transition is slowed down, but this slow down would never be so large to become an outlier. Only outlier cases, with path delays signi cantly larger than average case delays, are being tuned. 6.2.4 EXOR tree simulation: example Table 6.2 represents one EXOR tree simulation for the case Vdd = 1.2 and =0.150V. The table shows Rising Input 4 to have the largest delay (7.93ns) before tuning. The in- crease in the delays after adding tuning transistors is attributed to the parasitic capacitances introduced by the additional transistors. This critical path delay is brought down consider- ably on tuning to 1.786ns - a 77.5% reduction in the path delay, which is very signi cant. The example demonstrated here is that of an extreme outlier. On an average the bene ts on single EXOR tree outlier circuits obtained were less. Table 6.3 gives the average single EXOR tree circuit outlier bene ts. It is to be noted that the entire process of nding outlier circuits, extracting details of the slowest path delays, sorting gates along a path in descending order of Vth, modifying 44 Input Transition Without any tuning circuitry Untuned delays Tuned delays (ns) (ns) (ns) INPUT 1 RISING 1.301 1.373 1.377 FALLING 1.222 1.295 1.295 INPUT 2 RISING 1.030 1.092 1.089 FALLING 1.295 1.371 1.371 INPUT 3 RISING 1.499 1.161 1.079 FALLING 0.9493 1.005 1.030 INPUT 4 RISING 7.930 8.323 1.716 FALLING 0.9136 0.9644 0.9996 INPUT 5 RISING 0.8768 0.9277 0.9302 FALLING 0.8978 0.9489 0.9482 INPUT 6 RISING 0.8381 0.878 0.8881 FALLING 1.689 1.788 1.786 INPUT 7 RISING 0.9795 1.034 1.036 FALLING 1.148 1.216 1.215 INPUT 8 RISING 1.010 1.068 1.067 FALLING 1.310 1.384 1.382 7.930 8.323 1.786 Table 6.2: Tunable EXOR tree simulation CASE PATH DELAY IMPROVEMENT Vdd=1.2V = 0:150V 70% Vdd=1.2V = 0:075V 15% Vdd=1.5V = 0:150V 27% Vdd=1.5V = 0:075V 10% Table 6.3: Performance gain for a Single EXOR tree the value of the voltage sources connected to tuning transistors, and invoking HSPICE to simulate the SPICE netlist, repeated over and over again until speedup is achieved, was done entirely by a single MATLAB program. Another MATLAB program was written to pick a subset of ?N? EXOR trees to create a larger circuit. The ?N? EXOR trees constitute the sub-circuits put together to form the larger circuit. More is discussed on this in the following subsection. 45 6.2.5 Picking ?N? sub-circuits to create a larger circuits Once 20,000 simulations were done for every case, on the two circuits labeled ?with- out tuning circuitry? and ?with tuning circuitry?, and then tuning done on the outlier circuits to bring about speedup, larger circuits were constructed by randomly picking an ?N? number of sub-circuits, and then measuring the largest worst case delay of the ?N? worst case delays. This largest worst case delay would become the worst case delay of the larger circuit. Picking up ?N? circuits here refers to drawing ?N? worst case delays from the three worst case delay matrices that were constructed earlier for every case. This is done by generating ?N? random numbers in MATLAB, and then extracting the matrix values of that index. Once the three values are extracted, bene t obtained on tuning is calculated. This process of picking a speci c ?N? sub-circuits is repeated a 1000 times to get an average case; doing only a few picks may include only special cases and not cover the average case. The above process of picking ?N? sub-circuits 1000 times was repeated for di erent values of ?N? ranging from 10 to 10,000. The next chapter presents the results of simulation done on large circuits, with plots and explanations on the observed trends of curves in the plots. 46 Chapter 7 Simulation Results This chapter presents the results of circuit tuning on large circuits of varying sizes. For every size ?N? the pick and delay evaluation was done 1000 times and the average delay for the 1000 cases was calculated. 7.1 Plots The following plots, Figure7.1, Figure7.2, Figure7.3, Figure7.4, represent variations of path delay versus circuit size for the cases Vdd=1.5V sigma=0.075, Vdd=1.2V sigma=0.075, Vdd=1.5V sigma=0.150, and Vdd=1.2V sigma=0.150 respectively. These path delays are the average path delays of the randomly generated 1000 circuits. Curves in red correspond to delay versus circuit size variations for circuits without any tuning capability, while curves in blue represent the same for circuits with tuning capability in them but without any tuning done, and curves in green represent the same variations for tuned circuits. 7.2 Observation It is observed that the blue curves are always above the red curves. This behavior is attributed to the addition of parasitic capacitance to the load capacitance of every gate as a result of adding tuning transistors. With increased capacitance it takes longer for every load capacitance to charge and discharge. As circuit size gets larger, the chance of the presence of a larger extreme outlier in the circuit gets higher. This behavior explains the exponential increase in the delays of circuits with increasing circuit size, indicated here by the exponential rise of the red and blue curves. 47 Figure 7.1: Simulation case : Vdd=1.5V sigma=0.075 Figure 7.2: Simulation case : Vdd=1.2V sigma=0.075 48 Figure 7.3: Simulation case : Vdd=1.5V sigma=0.150 Figure 7.4: Simulation case : Vdd=1.2V sigma=0.150 49 The aim of tuning is to get the path delays to fall well below the outlier path delay value. As can be seen little or no bene t is obtained in small circuits, as the chance that an outlier will exist is less. This explains the initial behavior of the green curve when it tends to lie above the red curve. But beyond a certain circuit size we observe good returns, as the green curve seems to saturate while the red and the blue curves keep increasing. Delays of circuits for a =0.150V distribution are greater than a =0.075V distribution. This is attributed to the fact that larger distributions have larger spreads, and as the spread gets wider the chances of picking a slow outlier is greater. It is also observed that while the green curve appears to be saturating, a slight increase towards the end of the plot may be observed. This is attributed to the fact that a few EXOR trees out of the 20,000 random copies could not be tuned. The reason being the presence of a high resistance slow outlier in a network and a low resistance fast transistor in the tuning circuitry of the same network; an interesting case that was studied earlier. If tuning is done on such a node the voltage logic of the node would be disturbed, and the gate would very often not switch at all. However this special case is very rare and only 1 out of 20,000 cases were found to exhibit this behavior in the Vdd=1V and =0.150V case. 7.3 Bene ts obtained Table 7.1 shows the average case path delay improvements obtained for a circuit size of 10,000 sub-circuits under the di erent simulation environments. CASE PATH DELAY REDUCTION Vdd = 1.2V = 0:150V 59.45% Vdd = 1.2V = 0:075V 15.82% Vdd = 1.5V = 0:150V 39.52% Vdd = 1.5V = 0:075V 6.80% Table 7.1: Performance gain for a circuit with 10,000 EXOR trees 50 7.4 Conclusion Clearly the potential for performance improvement exists. Even though simulations were done in 180nm technology node, the distributions used for the purpose of simulation mimic current and future technology spreads. Hence similar or better bene ts can be expected with smaller technology nodes. The e ects of the implemented design on power dissipation are studied in the following chapter. 51 Chapter 8 Power Dissipated with tuning The tuning technique implemented provides performance bene t measured in terms of gain in clock frequency. Its impact on the extra power dissipated is studied in this chapter. 8.1 Static power dissipation In a circuit, once an outlier is identi ed, the required tuning transistor of that node is turned ?ON? and remains ?ON? throughout. Hence when the complementary network in the same node is activated to an ?ON? state, there exists a direct path for current to ow from Vdd to GND; in other words, static power dissipation exists. To measure the amount of power dissipated, simulations were done on a circuit size of 10,000 sub-circuits for the case Vdd=1.2V and =0.150V, as this case mimics the case impacted by process variability the most. 8.2 Simulation di culty While simulating for power it is important to simulate the circuit at a certain clock frequency, as dynamic power is a function of frequency. Most chips fabricated are run at a clock period that provides a 10% bu er to the critical delay in the chip, thus operating at a period 1.1 times the worst case delay. With respect to the work done here, a circuit of size 10,000 would have to be selected randomly out of the 20,000 sub-circuits, and then simulated again at a clock period that is 1.1 times the worst case delay of that circuit to estimate power. This process of selecting 10,000 random sub-circuits out of 20,000, nding the worst case delay of the pick, then simulating it with a clock period that is 1.1 times the worst case delay, and repeating the 52 process a 1000 times to get an average case out, would e ectively require another 10 million (10000*1000) simulations of the EXOR tree sub-circuits, which is very tedious. 8.3 Algorithm To get around the above problem, another innovative algorithm is adopted. First 1000 circuits of size 10,000 sub-circuits are drawn randomly from the 20,000 sub-circuits, and their worst case delays noted. These 1000 circuits are then speed binned depending on the value of the worst case delay of the pick. For circuits without any tuning capability, 5 bins were created - those with worst case delays less than 9ns, those with worst case delays between 9ns and 10ns, those with worst case delays between 10ns and 11ns, those with worst case delays between 11ns and 12ns, and those with worst case delays between 12ns and 13ns. Every circuit in a bin can be operated with a clock period which is 1.1 times the largerst bin delay. Hence, the bin labeled less than 9ns would have circuits simulated with a clock period of 9.9ns. Similarly circuits in a bin labeled between 11ns and 12ns would be simulated with a clock period of 13.2ns. In the same way bins were created for circuits with tuning capability added but no tuning done. Every one of the 20,000 circuits is now simulated again for the di erent bin clock fre- quencies. In other words, for circuits without any tuning capability added, the 20,000 EXOR trees would be simulated at clock periods 9.9ns, 11ns, 12.1ns, 13.2ns, and 14.3ns. Average power dissipated is evaluated for every simulation. To create a circuit with 10,000 sub-circuits and evaluating the power dissipated, 10,000 random sub-circuits are drawn from the 20,000 sub-circuits. The worst case delay of this circuit would determine the bin that the circuit would fall in. Once the bin is identi ed, total power dissipated for every sub-circuit is obtained from the pre-simulated power dissipated values for that clock period, and then added up together to give the total power dissipated in the circuit. 53 Tuned circuits were also simulated for di erent cases. In the rst case only 20 out of 20,000 circuits were tuned, and in another case 85 out of 20,000 circuits were tuned. These circuits were operated both at a clock period which was equal to the clock period at which they were operating at before any tuning was done, and at 1.1 times the worst case delay of the circuit after tuning was done. In this way the number simulations has been cut down from 10 million to less than a hundred thousand. 8.4 Simulation To carry out the above algorithm, the SPICE les generated to measure delays, are again simulated with references to the ?.pm? les, this time, however, to also include power measurement. The same set of ?.pm? les used to measure delays are again used here. Power measurement is done by including a SPICE command that measures the instantaneous power dissipated in a device over a time period. The total instantaneous power is then averaged over the clock period to get the average power dissipated. This power dissipation result includes both the dynamic and the leakage power. 8.4.1 Bins As mentioned earlier, bins were created for simulating for power to reduce the number of simulations required. Table 8.1 shows the bins created for the two circuits. As can be seen bins for ?without tuning circuitry? start and end with a smaller clock period than bins for ?with tuning circuitry?. This is due to the fact that circuits with tuning circuitry in them would have larger gate delays, because of the presence of parasitic capacitances that are included through the addition of tuning transistors. An important point to note is that no clock is actually applied to the circuit input. It is the width of the input pulse that represents the clock period. This is due to the fact that a 54 without tuning circuitry with tuning circuitry 8-9ns 9-10ns 9-10ns 10-11ns 10-11ns 11-12ns 11-12ns 12-13ns 12-13ns 13-14ns Table 8.1: Bins clock if applied must have a period which is greater than the worst case pulse width required by the circuit to function correctly. 8.4.2 SPICE Once the bin periods are established, a MATLAB program is written to simulate every one of the 20,000 sub-circuits at 1.1 times the bin frequency. The values of the power dissipated in every circuit are extracted and stored in a matrix. The next set of simulations are done on tuned circuits. The challenge behind tuning is that, an outlier circuit rst needs to be tuned to bring about speedup, and then a search needs to be done to decide which bin the tuned circuit would fall in. Once the bin is chosen, the tuned circuit is again simulated, but this time at a clock period which is 1.1 times the bin period. In the beginning, simulation is done for delay measurement. Hence, the period at which simulation is done, is kept large enough to accommodate the outlier delays. When simulating nally for power, the values of the pulse widths in the SPICE netlist had to be modi ed to the bin period, before being simulated again to measure power. This modi cation of pulse widths was done though a PERL script, invoked from the MATLAB program of power simulations. While simulating with a circuit size of 10,000, it was found that all circuits once tuned had a worst delay in the range of 6-7ns. Hence, for tuned circuits, only 1 bin, with a period of 7.7ns, was created. 55 Once tuned circuits are simulated, the power dissipation values are stored in a third matrix. These matrices, created for power, are then made use of, along with the three delay matrices created earlier, in generating larger circuits. Power dissipated with tuning, but no modi cation done to the pulse widths, was also calculated and stored for the purpose of comparisons, which will be studied in the proceeding sections. Tuning was also done for two cases di ering in the number of circuits tuned. In one case only 20 out of the 20,000 circuits were tuned, and in another 85 out of the 20,000 circuits were tuned. 8.4.3 Picking 10,000 sub-circuits to create a larger circuit To generate a larger circuit of size 10,000 sub-circuits, a random set of 10,000 numbers was generated by a MATLAB program. With this as an index, the delay of the circuits were found from the three delay matrices, and the bins were then chosen, to decide which bin to pick the power dissipation values from. Once the power dissipation values for the 10,000 circuits were picked, they were added to give the total power dissipated in the large circuit with 10,000 EXOR tree circuits. This process of picking a random set of 10,000 numbers and calculating power is repeated 1000 times to get an average case. 8.5 Simulation results CASE Without any tuning circuitry Untuned Tuned Change 20 gates Tuned 0.0656 W 0.0652 W 0.0654 W 0.24% 85 gates Tuned 0.0657 W 0.0654 W 0.0657 W 0.41% 20 gates Tuned and Speeded 0.0655 W 0.0653 W 0.0902 W 38.2% 85 gates Tuned and Speeded 0.0656 W 0.0652 W 0.107 W 64.1% Table 8.2: Power dissipated with tuning 56 Table 8.2 shows that the percentage increase in the power dissipated with tuning alone and no speedup done is less than 1%. In the cases above more gates are being tuned than prescribed earlier - 1 in 3.5 million that lie beyond 5 sigma should be tuned. But in our case we are forced to tune a little more because of the smaller circuit size involved. In spite of tuning more, the excess power dissipation observed is still only less than 1%, which would be a lot lower in the 1 in 3.5 million case. On tuning 85 gates instead of 20, the increase in power was also very substantial. It is also observed that the power dissipation increases to 38% and 64% on operating the tuned circuits at their maximum clock frequencies. This is expected because the circuits are now being operated at approximately 38% and 64% higher clock frequency. Also, the power dissipated in circuits without any tuning capability included is greater than in circuits with tuning capability included, in spite of the larger leakage power pro- duced, because these circuits are operated at a higher clock frequency. Circuits with tuning capability included but no tuning done would have larger circuit delays than circuits with- out any tuning circuitry added, because of the additional capacitance added by the tuning transistors. Hence operating them at a lower clock frequency would generate less power. The above results are in tune with the expected results. On an average, keeping the frequency and Vdd supply voltage the same, switching from NMOS logic to CMOS logic reduced the power dissipation by approximately 2 orders of magnitude. Hence, tuning gates should also, on average, consume power that would go up by 2 orders of magnitude, because e ectively the logic changes from CMOS to a pseudo NMOS type. Hence, assuming an increase in per gate consumption of power by 2 orders of magnitude, the increase in power consumed in a circuit with 280,000 gates, and having 25 gates that are tuned would cause an increase in the power consumed by 0.89%, which is very close to the obtained results. The next chapter presents the nal conclusions and observations on the work done. 57 Chapter 9 Conclusion In this work a new design methodology that targets slow down brought about by process variability, and allows for post fabrication speedup is studied. Random process variability is bound to occur and a ect every chip fabricated as scaling goes beyond 22nm. Statistically every fabricated chip will have dozens of outlier transistors that would limit the maximum clock frequency at which the chip can be operated. Often considered as a defect, there exists a need to address this problem so as to reap the bene ts of scaling. The design methodology incorporates the idea of adding redundant transistors to the circuit which can be turned ?ON? in case an outlier is detected. Tuning is limited to outlier gates, which in a circuit size of a 100 million would equal a few hundred gates. This tuning would be bene cial in pulling back worst case path delays close to the average case delay, at the expense of a very small increase in the power dissipated. Simulation results indicate that worst case path delays were brought down by 50%, while power dissipation only increased by a meager 0.44%. While reviewing the proposed tuning architecture, there were a few concerns about the area overhead, diagnosis di culty, and implementing tuning itself. It is true that there is an area overhead, but this area overhead is compensated by the savings obtained on Silicon itself, through the fault tolerance capability of this architecture. The implemented architecture is fault tolerant as turning ?ON? a tuning transistor could provide a path for charging or discharging the gate output capacitance of an otherwise faulty node. Hence, even though additional area is used up by the tuning circuitry, tuning when implemented could increase the yield, and buy back a portion of the used up Silicon. 58 Diagnosis of outliers is di cult indeed, but a scheme that enables it is being developed and studied. The following is a brief description of the work being done. The scheme rst involves running path delay tests on the circuit to detect the slow paths. Vectors pairs that would detect slow paths are determined by running ATPG. When tested on a chip, an incorrect output logic measured within a particular time frame for the vector pair applied would implicate a set of gates that could be outlier candidates. Di erent sets of outlier candidates are determined for di erent vector pair inputs and the corresponding outputs where the fault was detected. Starting with this set of implicated gates, an algorithm is implemented to lter out the correct outliers from the set. This diagnosis scheme is being worked on, and when ready would make it possible to exercise the tuning strategy. Tuning implementation is perceived to be implemented by overlaying a programmable memory like mesh on the chip, with the gate terminals of the tuning transistor connected to the memory array. The memory array is programmable to imply that a tuning transistor can be turned ?ON? by programming the memory appropriately. To make this implementation a ordable, the memory mesh could be fabricated in thin- lm technology, where amorphous Si is used in the fabrication process. Such a technology would have a very low drive, but for the sake of tuning, would su ce. Hence, it is valid to assume that such a tuning architecture has scopes of applicability in future technology nodes. The work described in the thesis is only the beginning. The cases studied were naive in the sense that a simple circuit with the same gates and equal path delays is assumed, and adding tuning transistors and the additional loading capacitance had to be done manually and with small approximations. Also, simulations were done in 180nm technology, which might seem to indicate irrelevance of the obtained results, but the distributions worked with are pertinent to the advanced technologies, and hence close to applicable. Future work involves using 45nm technology for the purpose of simulations, adding tunable gates to libraries so that its extraction can be automated, and working with standard circuits to make the case more authentic and applicable. Also a de nite diagnosis strategy 59 that would enable the detection of outlier gates in a chip, and a tuning methodology that would enable implementing the tuning scheme, needs to be studied. 60 Bibliography [1] http://www.itrs.net/links/2009ITRS/Home2009.html, The International Techonol- ogy Roadmap for Semiconductors, 2009. [2] S. Nassif, K. Bowman, \Design for Variability in DSM Technologies," ISQED 2000. [3] S. Borkar, \Design Challenges of Technology Scaling," IEEE Micro, Vol. 19, Issue 4, pp. 23-29, August 1999. [4] M. Ashouei, M. Nisar, A. Chatterjee, A. D. Singh, A. Diril, \Probabilistic Self- Adaptation of Nanoscale CMOS Circuits: Yield Maximization under Increased Intra-Die Variations," International Conference on VLSI Design, Bangalore, India, pp. 711-716, January 2007. [5] M. Ashouei, A. D. Singh, A. Chatterjee, \A Defect-Tolerant Architecture for End-of- Roadmap CMOS," European Test Symposium, Freiburg, Germany, May 2007. [6] A. D. Singh, \Scan Based Testing of Dual/Multi Core Processors for Small Delay De- fects", in Proc. International Test Conference, 2008. [7] A. D. Singh, \A Self-Timed Structural Test Methodology for Timing Anomalies due to Defects and Process Variations", in Proc. International Test Conference, 2005. [8] K. S. Saha, \Modeling Process Variability in Scaled CMOS Technology," IEEE Design and Test of Computers, Vol. 27, Issue 2, pp. 8-16, March/April 2010. [9] B. Cheng, D. Dideban, N. Moezi, C. Millar, G. Roy, X. Wang, S. Roy, A. Asenov, \Statistical-Variability Compact-Modeling Strategies for BSIM4 and PSP," IEEE De- sign and Test of Computers, Vol. 27, Issue 2, pp. 26-35, March/ April 2010. [10] V. Wang, K. Agarwal, S. R. Nassif, K. J. Nowka, D. Markovic , \A Design Model for Random Process Variability" in proc. ISQED pp. 734-737, 2008. [11] C. Kenyon, A. Kornfeld, K. Kuhn, M. Liu, A. Maheshwari, W. Shih, S. Sivakumar, G. Taylor, P. VanDerVoorn, K. Zawadzki, \Managing Process Variation in Intel?s 45nm CMOS Technology." Intel Technology Journal , Vol. 12, Issue 2, pp. 92-110, June 2008. [12] W. Shockley, \Problems related to P-N Junctions in Silicon." Solid-State Electronics, Vol. 2, pp. 35-67, January 1961. 61 [13] W. Schemmert, G. Zimmer, \Threshold-Voltage Sensitivity of Ion-Implanted MOS Transistors due to Process Variations." Electronics letters, Vol. 10, Issue 9, pp. 151- 152, May 1974. [14] K. Yokoyama, A. Yoshii, S. Horiguchi, \Threshold-sensitivity Minimization of Short- Channel MOSFET?s by Computer Simulation." IEEE Journal of Solid-State Circuits, Vol. 15, Issue 4, pp. 574-579, August 1980. [15] P. A. Stolk, F. P. Widdershoven, D. B. M. Klaassen, \Modeling Statistical Dopant Fluctuations in MOS Transistors," IEEE Transactions on Electron Devices, Vol. 45, Issue 9, pp. 1960-1971, September 1998. [16] K. Kuhn, J. Kelin , \Reducing Variation in Advanced Logic Technologies: Approaches to Process and Design for Manufacturability of Nanoscale CMOS." IEEE International Electron Devices Meeting, IEDM Technical Digest, pp. 471-474, December 2007. [17] C. H. Diaz, H. J. Tao, Y. C. Ku, A. Yen, K. Young, \An Experimentally Validated Analytical Model for Gate Line Edge Roughness (LER) E ects on Technology Scaling." IEEE Electron Device Letters, Volume 22, Issue 6, pp. 287-289, June 2001. [18] H. W. Kim, J. Y. Lee, J. Shin, S. G. Woo, H. K. Cho, J. T. Moon, \Experimental Inves- tigation of the E ect of LWR on Sub-100-nm Device Performance." IEEE Transactions on Electron Devices, Vol. 51, Issue 12, pp. 1984-1988, December 2004. [19] H. Fukutome, Y. Momiyama, T. Kubo, Y. Tagawa, T. Aoyama, H. Arimoto, \Direct Evaluation of Gate Line Edge Roughness Impact on Extension Pro les." IEEE Trans- actions on Electron Devices, Vol. 53, Issue 11, pp. 2755-2763, November 2006. [20] A. Asenov, A. R. Brown, J. H. Davies, S. Kaya, G. Slavcheva, \Intrinsic Parameter Fluctuations in Decananometer MOSFETs introduced by Gate Line Edge Roughness." IEEE Transactions on Electron Devices, Vol. 50, Issue 5, pp. 1254-1260, May 2003. [21] K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L. Ji, S. R. Nassif, E. J. Nowak, D. J. Pearson, N. J. Rohrer, \High-performance CMOS Variability in the 65-nm Regime and Beyond," IBM Journal of Research and Development, Vol. 50, Issue 4/5, pp. 433-449, July 2006. [22] A. Asenov, S. Kaya, J. H. Davies, \Intrinsic Vth Fluctuations in Decananometer MOS- FETs Due to Local TOX Variations," IEEE Transactions on Electron Devices, Vol. 49, Issue 1, pp. 112-119, January 2002. [23] S. Changhwan, S. Xin, K. L. Tsu-Jae, \Study of Random Dopant Fluctuation (RDF) E ects for the Trigate Bulk MOSFET," IEEE Transactions on Electron Devices, Vol. 56, Issue 7, July 2009. [24] A. T. Putra, A. Nishida, S. Kamohara, T. Hiramoto, \Random Threshold Voltage Vari- ability Induced by Gate-Edge Fluctuations in Nanoscale Metal-Oxide-Semiconductor Field-E ect-Transistors," Applied Physics Express 2, Vol. 024501, January 2009. 62 [25] http://en.wikipedia.org/wiki/Normal_distribution, Normal distribution. [26] http://ptm.asu.edu/modelcard/HP/32nm_HP.pm, 32nm High Performance PTM models. [27] A. D. Singh, K. Mishra, A. Faraz, A. Chatterjee, \Path Delay Tuning for Performance Gain in the face of Random Manufacturing Variations," in proc. International Confer- ence on VLSI Design, Bangalore, India, January 2011. 63