Better Than Worst Case Timing Design With Latch Bu ers On Short Paths by Ravi Kanth Uppu A thesis submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Master of Science Auburn, Alabama December 14, 2013 Keywords: Process Variation, Hold Latches, Clock Period Scaling, Critical Functional Flops, Shadow Flops Copyright 2013 by Ravi Kanth Uppu Approved by Adit D.Singh, Chair, James B Davis Professor, Electrical and Computer Engineering Vishwani Agrawal, James J Danaher Professor, Electrical and Computer Engineering Victor P. Nelson, Professor and Assistant Chair, Electrical and Computer Engineering Abstract With continued advances in CMOS technology, parameter variations are emerging as a major design challenge. Irregularities during the fabrication of a microprocessor and varia- tions of voltage and temperature during its operation make it increasingly di cult to meet aggressive performance targets under strict power budgets. Traditional adaptive techniques that compensate for Process Voltage Temperature (PVT) variations need safety margins and cannot respond to rapid environmental changes. In this thesis, we present a novel Better-than-worst-case design (BTWC) technique, which eliminates worst-case safety mar- gins through in situ error detection of variation-induced delay errors. In our design, we use a delay-error tolerant ip- op for every functional critical op to scale the clock period to the point of rst failure of a die under low power operations, which was the concept adopted from Razor[12][13]. Thus, all margins due to global and local PVT variations are eliminated, resulting in signi cant energy savings. In addition, the clock period can be scaled even lower than the rst failure point into the sub-critical region, deliberately tolerating a targeted error rate, thereby providing additional energy savings. Thus, in the context of this design, a timing error is not a catastrophic system failure but a trade-o between the overhead of error-correction and the additional performance beni t due to Clock Frequency Scaling. Earlier BTWC designs such as Razor [12][13] introduce shadow ip- ops triggered by a delayed clock in parallel to the functional ip- ops for timing error detection through duplication and comparision. This arrangement su ers from the "short path" problem, whereby the activation of paths shorter than this timing skew can cause false errors to be agged. The traditional solution is to add bu ers to the short paths that are less than the clock skew between the duplicated error detection ip- ops. However, this approach adds considerable area and power overhead, particularly in the presence of signi cant process ii variations [18]. The proposed design studies the use of latches to introduce extra delay on short paths; holding short paths stable for the rst phase of the clock allows the design to achieve a skew of half a clock period between the functional and shadow ip- ops without short path errors. We present a generic algorithm that characterizes all the path groups and places latches in appropriate path segments of the circuit to ensure that all short paths driving duplicated ip- ops are delayed by half a clock cycle. Unit delay simulations for benchmark designs with and without process variations are presented. Average performance improvement (API) and best-case performance improvement (BPI) for designs are presented with an overall average performance improvement of about 15% and best case performance improvement of 32% at a cost of acceptable area overhead. Error correction for the above stated approach is taken care by an architectural replay mechanism. Furthermore, this design can be proven e ective for detecting spurious transi- tions that are caused due to Single Event Upsets. However, this requires augmenting all the functional ip ops with shadow ops leading to an additional area overhead. iii Acknowledgments First and foremost I o er my sincerest gratitude to my supervisor, Dr Adit Singh, who has supported me thoughout my thesis with his patience and knowledge whilst allowing me the room to work in my own way. I attribute the level of my Masters degree to his encouragement and e ort and without him this thesis, too, would not have been completed or written. His insightful suggestions in understanding and approaching a subject has not only helped in completion of my thesis but also attributed in confronting challenges on a larger scale. I also wish to thank my advisory committee members, Dr. Vishwani Agrawal and Dr. Fa Foster Dai for their guidance and advice on this work. My masters at Auburn University has helped me gain valuable practical experience. Courses like CAD tools, VLSI Testing, Low power design and Computer Architecture has not only delivered the required intuition for an hardware engineer but also helped in getting accustomed to several simulating tools, for which I am grateful to each and every professor. Graduate study at Auburn University has been a learning experience. I thank my brother Ravi Tej for his constant support and colleagues Suraj and Praveen for their valu- able inputs and patients. Without them my graduate experience would not have been this enriching. I am indebted to my parents, brother, sister and all friends for their love and support. I also thank everyone who directly or indirectly helped and inspired me during the course of my graduate study. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 NTC Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Contribution of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Requirement for BTWC designs . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Previous BTWC designs . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Razor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Razor I Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 RazorII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Razor Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 CRISTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 CSCD for High Speed and Area E cient Arithmetic . . . . . . . . . . . . . . 22 2.5 Need for reasonable area, power overhead and Resilient Error Detection Circuit 24 3 Level-sensitive latch based scheme for hold- xing . . . . . . . . . . . . . . . . . 25 3.0.1 Short Path Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . 25 v 3.0.2 Latch Based Bu ering . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Latch Placement for designs with and without timing variation . . . . . . . . . . 30 4.1 Latch Placement Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Latch Placement for a Design without inducing Variability . . . . . . . . . . 33 4.3 Latch Placement for a Design with Timing Variability . . . . . . . . . . . . . 37 4.4 Inducing Variation in a given circuit . . . . . . . . . . . . . . . . . . . . . . 39 5 Latch placement for an 8-bit ripple carry adder subject to timing variation . . . 42 6 Design Flow Approach For Latch Insertion in a Given Verilog Netlist . . . . . . 45 7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 vi List of Figures 1.1 Technology scaling trends of supply voltage and energy . . . . . . . . . . . . . . 3 1.2 Energy and delay in di erent supply voltage operating regions. . . . . . . . . . . 6 1.3 Impact of voltage scaling on gate delay variation. . . . . . . . . . . . . . . . . . 8 2.1 Razor op- op . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Pipeline augmented with Razor latches and control lines . . . . . . . . . . . . . 17 2.3 Current Sensor Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Short Path Constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Latch insertion scheme for satisfying the short path constraint . . . . . . . . . . 27 3.3 Timing model waveforms for error detection using a delayed clock . . . . . . . . 28 3.4 Schematic of the proposed design . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Latch Placement Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Flow Chart for Latch Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 32-bit Ripple Carry Adder Delay Distribution for Random Inputs. . . . . . . . . 35 4.4 E ective Placement of a latch for a design without variation . . . . . . . . . . . 36 4.5 Process Variation e ects on Vth at di erent technology nodes . . . . . . . . . . 38 vii 4.6 Increase in delay of an inverter for multi Vth operated at various voltage corners 41 5.1 8-bit Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Last four full adder modules of an 8-bit Ripple Carry Adder . . . . . . . . . . . 43 5.3 Last four full adder modules of an 8-bit Ripple Carry Adder with latches placed 44 6.1 Design ow for the latch insertion . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.1 Delay Spread of a 32-bit Ripple Carry Adder for 1 million vectors . . . . . . . 50 viii List of Tables 4.1 E ects of variation on the average of rising and falling transitions of an inverter operated at multiple voltages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.1 Unit Delay Simulations for a 32-bit Ripple Carry Adder with the Latch Placement and Clock Frequency Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.2 Simulation Results For s444 with the proposed design operated at Multiple Volt- age Corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.3 Simulation Results s510 with the proposed design operated at Multiple Voltage Corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.4 Simulation Results For s641 with the proposed design operated at Multiple Volt- age Corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.5 Simulation Results For s1196 with the proposed design operated at Multiple Voltage Corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.6 Simulation Results For 32-Bit RCA with the proposed design operated at Multi- ple Voltage Corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 ix Chapter 1 Introduction With the advent of the deep submicron (DSM) era, more and more system on-chip (SoC) designs are being fabricated in sub-100nm technologies. Process, Voltage and Temperature variations are systematic and random variations in process characteristics across dies, sup- ply voltage droop, and temperature variation. Rising PVT variations at advanced process nodes widen worst-case timing margins of the design{degrading performance signi cantly. These variations di er signi cantly in temporal and spatial scales. Process variations are static in nature, while voltage and temperature variations are highly sensitive to workload behavior, albeit at very di erent time scales. All sources of variation a ect di erent parts of a microprocessor die in myriad ways with complex interactions and di er in their temporal and spatial characteristics. Microarchitectural techniques designed to mitigate parameter variations must clearly account for these di ering characteristics. At each semiconductor process node, the impacts on performance of environmental and semiconductor process variations become a larger portion of the cycle time of the product. With continued scaling of CMOS technology however, the numbers of relevant sources of variation and their magnitude have increased. In an attempt to account for this, additional static timing corners (additional clock period for the actual clock period) are being added to the design ow. As additional sources of variation become important, either the total guard- band applied during static timing is increased, or the risk of impacting yield is increased. This comes about due to the di erent delay sensitivities of each path on a design and the inability of the currently available design automation tools to handle the unique sensitivities of each path on the chip without running 2n timing corners, where n is the number of independent variables of interest. Some paths are predominately sensitive to metal delay 1 while others are predominately sensitive to silicon (device) delay. Assuming that these paths will track the same from one manufacturing lot to another can lead to silicon that fails its timing requirements. Clearly a better method for dealing with variations is required. Accounting for all these variations, processors are designed for worst-case operating margins, losing substantial performance. Although processors are traditionally designed to tolerate worst-case conditions, it is unlikely that all nonidealities will take e ect at once, pushing a processor to the brink of erroneous behavior. Thus, there exists a considerable potential to increase the power e - ciency or performance of processors by relaxing traditional, conservative requirements for correctness in the worst-case and instead designing processors for the average-case. Such better-than-worst-case designs work normally in the average case and have recovery mecha- nisms to ensure correct operation when errors occur. Design methodologies with reasonable overhead that can exploit the skewed behavior of the combinational pro le under extreme variability, which is much more apparent in low power operations, would reap a signi cant amount of performance bene ts. 1.1 Background Over the past four decades, the number of transistors on a chip has increased exponen- tially in accordance with Moore?s law [1]. This has led to progress in diversi ed computing applications, such as health care, education, security, and communications. A number of societal projections and industrial roadmaps are driven by the expectation that these rates of improvement will continue, but the impediments to growth are more formidable today than ever before. The largest of these barriers is related to energy and power dissipation, and it is not an exaggeration to state that developing energy-e cient solutions is critical to the survival of the semiconductor industry. Extensions of today?s solutions can only go so far, and without improvements in energy e ciency, CMOS is in danger of running out of steam. 2 When we examine history, we readily see a pattern: generations of previous technologies, ranging from vacuum tubes to bipolar-based to NMOS-based technologies, were replaced by their successors when their energy overheads became prohibitive. However, there is no clear successor to CMOS today. The available alternatives are far from being commercially viable, and none has gained su cient traction, or provided the economic justi cation for overthrowing the large investments made in the CMOS-based infrastructure. Therefore, there is a strong case supporting the position that solutions to the power conundrum must come from enhanced devices, design styles, and architectures, rather than a reliance on the promise of radically new technologies becoming commercially viable. In our view, the solution to this energy crisis is the universal application of aggressive low-voltage operation across all computation platforms. This can be accomplished by targeting so-called \near- threshold operation" and by proposing novel methods to overcome the barriers that have historically relegated ultralow-voltage operation to niche markets. Figure 1.1: Technology scaling trends of supply voltage and energy CMOS-based technologies have continued to march in the direction of miniaturization per Moore?s law. New silicon-based technologies such as FinFET devices [2] and 3-D in- tegration [3] provide a path to increasing transistor counts in a given footprint. However, using Moore?s law as the metric of progress has become misleading since improvements in 3 packing densities no longer translate into proportionate increases in performance or energy e ciency. Starting around the 65 nm node, device scaling no longer delivers the energy gains that drove the semiconductor growth of the past several decades, as shown in Figure 1.1. The supply voltage has remained essentially constant since then and dynamic energy e ciency improvements have stagnated, while leakage currents continue to increase. Heat removal limits at the package level have further restricted more advanced integration. To- gether, such factors have created a curious design dilemma: more gates can now t on a die, but a growing fraction cannot actually be used due to strict power limits. At the same time, we are moving to a \more than Moore" world, with a wider diversity of applications than the microprocessor or ASICs of ten years ago. Tomorrow?s design paradigm must enable designs catering to applications that span from high-performance processors and portable wireless applications, to sensor nodes and medical implants. Energy considerations are vital over this entire spectrum. The aim of the designer in this era is to overcome the challenge of energy e cient computing and unleash performance from the reins of power to reenable Moore?s law in the semiconductor industry.The use of ultralow-voltage operation, and in particular subthreshold operation (Vdd < Vth), was rst proposed over three decades ago when the theoretical lower limit of Vdd was found to be 36 mV [4]. However, the challenges that arise from operating in this regime have kept subthreshold operation con ned to a handful of minor markets, such as wristwatches and hearing aids. To the mainstream designer, ultralow-voltage design has remained little more than a fascinating concept with no practical relevance. However, given the current energy crisis in the semiconductor industry and stagnated voltage scaling we foresee the need for a radical paradigm shift where ultralow-voltage operation is applied across application platforms and forms the basis for renewed energy e ciency. Various design methodologies were proposed to achieve e ective Dynamic Frequency or Voltage Scaling that include Adaptive Design Techniques, which tunes the system parameters (supply voltage and frequency of operation) during the dynamic operation of a processor, speci c to the native 4 speed of each die and its run-time computational workload. The other class are the BTWC designs that not only overcome the additional margin that account for PVT variations but are also capable of running the clock at much aggressive rates under low power operations. 1.2 Motivation For designs with aggressive voltage scaling, it is important to determine the Vdd at which the energy per operation (or instruction) is optimal. In the superthreshold regime (Vdd > Vth), energy is highly sensitive to Vdd due to the quadratic scaling of switching energy with Vdd. Hence voltage scaling down to the near-threshold regime (Vdd Vth) yields an energy reduction on the order of 10X at the expense of approximately 10X performance degradation, as seen in Fig 1.2 [7]. However, the dependence of energy on Vdd becomes more complex as voltage is scaled below Vth. In subthreshold (Vdd < Vth), circuit delay increases exponentially with Vdd, causing leakage energy (the product of leakage current, Vdd, and delay) to increase in a near-exponential fashion. This rise in leakage energy eventually dominates any reduction in switching energy, creating an energy minimum seen in Figure 1.2. The identi cation of an energy minimum led to interest in processors that operate at this energy optimal supply voltage [5],[6],[8] to the subthreshold regime, though delay rises by 50- 100X over the same region. While acceptable in ultralow energy sensor-based systems, this delay penalty is not tolerable for a broader set of applications. Hence, although introduced roughly 30 years ago, ultralow-voltage design remains con ned to a small set of markets with little or no impact on mainstream semiconductor products. 1.2.1 NTC Barriers Although NTC provides excellent energy-frequency tradeo s, it brings its own set of complications. NTC faces three key barriers that must be overcome for widespread use; performance loss, performance variation, and functional failure. 5 Figure 1.2: Energy and delay in di erent supply voltage operating regions. A. Performance Loss The performance loss observed in NTC, while not as severe as that in subthreshold operation, poses one of the most formidable challenges for NTC viability. In an indus- trial 45 nm technology the fanout-of-four inverter delay (FO4, a commonly used metric for the intrinsic speed of a semiconductor process technology) at an NTC supply of 400 mV is 10X slower than at the nominal 1.1 V. There have been several recent advances in ar- chitectural and circuit techniques that can regain some of this loss in performance. New technology optimizations that opportunistically leverage the signi cantly improved silicon wearout characteristics (e.g., oxide breakdown) observed in low voltage NTC can be used to regain a substantial portion of the lost performance. B. Increased Performance Variation In the near-threshold regime, the dependencies of MOSFET drive current on Vth, Vdd, and temperature approach exponential. As a result, NTC designs display a dramatic increase 6 in performance uncertainty. Figure 1.3 shows that performance variation due to global process variation alone increases by approximately 5X from 30% (1.3X) [9] at nominal operating voltage to as much as 400%, (5X) at 400 mV. Operating at this voltage also heightens sensitivity to temperature and supply ripple, each of which can add another factor of 2X to the performance variation resulting in a total performance uncertainty of 20X. Compared to a total performance uncertainty of 1.5X at nominal voltage, the increased performance uncertainty of NTC circuits looms as a daunting challenge that has caused most designers to pass over low-voltage design entirely. Simply adding margin so that all chips will meet the needed performance speci cation in the worst case is e ective in nominal voltage design. In NTC design this approach results in some chips running at 1/10th their potential performance, which is wasteful both in performance and in energy due to leakage currents. The proposed BTWC design presents a new architectural approach to dynamically adapting the performance of a design to the intrinsic and environmental conditions of process, voltage, and temperature that is capable of tracking over the wide performance range observed in NTC operation. This method is complemented by circuit-level techniques for diminishing the variation of NTC circuits and for e cient adaptation of performance. C. Increased Functional Failure The increased sensitivity of NTC circuits to variations in process, temperature and volt- age not only impacts performance but also circuit functionality. In particular, the mismatch in device strength due to local process variations from such phenomena as random dopant uctuations (RDF) and line edge roughness (LER) can compromise state holding elements based on positive feedback loops. Mismatch in the loop?s elements will cause it to develop a natural inclination for one state over the other, a characteristic that can lead to hard functional failure or soft timing failure. This issue has been most pronounced in SRAM where high yield requirements and the use of aggressively sized devices result in prohibitive sensitivity to local variation. As our proposed design is con ned to datapath units in a processor, the above discussed barrier can be neglected. However, we can make the design 7 Figure 1.3: Impact of voltage scaling on gate delay variation. robust to SEU?s by augmenting every functional op with a shadow op, which can detect any Soft Errors with a penalty of increased hardware. 1.3 Problem Statement From the above discussion, it is imminent that innovative design methodologies that not only overcome the timing margin in clocked sequential synchronous designs but also capable of aggressive frequency scaling operated at Near Threshold Voltage with reasonable hardware overhead are in great demand. One such design is RAZOR[12][13], which has its own limitations due to which the innate bene ts of the design could not be exploited. In this thesis we try to overcome those limitations by using a novel hold mechanism at key internal nodes in a combinational pro le. The following are the features of the proposed design. 1) Exploiting the skewed behavior of the combinational pro le in a sequential design, under extreme timing variability (apparent in Low Power Operations). 8 2) Increased timing margin for the shadow ops ( T = Tperiod/2), allowing aggressive frequency scaling under the face of extreme variability. 3) Reasonable hardware overhead, as dynamic latches (Tristate Bu ers) are used to bu er the short paths, which e ectively eliminate the short path violation without being susceptible to process variation. 4) Increase in the clock frequency, close to 20%, excluding the timing margins. 5) E ective for both the designs, with inherent skewed and balanced paths delays in a combinational pro le. 1.4 Contribution of This Thesis A novel design methodology was proposed that allows aggressive clock frequency scaling with a minimal area overhead and was analyzed for multiple circuits with and without timing variation. Unit delays were assigned to the gates in the design by characterizing the delay variation in an inverter for a 45nm process node, at multiple voltage corners with the threshold voltage varying from 0.1V to 0.9V. The proposed design was implemented in both combinational and sequential designs with inherent skewed and balanced path groups. A generic algorithm was implemented, that characterizes all the path groups and places latches in appropriate path segments of the circuit to ensure that all short paths driving duplicated ip- ops are delayed by half a clock cycle. Unit delay simulations for benchmark designs with and without process variations are presented. Average performance improve- ment (API) and best-case performance improvement (BPI) for designs are presented with an overall average performance improvement of about 15% and best case performance improve- ment of 32% at a cost of acceptable area overhead. The key contribution of this thesis is achieving a timing window of Tperiod/2 by augmenting a negative edge triggered op, which allows the clock to scale close to 33% of the critical path without including the PVT margin. 9 1.5 Organization of Thesis The thesis is organized as follows. In Chapter 2, we discuss a general background of the earlier Better than Worst Case Design techniques, their pros and cons. Chapter 3 presents a discussion on short path constraints and overcoming them using latches at appropriate segment delays. Chapter 4 presents the latch placement algorithm and constraints for the latch placement for both the designs with and without timing variability. In Chapter 5, the proposed BTWC design is emphasized using an 8-bit Ripple Carry Adder. A design ow approach of the proposed BTWC design for a generic sequential circuit is discussed in Chapter 6. Simulation results with Conclusions and future work are discussed in Chapter 7. 10 Chapter 2 Literature Survey This chapter presents a brief study of the present and past research work on better than worst case delay designs. Though BTWC designs [10-17] were implemented earlier either for performance improvement or for power aware computation the design has to make sure that a circuit operates without any errors even in the worst case scenario. Thus, the design of an e cient error resilient logic and a mechanism which can recover from a timing critical error can be thought of as a metric for the designs based on BTWC delay. The completion detection and correction in this work is novel, and can be implemented on designs whose delay distribution is skewed. Also, it is important to keep in mind that power and performance are compromising design metrics i.e. one needs to trade power for performance or vice versa. 2.1 Requirement for BTWC designs After decades of astonishing improvement in integrated circuit performance, digital circuits have come to a point at which there are many problems ahead of us. Two main problems are power consumption and process variations. In the past few decades, circuit designs have followed Moore?s law and the number of transistors on a chip has doubled every two years. As we t more transistors into a given area and clock them faster and faster, power density increases exponentially. The most e ective way to reduce power density is to minimize the supply voltage, as predicted by CV2f. Currently, we have been successful in containing the power density within tolerable levels, but this will not last. One barrier comes from the threshold voltage. In order to maintain the same performance, we have to reduce the threshold voltage together with the 11 supply voltage. However, reducing threshold voltage leads to an exponential increase in o - state leakage current. Leakage current has become so signi cant that further reductions in threshold voltage and supply voltage have slowed or even stopped. Without voltage scaling, the power density of a chip will increase without bound. If a solution to this ever increasing power consumption cannot be found, Moore?s law will no longer apply and we will no longer be able to experience the tremendous increase in performance experienced in the past. The other problem of the IC industry is process variations [17,18]. As transistor sizes approach atomic levels, it is very di cult to fabricate exactly what we specify. For example, a variation of dopant implantation on the order of a few atoms may translate to a huge di erence in dopant concentration, and may cause a noticeable shift in the threshold voltage. Because traditional designs dictate that our circuit must always function correctly in all circumstances, the huge process variations present today force designers to allocate more voltage margin on top of the typical case in order to ensure proper operation. To make things worse, other temperature, die-to-die, and IR drop variations further increase the safety margins needed. The general practice of conservative overdesign has become a barrier to low power design. Because these large margins are used only to prevent the rare worst case scenario from failing, a large amount of energy can be saved if these margins are eliminated and instead utilize an error-resilient logic that can dynamically tune itself for all kinds of variations. We will then be able to run our chip at the lowest possible energy consumption or highest possible performance . Power consumption and variations have become two of the most important roadblocks for extending Moores law and these problems must be addressed before we can continue to improve the performance of electronics at the amazing pace we enjoyed in the past decades. Thus even a minor improvement in performance without trading o power shall be a signif- icant contribution in the face of the above mentioned bottlenecks of digital circuits. 12 2.1.1 Previous BTWC designs A number of better-than-worst case (BTWC) designs have been proposed in the past to allow circuits to save power by operating under normal conditions rather than conservative worst case limits. One class of BTWC techniques speci es multiple safe voltage and frequency levels that a design may operate at and allows for switching between these levels. Examples in this class are correlating VCO (Voltage Controlled Oscillator) [28] [29] and Design-Time DVS (Dynamic Voltage Scaling) [32]. As voltage changes, correlating VCO adapts the frequency of a circuit to a level slightly below the critical frequency. The clock frequency for a given voltage is selected to incorporate a margin for process and temperature variations, as well as noise in the power supply network. Thus, scaling past the critical point is not allowed. Similarly, Design-Time DVS provides the capability to switch between multiple voltage / frequency operating points to match user or application requirements. As with correlating VCO, each operating point incorporates a conservative margin to ensure that errors do not occur.Another class of BTWC designs uses canary circuits to detect when arrival at the critical point is imminent, thus revealing the extent of safe scaling. Delay line speed detectors [30] work by propagating a signal transition down a path that is slightly longer than the critical path of a circuit. Scaling is allowed to proceed until the point where the transition no longer reaches the end of the delay line before the clock period expires. While this circuit enables scaling, no scaling is allowed past the critical path delay plus a safety margin. Another similar circuit technique uses multiple latches which strobe a signal in close succession to locate the critical operating point of a design. The third latch of a triple latch monitor [31] is always assumed to capture the correct value, while the rst two latches indicate how close the current operating point is to the critical point. A design is considered to be tuned when the values in the rst two latches do not match but the values in last two latches do match, indicating that the setup time of the third latch is longer than the critical delay of the circuit by a small margin. 13 All the BTWC techniques mentioned above have similar limitations. They allow for scaling up to, but never beyond,the critical operating point. However, with increasing vari- ability in circuits, there is also high potential for bene t (e.g. in terms of power ) when scaling is allowed to proceed past the critical point. Razor [12,13,14] actually allows voltage scaling past the critical point, since it incorporates error detection and correction mecha- nisms to handle the case when errors occur. While CRISTA [6] allows aggressive voltage scaling by Isolating and predicting the set of possible paths that may become critical under process variation and ensure that they are activated rarely, it avoids possible delay failures in the critical paths by dynamically switching to two-cycle operation. On the other hand CSCD [10] provides carry completion signaling in low cost ripple carry adders which allows the control logic to schedule the next addition as soon as an earlier one is complete, thereby achieving the average case, rather than worst case addition delay over a set of computations. While all the above mentioned designs can be considered as BTWC, only Razor, CRISTA and CSCD fall under a less conservative design as each design has unique way to sense the timing paths completion, and the challenge is to design a least conservation design technique. Instead of applying this approach to general purpose logic, as attempted by the Razor team and others, we have explored its use in speci c widely used computations such as addition and multiplication. 2.2 Razor The Razor approach, proposed in [12], aims at reducing the power consumption by minimizing the PVT timing margin to zero and beyond (since timing critical paths are not activated in every cycle), by building in a system capability to detect occasional errors due to slow signal paths and recover from them. The timing margins are removed by reducing the supply voltage to slow down the circuit to a point where a small, acceptable number of errors are observed. As long as the power saving from the reduced voltage operation exceeds the 14 extra power needed, on average, by the occasional error detection and recovery cycle, such a scheme can provide a net power saving. The challenge clearly is in designing an e cient low cost error detection and recovery capability to support this approach. The original Razor design [12] has changed and evolved [13,14] signi cantly over time in an attempt to achieve practicality. Even so the potential power savings from eliminating timing margins appear limited. 2.2.1 Razor I Overview The key concept in Razor I [12] is to sample the input data of the ip- op at two di erent points in time. The earlier, speculative sample is stored in a conventional positive- edge triggered, master-slave ip- op. This main ip- op is augmented with a so-called shadow latch which samples at the negative level of the clock. Razor I implementation shown in Figure 2.1. Figure 2.1: Razor op- op 15 Thus, the shadow latch gets additional time equal to the high phase of the clock to capture the correct state of the data. An error is agged when data captured at the main ip- op di ers from the shadow-latch data. As the setup and hold constraints for the main ip- op are allowed to be violated, an additional detector is required to ag the occurrences of metastability at the output of the main ip- op. The error-pins of individual RazorI ip- ops are then OR-ed together to generate a pipeline restore signal which overwrites the correct data in the shadow latch into the main ip- op, at the next positive edge of the clock. Since the shadow latch data is used to overwrite the state of the main ip- op, it is required to ensure, using conventional worst-case techniques, that the data in the shadow latch is always correct. There are key design issues that complicate the deployment of RazorI in high-performance, aggressively-clocked microprocessors. The primary di culty is the generation and propaga- tion of the pipeline restore signal. The restore signal is evaluated at the output of a high fan-in OR-tree and is suitably bu ered and routed to every ip- op in the pipeline stage before the next rising edge of the clock. This imposes signi cant timing constraints on the restore signal and the error recovery path can itself become critical when the supply voltage is scaled. This limits the voltage headroom available for Razor, especially in aggressively clocked designs. The design of the metastability detector is also di cult under rising pro- cess variations as it is required to respond to metastable ip- op outputs across all process, voltage and temperature corners. Consequently, it requires the use of larger devices which adversely impacts the area and power overhead of the RazorI ip- op. There is the addi- tional risk of metastability at the restore signal which can propagate to the pipeline control logic, potentially leading to system failure. 2.2.2 RazorII Razor II [13], Figure 2.2, was implemented to e ectively address the design and timing issues in RazorI which moves the responsibility of recovery entirely to the micro-architectural 16 domain. The RazorII approach introduces two novel components which are described in the following paragraphs. Instead of performing both error detection and correction in the ip- op, RazorII per- forms only detection in the ip- op, while correction is performed through architectural replay. This allows signi cant reduction in the complexity and size of the Razor ip- op. although at the cost of increased IPC penalty during recovery. Architectural replay is a conventional technique which often already exists in high-performance microprocessors to support speculative operation such as out-of-order execution and branch prediction. Hence, it is possible to overload the existing framework to support replay in the event of timing errors. In addition, this technique precludes the need for a pipeline restore signal, thereby signi cantly relaxing the timing constraints on the error recovery path. This feature makes RazorII highly amenable to deployment in high-performance processors. Figure 2.2: Pipeline augmented with Razor latches and control lines Besides, the design of the RazorII ip- op uses a positive level-sensitive latch instead of a master-slave ip- op. The ip- op operation is enforced by agging any transition on the input data in the positive clock phase as a timing error. Elimination of the master latch 17 signi cantly reduces the clock pin capacitance of the ip- op, bringing down its power and area overhead. 2.2.3 Razor Limitations Designs such as Razor that allow scaling past the critical operating point [25] must be mindful of two aspects of error recovery - error detection and correction. Razor detects an error when the value latched by the shadow latch di ers from the value latched by the main ip- op. This happens when the logic signal has not settled to its nal value before the setup time of the main ip- op. If the signal transitions again before the shadow latch latches, an error will be detected. For error correction, the Razor ip- op must not only detect a timing violation, but must also latch the correct value in the shadow latch. This simply implies that the correct value must arrive by the setup time of the shadow latch for all Razor ip- ops in a design. So, Razor may not be able to correct errors if a) detection fails (i.e., both the main ip- op and the shadow latch have the same incorrect value), or b) detection succeeds, but the value latched in the shadow latch is not the correct value. To guarantee correctness, Razor requires two conditions to be met on the circuit delay behaviour which are, the short path constraint and the long path constraint. The long path constraint (eqn. 2.1), states that the maximum delay through a logic stage protected by Razor must be less than the clock period (T) plus the skew between the two clocks (the clock for the main ip- op and the clock for the shadow latch). delayskew +hold (2.2) Failure to satisfy the short path constraint leads to false positive error detections when the logic output changes in response to new circuit inputs before the shadow latch has sampled the previous output. Combination of the short and long path constraints (eqn. 2.4) demonstrates that Razor can only guarantee correctness when the range of possible delays for a circuit output falls within a window of size skew +hold