A High Throughput Multiplier Design Exploiting Input Based Statistical Distribution in Completion Delays by Ravi Tej Uppu A thesis submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Master of Science Auburn, Alabama Aug 3, 2013 Keywords: Wallace Multiplier, Completion Sensing, Delay Distribution, Clock Period Scaling, Throughput, Hold and Release Copyright 2013 by Ravi Tej Uppu Approved by Adit D.Singh, Chair, James B Davis Professor, Electrical and Computer Engineering Vishwani Agrawal, James J Danaher Professor, Electrical and Computer Engineering Fa Foster Dai, Professor and Asso Director, Electrical and Computer Engineering Abstract The primary goal of this work is to ensure that optimum performance is achieved for a Multiplier Design, while reducing as much static power dissipation as possible or atleast equal to their slower counterparts. This design tries to exploit the input based Statisti- cal Distribution of Completion Delays of a circuit in optimizing the performance. Design methodologies such as Razor [3,4,5] minimize power dissipation by slowing down circuits so as to eliminate timing slacks to the point where occasional timing errors are observed. The main challenge is the design of e cient mechanisms to detect and recover from these infrequent errors. We present a novel design for widely used Wallace multiplication using 4:2 compressors, where because of the highly skewed input based statistical distribution in completion delays, the potential for power and performance gains is signi cantly higher; clock periods can be potentially reduced by a factor of 3 or more, with very rare timing violations for random input distributions. For this we present a novel low cost error recovery approach that latches and holds logic values at key internal circuit nodes during every clock cycle beyond the next clock edge. This allows generation of the correct outputs for that clock period one clock cycle later in case of a timing error. Meanwhile, very fast error evaluation, exploiting a unique characteristic of carry ripple addition, allows this hold to be quickly released if no error is detected, ensuring no impact on the circuit timing in error free operation. While an additional area overhead of 10% was observed after implementing the design in a 32x32 Wallace Multiplier a 2.5x improvement in the average performance was achieved. Spice simulation results with varied clock period for 10000 vectors shows an optimum average performance improvement can be achieved at a reduced clock period of 3.75ns against the actual clock period of 9.5ns, the vectors which can trigger the critical path were obtained ii from [8]. Also, this design when deployed in a logic circuit would prove to be a Variation Tolerant Design. iii Acknowledgments My advisor and committee were the people most directly involved with the completion of my thesis. I would like to express my appreciation and sincere thanks to my advisor Dr. Adit Singh, who patiently shaped this work right from the beginning. I bene ted greatly from his ability to approach problems from many di erent directions. His advise and attitude towards life would remain a guiding light for me throughout my career. I also wish to thank my advisory committee members, Dr. Vishwani Agrawal and Dr. Fa Foster Dai for their guidance and advice on this work. My work could not have been completed without the excellent course work o ered in VLSI design and Testing and other related courses at Auburn University, for which I am grateful to each and every professor. Graduate study at Auburn University has been a learning experience. I thank my brother Ravi Kanth for his constant support and colleagues Suraj and Praveen for their valuable inputs and patients. Without them my graduate experience would not have been this enriching. I am indebted to my parents, brother, sister and all friends for their love and support. I also thank everyone who directly or indirectly helped and inspired me during the course of my graduate study. Finally, I would like to thank AMD for giving me an opportunity to implement the design in TSMC 28nm technology and letting me use the relative numbers, besides giving a start o to my career. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Contribution of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Requirement for BTWC designs . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Previous BTWC designs . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Razor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Razor I Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 RazorII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 Razor Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 CRISTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 CSCD for High Speed and Area E cient Arithmetic . . . . . . . . . . . . . . 20 2.5 Need for Fast and Resilient Error Detection Circuit . . . . . . . . . . . . . . 22 3 Brief Overview on Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Basic Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Half Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 v 3.1.2 Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Ripple Carry Adder (RCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Performance Analysis of a 4-bit RCA . . . . . . . . . . . . . . . . . . 26 3.2.2 Best Case Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.3 Worst Case Performance . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.4 Average Case Performance . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Carry Look Ahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 Performance Analysis of 4-bit CLA . . . . . . . . . . . . . . . . . . . 29 3.4 Carry Select Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Overview on Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 The Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Partial Product Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Partial Product Accumulation . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3.1 Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3.2 Carry-Save Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.3 Wallace Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Final Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5 Proposed BTWC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 The Internal signal Hold until Completion Con rmed Scheme . . . . . . . . 44 5.2 Timing Error Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Fast Comparator to monitor output completion . . . . . . . . . . . . . . . . 48 5.3.1 Equality Comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.2 A Pass Logic Single Stage Comparator . . . . . . . . . . . . . . . . . 51 6 Implementation of 32x32 Wallace Multiplier with the proposed BTWC Design . 53 7 Simulation Results and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.1 Simulation Results using TSMC 180nm technology . . . . . . . . . . . . . . 56 7.2 Simulation Results using TSMC 28nm technology . . . . . . . . . . . . . . . 58 vi 7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 vii List of Figures 1.1 Delay Distribution of 32 bit RCA . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Razor op- op . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Pipeline augmented with Razor latches and control lines . . . . . . . . . . . . . 15 2.3 Current Sensor Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Half Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Full Adder Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Full Adder using two Half Adders . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Full Adder with fast carry out . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 4-bit Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6 4-bit Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7 4-bit adder best case performance . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.8 4-bit adder worst case performance . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.9 Illustrating the probability of carry rippling through a 5 stage RCA . . . . . . . 28 3.10 Partial Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.11 4-bit Carry Look Ahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 viii 3.12 Carry Select Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Binary Multiplication - an example . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 4x4 Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Delay Distribution of an Array Multiplier for 50,000 vectors . . . . . . . . . . . 36 4.4 A 4x4 carry-save multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 Transforming a Partial-Product tree (a) into a Wallace tree (b,c,d), using an iterative covering process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Wallace Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.7 Illustration of Wallace Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.8 4:2 Compressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.9 Illustration of Wallace Multiplier with 4:2 compressors . . . . . . . . . . . . . . 41 4.10 Delay Distribution of a Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . 42 4.11 Delay Distribution of 32-bit Wallace Multiplier with nal RCA distribution for 100000 vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1 Pipelined Schemtic for Wallace Multiplier Operation . . . . . . . . . . . . . . . 45 5.2 Timing Model Waveforms for Hold Until Completion . . . . . . . . . . . . . . . 46 5.3 Timing Error Detection Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 Implementation of the above verilog code by Mentor Graphics Synthesis tool . . 49 5.5 4-bit Equality comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.6 4-bit Equality comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1 Illustration of Wallace Multiplier with the Proposed Design . . . . . . . . . . . 53 ix List of Tables 7.1 Spice Simulation Results with Reduced Clock Periods . . . . . . . . . . . . . . . 57 7.2 Relative numbers for 28nm Multiplier Design (Performance and Area Optimized) 58 x Chapter 1 Introduction Many factors introduce variation into the behavior of CMOS-based processor designs. Non-idealities such as voltage uctuations in the power supply network, temperature uc- tuations in the operating environment, manufacturing variations in parameters such as gate length and doping concentration, and cross-coupling noise all a ect the timing behavior of a processor, making it statistical in nature rather than deterministic. In traditional processor designs, great pains are taken to ensure that a processor al- ways produces correct results, even when subjected to a worst-case combination of non- idealities.This means that conservative guardbands are incorporated into design parame- ters to ensure correct behavior in all possible scenarios. Design for such a conservative operating point incurs a considerable overhead in terms of power spent to ensure correct- ness. Making matters worse, variation in CMOS-based circuits is expected to increase in comming technology generations [International Technology Roadmap for Semiconductors 2008, http://public.itrs.net.],resulting in the need for even more conservative designs. Con- sequently, the already expensive cost of pristine computation will continue to increase in the future. Although processors are traditionally designed to tolerate worst-case conditions, it is unlikely that all nonidealities will take e ect at once, pushing a processor to the brink of erroneous behavior. Thus, there exists a considerable potential to increase the power e - ciency or performance of processors by relaxing traditional, conservative requirements for correctness in the worst-case and instead designing processors for the average-case. Such better-than-worstcase designs work normally in the average case and have recovery mecha- nisms to ensure correct operation when errors occur. 1 1.1 Background Digital circuits are classi ed into synchronous and asynchronous circuits. Synchronous circuits (and systems) include clocked elements like ip ops, latches and registers. The input of these elements should be stable before the next active edge of the clock. One major problem with these systems is the distribution of clock signals which must be in phase without signi cant skew every place in the chip. This makes designing a complex synchronous circuit operating at high frequency di cult due to timing skew and delays. Besides, a design should also accommodate some timing window for PVT induced variation during which the system is completely inactive, which indeed contributes to the already signi cant leakage power. Having mentioned the above constraints, it is imperative to design an innovative tech- nique which can not only exploit the timing window to accommodate PVT induced variation but also the slack available in completion delays. While the above mentioned solutions seems possible if the design has a mechanism which can not only sense the failing paths but also e ectively recover them by providing an additional clock cycle. Authors in [ref ] deploy voltage scaling and frequency scaling techniques to reduce the power dissipation or improve the performance by exploiting the non-critical paths completion delays timing slack. The Razor approach, proposed in [3], aims at reducing the power consumption by minimizing the timing margin to zero and beyond (used to accommodate PVT induced timing margin), by building in a system capability to detect occasional errors due to slow signal paths and recover from them. The timing margins are removed by reducing the supply voltage to slow down the circuit to a point where a small, acceptable number of errors are observed. As long as the power saving from the reduced voltage operation exceeds the extra power needed, on average, by the occasional error detection and recovery cycle, such a scheme can provide a net power saving. The challenge clearly is in designing an e cient low cost error detection and recovery capability to support this approach. The original Razor design [3] has changed and evolved [4,5,7] signi cantly over time in an attempt to achieve practicality. Even so the potential power savings from eliminating timing margins 2 appear limited. Besides CRISTA [6], tries to exploit the input based statistical distribution in completion delays and use an algorithm to synthesize a circuit such that a critical path is triggered occasionally i.e. aiming to obtain a skewed distribution in completion delays, this technique uses shannon?s expansion theorem to synthesize a circuit and upsize the gates in a critical path to make it furthur critical. Figure 1.1: Delay Distribution of 32 bit RCA Instead of applying this approach of minimizing circuit timing slack with recovery from occasional errors to general purpose logic as attempted by the Razor team and others, we have implemented it in speci c widely used computations such as addition and multiplication as in [1] where, because of the inherent characteristics in the applications, the potential for power savings can be much greater. Depending on the hardware implementation, these computations can display a wide range of completion delays depending on the applied inputs. For example a simple low cost (in hardware and power) 32-bit ripple carry adder (RCA) can generate a result with no carry propagation delays for some input cases, while requiring 32 stages of carry propagation in the worst case. Importantly, for random inputs, this delay distribution is highly skewed, which can allow a signi cant speedup in the operating clock 3 period while still ensuring only a small error and recovery rate. This is seen in Figure 1.1, which simulates addition completion delays for a 32bit RCA designed at the gate level and, for simplicity, assumes xed unit delay for every gate. Observe that the median addition delay is only 10 units when compared to the worst-case delay of 64 units. Note this does not mean that only 3-4 full adder stages generate a carry for the median case because strings of carries are generated and propagate in parallel at the same time. From the point of generation, a carry needs to see 01 or 10 inputs in the subsequent full adder stage for propagation, resulting in only a 50% chance that it will propagate through the next stage. The probability of a carry propagating n stages is 2n , which dies out quickly with increasing n. Consequently, it if observed in Figure 1 that compared to a worst case delay of 64 units, less than 0.1% of random inputs result in an addition delay of greater than 25 units. This suggests that the addition can potentially be run at 2.5X the worst case speed, with only one in a thousand operations, on average, requiring correction and recovery. Of course the overall system, utilizing such an adder must be designed to support and work with this occasional recovery operation. Fortunately, architectural level handing of exceptions through pipeline stalls and out of order instruction issue are now well understood and commonplace in processors to handle other forms of speculative execution; our design can be considered just another speculative implementation. Note the greatly reduced average clock period results in signi cant energy saving per computation because of the saving in the static power wasted during the mostly inactive part of the longer clock period in a traditional design. Moreover, this performance gain can be traded o for lower power consumption by reducing the supply voltage to slow down the circuitry instead of speeding up the clock as in the Razor[3] approach. 1.2 Motivation Current trends in the CMOS VLSI design methodologies show steady scaling of feature sizes to meet the speed and performance requirements for complex applications in the areas 4 of communication, aerospace, defence and consumer electronics. Extensive scaling of CMOS process has given rise to many problems such as leakage current, short circuit current, large process variations, etc [2]. It is proving to be di cult to achieve higher performance with transistor channel length scaled to few nanometres. These problems have to be addressed while designing circuits for low power and high speed applications. Also, with CMOS technol- ogy approaching its limit, new approaches for processes and design architectures are needed. Apart from scaling, there has been much focus on very low power consumption and area e cient design methodologies in developing consumer applications. As performance and battery life requirements are becoming more crucial in small and lightweight systems like laptops and hand-held devices, power consumption is becoming a critical problem. For example, processors and microcontrollers for these small portable systems should provide higher system performance while keeping power consumption within speci ed limits. The performance of these chips is mainly dependent on the Arithmetic Logic Units (ALUs) used in them to perform complex mathematical operations such as addition, subtraction, division and multiplication. To save silicon area and power, some chips do not include dedicated arithmetic circuits for multiplication and other complex computations which are considered to be very area intensive and power hungry; multiplication is performed through repeated addition. In such scenarios, there is a critical need for fast and power e cient adder circuits which are almost constantly active during computation, and can have a major impact on overall system performance. Current Sensing Completion Detection for High Speed and Area E cient Arithmetic is one such design [1] which try to exploit a unique characteristic of a Ripple Carry Adder in improving the performance and power of a Booth Multiplier. Our focus is low cost, high throughput integer multiplication. To portray the e cacy of our approach we present the design of a 32x32 Wallace Multiplier. Arithmetic circuits, par- ticularly the multiplier, often have a decisive impact on overall timing in electronic systems. Moreover, multiplication being a very important and commonly used operation consumes 5 a signi cant portion of the system?s power in modern processors. Traditionally, state-of- the-art multiplier designs have focused on the performance over hardware cost and power, with numerous high-speed designs proposed in the literature. Many high speed multipli- ers employ a carry save approach, based on designs proposed by Wallace [10] and Dadda [11], that rst quickly reduces the input based partial products to two numbers without any carry propagation delays, and then quickly adds these two numbers using high speed adders. Enhancements speed up each of these two steps at considerable expense in hardware and power. For example, the high speed multiplier described in [9] employs Booth encoding on the operands to reduce the number of partial products to be reduced by the carry save tree, specialized compressor circuits for carry save partial product reduction, and a mixed carry-select, carry look-ahead topology to add the nal two accumulated partial products [12,13,14]. Our goal is to achieve comparable speed for almost all the multiplication oper- ations without many of these enhancements for partial product reduction, and the use of a much lower cost carry ripple adder to generate the nal product; only rarely requiring recov- ery for a timing error due to the activation of a long path. Furthermore, it is critical for the timing error detection and correction circuitry to be fast, low cost and robust. Developing such a timing exception recovery capability is a major contribution of this thesis. To achieve this error recovery we present a novel low cost approach which involves latch- ing and holding logic values at key internal circuit nodes during every clock cycle beyond the clock edge to allow recreation of the correct outputs for that clock period one clock cycle later in case of a timing error. Then very fast error evaluation, exploiting a unique characteristic of carry ripple addition, allows this hold to be quickly released without any impact on the circuit timing in the next period cycle in error free operation. The rare detection of a timing error (long carry propagation path) extends the hold signal for additional one or more clock cycles to give the computation time to complete. Note that this approach is very di erent from those previously proposed for Razor or other work in the literature, and is particularly well suited to exploit the architecture of low cost Wallace multipliers. 6 1.3 Problem Statement While in the earlier work timing slack due to PVT variation was addressed and overcome in Razor, CRISTA a circuit design technique where the clock frequency can be increased by a factor of two by making the delay paths skewed was addressed. In this thesis we try to improvise the following by exploiting the innate skewed distribution property of a Wallace Multiplier with Ripple Carry Adder as a nal vector merger. 1) A better recovery mechanism to quickly recover from the occasional timing errors. 2) Exploit the Skewed distribution property and PVT timing margin slack of a circuit. 3) Increase the clock frequency by a factor of two or more. 1.4 Contribution of This Thesis In this thesis we present a novel design technique to leverage the input based statistical distribution of a Ripple Carry Adder, further this design technique is implemented on a Wallace Multiplier with a Ripple Carry Adder as a nal Adder. The key contribution of this thesis is achieving a performance bene t of close to 2.5x compared to the actual design i.e., clock period determined by the worst case delay. The performance achieved was comparable to that of the fastest multiplier which is supposed to be hardware intense. An analysis of how the delays are spread in a Ripple Carry Adder was done using unit delay simulation in Modelsim. The skewed delay distribution of the Ripple Carry Adder was as expected and this delay distribution was leveraged after implementing a Wallace Multiplier with Ripple Carry Adder as nal adder. The key concept is to hold the internal signal at the nal adder beyond the next clock edge and release them quickly if there are no active carry signal propagating within the Ripple Carry Adder thus, providing additional two/three clock cycles in the case of occasional timing paths which go beyond the already set clock period. Spice simulation results on this design for 10,000 random vectors with progressively decreasing clock period shows a performance bene t of 2.36x. 7 1.5 Organization of Thesis The thesis is organized as follows. In Chapter 2, we discuss a general background of the earlier Better than Worst Case Design techniques, their pros and cons, and their use in narrower but widely used designs such as Adders and Multipliers. Chapter 3 presents a detailed overview of the performance of Ripple Carry Adder, Carry Look Ahead adders and other complex adders. Chapter 4 presents a brief overview of various stages in a Multiplier and how we leverage the unique characteristics of a RCA. In Chapter 5, the proposed BTWC design is discussed. Implementation of the proposed BTWC design in a Wallace Multiplier is discussed in Chapter 6. Simulation results with Conclusions and future work are discussed in Chapter 7. 8 Chapter 2 Literature Survey This chapter presents a brief study on the present and past research work on better than worst case delay designs. Though BTWC designs were implemented earlier either for performance improvement or for power aware computation the design has to make sure that a circuit operates without any errors even in the worst case scenario. Thus, the design of a e cient Error resilient logic and a mechanism which can recover from a timing critical error can be thought of as a metric for the designs based on BTWC delay. The completion detection and correction in this work is novel, and can be implemented on designs whose delay distribution is skewed. Also, it is important to keep in mind that power or performance is a compromising design metrics i.e. one needs to trade power for performance or vice versa. 2.1 Requirement for BTWC designs After decades of astonishing improvement in integrated circuit performance, digital circuits have come to a point in which there are many problems ahead of us. Two main problems are power consumption and process variations. In the past few decades, circuit designs have followed Moores law and the number of transistors on a chip has doubled every two years. As we t more transistors into a given area and clocked them faster and faster, power density increases exponentially. The most e ective way to reduce power density is to minimize the supply voltage, as predicted by CV 2f. Currently, we have been successful in containing the power density within tolerable levels, but this will not last. One barrier comes from the threshold voltage. In order to maintain the same performance, we have to reduce the threshold voltage together with the 9 supply voltage. However, reducing threshold voltage leads to an exponential increase in o - state leakage current. Leakage current has become so signi cant that a further reduction in threshold voltage and supply voltage has slowed or even stopped. Without voltage scaling, the power density of a chip will increase without bound. If a solution to this ever increasing power consumption cannot be found, Moores law will come to an end and we will no longer be able to experience the tremendous increase in performance experienced in the past. The other problem of the IC industry is process variations [17,18]. As transistor sizes approach atomic levels, it is very di cult to fabricate exactly what we specify. For example, a variation of dopant implantation on order of a few atoms may translate to a huge dif- ference in dopant concentration, and may cause a noticeable shift in the threshold voltage. Because traditional designs dictate that our circuit must always function correctly in all circumstances, the huge process variations present today forces designers to allocate more voltage margin on top of the typical case in order to ensure proper operation. To make things worse, other temperature, die-to-die, and IR drop variations further increases safety margins needed. The general practice of conservative over design has become a barrier to low power design. Because these large margins are used only to prevent the rare worst case scenario from failing, a large amount of energy can be saved if these margins are eliminated and instead utilize an error-resilient logic that can dynamically tune itself for all kinds of variations. We will then be able to run our chip at the lowest possible energy consumption or highest possible performance . Power consumption and variations have become two of the most important roadblocks for extending Moores law and these problems must be addressed before we can continue to improve the performance of electronics at the amazing pace we enjoyed in the past decades. Thus even a minor improvement in performance without trading o power shall be a signif- icant contribution in the face of the above mentioned bottlenecks of digital circuits. 10 2.1.1 Previous BTWC designs A number of better-than-worst case (BTWC) designs have been proposed in the past to allow circuits to save power by operating under normal conditions rather than conserva- tive worst case limits. One class of BTWC techniques speci es multiple safe voltage and frequency levels that a design may operate at and allows for switching between these states. Examples in this class are correlating VCO [23, 24] and Design-Time DVS [27]. As voltage changes, correlating VCO adapts the frequency of a circuit to a level slightly below the crit- ical frequency. The clock frequency for a given voltage is selected to incorporate a margin for process and temperature variations, as well as noise in the power supply network. Thus, scaling past the critical point is not allowed. Similarly, Design-Time DVS provides the capability to switch between multiple voltage / frequency operating points to match user or application requirements. As with correlating VCO, each operating point incorporates a conservative margin to ensure that errors do not occur.Another class of BTWC designs uses canary circuits to detect when arrival at the critical point is imminent, thus revealing the extent of safe scaling. Delay line speed detectors [25] work by propagating a signal transition down a path that is slightly longer than the critical path of a circuit. Scaling is allowed to proceed until the point where the transition no longer reaches the end of the delay line before the clock period expires. While this circuit enables scaling, no scaling is allowed past the critical path delay plus a safety margin. Another similar circuit technique uses multiple latches which strobe a signal in close succession to locate the critical operating point of a design. The third latch of a triple latch monitor [26] is always assumed to capture the correct value, while the rst two latches indicate how close the current operating point is to the critical point. A design is considered to be tuned when the values in the rst two latches do not match but the values in last two latches do match, indicating that the setup time of the third latch is longer than the critical delay of the circuit by a small margin. 11 All the BTWC techniques mentioned above have similar limitations. They allow for scaling up to, but never beyond,the critical operating point. However, with increasing vari- ability in circuits, there is also high potential for bene t (in terms of power e.g.) when scaling is allowed to proceed past the critical point. Razor [3,4,5] actually allows voltage scaling past the critical point, since it incorporates error detection and correction mechanisms to handle the case when errors occur. While CRISTA [6] allows aggressive voltage scaling by Isolating and predicting the set of possible paths that may become critical under process variation and ensure that they are activated rarely, it avoids possible delay failures in the critical paths by dynamically switching to two-cycle operation. On the other hand CSCD [1] provides carry completion signaling in low cost ripple carry adders which allows the control logic to schedule the next addition as soon as an earlier one is complete, thereby achieving the average case, rather than worst case addition delay over a set of computations. While all the above mentioned designs can be considered as BTWC. Only Razor, CRISTA and CSCD fall under a less conservative design as each design has unique way to sense the timing paths completion, and the challenge is to design a least conservation design technique. Instead of applying this approach to general purpose logic as attempted by the Razor team and others, we have explored its use in speci c widely used computations such as addition and multiplication. 2.2 Razor The Razor approach, proposed in [3], aims at reducing the power consumption by min- imizing the PVT timing margin to zero and beyond (since timing critical paths are not activated in every cycle), by building in a system capability to detect occasional errors due to slow signal paths and recover from them. The timing margins are removed by reducing the supply voltage to slow down the circuit to a point where a small, acceptable number of errors are observed. As long as the power saving from the reduced voltage operation 12 exceeds the extra power needed, on average, by the occasional error detection and recovery cycle, such a scheme can provide a net power saving. The challenge clearly is in designing an e cient low cost error detection and recovery capability to support this approach. The original Razor design [3] has changed and evolved [4,5] signi cantly over time in an attempt to achieve practicality. Even so the potential power savings from eliminating timing margins appear limited. 2.2.1 Razor I Overview The key concept in Razor I [3] is to sample the input data of the ip- op at two di erent points in time. The earlier, speculative sample is stored in a conventional positive-edge triggered, master-slave ip- op. This main ip- op is augmented with a so-called shadow latch which samples at the negative edge of the clock. Figure 2.1: Razor op- op Thus, the shadow-latch gets additional time equal to the high-phase of the clock to capture the correct state of the data. An error is agged when data captured at the main 13 ip- op di ers from the shadow-latch data. As the setup and hold constraints for the main ip- op are allowed to be violated, an additional detector is required to ag the occurrences of metastability at the output of the main ip- op. The error-pins of individual RazorI ip- ops are then OR-ed together to generate a pipeline restore signal which overwrites the correct data in the shadow-latch into the main ip- op, at the next positive edge of the clock. Since the shadow latch data is used to overwrite state in the main ip- op, it is required to ensure using conventional worst-case techniques that the data in the shadow latch is always correct. There are key design issues that complicate the deployment of RazorI in high-performance, aggressively-clocked microprocessors. The primary di culty is the generation and propaga- tion of the pipeline restore signal. The restore signal is evaluated at the output of a high fan-in OR-tree and is suitably bu ered and routed to every ip- op in the pipeline stage before the next rising edge of the clock. This imposes signi cant timing constraints on the restore signal and the error recovery path can itself become critical when the supply voltage is scaled. This limits the voltage headroom available for Razor, especially in aggressively clocked designs. The design of the metastability detector is also di cult under rising pro- cess variations as it is required to respond to metastable ip- op outputs across all process, voltage and temperature corners. Consequently, it requires the use of larger devices which adversely impacts the area and power overhead of the RazorI ip- op. There is the addi- tional risk of metastability at the restore signal which can propagate to the pipeline control logic, potentially leading to system failure. 2.2.2 RazorII Razor II [3] was implemented to e ectively address the design and timing issues in RazorI which moves the responsibility of recovery entirely to the micro-architectural domain. The RazorII approach introduces two novel components which are described in the following paragraphs. 14 Instead of performing both error detection and correction in the ip- op, RazorII per- forms only detection in the ip- op, while correction is performed through architectural replay. This allows signi cant reduction in the complexity and size of the Razor ip- op. although at the cost of increased IPC penalty during recovery. Architectural replay is a conventional technique which often already exists in high-performance microprocessors to support speculative operation such as out-of-order execution and branch prediction. Hence, it is possible to overload the existing framework to support replay in the event of timing errors. In addition, this technique precludes the need for a pipeline restore signal, thereby signi cantly relaxing the timing constraints on the error recovery path. This feature makes RazorII highly amenable to deployment in high-performance processors. Figure 2.2: Pipeline augmented with Razor latches and control lines Besides, the design of the RazorII ip- op uses a positive level-sensitive latch instead of a master-slave ip- op. The ip- op operation is enforced by agging any transition on the input data in the positive clock-phase as a timing error. Elimination of the master latch signi cantly reduces the clock pin capacitance of the ip- op bringing down its power and area overhead. 15 2.2.3 Razor Limitations Designs such as Razor that allow scaling past the critical operating point [ref (rl) J. Patel. Cmos process variations: A critical operation point hypothesis, 2008.] must be mindful of two aspects of error recovery - error detection and correction. Razor detects an error when the value latched by the shadow latch di ers from the value latched by the main ip- op. This happens when the logic signal has not settled to its nal value before the setup time of the main ip- op. If the signal transitions again before the shadow latch latches, an error will be detected. For error correction, the Razor ip- op must not only detect a timing violation, but must also latch the correct value in the shadow latch. This simply implies that the correct value must arrive by the setup time of the shadow latch for all Razor ip- ops in a design. So, Razor may not be able to correct errors if a) detection fails (i.e., both the main ip- op and the shadow latch have the same incorrect value), or b) detection succeeds, but the value latched in the shadow latch is not the correct value. To guarantee correctness, Razor requires two conditions to be met on the circuit delay behaviour the short path constraint and the long path constraint. The long path constraint (eqn. 2.1), states that the maximum delay through a logic stage protected by Razor must be less than the clock period (T) plus the skew between the two clocks (the clock for the main ip- op and the clock for the shadow latch). delay < T + skew (2.1) If the long path constraint is not satis ed, false negative detections can occur when a timing violation causes both the main ip- op and shadow latch to latch the incorrect value. The short path constraint (eqn. 2.2) states that there must not be a short path through a logic stage protected by Razor that can cause the output of the logic to change before the shadow latch latches the previous output. 16 delaymin > skew + hold (2.2) Failure to satisfy the short path constraint leads to false positive error detections when the logic output changes in response to new circuit inputs before the shadow latch has sampled the previous output. Combination of the short and long path constraints (eqn. 2.4) demonstrates that Razor can only guarantee correctness when the range of possible delays for a circuit output falls within a window of size skew + hold < delay < T + skew (2.3) delaymax delaymin < T hold (2.4) Note that equation 2.4 implies a tradeo between the limit of Razor protection and the range of Razor usability. While increasing skew can reduce the number of uncorrectable errors by protecting longer path delays, this also leads to a reduction in the range over which Razor can be applied to correct errors due to violation of the short path constraint. The authors of [22] try to demonstrate the voltage scaling limitations of Razor based design using two circuits which are chosen to be canonical examples of two contrasting design philosophies i.e., a Kogge-Stone adder (KSA) and a Ripple Carry Adder (RCA) The KSA architecture has timing paths of nearly the same length, and therefore exhibits a critical operating point akin to traditional high performance processor designs. Due to this property of KSA, it was observed that in [ref ] scaling beyond a certain voltage point leads to a catastrophic failure of the adder (i.e., 100% error rate). Aggressive voltage scaling, therefore, is not possible for such designs even with an e cient error correction mechanism 17 as the power consumption may actually increase drastically in spite of voltage scaling. This is because the absolute error rate is high (close to 100%) and the overhead of error recovery for Razor is roughly an order of magnitude more expensive than the overhead of executing an instruction normally [ref ]. So, for designs like KSA where timing paths are bunched up (like in traditional high performance processor designs), Razor may not be very e ective in terms of power reduction through undervolting (1.e., scaling beyond the voltage for which the rst timing violation appears). While some power can be saved by eliminating the voltage guardband, scaling past the critical operating point results in nearly 100% erroneous computations. The second circuit i.e., ripple carry adder (RCA architecture is not subject to catas- trophic failure in response to scaling past the point of rst error. The delay distribution show in g 1 of chapter 1 implies that the error rate increases gradually as voltage decreases. Although the minimum delay for any path of the RCA equals the delay of the sum path of a full adder, operational delay ultimately depends on adder inputs, which generate carry chains from lower to higher order bits. The RCA exhibits maximum delay when the carry chain extends from the least signi cant bit to the most signi cant bit. However, on average, carry chains are much shorter, leaving extensive room for aggressive scaling past the point where errors begin to occur. In fact, the error rate reaches close to 100% only at very low voltages. The behaviour of RCA may be a suitable desired behaviour for high performance pro- cessor designs to enable signi cant power savings through undervolting. Recent attempts [20, 21] at processor designs that produce graceful degradation in reliability in the face of voltage scaling try to mimic this behaviour. It might be obvious that Razor should perform well for architectures that fail gracefully, since such designs do not have a wall of criticality. However, analysis of the results in [22] reveals some serious limitations of using Razor, even in such architectures. 18 Limitations arise due to the potential short path and long path constraint violations as discussed earlier. If the long path constraint is not satis ed, false negative detection can occur when the main ip- op and shadow latch both latch the incorrect value. Similarly, the failure to meet short path constraints makes Razor unusable over a range of voltages. In fact, the same factor that makes the error behavior of RCA graceful (skewed delay distribution) makes Razor less e ective. This is because Razor relies on the variation in delay to be less than a threshold. The variation in delay is signi cantly larger for an RCA design than a KSA design. In order to make Razor work for circuits that fail gracefully, bu ering must be used to increase the delay of short paths, thus shifting them into the window of correction. This bu ering adds area and power overheads in a design, negating some of the power savings a orded by better than worst-case design. Secondly, required bu ering increases the delay on short paths, transforming a circuit from one that fails gracefully to one that fails catastrophically, thus limiting the extent of possible scaling. So, while Razor is ine ective for circuits like KSA because of massive timing violations in the face of undervolting, it is also not very e ective for circuits like RCA due to a large span between the maximum and minimum circuit delays. The results in [ref razor limits] demonstrate the inadequacies of current better than worst-case design methodologies (like Razor) in terms of voltage/ frequency scaling, motivating the need for new techniques for processor design and error handling. 2.3 CRISTA CRISTA [6] is a low-power variation-tolerant circuit design called CRitical path ISo- lation for Timing Adaptiveness [ref crista], which allows aggressive voltage scaling. The principal idea includes the following: 1.) isolates the critical paths and makes them predictable (based on few primary inputs) under parametric variation so that with reduced supply voltage, possible delay errors under single-cycle operation are deterministic and can be avoided by a two-cycle operation. 19 2.) restricts the occurrences of the previous two-cycle operations by reducing the activation probability of critical paths. 3.) increases the delay margin between critical and noncritical paths by both logic synthesis and proper gate sizing for improved yield, reliability of operations, and aggressive voltage scaling. As mentioned above CRISTA induces the skewed distribution in completion delays for a given pipeline stage logic and further downsizes the critical path gates to increase the timing slack between non-critical and critical paths for performance/power bene ts in a circuit. The occurrence of an error is detected by a Decode logic which monitor the activation probability of critical path vectors. It can be understood intuitively that the performance or power bene t from such a design technique for designs such as a 32-bit RCA can be minimal. The primary reason being RCA has an innate skewed distribution as shown in [ g chap 1] besides the decode logic used to ag for two cycle operation could impose additional hardware penalty. Therefore, though the design can be considered as a less conservative design for generic circuits with balanced critical paths it might not gain huge performance bene ts from designs with skewed delay distribution such as RCA. 2.4 CSCD for High Speed and Area E cient Arithmetic This design [1] equips a Ripple Carry Adder with a current sensing capability which observes late settling carry signal nodes in the circuit and indicates when they reach a quiescent state. The incorporation of a computation completion signal into a RCA o ers a way for improving the "average case" RCA to signal to the higher level circuitry controlling it that it has completed the operation. Thus, for example, if 32 repeated additions are to be performed to multiply two 32 bit numbers, using completion signalling to initialize the next addition can cut down the total multiplication time from 32 worst case addition delays, to 32 average case delays. 20 The proposed method for implementing the current sensor involves using a sense-inverter such as the one shown in g-2.3 . A CMOS inverter draws current from the power supply as long as its input is midrange between a high and a low voltage, such that both the P and N transistors see a gate source voltage above their respective threshold voltages. The Supply current does not ow once the input reaches within a threshold of either the high or low supply voltages. Thus monitoring the supply current provides a means to determine if the input has stabilized close (within a threshold voltage) of a high or low. Figure 2.3: Current Sensor Circuitry In g-2.3 the sense current is drawn from a transient current generator i.e., an inverter whose IDD quiescent current is monitored with respect to the rising clock edge of each capture op. The Addcomp signal is agged in the case of an addition completion. Though this design seems to be promising for circuits with a skewed delay distribution such as a RCA, the loading on the sense circuitry increases linearly with the number of 21 capture ops to be monitored for transition completion. Thus, imposing an upper bound on performance or power bene ts. 2.5 Need for Fast and Resilient Error Detection Circuit Though designs such as RAZOR, CRISTA and CSCD are less conservative compared to other BTWC designs the performance/power bene ts for a design with skewed delay distribution (like a RCA) using any of the above techniques is limited. Thus, to exploit the innate skewed distribution of a much narrower but widely used arithmetic designs i.e. Multiplier (With a RCA as a nal Vector Merger) it is important that an Error detection technique which is fast and resilient is required. The major contribution of the work in this thesis is the design of a Fast Error Detection logic which makes the performance of a Multiplier with a low cost RCA comparable to that of the fastest Multiplier. 22 Chapter 3 Brief Overview on Adders In this chapter the performance of a Ripple Carry Adder for delay in the best and worst case scenarios is discussed. Also ripple carry adder performance and area is compared to a di erent avors of adders such as Carry Look Ahead (CLA), Carry Save Adder (CSA), Kogge Stone Adder etc.. 3.1 Basic Adders The basic building blocks for any adder are Half Adder and Full Adder (3:2 compressor), there are several complex compressors in the literature but a good insight in Half Adder and Full Adder will give an intuitive understanding in characterizing the performance of complex adders. 3.1.1 Half Adder The half adder adds two single binary digits A and B. It has two outputs, sum (S) and carry (C). The carry signal represents an over ow into the next digit of a multi-digit addition. The simplest half-adder design, pictured in Fig 3.1, incorporates an XOR gate for S and an AND gate for C. Figure 3.1: Half Adder 23 In the above gure the Worst case delay is one XOR gate delay. 3.1.2 Full Adder A full adder performs additions of three 1-bit numbers, A, B, Cin, and gives outputs Sum, and Carry out, Cout. The expressions for sum and carry are given by S = A B Cin (3.1) Cout = A B + Cin(A B) (3.2) Cout = A B + A Cin + B Cin (3.3) Figure 3.2: Full Adder Block Two half adders can be combined to make a full adder with the addition of an OR gate to combine their carry outputs. Using Eqn 3.1 and 3.2 we can represent a Full Adder at gate level as shown in Fig 3.3. It can be observed from the gure that the critical path for Sum output is determined by two XOR gates, whereas the Critical path for a Carry out is determined by an XOR, AND and an OR gate. Since, carry output is the one which 24 propagates in adders with bit length > 1 it is important to build a full adder which can generate a carry output faster as shown in g 3.4. Figure 3.3: Full Adder using two Half Adders In Fig 3.4 the carry out delay is determined by an AND gate and an OR gate (3-input) which is much faster than the g 3.3. Henceforth, a full adder is considered to be the one in g 3.4 in this thesis. Figure 3.4: Full Adder with fast carry out 3.2 Ripple Carry Adder (RCA) A Ripple carry adder consists of blocks of 1-bit full adders connected in series with the carry out of one block serving as carry in to the next block. Fig 3. shows the interconnection of a 4-bit ripple carry adder. It can be observed that the critical path for this design is the path from C0 to C4. 25 Figure 3.5: 4-bit Ripple Carry Adder 3.2.1 Performance Analysis of a 4-bit RCA Figure 3.6 shows the gate level representation of 4-bit full adder circuit with sum and carry outputs. The addition of two bits at any stage depends on the carry generated by the addition of the two bits in the previous stage. Thus, the sum of the most signi cant bit is only available after the carry signal has rippled through the adder from the least signi cant stage to the most signi cant stage. The critical path for this design is the carry propagation path from C0 to C4 which is 8 gate delays. Extending the design concept for 32- bit ripple carry adder gives us the critical path delay of 64 gates. Figure 3.6: 4-bit Ripple Carry Adder 3.2.2 Best Case Performance The best case performance for a ripple carry adder is seen when there is no carry generated at each bit position. In the case of the 4-bit adder in Figure 20 the sum outputs 26 are available just after two xor gate delays. This can be observed with input values A = 1111, B = 0000 and C0=0, which gives C1=C2=C3=C4=0 at each bit position. For this case there are no carry bits generated and since no propagation takes place the sum outputs are valid after two gate delays. Figure 3.7: 4-bit adder best case performance Delay S0 = 2 gate delays S1 = 2 gate delays S2 = 2 gate delays S3 = 2 gate delays 3.2.3 Worst Case Performance The worst case performance for a ripple carry adder is seen when the input carry, Cin is propagated to next stage by each bit position. Because of carry propagation from the least signi cant position to most signi cant position the sum bits generation takes variable time delays. Figure 3.8: 4-bit adder worst case performance 27 Delay S0 = 2 gate delays S1 = 4 gate delays S2 = 6 gate delays S3 = 8 gate delays In the case of the 4-bit adder in Figure 3.6, this can be observed with A = 1111, B = 0000 and C0=1, which gives C1=C2=C3=C4=1 at each bit position. Since the carry is propagated the delays for the sum outputs is observed to be two to eight gate delays. 3.2.4 Average Case Performance The average case performance of a ripple carry can be obtained from the delay distribu- tion of the RCA. To look at this from a statistical perspective let us consider the probability of having a ?10? or ?01? combination at the input of the Full Adder i.e., two out of four cases the Full Adder inputs can have the above combination (A carry can only propagate to the next full adder only if the input bits are either ?10? or ?01?) so the probability that a carry propagates to the next Full Adder is 12. Hence, the probability that a carry propagates beyond the nth stage is 12n i.e., for a 5 bit adder the critical path is triggered once in 32 cases as shown in Figure 3.9. Hence, the skewed nature of RCA becomes more prominent with the RCA bit length. Figure 3.9: Illustrating the probability of carry rippling through a 5 stage RCA In gure 3.9 we try to illustrate the probability of a carry being propagated all the way down to Cout in a 5 stage Ripple Carry Adder. 28 3.3 Carry Look Ahead Adder The carry look ahead adder calculates carry bits simultaneously using complex logic circuitry. This minimizes the worst case delay to calculate the sum bits. Due to large fan out on the gates, the carry generation and propagation bits are calculated by making groups of carry blocks (usually four bits) as shown in Figure The primary block of the carry look ahead adder is Partial Full Adder (PFA). Each block of PFA takes three bits A, B, C; as inputs and generates three outputs G, P and S. These are Generate, G = A . B Propagate, P = A B Sum, S = A B C Figure 3.10: Partial Full Adder 3.3.1 Performance Analysis of 4-bit CLA Fig 3.10 shows the gate level architecture of a 4-bit carry look ahead adder. In order to construct a 4-bit CLA four PFAs are needed to generate the signals. When Pi = 1, an incoming carry is propagated to the next bit position from Ci to Ci+1. For Pi = 0, carry propagation to the bit position is blocked. Regardless of the value of Pi, when Gi = 1, the carry output from the current position is "1. 29 The carry output has the following logic: C1 = G0 + P0C0 C2 = G1 + P1C1 = G1 + P1G0 + P1P0C0 C3 = G2 + P2C2 = G2 + P2G1 + P2P1G0 + P2P1P0C0 C4 = G3 + P3C3 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0C0 The carry look ahead block takes only two gate delays to generate carry bits from C1 through C3. The implementation of C4 is becomes more complicated when this 4-bit carry look ahead adder is extended to multiples of 4 bits, such as 16 bits and 32 bits. The carry look ahead adder for 4 bits computes carry bits at all the bit positions simultaneously. The longest delay in the 4-bit carry look ahead adder is 4 gate delays, compared with 8 gate delays in the ripple carry adder. The improvement is very modest but at the cost of much additional hardware. From Figure 22 it can be observed that the fan in for generating the C4 is 4, if we extend the same concept to generate more carry bits simultaneously then the high fan-in for generating the carry bits could contribute to additional adder delays. So in such cases the carry bits are limited to groups of 4 and are extend to next higher level of blocks to generate the carry bits. For a 16 bit carry look ahead adder the higher numbered bits 4 - 7, 8 - 11, and 12 - 15 are grouped together. For this in positions 4, 8, and 12 we would like the carry to be produced as fast as possible without using excessive fan-in. The estimated worst case delay for a 32-bit CLA is 8 gate delays but high fan-in in the carry generation block could add additional delays in the adder circuit. 30 Figure 3.11: 4-bit Carry Look Ahead Adder 3.4 Carry Select Adder In a Ripple Carry Adder, every full-adder cell has to wait for the incoming carry before an outgoing carry can be generated. One way to get around this linear dependency is to anticipate both possible values of the carry input and evaluate the result for both possibilities in advance. Once the real value of the input carry is known, the correct result is easily selected with a simple multiplexer stage. An implementation of this idea, appropriately called the carry-select adder is demonstrated in Figure 3.11 Consider the block of adders, which is adding bits k to k+3. Instead of waiting on the arrival of the output carry of bit k-1, both the 0 and 1 possibilities are analysed. From a circuit point of view, this means that two carry paths are implemented. When Cin nally at the last stage nally settles, either the result of the 0 or the 1 path is selected by the multiplexer, which can be performed with a minimal delay. As is evident form the Figure 31 Figure 3.12: Carry Select Adder 3.11, the hardware overhead of the carry-select adder is quite signi cant with an additional carry path and a multiplexer. 32 Chapter 4 Overview on Multipliers Multiplications are expensive and slow operations. The performance of many compu- tational problems often is dominated by the speed at which a multiplication operation can be executed. Many multiplier designs focus on the performance perspective of a multiplier, and may not t a low-power environment. Hence, the fastest multiplier in the literature comes with the expense of Area, Power or complexity in routing. This chapter tries to dis- cuss some basic multiplier designs from the literature [ref ] which are similar to the decimal multiplication using paper and pencil and fastest multiplier with signi cant overhead in area. 4.1 The Multiplier Consider two unsigned binary numbers X and Y that are M and N bits wide, respec- tively. To introduce the multiplication operation, it is useful to express X and Y in the binary representation X = M 1X i=0 Xi2i Y = N 1X j=0 Yj2j (4.1) with Xi20;1. The multiplication operation is then de ned as follows: Z = X Y = M+N 1X k=0 Zk2k = ( M 1X i=0 Xi2i)( N 1X j=0 Yj2j) = M 1X i=0 ( N 1X j=0 XiYj2i+j) (4.2) The simplest way to perform a multiplication is to use a single two-input adder [9]. For inputs that are M and N bits wide, the multiplication takes M cycles, using an N bit adder. 33 This shift and-addd algorithm for multiplication adds together M partial products. Each partial product is generated by multiplying the multiplicand with a bit of the multiplier - which, essentially, is an AND operation - and by shifting the result on the basis of the multiplier bit?s position. A faster way to implement multiplication is to resort to an approach similar to manually computing a multiplication. All the partial products are generated at the same time and organized in the array. A multioperand addition is applied to compute the nal product. The approach illustrated in g 4.1. Figure 4.1: Binary Multiplication - an example This set of operations can be mapped directly into hardware. The resulting structure is called an array multiplier and combines the following three functions: i) Partial-Product Generation ii) Partila-Product Accumulation and iii) Final Addition 4.2 Partial Product Generation Partial products result from the logical AND of multiplicand X with a multiplier bit Yi. Each row in the partial-product array is either a copy of the multiplicand or a row of zeros. 34 It is obvious that this stage of multiplication is done quickly and the maximum delay of multiplier is contributed by the partial-product accumulation and the nal addition which have a xed and a variable delay respectively as discussed in the following subsections. 4.3 Partial Product Accumulation After the partial products are generated, they must be summed. This accumulation is essentially a multioperand addition. A straightforward way to accumulate partial products is by using a number of adders that will form an array - hence, the name, array multiplier. 4.3.1 Array Multiplier The most direct approach to multiplication implementation is the Array Multiplier. The composition of an array multiplier is shown in g. There is a one-to-one topological correspondence between this hardware structure and the manual multiplication shown in g 4.1. The generation of N partial products requires N xM two-bit AND gates (in the style of g). Most of the area of the multiplier is devoted to the adding of the N partial products, which require N -1 M -bit adders. The shifting of the partial products for their proper alignment is performed by simple routing and does not require any logic. The overall structure can be easily compacted into a rectangle, resulting in a very e cient layout. Figure 4.2: 4x4 Array Multiplier 35 Due to array organization, determining the propagation delay of this circuit is not straightforward. Consider the implementation of g. The partial sum adders are imple- mented as ripple-carry structures. Performance optimization requires that the critical timing path be identi ed rst. This turns out to be nontivial. In fact, a large number of paths of almost identical length can be identi ed. One such path which is the critical path is shown in the g 4.2. A closer look at the critical path yields an approximate expression for the propagation delay (derived here for critical path), we can represent this as tmult = [(M 1) + (N 1)]tcarry + (N 1)tsum + tand (4.3) where tcarry is the propagation delay between input and output carry, tsum is the delay between carry and the sum bit of the full adder (as calculated from capter 3 section ), and tand is the delay of the AND gate. From equation 4.3 we can say that the timing paths in a Arrray Multiplier are balanced which is illustrated in the Delay Distribution of an Array Multiplier obtained for 50,000 vectors. Figure 4.3: Delay Distribution of an Array Multiplier for 50,000 vectors 36 Since all critical paths have the same length, speeding up just one of them-for instance, by replacing one adder by a faster one such as a carry-select adder- does not make much sense from a design standpoint. All critical paths have to be attacked at the same time. From equation 4.3, it can be deduced that the minimization of tmult requires the minimization of both tcarry and tsum. In this case, it could be bene cial for tcarry to equal tsum. This contrasts with the requirements for adder cells discusssed before, where a minimal tcarry was of prime importance. 4.3.2 Carry-Save Multiplier Due to large number of almost identical critical paths, increasing the performance of the Array multiplier through transistor sizing yields marginal bene ts. A more e cient realization can be obtained by noticing that the multiplication result does not change when the output carry bits are passed diagonally downwards instead of only to the right, as shown in g Figure 4.4: A 4x4 carry-save multiplier 37 An extra adder is included to merge the nal vectors to generate the nal result. The resulting multiplier is called a carry-save multiplier or a, because the carry bits are not immediately added, but rather are "saved" for the next adder stage. In the nal stage, carries and sums are merged in a fast carry-propagate (e.g, carry-lookahead or carry-skip) adder stage (the major contribution of this thesis is to attain comparable performance to a Carry Save multiplier design with one of the above mentioned adders as nal vector merger without any area overhead). While this structure has a slightly increased area cost (one extra adder), it has the advantage that its worst case critical path is shorter and uniquely de ned, and is expressed as tmult = tand + (N 1)tcarry + tmerge (4.4) 4.3.3 Wallace Multiplier The partial-sum adders can also be rearranged in a treelike fashion, reducing both the critical path and the number of adder cells needed. Consider the simpler example of four partial products each of which is four bits wide, as shown in g. The number of full adders needed for this operation can be reduced by observing that only column 3 in the array has to add four bits. All other columns are somewhat less complex. This is illustrated in g, where the original matrix of partial products is reorganized into a tree shape to visually illustrate its varying depth. The challenge is to realize the complete matrix with a minimum depth and a minimum number of adder elements. The rst type of operator that can be used to cover the array is a full adder, which takes three inputs and produces two outputs: the sum, located in the same column and the carry, located in the next one. For this reason, the FA is called a 3-2 compressor. It is denoted by a circle covering three bits. The other operator is the half-adder, which takes two input bits in a colum and produces two outputs. The HA is denoted by a circle covering two bits. 38 Figure 4.5: Transforming a Partial-Product tree (a) into a Wallace tree (b,c,d), using an iterative covering process. To arrive at the minimal implementation, we iteratively cover the tree with FAs and HAs, starting from its densest part. In a rst step, we introduce HAs in columns 4 and 3 g b. The reduced tree is shown in g c. A second round of reductions creates a tree of depth 2 d. Only three FAs and three HAs are used for the reduction process, compared with six FAs and six HAs in the carry-save multiplier of g. The nal stage consists of a simple two-input adder, for which any type of adder can be used (as discussed in the next section "Final Addition") The presented structure is called the Wallace tree multiplier, and its implementation is shown in g along with the illustration of an 8x8 multiplier with the three di erent stages of a multiplier. The tree muliplier realizes substantial hardware savings for larger multipliers. The propagarion delay is reduced as well. In fact, it can be shown that the propagation delay through the tree is equal to O(log3=2(N ) ). While substantially faster than carry-save structure for large multiplier word lengths, the Wallace multiplier has the disadvantage of being very irregular, which complicates the task of coming up with an e cient layout. 39 Figure 4.6: Wallace Multiplier Figure 4.7: Illustration of Wallace Multiplier There are numerous other ways to accumulate the partial-product tree. A number of compression circuits has been proposed in the literature [ref ], They are all based on the concept that when full adders are used as 3:2 compressors, the number of partial products is reduced by two-thirds per multiplier stage. Hence, we can go further and device a 4:2 compressor shown in g 4.8 or even higher order compressors. 40 Figure 4.8: 4:2 Compressor By using a higher order compressor like the one shown in g 4.8 the carry save tree or the partial product stage delay reduces by one XOR gate delay for every two stages depicted by the highlighted path in the g 4.8. The higher level picture is shown in g 4.9 which is similar to g 4.7 but with the use of 4:2 compressors besides a Half Adder and a Full Adder. Figure 4.9: Illustration of Wallace Multiplier with 4:2 compressors 41 4.4 Final Addition The nal step for completing the multiplication is to combine the result in the nal adder. Performance of this "vector-merging" operation is of key importance. The choice of the adder style depends on the structure of the accumulation array. A carry look-ahead adder is preferable option if all input bits to the adder arrive at the same time, as it yields the smallest possible delay. This is the case if a pipeline stage is placed right before the nal addition. Pipelining is a technique frequently used in high-performance multipliers. In nonpipelined multipliers, the arrival time pro le of the inputs to the nal adder is quite uneven due to the varying logic depts of the multiplier tree and hence, we in this thesis use a Ripple Carry Adder with a skewed delay distribution as shown in g 4.10 with an e cient Error Detection and Correction design. It should be intuitive that the skewed distribution of the RCA ts perfectly for Better than Worst case design (BTWC) making the multipliers performance comparable to that of the fastest Multipliers with a Carry Select adder or other hybrid adders [ref ] Figure 4.10: Delay Distribution of a Ripple Carry Adder 42 From the above discussion we can conclude that, even though the partial product re- duction stage has a variable delay it is limited by the number of stages in the carry save tree and hence the delay of the carry save tree can be treated as constant delay and the nal adder delay as variable delay as illustrated in the g 4.10. So, having a Ripple Carry Adder with a variable delay in a Wallace multiplier would result in a delay distribution which is displayed by the carry save trees xed delay. The unit delay simulation results for a Wallace Multiplier to obtain the delay distribution is illustrated in the g 4.11 along with the RCA?s distribution. Figure 4.11: Delay Distribution of 32-bit Wallace Multiplier with nal RCA distribution for 100000 vectors 43 Chapter 5 Proposed BTWC Design In Chapter 2 a brief overview on Better than Worst Case designs was given and in chapter 3 and 4 several Adder and Multiplier designs were discussed. In this chapter the key features of how the design is implemented to leverage the unique characteristics of a Ripple Carry Adder are described(shown in the delay distribution in g) besides the primary requirement of a Better than Worst Case design i.e. e cient error detection and providing multiple clock cycles during an occasional error are illustrated. 5.1 The Internal signal Hold until Completion Con rmed Scheme The schematic in Fig illustrates pipelined operation of proposed Wallace multiplier design at a high level, where new inputs arrive and the current cycle results are captured and made available on each edge. The multiplier itself is broken up into two parts, the front- end logic that generates the partial products along with the carry save tree (these mostly display a xed delay relatively independent of inputs), and the carry ripple adder with highly variable delay. For purposes of this initial discussion we assume that there exists a register of latches between these two parts that can hold the output signals of the carry save tree (the inputs to the ripple carry adder) for a desired duration. (In practice, as we shall see later, his capability can be more cost e ectively implemented using tri-state gates that create dynamic latches.) This (low active) hold signal is activated on each active clock edge, but released (making the latches transparent again) shortly thereafter if no error is detected. Observe that if the hold activated at the start of a clock cycle is released within a period less than the propagation delay of the rst (carry save tree) part of the logic, i.e. before new logic values due to the updated inputs propagate to the latch register, the activation of the hold in each 44 cycle does not in any way restrict the propagation of logic signals. Therefore it has no impact on the overall multiplication delay, beyond the extra logic and loading delays introduced by the insertion of the latches. Figure 5.1: Pipelined Schemtic for Wallace Multiplier Operation As was observed in previous chapter, because of the highly skewed completion delay distribution of the carry ripple adder, even with the system running at a clock 2-3X faster than the worst case delay, errors in the result of the multiplication due to timing violations on long paths can be expected to be extremely rare. Assume for the moment that a capability for the fast detection of such errors exists, i.e. it is possible to know quickly after the clock edge that captures a result in the results register whether or not that result is correct. This information can be used to control the hold/release signals for the latches, as shown in the timing waveforms in Figure 5.2. In case an error is detected, the hold on the latches is not released for the remainder of the cycle, and into the next cycle, until an error free result in the next cycle is con rmed. Observe that since the hold is activated on the same clock edge that updates the input registers, changes from the updated inputs are unable to immediately propagate through the carry save tree to in uence the values in the latches. Thus in the hold mode the latches always hold signal values from the previous clock cycle. If not released in the event of an error, the hold latches continue to hold the values from the previous cycle at the inputs of the ripple carry adder, giving that computation an additional cycle 45 to complete. The correct result is then available one cycle after the incorrect result for the same multiplication was captured in the results register. This can even extend to a third (or fourth) cycle for worst-case carry propagation until an error free result is con rmed. Meanwhile, the error ag can also be used as a signal to supervisory system for appropriate handling of the input and output values in each clock cycle. Erroneous cycles must be marked as stalls in pipelined operations, with the correct result following in the next cycle. Additionally, the ow of new inputs to the multiplier must be halted corresponding to each stall. Figure 5.2: Timing Model Waveforms for Hold Until Completion 5.2 Timing Error Detection In the discussion in the previous Section we have assumed that a timing error, caused by a longer carry propagation path in the ripple carry adder than was allowed for in setting the clock period, can be quickly detected following the capture clock edge. The key to this capability lies in an important characteristic associated with the propagating carries. Observe from Fig 5.2 that the latches go into the hold mode and retain the signal values 46 at the input of the ripple carry adder on the clock edge that captures a new multiplication result. This allows the ripple carry adder to continue working on the previous (just captured) result until the hold is released. In most cases, the outputs of the ripple carry adder stabilize before the clock edge, and the correct multiplication result is captured in the results register. Here the outputs of the ripple carry adder (inputs to the results register) do not change even as the adder is given additional time to operate into the next clock cycle. However, if the outputs of the ripple carry adder have not stabilized at clock time because of a long carry propagation path, then they will continue to change after the clock edge. This condition can be detected by comparing each of the adder output bits with the corresponding bit captured in the results register at the clock edge a mismatch indicates a timing error. A timing error detection circuit can therefore be constructed as shown in Figure 5.3. Figure 5.3: Timing Error Detection Circuitry Recall that a propagating carry in the carry ripple adder causes a change in the sum output of every full adder stage it propagates through. This is because the sum is is formed using EXORs on three inputs, two of which are stable as the carry propagates. Thus if nal stable outputs are not captured on a clock edge, at least on ripple carry adder output can be expected to change within a full adder delay of the clock edge. (It is impossible for the adder outputs to show no change over any full adder delay window, but then change with a larger delay.) This allows the error/computation complete indication in Figure 5.3 to be evaluated 47 within a few gate delays of the clock edge, achieving the important requirement of quick completion con rmation that can allow the hold in the latches to be released without any performance penalty. 5.3 Fast Comparator to monitor output completion As mentioned in the previous section in g 5.3 the key design feature which helps in detecting the output completion, is the ability to detect any carry which is still propagating in the RCA one cycle after the inputs are fed into the multiplier design. The fast detection is primarily due to the full adder logic which re ects any change in its input carry to the output sum. Hence, by comparing the sum bit of each Full Adder and the data captured at the end of previous clock cycle, we can ag an error saying that a carry is still alive in the RCA and that we need to continue holding the input data to the RCA beyond the previous clock cycle. One major overhead is as the number of full adders increase the number of comparators increases and the delay to generate an error signal increases thus, impeding the current clock cycles computational data at the input of the RCA. In the proposed design the nal vector merger (RCA) has 59 full adders. Besides theXOR gates which ags an error we need to have all the generated signals from the XOR gate OR?ed Since, the standard cell library does not support one monolithic OR gate (we use an OR gate to ag an error in case of any carry propagating in the RCA, since a change in any of the sum output is su cient condition to ag an error) the synthesis tool might place several stages or OR logic (5 stages of 3-input OR gates in the case of our design) thus, delaying the error signal generation by the number of 3-inputOR gates deduced by the synthesis tool. The following is the verilog code implemented as one monolithic OR gate. 48 \\Verilog Code for Comparator to detect Error module monolithic_OR(a,b); input [58:0] a; output b; assign \#6 b = |a; endmodule Figure 5.4: Implementation of the above verilog code by Mentor Graphics Synthesis tool 49 Fig 5.4 shows the comparator implementation of the monolithic OR gate coded in verliog. As can be seen from the gure 5.4 if each 3-input OR gate is considered to have a delay of 2-units, the total delay of the entire comparator logic will be 10 units, implying that the current clock cycles computation is held by the hold signal until the Completion detection is done. Even thought the overhead is not signi cant, on an average case there would be some loss in performance. To address this issue we describe some fast Comparators present in the literature. 5.3.1 Equality Comparator Most of the high performance processors use Dynamic logic as comparators which are robust when compared to the Static CMOS desing but are sensitive to noise and hence, proper care is to be taken while designing a Dynamic logic by having xed noise margins. One such design is shown in g 5.5 An example of Equality Comparator [15] CMOS 4-bit equality comparator is shown in Fig. 5.5. In Fig. 5.5, when the CLK is low, Node 1 is precharged to VDD. If A < 0 > and B< 0 > are both high, then N1 and N2 are on and P1 and P2 are o . Thus, no current path exists during the evaluation period, and then Node 1 will be kept high. If A< 0 > is high and B< 0 > is low, then N1 and P2 are on. Thus, a current path is formed between Node 1 and ground through P2 and N1 during the evaluation period. Node 1 will then be pulled down. The operation for A < 1 > and B< 1 >, A< 2 > and B< 2 >, and A< 3 > and B< 3 > is the same. In short, when any pair of A< i > and B< i > is not equal, a current path will be formed and Node 1 will be low. By contrast, if A< i > is equal to B< i > for all i, Node 1 will keep high. 50 Figure 5.5: 4-bit Equality comparator 5.3.2 A Pass Logic Single Stage Comparator The second dynamic comparator design [16], shown in Figure 5.6, avoids the use of dominostyle logic altogether. Figure 5.6: 4-bit Equality comparator The pass transistor logic shown within the greyed box in Figure 5.6 passes a high logic level to the gate of the ntransistor Q1 when bits A7 and B7, as well as bits A6 and B6 of the comparands match. The series pulldown structure consisting of the devices Q1, Q2, 51 Q3 and Q4 thus conducts when all 8 bits of the comparands are equal. The output of this comparator, precharged to Vdd by Q0 is thus discharged when all bits of the comparands are equal and when the evaluate device, Q5, is on. The ntransistors Q6, Q7, Q8 and Q9 discharge any accumulated charges when partial matches occur Hence, the OR tree structure shown in g 5.4 for our design can be replaced with the above dynamic comparator which is both robust and less in hardware implementation, but as mentioned proper care is to be taken to avoid any noise due to coupling capacitance. 52 Chapter 6 Implementation of 32x32 Wallace Multiplier with the proposed BTWC Design A 32x32 Wallace multiplier with statistical timing was implemented as described in the earlier Sections with a carry save tree built from 4:2 compressors, and a ripple carry adder for the nal stage. A schematic of the design is shown in Figure 6.1. Figure 6.1: Illustration of Wallace Multiplier with the Proposed Design The partial products are generated by performing the AND operation on the appropriate bits from the multiplicand and the multiplier, resulting in 32 partial products, each 32 bits 53 long, arranged as illustrated in Figure 2 for 8x8 multiplication. The carry save tree then compresses the partial products in 8 stages using 4:2 compressors to two numbers that are added by the ripple carry adder. As can be observed form the example in Figure 2, some of the least signi cant bits of the multiplication result are already available from the carry save compression and need not be added further. For the 32x32 multiplier, 5 bits are pre- generated in this manner, requiring a 59 bit ripple carry adder to add the remaining bits. (Note that the some of the 5 pre-generated bits least signi cant have very short logic paths through the carry save tree, and may need to be delayed through bu ers to avoid a race condition at the hold latches.) The hold block shown represent the latch and hold mechanism that is initiated on each clock edge, to continue to hold the inputs of the ripple carry adder until the error detection logic con rms that the correct multiplication result was captured at that clock edge. To minimize, performance and hardware overhead, this function is actually optimized in our design as a dynamic signal hold by tri-stating the driving output gates of the carry save tree. As described in the previous section, detection whether or not the result captured on a clock edge is in error is achieved by comparing the captured result on a clock edge with the output of the ripple carry adder driving the results register. Since the inputs of the carry ripple adder are held stable past the active clock edge for all clock-cycles (until the hold is released), it can, if needed, continue working on the computation for which an initial result is captured on the clock edge. A subsequent mismatch between the ripple adder output and the captured result after the clock edge is an indication that the addition did not complete within the clock period. In our design an error signal is generated by latching this mismatch signal (from the EXOR-OR comparator circuit) using a clock appropriately delayed from the active clock edge. Since a propagating carry in the carry ripple adder causes a change in the sum output of every full adder stage it propagates through, this delayed clock need only incorporate the carry to sum output delay of a full adder, plus the comparator delay. In our design, this is about ten gate delays, allowing for the high fan in of the OR gate. Such fast 54 error evaluation/completion detection, allows the tri-state/hold condition at the outputs of the carry save tree to be removed well before results for the new set of partial products for the next computation are ready, thereby avoiding any performance penalty from the hold during an error free cycle. A 32x32 Wallace multiplier with statistical timing incorporating the hold until comple- tion detection logic described above was implemented at the RTL level and synthesized and optimized for timing using Mentor Graphics Spectrum in tsmc180nm technology. The entire design was then simulated for performance evaluation. 55 Chapter 7 Simulation Results and Conclusion 7.1 Simulation Results using TSMC 180nm technology The timing simulations results shown in Table 1 were obtained using SPICE and tsmc 180nm technology les. The pessimistic worst-case delay of the multiplier without timing errors was found to be 9.5ns from the Mentor Graphics STA tool. Actual SPICE timing evaluation using know worst case inputs [5] yielded a delay of 9ns. This extremely rare worst case delay was not observed in the simulation of 10,000 random inputs in SPICE. Table I shows the number of timing errors observed for the 10,000 random inputs when the allowed clock period for multiplication completion is reduced from the worst case 9ns to a range between 5ns and 3.5ns. It can be observed that as the clock period decreses the number of errors with a single clock period penalty increase very quickly. This is consistent with the delay distribution in Figure 4. For yet shorter clock periods, errors that require two additional clock cycle also begin to grow rapidly. Only 4 out of the 10,000 inputs triggered a timing error with the clock period set at 5ns, and understandably no case required more than two clock periods (10ns). 26 errors were triggered with the clock period set at 4ns, with one case requiring more than two cycles (greater than 8ns)to complete. Column 4 in the Table shows the e ective average time required per multiplication at the corresponding single cycle clock period. This can be used to select an optimum clock period, which from the table is 3.75ns for our design. At this clock period the average multiplication time is 3.86ns. Finally, the last column shows the performance improvement over using a worst case clock of 9ns for all multiplications. This is obtained by dividing 9ns by the average multiplication time in column 3. At the optimum clock period of 3.75ns, the performance improvement is 2.36X. 56 Table 7.1: Spice Simulation Results with Reduced Clock Periods Error Count Reduced Clock 10,000 random inputs Average Clock Performance period ns (one cycle penalty) (two cycle penalty) Period ns Improvement (actual clk/avg clk) 5 4 0 5.002 1.79x 4.5 10 0 4.5045 1.99x 4 26 1 4.015 2.24x 3.85 113 3 3.91 2.3x 3.75 257 13 3.856 2.36x 3.5 1876 47 4.2 2.14x Direct comparisons of the performance of this new design with other designs in the literature are di cult to make, even at the same feature sizes, because of di erences in the cell libraries and technology les, particularly the threshold and supply voltages used in the reported simulations. Nevertheless, the fastest multiplier described in the literature in comparable 180nm technology reported a simulated multiplication time of 2.85ns. This design employed a Kogge-Stone fast adder which is extremely expensive in hardware, along with other enhancements. For the 60-bit nal addition required by 32x32 bit multipliers, the chip area and gate count overhead of the fast adders can be 2-4X when compared to a ripple carry adder, and this hardware dominates the chip area. Our design as presented incurs a 35-40% overhead in gate counts, mostly in the comparator used for error detection. Fortunately comparators, being widely used in many applications, have been extensively improved and optimized. It should be possible to use fast comparators designs such as those in [4] to signi cantly reduce the overhead of the proposed approach to about 20%. 57 7.2 Simulation Results using TSMC 28nm technology From the simulation results in table 7.1 it is observed that a Multiplier design with a Carry Save tree and a Ripple Carry Adder as nal adder deployed with the proposed better- than-worst-case design implementation with Hold Until Completion can, on average, achieve a performance gain of 2.36X. Assuming a similar speed up for the better-than-worst-case (area optimized) design in 28nm technology, our new multiplier design can achieve a perfor- mance almost comparable to that of the design optimized for performance as shown in table 7.2. It can be observed that the multiplier design optimized for performance will dissipate 2.5X power and require almost 2.5X area compared to our area optimized better-than-worst- case design. Table 7.2: Relative numbers for 28nm Multiplier Design (Performance and Area Optimized) Multiplier Design Timing (ps) Area (units) Power (Watts) Optimized for Per- formance 1 1 1 Optimized for Area (Traditional Worst Case Design) 3.07 0.41 0.4 Optimized for Area (Better Than Worst Case Design) 1.3 0.41 0.4 58 7.3 Conclusion In this thesis a novel design approach was proposed for exploiting the highly skewed sta- tistical variation in completion delays observed in low cost Wallace multipliers implemented with ripple carry adders in the nal stage. In the vast majority of cases, such multipliers complete the computation in well under half the worst case delay, making it possible to operate them with a 50-60% shorter clock period with fewer than one timing error every thousand multiplications. To support such operation, we have developed a novel internal signal hold until completion detection recovery scheme that automatically allows such an "overclocked" multiplier to steal one or more additional clock periods as needed to trans- parently complete the infrequent computation that needs additional time. Of course the overall system, utilizing such a multiplier, must be designed to support and work with this occasional recovery process. Fortunately, architectural level handing of exceptions through pipeline stalls and out of order instruction issue are now well understood and commonplace in processors to handle other forms of speculative execution; our design can be considered just another speculative implementation. Achieving comparable deterministic multiplication speed requires very high speed, long word length, adders in the design which can require 3-4x more gates/area and consume signi cantly more power. By reducing the clock period to the point of encountering an acceptable number of timing errors, our design, like the Razor[3] approach, eliminates the wasted static (leakage) power during the frequent quiescent intervals observed in traditional designs that reliably allow for rare worst case signal propagation. This greatly reduces the average power consumed per multiplication, which is critical in battery powered mobile applications. 59 Bibliography [1] Gadamsetti, B.; Singh, A.D.; , "Current Sensing Completion Detection for high speed and area e cient arithmetic," Circuits and Systems (APCCAS), 2010 IEEE Asia Pa- ci c Conference on , vol., no., pp.240-243, 6-9 Dec. 2010. [2] Borkar, S.; Karnik, T.; Vivek De; , "Design and reliability challenges in nanometer technologies," Design Automation Conference, 2004. Proceedings. 41st , vol., no., pp.75, 7-11 July 2004. [3] Ernst, D.; Nam Sung Kim; Das, S.; Pant, S.; Rao, R.; Toan Pham; Ziesler, C.; Blaauw, D.; Austin, T.; Flautner, K.; Mudge, T.;"Razor: a low-power pipeline based on circuit- level timing speculation," Microarchitecture, 2003. MICRO-36. Proceedings. 36th An- nual IEEE/ACM International Symposium on, vol., no., pp. 7-18, 3-5 Dec. 2003. [4] Blaauw, D.; Kalaiselvan, S.; Lai, K.; Wei-Hsiang Ma; Pant, S.; Tokunaga, C.; Das, S.; Bull, D.; , "Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance," Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International , vol., no., pp.400-622, 3-7 Feb. 2008. [5] Fojtik, M.; Fick, D.; Yejoong Kim; Pinckney, N.; Harris, D.; Blaauw, D.; Sylvester, D.; "Bubble Razor: An architecture-independent approach to timing-error detection and correction, "Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, vol., no., pp.488-490, 19-23 Feb. 2012. [6] Ghosh, S.; Bhunia, S.; Roy, K.; , "CRISTA: A New Paradigm for Low-Power, Variation- Tolerant, and Adaptive Circuit Synthesis Using Critical Path Isolation," Computer- Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol.26, no.11, pp.1947-1956, Nov. 2007. [7] Choudhury, M.; Chandra, V.; Mohanram, K.; Aitken, R.; , "TIMBER: Time borrowing and error relaying for online timing error resilience," Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010 , vol., no., pp.1554-1559, 8-12 March 2010. [8] Eriksson, H.; Larsson-Edefors, P.; Eckerbert, D.; , "Toward architecture-based test- vector generation for timing veri cation of fast parallel multipliers," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.14, no.4, pp.370-379, April 2006. [9] J.M. Rabaey, Digital Integrated Circuits. Englewood Cli s, NJ: Prentice-Hall, 1996. 60 [10] Wallace, C. S.; , "A Suggestion for a Fast Multiplier," Electronic Computers, IEEE Transactions on , vol.EC-13, no.1, pp.14-17, Feb. 1964. [11] L. Dadda. Some Schemes for Parallel Multipliers. Alta Freq., 34:349-356,1965. [12] Vojin G. Oklobdzija High-Speed VLSI Arithmetic Units: Adders and Multipliers. [13] Oklobdzija, V.G.; Villeger, D.; Liu, S.S.; , "A method for speed optimized partial prod- uct reduction and generation of fast parallel multipliers using an algorithmic approach ," Computers, IEEE Transactions on , vol.45, no.3, pp.294-306, Mar 1996. [14] Vratonjic, M.; Zeydel, B.R.; Oklobdzija, V.G.; , "Low- and ultra low-power arithmetic units: design and comparison," Computer Design: VLSI in Computers and Processors, 2005. ICCD 2005. Proceedings. 2005 IEEE International Conference on , vol., no., pp. 249- 252, 2-5 Oct. 2005. [15] Ergin, O.; Ghose, K.; Kucuk, G.; Ponomarev, D.; , "A circuit-level implementation of fast, energy-e cient CMOS comparators for high-performance microprocessors," Com- puter Design: VLSI in Computers and Processors, 2002. Proceedings. 2002 IEEE In- ternational Conference on, vol., no., pp. 118- 121, 2002. [16] Chua-Chin Wang; Hsin-Long Wu; Chih-Feng Wu; , "A fast dynamic 64-bit comparator with small transistor count," Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on , vol.5, no., pp.545-548 vol.5, 2000 [17] Saha, K. S., Modeling Process Variability in Scaled CMOS Technology, IEEE Design and Test of Computers, Vol. 27, Number 2, pp. 8-16, March/ April 2010. [18] Victoria Wang, Kanak Agarwal, Sani R. Nassif, Kevin J. Nowka, Dejan Markovic , A Design Model for Random Process Variability Proceeding 2008 ISQED pp. 734-737 [19] J. Patel. CMOS process variations: A critical operation point hypothesis, 2008. [20] R. Kumar. Stochastic processors. In NSF Workshop on Science of Power Management, 2009. [21] S. Narayanan, G. Lyle, R. Kumar, and D. Jones. Testing the critical operating point (cop) hypothesis using fpga emulation of timing errors in over-scaled soft-processors. In In SELSE 5 Workshop - Silicon Errors in Logic - System E ects, 2009. [22] J Sartori, R Kumar; ,"Characterizing the voltage scaling limitation s of Razor-based Designs" [23] T. D. Burd, S. Member, T. A. Pering, A. J. Stratakos, and R. W. Brodersen. A dynamic voltage scaled microprocessor system. IEEE Journal of Solid-State Circuits, 35:15711580, 2000. [24] V. Gutnik and A. Chandrakasan, An e cient controller for variable supply-voltage low power processing. pages 158159, Jun 1996. 61 [25] S. Dhar, D. Maksimovic, and B. Kranzen. Closed-loop adaptive voltage scaling controller for standard-cell asics. In ISLPED ?02: Proceedings of the 2002 international symposium on Low power electronics and design, pages 103107, New York, NY, USA, 2002. ACM. [26] T. Kehl. Hardware self-tuning and circuit performance monitoring. pages 188192, Oct 1993. [27] I. Corporation. Enhanced intel speed step technology for the intel pentium M processor, 2004. 62