Current Sensing Completion Detection for High Speed and Area Efficient Arithmetic by Balapradeep Gadamsetti A thesis submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Master of Science. Auburn, Alabama May 09, 2011 Keywords: current sensing completion detection, adders, multipliers Copyright 2011 by Balapradeep Gadamsetti Approved by Adit D. Singh, Chair, James B. Davis Professor of Electrical and Computer Engineering Vishwani D. Agrawal, James J. Danaher Professor of Electrical and Computer Engineering Victor P. Nelson, Professor of Electrical and Computer Engineering ii Abstract Adders and multipliers are the most widely used computational units in integrated circuits. In many compact low power and high speed designs, more complex arithmetic operations such as multiplication are performed through repeated additions. In such applications, providing carry completion signaling in low cost ripple carry adders can allow the control logic to schedule the next addition as soon as an earlier one is complete, thereby achieving the average case, rather than worst case addition delay over a set of computations. Earlier attempts at using current sensing for such carry completion signaling suffered from serious limitations. This thesis presents a new approach for the design of a ripple carry adder with a current sensing capability which observes late settling carry signal nodes in the circuit and indicates when they reach a quiescent state. By this functional circuitry we avoid key problems of more hardware and power. Simulations for new design of ripple carry adder with current sensing completion detection technique show better than 50% speedup, on average, with less than 10% area overhead. The performance improvement with respect to carry look ahead adder which takes 40% more area than our ripple carry adder design is found to be 20%. This could give significant area and power savings in smaller circuit designs. To demonstrate a potential application of ripple carry adder with completion detection capability, we incorporate our carry ripple adder into a Booth multiplier design and study the performance gain over a traditional ripple carry adder based design. Simulation results show that a 32-bit Booth Multiplier using the new completion signaling circuits can outperform a 32-bit Booth Multiplier with ripple carry adder (RCA) by 20-50%, while requiring less than 2% iii additional silicon area. This is comparable to the gains from the best carry look ahead adder designs at a fraction of the area overhead costs. The new approach eliminates the limitations suffered from the previous current sensing techniques. The overall performance of is improved with a minimal area overhead. iv Acknowledgements At the outset I would like to thank my major advisor Dr. Adit Singh for his immense support and guidance in completing this thesis. He has been a constant source of inspiration and encouragement. I am highly grateful to him for his suggestions and cooperation right from the start of this thesis work. His guidance helped me in all the time of research and my studies at Auburn University. I express my gratitude to my thesis committee members, Dr. Vishwani Agrawal and Dr. Victor Nelson, for being part of my graduate committee and co-advising my thesis. Their lectures as part of my graduate coursework helped me in learning the basic knowledge required for my thesis work. I want to express my sincere thanks to my senior Nitin Yogi and my friend Rakshit Venkatesh for helping me academically and giving me valuable suggestions throughout my graduate studies at Auburn University. Finally I owe my deepest gratitude to my parents for their moral support during all the times. They stood by me unconditionally motivating me in every aspect of my life and education. v Table of Contents Abstract ........................................................................................................................................... ii Acknowledgements ........................................................................................................................ iv List of Tables ................................................................................................................................ vii List of Abbreviations ...................................................................................................................... x 1 Introduction .................................................................................................................................. 1 1.1 Background ........................................................................................................................... 2 1.2 Motivation ............................................................................................................................. 3 1.2.1 Background on Adder Circuits ....................................................................................... 5 1.2.2 Ripple Carry Adder and Carry Look Ahead Adder Discussion ................................... 10 1.3 Problem Statement .............................................................................................................. 12 1.4 Contribution of This Thesis................................................................................................. 12 1.5 Organization of Thesis ........................................................................................................ 13 2 Prior Work on CSCD Design ..................................................................................................... 14 2.1 Completion Detection Methods ..................................................................................... 14 2.2 Classical CSCD Design .................................................................................................. 15 2.3 Current Sensor Design Approaches ............................................................................... 18 2.4 Applications Using Current Sensor Completion Detection ........................................... 19 3 Proposed CSCD Design ............................................................................................................. 22 3.1 CMOS Inverter ............................................................................................................... 22 3.2 Sense Inverter Design..................................................................................................... 24 3.3 Current Sensor Design ................................................................................................... 26 3.3.1 Current Sensor for asynchronous designs .................................................................... 26 3.3.2 Current Sensor for synchronous designs ...................................................................... 28 4 Ripple Carry Adder and Carry Look Ahead Adder Analysis .................................................... 33 4.1 Ripple Carry Adder ........................................................................................................ 33 4.2 Full Adder ...................................................................................................................... 34 4.3 Performance Analysis of RCA ....................................................................................... 35 vi 4.3.1 Best Case Performance ................................................................................................. 35 4.3.2 Worst Case Performance .............................................................................................. 36 4.4 Carry Look Ahead Adder ............................................................................................... 37 4.5 Performance Analysis for 4-bit CLA ............................................................................. 38 5 Proposed Ripple Carry Adder Design with Completion Signaling ........................................... 40 5.1 Background .................................................................................................................... 40 5.2 RCA Design ................................................................................................................... 41 5.3 RCA Operation ............................................................................................................... 42 5.4 Calculations .................................................................................................................... 43 5.5 Optimizations ................................................................................................................. 44 5.6 Simulations ..................................................................................................................... 45 5.6.1 Ripple Carry Adder with Sensor Circuitry ................................................................... 45 5.6.2 RCA Vs CLA................................................................................................................ 46 5.7 Power Analysis ............................................................................................................... 48 5.7.1 Average Power Calculations ........................................................................................ 48 5.7.2 Energy Calculations ...................................................................................................... 49 5.8 Results Analysis ............................................................................................................. 49 6 Booth Multiplier......................................................................................................................... 51 6.1 Background .................................................................................................................... 51 6.2 Modified Booth Multiplier ............................................................................................. 53 6.3 Calculations .................................................................................................................... 54 6.4 Simulations ..................................................................................................................... 55 6.4.1 BM with RCA vs BM with RCA and Current Sensor .................................................. 55 6.4.2 BM with CLA vs BM with RCA and Current Sensor .................................................. 56 6.5 Results Analysis ............................................................................................................. 56 7 Conclusion ................................................................................................................................. 57 References ..................................................................................................................................... 59 vii List of Tables Table 1: Simulation results with area and delay optimization for RCA with and without sensor circuitry. ...................................................................................................................................45 Table 2: Simulation results with area optimization for RCA with and without sensor circuitry. .................................................................................................................................................46 Table 3: Simulation results with delay optimization for RCA with and without sensor circuitry. .................................................................................................................................................46 Table 4: Simulation results with Area + Delay optimization for RCA and CLA. .....................47 Table 5: Area Overhead calculations with Area + Delay optimization. ....................................47 Table 6:Simulation results with Area optimization for RCA and CLA. ....................................47 Table 7: Area Overhead calculations with Area + Delay optimization. ....................................47 Table 8: Simulation results with Delay optimization for RCA and CLA. .................................48 Table 9: Area Overhead calculations with Delay optimization. ................................................48 Table 10: Average Power Calculations .....................................................................................48 Table 11: Energy per addition calculations ...............................................................................49 Table 12: Booth Encoding ........................................................................................................52 Table 13: Simulation results with BM with RCA with and without sensor circuitry. ...............55 Table 14: Simulation results with BM with RCA with sensor circuitry and CLA. ...................56 viii List of Figures Figure 1: Ripple Carry Adder ......................................................................................................... 6 Figure 2: Carry Look Ahead Adder ................................................................................................ 7 Figure 3: Carry Skip Adder............................................................................................................. 8 Figure 4: Carry Select Adder .......................................................................................................... 9 Figure 5: Current-Sensing Completion Detection (CSCD) circuitry for a CMOS Block ............ 17 Figure 6: Basic Current Sensors ................................................................................................... 18 Figure 7: Basic CSCD configuration ............................................................................................ 20 Figure 10 : A Simple CMOS Inverter ........................................................................................... 23 Figure 11: V-I Characteristics of CMOS inverter......................................................................... 23 Figure 12: Sense Inverter Implementation ................................................................................... 24 Figure 13: Current Sensor for Asynchronous Designs ................................................................ 27 Figure 14: Waveforms for Asynchronous Sensor ......................................................................... 27 Figure 15: Current Sensor for Synchronous Designs ................................................................... 29 Figure 16: Control Logic Block .................................................................................................... 30 Figure 17: Control Signals ............................................................................................................ 31 Figure 18: Waveforms for Synchronous sensor............................................................................ 32 Figure 19: 4-Bit Full Adder Circuit ............................................................................................. 33 Figure 21: Gate Level Representation of Full Adder Circuit ....................................................... 34 Figure 20 : Full Adder Block ........................................................................................................ 34 Figure 22: Gate level representation of 4-bit ripple carry adder ................................................... 35 Figure 23: Partial Full Adder ....................................................................................................... 37 Figure 24: Carry Look Ahead Adder ........................................................................................... 39 Figure 25: Carry Propagation graph for two 32 bit additions for 100000 random vectors. ........ 41 Figure 26: Ripple Carry Adder with sensor circuitry. ................................................................. 42 Figure 27 : Clock Division ............................................................................................................ 43 Figure 28: Optimization w.r.t Area ............................................................................................... 44 Figure 29: Optimization w.r.t Delay ............................................................................................. 45 ix Figure 30: Booth Multiplier with Ripple Carry Adder and sensor circuitry................................ 54 x List of Abbreviations CSCD Current Sensing Completion Detection RCA Ripple Carry Adder CLA Carry Look Ahead Adder BM Booth Multiplier ALU Arithmetic Logic Unit PFA Partial Full Adder CDG Clock Delay Generator PDA Personal Digital Assistants DSP Digital Signal Processors IC Integrated Circuit 1 Chapter 1 Introduction In recent years there has been a rise in use of hand held devices such as Media Players, Smart Phones, Digital Cameras and Personal Digital Assistants (PDAs). The features on these mobile devices will require physically smaller designs with efficient area and power management. The future of digital design holds ever larger and more complex digital circuits than have been achieved so far. Many functions on these devices are expected to be specific to a target application. A generic Integrated Circuit (IC) with versatile functionality may turn out to be too expensive for a customized application which can do with fewer functions. In practice, many available functions may not be required in the application; these functions consume a significant portion of the component chip area and cost. In such situations the generic IC can be replaced with a smaller circuit with fewer functions matched to the requirements of the customized application. Mobile systems incorporate Digital Signal Processors (DSPs). A DSP is a specialized processor with an optimized architecture for the fast computational needs of digital signal processing. Some of the most common functions performed in the signal processing are signal filtering, convolution and Fourier transforms. In mathematical terms, these functions perform a series of dot products, multiply and accumulate (MAC), additions and multiplications. General- purpose microprocessors can also execute DSP algorithms successfully, but are generally not suitable for use in mobile devices such as mobile phones and PDAs because of power and space 2 constraints. A customized digital signal processor, however, can provide a lower-cost solution, with better performance and lower power consumption. 1.1 Background As performance and battery life requirements are becoming more crucial in small and lightweight systems like laptops and handheld devices, power consumption is becoming a critical problem. For example, processors and microcontrollers for these small portable systems should provide higher system performance while keeping power consumption within specified limits. The performance of these chips is mainly dependent on the Arithmetic Logic Units (ALUs) used in them to perform complex mathematical operations such as addition, subtraction, division and multiplication. To save silicon area and power, some chips do not include dedicated arithmetic circuits for multiplication and other complex computations which are considered to be very area intensive and power hungry; multiplication is performed through repeated addition. In such scenarios, there is a critical need for fast and power efficient adder circuits which are almost constantly active during computation, and can have a major impact on overall system performance. Current trends in the CMOS VLSI design methodologies show steady scaling of feature sizes to meet the speed and performance requirements for complex applications in the areas of communication, aerospace, defense and consumer electronics. Extensive scaling of CMOS process has given rise to many problems such as leakage current, short circuit current, large process variations, etc. It is proving to be difficult to achieve higher performance with transistor channel length scaled to few nanometers. These problems have to be addressed while designing circuits for low power and high speed applications. Also, with CMOS technology approaching 3 its limit, new approaches for processes and design architectures are needed. Apart from scaling, there has been much focus on very low power consumption and area efficient design methodologies in developing consumer applications. 1.2 Motivation Digital circuits are classified into synchronous and asynchronous circuits. Synchronous circuits (and systems) include clocked elements like flip flops, latches and registers. The input of these elements should be stable before the next active edge of the clock arrives. Therefore, the behavior of these circuits can be predicted exactly. One major problem with these systems is the distribution of clock signals which must be in phase without significant skew every place in the chip. This makes designing a complex synchronous circuit operating at high frequency difficult due to timing skew and delays. Asynchronous circuits (and systems) are not governed by a clock circuit driving a global clock signal, but instead they need only to wait for signals that indicate completion of operations from a previous stage using a handshaking approach. In the handshaking approach, a functional block needs to supply a "done" signal indicating that the results of the functional block are available and valid. In synchronous digital systems to achieve higher performances the clock should be operated at higher speeds. But clock skew is consuming a significant portion of the clock cycle and is limiting the maximum clock speed of the system. The clock frequency in these circuits also depends on the worst case delay between any two clocked elements. With these problems alternative ways of speeding up the operating frequency are required. One method is to optimize the combinational block between the clocked elements to reduce the worst case delay. Another 4 method is to use techniques for detecting the operation of the combinational block and generate a signal that indicates when the combinational block or its part has completed computation. This type of completion signaling can also be used in self-timed asynchronous designs, where the data between logic blocks are transferred asynchronously utilizing handshake circuits for signaling completion rather than clocks. Some methods of signaling computation completion proposed in the literature are Activity- Monitoring Completion Detection (AMCD) [23, 24], Current-Sensing Completion Detection (CSCD) [2, 22], and Transition-Monitoring Completion Detection (TMCD) [23, 24]. In Chapter 2 of this thesis AMCD, CSCD, and TMCD methods are discussed from the design point of view. Some benefits and drawbacks of activity monitoring methods are presented. The methods reviewed in Chapter 2 suffer from serious limitations in low power and low voltage applications so in Chapter 3 we propose a new method for completion detection which addresses the limitations suffered by earlier methods and also present a faster current sensor design for synchronous designs. As discussed earlier, the performance of the digital circuits depend on the computation capabilities of ALUs which they incorporate. The major blocks in these ALUs are adder and multiplier units; multipliers also incorporate adders. Various designs of adder circuits have been proposed earlier which are efficient in area or delay. These adders are discussed in section 1.2.1, which gives a detailed analysis with respect to area and delay. Our idea here is to select the adder design which will suit the need for smaller circuit designs for mobile applications. Also we will try to incorporate additional capabilities of completion detection techniques to enhance the performance of the overall design. 5 1.2.1 Background on Adder Circuits Adder circuits are the considered important system blocks for arithmetic logic units used in microprocessors and microcontrollers which perform complex computations in applications. To achieve better performance different types of adder circuits have been designed. Their usages in different applications depend on the following three factors: 1. Delay 2. Area 3. Power consumption It is hard to design adder circuits which take all these factors into consideration and deliver higher system performance. This made researchers look for alternative ways to design circuits in an optimized and customized way to achieve performance by speeding up operations in complex applications. It is well known that faster computation capability will result in a higher system performance. Often computation capability is mainly dependent on adder circuits used in the arithmetic logic units of the processor. Some basic adder circuits in use are listed below. Each of these adders has his own advantages and disadvantages. These are: a) Ripple Carry Adder b) Carry Look Ahead Adder c) Carry Skip Adder d) Carry Select Adder This section presents a detailed discussion of each adder circuit with respect to delay and area and also discusses their limitations in practical use. 6 a) Ripple Carry Adder The Ripple Carry Adder, being the simplest one, uses the least hardware circuitry but suffers from the worst case carry propagation delay. Figure 1 shows a 4-bit ripple carry adder with worst case delay path from C0 to C4. A Higher impact on performance is seen for 32-bit or 64-bit adder circuits where the worst case carry propagates from the LSB to MSB block. The delay of the ripple carry adder increases linearly with the number of bits with a worst case delay of O(n) [13]. The RCA has a very compact design area given by O(n) [13]. This worst case delay makes it slow when large bit sizes are used. The advantage for ripple carry adder is that it uses much less hardware circuitry when compared to all other traditional adder circuits in use. b) Carry Look Ahead Adder The Carry Look Ahead Adder has lower delay but requires much more complex circuitry in achieving its performance. The complex circuitry is added to find in advance the generation and propagation of carry bits. The carry at ith bit position is given by Ci = Gi + Gi-2 Pi-1 + Gi-3 Pi-2 Pi-1 + ??. + G 0 P1 P2 ?.. P i-1 + C0 P1 P2 ?.. P i-1 1-Bit Full Adder 1-Bit Full Adder 1-Bit Full Adder 1-Bit Full Adder A0 A1 A2 A3 B1 B0 B2 B3 S1 S3 S0 C3 C2 2 C1 2 C0 2 S2 Figure 1: Ripple Carry Adder C4 7 The size and complexity for a large adder with the carry signals generated using this equation is not practical. So the carry generation and propagation bits are computed by making groups of carry blocks (usually four bits) as shown in Figure 2. Such a block generates a group carry which give the carry bit to the next block, giving time to the sum units to perform their calculation. This block also calculates generation and propagation information and provides it to next higher level block in the hierarchy. The added circuitry contributes to more power consumption. Also large fan out within each block circuitry contributes to additional delay in the adder circuit. The delay of a carry look ahead is given by O(log(n)) [1] but will require much larger area given by O(nlog(n) [1]. c) Carry Skip Adder The Carry Skip Adder reduces the carry-propagation time by skipping over groups of consecutive adder stages. The carry skip adder is usually comparable in speed to the carry look- 1-Bit Full Adder 1-Bit Full Adder 1-Bit Full Adder 1-Bit Full Adder A0 A1 A2 A3 B1 B0 B2 B3 S1 S2 S3 S0 C4 P3 G3 C3 P2 G2 C2 P1 G1 C1 P0 G0 4 -Bit Carry Look Ahead PG GG C0 Figure 2: Carry Look Ahead Adder 8 ahead adders, but it requires less chip area and consumes less power compared to carry look ahead adders. To accelerate the carry propagation it adds additional group carry bypass paths to its ripple path and the carries can bypass the ripple path when the group propagate signal is high. Figure 3 shows the basic structure of an N-bits carry-skip adder with k-bits per group. Each block is composed of ripple carry adder blocks of fixed size and carry skip chain. Each block in the carry skip adder generates a propagation bit, Pi, when Pi = 1, the carry in of the ith block is passed to the (i+1)th block without waiting for the input carry to ripple. The group propagate function is given by Pi = (Ak-1 ? Bk-1) ? (Ak-2 ? Bk-2) ???. ? (A 1 ? B1) ? (A0 ? B0) This scheme is faster than a Ripple Carry adder with a small overhead in adding bypass circuitry to decrease the carry propagation delay. However, the delay is linearly dependent on the number of bits N [7]. G0 A0 B0 Sk-1 Bk-1 Ak-1 . . . G1 Bk B2k-1 A2k-1 . . . S0 . . . . . . S2k-1 Sk . . . . . . Gi AN-k-1 BN-k-1 BN-1 AN-1 . . . SN-1 SN-k-1 . . . . . . Ak or or . . . . . . or P0 P1 Pi and and and Cin Figure 3: Carry Skip Adder Cout 9 d) Carry Select Adder The Carry Select Adder consists of two ripple carry adders and one multiplexer. The two adders are used to calculate the addition twice; one addition is computed assuming carry input ?1? and the other as ?0?. This adder implementation does not wait for propagated carry signal; rather it first generates sum and carry-out pairs by using both possibilities of carry-in signal at each bit position. The correct output is then selected upon the arrival of carry-in signal. The typical delay of this design is given by O(?n) [7] but at the cost of much larger hardware circuitry which is twice that of ripple carry adder as shown in Figure 4 for a 16-bit carry select adder. This design also contributes to much larger power consumption due to its complex circuitry. Figure 4: Carry Select Adder 10 1.2.2 Ripple Carry Adder and Carry Look Ahead Adder Discussion In section 1.2.1 we have analyzed the different available adder circuits but the most suitable adder circuits for small customized applications is the ripple carry adder. The idea in this thesis here is to modify the ripple carry adder by adding additional capabilities to suit the performance and area needs. This modification should result in performance comparable to the other fast adder circuits without suffering from significant additional overhead issues. Although the classical Ripple Carry Adder has the most compact design (O(n) area) [13] of all adders, it is slow due to its long worst case carry-chain length. But this worst case of carry propagation is seen only in very few cases in practice. So designing a circuit that meets this worst case propagation delay will have impact on the system?s performance in most of the other cases where the result is available earlier. The idea here is to design a self timed circuit which performs operations on a much faster rate taking into consideration the worst case performance of the adder circuit. Our proposed idea is to monitor the circuit activity only on the specific selected nodes of the circuit which are considered to be critical for the overall circuit delay. This monitoring is based on current sensing completion detection which relies on the basics of CMOS operation which states that during any active phase of a CMOS circuit the current is drawn from the power supply. This current is also known as switching current. Based on this fact, we can deduce that a circuit has completed all its operations when the switching current ceases or falls below a certain reference threshold value. One point to be considered here is that the current sensing circuitry adds some additional overhead to the circuit design with respect to the area and delay. The challenge is to keep this as small as possible in designing the sensing circuitry so as to have a minimal impact on the overall circuit performance. 11 Carry Look-ahead Adders are significantly faster than their counterpart Ripple Carry Adders but require much larger area (O(nlog(n)) [1]. CLAs speed up the computation of the addition operation by using complex circuitry for calculating the carry generation and propagation information at each bit position in advance. This helps in speeding the addition operation but at the cost of much larger circuitry. This larger circuitry also results in substantially higher power dissipation. In contrast, for ripple carry adder to determine the carry for a specific bit position the carry input at the previous bit position must be calculated, this adds up to much larger delay in the calculation. This implementation of Ripple Carry Adder uses less circuitry when compared to traditional Carry Look Ahead Adder. While timing in designs using conventional RCA designs must allow for the worst case carry ripple delay for every addition, for many input cases the correct result is in fact available a lot earlier. The incorporation of a completion signaling mechanism into a RCA offers a way for improving the ?worst case? RCA performance, which is given by O(n) [13] to ?average case? RCA performance O(log(n)) [13] over a set of repeated additions. This can provide a mechanism for the RCA to signal to the higher level circuitry controlling it that it has completed the current operation and to be ready to schedule the next operation. Thus, for example, if 32 repeated additions are to be performed to multiply two 32 bit numbers, using completion signaling to initialize the next addition (before waiting out the worst case delay) can cut down the total multiplication time from 32 worst case addition delays, to 32 ?average? case delays. This can achieve performance comparable to that attainable from much more expensive carry look ahead adders. These benefits also hold for more typically used Booth?s algorithm based multiplication implementations that reduce the number of additions needed by skipping additions for strings of 0?s and strings of 1?s in the multiplier input. 12 1.3 Problem Statement While the earlier work has established that sensing the current from the power supply of a CMOS block offers a valid means for generating a completion signal, introducing current sensors in the functional power supply has not proven practical. These current sensors suffered from significant issues. In this thesis we try to address the problems encountered in earlier designs. The problems addressed here are 1) Development of new approach for current sensing completion detection. 2) Design of current sensors for Synchronous and Asynchronous designs. 3) Design of adder and multiplier circuits with completion signaling capability. 1.4 Contribution of This Thesis In this thesis we present a novel approach where key individual (late switching) circuit nodes are probed (through sensor lines) and monitored to observe the absence of further transitions. These monitors again detect a quiescent stable state in the probed signals by sensing the current drawn from the power supply, but in this case only the supply current drawn by the sensor circuits need be observed. The overall switching current in the functional block does not need to be monitored. This avoids the key problems associated with the classical Current- Sensing Completion Detection (CSCD) approach. This thesis presents the analysis between a traditional ripple carry adder with carry completion detection and carry look ahead adder. A classical RCA design uses a string of full- adders, consisting of 2 gate delays each, that are interconnected with the (ripple) carry signals. In an experiment running 100000 random input vectors through such a 32 bit ripple carry adder, the 13 average delay is observed to be 12-14 gate delays associated with an average 6-7 bit longest carry chain length. (Other equal size or shorter carry chains can also be simultaneously active for the same computation.) Compared to the worst case 32-bit carry chain length, which is 64 gate delays, the average case delay reflects a dramatic 5-6X performance improvement. This suggests that a RCA with computation completion signaling can potentially incorporate the low power and area of RCAs with average performance comparable to that of CLAs. However, there is an additional overhead associated with the circuitry producing the completion signal; the challenge is to keep this as low as possible. 1.5 Organization of Thesis The thesis is organized as follows. In Chapter 2, we discuss a general background of the earlier techniques for current sensing based completion detection, their pros and cons, and their use in practical applications. Also we discuss their limitations with respect to current technology. In Chapter 3, a novel approach for current sensing completion detection is presented and its operation is discussed. Chapter 4 presents a detailed overview of the performance of Ripple Carry Adder and Carry Look Ahead adders. New design of ripple carry adder with current sensing completion detection is proposed and discussed in Chapter 5. In Chapter 6, a Booth multiplier with new CSCD circuitry is presented and its performance improvement with traditional Booth multiplier is evaluated. Conclusions and future work are discussed in Chapter 7. 14 Chapter 2 Prior Work on CSCD Design Researchers in the past have developed many methods for completion detection. This chapter discusses some of these techniques and past work on current sensing methods, their limitations and how these design methods would fail in the current scenarios where technology is being scaled to nanometers. This part also discusses several possible designs (in standard low- power CMOS process) that can be used as the current sensing elements in CSCD circuitry. 2.1 Completion Detection Methods Transitions in a CMOS digital circuit can be monitored by two methods. The first method is based on the current sensing. A CMOS circuit draws current from the power supply when both N and P transistor are ON. When the current ceases indicates that all the transitions in the circuit are completed. Monitoring the current drawn from the power supply provides a means for signaling the completion of the CMOS block. The second method is voltage monitoring. During transitions the internal node capacitances are charged or discharged. Completion signaling is done by monitoring the voltage transitions on internal node capacitances. In literature various methods are proposed for current and voltage monitoring. The methods used for voltage monitoring are Activity Monitoring Completion Detection (AMCD) and Transition Monitoring Completion Detection (TMCD) [23, 24]. In AMCD, Activity-Monitoring (AM) circuitry is used to observe the transitions on the internal nodes of the CMOS block. The TMCD method is similar to AMCD but offers additional functionality to observe activity from glitches. The issues 15 with these methods are that they add additional capacitances on the monitoring nodes which introduce additional delays in the circuit. Also extra routing lines are needed for the activity monitoring circuitry to observe transitions on the internal node. With respect to performance and area, AMCD method gives better performance than TMCD method while consuming less area and power. The approach used for current monitoring is Current Sensing Completion Detection (CSCD) method where the current drawn from the power supply by the CMOS block in transition is monitored. The current is collected in a current sensor and compared with a pre- selected threshold. When the monitored current is above the threshold value indicates the CMOS block is in operation and when it drops below the threshold completion is signaled. 2.2 Classical CSCD Design In theory, CMOS designs offer a simple way to determine when all switching activity has ceased and the circuit has reached steady state: there should be only minimal leakage current drawn from the power supply. A number of previous researchers have investigated ways to implement Current-Sensing Completion Detection (CSCD) [1-4] using circuits that monitor the power supply of circuit blocks as shown in Figure 5 and indicate when the outputs stabilize. Unfortunately, monitoring the power supply current necessarily involves introducing additional circuitry between the logic gates and the power supply. This can seriously degrade the supply voltage available to the logic, particularly in low-power, low-voltage designs, and thereby significantly impact circuit performance. Furthermore, it remains very challenging to design current monitors with a sufficient dynamic range to provide for the peak switching current demands of a functional block of a hundred or more gates, while still being sensitive enough to reliably indicate when the last gate completes its switching transition. These difficulties have 16 prevented the practical implementation of current sensing based completion detection. Classical CMOS circuits have the inherent property of only consuming power during transitions when both pull up and pull down transistors are on and circuit capacitances have to be charged and discharged. Only a small leakage current is drawn from the power supply in steady state. It is therefore straight forward to see how monitoring current consumption provides real-time information on circuit switching. The implementation of a CSCD design has a few requirements. 1. First, the circuit that is being monitored needs to be capable of being an asynchronous, stand alone, functional block that can take advantage of the completion signaling. 2. Second, the circuit being monitored needs to be combinational logic that contains a subset of signals in which at least one will be in the process of transitioning at any given moment if the operation is still in progress. Assuming these requirements are satisfied, any combinational block can, in principal, be modified to use CSCD circuitry. Previous CSCD designs [1-4], including those implemented for ripple carry adders have all employed circuit architecture conceptually similar to the one shown in Figure 5. Unfortunately, such designs have some significant drawbacks. Perhaps most important is the voltage degradation across the current sensor circuits, which can severely impact the switching delay of the CMOS logic. Furthermore, since typically a single pair of current sensors is used to monitor supply current for the complete functional block (e. g. a 32-bit ripple carry adder), the current sensors must be designed to allow a large peak current to pass through, while being able to accurately discriminate about a threshold current due to a single active gate. Managing such a large dynamic range of supply currents is the classical challenge faced by designers of on chip current sensors [1-4]. The only real solution is to minimize the number of 17 gates in each block being monitored, but that greatly increases the number of current sensors needed, and results in unacceptable area overhead. Earlier CSCD designs which directly monitor the circuit power supply current to create a completion signal are shown in Figure 5. Recall that CMOS circuits only consume power during switching and therefore when the switching stops it can be assumed the calculation is completed. In Figure 5, the power supply current sensors are to detect when the current level has been reduced below a threshold. At this point a ?done? signal is asserted to signify to the higher level control circuitry that the operation has completed. Pull-Up Pull-Down CMOS Data In Current Sensor Vdd Data Out Current Sensor Done Figure 5: Current-Sensing Completion Detection (CSCD) circuitry for a CMOS Block 18 2.3 Current Sensor Design Approaches The two main designs of current sensors for completion detection are resistance based and current mirror based. Figure 6: Basic Current Sensors The first method is resistance based shown in Figure 6a the detected current flows through resistance R1. The value of the resistance R1 should be chosen such that the voltage between base-emitter junction of the BJT is above 0.7 ? 0.8 V. Here a large value resistance is required as the sensed current has low values. This resistance based method has limitations when used in low-voltage applications where supply voltage is as low as 1V. The first being the unacceptable voltage drop across the base emitter junction and second the large value of the resistance R1 which limits the current in the functional path of the logic block. Also future trends concentrating on CMOS designs, inclusion of bipolar transistors is not suggested. The second method is the current mirror based shown in Figure 6b. Transistors M1 and M2 are connected in a push-pull inverter configuration. These transistors operate in saturation and act as current mirror. The sensed current Iin is mirrored and converted into voltage Vout. The disadvantage with this design is large values for W/L ratios are needed since current values are 19 low. Also, overall delay of the current sensor increases due to high gate capacitances, thus making faster transitions invisible to the sensor. 2.4 Applications Using Current Sensor Completion Detection The basic configuration for incorporating CSCD in CMOS block is shown in Figure 7 which incorporates two current sensors. The first one between the logic block and VCC, the second between the GND and the logic block. The current sensors are used to detect a current above a predefined threshold and generate a signal as long as the transient current flow is above this threshold. Ideally two current sensors are needed in both the GND and VCC supply paths. When the circuit drives large loads in those situations the current mostly flows either through the NMOS or PMOS network discharging or charging the output load capacitances. This configuration is expected to work independent of the load variations on the output of the logic function. 20 Figure 7: Basic CSCD configuration The CSCD design also incorporates few other components. These are: 1. Latches: The input latches are used to provide inputs to the logic function. These latches also remove any the differences in the arrival time of the input signals at the logic block. 2. MDG: The Minimum Delay Generator (MDG) is incorporated to guarantee that minimum time has elapsed for the logic function to compute correct outputs. The input to the MDG is from the sequential control which is the moment when the inputs are fed to the logic function. This time is considered as start time for the computation. 3. NOR Gate: The NOR gate is provided inputs from for the current sensors and minimum-delay generator. When the all the inputs from them are active low the completion signal ?done? is signaled. Vdd Gnd Logic Function Input Vcc Output Current Sensor Done MDG Current Sensor L A T H E S Seg. Ctrl. 21 The two important factors which could affect the operation or performance of the logic function being monitored are: 1. Voltage drop across the current sensor which limits the supply voltage to the logic. 2. Resistance offered by the current sensor which limits the current flow to the logic function which slows down the switching speed of the logic function. CSCD as presented here suffer from serious limitations of static power consumption and voltage drop in the power supply voltage provide to the logic function thus making them unsuitable for low-power and low-voltage applications. The next section proposes a novel idea which addresses the issues presented here. 22 Chapter 3 Proposed CSCD Design Previous methods using the technique of current sensing for completion detection introduced additional overhead in delays, degrading the performance of the overall circuit. This thesis presents a novel approach for completion detection design that does not monitor the current in the power supply of the functional block. Instead, only a selected set of individual late settling signal nodes in the circuit are observed by connecting simple inverter sensors to them, and monitoring the current drawn by these sensors from the power rail. 3.1 CMOS Inverter A CMOS inverter is constructed by two complimentary transistors, PMOS and NMOS. It gives an output voltage which is of the opposite logic-level to its input. A CMOS inverter draws current from the power supply when both the transistors are ?ON?. This current is seen when there is transition at the input of the inverter from Logic ?1? to Logic ?0? and vice versa. Figure 8 represents a simple CMOS inverter and Figure 9 shows the plots of V-I characteristics of the inverter. Observe from the V-I graph that the drain current Id is generated only when the signal at the input of the inverter makes a transition from High to Low and vice versa. This simple concept can be used to detect the transitions on the late settling nodes in a combinational circuit and determine when they have reached a steady state. 23 Figure 8 : A Simple CMOS Inverter Figure 9: V-I Characteristics of CMOS inverter 24 3.2 Sense Inverter Design The proposed method for implementing the current sensor involves using a sense-inverter such as the one shown in Figure 10. CMOS inverters draw current from the power supply as long as their input is midrange between a high and a low voltage, shown in Figure 9, such that both the P and N transistors see a gate source voltage above their respective threshold voltages. The supply current does not flow once the input voltage levels go below the threshold of either transistor types. Thus, monitoring the supply current of these inverters provides a means of determining if the input has stabilized close (within a threshold voltage) to a high or low. Figure 10: Sense Inverter Implementation Incorporating a sense-inverter current-sensing circuit implementation into a standard 32-bit RCA involves adding a minimally-sized inverter to each of the 31 carry signals as shown in Figure 10. The carry output of the final stage is neglected in this case. Recall that standard cells in typical cell libraries generally use transistors that are sized up in width by a 5-10X factor for N transistors and 10-20X (or more) for P transistors for performance. Using a minimally-sized inverter in this application ensures that the sense-inverter does not significantly load the carry signals in comparison to the normal load on the lines, minimizing any performance impact. 25 Observe in Figure 10 that only the current in the sense-inverters is monitored. Furthermore the sense inverters have no load at the output, other than parasitic capacitances associated with the small minimum sized transistors. Thus a single current monitor connected on the ground rail is sufficient to provide a reliable switching completion signal. In the case of large capacitive load on outputs, it is possible that the N transistor turns off while the P transistor is still on and charging the output capacitance for significantly longer. This is the reason for the double sensors, one for each supply rail, in classical CSCD designs. Importantly, the entire current-sensing circuitry is only associated with the sense inverters, which are not in the functional path. The power supply of the functional logic is not monitored. Thus impact on functional performance is minimal. Selecting only a limited number of circuit nodes to be monitored also reduces the peak and dynamic range in the current observed at the current sensor. Even so, observe in Figure 10 that the peak current, when all the inputs simultaneously make transitions, can be 31 times (in case of 32 bit Ripple Carry Adder) the largest current seen in an inverter. By adding sense invertors the overall circuit switching current is not monitored. In this way, functional performance in only minimally impacted (by the small additional fan out loading due to the sensors), while the completion of switching at the observed nodes can be reliably detected by monitoring the current drawn by the sensors. The proposed new design approach thus overcomes key limitations of earlier approaches discussed in Chapter 2. 26 3.3 Current Sensor Design In this section we present two designs of current sensors for synchronous and asynchronous circuits. 3.3.1 Current Sensor for asynchronous designs Asynchronous designs do not implement a clock. These circuits are self timed; when the activity in the circuits stops the output ?Done? signal is flagged indicating the completion of the operation. Figure 11 shows the design of a current sensor for asynchronous logic circuits. The size of transistor P1 is chosen to be L\W=5\1 to get sufficient bias current to keep both the transistors N1 and N2 in saturation. So when a small current from the sensor flows, the current is mirrored into Iout. The input voltage to the inverter combination of transistors N4, N3 and P3 gets a voltage equal to Vdd minus the voltage drop across drain and source of transistor P2. Transistor N4 is used to make sure that a small voltage swing at the input of the inverter causes a full swing at the output. Previous CSCD designs have also used a Minimal Delay Generator (MDG) that is also adapted in our design. The purpose of the MDG is to ensure that enough time has elapsed for the current-sensing circuit to begin functioning properly. As shown in Figure 12, the output of the MDG should be asserted and the comparator must be de-asserted in order for the ?done? signal to be asserted. Figure 12 shows the waveforms for the simulation of a 32 stage carry propagation. 27 Figure 11: Current Sensor for Asynchronous Designs Figure 12: Waveforms for Asynchronous Sensor It should be noted that the discharge time of Vsens depends on the time constant approximated by the product of the capacitance on Isens line and the channel resistance of transistor M1 in parallel with channel resistance of transistor P1. Also, the charge time of Vout depends on the time constant calculated by the product of the capacitance on Iout line and the channel resistance of transistor M2 in parallel with channel resistance of transistor P2. For correct operation of the current sensor we need a voltage swing of 0-1V on the Vout node. To achieve this, the resistance seen on Vout line should be high. This high value of resistance increases the charging time constant on Vout node increasing the response time of the current 28 sensor, thus affecting the performance of the circuit. The additional delay added in the current sensor response is found to be 1.2ns but the actual completion is over a lot earlier. This overhead is always added after completion of each operation degrading the overall performance of the circuit. With this degraded performance the asynchronous current sensor cannot be used in practice. The synchronous current sensor presented in Figure 13 has to sense only a differential voltage change when compared to large voltage swing required for asynchronous sensor. After sensing the differential voltage the full voltage swing is provided by the latch operation of inverters connected in regenerative feedback configuration, thus we can see better response by the sensor. The next section presents the design of synchronous current sensor. 3.3.2 Current Sensor for synchronous designs Synchronous designs have a clock so all the operations within the circuits are synchronized with the master clock. The ?Done? signal must be asserted before the active edge of the clock so as to synchronize the next operation. Our current sensor also uses a master clock from which signals to control the current sensor are generated. Referring to Figure 12, the current sensor consists of two inverters, Inv1 and Inv2, connected in a regenerative feedback configuration. The sources of the two PMOS transistors, T5 and T6, of these two inverters are connected to power through another P transistor, T3, which serves as a switch to trigger these inverters. The measured voltages are connected to the sensor through two pass transistors, namely T1 and T2. The gates of T1, T2, T3 and T4 are connected to ?accum?, ?precharge? and ?eval? control signals generated from control logic. 29 Figure 13: Current Sensor for Synchronous Designs During the precharge phase, T4 is on which maintains the inverter inputs at same voltage. In the accumulation phase, T1 and T2 are on. This allows the charge from the measured voltages to build up on the inverter inputs, but the output the inverters are not affected as T3 is turned off, which latches the voltages into the inverter inputs. In the evaluation phase, the power to the inverters is turned on. When the inverters are powered up, T3 turned on, the inverter with the higher input voltage dominates, and its output is pulled low. The outputs from both inverters are connected to another set of inverters Inv3 and Inv4, which serve to buffer the sensor output from the rest of the circuit. If the current from the current generator is greater than the reference current, then the node that was connected to the current generator (output of Inv1) is pulled high. This node is connected to one of the output inverters (Inv3) and pulls its output low. This creates the active-low signal that indicates carry propagation. It should be noted that the current sensor is triggered before the evaluation period and the output of the RCA is valid only when the ?addcomp? signal is high, if not, the next cycle is tested for the addition completion. 30 In practice, the supply current cannot be more than about 20X the detection threshold current. The reduced voltage across the inverters automatically limits the current drawn from the power supply. Note that while this serious degradation in available voltage across the sense- inverters is acceptable in our design, because these sense-gates are not in the functional path, the performance impact of such supply degradation would be unacceptable if the current sensors were in the functional power supply, as in traditional CSCD designs. This is a major reason why past CSCD designs have not been very practical. In fact, design of the current sensor circuits to meet this demanding requirement of handling a very wide range of supply currents without excessively degrading functional performance was the key challenge in earlier designs. 4.3.2.1. Control Logic Block As shown in Figure 14 the control signals are generated using the clock and a delayed clock signal as inputs. The Clock Delay Generator (CDG) is used to generate the delayed clock. The delay of the CDG is set to the evaluation time of the current sensor. Typically we see this delay as an additional overhead on the overall circuit delay over one addition operation. This overhead can be minimized by designing the current sensor with faster invertors in the regenerative configuration. Figure 14: Control Logic Block 31 4.3.2.2. Control Signals The Control logic generates three signals, corresponding to the three phases Precharge, Accumulation and Evaluation, of the current sensor operation. The states of transistors T1, T2, T3 and T4 of Figure 13 are shown in Figure 15. Figure 15: Control Signals 4.3.2.3. Waveforms Figure 16 shows the voltage waveforms at different nodes of the current sensor circuit for a 32 bit addition, where 32 stage carry propagation is generated. These waveforms show analysis for three evaluation phases. 1. Start: Start indicates the time when the inputs are fed to the circuitry. The inputs for transistors T1, T2, T3 and T4 of Figure 13 are shown in Figure 15. 2. Evaluation Phase-1: Here the sensing voltage, V(sens), is greater than the reference voltage, V(ref). This indicates the operation is not completed yet so V(addcomp) is Active LOW. 3. Evaluation Phase-2: The sensing voltage, V(sens), is still greater than the reference voltage, V(ref). This indicates the carry is still in propagation and V(addcomp) is Active LOW. 32 4. Evaluation Phase-3: Now the sensing voltage, V(sens), is less than the reference voltage, V(ref), indicating the operation is completed and V(addcomp) is set to Active HIGH. Figure 16: Waveforms for Synchronous sensor 33 Chapter 4 Ripple Carry Adder and Carry Look Ahead Adder Analysis In this chapter the performance of a ripple carry adder for delay in the best and worst case scenarios is discussed. Also ripple carry adder performance is compared to a carry look ahead adder with carry propagate and carry generate block designed using grouping of 4 bits. 4.1 Ripple Carry Adder A Ripple carry adder consists of blocks of 1-bit full adders connected in series with the carry out of one block serving as carry in to the next block. Figure 17 shows the interconnection of a 4-bit ripple carry adder. It can be observed that the critical path for this design is the path from C0 to C4. 1-Bit Full Adder 1-Bit Full Adder 1-Bit Full Adder 1-Bit Full Adder A0 A1 A2 A3 B1 B0 B2 B3 S1 S3 S0 C3 C2 2 C1 2 C0 2 S2 C4 Figure 17: 4-Bit Full Adder Circuit 34 4.2 Full Adder A full adder performs additions of three 1-bit numbers, A, B, Cin, and gives outputs Sum, and Carry out, Cout. The expressions for sum and carry are given by S = CinBA ?? Cout = ))(()( BAC inBA ???? Figure 19: Gate Level Representation of Full Adder Circuit Figure 19 shows the gate level representation of a full adder circuit with sum and carry outputs. The sum output takes two XOR gate delays and the carry out has delay of two gates; one AND gate and one OR gate. 1-Bit Full Adder B A2 A S Cout Cin Figure 18 : Full Adder Block 35 4.3 Performance Analysis of RCA Figure 20 shows the gate level representation of 4-bit full adder circuit with sum and carry outputs. The addition of two bits at any stage depends on the carry generated by the addition of the two bits in the previous stage. Thus, the sum of the most significant bit is only available after the carry signal has rippled through the adder from the least significant stage to the most significant stage. The critical path for this design is the carry propagation path from C0 to C4 which is 8 gate delays. Extending the design concept for 32- bit ripple carry adder gives us the critical path delay of 64 gates. Figure 20: Gate level representation of 4-bit ripple carry adder 4.3.1 Best Case Performance The best case performance for a ripple carry adder is seen when there is no carry generated at each bit position. In the case of the 4-bit adder in Figure 20 the sum outputs are available just after two xor gate delays. This can be observed with input values A = 1111, B = 0000 and C0=0, which gives C1=C2=C3=C4=0 at each bit position. For this case there are no carry bits generated and since no propagation takes place the sum outputs are valid after two gate delays. 36 Delays: S0 = 2 gate delays S1 = 2 gate delays S3 = 2 gate delays S4 = 2 gate delays 4.3.2 Worst Case Performance The worst case performance for a ripple carry adder is seen when the input carry, Cin is propagated to next stage by each bit position. Because of carry propagation from the least significant position to most significant position the sum bits generation takes variable time delays. In the case of the 4-bit adder in Figure 20, this can be observed with A = 1111, B = 0000 and C0=1, which gives C1=C2=C3=C4=1 at each bit position. Since the carry is propagated the delays for the sum outputs is observed to be two to eight gate delays. Delay: S0 = 2 gate delays 37 S1 = 4 gate delays S3 = 6 gate delays S4 = 8 gate delays 4.4 Carry Look Ahead Adder The carry look ahead adder calculates carry bits simultaneously using complex logic circuitry. This minimizes the worst case delay to calculate the sum bits. Due to large fan out on the gates, the carry generation and propagation bits are calculated by making groups of carry blocks (usually four bits) as shown in Figure 22. The primary block of the carry look ahead adder is Partial Full Adder (PFA). Each block of PFA takes three bits A, B, C; as inputs and generates three outputs G, P and S. These are Generate function, G = A ? B Propagate function, P = BA? Sum function, S = CBA ?? Figure 21: Partial Full Adder 38 4.5 Performance Analysis for 4-bit CLA Figure 22 shows the gate level architecture of a 4-bit carry look ahead adder. In order to construct a 4-bit CLA four PFAs are needed to generate the signals. When Pi = 1, an incoming carry is propagated to the next bit position from Ci to Ci+1. For Pi = 0, carry propagation to the bit position is blocked. Regardless of the value of Pi, when Gi = 1, the carry output from the current position is ?1?. The carry output has the following logic: C1 = G0 + P0C0 C2 = G1 + P1C1 = G1 + P1G0 + P1P0C0 C3 = G2 + P2C2 = G2 + P2G1 + P2P1G0 + P2P1P0C0 C4 = G3 + P3C3 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0C0 The carry look ahead block takes only two gate delays to generate carry bits from C1 through C3. The implementation of C4 is becomes more complicated when this 4-bit carry look ahead adder is extended to multiples of 4 bits, such as 16 bits and 32 bits. The carry look ahead adder for 4 bits computes carry bits at all the bit positions simultaneously. The longest delay in the 4-bit carry look ahead adder is 4 gate delays, compared with 8 gate delays in the ripple carry adder. The improvement is very modest but at the cost of much additional hardware. From Figure 22 it can be observed that the fan in for generating the C4 is 4, if we extend the same concept to generate more carry bits simultaneously then the high fan-in for generating the carry bits could contribute to additional adder delays. So in such cases the carry bits are limited to groups of 4 and are extend to next higher level of blocks to generate the carry bits. For a 16 bit carry look ahead adder the higher numbered bits 4 - 7, 8 - 11, and 12 - 15 are grouped together. For this in positions 4, 8, and 12 we would like the carry to be produced as fast as possible 39 without using excessive fan-in. The estimated worst case delay for a 32-bit CLA is 8 gate delays but high fan-in in the carry generation block could add additional delays in the adder circuit. Figure 22: Carry Look Ahead Adder 40 Chapter 5 Proposed Ripple Carry Adder Design with Completion Signaling 5.1 Background A classical ripple carry adder design uses a string of full-adders, consisting of 2 gate delays each, that are interconnected with the (ripple) carry signals. In 1946 Burks, Goldstine, and von Neumann published a proof that the expected maximum length of carry propagation in the addition of two binary n digits each does not exceed log2n [26]. In an experiment running 100000 random input vectors through such a 32 bit ripple carry adder, the average delay is observed to be 12-14 gate delays, associated with an average 6-7 bit longest carry chain length as shown in Figure 23. (Other equal size or shorter carry chains can also be simultaneously active for the same computation.) Compared to the worst case 32-bit carry chain length, which is 64 gate delays, the average case delay reflects a dramatic 5-6X performance improvement. This suggests that a ripple carry adder with computation completion signaling can potentially incorporate the low power and area of ripple carry adders with average performance comparable to that of carry look ahead adders. However, there is an additional overhead associated with the circuitry producing the completion signal; the challenge is to keep this as low as possible. 41 Figure 23: Carry Propagation graph for two 32 bit additions for 100000 random vectors. 5.2 RCA Design The new proposed design for ripple carry adder with completion signaling is shown in Figure 24. The new carry completion signaling design discussed in chapter 3 is incorporated in a 32-bit ripple carry adder to produce a carry completion signal. This modified ripple carry adder has a strong delay dependency on carry chain length, which is the critical path for the design. So monitoring the activity on the carry out lines only gives us information about the delays seen in calculating the sum bits for different input vectors. A clock is used for the operation of the current sensor circuitry. As discussed earlier this clock is used to generate control signals. 42 Figure 24: Ripple Carry Adder with sensor circuitry. 5.3 RCA Operation The sense inverters are added to carry out signals from all the output blocks except the final stage. The carry out of the final stage is neglected, as it does not play any role in calculation of sum bits. For 32-bit additions if there is any carry propagation then the carry signals have transitions on their lines. Due to these transitions, current is drawn from the power supply by the sense invertors. This current is captured into the current sensor and is compared with a pre- chosen threshold. Whenever the captured current is above the threshold, the output done line will be active low, indicating carry propagation by the full adder blocks. But when the current falls below the threshold, there are no transitions on the carry out lines and the circuit has reached stable state values. At this moment the done signal goes high and the adder is ready is compute the next addition. In case of asynchronous current sensor designs the done signal is flagged as soon as all the carry lines settle to a stable state. But the design here is synchronous, which includes a clock signal. So we compare current once in each clock cycle and provide a done signal. This clock 43 signal can be used as a master reference clock and another clock with high frequency can be used to create multiple evaluation points which fit into the clock period of the master clock. The general operation for the adder is to load the values to the registers after waiting for the worst case delay; for the RCA this is equivalent to the 64 gate delays associated with the carry propagation path. For loading the values we use a master clock (Clk1) which is operated at a frequency equal to the worst case delay (assuming the set up and hold time constraints are met). The new RCA with sensor circuitry uses a different clock faster, (Clk2/n) by a factor of ?n? with respect to the main clock. So the values are now loaded for the next addition on the rising edge of the clock depending on the value of the ?Done? signal as shown in Figure 25. If, at the first clock edge the ?Done? signal is active high, indicating addition is complete, the new values are loaded to the registers; if not so, then the next clock edge is used for loading the values. Figure 25 : Clock Division 5.4 Calculations Clk1 = Twcd = 32 stage worst case delay without sense invertors Clk2 = Twcd + 0.2ns* +0.4ns** * Delay introduced due to sense invertors on carry signals 44 ** Delay added for evaluation over one complete addition Timed saved = (T1-T2) / T1 Where, T1 = Total time with Clock set to Clk1 T2 = Total time with Clock set to Clk2/n, n = 2, 3, 4. Here to have multiple comparison slots we divide the Clk2 by a factor ?n?. Here ?n? indicates the number of comparisons. 5.5 Optimizations In the analysis different types of optimizations are used in automatic synthesis tool Leonardo spectrum for generating the netlist. Figures 26 and 27 show the gate level representation of a 1-bit full adder block after the optimizations are performed by the synthesis tool. Figure 26: Optimization w.r.t Area 45 Figure 27: Optimization w.r.t Delay 5.6 Simulations 5.6.1 Ripple Carry Adder with Sensor Circuitry The SPICE simulations were performed with 100000 random vectors to explore the correlation between delay for the RCA with and without the sensor circuitry. The Clock Delay Generator (CDG), for the design with the sensor, is set to 0.4ns which is the evaluation time of the current sensor. The netlist parameters were obtained by using three different types of synthesis optimization in automatic synthesis tool Leonardo Spectrum using 180nm technology. 1) Area + Delay Clk1 = Twcd = 5.0ns Clk2 = Twcd + 0.2ns +0.4ns = 5.6ns Clock(ns) Clk2/2 Clk2/3 Clk2/4 Time Saved 41.09% 52.18% 53.02% Table 1: Simulation results with area and delay optimization for RCA with and without sensor circuitry. 46 2) Area Clk1 = Twcd = 5.1ns Clk2 = Twcd + 0.2ns +0.4ns = 5.7ns Clock(ns) Clk2/2 Clk2/3 Clk2/4 Time Saved 41.21% 52.28% 53.01% Table 2: Simulation results with area optimization for RCA with and without sensor circuitry. 3) Delay Clk1 = Twcd = 4.8ns Clk2 = Twcd + 0.2ns +0.4ns = 5.4ns Clock(ns) Clk2/2 Clk2/3 Clk2/4 Time Saved 40.83% 51.96% 52.89% Table 3: Simulation results with delay optimization for RCA with and without sensor circuitry. 5.6.2 RCA Vs CLA The SPICE simulations performed here are to explore the correlation between delay for the RCA with the sensor circuitry and Carry Look ahead adder. These netlist parameters were obtained by using three different types of synthesis optimization in automatic synthesis tool Leonardo Spectrum using 180nm technology. 1) Area + Delay 47 Clk1 = 3.1ns Clk2 = 5.0ns + 0.2ns +0.4ns = 5.6ns Clock(ns) Clk2/2 Clk2/3 Clk2/4 Time Saved 5.25% 21.91% 23.68% Table 4: Simulation results with Area + Delay optimization for RCA and CLA. Design RCA + Sensor CLA Overhead #Transistors 1182 1450 22.67% Table 5: Area Overhead calculations with Area + Delay optimization. 2) Area Clk1 = 3.2ns Clk2 = 5.1ns + 0.2ns +0.4ns = 5.7ns Clock(ns) Clk2/2 Clk2/3 Clk2/4 Time Saved 6.57% 22.99% 24.74% Table 6: Simulation results with Area optimization for RCA and CLA. . Design RCA + Sensor CLA Overhead #Transistors 1182 1450 22.67% Table 7: Area Overhead calculations with Area + Delay optimization. 3) Delay Clk1 = 2.9ns Clk2 = 4.8ns + 0.2ns +0.4ns = 5.4ns 48 Clock(ns) Clk2/2 Clk2/3 Clk2/4 Time Saved 2.33% 19.49% 21.32% Table 8: Simulation results with Delay optimization for RCA and CLA. Design RCA + Sensor CLA Overhead #Transistors 1451 2008 38.39% Table 9: Area Overhead calculations with Delay optimization. 5.7 Power Analysis 5.7.1 Average Power Calculations Table 10 gives the average power (in mW) calculations for the three circuit designs, Ripple carry adder, Ripple carry adder with sensor circuitry and the Carry look ahead adder. Longest Carry Chain Length / Inputs RCA RCA with CSCD CLA Clk/2 Clk/3 Clk/4 0 (0x 22222222 + 0x 11111111) 0.555 0.647 0.681 0.788 0.867 2 (0x 33333333 + 0x 11111111) 0.811 0.883 1.061 1.170 1.441 4 (0x 0F0F0F0F + 0x 01010101) 1.014 1.093 1.270 1.378 1.682 6 (0x 3F3F3F3F + 0x 01010101) 1.668 1.776 1.883 1.990 2.512 8 (0x 00FF00FF + 0x 00010001) 1.143 1.184 1.360 1.446 1.771 10 (0x 03FF03FF + 0x 00010001) 1.446 1.567 1.675 1.780 2.204 12 (0x 0FFF0FFF + 0x 00010001) 1.757 1.943 2.052 2.156 2.719 16 (0x 0000FFFF + 0x 00000001) 1.049 1.161 1.250 1.351 1.835 20 (0x 000FFFFF + 0x 00000001) 1.191 1.264 1.338 1.476 2.290 24 (0x 00FFFFFF + 0x 00000001) 1.322 1.465 1.576 1.605 2.767 28 (0x 0FFFFFFF + 0x 00000001) 1.457 1.545 1.636 1.737 3.234 32 (0x FFFFFFFF + 0x 00000001) 1.583 1.636 1.727 1.856 3.658 Table 10: Average Power Calculations 49 5.7.2 Energy Calculations Table 11 gives the energy per addition (in mJ) calculations for the three circuit designs, Ripple carry adder, Ripple carry adder with sensor circuitry and the Carry look ahead adder. Longest Carry Chain Length / Inputs RCA RCA with CSCD CLA Clk/2 Clk/3 Clk/4 0 (0x 22222222 + 0x 11111111) 0.160 0.186 0.196 0.227 0.260 2 (0x 33333333 + 0x 11111111) 0.467 0.509 0.612 0.674 1.730 4 (0x 0F0F0F0F + 0x 01010101) 0.877 0.945 1.098 1.192 2.018 6 (0x 3F3F3F3F + 0x 01010101) 1.923 2.048 2.171 2.294 5.275 8 (0x 00FF00FF + 0x 00010001) 1.647 1.707 1.960 2.083 3.720 10 (0x 03FF03FF + 0x 00010001) 2.500 2.71 2.896 3.078 4.628 12 (0x 0FFF0FFF + 0x 00010001) 3.545 3.921 4.140 4.349 5.709 16 (0x 0000FFFF + 0x 00000001) 2.721 3.011 3.242 3.504 3.853 20 (0x 000FFFFF + 0x 00000001) 3.776 4.009 4.244 4.680 6.871 24 (0x 00FFFFFF + 0x 00000001) 4.953 5.49 5.872 6.014 8.301 28 (0x 0FFFFFFF + 0x 00000001) 6.301 6.68 7.073 7.508 9.701 32 (0x FFFFFFFF + 0x 00000001) 7.758 8.016 8.460 9.095 10.973 Table 11: Energy per addition calculations 5.8 Results Analysis The design of the 32-bit modified RCA used 1182 transistors in its design, which includes the CSCD circuitry. In contrast, the 32-bit RCA without sensor circuitry used 1079 transistors. This suggests a 9-10% area overhead. The average speedup achieved was between 40-55% in all the three cases of optimizations. The design the 32-bit CLA used 1450 transistors which is 23% more compared to 32-bit RCA with CSCD circuitry. The performance improvement achieved was between 5-25% in all the three cases of optimizations. 50 The modified Ripple carry adder with sensor circuitry consumes 10-20% more power than the standard Ripple carry adder. But the consumption is less when compared to Carry Look Ahead adder which takes 10-50% more power. 51 Chapter 6 Booth Multiplier 6.1 Background The Booth multiplier performs the multiplication of two signed numbers using the signed binary notations. The Booth multiplier tries to decrease the number of partial products generated for an n-bit multiplication. In general, for an n-bit multiplication n partial products are generated and added to form the final result of the product. The basic idea for multiplying two binary numbers is ?shift and add?. That is, for each column in the multiplier, shift the multiplicand the appropriate number of columns and multiply it by the value of the digit in that column of the multiplier, to obtain a partial product. The partial products are then added to obtain the final result. As shown in example below, we get four partial products to add for 4-bit multiplication. 0 0 1 1 (3) x 0 1 1 0 (6) 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 (18) 52 Radix-4 Booth encoding reduces the number of partial products by half. The idea here is that, instead of shifting and adding for every column of the multiplier term and multiplying by 1 or 0, we only take every second column, and multiply by ?1, ?2, or 0, to obtain the same results. In the radix-4 booth algorithm the multiplier is divided into substring of 3 bits and based on the bit pattern, operation is performed as per Booth encoding shown in Table 12. Bit Pattern Operation 000 0 001 1 * Multiplicand 010 1 * Multiplicand 011 2 * Multiplicand 100 -2 * Multiplicand 101 -1 * Multiplicand 110 -1 * Multiplicand 111 0 Table 12: Booth Encoding Booth encoding has an advantage of reducing the number of partial products. For radix-4 encoding the number of partial products is halved. This is important in circuit design as it relates to the propagation delay in the operation of the circuit, and the complexity and power consumption of its implementation. The above example can be multiplied using the Booth encoding as below, we consider the bits in blocks of three, such that each block overlaps the previous block by one bit. The overlap is necessary so that we know what happened in the last block, as the MSB of the block acts like a sign bit. Grouping starts from the LSB, and the first block only uses two bits of the multiplier. 53 Since there is no previous block to overlap a padding of 1-bit is used, which is added to the right towards the least significant bit. 0 1 1 0 0 2nd 1st Padding bit For 1st substring, Addition term as per booth encoding = (-1) * Multiplicand For 2nd substring, Addition term as per booth encoding = (2) * Multiplicand The final product is obtained by adding the two partial products with a 2nd partial product shifted left by two bits. 6.2 Modified Booth Multiplier The Booth multiplier performs the addition of the partial products. To achieve time saving, Ripple Carry Adder (RCA) with completion signaling is used in our Radix-4 Booth Multiplier by replacing the standard RCA in the design as shown in Figure 28. 54 Figure 28: Booth Multiplier with Ripple Carry Adder and sensor circuitry. 6.3 Calculations The Booth multiplier uses a master clock to load values to the registers. We run the master clock at a frequency equal to the worst case delay required for adders. 1. When RCA is used the clock is set to Clk1 Clk1 = 32 stage delay without sense invertors 2. When RCA with completion detection circuitry is used the clock is set to Clk2 Clk2 = Clk1 + 0.2ns* +0.4ns** * Delay added due to sense invertors on carry signals ** Delay added for evaluation over one complete addition Here we divide the Clk2 by a factor ?n?, where ?n? indicates the number of comparisons. So we now operate the Booth multiplier at a faster clock rate given by Clk2/n. 3. When CLA is uses the clock is set to Clk3 Clk3 = delay seen for calculating carry for last block. 55 Timed saved = (T1-T2)/T1 Where, T1 = Total time with Clock set to Clk1 T2 = Total time with Clock set to Clk2/n or Clk3/n, n = 2, 3, 4. It is important to note that when ?0? is added , it?s a simple shift operation without an addition, so a faster clock can be used and when there is addition the worst case delay wait is added. This can improve the overall performance of the Booth multiplier. 6.4 Simulations 6.4.1 BM with RCA vs BM with RCA and Current Sensor For comparison, simulations for 1000 random input vectors with two designs of booth multiplier, one with RCA and other with RCA and current sensor circuitry are performed. In each simulation run two clocks, Clk1 and Clk2, are used with time period scaled by a factor ?n?, where ?n? is number of evaluations per addition. A performance improvement between 30-50% was observed with overall area overhead less than 2% of the silicon area (Table 13). For BM with RCA, the shift operation is done in a Clk1/n (n=2, 3, 4) time period and for additions the worst case delay is applied and the new values are loaded in Clk1 time period. For BM with sensor circuitry, the Clk2/n time period is used and values are loaded only the when add completion signal is high at an active clock edge. The Clock Delay Generator in RCA with Current sensor circuitry is set to a delay of 0.4ns, which is the response time of the Current sensor. Clock(ns) Clk2/2 Clk2/3 Clk2/4 Time Saved 33.69% 46.89% 49.25% Table 13: Simulation results with BM with RCA with and without sensor circuitry. 56 6.4.2 BM with CLA vs BM with RCA and Current Sensor For comparison, simulation for 1000 random input vectors with two designs of booth multiplier, one with CLA and other with RCA and current sensor circuitry are performed. In each simulation run two clocks, Clk1 and Clk2, are used with time period scaled by a factor ?n?(n=2, 3, 4). A performance improvement between 15-30% was observed while keeping the overall area overhead less than 2% of the silicon area (Table 14). For BM with CLA, the shift operation is done Clk1/n time period and for additions the worst case delay is applied and the new values are loaded in Clk1 time period. For BM with sensor circuitry uses Clk2/n time period, values are loaded only when add completion signal is high at active clock edge. The Clock Delay Generator in RCA with Current sensor circuitry is set to a delay of 0.4ns i.e. response time of the Current sensor. Clock(ns) Clk2/2 Clk2/3 Clk2/4 Time Saved 16.45% 24.45% 29.98% Table 14: Simulation results with BM with RCA with sensor circuitry and CLA. 6.5 Results Analysis The Booth multiplier when replaced with the RCA with Current sensor circuitry can be operated at a higher clock rate to get improved performance. This average time saving achieved is found to be between 30-50%. When the same design is compared with a booth multiplier which incorporates CLA for addition the performance improvement is between 20-30%. 57 Chapter 7 Conclusion Providing carry completion signaling in low cost ripple carry adders can allow the control logic to schedule the next addition as soon as an earlier one is complete, thereby achieving the average case, rather than worst case addition delay over a set of computations. Earlier attempts at using current sensing for such carry completion signaling suffered from serious limitations. In this thesis we presented a new approach for the design of a ripple carry adder with a current sensing capability which observes late settling carry signal nodes in the circuit and indicates when they reach a quiescent state. Simulations show better than 50% speedup, on average, with less than 10% area overhead. To demonstrate a potential application of such an approach, we incorporated our carry completion adder into a Booth multiplier design and studied the performance gain over a traditional ripple carry adder based design. Simulation results show that a 32-bit Booth Multiplier using the new completion signaling circuits can outperform a 32-bit Booth Multiplier with ripple carry adder (RCA) by 30-50%, while requiring less than 2% additional silicon area, This is comparable to the gains from the best carry look ahead adder designs at a fraction of the area overhead costs. Also, the observed performance improvement over a carry look ahead adder is 20-30%, with an area overhead less than 40% of that required by a carry look ahead adder. This area overhead will save power and area in smaller circuit designs. When the clock is set to average case RCA delay, the average case delay is found to be 8-9 times the longest carry chain length. At first this number appears bigger than the 6-7 long average case carry chain length for 32-bit 58 adders found by our preliminary simulations presented in Figure 23, recall however that our design incorporates the current sensor that limits the earliest completion signal by adding an additional overhead of 0.6ns, 0.2ns for additional load on carry lines and 0.4ns for the current sensor response over one complete addition. Careful designs along with reduction of the response time of the sensor can potentially further improve the average case delay of our design by 10-20%. Future work is focused on investigating other applications of the completion signaling approach presented here. 59 References 1. F. Cheng, S. Unger, and M. Theobald, ?Self-Timed Carry Lookahead Adders?. IEEE Transactions on Computers, vol. 49, no. 7, July 2000. 2. M. E. Dean, D. I. Dill, and M. Horowitz, ?Self-timed logic using current-sensing completion detection (CSCD)?. Journal of CLSI Signal Processing, vol. 7, no. 1-2, pp. 7-16, February 1994. 3. E. Grass and S. Jones, ?Asynchronous circuits based on multiple localized current-sensing completion detection?. Proceedings: Second Working Conference on Asynchronous Design Methodologies. pp. viii+223, 170-177, London, UK, May 1995. 4. O. A. Izosimov, I. I. Shagurin, and V. V. Tsylyov, ?Physical approach to CMOS Self- Timing?. Electronic Letters, vol. 26, pp. 1835-1836, 1990. 5. C. G. Knight, A. D. Singh, and V. P. Nelson, ?An IDDQ Sensor for Concurrent Timing Error Detection?. IEEE Journal of Solid-State Circuits, vol. 33, no. 10, October 1998. 6. H. Lampinen, O. Vain, ?Circuit design for current-sensing completion detection?. ISCAS ?98: Proceedings of the 1998 IEEE International Symposium on Circuits and Systems. Monterey, CA, vol. 2, pp. 185-188,June1998 7. Weste, N., and Eshraghian, K.: ?Principles of CMOS VLSI design? (Addison Wesley, 1993) 8. Beerel, P.A.: ?Asynchronous circuits: an increasingly practical design solution?. Proc. Int. Symp. on Quality Electronic Design, 18?21 March 2002, pp. 367?372 9. Perri, S., Corsonello, P., and Cocorullo, G.: ?VLSI circuits for lowpower high-speed asynchronous addition?, IEEE Trans. VLSI, 2002, 10, (5), pp. 608?613 60 10. De Gloria, A., and Olivieri, M.: ?Completion-detecting carry select addition?, IEE Proc., Comput. Digit. Tech., 2000, 23, pp. 93?100 11. Ruiz, G.A., and Manzano, M.A.: ?Compact 32-bit CMOS adder in multiple-output DCVS logic for self-timed circuits?, IEE Proc., Circuits Devices Syst., 2000, pp. 183?188 12. De Gloria, A., and Olivieri, M.: ? Statistical carry lookahead adders?, IEEE Trans. Comput., 1996, pp. 340?347 7 Escrib!a, J., and Carrasco, J.A.: ?Self-timed Manchester chain carry propagate adder?, Electron. Lett., 1996, 32, (8), pp. 708?710 13. Reitwiesner, G.W.: ?The determination of carry propagation length for binary addition?, IRE Trans. Electron. Comput., 1960, pp. 35?38. 14. Nowick, S.M.: ?Design of a low-latency asynchronous adder using speculative completion?, IEE Proc., Comput. Digit. Tech., 1996, 143, (5), pp. 301?307 15. Garside, J.D.: ?A CMOS VLSI implementation of an asynchronous ALU?. Proc. IFIP Conf. on Asynchronous Design Methodologies, Manchester, UK, 1993, pp. 181?192 16. Ruiz, G.A.: ?Evaluation of three 32-bit CMOS adders in DCVS for self-timed circuits?, IEEE J. Solid-State Circuits, 1998, 33, (4), pp. 604?613 17. Martin, A.J.: ?Asynchronous datapaths and the design of an asynchronous adder?, Form. Methods Syst. Des., 1992, 1, (1), pp. 119?137 18. Cheng, F.-C., Unger, S.H., Theobald, M., and Cho, W.-C.: ?Delay insensitive carry- lookahead adders?. Proc. 10th Int. Conf. VLSI Design, Jan. 1997, pp. 322?328 19. Cheng, F.-C., Unger, S.H., and Theobald, M.: ?Self-timed carrylookahead adders?, IEEE Trans. Comput., 2000, 49, (7), pp. 659?672 20. Hauck, S.: ?Asynchronous design methodologies: an overview?, Proc. IEEE, 1995, 83, (1), pp. 69?93 61 21. Parhami, B.: ?Computer arithmetic: algorithms and hardware design? (Oxford University Press, 2000) 22. H. Lampinen and 0. Vainio, ?Circuit Design for Current- Sensing Completion Detection,? IEEE Intemational Symposium on Circuits and Systems, Monterey, California, USA, Vol. 2, pp. 185-188, May 31 -June 3 1998. 23. E. Grass, V. Bartlett, and I. Kale, ?Completion-Detection Techniques for Asynchronous Circuits,? IEICE Tramactions on Zilfomzation and Systems, Vol. E80-D, No. 3, pp. 24. E. Grass and S. Jones, ?Activity-Monitoring Completion- Detection (AMCD): A New Approach to Achieve Self- Timing,? Electronics Letters, Vol. 32, No. 2, pp. 86-88, Jan. 1996. 25. H. Lampinen, Olli Vainio, ?Current-Sensing Completion Detection Method For Standard Cell Based Digital System Design?, 2002. 26. W. Burks, Herman H. Goldstine, John Von Neumann, ?Preliminary discussion of the logical design of an electronic computing instrument?. 27. Harri Lampinen, Olli Vainio, ?Dynamically Biased Current Sensor for Completion Detection?.