Current Sensing Completion Detection for High Speed and Area Efficient Arithmetic
by
Balapradeep Gadamsetti
A thesis submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Master of Science.
Auburn, Alabama
May 09, 2011
Keywords: current sensing completion detection, adders, multipliers
Copyright 2011 by Balapradeep Gadamsetti
Approved by
Adit D. Singh, Chair, James B. Davis Professor of Electrical and Computer Engineering
Vishwani D. Agrawal, James J. Danaher Professor of Electrical and Computer Engineering
Victor P. Nelson, Professor of Electrical and Computer Engineering
ii
Abstract
Adders and multipliers are the most widely used computational units in integrated
circuits. In many compact low power and high speed designs, more complex arithmetic
operations such as multiplication are performed through repeated additions. In such applications,
providing carry completion signaling in low cost ripple carry adders can allow the control logic
to schedule the next addition as soon as an earlier one is complete, thereby achieving the average
case, rather than worst case addition delay over a set of computations. Earlier attempts at using
current sensing for such carry completion signaling suffered from serious limitations. This thesis
presents a new approach for the design of a ripple carry adder with a current sensing capability
which observes late settling carry signal nodes in the circuit and indicates when they reach a
quiescent state. By this functional circuitry we avoid key problems of more hardware and power.
Simulations for new design of ripple carry adder with current sensing completion detection
technique show better than 50% speedup, on average, with less than 10% area overhead. The
performance improvement with respect to carry look ahead adder which takes 40% more area
than our ripple carry adder design is found to be 20%. This could give significant area and power
savings in smaller circuit designs.
To demonstrate a potential application of ripple carry adder with completion detection
capability, we incorporate our carry ripple adder into a Booth multiplier design and study the
performance gain over a traditional ripple carry adder based design. Simulation results show that
a 32-bit Booth Multiplier using the new completion signaling circuits can outperform a 32-bit
Booth Multiplier with ripple carry adder (RCA) by 20-50%, while requiring less than 2%
iii
additional silicon area. This is comparable to the gains from the best carry look ahead adder
designs at a fraction of the area overhead costs. The new approach eliminates the limitations
suffered from the previous current sensing techniques. The overall performance of is improved
with a minimal area overhead.
iv
Acknowledgements
At the outset I would like to thank my major advisor Dr. Adit Singh for his immense
support and guidance in completing this thesis. He has been a constant source of inspiration and
encouragement. I am highly grateful to him for his suggestions and cooperation right from the
start of this thesis work. His guidance helped me in all the time of research and my studies at
Auburn University.
I express my gratitude to my thesis committee members, Dr. Vishwani Agrawal and Dr.
Victor Nelson, for being part of my graduate committee and co-advising my thesis. Their
lectures as part of my graduate coursework helped me in learning the basic knowledge required
for my thesis work.
I want to express my sincere thanks to my senior Nitin Yogi and my friend Rakshit
Venkatesh for helping me academically and giving me valuable suggestions throughout my
graduate studies at Auburn University.
Finally I owe my deepest gratitude to my parents for their moral support during all the
times. They stood by me unconditionally motivating me in every aspect of my life and education.
v
Table of Contents
Abstract ........................................................................................................................................... ii
Acknowledgements ........................................................................................................................ iv
List of Tables ................................................................................................................................ vii
List of Abbreviations ...................................................................................................................... x
1 Introduction .................................................................................................................................. 1
1.1 Background ........................................................................................................................... 2
1.2 Motivation ............................................................................................................................. 3
1.2.1 Background on Adder Circuits ....................................................................................... 5
1.2.2 Ripple Carry Adder and Carry Look Ahead Adder Discussion ................................... 10
1.3 Problem Statement .............................................................................................................. 12
1.4 Contribution of This Thesis................................................................................................. 12
1.5 Organization of Thesis ........................................................................................................ 13
2 Prior Work on CSCD Design ..................................................................................................... 14
2.1 Completion Detection Methods ..................................................................................... 14
2.2 Classical CSCD Design .................................................................................................. 15
2.3 Current Sensor Design Approaches ............................................................................... 18
2.4 Applications Using Current Sensor Completion Detection ........................................... 19
3 Proposed CSCD Design ............................................................................................................. 22
3.1 CMOS Inverter ............................................................................................................... 22
3.2 Sense Inverter Design..................................................................................................... 24
3.3 Current Sensor Design ................................................................................................... 26
3.3.1 Current Sensor for asynchronous designs .................................................................... 26
3.3.2 Current Sensor for synchronous designs ...................................................................... 28
4 Ripple Carry Adder and Carry Look Ahead Adder Analysis .................................................... 33
4.1 Ripple Carry Adder ........................................................................................................ 33
4.2 Full Adder ...................................................................................................................... 34
4.3 Performance Analysis of RCA ....................................................................................... 35
vi
4.3.1 Best Case Performance ................................................................................................. 35
4.3.2 Worst Case Performance .............................................................................................. 36
4.4 Carry Look Ahead Adder ............................................................................................... 37
4.5 Performance Analysis for 4-bit CLA ............................................................................. 38
5 Proposed Ripple Carry Adder Design with Completion Signaling ........................................... 40
5.1 Background .................................................................................................................... 40
5.2 RCA Design ................................................................................................................... 41
5.3 RCA Operation ............................................................................................................... 42
5.4 Calculations .................................................................................................................... 43
5.5 Optimizations ................................................................................................................. 44
5.6 Simulations ..................................................................................................................... 45
5.6.1 Ripple Carry Adder with Sensor Circuitry ................................................................... 45
5.6.2 RCA Vs CLA................................................................................................................ 46
5.7 Power Analysis ............................................................................................................... 48
5.7.1 Average Power Calculations ........................................................................................ 48
5.7.2 Energy Calculations ...................................................................................................... 49
5.8 Results Analysis ............................................................................................................. 49
6 Booth Multiplier......................................................................................................................... 51
6.1 Background .................................................................................................................... 51
6.2 Modified Booth Multiplier ............................................................................................. 53
6.3 Calculations .................................................................................................................... 54
6.4 Simulations ..................................................................................................................... 55
6.4.1 BM with RCA vs BM with RCA and Current Sensor .................................................. 55
6.4.2 BM with CLA vs BM with RCA and Current Sensor .................................................. 56
6.5 Results Analysis ............................................................................................................. 56
7 Conclusion ................................................................................................................................. 57
References ..................................................................................................................................... 59
vii
List of Tables
Table 1: Simulation results with area and delay optimization for RCA with and without sensor
circuitry. ...................................................................................................................................45
Table 2: Simulation results with area optimization for RCA with and without sensor circuitry.
.................................................................................................................................................46
Table 3: Simulation results with delay optimization for RCA with and without sensor circuitry.
.................................................................................................................................................46
Table 4: Simulation results with Area + Delay optimization for RCA and CLA. .....................47
Table 5: Area Overhead calculations with Area + Delay optimization. ....................................47
Table 6:Simulation results with Area optimization for RCA and CLA. ....................................47
Table 7: Area Overhead calculations with Area + Delay optimization. ....................................47
Table 8: Simulation results with Delay optimization for RCA and CLA. .................................48
Table 9: Area Overhead calculations with Delay optimization. ................................................48
Table 10: Average Power Calculations .....................................................................................48
Table 11: Energy per addition calculations ...............................................................................49
Table 12: Booth Encoding ........................................................................................................52
Table 13: Simulation results with BM with RCA with and without sensor circuitry. ...............55
Table 14: Simulation results with BM with RCA with sensor circuitry and CLA. ...................56
viii
List of Figures
Figure 1: Ripple Carry Adder ......................................................................................................... 6
Figure 2: Carry Look Ahead Adder ................................................................................................ 7
Figure 3: Carry Skip Adder............................................................................................................. 8
Figure 4: Carry Select Adder .......................................................................................................... 9
Figure 5: Current-Sensing Completion Detection (CSCD) circuitry for a CMOS Block ............ 17
Figure 6: Basic Current Sensors ................................................................................................... 18
Figure 7: Basic CSCD configuration ............................................................................................ 20
Figure 10 : A Simple CMOS Inverter ........................................................................................... 23
Figure 11: V-I Characteristics of CMOS inverter......................................................................... 23
Figure 12: Sense Inverter Implementation ................................................................................... 24
Figure 13: Current Sensor for Asynchronous Designs ................................................................ 27
Figure 14: Waveforms for Asynchronous Sensor ......................................................................... 27
Figure 15: Current Sensor for Synchronous Designs ................................................................... 29
Figure 16: Control Logic Block .................................................................................................... 30
Figure 17: Control Signals ............................................................................................................ 31
Figure 18: Waveforms for Synchronous sensor............................................................................ 32
Figure 19: 4-Bit Full Adder Circuit ............................................................................................. 33
Figure 21: Gate Level Representation of Full Adder Circuit ....................................................... 34
Figure 20 : Full Adder Block ........................................................................................................ 34
Figure 22: Gate level representation of 4-bit ripple carry adder ................................................... 35
Figure 23: Partial Full Adder ....................................................................................................... 37
Figure 24: Carry Look Ahead Adder ........................................................................................... 39
Figure 25: Carry Propagation graph for two 32 bit additions for 100000 random vectors. ........ 41
Figure 26: Ripple Carry Adder with sensor circuitry. ................................................................. 42
Figure 27 : Clock Division ............................................................................................................ 43
Figure 28: Optimization w.r.t Area ............................................................................................... 44
Figure 29: Optimization w.r.t Delay ............................................................................................. 45
ix
Figure 30: Booth Multiplier with Ripple Carry Adder and sensor circuitry................................ 54
x
List of Abbreviations
CSCD Current Sensing Completion Detection
RCA Ripple Carry Adder
CLA Carry Look Ahead Adder
BM Booth Multiplier
ALU Arithmetic Logic Unit
PFA Partial Full Adder
CDG Clock Delay Generator
PDA Personal Digital Assistants
DSP Digital Signal Processors
IC Integrated Circuit
1
Chapter 1
Introduction
In recent years there has been a rise in use of hand held devices such as Media Players,
Smart Phones, Digital Cameras and Personal Digital Assistants (PDAs). The features on these
mobile devices will require physically smaller designs with efficient area and power
management. The future of digital design holds ever larger and more complex digital circuits
than have been achieved so far. Many functions on these devices are expected to be specific to a
target application. A generic Integrated Circuit (IC) with versatile functionality may turn out to
be too expensive for a customized application which can do with fewer functions. In practice,
many available functions may not be required in the application; these functions consume a
significant portion of the component chip area and cost. In such situations the generic IC can be
replaced with a smaller circuit with fewer functions matched to the requirements of the
customized application.
Mobile systems incorporate Digital Signal Processors (DSPs). A DSP is a specialized
processor with an optimized architecture for the fast computational needs of digital signal
processing. Some of the most common functions performed in the signal processing are signal
filtering, convolution and Fourier transforms. In mathematical terms, these functions perform a
series of dot products, multiply and accumulate (MAC), additions and multiplications. General-
purpose microprocessors can also execute DSP algorithms successfully, but are generally not
suitable for use in mobile devices such as mobile phones and PDAs because of power and space
2
constraints. A customized digital signal processor, however, can provide a lower-cost solution,
with better performance and lower power consumption.
1.1 Background
As performance and battery life requirements are becoming more crucial in small and
lightweight systems like laptops and handheld devices, power consumption is becoming a critical
problem. For example, processors and microcontrollers for these small portable systems should
provide higher system performance while keeping power consumption within specified limits.
The performance of these chips is mainly dependent on the Arithmetic Logic Units (ALUs) used
in them to perform complex mathematical operations such as addition, subtraction, division and
multiplication. To save silicon area and power, some chips do not include dedicated arithmetic
circuits for multiplication and other complex computations which are considered to be very area
intensive and power hungry; multiplication is performed through repeated addition. In such
scenarios, there is a critical need for fast and power efficient adder circuits which are almost
constantly active during computation, and can have a major impact on overall system
performance.
Current trends in the CMOS VLSI design methodologies show steady scaling of feature
sizes to meet the speed and performance requirements for complex applications in the areas of
communication, aerospace, defense and consumer electronics. Extensive scaling of CMOS
process has given rise to many problems such as leakage current, short circuit current, large
process variations, etc. It is proving to be difficult to achieve higher performance with transistor
channel length scaled to few nanometers. These problems have to be addressed while designing
circuits for low power and high speed applications. Also, with CMOS technology approaching
3
its limit, new approaches for processes and design architectures are needed. Apart from scaling,
there has been much focus on very low power consumption and area efficient design
methodologies in developing consumer applications.
1.2 Motivation
Digital circuits are classified into synchronous and asynchronous circuits. Synchronous
circuits (and systems) include clocked elements like flip flops, latches and registers. The input of
these elements should be stable before the next active edge of the clock arrives. Therefore, the
behavior of these circuits can be predicted exactly. One major problem with these systems is the
distribution of clock signals which must be in phase without significant skew every place in the
chip. This makes designing a complex synchronous circuit operating at high frequency difficult
due to timing skew and delays.
Asynchronous circuits (and systems) are not governed by a clock circuit driving a global
clock signal, but instead they need only to wait for signals that indicate completion of operations
from a previous stage using a handshaking approach. In the handshaking approach, a functional
block needs to supply a "done" signal indicating that the results of the functional block are
available and valid.
In synchronous digital systems to achieve higher performances the clock should be
operated at higher speeds. But clock skew is consuming a significant portion of the clock cycle
and is limiting the maximum clock speed of the system. The clock frequency in these circuits
also depends on the worst case delay between any two clocked elements. With these problems
alternative ways of speeding up the operating frequency are required. One method is to optimize
the combinational block between the clocked elements to reduce the worst case delay. Another
4
method is to use techniques for detecting the operation of the combinational block and generate a
signal that indicates when the combinational block or its part has completed computation. This
type of completion signaling can also be used in self-timed asynchronous designs, where the data
between logic blocks are transferred asynchronously utilizing handshake circuits for signaling
completion rather than clocks.
Some methods of signaling computation completion proposed in the literature are Activity-
Monitoring Completion Detection (AMCD) [23, 24], Current-Sensing Completion Detection
(CSCD) [2, 22], and Transition-Monitoring Completion Detection (TMCD) [23, 24]. In Chapter
2 of this thesis AMCD, CSCD, and TMCD methods are discussed from the design point of view.
Some benefits and drawbacks of activity monitoring methods are presented. The methods
reviewed in Chapter 2 suffer from serious limitations in low power and low voltage applications
so in Chapter 3 we propose a new method for completion detection which addresses the
limitations suffered by earlier methods and also present a faster current sensor design for
synchronous designs.
As discussed earlier, the performance of the digital circuits depend on the computation
capabilities of ALUs which they incorporate. The major blocks in these ALUs are adder and
multiplier units; multipliers also incorporate adders. Various designs of adder circuits have been
proposed earlier which are efficient in area or delay. These adders are discussed in section 1.2.1,
which gives a detailed analysis with respect to area and delay. Our idea here is to select the adder
design which will suit the need for smaller circuit designs for mobile applications. Also we will
try to incorporate additional capabilities of completion detection techniques to enhance the
performance of the overall design.
5
1.2.1 Background on Adder Circuits
Adder circuits are the considered important system blocks for arithmetic logic units used in
microprocessors and microcontrollers which perform complex computations in applications. To
achieve better performance different types of adder circuits have been designed. Their usages in
different applications depend on the following three factors:
1. Delay
2. Area
3. Power consumption
It is hard to design adder circuits which take all these factors into consideration and deliver
higher system performance. This made researchers look for alternative ways to design circuits in
an optimized and customized way to achieve performance by speeding up operations in complex
applications. It is well known that faster computation capability will result in a higher system
performance. Often computation capability is mainly dependent on adder circuits used in the
arithmetic logic units of the processor.
Some basic adder circuits in use are listed below. Each of these adders has his own
advantages and disadvantages. These are:
a) Ripple Carry Adder
b) Carry Look Ahead Adder
c) Carry Skip Adder
d) Carry Select Adder
This section presents a detailed discussion of each adder circuit with respect to delay and
area and also discusses their limitations in practical use.
6
a) Ripple Carry Adder
The Ripple Carry Adder, being the simplest one, uses the least hardware circuitry but
suffers from the worst case carry propagation delay. Figure 1 shows a 4-bit ripple carry adder
with worst case delay path from C0 to C4. A Higher impact on performance is seen for 32-bit or
64-bit adder circuits where the worst case carry propagates from the LSB to MSB block. The
delay of the ripple carry adder increases linearly with the number of bits with a worst case delay
of O(n) [13]. The RCA has a very compact design area given by O(n) [13]. This worst case delay
makes it slow when large bit sizes are used. The advantage for ripple carry adder is that it uses
much less hardware circuitry when compared to all other traditional adder circuits in use.
b) Carry Look Ahead Adder
The Carry Look Ahead Adder has lower delay but requires much more complex circuitry in
achieving its performance. The complex circuitry is added to find in advance the generation and
propagation of carry bits. The carry at ith bit position is given by
Ci = Gi + Gi-2 Pi-1 + Gi-3 Pi-2 Pi-1 + ??. + G 0 P1 P2 ?.. P i-1 + C0 P1 P2 ?.. P i-1
1-Bit
Full Adder
1-Bit
Full Adder
1-Bit
Full Adder
1-Bit
Full Adder
A0 A1 A2 A3 B1 B0 B2 B3
S1 S3 S0
C3 C2
2
C1
2
C0
2
S2
Figure 1: Ripple Carry Adder
C4
7
The size and complexity for a large adder with the carry signals generated using this
equation is not practical. So the carry generation and propagation bits are computed by making
groups of carry blocks (usually four bits) as shown in Figure 2. Such a block generates a group
carry which give the carry bit to the next block, giving time to the sum units to perform their
calculation. This block also calculates generation and propagation information and provides it to
next higher level block in the hierarchy.
The added circuitry contributes to more power consumption. Also large fan out within each
block circuitry contributes to additional delay in the adder circuit. The delay of a carry look
ahead is given by O(log(n)) [1] but will require much larger area given by O(nlog(n) [1].
c) Carry Skip Adder
The Carry Skip Adder reduces the carry-propagation time by skipping over groups of
consecutive adder stages. The carry skip adder is usually comparable in speed to the carry look-
1-Bit
Full Adder
1-Bit
Full Adder
1-Bit
Full Adder
1-Bit
Full Adder
A0 A1 A2 A3 B1 B0 B2 B3
S1 S2 S3 S0
C4 P3 G3 C3 P2 G2 C2 P1 G1 C1 P0 G0
4 -Bit Carry Look Ahead PG GG
C0
Figure 2: Carry Look Ahead Adder
8
ahead adders, but it requires less chip area and consumes less power compared to carry look
ahead adders. To accelerate the carry propagation it adds additional group carry bypass paths to
its ripple path and the carries can bypass the ripple path when the group propagate signal is high.
Figure 3 shows the basic structure of an N-bits carry-skip adder with k-bits per group. Each
block is composed of ripple carry adder blocks of fixed size and carry skip chain. Each block in
the carry skip adder generates a propagation bit, Pi, when Pi = 1, the carry in of the ith block is
passed to the (i+1)th block without waiting for the input carry to ripple. The group propagate
function is given by
Pi = (Ak-1 ? Bk-1) ? (Ak-2 ? Bk-2) ???. ? (A 1 ? B1) ? (A0 ? B0)
This scheme is faster than a Ripple Carry adder with a small overhead in adding bypass
circuitry to decrease the carry propagation delay. However, the delay is linearly dependent on the
number of bits N [7].
G0
A0 B0
Sk-1
Bk-1 Ak-1
. . .
G1
Bk B2k-1 A2k-1
. . .
S0
. . . . . .
S2k-1 Sk
. . . . . .
Gi
AN-k-1 BN-k-1 BN-1 AN-1
. . .
SN-1 SN-k-1
. . . . . .
Ak
or or
. . .
. . .
or
P0 P1 Pi
and and and
Cin
Figure 3: Carry Skip Adder
Cout
9
d) Carry Select Adder
The Carry Select Adder consists of two ripple carry adders and one multiplexer. The two
adders are used to calculate the addition twice; one addition is computed assuming carry input
?1? and the other as ?0?. This adder implementation does not wait for propagated carry signal;
rather it first generates sum and carry-out pairs by using both possibilities of carry-in signal at
each bit position. The correct output is then selected upon the arrival of carry-in signal. The
typical delay of this design is given by O(?n) [7] but at the cost of much larger hardware
circuitry which is twice that of ripple carry adder as shown in Figure 4 for a 16-bit carry select
adder. This design also contributes to much larger power consumption due to its complex
circuitry.
Figure 4: Carry Select Adder
10
1.2.2 Ripple Carry Adder and Carry Look Ahead Adder Discussion
In section 1.2.1 we have analyzed the different available adder circuits but the most
suitable adder circuits for small customized applications is the ripple carry adder. The idea in this
thesis here is to modify the ripple carry adder by adding additional capabilities to suit the
performance and area needs. This modification should result in performance comparable to the
other fast adder circuits without suffering from significant additional overhead issues. Although
the classical Ripple Carry Adder has the most compact design (O(n) area) [13] of all adders, it is
slow due to its long worst case carry-chain length. But this worst case of carry propagation is
seen only in very few cases in practice. So designing a circuit that meets this worst case
propagation delay will have impact on the system?s performance in most of the other cases
where the result is available earlier. The idea here is to design a self timed circuit which performs
operations on a much faster rate taking into consideration the worst case performance of the
adder circuit.
Our proposed idea is to monitor the circuit activity only on the specific selected nodes of
the circuit which are considered to be critical for the overall circuit delay. This monitoring is
based on current sensing completion detection which relies on the basics of CMOS operation
which states that during any active phase of a CMOS circuit the current is drawn from the power
supply. This current is also known as switching current. Based on this fact, we can deduce that a
circuit has completed all its operations when the switching current ceases or falls below a certain
reference threshold value. One point to be considered here is that the current sensing circuitry
adds some additional overhead to the circuit design with respect to the area and delay. The
challenge is to keep this as small as possible in designing the sensing circuitry so as to have a
minimal impact on the overall circuit performance.
11
Carry Look-ahead Adders are significantly faster than their counterpart Ripple Carry
Adders but require much larger area (O(nlog(n)) [1]. CLAs speed up the computation of the
addition operation by using complex circuitry for calculating the carry generation and
propagation information at each bit position in advance. This helps in speeding the addition
operation but at the cost of much larger circuitry. This larger circuitry also results in substantially
higher power dissipation. In contrast, for ripple carry adder to determine the carry for a specific
bit position the carry input at the previous bit position must be calculated, this adds up to much
larger delay in the calculation. This implementation of Ripple Carry Adder uses less circuitry
when compared to traditional Carry Look Ahead Adder.
While timing in designs using conventional RCA designs must allow for the worst case
carry ripple delay for every addition, for many input cases the correct result is in fact available a
lot earlier. The incorporation of a completion signaling mechanism into a RCA offers a way for
improving the ?worst case? RCA performance, which is given by O(n) [13] to ?average case?
RCA performance O(log(n)) [13] over a set of repeated additions. This can provide a mechanism
for the RCA to signal to the higher level circuitry controlling it that it has completed the current
operation and to be ready to schedule the next operation. Thus, for example, if 32 repeated
additions are to be performed to multiply two 32 bit numbers, using completion signaling to
initialize the next addition (before waiting out the worst case delay) can cut down the total
multiplication time from 32 worst case addition delays, to 32 ?average? case delays. This can
achieve performance comparable to that attainable from much more expensive carry look ahead
adders. These benefits also hold for more typically used Booth?s algorithm based multiplication
implementations that reduce the number of additions needed by skipping additions for strings of
0?s and strings of 1?s in the multiplier input.
12
1.3 Problem Statement
While the earlier work has established that sensing the current from the power supply of a
CMOS block offers a valid means for generating a completion signal, introducing current sensors
in the functional power supply has not proven practical. These current sensors suffered from
significant issues. In this thesis we try to address the problems encountered in earlier designs.
The problems addressed here are
1) Development of new approach for current sensing completion detection.
2) Design of current sensors for Synchronous and Asynchronous designs.
3) Design of adder and multiplier circuits with completion signaling capability.
1.4 Contribution of This Thesis
In this thesis we present a novel approach where key individual (late switching) circuit
nodes are probed (through sensor lines) and monitored to observe the absence of further
transitions. These monitors again detect a quiescent stable state in the probed signals by sensing
the current drawn from the power supply, but in this case only the supply current drawn by the
sensor circuits need be observed. The overall switching current in the functional block does not
need to be monitored. This avoids the key problems associated with the classical Current-
Sensing Completion Detection (CSCD) approach.
This thesis presents the analysis between a traditional ripple carry adder with carry
completion detection and carry look ahead adder. A classical RCA design uses a string of full-
adders, consisting of 2 gate delays each, that are interconnected with the (ripple) carry signals. In
an experiment running 100000 random input vectors through such a 32 bit ripple carry adder, the
13
average delay is observed to be 12-14 gate delays associated with an average 6-7 bit longest
carry chain length. (Other equal size or shorter carry chains can also be simultaneously active for
the same computation.) Compared to the worst case 32-bit carry chain length, which is 64 gate
delays, the average case delay reflects a dramatic 5-6X performance improvement. This
suggests that a RCA with computation completion signaling can potentially incorporate the low
power and area of RCAs with average performance comparable to that of CLAs. However, there
is an additional overhead associated with the circuitry producing the completion signal; the
challenge is to keep this as low as possible.
1.5 Organization of Thesis
The thesis is organized as follows. In Chapter 2, we discuss a general background of the
earlier techniques for current sensing based completion detection, their pros and cons, and their
use in practical applications. Also we discuss their limitations with respect to current technology.
In Chapter 3, a novel approach for current sensing completion detection is presented and its
operation is discussed. Chapter 4 presents a detailed overview of the performance of Ripple
Carry Adder and Carry Look Ahead adders. New design of ripple carry adder with current
sensing completion detection is proposed and discussed in Chapter 5. In Chapter 6, a Booth
multiplier with new CSCD circuitry is presented and its performance improvement with
traditional Booth multiplier is evaluated. Conclusions and future work are discussed in Chapter
7.
14
Chapter 2
Prior Work on CSCD Design
Researchers in the past have developed many methods for completion detection. This
chapter discusses some of these techniques and past work on current sensing methods, their
limitations and how these design methods would fail in the current scenarios where technology is
being scaled to nanometers. This part also discusses several possible designs (in standard low-
power CMOS process) that can be used as the current sensing elements in CSCD circuitry.
2.1 Completion Detection Methods
Transitions in a CMOS digital circuit can be monitored by two methods. The first method
is based on the current sensing. A CMOS circuit draws current from the power supply when both
N and P transistor are ON. When the current ceases indicates that all the transitions in the circuit
are completed. Monitoring the current drawn from the power supply provides a means for
signaling the completion of the CMOS block. The second method is voltage monitoring. During
transitions the internal node capacitances are charged or discharged. Completion signaling is
done by monitoring the voltage transitions on internal node capacitances. In literature various
methods are proposed for current and voltage monitoring. The methods used for voltage
monitoring are Activity Monitoring Completion Detection (AMCD) and Transition Monitoring
Completion Detection (TMCD) [23, 24]. In AMCD, Activity-Monitoring (AM) circuitry is used
to observe the transitions on the internal nodes of the CMOS block. The TMCD method is
similar to AMCD but offers additional functionality to observe activity from glitches. The issues
15
with these methods are that they add additional capacitances on the monitoring nodes which
introduce additional delays in the circuit. Also extra routing lines are needed for the activity
monitoring circuitry to observe transitions on the internal node. With respect to performance and
area, AMCD method gives better performance than TMCD method while consuming less area
and power. The approach used for current monitoring is Current Sensing Completion Detection
(CSCD) method where the current drawn from the power supply by the CMOS block in
transition is monitored. The current is collected in a current sensor and compared with a pre-
selected threshold. When the monitored current is above the threshold value indicates the CMOS
block is in operation and when it drops below the threshold completion is signaled.
2.2 Classical CSCD Design
In theory, CMOS designs offer a simple way to determine when all switching activity has
ceased and the circuit has reached steady state: there should be only minimal leakage current
drawn from the power supply. A number of previous researchers have investigated ways to
implement Current-Sensing Completion Detection (CSCD) [1-4] using circuits that monitor the
power supply of circuit blocks as shown in Figure 5 and indicate when the outputs stabilize.
Unfortunately, monitoring the power supply current necessarily involves introducing additional
circuitry between the logic gates and the power supply. This can seriously degrade the supply
voltage available to the logic, particularly in low-power, low-voltage designs, and thereby
significantly impact circuit performance. Furthermore, it remains very challenging to design
current monitors with a sufficient dynamic range to provide for the peak switching current
demands of a functional block of a hundred or more gates, while still being sensitive enough to
reliably indicate when the last gate completes its switching transition. These difficulties have
16
prevented the practical implementation of current sensing based completion detection. Classical
CMOS circuits have the inherent property of only consuming power during transitions when
both pull up and pull down transistors are on and circuit capacitances have to be charged and
discharged. Only a small leakage current is drawn from the power supply in steady state. It is
therefore straight forward to see how monitoring current consumption provides real-time
information on circuit switching.
The implementation of a CSCD design has a few requirements.
1. First, the circuit that is being monitored needs to be capable of being an asynchronous,
stand alone, functional block that can take advantage of the completion signaling.
2. Second, the circuit being monitored needs to be combinational logic that contains a
subset of signals in which at least one will be in the process of transitioning at any given
moment if the operation is still in progress.
Assuming these requirements are satisfied, any combinational block can, in principal, be
modified to use CSCD circuitry. Previous CSCD designs [1-4], including those implemented for
ripple carry adders have all employed circuit architecture conceptually similar to the one shown
in Figure 5. Unfortunately, such designs have some significant drawbacks. Perhaps most
important is the voltage degradation across the current sensor circuits, which can severely impact
the switching delay of the CMOS logic. Furthermore, since typically a single pair of current
sensors is used to monitor supply current for the complete functional block (e. g. a 32-bit ripple
carry adder), the current sensors must be designed to allow a large peak current to pass through,
while being able to accurately discriminate about a threshold current due to a single active gate.
Managing such a large dynamic range of supply currents is the classical challenge faced by
designers of on chip current sensors [1-4]. The only real solution is to minimize the number of
17
gates in each block being monitored, but that greatly increases the number of current sensors
needed, and results in unacceptable area overhead.
Earlier CSCD designs which directly monitor the circuit power supply current to create a
completion signal are shown in Figure 5. Recall that CMOS circuits only consume power during
switching and therefore when the switching stops it can be assumed the calculation is completed.
In Figure 5, the power supply current sensors are to detect when the current level has been
reduced below a threshold. At this point a ?done? signal is asserted to signify to the higher level
control circuitry that the operation has completed.
Pull-Up
Pull-Down
CMOS Data In
Current
Sensor
Vdd
Data Out
Current
Sensor
Done
Figure 5: Current-Sensing Completion Detection (CSCD) circuitry for a CMOS Block
18
2.3 Current Sensor Design Approaches
The two main designs of current sensors for completion detection are resistance based and
current mirror based.
Figure 6: Basic Current Sensors
The first method is resistance based shown in Figure 6a the detected current flows through
resistance R1. The value of the resistance R1 should be chosen such that the voltage between
base-emitter junction of the BJT is above 0.7 ? 0.8 V. Here a large value resistance is required
as the sensed current has low values. This resistance based method has limitations when used in
low-voltage applications where supply voltage is as low as 1V. The first being the unacceptable
voltage drop across the base emitter junction and second the large value of the resistance R1
which limits the current in the functional path of the logic block. Also future trends concentrating
on CMOS designs, inclusion of bipolar transistors is not suggested.
The second method is the current mirror based shown in Figure 6b. Transistors M1 and M2
are connected in a push-pull inverter configuration. These transistors operate in saturation and
act as current mirror. The sensed current Iin is mirrored and converted into voltage Vout. The
disadvantage with this design is large values for W/L ratios are needed since current values are
19
low. Also, overall delay of the current sensor increases due to high gate capacitances, thus
making faster transitions invisible to the sensor.
2.4 Applications Using Current Sensor Completion Detection
The basic configuration for incorporating CSCD in CMOS block is shown in Figure 7
which incorporates two current sensors. The first one between the logic block and VCC, the
second between the GND and the logic block. The current sensors are used to detect a current
above a predefined threshold and generate a signal as long as the transient current flow is above
this threshold.
Ideally two current sensors are needed in both the GND and VCC supply paths. When the
circuit drives large loads in those situations the current mostly flows either through the NMOS or
PMOS network discharging or charging the output load capacitances. This configuration is
expected to work independent of the load variations on the output of the logic function.
20
Figure 7: Basic CSCD configuration
The CSCD design also incorporates few other components. These are:
1. Latches: The input latches are used to provide inputs to the logic function. These
latches also remove any the differences in the arrival time of the input signals at the
logic block.
2. MDG: The Minimum Delay Generator (MDG) is incorporated to guarantee that
minimum time has elapsed for the logic function to compute correct outputs. The input
to the MDG is from the sequential control which is the moment when the inputs are
fed to the logic function. This time is considered as start time for the computation.
3. NOR Gate: The NOR gate is provided inputs from for the current sensors and
minimum-delay generator. When the all the inputs from them are active low the
completion signal ?done? is signaled.
Vdd
Gnd
Logic Function Input
Vcc
Output
Current
Sensor
Done
MDG
Current
Sensor L
A
T
H
E
S Seg. Ctrl.
21
The two important factors which could affect the operation or performance of the logic
function being monitored are:
1. Voltage drop across the current sensor which limits the supply voltage to the logic.
2. Resistance offered by the current sensor which limits the current flow to the logic
function which slows down the switching speed of the logic function.
CSCD as presented here suffer from serious limitations of static power consumption and
voltage drop in the power supply voltage provide to the logic function thus making them
unsuitable for low-power and low-voltage applications. The next section proposes a novel idea
which addresses the issues presented here.
22
Chapter 3
Proposed CSCD Design
Previous methods using the technique of current sensing for completion detection
introduced additional overhead in delays, degrading the performance of the overall circuit. This
thesis presents a novel approach for completion detection design that does not monitor the
current in the power supply of the functional block. Instead, only a selected set of individual late
settling signal nodes in the circuit are observed by connecting simple inverter sensors to them,
and monitoring the current drawn by these sensors from the power rail.
3.1 CMOS Inverter
A CMOS inverter is constructed by two complimentary transistors, PMOS and NMOS. It
gives an output voltage which is of the opposite logic-level to its input. A CMOS inverter draws
current from the power supply when both the transistors are ?ON?. This current is seen when
there is transition at the input of the inverter from Logic ?1? to Logic ?0? and vice versa. Figure 8
represents a simple CMOS inverter and Figure 9 shows the plots of V-I characteristics of the
inverter. Observe from the V-I graph that the drain current Id is generated only when the signal at
the input of the inverter makes a transition from High to Low and vice versa. This simple
concept can be used to detect the transitions on the late settling nodes in a combinational circuit
and determine when they have reached a steady state.
23
Figure 8 : A Simple CMOS Inverter
Figure 9: V-I Characteristics of CMOS inverter
24
3.2 Sense Inverter Design
The proposed method for implementing the current sensor involves using a sense-inverter
such as the one shown in Figure 10. CMOS inverters draw current from the power supply as long
as their input is midrange between a high and a low voltage, shown in Figure 9, such that both
the P and N transistors see a gate source voltage above their respective threshold voltages. The
supply current does not flow once the input voltage levels go below the threshold of either
transistor types. Thus, monitoring the supply current of these inverters provides a means of
determining if the input has stabilized close (within a threshold voltage) to a high or low.
Figure 10: Sense Inverter Implementation
Incorporating a sense-inverter current-sensing circuit implementation into a standard 32-bit
RCA involves adding a minimally-sized inverter to each of the 31 carry signals as shown in
Figure 10. The carry output of the final stage is neglected in this case. Recall that standard cells
in typical cell libraries generally use transistors that are sized up in width by a 5-10X factor for N
transistors and 10-20X (or more) for P transistors for performance. Using a minimally-sized
inverter in this application ensures that the sense-inverter does not significantly load the carry
signals in comparison to the normal load on the lines, minimizing any performance impact.
25
Observe in Figure 10 that only the current in the sense-inverters is monitored. Furthermore the
sense inverters have no load at the output, other than parasitic capacitances associated with the
small minimum sized transistors. Thus a single current monitor connected on the ground rail is
sufficient to provide a reliable switching completion signal. In the case of large capacitive load
on outputs, it is possible that the N transistor turns off while the P transistor is still on and
charging the output capacitance for significantly longer. This is the reason for the double sensors,
one for each supply rail, in classical CSCD designs. Importantly, the entire current-sensing
circuitry is only associated with the sense inverters, which are not in the functional path. The
power supply of the functional logic is not monitored. Thus impact on functional performance is
minimal. Selecting only a limited number of circuit nodes to be monitored also reduces the peak
and dynamic range in the current observed at the current sensor. Even so, observe in Figure 10
that the peak current, when all the inputs simultaneously make transitions, can be 31 times (in
case of 32 bit Ripple Carry Adder) the largest current seen in an inverter.
By adding sense invertors the overall circuit switching current is not monitored. In this
way, functional performance in only minimally impacted (by the small additional fan out loading
due to the sensors), while the completion of switching at the observed nodes can be reliably
detected by monitoring the current drawn by the sensors. The proposed new design approach
thus overcomes key limitations of earlier approaches discussed in Chapter 2.
26
3.3 Current Sensor Design
In this section we present two designs of current sensors for synchronous and asynchronous
circuits.
3.3.1 Current Sensor for asynchronous designs
Asynchronous designs do not implement a clock. These circuits are self timed; when the
activity in the circuits stops the output ?Done? signal is flagged indicating the completion of the
operation. Figure 11 shows the design of a current sensor for asynchronous logic circuits. The
size of transistor P1 is chosen to be L\W=5\1 to get sufficient bias current to keep both the
transistors N1 and N2 in saturation. So when a small current from the sensor flows, the current is
mirrored into Iout. The input voltage to the inverter combination of transistors N4, N3 and P3
gets a voltage equal to Vdd minus the voltage drop across drain and source of transistor P2.
Transistor N4 is used to make sure that a small voltage swing at the input of the inverter causes a
full swing at the output. Previous CSCD designs have also used a Minimal Delay Generator
(MDG) that is also adapted in our design. The purpose of the MDG is to ensure that enough time
has elapsed for the current-sensing circuit to begin functioning properly. As shown in Figure 12,
the output of the MDG should be asserted and the comparator must be de-asserted in order for
the ?done? signal to be asserted. Figure 12 shows the waveforms for the simulation of a 32 stage
carry propagation.
27
Figure 11: Current Sensor for Asynchronous Designs
Figure 12: Waveforms for Asynchronous Sensor
It should be noted that the discharge time of Vsens depends on the time constant
approximated by the product of the capacitance on Isens line and the channel resistance of
transistor M1 in parallel with channel resistance of transistor P1. Also, the charge time of Vout
depends on the time constant calculated by the product of the capacitance on Iout line and the
channel resistance of transistor M2 in parallel with channel resistance of transistor P2. For
correct operation of the current sensor we need a voltage swing of 0-1V on the Vout node. To
achieve this, the resistance seen on Vout line should be high. This high value of resistance
increases the charging time constant on Vout node increasing the response time of the current
28
sensor, thus affecting the performance of the circuit. The additional delay added in the current
sensor response is found to be 1.2ns but the actual completion is over a lot earlier. This overhead
is always added after completion of each operation degrading the overall performance of the
circuit. With this degraded performance the asynchronous current sensor cannot be used in
practice. The synchronous current sensor presented in Figure 13 has to sense only a differential
voltage change when compared to large voltage swing required for asynchronous sensor. After
sensing the differential voltage the full voltage swing is provided by the latch operation of
inverters connected in regenerative feedback configuration, thus we can see better response by
the sensor. The next section presents the design of synchronous current sensor.
3.3.2 Current Sensor for synchronous designs
Synchronous designs have a clock so all the operations within the circuits are synchronized
with the master clock. The ?Done? signal must be asserted before the active edge of the clock so
as to synchronize the next operation. Our current sensor also uses a master clock from which
signals to control the current sensor are generated. Referring to Figure 12, the current sensor
consists of two inverters, Inv1 and Inv2, connected in a regenerative feedback configuration. The
sources of the two PMOS transistors, T5 and T6, of these two inverters are connected to power
through another P transistor, T3, which serves as a switch to trigger these inverters. The
measured voltages are connected to the sensor through two pass transistors, namely T1 and T2.
The gates of T1, T2, T3 and T4 are connected to ?accum?, ?precharge? and ?eval? control signals
generated from control logic.
29
Figure 13: Current Sensor for Synchronous Designs
During the precharge phase, T4 is on which maintains the inverter inputs at same voltage.
In the accumulation phase, T1 and T2 are on. This allows the charge from the measured voltages
to build up on the inverter inputs, but the output the inverters are not affected as T3 is turned off,
which latches the voltages into the inverter inputs. In the evaluation phase, the power to the
inverters is turned on. When the inverters are powered up, T3 turned on, the inverter with the
higher input voltage dominates, and its output is pulled low. The outputs from both inverters are
connected to another set of inverters Inv3 and Inv4, which serve to buffer the sensor output from
the rest of the circuit. If the current from the current generator is greater than the reference
current, then the node that was connected to the current generator (output of Inv1) is pulled high.
This node is connected to one of the output inverters (Inv3) and pulls its output low. This creates
the active-low signal that indicates carry propagation. It should be noted that the current sensor is
triggered before the evaluation period and the output of the RCA is valid only when the
?addcomp? signal is high, if not, the next cycle is tested for the addition completion.
30
In practice, the supply current cannot be more than about 20X the detection threshold
current. The reduced voltage across the inverters automatically limits the current drawn from the
power supply. Note that while this serious degradation in available voltage across the sense-
inverters is acceptable in our design, because these sense-gates are not in the functional path, the
performance impact of such supply degradation would be unacceptable if the current sensors
were in the functional power supply, as in traditional CSCD designs. This is a major reason why
past CSCD designs have not been very practical. In fact, design of the current sensor circuits to
meet this demanding requirement of handling a very wide range of supply currents without
excessively degrading functional performance was the key challenge in earlier designs.
4.3.2.1. Control Logic Block
As shown in Figure 14 the control signals are generated using the clock and a delayed
clock signal as inputs. The Clock Delay Generator (CDG) is used to generate the delayed clock.
The delay of the CDG is set to the evaluation time of the current sensor. Typically we see this
delay as an additional overhead on the overall circuit delay over one addition operation. This
overhead can be minimized by designing the current sensor with faster invertors in the
regenerative configuration.
Figure 14: Control Logic Block
31
4.3.2.2. Control Signals
The Control logic generates three signals, corresponding to the three phases Precharge,
Accumulation and Evaluation, of the current sensor operation. The states of transistors T1, T2,
T3 and T4 of Figure 13 are shown in Figure 15.
Figure 15: Control Signals
4.3.2.3. Waveforms
Figure 16 shows the voltage waveforms at different nodes of the current sensor circuit for a
32 bit addition, where 32 stage carry propagation is generated. These waveforms show analysis
for three evaluation phases.
1. Start: Start indicates the time when the inputs are fed to the circuitry. The inputs for
transistors T1, T2, T3 and T4 of Figure 13 are shown in Figure 15.
2. Evaluation Phase-1: Here the sensing voltage, V(sens), is greater than the reference
voltage, V(ref). This indicates the operation is not completed yet so V(addcomp) is
Active LOW.
3. Evaluation Phase-2: The sensing voltage, V(sens), is still greater than the reference
voltage, V(ref). This indicates the carry is still in propagation and V(addcomp) is
Active LOW.
32
4. Evaluation Phase-3: Now the sensing voltage, V(sens), is less than the reference
voltage, V(ref), indicating the operation is completed and V(addcomp) is set to Active
HIGH.
Figure 16: Waveforms for Synchronous sensor
33
Chapter 4
Ripple Carry Adder and Carry Look Ahead Adder Analysis
In this chapter the performance of a ripple carry adder for delay in the best and worst case
scenarios is discussed. Also ripple carry adder performance is compared to a carry look ahead
adder with carry propagate and carry generate block designed using grouping of 4 bits.
4.1 Ripple Carry Adder
A Ripple carry adder consists of blocks of 1-bit full adders connected in series with the
carry out of one block serving as carry in to the next block. Figure 17 shows the interconnection
of a 4-bit ripple carry adder. It can be observed that the critical path for this design is the path
from C0 to C4.
1-Bit
Full Adder
1-Bit
Full Adder
1-Bit
Full Adder
1-Bit
Full Adder
A0 A1 A2 A3 B1 B0 B2 B3
S1 S3 S0
C3 C2
2
C1
2
C0
2
S2
C4
Figure 17: 4-Bit Full Adder Circuit
34
4.2 Full Adder
A full adder performs additions of three 1-bit numbers, A, B, Cin, and gives outputs Sum,
and Carry out, Cout. The expressions for sum and carry are given by
S = CinBA ??
Cout = ))(()( BAC inBA ????
Figure 19: Gate Level Representation of Full Adder Circuit
Figure 19 shows the gate level representation of a full adder circuit with sum and carry
outputs. The sum output takes two XOR gate delays and the carry out has delay of two gates; one
AND gate and one OR gate.
1-Bit
Full Adder
B
A2
A
S
Cout Cin
Figure 18 : Full Adder Block
35
4.3 Performance Analysis of RCA
Figure 20 shows the gate level representation of 4-bit full adder circuit with sum and carry
outputs. The addition of two bits at any stage depends on the carry generated by the addition of
the two bits in the previous stage. Thus, the sum of the most significant bit is only available after
the carry signal has rippled through the adder from the least significant stage to the most
significant stage. The critical path for this design is the carry propagation path from C0 to C4
which is 8 gate delays. Extending the design concept for 32- bit ripple carry adder gives us the
critical path delay of 64 gates.
Figure 20: Gate level representation of 4-bit ripple carry adder
4.3.1 Best Case Performance
The best case performance for a ripple carry adder is seen when there is no carry generated
at each bit position. In the case of the 4-bit adder in Figure 20 the sum outputs are available just
after two xor gate delays. This can be observed with input values A = 1111, B = 0000 and C0=0,
which gives C1=C2=C3=C4=0 at each bit position. For this case there are no carry bits generated
and since no propagation takes place the sum outputs are valid after two gate delays.
36
Delays:
S0 = 2 gate delays
S1 = 2 gate delays
S3 = 2 gate delays
S4 = 2 gate delays
4.3.2 Worst Case Performance
The worst case performance for a ripple carry adder is seen when the input carry, Cin is
propagated to next stage by each bit position. Because of carry propagation from the least
significant position to most significant position the sum bits generation takes variable time
delays. In the case of the 4-bit adder in Figure 20, this can be observed with A = 1111, B = 0000
and C0=1, which gives C1=C2=C3=C4=1 at each bit position. Since the carry is propagated the
delays for the sum outputs is observed to be two to eight gate delays.
Delay:
S0 = 2 gate delays
37
S1 = 4 gate delays
S3 = 6 gate delays
S4 = 8 gate delays
4.4 Carry Look Ahead Adder
The carry look ahead adder calculates carry bits simultaneously using complex logic
circuitry. This minimizes the worst case delay to calculate the sum bits. Due to large fan out on
the gates, the carry generation and propagation bits are calculated by making groups of carry
blocks (usually four bits) as shown in Figure 22.
The primary block of the carry look ahead adder is Partial Full Adder (PFA). Each block of
PFA takes three bits A, B, C; as inputs and generates three outputs G, P and S. These are
Generate function, G = A ? B
Propagate function, P = BA?
Sum function, S = CBA ??
Figure 21: Partial Full Adder
38
4.5 Performance Analysis for 4-bit CLA
Figure 22 shows the gate level architecture of a 4-bit carry look ahead adder. In order to
construct a 4-bit CLA four PFAs are needed to generate the signals. When Pi = 1, an incoming
carry is propagated to the next bit position from Ci to Ci+1. For Pi = 0, carry propagation to the bit
position is blocked. Regardless of the value of Pi, when Gi = 1, the carry output from the current
position is ?1?.
The carry output has the following logic:
C1 = G0 + P0C0
C2 = G1 + P1C1 = G1 + P1G0 + P1P0C0
C3 = G2 + P2C2 = G2 + P2G1 + P2P1G0 + P2P1P0C0
C4 = G3 + P3C3 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0C0
The carry look ahead block takes only two gate delays to generate carry bits from C1
through C3. The implementation of C4 is becomes more complicated when this 4-bit carry look
ahead adder is extended to multiples of 4 bits, such as 16 bits and 32 bits. The carry look ahead
adder for 4 bits computes carry bits at all the bit positions simultaneously. The longest delay in
the 4-bit carry look ahead adder is 4 gate delays, compared with 8 gate delays in the ripple carry
adder. The improvement is very modest but at the cost of much additional hardware. From
Figure 22 it can be observed that the fan in for generating the C4 is 4, if we extend the same
concept to generate more carry bits simultaneously then the high fan-in for generating the carry
bits could contribute to additional adder delays. So in such cases the carry bits are limited to
groups of 4 and are extend to next higher level of blocks to generate the carry bits. For a 16 bit
carry look ahead adder the higher numbered bits 4 - 7, 8 - 11, and 12 - 15 are grouped together.
For this in positions 4, 8, and 12 we would like the carry to be produced as fast as possible
39
without using excessive fan-in. The estimated worst case delay for a 32-bit CLA is 8 gate delays
but high fan-in in the carry generation block could add additional delays in the adder circuit.
Figure 22: Carry Look Ahead Adder
40
Chapter 5
Proposed Ripple Carry Adder Design with Completion Signaling
5.1 Background
A classical ripple carry adder design uses a string of full-adders, consisting of 2 gate delays
each, that are interconnected with the (ripple) carry signals. In 1946 Burks, Goldstine, and von
Neumann published a proof that the expected maximum length of carry propagation in the
addition of two binary n digits each does not exceed log2n [26]. In an experiment running
100000 random input vectors through such a 32 bit ripple carry adder, the average delay is
observed to be 12-14 gate delays, associated with an average 6-7 bit longest carry chain length as
shown in Figure 23. (Other equal size or shorter carry chains can also be simultaneously active
for the same computation.) Compared to the worst case 32-bit carry chain length, which is 64
gate delays, the average case delay reflects a dramatic 5-6X performance improvement. This
suggests that a ripple carry adder with computation completion signaling can potentially
incorporate the low power and area of ripple carry adders with average performance comparable
to that of carry look ahead adders. However, there is an additional overhead associated with the
circuitry producing the completion signal; the challenge is to keep this as low as possible.
41
Figure 23: Carry Propagation graph for two 32 bit additions for 100000 random vectors.
5.2 RCA Design
The new proposed design for ripple carry adder with completion signaling is shown in
Figure 24. The new carry completion signaling design discussed in chapter 3 is incorporated in a
32-bit ripple carry adder to produce a carry completion signal. This modified ripple carry adder
has a strong delay dependency on carry chain length, which is the critical path for the design. So
monitoring the activity on the carry out lines only gives us information about the delays seen in
calculating the sum bits for different input vectors. A clock is used for the operation of the
current sensor circuitry. As discussed earlier this clock is used to generate control signals.
42
Figure 24: Ripple Carry Adder with sensor circuitry.
5.3 RCA Operation
The sense inverters are added to carry out signals from all the output blocks except the final
stage. The carry out of the final stage is neglected, as it does not play any role in calculation of
sum bits. For 32-bit additions if there is any carry propagation then the carry signals have
transitions on their lines. Due to these transitions, current is drawn from the power supply by the
sense invertors. This current is captured into the current sensor and is compared with a pre-
chosen threshold. Whenever the captured current is above the threshold, the output done line will
be active low, indicating carry propagation by the full adder blocks. But when the current falls
below the threshold, there are no transitions on the carry out lines and the circuit has reached
stable state values. At this moment the done signal goes high and the adder is ready is compute
the next addition.
In case of asynchronous current sensor designs the done signal is flagged as soon as all the
carry lines settle to a stable state. But the design here is synchronous, which includes a clock
signal. So we compare current once in each clock cycle and provide a done signal. This clock
43
signal can be used as a master reference clock and another clock with high frequency can be used
to create multiple evaluation points which fit into the clock period of the master clock.
The general operation for the adder is to load the values to the registers after waiting for the
worst case delay; for the RCA this is equivalent to the 64 gate delays associated with the carry
propagation path. For loading the values we use a master clock (Clk1) which is operated at a
frequency equal to the worst case delay (assuming the set up and hold time constraints are met).
The new RCA with sensor circuitry uses a different clock faster, (Clk2/n) by a factor of ?n? with
respect to the main clock. So the values are now loaded for the next addition on the rising edge
of the clock depending on the value of the ?Done? signal as shown in Figure 25. If, at the first
clock edge the ?Done? signal is active high, indicating addition is complete, the new values are
loaded to the registers; if not so, then the next clock edge is used for loading the values.
Figure 25 : Clock Division
5.4 Calculations
Clk1 = Twcd = 32 stage worst case delay without sense invertors
Clk2 = Twcd + 0.2ns* +0.4ns**
* Delay introduced due to sense invertors on carry signals
44
** Delay added for evaluation over one complete addition
Timed saved = (T1-T2) / T1
Where, T1 = Total time with Clock set to Clk1
T2 = Total time with Clock set to Clk2/n, n = 2, 3, 4.
Here to have multiple comparison slots we divide the Clk2 by a factor ?n?. Here ?n?
indicates the number of comparisons.
5.5 Optimizations
In the analysis different types of optimizations are used in automatic synthesis tool
Leonardo spectrum for generating the netlist. Figures 26 and 27 show the gate level
representation of a 1-bit full adder block after the optimizations are performed by the synthesis
tool.
Figure 26: Optimization w.r.t Area
45
Figure 27: Optimization w.r.t Delay
5.6 Simulations
5.6.1 Ripple Carry Adder with Sensor Circuitry
The SPICE simulations were performed with 100000 random vectors to explore the
correlation between delay for the RCA with and without the sensor circuitry. The Clock Delay
Generator (CDG), for the design with the sensor, is set to 0.4ns which is the evaluation time of
the current sensor. The netlist parameters were obtained by using three different types of
synthesis optimization in automatic synthesis tool Leonardo Spectrum using 180nm technology.
1) Area + Delay
Clk1 = Twcd = 5.0ns
Clk2 = Twcd + 0.2ns +0.4ns = 5.6ns
Clock(ns) Clk2/2 Clk2/3 Clk2/4
Time Saved 41.09% 52.18% 53.02%
Table 1: Simulation results with area and delay optimization for RCA with and without sensor circuitry.
46
2) Area
Clk1 = Twcd = 5.1ns
Clk2 = Twcd + 0.2ns +0.4ns = 5.7ns
Clock(ns) Clk2/2 Clk2/3 Clk2/4
Time Saved 41.21% 52.28% 53.01%
Table 2: Simulation results with area optimization for RCA with and without sensor circuitry.
3) Delay
Clk1 = Twcd = 4.8ns
Clk2 = Twcd + 0.2ns +0.4ns = 5.4ns
Clock(ns) Clk2/2 Clk2/3 Clk2/4
Time Saved 40.83% 51.96% 52.89%
Table 3: Simulation results with delay optimization for RCA with and without sensor circuitry.
5.6.2 RCA Vs CLA
The SPICE simulations performed here are to explore the correlation between delay for the
RCA with the sensor circuitry and Carry Look ahead adder. These netlist parameters were
obtained by using three different types of synthesis optimization in automatic synthesis tool
Leonardo Spectrum using 180nm technology.
1) Area + Delay
47
Clk1 = 3.1ns
Clk2 = 5.0ns + 0.2ns +0.4ns = 5.6ns
Clock(ns) Clk2/2 Clk2/3 Clk2/4
Time Saved 5.25% 21.91% 23.68%
Table 4: Simulation results with Area + Delay optimization for RCA and CLA.
Design RCA + Sensor CLA Overhead
#Transistors 1182 1450 22.67%
Table 5: Area Overhead calculations with Area + Delay optimization.
2) Area
Clk1 = 3.2ns
Clk2 = 5.1ns + 0.2ns +0.4ns = 5.7ns
Clock(ns) Clk2/2 Clk2/3 Clk2/4
Time Saved 6.57% 22.99% 24.74%
Table 6: Simulation results with Area optimization for RCA and CLA.
.
Design RCA + Sensor CLA Overhead
#Transistors 1182 1450 22.67%
Table 7: Area Overhead calculations with Area + Delay optimization.
3) Delay
Clk1 = 2.9ns
Clk2 = 4.8ns + 0.2ns +0.4ns = 5.4ns
48
Clock(ns) Clk2/2 Clk2/3 Clk2/4
Time Saved 2.33% 19.49% 21.32%
Table 8: Simulation results with Delay optimization for RCA and CLA.
Design RCA + Sensor CLA Overhead
#Transistors 1451 2008 38.39%
Table 9: Area Overhead calculations with Delay optimization.
5.7 Power Analysis
5.7.1 Average Power Calculations
Table 10 gives the average power (in mW) calculations for the three circuit designs, Ripple
carry adder, Ripple carry adder with sensor circuitry and the Carry look ahead adder.
Longest Carry Chain Length /
Inputs RCA RCA with CSCD CLA
Clk/2 Clk/3 Clk/4
0 (0x 22222222 + 0x 11111111) 0.555 0.647 0.681 0.788 0.867
2 (0x 33333333 + 0x 11111111) 0.811 0.883 1.061 1.170 1.441
4 (0x 0F0F0F0F + 0x 01010101) 1.014 1.093 1.270 1.378 1.682
6 (0x 3F3F3F3F + 0x 01010101) 1.668 1.776 1.883 1.990 2.512
8 (0x 00FF00FF + 0x 00010001) 1.143 1.184 1.360 1.446 1.771
10 (0x 03FF03FF + 0x 00010001) 1.446 1.567 1.675 1.780 2.204
12 (0x 0FFF0FFF + 0x 00010001) 1.757 1.943 2.052 2.156 2.719
16 (0x 0000FFFF + 0x 00000001) 1.049 1.161 1.250 1.351 1.835
20 (0x 000FFFFF + 0x 00000001) 1.191 1.264 1.338 1.476 2.290
24 (0x 00FFFFFF + 0x 00000001) 1.322 1.465 1.576 1.605 2.767
28 (0x 0FFFFFFF + 0x 00000001) 1.457 1.545 1.636 1.737 3.234
32 (0x FFFFFFFF + 0x 00000001) 1.583 1.636 1.727 1.856 3.658
Table 10: Average Power Calculations
49
5.7.2 Energy Calculations
Table 11 gives the energy per addition (in mJ) calculations for the three circuit designs,
Ripple carry adder, Ripple carry adder with sensor circuitry and the Carry look ahead adder.
Longest Carry Chain Length /
Inputs RCA RCA with CSCD CLA
Clk/2 Clk/3 Clk/4
0 (0x 22222222 + 0x 11111111) 0.160 0.186 0.196 0.227 0.260
2 (0x 33333333 + 0x 11111111) 0.467 0.509 0.612 0.674 1.730
4 (0x 0F0F0F0F + 0x 01010101) 0.877 0.945 1.098 1.192 2.018
6 (0x 3F3F3F3F + 0x 01010101) 1.923 2.048 2.171 2.294 5.275
8 (0x 00FF00FF + 0x 00010001) 1.647 1.707 1.960 2.083 3.720
10 (0x 03FF03FF + 0x 00010001) 2.500 2.71 2.896 3.078 4.628
12 (0x 0FFF0FFF + 0x 00010001) 3.545 3.921 4.140 4.349 5.709
16 (0x 0000FFFF + 0x 00000001) 2.721 3.011 3.242 3.504 3.853
20 (0x 000FFFFF + 0x 00000001) 3.776 4.009 4.244 4.680 6.871
24 (0x 00FFFFFF + 0x 00000001) 4.953 5.49 5.872 6.014 8.301
28 (0x 0FFFFFFF + 0x 00000001) 6.301 6.68 7.073 7.508 9.701
32 (0x FFFFFFFF + 0x 00000001) 7.758 8.016 8.460 9.095 10.973
Table 11: Energy per addition calculations
5.8 Results Analysis
The design of the 32-bit modified RCA used 1182 transistors in its design, which includes
the CSCD circuitry. In contrast, the 32-bit RCA without sensor circuitry used 1079 transistors.
This suggests a 9-10% area overhead. The average speedup achieved was between 40-55% in all
the three cases of optimizations.
The design the 32-bit CLA used 1450 transistors which is 23% more compared to 32-bit
RCA with CSCD circuitry. The performance improvement achieved was between 5-25% in all
the three cases of optimizations.
50
The modified Ripple carry adder with sensor circuitry consumes 10-20% more power than
the standard Ripple carry adder. But the consumption is less when compared to Carry Look
Ahead adder which takes 10-50% more power.
51
Chapter 6
Booth Multiplier
6.1 Background
The Booth multiplier performs the multiplication of two signed numbers using the signed
binary notations. The Booth multiplier tries to decrease the number of partial products generated
for an n-bit multiplication. In general, for an n-bit multiplication n partial products are generated
and added to form the final result of the product.
The basic idea for multiplying two binary numbers is ?shift and add?. That is, for each
column in the multiplier, shift the multiplicand the appropriate number of columns and multiply
it by the value of the digit in that column of the multiplier, to obtain a partial product. The partial
products are then added to obtain the final result. As shown in example below, we get four
partial products to add for 4-bit multiplication.
0 0 1 1 (3)
x 0 1 1 0 (6)
0 0 0 0
0 0 1 1
0 0 1 1
0 0 0 0
0 0 1 0 0 1 0 (18)
52
Radix-4 Booth encoding reduces the number of partial products by half. The idea here is
that, instead of shifting and adding for every column of the multiplier term and multiplying by 1
or 0, we only take every second column, and multiply by ?1, ?2, or 0, to obtain the same results.
In the radix-4 booth algorithm the multiplier is divided into substring of 3 bits and based on the
bit pattern, operation is performed as per Booth encoding shown in Table 12.
Bit Pattern Operation
000 0
001 1 * Multiplicand
010 1 * Multiplicand
011 2 * Multiplicand
100 -2 * Multiplicand
101 -1 * Multiplicand
110 -1 * Multiplicand
111 0
Table 12: Booth Encoding
Booth encoding has an advantage of reducing the number of partial products. For radix-4
encoding the number of partial products is halved. This is important in circuit design as it relates
to the propagation delay in the operation of the circuit, and the complexity and power
consumption of its implementation.
The above example can be multiplied using the Booth encoding as below, we consider the
bits in blocks of three, such that each block overlaps the previous block by one bit. The overlap
is necessary so that we know what happened in the last block, as the MSB of the block acts like a
sign bit. Grouping starts from the LSB, and the first block only uses two bits of the multiplier.
53
Since there is no previous block to overlap a padding of 1-bit is used, which is added to the right
towards the least significant bit.
0 1 1 0 0
2nd 1st Padding bit
For 1st substring, Addition term as per booth encoding = (-1) * Multiplicand
For 2nd substring, Addition term as per booth encoding = (2) * Multiplicand
The final product is obtained by adding the two partial products with a 2nd partial product
shifted left by two bits.
6.2 Modified Booth Multiplier
The Booth multiplier performs the addition of the partial products. To achieve time saving,
Ripple Carry Adder (RCA) with completion signaling is used in our Radix-4 Booth Multiplier by
replacing the standard RCA in the design as shown in Figure 28.
54
Figure 28: Booth Multiplier with Ripple Carry Adder and sensor circuitry.
6.3 Calculations
The Booth multiplier uses a master clock to load values to the registers. We run the master
clock at a frequency equal to the worst case delay required for adders.
1. When RCA is used the clock is set to Clk1
Clk1 = 32 stage delay without sense invertors
2. When RCA with completion detection circuitry is used the clock is set to Clk2
Clk2 = Clk1 + 0.2ns* +0.4ns**
* Delay added due to sense invertors on carry signals
** Delay added for evaluation over one complete addition
Here we divide the Clk2 by a factor ?n?, where ?n? indicates the number of comparisons. So
we now operate the Booth multiplier at a faster clock rate given by Clk2/n.
3. When CLA is uses the clock is set to Clk3
Clk3 = delay seen for calculating carry for last block.
55
Timed saved = (T1-T2)/T1
Where, T1 = Total time with Clock set to Clk1
T2 = Total time with Clock set to Clk2/n or Clk3/n, n = 2, 3, 4.
It is important to note that when ?0? is added , it?s a simple shift operation without an
addition, so a faster clock can be used and when there is addition the worst case delay wait is
added. This can improve the overall performance of the Booth multiplier.
6.4 Simulations
6.4.1 BM with RCA vs BM with RCA and Current Sensor
For comparison, simulations for 1000 random input vectors with two designs of booth
multiplier, one with RCA and other with RCA and current sensor circuitry are performed. In
each simulation run two clocks, Clk1 and Clk2, are used with time period scaled by a factor ?n?,
where ?n? is number of evaluations per addition. A performance improvement between 30-50%
was observed with overall area overhead less than 2% of the silicon area (Table 13).
For BM with RCA, the shift operation is done in a Clk1/n (n=2, 3, 4) time period and for
additions the worst case delay is applied and the new values are loaded in Clk1 time period. For
BM with sensor circuitry, the Clk2/n time period is used and values are loaded only the when add
completion signal is high at an active clock edge. The Clock Delay Generator in RCA with
Current sensor circuitry is set to a delay of 0.4ns, which is the response time of the Current
sensor.
Clock(ns) Clk2/2 Clk2/3 Clk2/4
Time Saved 33.69% 46.89% 49.25%
Table 13: Simulation results with BM with RCA with and without sensor circuitry.
56
6.4.2 BM with CLA vs BM with RCA and Current Sensor
For comparison, simulation for 1000 random input vectors with two designs of booth
multiplier, one with CLA and other with RCA and current sensor circuitry are performed. In
each simulation run two clocks, Clk1 and Clk2, are used with time period scaled by a factor
?n?(n=2, 3, 4). A performance improvement between 15-30% was observed while keeping the
overall area overhead less than 2% of the silicon area (Table 14).
For BM with CLA, the shift operation is done Clk1/n time period and for additions the
worst case delay is applied and the new values are loaded in Clk1 time period. For BM with
sensor circuitry uses Clk2/n time period, values are loaded only when add completion signal is
high at active clock edge. The Clock Delay Generator in RCA with Current sensor circuitry is set
to a delay of 0.4ns i.e. response time of the Current sensor.
Clock(ns) Clk2/2 Clk2/3 Clk2/4
Time Saved 16.45% 24.45% 29.98%
Table 14: Simulation results with BM with RCA with sensor circuitry and CLA.
6.5 Results Analysis
The Booth multiplier when replaced with the RCA with Current sensor circuitry can be
operated at a higher clock rate to get improved performance. This average time saving achieved
is found to be between 30-50%. When the same design is compared with a booth multiplier
which incorporates CLA for addition the performance improvement is between 20-30%.
57
Chapter 7
Conclusion
Providing carry completion signaling in low cost ripple carry adders can allow the control
logic to schedule the next addition as soon as an earlier one is complete, thereby achieving the
average case, rather than worst case addition delay over a set of computations. Earlier attempts at
using current sensing for such carry completion signaling suffered from serious limitations. In
this thesis we presented a new approach for the design of a ripple carry adder with a current
sensing capability which observes late settling carry signal nodes in the circuit and indicates
when they reach a quiescent state. Simulations show better than 50% speedup, on average, with
less than 10% area overhead. To demonstrate a potential application of such an approach, we
incorporated our carry completion adder into a Booth multiplier design and studied the
performance gain over a traditional ripple carry adder based design. Simulation results show that
a 32-bit Booth Multiplier using the new completion signaling circuits can outperform a 32-bit
Booth Multiplier with ripple carry adder (RCA) by 30-50%, while requiring less than 2%
additional silicon area, This is comparable to the gains from the best carry look ahead adder
designs at a fraction of the area overhead costs.
Also, the observed performance improvement over a carry look ahead adder is 20-30%,
with an area overhead less than 40% of that required by a carry look ahead adder. This area
overhead will save power and area in smaller circuit designs. When the clock is set to average
case RCA delay, the average case delay is found to be 8-9 times the longest carry chain length.
At first this number appears bigger than the 6-7 long average case carry chain length for 32-bit
58
adders found by our preliminary simulations presented in Figure 23, recall however that our
design incorporates the current sensor that limits the earliest completion signal by adding an
additional overhead of 0.6ns, 0.2ns for additional load on carry lines and 0.4ns for the current
sensor response over one complete addition. Careful designs along with reduction of the
response time of the sensor can potentially further improve the average case delay of our design
by 10-20%.
Future work is focused on investigating other applications of the completion signaling
approach presented here.
59
References
1. F. Cheng, S. Unger, and M. Theobald, ?Self-Timed Carry Lookahead Adders?. IEEE
Transactions on Computers, vol. 49, no. 7, July 2000.
2. M. E. Dean, D. I. Dill, and M. Horowitz, ?Self-timed logic using current-sensing completion
detection (CSCD)?. Journal of CLSI Signal Processing, vol. 7, no. 1-2, pp. 7-16, February
1994.
3. E. Grass and S. Jones, ?Asynchronous circuits based on multiple localized current-sensing
completion detection?. Proceedings: Second Working Conference on Asynchronous Design
Methodologies. pp. viii+223, 170-177, London, UK, May 1995.
4. O. A. Izosimov, I. I. Shagurin, and V. V. Tsylyov, ?Physical approach to CMOS Self-
Timing?. Electronic Letters, vol. 26, pp. 1835-1836, 1990.
5. C. G. Knight, A. D. Singh, and V. P. Nelson, ?An IDDQ Sensor for Concurrent Timing
Error Detection?. IEEE Journal of Solid-State Circuits, vol. 33, no. 10, October 1998.
6. H. Lampinen, O. Vain, ?Circuit design for current-sensing completion detection?. ISCAS
?98: Proceedings of the 1998 IEEE International Symposium on Circuits and Systems.
Monterey, CA, vol. 2, pp. 185-188,June1998
7. Weste, N., and Eshraghian, K.: ?Principles of CMOS VLSI design? (Addison Wesley, 1993)
8. Beerel, P.A.: ?Asynchronous circuits: an increasingly practical design solution?. Proc. Int.
Symp. on Quality Electronic Design, 18?21 March 2002, pp. 367?372
9. Perri, S., Corsonello, P., and Cocorullo, G.: ?VLSI circuits for lowpower high-speed
asynchronous addition?, IEEE Trans. VLSI, 2002, 10, (5), pp. 608?613
60
10. De Gloria, A., and Olivieri, M.: ?Completion-detecting carry select addition?, IEE Proc.,
Comput. Digit. Tech., 2000, 23, pp. 93?100
11. Ruiz, G.A., and Manzano, M.A.: ?Compact 32-bit CMOS adder in multiple-output DCVS
logic for self-timed circuits?, IEE Proc., Circuits Devices Syst., 2000, pp. 183?188
12. De Gloria, A., and Olivieri, M.: ? Statistical carry lookahead adders?, IEEE Trans. Comput.,
1996, pp. 340?347 7 Escrib!a, J., and Carrasco, J.A.: ?Self-timed Manchester chain carry
propagate adder?, Electron. Lett., 1996, 32, (8), pp. 708?710
13. Reitwiesner, G.W.: ?The determination of carry propagation length for binary addition?, IRE
Trans. Electron. Comput., 1960, pp. 35?38.
14. Nowick, S.M.: ?Design of a low-latency asynchronous adder using speculative completion?,
IEE Proc., Comput. Digit. Tech., 1996, 143, (5), pp. 301?307
15. Garside, J.D.: ?A CMOS VLSI implementation of an asynchronous ALU?. Proc. IFIP Conf.
on Asynchronous Design Methodologies, Manchester, UK, 1993, pp. 181?192
16. Ruiz, G.A.: ?Evaluation of three 32-bit CMOS adders in DCVS for self-timed circuits?, IEEE
J. Solid-State Circuits, 1998, 33, (4), pp. 604?613
17. Martin, A.J.: ?Asynchronous datapaths and the design of an asynchronous adder?, Form.
Methods Syst. Des., 1992, 1, (1), pp. 119?137
18. Cheng, F.-C., Unger, S.H., Theobald, M., and Cho, W.-C.: ?Delay insensitive carry-
lookahead adders?. Proc. 10th Int. Conf. VLSI Design, Jan. 1997, pp. 322?328
19. Cheng, F.-C., Unger, S.H., and Theobald, M.: ?Self-timed carrylookahead adders?, IEEE
Trans. Comput., 2000, 49, (7), pp. 659?672
20. Hauck, S.: ?Asynchronous design methodologies: an overview?, Proc. IEEE, 1995, 83, (1),
pp. 69?93
61
21. Parhami, B.: ?Computer arithmetic: algorithms and hardware design? (Oxford University
Press, 2000)
22. H. Lampinen and 0. Vainio, ?Circuit Design for Current- Sensing Completion Detection,?
IEEE Intemational Symposium on Circuits and Systems, Monterey, California, USA, Vol. 2,
pp. 185-188, May 31 -June 3 1998.
23. E. Grass, V. Bartlett, and I. Kale, ?Completion-Detection Techniques for Asynchronous
Circuits,? IEICE Tramactions on Zilfomzation and Systems, Vol. E80-D, No. 3, pp.
24. E. Grass and S. Jones, ?Activity-Monitoring Completion- Detection (AMCD): A New
Approach to Achieve Self- Timing,? Electronics Letters, Vol. 32, No. 2, pp. 86-88, Jan.
1996.
25. H. Lampinen, Olli Vainio, ?Current-Sensing Completion Detection Method For Standard
Cell Based Digital System Design?, 2002.
26. W. Burks, Herman H. Goldstine, John Von Neumann, ?Preliminary discussion of the logical
design of an electronic computing instrument?.
27. Harri Lampinen, Olli Vainio, ?Dynamically Biased Current Sensor for Completion
Detection?.