A HARDWARE-SOFTWARE PROCESSOR ARCHITECTURE USING
PIPELINE STALLS FOR LEAKAGE POWER MANAGEMENT
Except where reference is made to the work of others, the work described in this thesis is
my own or was done in collaboration with my advisory committee. This thesis does not
include proprietary or classi ed information.
Khushboo Sheth
Certi cate of Approval:
Adit D. Singh
James B. Davis Professor
Electrical and Computer Engineering
Vishwani D. Agrawal, Chair
James J. Danaher Professor
Electrical and Computer Engineering
Victor P. Nelson
Professor
Electrical and Computer Engineering
George T. Flowers
Dean
Graduate School
A HARDWARE-SOFTWARE PROCESSOR ARCHITECTURE USING
PIPELINE STALLS FOR LEAKAGE POWER MANAGEMENT
Khushboo Sheth
A Thesis
Submitted to
the Graduate Faculty of
Auburn University
in Partial Ful llment of the
Requirements for the
Degree of
Master of Science
Auburn, Alabama
May 9th, 2009
A HARDWARE-SOFTWARE PROCESSOR ARCHITECTURE USING
PIPELINE STALLS FOR LEAKAGE POWER MANAGEMENT
Khushboo Sheth
Permission is granted to Auburn University to make copies of this thesis at its
discretion, upon the request of individuals or institutions and at
their expense. The author reserves all publication rights.
Signature of Author
Date of Graduation
iii
Vita
Khushboo U. Sheth, daughter of Mr. Umeshkumar R. Sheth and Mrs. Nisha U. Sheth,
was born in Ahmedabad, Gujarat, India. She attended L. D. College Of Engineering,
Gujarat University in 2001 and graduated with Bachelor of Engineering degree in Instru-
mentation and Control Engineering in 2005. At the same year August, she entered the
Electrical and Computer Engineering graduate program at Auburn University, Alabama.
iv
Thesis Abstract
A HARDWARE-SOFTWARE PROCESSOR ARCHITECTURE USING
PIPELINE STALLS FOR LEAKAGE POWER MANAGEMENT
Khushboo Sheth
Master of Science, May 9th, 2009
(B.E., Gujarat University, 2005)
109 Typed Pages
Directed by Vishwani D. Agrawal
In recent years, power consumption has become a critical design concern for many
VLSI systems. Nowhere is this truer than for portable, battery-operated applications, where
power consumption has perhaps superseded speed and area as the overriding implementation
constraint. But since last few years as the greater emphasis is put on miniaturization, in
future technologies, the problem of subthreshold leakage power in CMOS circuits will grow
in signi cance. The leakage current is exponentially dependent on the value of the threshold
voltage such that if the threshold voltage is reduced (as it will be in the future technologies),
the leakage current registers an exponential increase. Responding to this challenge, several
low power techniques at levels ranging from technology to architecture have been proposed
to reduce both dynamic and static power for processors and make them more energy-
e cient. Some of these techniques can be applied to hardware whereas others are software
based techniques. In this thesis, we propose a combined hardware-software technique which
will potentially show considerable leakage energy reduction when power-performance trade-
o s are made in higher-leakage technologies.
v
A simple method to reduce the power consumption in a processor is to slow down
the clock frequency. The dynamic power reduces in proportion to the frequeny reduction.
However, the leakage power remains the same. Because a computing task will require the
same number of clock cycles it will now take more time. The leakage energy will therefore
increase, although the dynamic energy will remain the same. It is the reduction of leakage
power and energy that is targeted in the present work.
The main idea introduced and investigated is that the power-performance trade-o is
accomplished by inserting empty (no-op) cycles while the clock rate is kept unchanged.
The hardware units are especially designed to save leakage power while processing a no-
op instruction. As an illustration, the hardware of a  ve-stage pipeline RISC processor is
redesigned to reduce power consumption of the no-op cycles using sleep modes. This largely
eliminates the leakage in those cycles. The expected result is that as more empty cycles are
inserted, the performance would drop similar to the conventional clock frequency reduction.
However, the computing task that now takes more cycles has only a marginal increase in the
leakage energy. In addition, the empty cycles may eliminate many of the pipeline hazards
and thus reduce the performance penalty. In this work, power supply for the active (i.e.,
non-empty) cycles is not changed and that aspect is left for the future investigation.
The control unit of the processor has been designed to interpret an external power man-
agement signal. Based on the power and performance requirements, this signal speci es a
performance slowdown factor. The power block has been added in the processor architec-
ture that inserts no-ops in proportion to the slowdown factor in the instruction stream.
The normal clock rate is maintained. The control also generates the power signals for dif-
ferent blocks of the processor along with the other control signals. These power signals are
vi
applied to various hardware blocks (register  le, ALU, and instruction and data caches) of
the processor, which on the basis of the required activity are put into one of the low power
modes such as drowsy or sleep mode. These modes are chosen for the best leakage power
reduction.
We have simulated a modi ed 32-bit MIPS pipelined processor for 22nm and 65nm
technologies using the Berkeley Predictive Technology Models.
vii
Acknowledgments
I thank my adviser Dr. Vishwani D. Agrawal for his constant support. Without his
patient guidance and encouragement, this work would not be possible. His technical advice
made my master studies a meaningful learning experience. I thank my advisory committee
members, Dr. Victor P. Nelson and Dr. Adit D. Singh for being on my thesis committee and
for their invaluable advice on this research. I thank all my professors in Auburn University
for their guidance and support, its been a privilege learning from them. I would like to
acknowledge with gratitude and a ection, encouragement and support given by my parents
and my brother during my graduate study. I thank my research mates and friends in Auburn
for their assistance and company.
I acknowledge with thanks a teaching assistantship at Auburn, which allowed me to
take a closer look at microprocessors. My work was additonally supported by the NSF
Grant 421128.
viii
Style manual or journal used Journal of Approximation Theory (together with the style
known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars.
Computer software used The document preparation package TEX (speci cally LATEX)
together with the departmental style- le aums.sty.
ix
Table of Contents
List of Figures xii
List of Tables xiv
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contribution of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 SOURCES OF POWER CONSUMPTION . . . . . . . . . . . . . . . . . . 5
2.2 DEGREES OF FREEDOM . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Physical Capacitance . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 RECURRING THEMES IN LOW POWER DESIGN . . . . . . . . . . . . 16
2.4 TECHNOLOGY LEVEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Packaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 LAYOUT LEVEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 CIRCUIT LEVEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.1 Dynamic Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.2 Pass-Transistor Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.3 Asynchronous Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.4 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.5 Design Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.6 Dynamic Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.7 Dual Threshold Voltage . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.8 Dynamic Power Cuto Technique (DPCT) . . . . . . . . . . . . . . 31
2.6.9 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 GATE LEVEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7.1 Technology Decomposition and Mapping . . . . . . . . . . . . . . . . 32
2.7.2 Activity Postponement . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7.3 Glitch Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7.4 Input Vector Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7.5 Concurrency and Redundancy . . . . . . . . . . . . . . . . . . . . . 37
2.8 ARCHITECTURE AND SYSTEM LEVELS . . . . . . . . . . . . . . . . . 38
x
2.8.1 Concurrent Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.8.2 Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.8.3 Memory Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.8.4 Programmability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8.5 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.9 ALGORITHM LEVEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.10 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Hardware - Software Technique using Pipeline stalls to reduce leak-
age power 53
3.1 Hardware-Software Technique . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Theoretical and Experiment Results 60
4.1 THEORETICAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.1 Clock Slow-Down Method . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.2 Instruction Slow-Down Method . . . . . . . . . . . . . . . . . . . . . 65
4.1.3 Comparison of Clock Slow-Down and Instruction Slow-Down Methods 68
4.2 MIPS PROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1 MIPS Instruction Formats . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.3 Pipeline Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 MODIFIED MIPS PROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.1 Blocks of the Original Processor . . . . . . . . . . . . . . . . . . . . 80
4.4.2 Power Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.3 Comparison between the original and modi ed processors . . . . . . 84
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Conclusion 87
Bibliography 88
xi
List of Figures
1.1 Power density for the Intel-32 family . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Projected development of power consumption over technology generations [44] 3
2.1 Circuit transition currents (left: charge, right: discharge) . . . . . . . . . . 6
2.2 Short circuit currents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Leakage Currents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Energy and Delay as a function of supply voltage . . . . . . . . . . . . . . . 9
2.5 Primary sources of device capacitance . . . . . . . . . . . . . . . . . . . . . 11
2.6 Sources of interconnect capacitance . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Interpretation of switching activity in synchronous systems . . . . . . . . . 14
2.8 Static and Dynamic implementations of F = (A+B)C . . . . . . . . . . . . 25
2.9 Complementary pass transistor implementation of F = (A+B)C . . . . . . 26
2.10 Asynchronous circuits with handshaking . . . . . . . . . . . . . . . . . . . . 27
2.11 Input Reordering for activity reduction . . . . . . . . . . . . . . . . . . . . . 34
2.12 Cascaded versus Balanced tree gate structures . . . . . . . . . . . . . . . . . 35
2.13 Voltage Scaling and Parallelism for low power . . . . . . . . . . . . . . . . . 39
2.14 Voltage Scaling and Pipelining for low power . . . . . . . . . . . . . . . . . 41
3.1 Using Headers for Power Gating . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Supply Voltage Control Mechanism . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Power Saving Ratio for Clock Slow-Down Method . . . . . . . . . . . . . . 63
xii
4.2 Energy Saving Ratio for Clock Slow-Down Method . . . . . . . . . . . . . . 64
4.3 Distribution of Instruction Energy and NOP Energy for a given time period 66
4.4 Power Saving Ratio for Instruction Slow-Down Method . . . . . . . . . . . 67
4.5 Energy Saving Ratio for Instruction Slow-Down Method . . . . . . . . . . . 68
4.6 Clock Slowdown Method Vs. Instruction Slowdown Method for  = 1 (No
Sleep Mode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Clock Slowdown Method Vs. Instruction Slowdown Method for  = 0.5
(Sleep Mode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 Clock Slowdown Method Vs. Instruction Slowdown Method for  = 0.1
(Sleep Mode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.9 General Architecture of the Processor . . . . . . . . . . . . . . . . . . . . . 74
4.10 Modi ed Architecture of the Processor . . . . . . . . . . . . . . . . . . . . . 77
4.11 Schematic of the Power Block . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xiii
List of Tables
2.1 Leakage current values for di erent input combinations of a 3-input NAND
gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 MIPS Instruction Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Format and the meaning of the supported Instructions . . . . . . . . . . . . 73
4.3 A Sequence of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 External 3-bit signal conditions . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Estimated power in microwatts for di erent blocks of the processor (65nm
CMOS technology, clock period 10ns) . . . . . . . . . . . . . . . . . . . . . 81
4.6 Estimated power in microwatts for di erent blocks of the processor (22nm
CMOS technology, clock period 10ns) . . . . . . . . . . . . . . . . . . . . . 81
4.7 Estimated power in microwatts for di erent blocks of the processor (65nm
CMOS technology, clock period 20ns) . . . . . . . . . . . . . . . . . . . . . 82
4.8 Estimated power in microwatts for di erent blocks of the processor (65nm
CMOS technology, clock period 40ns) . . . . . . . . . . . . . . . . . . . . . 82
4.9 Estimated power in microwatts for di erent blocks of the processor (22nm
CMOS technology, clock period 20ns) . . . . . . . . . . . . . . . . . . . . . 83
4.10 Estimated power in microwatts for di erent blocks of the processor (22nm
CMOS technology, clock period 40ns) . . . . . . . . . . . . . . . . . . . . . 83
4.11 Estimated power in microwatts for power block of the processor (65nm CMOS
technology) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.12 Estimated power in microwatts for power block of the processor (22nm CMOS
technology) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.13 Comparing the original and modi ed processors for 22nm technology for 1
NOP (power in microwatts) . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.14 Comparing the original and modi ed processors for 65nm technology for 1
NOP (power in microwatts) . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
xiv
Chapter 1
Introduction
The invention of a transistor was a giant leap forward for low-power electronics that
has remained unequaled to date. The operation of vacuum tubes required several hundred
volts and several watts of power. In comparison the transistor required only milliwatts
of power. Since the invention of the transistor, decades ago, through the years leading
to the 21st century, power dissipation, though not entirely ignored, was of little concern.
The greater emphasis was on performance and miniaturization. Applications powered by
batteries - pocket calculators, hearing aids and, most importantly, wristwatches - drove
low-power electronics. In all such applications, it is important to prolong the battery life
as much as possible. Especially now, with the growing trend towards portable computing
and wireless communication, power dissipation has become one of the most critical factors
in the continued development of the microelectronics technology. There are two reasons for
this:
1. One of the most signi cant low-power design ideas of the last century, the comple-
mentary metal oxide semiconductor (CMOS) technology [111], has served us well for
several decades. However, to make the best cost-performance trade-o [82] we con-
tinue to integrate more functions onto a chip by shrinking the features to the smallest
size permitted by the manufacturing technology. As a result, the dissipation of power
per unit area grows and the accompanying problem of heat removal and cooling wors-
ens. Examples are the general-purpose microprocessors used in desktop computers
and servers. Even with the scaling down of the supply voltage, power dissipation has
1
Figure 1.1: Power density for the Intel-32 family
not come down. Figure 1.1 shows the power density for several commercial processors.
As it is shown in the  gure, the trend is to increase the power density to levels where
the cooling mechanisms are unlikely to be e ective enough.
2. Portable battery-powered applications of the past were characterized by low com-
putational requirements. The last few years have seen the emergence of portable
applications that require a greater amount of processing power. Two vanguards of
this processing model are the notebook computer and the personal digital assistant
(PDA).
As a result, today, it is widely accepted that power e ciency is a design goal at par in
importance with miniaturization and performance. In spite of this acceptance, the practice
of low-power methodologies is being adopted at a slow pace due to the widespread changes
2
Figure 1.2: Projected development of power consumption over technology generations [44]
called for by these methodologies. Minimizing energy consumption and power dissipation
calls for conscious e ort at each abstraction level and at each phase of the design process.
While technology scaling has made it possible to put more transistors onto a single
chip, at the same time allowing them to run faster, new complications continue to arise [1].
The supply voltage, being one of the critical parameters, has been reduced according to the
characteristics of the shrinking MOS device. Therefore, to maintain the transistor switch-
ing speed, the threshold voltage is also scaled down at the same rate as the supply voltage.
As a result, leakage currents increase dramatically with each technology generation. Some
researchers predict a 7.5-fold increase in the leakage current and a 5-fold increase in total
energy dissipation for every new microprocessor chip generation [64]. As the leakage cur-
rent increases faster, it will become the major component in the total power dissipation.
Figure 1.2 shows that static power consumption reaching the level of dynamic power con-
sumption within a few years time. Leakage power is becoming a serious problem that needs
handling at all levels of abstraction.
3
1.1 Problem Statement
In the presence of non-negligible leakage power, the way to design architectures for low
power consumption may have changed. This master?s thesis represents one step towards
exploring low power design again. It concentrates on trying to reduce the leakage power at
the architectural level.
1.2 Contribution of Research
We have developed a new Hardware-Software Architecture for the processors that helps
in leakage power reduction for future higher-leakage technologies by using pipeline stalls.
The architecture includes a power block that inserts NOPs in the processor in order to
reduce the frequency of operation. Simultaneously when the NOP is inserted the di erent
blocks of the processor are put into low-power mode, such as sleep mode or drowsy mode,
by power-gating technique given in the literature.
1.3 Organization of the thesis
The thesis is organised as follows. In chapter 2, we survey the di erent power reduction
techniques that has been used at di erent abstraction levels such as technology level, layout
level, circuit level, gate level, architecture level and algorithm level. Chapter 3 explains
the new hardware-software technique for leakage power reduction. The modi ed low-power
architecture of the processor and the results are discussed in chapter 4. The thesis is
concluded with the insight on the future work in chapter 5.
4
Chapter 2
Background
This chapter describes low-power design techniques at abstraction levels ranging from
layout and technology to architecture and system. By reading this chapter, it should become
clear that high-level design decisions - those made at the architecture or system level - have
the most dramatic impact on power consumption.
2.1 SOURCES OF POWER CONSUMPTION
Power dissipated in CMOS circuits consists of several components as indicated be-
low [15] [112]:
Ptotal = Pswitching +Pshortcircuit +Pstatic +Pleakage (2.1)
The individual components represent the power required to charge or switch a capacitive
load (Pswitching), short circuit power consumed during output transitions of a CMOS gate as
the input switches (Pshortcircuit), static power consumed by the input switches (Pstatic), and
leakage power consumed by the device (Pleakage). Components Pswitching and Pshortcircuit are
present when a device is actively changing state, while the components Pstatic and Pleakage
are present regardless of state changes.
The largest active component is Pswitching. It is caused due to the logic transitions.
As the transistors in the digital CMOS circuit transition back and forth between the two
logic levels, the parasitic capacitances are charged and discharged. Current  ows through
the channel resistance of the transistors, and electrical energy is converted into heat and
dissipated away (Fig. 2.1)
5
Figure 2.1: Circuit transition currents (left: charge, right: discharge)
Pswitching is de ned as,
Pswitching = C Vdd Vswing   f (2.2)
Where C represents the capacitance being switched, Vdd is the supply voltage, Vswing
corresponds to the change in the voltage level of the switched capacitance,  represents a
switching activity factor based on the probability of an output transition and, f represents
the frequency of operation. The product   C is also referred to as the e ective switched
capacitance, or Ceff. In most circuits, Vswing is equal to Vdd, so (2.2) is commonly written
as
Pswitching = Ceff  V2dd f (2.3)
The term Pshortcircuit occurs when short-circuit currents  ow directly from supply to
ground when both n-subnetwork and p-subnetwork of a CMOS gate conduct simultaneously
(Fig.2.2). With the input(s) to the gate stable at either logic level, only one of the two
subnetworks conduct and no short-circuit currents  ow. But when the output of the gate is
changing in response to the change in the input, both subnetworks conduct simultaneously
for a brief interval. The duration of the interval and the short circuit dissipation both
depend on the input and the output transition (rise or fall) times.
6
Figure 2.2: Short circuit currents
Pshortcircuit has a complicated derivation, but in a simpli ed form can be written as [16],
Pshortcircuit = Imean Vdd (2.4)
Where Imean represents the average current drawn during the input transition. Imean is
minimized for a single gate with short input rise and fall times, and with long output tran-
sition times, thus presenting a trade o in device sizing. When a set of gates is considered,
it is generally optimal to target equal input and output transition times. For large devices
such as input-output (I/O) bu ers or clock drivers, special design considerations are often
used to minimize the overlap current [109]. For properly sized and ratioed gates, the contri-
bution to overall dynamic power due to Pshortcircuit is on the order of 10 % -20 %, although
this factor may increase with increased device scaling [110].
Pstatic is not usually a factor in pure CMOS designs, since static current is not drawn
by a CMOS gate, but certain circuit structures such as sense ampli ers, voltage references,
and constant current sources do exist in CMOS systems and contribute to overall power.
Pleakage is due to leakage currents from reversed biased PN junctions associated with
the source and drain of MOS transistors, as well as subthreshold conduction currents. The
leakage current  ows when the input(s) to, and therefore the outputs of, a gate are not
7
changing (Fig. 2.3). The leakage component is proportional to device area and temperature.
The subthreshold leakage component is strongly dependent on device threshold voltages,
and becomes an important factor as power supply voltage scaling is used to lower power.
For systems with a high ratio of standby operation to active operation, Pleakage may be
the dominant factor in determining overall battery life as lower the threshold voltage, the
lower the degree to which MOSFETs in the logic gates are turned o and the higher is the
standby leakage current.
Figure 2.3: Leakage Currents
Minimization of these components of power dissipation is important in designing low-
power systems, and there are complex interactions that require trade o s to be made in-
volving each.
2.2 DEGREES OF FREEDOM
Active power minimization involves reducing the magnitude of each of the components
in equation (2.2): voltage, physical capacitance, and activity. Optimizing for power in-
variably involves an attempt to reduce one or more of these factors. Unfortunately, these
parameters cannot be optimized independently as they are not completely orthogonal. This
8
section brie y discusses each of the factors, describing their relative importance, as well as
the interactions that complicate the power optimization process [22].
2.2.1 Voltage
With its quadratic relationship to power, voltage reduction o ers the most direct and
dramatic means of minimizing energy consumption. Without requiring any special circuits
or technologies, a factor of two reduction in supply voltage yields a factor of four decrease
in energy (see Figure 2.4(a) [18]). Furthermore, this power reduction is a global e ect,
experienced not only in one sub-circuit or block of the chip, but throughout the entire
design. Because of this quadratic relationship, designers of low-power systems are often
willing to sacri ce increased physical capacitance or circuit activity for reduced voltage.
Unfortunately, supply voltage cannot be decreased without bound. In fact, several factors
other than power in uence selection of a system supply voltage. The primary factors are
performance requirements and compatibility issues. As supply voltage is lowered, circuit
Figure 2.4: Energy and Delay as a function of supply voltage
delays increase (see Figure 2.4(b)) leading to reduced system performance. To the  rst
order, device currents are given by:
9
Idd =  Cox2 WL (Vdd Vt)2 (2.5)
This leads to circuit delays of the order:
t = CVddI
dd
/ Vdd(V
dd Vt)2
(2.6)
So, for Vdd >> Vt delays increase linearly with decreasing voltage. In order to meet
system performance requirements, these delay increases cannot go unchecked. Some tech-
niques must be applied, either technological or architectural, in order to compensate for this
e ect. As Vdd approaches the threshold voltage, however, delay penalties simply become
unmanageable, limiting the advantages of going below a supply voltage of about 2Vt.
Performance is not, however, the only limiting criterion. When going to non-standard
voltage supplies, there is also the issue of compatibility and inter-operability. Unless the
entire system is being designed completely from scratch it is likely that some amount of
communications will be required with components operating at a standard voltage. This
dilemma is lessened by the availability of highly e cient (> 90%) DC-DC level converters,
but still there is some cost involved in supporting several di erent supply voltages [98].
This suggests that it might be advantageous for designers to support only a small number of
distinct intrasystem voltages. For example, custom chips in the system could be designed to
operate o a single low voltage (e.g., 2Vt) with level shifting only required for communication
with the outside world. To account for parameter variations within and between chips, the
supply would need to be set relative to the worst-case threshold, Vt;max.
10
To summarize, reducing supply voltage is paramount to lowering power consumption,
and it often makes sense to increase physical capacitance and circuit activity in order to
further reduce voltage. There are, however, limiting factors such as minimum performance
and compatibility requirements that limit voltage scaling. These factors will likely lead
designers to  x the voltage within a system. Once the supply has been  xed, it remains to
address the issues of minimizing physical capacitance and activity at that operating voltage.
The next two sections address these topics.
Figure 2.5: Primary sources of device capacitance
2.2.2 Physical Capacitance
Dynamic power consumption depends linearly on the physical capacitance being switched.
So, in addition to operating at low voltages, minimizing capacitance o ers another tech-
nique for reducing power consumption. In order to properly evaluate this opportunity we
must  rst understand what factors contribute to the physical capacitance of a circuit. Then
we can consider how those factors can be manipulated to reduce power. The physical ca-
pacitance in CMOS circuits stems from two primary sources: devices and interconnect. For
devices, the most signi cant contributions come from the gate and junction capacitances as
shown in Figure 2.5. The capacitance associated with the thin gate oxide of the transistor
11
is usually the larger of the two. This term can be approximated as a parallel-plate (area)
capacitance between the gate and the substrate or channel:
cg = WLCox = WL oxt
ox
(2.7)
In addition, source/drain junction capacitances contribute to the overall device capaci-
tance. These capacitances have both an area and a perimeter component and are non-linear
with the voltage across the junction:
Cj(V) = ACj0(1 V 
0
) m +PCjsw0(1 V 
0
) m (2.8)
Where A and P are the source/drain area and perimeter, Cj0 and Cjsw0 are equilibrium
bottomwall and sidewall capacitances,  0 is the junction barrier potential,and m is the
junction grading coe cient.
Figure 2.6: Sources of interconnect capacitance
Often, this non-linear capacitance is approximated by a large-signal equivalent lin-
earized capacitance given by:
12
Cjeq =
R V1V
0 Cj(V)dV
V1 V0 (2.9)
Where V0 and V1 describe the range of typical operating voltages for the junction. In past
technologies, device capacitances dominated over interconnect parasitics. As technologies
continue to scale down, however, this no longer holds true and we must consider the con-
tribution of interconnect to the overall physical capacitance. For the interconnect, there
is the capacitance between each metalization layer and the substrate, as well as coupling
capacitances between the layers themselves (see Figure 2.6). Each of these capacitances in
turn has two components: a parallel-plate component and a fringing component:
Cw = WLCp + 2(W +L)Cf (2.10)
Historically, the parallel-plate component, which increases linearly with both the width
and the length of the wire, has been dominant. The fringing component starts to become
signi cant, however, as the interconnect width becomes narrower and narrower relative
to the wire thickness [13]. With this understanding, we can now consider how to reduce
physical capacitance. From the previous discussion, we recognize that capacitances can be
kept at a minimum by using small devices and short wires. As with voltage, however, we are
not free to optimize capacitance independently. For example, reducing device sizes will not
only reduce physical capacitance, but will also reduce the current drive of the transistors,
making the circuit operate more slowly. This loss in performance might prevent us from
lowering Vdd as much as we might otherwise be able to do. In this scenario, we are giving
up a possible quadratic reduction in power through voltage scaling for a linear reduction
13
Figure 2.7: Interpretation of switching activity in synchronous systems
through capacitance scaling. So, if the designer is free to scale voltage it does not make sense
to minimize physical capacitance without considering the side e ects. Similar arguments
can be applied to interconnect capacitance. If voltage and/or activity can be signi cantly
reduced by allowing some increase in physical interconnect capacitance, then this may result
in a net decrease in power. The key point to recognize is that low-power design is a joint
optimization process in which the variables cannot be manipulated independently.
2.2.3 Activity
In addition to voltage and physical capacitance, switching activity also in uences dy-
namic power consumption. A chip can contain a huge amount of physical capacitance, but
if it does not switch then no dynamic power will be consumed. The activity determines how
often this switching occurs. As mentioned above, there are two components to switching ac-
tivity. The  rst is the data rate, f, which re ects how often, on average, new data arrives at
each node. This data might or might not be di erent from the previous data value. In this
sense, the data rate f describes how often on average switching could occur. For example,
in synchronous systems f might correspond to the clock frequency (see Figure 2.7).
14
The second component of activity is the data activity,  . This factor corresponds to
the expected number of transitions that will be triggered by the arrival of each new piece
of data. So, while f determines the average periodicity of data arrivals,  determines how
many transitions each arrival will spark. For circuits that don?t experience glitching,  can
be interpreted as the probability that a transition will occur during a single data period.
For certain logic styles, however, glitching can be an important source of signal activity
and, therefore, deserves some mention here [17]. Glitching refers to spurious and unwanted
transitions that occur before a node settles down to its  nal, steady-state value. Glitching
often arises when paths with unbalanced propagation delays converge at the same point in
the circuit. Since glitching can cause a node to make several power consuming transitions
instead of one (i.e.  > 1) it should be avoided whenever possible. The data activity  
can be combined with the physical capacitance C to obtain an e ective capacitance, Ceff
=  C=2, which describes the average capacitance charged during each 1=f data period.
This re ects the fact that neither the physical capacitance nor the activity alone determines
dynamic power consumption. Instead, it is the e ective capacitance, which combines the
two, that truly determines the power consumed by a CMOS circuit:
P = 12 CVdd2f = CeffVdd2f (2.11)
This discussion provides the  rst indication that data statistics can have a signi cant
e ect on power consumption. As with voltage and physical capacitance, we can consider
techniques for reducing switching activity as a means of saving power. For example, cer-
tain data representations, such as sign-magnitude, have an inherently lower activity than
15
two?s-complement [27]. Since sign-magnitude arithmetic is much more complex than two?s-
complement, however, there is a price to be paid for the reduced activity in terms of higher
physical capacitance. This is yet another indication that low-power design is truly a joint
optimization problem. In particular, optimization of activity cannot be undertaken inde-
pendently without consideration for the impact on voltage and capacitance.
2.3 RECURRING THEMES IN LOW POWER DESIGN
Sections 2.1 and 2.2 have provided a strong foundation from which to consider low-
power CMOS design. Speci cally, Section 2.1 derived the classical expression for dynamic
power consumption in CMOS. This led to the realization that three primary parameters:
voltage, physical capacitance, and activity determine the average power consumption of
a digital CMOS circuit. Section 2.2 then went on to describe each of these factors in-
dividually, while emphasizing that design for low-power must involve a joint rather than
independent optimization of these three parameters. The upcoming sections present speci c
power reduction techniques applicable at various levels of abstraction. Many of these tech-
niques follow a small number of common themes. The three principle themes are trading
area/performance for power, avoiding waste, and exploiting locality. Probably the most
important theme is trading area/performance for power. As mentioned in Section 2.2.1,
power can be reduced by decreasing the system supply voltage and allowing the perfor-
mance of the system to degrade. This is an example of trading performance for power.
If the system designer is not willing to give up the performance, he can consider apply-
ing techniques such as parallel processing to maintain performance at low voltage. Since
many of these techniques incur an area penalty, we can think of this as trading area for
16
power. Another recurring low-power theme involves avoiding waste. For example, clocking
modules when they are idle is a waste of power. Glitching is another example of wasted
power and can be avoided by path balancing and choice of logic family. Other strategies for
avoiding waste include using dedicated rather than programmable hardware and reducing
control overhead by using regular algorithms and architectures. Avoiding waste can also
take the form of designing systems to meet, rather than beat, performance requirements.
If an application requires 25 MIPS of processing performance, there is no advantage gained
by implementing a 50 MIPS processor at twice the power. Exploiting locality is another
important theme of low-power design. Global operations inherently consume a lot of power.
Data must be transferred from one part of the chip to another at the expense of switching
large bus capacitances. Furthermore, in poorly partitioned designs the same data might
need to be stored in many parts of the chip, wasting still more power. In contrast, a design
partitioned to exploit locality of reference can minimize the amount of expensive global
communications employed in favor of much less costly local interconnect networks. While
not all low-power techniques can be classi ed as trading-o area/performance for power,
avoiding waste, and exploiting locality these basic themes do describe many of the strategies
that will be presented in the remainder of this chapter. The organization of these upcoming
sections is by level of abstraction. Speci cally, beginning with Section 2.4 and ending with
Section 2.9, they cover low-power design methodologies for the technology, layout, circuit,
gate, architecture, and algorithm levels, respectively.
17
2.4 TECHNOLOGY LEVEL
At the lowest level of abstraction we can consider low-power design strategies in the
context of both packaging and process technologies.
2.4.1 Packaging
Often a signi cant fraction of the total chip power consumption can be attributed
not to core processing but to driving large o -chip capacitances. This is not surprising
since o -chip capacitances are on the order of tens of picofarads while on-chip capacitances
are in the tens of femtofarads. For conventional packaging technologies, Bakoglu suggests
that pins contribute approximately 13-14 pF of capacitance each (10 pF for the pad and
3-4 pF for the printed circuit board traces) [13]. Since dynamic power is proportional to
capacitance, I/O power can be a signi cant portion of overall chip power consumption. The
notion that I/O capacitance at the chip level can account for as much as 1/4 to 1/2 of the
overall system power dissipation suggests that reduction of I/O power is a high priority in
multi-chip systems. If the large capacitances associated with inter-chip I/O were drastically
reduced, the I/O component of system power consumption would be reduced proportionally.
Packaging technology can have a dramatic impact on the physical capacitance involved
in o -chip communications. For example, multi-chip modules or MCM?s o er a drastic
reduction in the physical capacitance of o -chip wiring. In an MCM, all of the chips
comprising the system are mounted on a single substrate, and the entire module is placed
in a single package. Utilizing this technology, inter-chip I/O capacitances are reduced to the
same order as on-chip capacitances [20]. This is due not only to the elimination of the highly
capacitive PCB traces, but also to the minimization of on-chip pad driver capacitances due
18
to reduced o -chip load driving requirements. Thus, utilizing MCM technology, the I/O
component of system power consumption can be kept at a minimum, shifting the focus
of power optimization from I/O considerations to chip core considerations [23]. Actually,
low-power operation is only one of the advantages of MCM technology. In addition, MCM?s
with their reduced chip-level interconnect lengths and capacitances can signi cantly reduce
system delays resulting in higher performance, which can then be traded for lower power at
the designer?s discretion [13]. So, selection of a packaging technology can have an important
e ect on system power consumption.
2.4.2 Process
In addition to packaging considerations, process (or fabrication) technology plays an
important role in determining power consumption. This section presents two important
process-based techniques for reducing power consumption: technology scaling and threshold
voltage scaling.
TECHNOLOGY SCALING
Scaling of physical dimensions is a well-known technique for reducing circuit power
consumption. Basically, scaling involves reducing all vertical and horizontal dimensions
by a factor, S, greater than one. Thus, transistor widths and lengths are reduced, oxide
thicknesses are reduced, depletion region widths are reduced, interconnect widths and thick-
nesses are reduced, etc. The  rst-order e ects of scaling can be fairly easily derived [12, 35].
Device gate capacitances are of the form Cg = W L Cox. If we scale down W, L, and tox
by S, then this capacitance will scale down by S as well. Consequently, if system data rates
19
and supply voltages remain unchanged, this factor of S reduction in capacitance is passed
on directly to power:
Fixed Performance;Fixed Voltage : P / 1S (2.12)
To give a concrete example, at the 1994 International Solid-State Circuits Conference,
MIPS Technologies attributed a 25% reduction in power consumption for their new 64b
RISC processor solely to a migration from a 0.8 m to a 0.64 m technology [114]. The
e ect of scaling on delays is equally promising. Based on (eq 2.5), the transistor current
drive increases linearly with S. As a result, propagation delays, which are proportional
to capacitance and inversely proportional to drive current, scale down by a factor of S2.
Assuming we are only trying to maintain system throughput rather than increase it, the
improvement in circuit performance can be traded for lower power by reducing the supply
voltage. In particular, neglecting Vt e ects, the voltage can be reduced by a factor of S2.
This results in a S4 reduction in device currents, and along with the capacitance scaling
leads to an S5 reduction in power:
Fixed Performance;Variable Voltage : P / 1S5 (2.13)
This discussion, however, ignores many important second-order e ects. For example,
as scaling continues, interconnect parasitics eventually begin to dominate and change the
picture substantially. The resistance of a wire is proportional to its length and inversely
proportional to its thickness and width. Since in this discussion we are considering the
impact of technology scaling on a fixed design, the local and global wire lengths should
20
scale down by S along with the width and thickness of the wire. This means that the wire
resistance should scale up by a factor of S overall. The wire capacitance is proportional to
its width and length and inversely proportional to the oxide thickness. Consequently, the
wire capacitance scales down by a factor of S. To summarize:
Rw/S and Cw/ 1S (2.14)
twire/RwCw/1 (2.15)
This means that, unlike gate delays, the intrinsic interconnect delay does not scale
down with physical dimensions. So at some point interconnect delays will start to dominate
over gate delays and it will no longer be possible to scale down the supply voltage. This
means that once again power is reduced solely due to capacitance scaling:
Parasitics dominated : P / 1S (2.16)
Actually, the situation is even worse since the above analysis did not consider second-
order e ects such as the fringing component of wire capacitance, which may actually grow
with reduced dimensions. As a result, realistically speaking, power may not scale down at
all, but instead may stay approximately constant with technology scaling or even increase:
Including 2nd order effects : P /1 or more (2.17)
21
The conclusion is that technology scaling o ers signi cant bene ts in terms of power
only up to a point. Once parasitics begin to dominate, the power improvements slack o or
disappear completely. So we cannot rely on technology scaling to reduce power inde nitely.
We must turn to other techniques for lowering power consumption.
THRESHOLD VOLTAGE REDUCTION
Many process parameters, aside from lithographic dimensions, can have a large impact
on circuit performance. For example, at low supply voltages the value of the threshold
voltage (Vt) is extremely important. Section 2.2.1 revealed that threshold voltage places
a limit on the minimum supply voltage that can be used without incurring unreasonable
delay penalties. Based on this, it would seem reasonable to consider reducing threshold
voltages in a low-power process. Unfortunately, subthreshold conduction and noise margin
considerations limit how low Vt can be set. Although devices are ideally \o " for gate volt-
ages below Vt, in reality there is always some subthreshold conduction even for Vgs < Vt.
The question is especially important for dynamic circuits for which subthreshold currents
could causes erroneous charging or discharging of dynamic nodes. The relationship between
gate voltage and subthreshold current is exponential. Each 0.1V reduction in Vgs below Vt
reduces the subthreshold current by approximately one order of magnitude [74]. Therefore,
in order to prevent static currents from dominating chip power and to ensure functionality
of dynamic circuits, threshold voltages should be limited to a minimum of 0.3-0.5V. Unfor-
tunately, dimensional and threshold scaling are not always viable options. Aside from the
drawbacks of interconnect non-scalability, submicron e ects, and subthreshold conduction,
chip designers often don?t have complete freedom to arbitrarily scale their fabrication tech-
nology. Instead, economic factors as well as the capabilities of their fabrication facilities
22
impose limits on minimum lithographic dimensions. For this reason, in order to achieve
widespread acceptance, an ideal low-power methodology should not rely solely on technol-
ogy scaling or specialized processing techniques. The methodology should be applicable
not only to di erent technologies, but also to di erent circuit and logic styles. Whenever
possible, scaling and circuit techniques should be combined with the high-level methodology
to further reduce power consumption; however, the general low-power strategy should not
require these tricks.
2.5 LAYOUT LEVEL
There are a number of layout-level techniques that can be applied to reduce power. The
simplest of these techniques is to select upper level metals to route high activity signals.
The higher level metals are physically separated from the substrate by a greater thickness
of silicon dioxide. Since the physical capacitance of these wires decreases linearly with
increasing tox, there is some advantage to routing the highest activity signals in the higher
level metals. For example, in a typical process metal three will have about a 30% lower
capacitance per unit area than metal two [84]. The DEC Alpha chip takes advantage of this
fact by routing the high activity clock network primarily in third level metal [36]. It should
be noted, however, that the technique is most e ective for global rather than local routing,
since connecting to a higher level metal requires more vias, which add area and capacitance
to the circuit. Still, the concept of associating high activity signals with low physical
capacitance nodes is an important one and appears in many di erent contexts in low-power
design. For example, we can combine this notion with the locality theme of Section 2.3
to arrive at a general strategy for low-power placement and routing. The placement and
23
routing problem crops up in many di erent guises in VLSI design. Place and route can be
performed on pads, functional blocks, standard cells, gate arrays, etc. Traditional placement
involves minimizing area and delay. Minimizing delay, in turn, translates to minimizing the
physical capacitance (or length) of wires. In contrast, placement for power concentrates on
minimizing the activity-capacitance product rather than the capacitance alone. In summary,
high activity wires should be kept short or, in a manner of speaking, local. Tools have been
developed that use this basic strategy to achieve about an 18% reduction in power [31, 108].
2.6 CIRCUIT LEVEL
Many circuit techniques can lead to reduced power consumption. In this section, we
go beyond the traditional synchronous fully-complementary static CMOS circuit style to
consider the relative advantages and disadvantages of other design strategies. This section
will consider topics related to low-power circuit design.
2.6.1 Dynamic Logic
In static logic, node voltages are always maintained by a conducting path from the node
to one of the supply rails. In contrast, dynamic logic nodes go through periods during which
there is no path to the rails, and voltages are maintained as charge dynamically stored on
nodal capacitances. Figure 2.8 shows an implementation of a complex boolean expression
in both static and dynamic logic. In the dynamic case, the clock period is divided into a
precharge and an evaluation phase. During precharge, the output is charged to Vdd. Then,
during the next clock phase, the NMOS tree evaluates the logic function and discharges the
24
Figure 2.8: Static and Dynamic implementations of F = (A+B)C
output node if necessary. Relative to static CMOS, dynamic logic has both advantages and
disadvantages in terms of power.
Dynamic design styles have reduced device counts, do not experience short-circuit
power dissipation, and are guaranteed to have a maximum of one transition per clock cycle
unlike static gates which experience glitching, However, each of the precharge transistors
in the chip must be driven by a clock signal requiring a dense clock distribution network
and its associated capacitance and driving circuitry, which contributes signi cant power
consumption to the chip. Also as each gate is in uenced by the clock the issues of skew
become important and di cult to handle.
2.6.2 Pass-Transistor Logic
As with dynamic logic, pass-transistor logic o ers the possibility of reduced transistor
counts. Figure 2.9 illustrates this fact with an equivalent pass-transistor implementation of
the static logic function of Figure 2.8. Once again, the reduction in transistors results in
25
Figure 2.9: Complementary pass transistor implementation of F = (A+B)C
lower capacitive loading from devices. This might make pass-transistor logic attractive as
a low-power circuit style.
Like dynamic logic, however, pass-transistor circuits su er from several drawbacks, at
the voltages attractive for low-power design; the reduced current drive of pass-transistor
logic networks becomes particularly troublesome. Low threshold processes can lessen this
problem, but it is at the expense of robustness and static power dissipation.
2.6.3 Asynchronous Logic
Asynchronous logic refers to a circuit style employing no global clock signal for syn-
chronization. Instead, synchronization is provided by handshaking circuitry used as an
interface between gates (see Figure 2.10). While more common at the system level, asyn-
chronous logic has failed to gain acceptance at the circuit level. This has been based on
area and performance criteria. It is worthwhile to re-evaluate asynchronous circuits in the
context of low power. The primary power advantages of asynchronous logic can be clas-
si ed as avoiding waste. The clock signal in synchronous logic contains no information;
therefore, power associated with the clock driver and distribution network is in some sense
26
Figure 2.10: Asynchronous circuits with handshaking
wasted. Avoiding this power consumption component might o er signi cant bene ts. In
addition, asynchronous logic uses completion signals, thereby avoiding glitching, another
form of wasted power. Finally, with no clock signal and with computation triggered by the
presence of new data, asynchronous logic contains a sort of built in power-down mechanism
for idle periods
At the small granularity with which it is commonly implemented, the overhead of the
asynchronous interface circuitry dominates over the power saving attributes of the design
style. It should be emphasized, however, that this is mainly a function of the granularity of
the handshaking circuitry. It would certainly be worthwhile to consider using asynchronous
techniques to eliminate the necessity of distributing a global clock between blocks of larger
granularity.
2.6.4 Transistor Sizing
Regardless of the circuit style employed, the issue of transistor sizing for low power
arises. The primary trade-o involved is between performance and cost - where cost is
measured by area and power. Transistors with larger gate widths provide more current
drive than smaller transistors. Unfortunately, they also contribute more device capacitance
27
to the circuit and, consequently, result in higher power dissipation. Moreover, larger devices
experience more severe short-circuit currents, which should be avoided whenever possible.
In addition, if all devices in a circuit are sized up, then the loading capacitance increases
in the same proportion as the current drive, resulting in little performance improvement
beyond the point of overcoming  xed parasitic capacitance components. In this sense,
large transistors become self-loading and the bene t of large devices must be re-evaluated.
A sensible low-power strategy is to use minimum size devices whenever possible. Along
the critical path, however, devices should be sized up to overcome parasitics and meet
performance requirements. Care should be taken in this sizing process to avoid the waste of
self-loading [24]. By following this approach, Nagendra et al. found that the average power
dissipation of a signed-digit adder could be reduced by 36% with a delay penalty of only
0.3% [76].
2.6.5 Design Style
Another decision which can have a large impact on the overall chip power consumption
is the selection of a design style, e.g., full custom, gate array, standard cell, etc. Not surpris-
ingly, full-custom design o ers the best possibility of minimizing power consumption. In a
custom design, all the principles of low-power including locality, regularity, and sizing can be
applied optimally to individual circuits. Unfortunately, this is a costly alternative in terms
of design time, and can rarely be employed exclusively as a design strategy. Other possible
design styles include gate arrays and standard cells. Gate arrays o er one alternative for
reducing design cycles at the expense of area, power, and performance. While not o ering
the  exibility of full-custom design, gate-array CAD tools could nevertheless be altered to
28
place increased emphasis on power. Standard cell synthesis is another commonly employed
strategy for reducing design time. Current standard cell libraries and tools, however, o er
little hope of achieving low power operation. In many ways, standard cells represent the
antithesis of a low-power methodology. First and foremost, standard cells are often severely
oversized. Most standard cell libraries were designed for maximum performance and worst-
case loading from inter-cell routing. As a result, they experience signi cant self-loading and
waste correspondingly signi cant amounts of power.
2.6.6 Dynamic Voltage Scaling
The other technique that is considered for power reduction at the circuit level is dy-
namic voltage scaling where a component is run at a less-than-maximum voltage in order
to conserve power. Dynamic voltage scaling is most commonly used in laptops and other
mobile devices, where energy comes from a battery and thus is limited. As explained in
Section 2.2.1, the switching power dissipated by a chip decreases quadratically with voltage.
A problem with this technique in embedded devices is that batteries have a certain voltage
so extra circuitry must be introduced to scale the voltage down from the original value.
This circuitry may take a while to settle at the desired voltage, hence adding a time-cost in
the  s range on reducing the supply voltage. Dynamic frequency scaling is another power
conservation technique that works on the same principles as dynamic voltage scaling. Both
dynamic voltage scaling and dynamic frequency scaling can be used to prevent computer
system overheating, which can result in program or operating system crashes, and possibly
hardware damage. The speed at which a digital circuit can switch states - that is, to go
29
from "low" (VDD) to \high" (VSS) or vice versa - is proportional to the voltage di eren-
tial in that circuit. Reducing the voltage means that circuits switch slower, reducing the
maximum frequency at which that circuit can run. This, in turn, reduces the rate at which
program instructions that can be issued, increasing program runtime.
2.6.7 Dual Threshold Voltage
It may be expected that dynamic voltage scaling will always reduce dynamic power
dissipation in a long period of operating time since when the workload is reduced the supply
voltage could also be reduced to save power. However, in deep submicron technologies, we
need to begin to take leakage power consumptions into account as well, especially when
design with dual threshold voltage cells are becoming widely adopted. A recent report [43]
gives a comprehensive analysis of the consequences of applying Dynamic Voltage Scaling
(DVS) to dual Vt cell design. A typical design scenario could be as follows. In the initial
design, all cells used are low leakage (LL) cells to minimize power consumption. Then we
could replace cells along critical paths by high-speed (HS) cells in order to shorten the delay.
Thus we get a mixed cells design. And now DVS could be used to further decrease dynamic
power dissipation when a larger delay could be tolerated. When applying DVS, we only
get power gains if total power consumption in the mixed cell design is less than that in the
original single LL cell design. According to the author M. Hans, subthreshold leakage is
supply voltage dependent. An example [43] shows that after applying DVS the dynamic
power dissipation is indeed decreased. However, the leakage consumption is increased at the
same time. This tells us that DVS could have a negative impact on leakage consumption
under certain circumstances and thus careful analysis needs to be done before making design
30
choices. The literature [10, 25, 33, 41, 52, 54, 70, 81, 95, 101, 107] gives more information on
di erent experiments that has been tried using DVS, Transistor Sizing and Dual Threshold
Voltage Techniques.
2.6.8 Dynamic Power Cuto Technique (DPCT)
Dynamic Power Cuto Technique (DPCT) is an active leakage power reduction tech-
nique. In this technique  rst the switching window for each gate, during which a gate
makes its transitions, is identi ed by static timing analysis. Then, the circuit is optimally
partitioned into di erent groups based on the minimal switching window (MSW) of each
gate. Finally, power cut o transistors are inserted into each group to control the power
connections of that group. Each group is turned on only long enough for a wavefront of
changing signals to propagate through that group. Since each gate is only turned on during a
small timing window within each clock cycle, this signi cantly reduces active leakage power.
This technique can also save standby leakage and dynamic power. Results on ISCAS?85
benchmark circuits modeled using 70nm Berkeley Predictive Models show up to 90% active
leakage, 99% standby leakage, 54% dynamic power, and 72% total power savings [115, 116].
However, this technique being new, its e ect on the noise margin, power grid design and
layout is still not known.
2.6.9 Retiming
Retiming is a classic logic optimization technique for synchronous circuits. Retiming
is the technique of moving the structural location of latches or registers in a digital circuit
to improve its performance, area, and/or power characteristics in such a way that preserves
31
its functional behavior at its outputs. When originally introduced [58, 59, 60], its empha-
sis was on the application to systolic systems. A subsequent paper [61] fully revisits the
concept of retiming and shows how generic synchronous circuits can bene t from it under
three main optimality criteria: (1) minimize the circuit clock period by adding/removing
storage elements, (2) minimize the circuit area by reducing the number of storage elements,
and (3) minimize circuit area under a maximum clock-period constraint. In the last two
decades, Retiming has been adopted as a key optimization technique within every major
logic synthesis tool both in academia and industry.
The other circuit level techniques have been discussed several authors [26, 30, 63, 100].
2.7 GATE LEVEL
As in the case of the circuit level, there are gate-level techniques that can be applied
successfully to reduce power consumption. Once again these techniques re ect the themes
of trading performance and area for power, avoiding waste, and exploiting locality. In this
section we discuss a number of gate-level techniques and give some quantitative indication
of their impact on power. In particular, this section presents techniques for technology
mapping, glitching and activity reduction, input vector control technique and methods for
exploiting concurrency and redundancy in the context of low-power design.
2.7.1 Technology Decomposition and Mapping
Technology decomposition and mapping refers to the process of transforming a gate-
level boolean description of a logic network into a CMOS circuit. For a given gate-level
32
network there may be many possible circuit-level implementations. For instance, a three-
input NAND can be implemented as a single complex CMOS gate or as a cascade of simpler
two-input gates. Each mapping may result in di erent signal activities, as well as physical
capacitances. For example, complex gates tend to exhibit an overall lower physical capac-
itance since more signals are con ned to internal nodes rather than to the more heavily
loaded output nodes. The concept of technology mapping for low-power is to  rst decom-
pose the boolean network such that switching activity is minimized, and then to hide any
high activity nodes inside complex CMOS gates. In this way, rapidly switching signals are
mapped to the low capacitance internal nodes, thereby reducing power consumption. Mak-
ing a gate too complex, however, can slow the circuit, resulting in a trade-o of performance
for power. Several technology mapping algorithms for low power have been developed and
o er an average power reduction of 12% [102] to 21% [104].
2.7.2 Activity Postponement
While technology mapping attempts to minimize the activity-capacitance product,
other gate-level strategies focus on reducing activity alone. For example, an operation
as simple as reordering the inputs to a boolean network can in some cases reduce the total
network activity (see Figure 2.11) [56]. The basic concept is to postpone introduction of
high activity signals as long as possible. In this way, the fewest gates are a ected by the
rapidly switching signals.
2.7.3 Glitch Reduction
Other gate-level activity reduction techniques focus on avoiding the wasted transitions
associated with glitching. Figure 2.12 shows two implementations of the same logic function.
33
Figure 2.11: Input Reordering for activity reduction
One implementation employs a balanced tree structure, while the other uses a cascaded
gate structure. If we assume equal input arrival times and gate delays, we  nd that the
cascaded structure undergoes many more transitions than the tree structure before settling
at its steady-state value. In particular, the arrival of the inputs may trigger a transition
at the output of each of the gates. These output transitions may in turn trigger additional
transitions for the gates within their fan-out. This reasoning leads to an upper-bound
on glitching that is O(N2), where N is the depth of the logic network. In contrast, the
path delays in the tree structure are all balanced, and therefore, each node makes a single
transition and no power is wasted. This concept can be extended to derive optimum tree
structures for the case of unequal arrival times as well [75]. Some studies have suggested
that eliminating glitching in static circuits could reduce power consumption by as much as
15-20% [17].
Techniques for reducing glitch power have been described by several authors [5, 6, 49,
50, 66, 67, 68, 69, 86, 87, 88, 89, 90, 105, 106].
A mixed integer linear programming (MILP) technique [68] is used to minimize the
leakage as well as glitch power consumption of a static CMOS circuit for any speci ed
34
Figure 2.12: Cascaded versus Balanced tree gate structures
input to output delay. Using dual-threshold devices the number of high-threshold devices is
maximized and a minimum number of delay elements are inserted to reduce the di erential
path delays below the inertial delays of incident gates. The key features of this method
are that the constraint set size for the MILP model is linear in the circuit size and power-
performance trade o is allowed. Experimental results for this technique shows 96%,40%
and 70% reductions of leakage power, dynamic power and total power respectively for the
benchmark circuit C7552 implemented in the 70nm BPTM CMOS technology.
In a CMOS circuit, energy consumption per signal transition at a node with capacitance
C is 0.5CV2. Keeping the gate delays, internal to standard cells,  xed the authors of [106]
determine the values of necessary routing delays to eliminate all glitches by either path delay
balancing or inertial  ltering. To implement these delays they insert the required amounts
of resistances as customized feedthrough cells. In spite of the increased resistance in the
circuit, the overall power is reduced because the resistive delays suppress glitches without
increasing the 0.5CV2 power per transition, and no increase in the critical path delay is
incurred. For the ISCAS ?85 benchmark circuit, c2670, 30% saving has been achieved in
average power consumption with 14% increase of the chip area.
35
Table 2.1: Leakage current values for di erent input combinations of a 3-input NAND gate
Input State Subthreshold leakage (nA) Gate Leakage (nA) Total Leakage (nA)
000 0.49 6.58 7.07
001 0.81 19.68 21.49
010 0.81 6.79 7.60
011 2.68 34.78 37.46
100 0.81 3.15 3.96
101 2.68 16.8 19.48
110 2.68 1.84 4.52
111 16.85 45.3 62.15
2.7.4 Input Vector Control
Much research has been done to model and estimate the nominal leakage current of a
circuit. Due to the stacking e ect, the leakage of a circuit depends on its input combina-
tions [3]. Table 2.1 shows the leakage components for all input combinations of a 3-input
NAND gate.
Thus by  nding a minimum leakage vector (MLV) and applying it to the circuit, we can
guarantee the circuit turns into its low leakage state. Abdollahi et al. [3] have proposed a
technique to identify the MLV and discussed several ways to apply the vector to the circuit.
They  rst construct a Boolean network to compute the total leakage and then they use
the SAT solver to  nd an input vector that results in the minimum leakage of the whole
circuit. However, the e ectiveness of the method does not rely on  nding the MLV solely.
Due to the limited access to internal nodes, it is very di cult to put the ideal value to
every node especially for very complex circuits. Therefore the authors tried two ways to
increase controllability of the internal nodes, namely adding multiplexers and modifying
gates. Since both methods change the circuit to some extent, new Boolean networks need
36
to be constructed. Although the authors found an e cient way to identify the MLV and
 gured out ways to increase controllability, we could see that the actual application of
input vector control is still limited by the primary inputs. And to put a sleep circuit back
to normal will require extra memory elements to store the original states, thus incurring
both area and delay penalty. Another controllability increasing method is given by Rahman
and Chakrabarti [85].
2.7.5 Concurrency and Redundancy
The  nal technique discussed in this section is the use of concurrency and redundancy
at the gate level in low-power design. The principal concept is to use concurrency and
redundancy to improve performance and then to trade this performance for lower power by
reducing the voltage supply. In some sense, path balancing, which was found to be useful for
reducing glitching activity, can be thought of as a form of gate-level concurrent processing.
Referring to Figure 2.12, the path balanced tree structure is characterized by several logic
gates computing in parallel, with the gates in their fan-out combining the results. In
contrast, for the linear, cascaded structure computations must take place sequentially since
the results from the initial stages are needed to compute the output of the later stages. So
by using concurrency, the tree structure achieves a shorter critical path than the cascaded
structure - quantitatively, logarithmic as opposed to linear. This reduced critical path can
be used to improve performance, or this performance can be traded for power by reducing
the operating voltage until the delay of the logarithmic structure matches that of the linear
structure.The majority of the techniques employing concurrency or redundancy incur an
inherent penalty in area, as well as in physical capacitance and switching activity. At  rst
37
glance, a carry-select adder with 50 % more physical capacitance and activity than a ripple-
carry adder might not seem low power at all. The key concept is to identify the design
paradigm under which you are working:  xed voltage or variable voltage. If the voltage is
allowed to vary, then it is typically worthwhile to sacri ce increased physical capacitance
and activity for the quadratic power improvement o ered by reduced voltage. If, however,
the system voltage has been  xed, then there is nothing gained by employing a carry-select
adder in place of a ripple-carry adder, unless the slower adder does not meet the timing
constraints. So in such situation its better to use the least complex adder that meets the
performance requirements. This falls under the category of avoiding waste [76].
2.8 ARCHITECTURE AND SYSTEM LEVELS
This chapter has repeatedly suggested that decisions made at a high level (architecture
or system) will have a much larger impact on power than those made at a lower level (e.g.,
gate or circuit). This section provides some evidence to support this claim. In the termi-
nology of this thesis, architecture refers to the register-transfer (RT) level of abstraction,
where the primitives are blocks such as multipliers, adders, memories, and controllers. This
level of abstraction is also referred to as the micro-architecture level. Having de ned the
terminology, this section discusses architectural or RT-level techniques for reducing power
consumption.
2.8.1 Concurrent Processing
Perhaps the most important strategy for reducing power consumption involves employ-
ing concurrent processing at the architecture level. This is a direct trade-o of area and
38
Figure 2.13: Voltage Scaling and Parallelism for low power
performance for power. In other words, the designer applies some well-known technique
for improving performance such as parallelism or pipelining, and then swaps this higher
performance for lower power by reducing the supply voltage.
PARALLELISM
As a quantitative example, consider the use of parallelism to perform some complex
operation, A (see Figure 2.13(a)). The registers supplying operands and storing results for
A are clocked at a frequency f. Further assume that algorithmic and data dependency
constraints do not prevent concurrency in the calculations performed by A. When the
computation of A is parallelized, Figure 2.13(b) results. The hardware comprising block A
has been duplicated N times, resulting in N identical processors. Since there are now N
processors, a throughput equal to that of sequential processor, A, can be maintained with a
clocking frequency N times lower than that of A. That is, although each block will produce
a result only 1=Nth as frequently as processor A, there are N such processors producing
results. Consequently, identical throughput is maintained.
39
The key to this architecture?s utility as a power saving con guration lies in this factor
of N reduction in clocking frequency. In particular, with a clocking frequency of f=N, each
individual processor can run N times slower. Since to the  rst order, delays vary roughly
linearly with voltage supply, this corresponds to a possible factor of N reduction in supply
voltage. Examining the power consumption relative to the single processor con guration,
we see that capacitances have increased by a factor of N (due to hardware duplication),
while frequency and supply voltage have been reduced by the same factor. Thus, since
P = CV2f, power consumption is reduced by the square of the concurrency factor, N:
N way Concurrency : P /1=(N)2 (2.18)
Hardware parallelism also has its disadvantages. For instance, complete hardware
duplication entails a severe area penalty. In addition, there is hardware and interconnect
overhead related to signal distribution at the processor inputs and signal merging at the
outputs. These contribute to increased power consumption and tend to limit the utility of
excessive parallelism. Also, the area requirements of full parallelism can be a limiting factor.
Still, other forms of concurrent processing o er some of the power savings of parallelism at
reduced cost.
PIPELINING
Pipelining is another form of concurrent computation that can be exploited for power
reduction. An example of pipelining for low power is shown in Figure 2.14(b). In this
situation, rather than duplicating hardware, concurrency is achieved by inserting pipeline
registers, resulting in an N-stage pipelined version of processor A (assuming processor A
can be pipelined to this extent). In this implementation, maintaining throughput requires
40
Figure 2.14: Voltage Scaling and Pipelining for low power
that we sustain clocking frequency, f. Ignoring the overhead of the pipeline registers, the
capacitance, C, also remains constant. The advantage of this con guration is derived from
the greatly reduced computational requirements between pipeline registers. Rather than
performing the entire computation, A, within one clock cycle, only 1=Nth of A need be cal-
culated per clock cycle. This allows a factor N reduction in supply voltage and,considering
the constant C and f terms, the dynamic power consumption is reduced by N2.
Thus, for both concurrency techniques - pipelining and parallelism - consumption is
reduced by N, resulting in a  rst-order quadratic reduction in power consumption. As with
parallelism, pipelining incurs some overhead, though not nearly as much. In a pipelined
processor, for example, the pipeline registers represent an overhead in both power and
area. First, each register must be clocked, adding to the capacitive loading of the clock
network. As the pipeline depth increases, the area and capacitance associated with the
pipeline registers approaches that of the actual processor stages. At that point, further
pipelining becomes unattractive. Still, the overhead associated with pipelining is typically
41
much less than that of parallelism, making it a useful form of concurrent processing for
power minimization.
A lot of other techniques have been combined with pipelining to reduce power in
processors. Software constitutes a very important part of the system these days and a
large portion of the functionality of the systems is in the form of instructions as opposed
to gates so analyzing power consumption from the point of view of instructions is equally
important.
Pipelined processors frequently insert NOP instructions into the pipe for generating
delay or resolving dependency. NOP instruction is a No-Operation Instruction. It consumes
only 1 clock cycle compared to other instructions and consumes the minimum power but
the NOP energy signi cantly varies depending upon its position in the program.
[77] presents an approach for power virus generation using behavioral models for digital
circuits. The technique presented converts the given behavioral model automatically to an
integer (word-level) constraint model and employs an integer constraint solver to generate
the required power virus vectors. Experiment results on the DLX processor showed two
power virus sequences that consumed maximum power and the NOPs were predominant
in these sequences which show that NOP energy signi cantly varies depending upon its
position in the program as well as the instructions preceding and succeeding it.
P. Kamran et al. [65] describe a technique to optimize dynamic power consumption
by eliminating useless transitions that are generated by the pipeline when a stall happens.
For the NOP instruction to generate as few transitions as possible, the data part of the
instruction is kept as the preceding instruction and in this way as a NOP instruction passes
through a pipe, relative to the previous cycle, the same operations are performed on the
42
same data in all stages of the pipeline, therefore only a small number of transitions is
generated as a result of the NOP insertion and propagation. By this technique the authors
demonstrate up to 10% reduction in power consumption for some benchmarks at a cost of
negligible performance and area overhead (below 0.1%).
2.8.2 Power Management
Any power consumed by a system that does not lead to useful results is wasted. For
previous generations of chips, where power was not a major concern, this source of waste
was typically ignored. Strategies for avoiding wasted power in systems are often called
power management strategies. Some power management techniques available to designers
include selective power down, sleep mode, and adaptive clocking/voltage schemes. Selective
power down refers to deactivating processing modules that are not doing any useful work.
A selective power down strategy requires additional logic to monitor the activity of the
various modules within a system. Control lines signaling idle periods are then used to gate
the clocks to the various system modules. As a result, idle modules are not clocked and no
power is wasted. Some care must be taken in applying this strategy to dynamic circuits,
for which clocking is used to maintain information stored on dynamic nodes. Two of the
other methods that could be considered instead of complete power down are as follows:
CLOCK GATING:
Instead of switching o the power supply, the clock signal may be halted in idle devices.
This reduces switching activity and therefore dynamic power consumption to zero. Inserting
clock gates is not as great of an interference to the design as power supply shutdown and
is applicable for applications where power shut down is no alternative. Clock gating won?t
43
lessen power dissipation to zero since leakage power is una ected. The designer has to take
into account that the gate increases clock skew and makes testing more complicated. Lastly,
it should be seen that glitches on the switch?s control signal is prevented. For example, a
glitch could cause a temporarily false clock turn o /on, which might add an extra rising
edge to the clock signal behind the gate and as a result the behavior of the circuit is not
preserved.
ENABLED FLIP-FLOPS:
As clock gating can be seen as a softer alternative to power supply shut down, enabled
 ip- ops are the next less aggressive (and less e ective) strategy. Registers are replaced by
a representative with an enable signal. By enabling these representatives, they behave like
general registers. Disabled, the  ip- ops? outputs are not changing, which reduces switching
activity in the circuit. The most active signal, the clock, is still active though, ensuing a
great deal of power dissipation. It can be said that power management based on enabled
 ip- ops can be bene cial, but an implementation based on gated-clocks is fundamentally
superior.
LOW-POWER MODES:
Sleep mode is an extension of the selective power-down strategy. Here, the activity of
the entire system is monitored rather than that of the individual modules. If the system
has been idle for some predetermined time-out duration, then the entire system is shut
down and enters what is known as sleep mode. During sleep mode the system inputs are
monitored for activity, which will then trigger the system to wake up and resume processing.
Since there is some overhead in time and power associated with entering and leaving sleep
44
mode, there are some trade-o s to be made in setting the length of the desired time-out
period.
To reduce leakage power drowsy mode is another option under power-down strategy.
When in drowsy mode, the information in the cache line is preserved; however, the line must
be reinstated to a high-power mode before its contents can be accessed. A recent paper [40]
shows that with simple architectural techniques, about 80%-90% of the cache lines can be
maintained in a drowsy state without a ecting performance by more than 1%.
At ISSCC?94, Intel, MIPS Technologies, and IBM all reported microprocessors that
include selective power down and sleep-mode capabilities [14, 83, 94, 114]. IBM estimated
that selective clocking saved 12-30 % in the power consumption of their 80MHz super-
scalar PowerPC architecture [83]. In addition, Intel estimated that the combination of
both techniques resulted in an overall 2x reduction in average power when running typical
applications on their design [94].
Another power management strategy involves adapting clocking frequency and/or sup-
ply voltage to meet system performance requirements. Since the performance requirements
of a system typically vary over time as the task it is performing changes, it is wasteful
to run the system at maximum performance, even when a minimum of compute power is
required. Adapting the clocking frequency or supply voltage of the system to reduce per-
formance (and power) during these periods can result in substantial power savings. Since it
is di cult to directly measure how much performance is actually required at a given point
in time, some indirect performance feedback scheme must be devised. This can be in the
form of clock slow-down instruction issued by the software application. MIPS Technologies
45
takes this approach in their 64b RISC processor, achieving a 4x reduction in power through
reduced clock frequency [114].
To summarize, without careful management, large amounts of system power can be
wasted by continuing computations during idle periods. A power-conscious system will
avoid this source of waste either by powering down inactive modules or processors or by
adapting the processing power of the system to meet, rather than exceed, the current
requirements.
Many other and similar power saving techniques at the architectural level have been
discussed by several authors [19, 32, 37, 47, 48, 57, 62, 71, 72, 73, 79, 91, 92, 96, 97, 99, 113].
2.8.3 Memory Partitioning
Farrahi et al. [39] propose a memory partitioning (also called segmentation) scheme
that reduces power by exposing idleness in memory access. The functionality of memory is
to store data when it is written and return it when read. Farrahi suggests to view memory
not as a monolithic resource but as a collection of independent memory segments. Each
segment has its own clock and refresh signals. Whenever a memory segment is idle, it can
be put in a sleep mode where the clock is halted or no refreshes are transmitted. Memory
is idle, when no useful information is stored in it. It should be kept in mind that memory
is not idle, when it is not accessed. It might store vital information which would be lost
when the memory is turned o . It might store unimportant information though. A lifetime
can be assigned to each variable in a memory element. It de nes a time interval which
starts when a variable is written, and ends when the variable is last read. A segment is
called idle, when it contains no live variables. The partitioning technique attempts to store
46
variables which have overlapping lifetimes in the same segment. Due to this approach, idle
time of memory segments is increased and power dissipation is reduced. As with processing
hardware, a distributed array of small local memories is often more power e cient than
a shared, global memory subsystem. In particular, the energy consumed in accessing a
memory is approximately proportional to the number of words stored in that memory. If
this number can be reduced by partitioning the memory, then the total power associated
with memory accesses will also be reduced. For example, assume 10 independent processors
need to access one piece of data each. A single, shared memory will need to contain 10
words and each access will take 10 energy units for a total of 100 units of energy. On the
other hand, if each processor utilizes a local memory containing the one piece of data it
requires, then each of the 10 accesses will consume only one unit of energy for a total of
10 units - a factor of 10 savings in power. This example, though idealized, suggests the
advantage of local or distributed memory structures over global or shared memories.
2.8.4 Programmability
The previous section suggested that distributed processors, which can be optimized
for speci c tasks, might consume less power than general-purpose, programmable proces-
sors that must execute many di erent tasks. This observation was made in the context
of distributed versus centralized processing; however, it brings up the important issue of
programmable versus dedicated hardware. As an example, consider the implementation of
a linear, time-invariant  lter. Such  lters involve multiplication of variables by constants.
All required constant multiplications could be implemented on a single array multiplier.
This is a programmable scenario since the multiplier can be programmed to execute any of
47
the di erent constant multiplications on the same hardware. Alternatively, we can consider
implementing each of the constant multiplications on dedicated hardware. In this case, since
the multipliers each need to perform multiplication by a single,  xed coe cient, the array
multipliers can be reduced to add-shift operations with an adder required only for the 1 bits
in the coe cient and the shifting implemented by routing. For coe cients with relatively
few 1?s, this dedicated implementation can result in a greatly reduced power consumption,
since the waste and overhead associated with the programmable array multiplier is avoided.
This approach was taken by Chandrakasan for the implementation of a video color space
converter for translating YIQ images to RGB [29]. The algorithm consists of a 3 3 constant
matrix multiplication (i.e., nine multiplications by constant coe cients). Chandrakasan not
only replaced the array multipliers with dedicated add-shift hardware, but also scaled the
coe cients to minimize the number of 1 bits in the algorithm. The resulting chip consumed
only 1.5 mW at 1.1V. As this example demonstrates, avoiding excessive or unnecessary
programmability in hardware implementations can lead to signi cant reductions in power.
Programmability does, however, o er some advantages. For instance, dedicated hardware
typically imposes some area penalty. In the above case, the nine multiplications would each
require their own unique hardware blocks. In contrast, in a programmable implementation a
single array multiplier could be used to perform all nine multiplications. Another advantage
of programmable hardware is that it simpli es the process of making design revisions and
extensions. The behavior of a programmable component can be altered merely by changing
the instructions issued by the software driving the device. A version of the chip relying
primarily on dedicated hardware, however, might require extensive redesign e orts.
48
2.8.5 Data Representation
Another architectural decision that can impact power consumption is the choice of
data representation. In making this decision, the designer typically has several di erent
alternatives from which to choose, e.g.,  xed-point vs.  oating-point, sign-magnitude vs.
two?s-complement, and uncoded vs. encoded data. Each of these decisions involves a trade-
o in accuracy, ease of design, performance, and power. This section discusses some of the
issues involved in selecting a data representation for low power. The most obvious trade-
o involves deciding upon a  xed- or  oating-point representation. Fixed-point o ers the
minimum hardware requirements and, therefore, exhibits the lowest power consumption
of the two. Unfortunately, it also su ers the most from dynamic range limitations and
must be incorporated into the processor micro-code, which results in some runtime over-
head. Floating-point, in contrast, alleviates the dynamic range di culties at the expense
of extensive hardware additions. This increased hardware leads to correspondingly higher
capacitances and power consumption. As a result,  oating-point should be selected only
when absolutely required by dynamic range considerations. Decisions involving selection of
the word length for the datapath and the accuracy of the  oating point should be taken
after a careful analysis of the requirements of an application, rather than the desired for
an e cient low-power implementation. Aside from issues of accuracy and word length, the
designer must also select an arithmetic representation for the data. For example, two?s-
complement, sign-magnitude, and canonical signed-digit are all possible arithmetic repre-
sentations for data. Two?s-complement is the most amenable to arithmetic computation
and, therefore, is the most widely used. In this representation, the least signi cant bits
(LSB?s) are data bits, while the most signi cant bits (MSB?s) are sign bits. As a result, the
49
MSB?s contain redundant information, which can lead to wasted activity (and power) for
data signals that experience a large number of sign transitions. In contrast, sign-magnitude
data uses a single bit to describe the sign of the data and so sign transitions cause toggling
in only one bit of the data [27]. A related issue is that of data encoding. Logarithmic
companding can be used instead of  oating-point to achieve similar results. Unfortunately,
as with sign-magnitude, many computations (such as additions) don?t have straightforward
implementations in the logarithmic domain; however, some computations actually become
simpler and less power consuming in this domain, e.g., multiplications translate to additions.
Applications requiring a large number of multiplications can take advantage of this fact by
using logarithmically encoded data. Clearly, there are many trade-o s involved in selecting
a data representation for low-power systems. It is unlikely that any one choice would be
ideally suited for all applications. Instead, a careful analysis of the application requirements
in terms of performance and accuracy should be done before selecting the appropriate repre-
sentation. Moreover, it might be bene cial to use di erent data representations in di erent
parts of the systems at the expense of some data conversion overhead.
2.9 ALGORITHM LEVEL
Algorithmic-level power reduction techniques focus on minimizing the number of op-
erations, weighted by the cost of those operations. Selection of an algorithm is generally
based on details of an underlying implementation such as the energy cost of an addition
versus a logical operation, the cost of a memory access, and whether locality of reference,
both spatial and temporal can be maximized. The presence and structure of cache mem-
ory, for example, may cause a di erent set of operations to be selected, since the cost of a
50
memory access, relative to that of an arithmetic operation, changes. In general, reducing
the number of operations to be performed is a  rst-order goal, although in some situations,
recomputation of an intermediate result may be cheaper than spilling to and reloading from
memory. Techniques used by optimizing compilers, such as strength reduction, common
subexpression elimination, and optimizations to minimize memory tra c are also useful in
most circumstances in reducing power. Loop unrolling may also be of bene t, as it results
in minimized loop overhead as well as the potential for intermediate result reuse. Number
representations o er another area for algorithmic power trade-o s. For example, the choice
of using a  xed point or a  oating-point representation for data types can have a signi cant
di erence in power consumption during arithmetic operations. Selection of sign-magnitude
versus two?s complement representation for certain signal processing applications can result
in signi cant power reduction if the input samples are uncorrelated and dynamic range is
minimized [28]. Operator precision, or bit length, is another trade o that can be selected
to minimize power at the expense of accuracy. For some  oating point algorithms, full
precision can be avoided, and mantissa and exponent width reduced below the standard
23 and 8 bits, respectively, for single precision IEEE  oating point standard. In [103], the
authors show that for an interesting set of applications involving speech recognition, pattern
classi cation, and image processing, mantissa bit width may be reduced by more than 50%
to 11 bits with no corresponding loss of accuracy. In addition to improved circuit delays,
energy consumption of the  oating point multiplier was reduced 20% - 70% for mantissa
reductions to 16 and 8 bits, respectively. Truncation of low-order bits of partial sum terms
when performing a 16-bit  xed-point multiplication has been shown to result in power sav-
ings of 30% due mainly to reduction in area [93]. Adaptive bit truncation techniques for
51
performing motion estimation in a portable video encoder are shown to save 70% of the
power over a full bit width implementation [45].
2.10 SUMMARY
In previous sections we discussed the mechanisms of power dissipation. We discussed
various existing techniques of power reduction at di erent abstraction levels ranging from
layout and technology to architecture and system and outlined the bene ts and limitations
of those techniques.
52
Chapter 3
Hardware - Software Technique using Pipeline stalls to reduce leakage
power
In the previous chapter we discussed the mechanisms of power dissipation and various
techniques used at di erent abstraction levels to reduce power. In this chapter we describe
a new Architecture and System level technique to reduce leakage power.
3.1 Hardware-Software Technique
We already discussed the impact of the technology scaling on power dissipation in
Chapter 1. In particular, due to the scaling down of the threshold voltage, an exponential
growth in subthreshold leakage current is expected with every cranking of the technology
wheel [21]. Similarly, scaling down of the gate geometry (and in particular, oxide thickness)
is resulting in a very rapid growth of the gate leakage current [42]. Without corrective
measures at the device, circuit and/or microarchitecture level, the total standby (leakage)
power may well become the dominant part of the total power consumed by a microprocessor
chip in the future technologies.
To solve this problem of increasing leakage power consumption in the high-leakage
technologies we have proposed a hardware-software technique to reduce leakage power of
microprocessors at the architecture level. We present a simulated experiment to evaluate
this technique for a pipelined processor.
A simple and obvious way to reduce power consumption is to decrease the clock fre-
quency f. Decreasing f causes a proportional decrease in the dynamic power dissipation.
53
The power consumption over a given period of time is reduced but slowing the clock also
results in slower computations, hence the rate of useful work done is reduced and the system
then operates for a longer period of time to execute the given task. Two of the limiting con-
straints of the clock frequency reduction technique are throughput and peak-performance.
So for the peak-performance constrained and throughput constrained systems, clock fre-
quency reduction is not a viable alternative for power optimization.
As the clock frequency is reduced, the amount of work done in a given period of time
is reduced and this leads to reduction in the dynamic power but the leakage power remains
unchanged. So, for the future technologies where leakage power is of concern this will lead
to very high leakage energy dissipation. In order to reduce this leakage power dissipation,
instead of reducing the clock frequency of the processor we reduce the execution rate of
the processor by inserting NOPs in the pipeline after every instruction. According to the
performance needed we can add as many NOPs after each instruction as is possible. This
does a ect the throughput of the processor and so can be applied to processors when they
are not throughput constrained. This technique is more e cient in the case where the
programs executed on the normal processor have lots of hazards. Such programs will take
longer time to execute in the normal processor and will need to add bubbles in the pipeline
to remove the hazards, but in the case where the NOPs are added after every instruction,
many hazards will get resolved and the execution time will be less compared to the normal
processor.
Once a NOP is inserted in the pipeline the control unit decodes the NOP instruction
and generates power signals for the hardware components of the processor that consume
most power and puts them into sleep mode. In the sleep mode the power supply is either
54
completely or partially cut o . The power signals thus allow signi cant saving of power
during the cycles when NOPs are being executed.
Power-gating is a technique for reducing leakage power by shutting o the idle blocks.
Implementation of power-gating requires a multi-threshold CMOS process. Logic blocks
are implemented using low-Vt, high-performance transistors whereas high-Vt transistors
(called sleep transistors) connect the gated blocks to the power supply [78].
Microarchitectural technique for power gating of the Execution Units is explained
in [51]. In that paper, parameterized analytical equations that estimate the break-even
point for application of power-gating techniques are  rst developed. The potential for
power gated execution units is then evaluated, for the range of relevant break-even points
determined by the analytical equations, using a state-of-the-art out-of-order superscalar
processor model. The power gating potential of the  oating-point and  xed-point units of
this processor is then evaluated using three di erent strategies to detect opportunities for
entering sleep mode, namely, ideal, time-based, and branch-misprediction-guided.
In the ideal technique, power gating is achieved by using a suitably sized header (Fig.
3.1) or footer transistor for a circuit block that is deemed to be a power-gating candidate.
When the logic detects the onset of a su ciently long idle period of a target circuit block, a
"sleep" signal is applied to the gate of the header or footer transistor to turn-o the supply
voltage of the circuit block. Similarly, once it is determined that the circuit block is being
requested for use, the "sleep" signal is de-asserted to restore the voltage at the virtual Vdd.
In the time-based technique, the execution unit is power-gated after observing prede-
termined number of idle cycles and restarting the execution unit with a performance penalty
once a pending operation is detected.
55
Figure 3.1: Using Headers for Power Gating
The other technique given by the author [78], the branch prediction mechanism guides
the gating of execution units. An execution unit is turned o as soon as the branch mis-
prediction is detected.
The results show that using the time-based approach,  oating-point units can be put
to sleep for up to 28% of the execution cycles at a performance loss of 2%. For the more
di cult to power-gate  xed-point units, the branch misprediction guided technique allows
the  xed-point units to be put to sleep for up to an additional 40% of the execution cycles
compared to the simpler time-based technique, with similar performance impact.
For the caches, the preservation of cache states during standby mode is often desirable,
which means it would be good if data stored in caches were not destroyed so that we won?t
need to access secondary memories on recovery. The other thing is that memory access
time should not be greatly degraded, which means recovery time should be as small as
possible, otherwise it will severely compromise the system performance. Two most widely
cited methods are decay cache and drowsy cache [64].
Decay cache utilizes the gated-Vdd technique. This technique reduces the leakage power
by using a high threshold (high-Vt) transistor to turn o the power to the memory cell when
56
Figure 3.2: Supply Voltage Control Mechanism
the cell is set to low-power mode. This high-Vt device drastically reduces the leakage of
the circuit because of the exponential dependence of leakage on Vt. While this method is
very e ective at reducing leakage, its main disadvantage lies in that it loses any information
stored in the cell when switched into low-leakage mode. This means that a signi cant
performance penalty is incurred when data in the cell is accessed and more complex and
conservative cache policies must be employed. This technique reduces energy-delay by 62%
with minimal impact on performance.
Drowsy cache, as described by Kim et al. [55], provides a better solution. It also uses
transistors to separate virtual Vdd from Vdd supply line but still supplies a very low voltage
to the cell when it is turned into low power mode. The cell implementation is shown in
Figure 3.2.
According to the authors [55], the wake up latency is only a few clock cycles and thus
does not have a major impact on the system performance. For data caches, all cache lines
except the active one are put into drowsy mode every n clock cycles. The integer n depicts
57
the window size of how often should the cache be put into drowsy mode and they found
4000 is an adequate number for the benchmark they run on. Since programs typically only
access a small portion of the entire data in the memory, the drowsy cache method could
gain a signi cant reduction in leakage power consumption in the long run. By this method
80%-90% of the data cache lines can be maintained in a drowsy state without a ecting the
performance by more than 0.6%, even though moving lines into and out of a drowsy state
incurs a slight performance loss.
For instruction caches, the situation is slightly di erent due to the instruction access
characteristics. Therefore putting all cache lines into drowsy mode every n cycles does
not work well for instruction caches. However, the spatial locality property can still be
utilized. Kim et al. [55] proposed a low leakage instruction cache architecture based on the
subbank method. The basic idea of subbank is to divide the cache into several subbanks
and turn those inactive subbanks into low power mode. The proposed architecture extends
the subbank method and adds Next Subbank Prediction Techniques to it. A prediction
bu er keeps track of predicted subbank index and other information for the instruction
fetched one cycle earlier. Thus if the instruction (e.g., a jump instruction) is going to access
a subbank in drowsy mode, that subbank could be woken up one cycle earlier to enhance
the performance. There are also other techniques for instruction caches such as the one
described by Kalla et al. [53]. The authors perceived that programs, especially multimedia
applications, tend to spend most of their time in loops and execute only a sequence of
instructions for most of their computations. Based on this observation, they propose a
novel cache replacement policy for instruction caches, which forces instructions in a loop to
be placed in the same subbank and are not the  rst candidates to be replaced into secondary
58
memories when misses occur. In such a way, only one subbank will stay active and other
subbanks can stay in the drowsy mode most of the time.
Power-aware compilation for register  le energy reduction has been discussed in a recent
paper [11].
Power-gating techniques as mentioned above can be applied to the di erent components
of the processor which are not in use when the NOP instruction is inserted into the pipeline
of the processor.
Power gating techniques are discussed by other authors as well [4, 34].
3.2 Conclusion
A software approach of inserting NOPs after every instruction, combined with a hard-
ware approach of power-gating the processor components that are not in use when the NOP
is executed, is explained in this chapter to gain the leakage power savings. Because clock
period is not changed for the non-NOP instructions, we have not assumed any reduction in
the supply voltage, which would have slowed down the hardware. However, there has been
reported work on speculative hardware speed reduction for power saving [38]. Here, voltage
reduction may cause some errors, which are detected and the instructions are reexecuted.
As long as the error rate is small, one can obtain power saving with negligible performance
penalty. Using such procedures with the NOP insertion may be investigated in the future.
59
Chapter 4
Theoretical and Experiment Results
In the previous chapter, a hardware-software technique for reducing the leakage power
was explained. The theoretical results as well as practical results are discussed in this
chapter. The technique was applied to a 32-bit MIPS pipelined processor using CMOS
circuitry. We assumed the Berkeley Predictive Technology Models for 65nm and 22nm
CMOS technologies [2]. Though this demonstration uses one particular processor, the
technique can be applied to other processors as well.
4.1 THEORETICAL RESULTS
4.1.1 Clock Slow-Down Method
As we discussed in the previous chapter we are using the clock frequency reduction
method as our reference method. To reduce power when we slow down the clock, dynamic
power is reduced in proportion to clock rate whereas leakage power remains unchanged.
However, the computing task now takes longer to complete. This results in the same
dynamic energy consumption whereas the leakage energy consumed is more. For normal
operation, we assume:
Rated clock frequency as: f
Dynamic power as: Pd
Static power as: Ps
60
Then,
Total power consumedP(1) = Dynamic power +Static power = Pd+Ps (4.1)
and,
Energy consumedby anN cycletaskE(N;1) = Power Time = (Pd+Ps)N=f (4.2)
For power saving mode where the clock frequency is reduced by factor n:
Clock frequency = f=n
Dynamic Power = Pd=n, as dynamic power is reduced in proportion to clock rate
Static Power = Ps, as static power remains unchanged
Therefore, in this case,
Total power consumedP(n) = Pd=n+Ps (4.3)
and,
Energy consumed by an N cycle taskE(N;n) = (Pd+nPs)N=f (4.4)
Using equations 4.1 and 4.3, the power saving ratio is obtained as follows:
P ratio = P(1)=P(n) (4.5)
P ratio = n(Pd+Ps)=(Pd+nPs) (4.6)
P ratio = n(k + 1)=(k +n) where k = Pd=Ps (4.7)
61
For low leakage technologies where static power consumption is negligible compared to
dynamic power consumption, k >> 1. In this case,
P ratio = n (4.8)
For high leakage technologies where static power consumption is higher and of concern,
assuming k 2, power ratio we obtain for di erent values of k is as follows:
P ratio = 3n=(n+ 2) for k = 2 (4.9)
where k = 2 means dynamic power consumed is double the static power consumption.
Further,
P ratio = 2n=(n+ 1) for k = 1 (4.10)
where k = 1 means dynamic power consumed is equal to the static power consumption.
Also,
P ratio = 3n=(2n+ 1) for k = 0:5 (4.11)
where k = 0:5 means dynamic power consumed is half of the static power consumption.
These results are plotted in the graph shown in Figure 4.1.
From Figure 4.1, we observe that for low-leakage technologies, the power saving ob-
tained is linear with the clock slow-down factor n but as the static power increases in
proportion to the dynamic power, as is the case for the future high-leakage technologies,
the power savings obtained by this method shall reduce.
62
Figure 4.1: Power Saving Ratio for Clock Slow-Down Method
Similarly, calculating energy saving ratio from equations 4.2 and 4.4 we get:
E ratio = E(N;1)=E(N;n) (4.12)
E ratio = (Pd+Ps)=(Pd+nPs) = n P ratio (4.13)
E ratio = (k + 1)=(k +n) where k = Pd=Ps (4.14)
For low leakage technologies where static power consumption is negligible compared to
dynamic power consumption, k>> 1. In this case,
E ratio = 1 = constant (4.15)
63
Figure 4.2: Energy Saving Ratio for Clock Slow-Down Method
For high leakage technologies where static power consumption is higher and of concern,
assuming k 2, energy ratio we obtain for di erent cases of k is as follows:
E ratio = 3=(n+ 2) for k = 2 (4.16)
where k = 2 means dynamic power consumed is double the static power consumption.
Further,
E ratio = 2=(n+ 1) for k = 1 (4.17)
where k = 1 means dynamic power consumed is equal to the static power consumption.
Also,
E ratio = 3=(2n+ 1) for k = 0:5 (4.18)
where k = 0:5 means dynamic power consumed is half of the static power consumption.
These results are plotted in the graph shown in Figure 4.2.
64
From Figure 4.2, we observe that for low-leakage technologies, there is no increase in
energy and hence the energy saving obtained is constant but as the static power increases
in comparison to the dynamic power, as for the future high-leakage technologies, the energy
consumption will go on increasing.
4.1.2 Instruction Slow-Down Method
To di erentiate from the clock slow-down methodology, we will call the NOP insertion
as the instruction slow-down method. In this new energy saving method the rated clock
frequency (f) is maintained. Power management hardware inserts nops after each instruc-
tion. Let this instruction slowdown factor be m, where m = 0 for normal operation. Once
the nops are inserted the management unit provides hardware sleep modes to reduce NOP
power. Power control signals are generated by control logic for each individual unit of the
processor to set them into their individual low-power modes as discussed in Chapter 3. To
analyze this technique, let
P = power consumed by instruction cycles,
P=f = energy consumed per instruction cycle,
 P=f = energy consumed per NOP cycle,
where  = reduction factor (0   1) due to power down/sleep modes.
Hence, for this new technique, for a given time period as illustrated in Figure 4.3, we get
Power = P(1 +m )=(m+ 1) (4.19)
For the normal operation mode in the new instruction slow-down method we have,
Rated clock frequency, f and m = 0 (as no NOP is inserted)
65
Figure 4.3: Distribution of Instruction Energy and NOP Energy for a given time period
Assuming,
Dynamic power: Pd
Static power: Ps
Total power consumed = Pd+Ps (4.20)
and,
Energy consumed by N cycle task = (Pd+Ps)N=f (4.21)
For the power saving mode where the rated clock frequency f is maintained, using equation
4.19, we get
Dynamic Power = Pd(1 +m )=(m+ 1) (4.22)
Static Power = Ps(1 +m )=(m+ 1) (4.23)
Hence,
Total powerP(m) = (Pd+Ps)(1 +m )=(m+ 1) (4.24)
However, now a given N-cycle task will take longer to complete as NOPs are inserted
after each instruction. The energy consumed by the N-cycle task is given by,
E(N;m) = (Pd+Ps)[(1 +m )=(m+ 1)]N(m+ 1)=f = (Pd+Ps)(1 +m )N=f (4.25)
66
Figure 4.4: Power Saving Ratio for Instruction Slow-Down Method
From equations 4.20 and 4.24 the power saving ratio for the instruction slow-down method
can be obtained as:
P ratio = P(0)=P(m) (4.26)
P ratio = (m+ 1)=(1 +m ) (4.27)
The plot of the P ratio as obtained from equation 4.27 is shown in Figure 4.4. For the case
 = 1, where the NOP cycle consumes the same power as the instruction cycle we see no
power saving and this method is not e ective. When the NOP cycle consumes less power
compared to an instruction cycle, we observe power savings. For the case where  = 0,
which is the ideal case where NOP cycle consumes no power at all, we observe the power
saving linear with the instruction slow-down factor m.
Next, from equations 4.21 and 4.25, we get the energy saving ratio as follows:
E ratio = E(N;0)=E(N;m) (4.28)
67
Figure 4.5: Energy Saving Ratio for Instruction Slow-Down Method
E ratio = 1=(1 +m ) (4.29)
A plot of the E ratio, obtained by equation 4.29, is shown in Figure 4.5. For the case
 = 1, where the NOP cycle consumes the same power as the instruction cycle we see
the energy consumed increases linearly with the instruction slow-down factor m. However,
when the NOP cycle consumes less power compared to instruction cycle we observe that
the energy consumption decreases. For  = 0, which is the ideal case where NOP cycle
consumes no power at all, we observe that there is no increase in energy. Hence, for the
ideal case, where NOP cycle consumes no power, there is no increase in the energy whereas
the power saving is linear to the instruction slow-down factor m.
4.1.3 Comparison of Clock Slow-Down and Instruction Slow-Down Methods
From equations 4.4 and 4.25, we compute the energy ratio comparing the two methods
as follows:
68
Figure 4.6: Clock Slowdown Method Vs. Instruction Slowdown Method for  = 1 (No Sleep
Mode)
Energy (Clockslowdown)=Energy (Instructionslowdown) = (k+m+1)=[(k+1)(1+m )]
(4.30)
where, n = m + 1 and k = Pd=Ps. This ratio can be plotted for di erent values of  as
shown in Figures 4.6, 4.7 and 4.8.
From Figure 4.6 we observe that for  = 1, when there is no sleep mode applied and the
NOP cycle consumes the same power as instruction cycle, the new instruction slow-down
technique is not too e cient for high-leakage technologies and it provides no bene t for the
low-leakage technologies. Instruction slow-down shows advantage for k = Pd=Ps = 0:5,
i.e., when static power is twice that of dynamic power, but that is only because of the
ine ciency of the clock slow-down method.
69
Figure 4.7: Clock Slowdown Method Vs. Instruction Slowdown Method for  = 0.5 (Sleep
Mode)
For  = 0:5, where NOP cycle consumes 50% less power than the instruction cycle, in
Figure 4.7 we observe a break-even point for the case where k = Pd=Ps = 1. That is, for
a technology where dynamic power is equal to the static power, the instruction slow-down
technique shows the same energy consumption as the clock slow-down technique. For this
case too the instruction slow-down technique provides no signi cant bene t for low-leakage
technologies.
For  = 0.1, where NOP cycle consumes 90% less power than the instruction cycle, in
Figure 4.8 we observe signi cant advantage from instruction slow-down except when leakage
is very small (k = Pd=Ps>> 1).
Hence, the instruction slow-down technique is more e cient for high-leakage technolo-
gies than the clock-slow down technique for the case where NOP cycle consumes 50% or
less power than the normal instruction cycle.
70
Figure 4.8: Clock Slowdown Method Vs. Instruction Slowdown Method for  = 0.1 (Sleep
Mode)
The new technique was applied to a 32-bit MIPS processor [8, 9]. The architecture of
the processor is discussed in next section.
4.2 MIPS PROCESSOR
The MIPS architecture [46, 80] is a widely supported processor architecture, with a
vast infrastructure of industry-standard tools, software and services that help ensure rapid,
reliable and cost-e ective system-on-chip (SoC) design. The MIPS processor, designed in
1984 by researchers at Stanford University, uses a RISC (Reduced Instruction Set Com-
puter) instruction set. Compared with their CISC (Complex Instruction Set Computer)
counterparts (such as the Intel?s Pentium processors), RISC processors typically support
fewer and much simpler instructions.
71
Table 4.1: MIPS Instruction Formats
Format Bits 31-26 Bits 25-21 Bits 20-16 Bits 15-11 Bits 10-6 Bits 5-0
R op rs rt rd shamt funct
I op rs rt imm
J address
4.2.1 MIPS Instruction Formats
The meanings of the  elds in MIPS instructions presented in Table 4.1 are as follows:
op : opcode. basic operation of the instruction.
rs : the  rst register source operand.
rt : the second register source operand.
rd : the register distination operand. It gets the result of the operation.
shamt : shift amount.
funct : function. This  eld selects the speci c variant of the operation in the op  eld.
A compromise choice made by the MIPS designers is to keep all instructions the same
length, thereby requiring di erent kinds of instruction formats for di erent kinds of in-
structions. The format above is called R-type (for register), I-type (for immediate), and
J-type(for jump).
For the particular processor used for the experiment the supported instructions are
load word (LW), store word (SW), add (ADD), subtract (SUB), branch on equal (BEQ),
jump (J), and no operation (NOP). LW and SW use the immediate-format (I-format); the
operands are the destination/source register address, the register address storing the base
memory address, and an immediate value for the data memory address o set. BEQ also
follows the I-format such that it takes two register addresses to test for equality and an
72
Table 4.2: Format and the meaning of the supported Instructions
Name 6 Bit 5 Bit 5 Bit 5 Bit 5 Bit 6 Bit Assembly Meaning
lw 35 2 1 100 lw$1, 100($2) $1 <= DMem[$2+100]
sw 43 2 1 100 sw$1, 100($2) DMem[$2+100] <= $1
add 0 2 3 1 0 32 add $1, $2, $3 $1 <= $2 + $3
sub 0 2 3 1 0 34 sub $1, $2, $3 $1 <= $2 - $3
beq 4 1 2 25 beq $1, $2, 100 if($1 == $2), then
PC <= PC+4+100
j 2 2500 j 10000 PC <= 10000
nop 0 nop Do Nothing
immediate value to add to the program counters value should the equality test pass. ADD
and SUB follow the register-format (R-format) where the operands are the destination
register address and two source register addresses. Finally, J uses the jump-format (J-
format); its operand is an immediate value to store into the program counter. Table 4.2
summarizes the instruction formats.
4.2.2 Architecture
The processor datapath has  ve stages: instruction fetch (IF), instructions decode (ID),
execute (EX), data memory (M), and write-back (WB).
INSTRUCTION FETCH: The IF stage involves keeping track of the current/next
instruction as well as retrieving the current instruction from memory. In this scenario,
memory is split into separate instruction and data memories in order to avoid a structural
hazard. That is, simultaneous access to memory, one for instructions and the other for
data, is possible in the architecture shown in Figure 4.9.
INSTRUCTION DECODE: In the next cycle, the fetched instruction moves into the
ID stage. There, the instruction is broken up into several  elds and inputs into the control
logic and register  le. Various control signals, register values, and intermediate values are
73
Figure 4.9: General Architecture of the Processor
74
handed to the EX stage where arithmetic operations are performed (in this case, integer add
and subtract). In addition, the register addresses will be forwarded to the hazard detector
in this stage. If there is potential hazard in the system, this stage will perform a stall.
EXECUTE: This is the main stage where most of the ALU operations are performed.
Also, this is where the registers addresses are forwarded back to the ID stage for hazard
detection.
DATA MEMORY: In the M stage, data is retrieved and/or stored into memory.
WRITE BACK: Finally, in the WB stage, applicable control signals and results, either
from data memory or arithmetic calculations, are fed back to the register  le.
4.2.3 Pipeline Hazards
There are situations, called hazards, that prevent the next instruction in the instruction
stream from being executing during its designated clock cycle. Hazards reduce the perfor-
mance from the ideal speedup gained by pipelining. There are three classes of hazards:
Structural Hazards: They arise from resource con icts when the hardware cannot
support all possible combinations of instructions in simultaneous overlapped execution.
Data Hazards: They arise when an instruction depends on the result of a previous
instruction in a way that is exposed by the overlapping of instructions in the pipeline.
Control Hazards: They arise from the pipelining of branches and other instructions
that change the PC.
Hazards in pipelines can make it necessary to stall the pipeline. The processor can
stall on di erent events:
75
Table 4.3: A Sequence of Instructions
AND R1 , R2 , R3
SUB R4 , R5 , R1
AND R6, R1 , R7
A cache miss: A cache miss stalls all the instructions on pipeline both before and after
the instruction causing the miss.
A hazard in pipeline: Eliminating a hazard often requires that some instructions in the
pipeline to be allowed to proceed while others are delayed. When the instruction is stalled,
all the instructions issued later than the stalled instruction are also stalled. Instructions
issued earlier than the stalled instruction must continue, since otherwise the hazard will
never clear.
The problem with data hazards, introduced by the sequence of instructions as shown
in table 4.3, can be solved with a simple hardware technique called forwarding. The key
insight in forwarding is that the result is not really needed by SUB until after the ADD
actually produces it. The only problem is to make it available for SUB when it needs it.
If the result can be moved from where the ADD produces it (EX/MEM register), to where
the SUB needs it (ALU input latch), then the need for a stall can be avoided.
Using this observation, forwarding works as follows:
The ALU result from the EX/MEM register is always fed back to the ALU input
latches.
If the forwarding hardware detects that the previous ALU operation has written the
register corresponding to the source for the current ALU operation, control logic selects the
forwarded result as the ALU input rather than the value read from the register  le.
76
Figure 4.10: Modi ed Architecture of the Processor
4.3 MODIFIED MIPS PROCESSOR
The architecture of the given MIPS processor is modi ed in order to get low-power
consumption. In the modi ed architecture, a Power Block is added in the fetch cycle of the
Processor as shown in the Figure 4.10. This Power Block has an externally controlled 3-bit
signal that is controlled by the user. According to the performance required, the 3-bit signal
can be set to the number of NOPs that is inserted into the pipeline after every instruction.
With the 3-bit signal, maximum of 7 NOPs can be inserted into the pipeline with the bit
condition as shown in Table 4.4.
A schematic of the power block is shown in Figure 4.11.
77
Figure 4.11: Schematic of the Power Block
78
Table 4.4: External 3-bit signal conditions
000 Normal Processor Operation
001 1 NOP inserted after every instruction
010 2 NOPS inserted after every instruction
011 3 NOPS inserted after every instruction
100 4 NOPS inserted after every instruction
101 5 NOPS inserted after every instruction
110 6 NOPS inserted after every instruction
111 7 NOPS inserted after every instruction
In this architecture the instruction memory instead of feeding to the pipeline register for
the decode stage is fed to this Power Block. The Power Block then performs two operations.
It reads the external instruction slowdown signal and decides whether the normal stream
of instructions should be executed or the NOPs should be inserted in the pipeline after
every instruction. The second operation it performs is to maintain the program counter
pointing at the same instruction until the NOPs are executed. Once the NOPS are inserted
into the pipeline the control unit decodes the NOP instruction and generates the power
signals and puts the data-memory, register  le and the ALU into low-power mode by power-
gating techniques as discussed in Chapter 3. A low-power mode can also be applied to the
instruction memory when the pipeline is executing the inserted NOP instruction. Once the
NOPs are executed the program counter fetches the next instruction in instruction memory,
which is then executed followed by NOPs again.
79
4.4 RESULTS
4.4.1 Blocks of the Original Processor
Firstly the power consumption of di erent blocks of the processor was found out in
order to understand which block of the processor consumes the most power and needs to
be put down into low-power mode. For getting the power estimation of these blocks the
structural Verilog netlist of the blocks was taken and converted into the Rutger?s mode
gate-level netlist. This netlist was then estimated for power by the tool developed by Jins
Alexander at Auburn University [7]. The power results for the blocks were obtained by
applying 1000 random input vectors with the vector period of 100ns and rise time of 1ns for
a combinational circuit and the vector period of 200ns with rise time of 1ns for sequential
circuit. Operating voltage applied is 1V. Fanout wire load delay format was used with delay
of each gate in ns. Each of these blocks was power estimated for two di erent technologies,
65nm and 22nm technology and the clock period used is 10ns. The results are in microwatts.
The results obtained are shown in Tables 4.5 and 4.6.
From Tables 4.5 and 4.6 we observe that as the technology is scaled down to 22nm
from 65nm, the leakage power increases and the dynamic power decreases. The leakage
power is relatively high in 22nm technology and forms the major part of the total power
consumption. We also observe that for this particular processor the data memory, register
 le and ALU consume more power and so power-gating techniques can be applied to these
units to put them into low-power mode when the NOP is executed.
However, the results of Tables 4.5 and 4.6 show that the expected ratio of dynamic and
leakage power is not maintained as the Berkeley Predictive Technology Model  les available
may have been modi ed to account for certain industry processes.
80
Table 4.5: Estimated power in microwatts for di erent blocks of the processor (65nm CMOS
technology, clock period 10ns)
Block Name Number of gates Avg Leakage Power Avg Dynamic Power Total Avg Power
add-1-word 58 1.30247 0.371326 1.673802
add-nbits 142 3.310713737 1.322882 4.633595
alu 126 3.760219442 2.584336 6.344556
comparator 43 1.174578188 0.288572 1.463151
control-logic 12 0.221470117 0.111679 0.333149
data-mem 649 5566.433072 0.044306 5566.47731
inst-mem 63 0.99482736 0.061841 1.056669
hazard 30 0.754764244 0.17651 0.931275
forward 39 0.919369882 0.223469 1.142839
ex-m 73 3142.959671 4.311972 3147.271695
id-ex 123 5176.235456 6.456847 5182.692315
if-id 150 4201.080184 4.525968 4205.605946
m-wb 71 2997.391392 4.288376 3001.679666
pc 60 1690.834528 2.002712 1692.837221
reg le 4178 78306.37693 19.659017 78326.03902
Table 4.6: Estimated power in microwatts for di erent blocks of the processor (22nm CMOS
technology, clock period 10ns)
Block Name Number of gates Avg Leakage Power Avg Dynamic Power Total Avg Power
add-1-word 58 61.57274038 0.287966 61.860708
add-nbits 142 160.1353579 0.678731 160.814088
alu 126 183.8257594 1.278279 185.104043
comparator 43 57.45039016 0.075063 57.525453
control-logic 12 10.12932989 0.074958 10.204288
data-mem 649 7573.0565 0.011231 7573.067676
inst-mem 63 54.12064638 0.016018 54.136664
hazard 30 35.5 0.047986 35.560068
forward 39 44.38890755 0.059332 44.448239
ex-m 73 8541.498333 3.325714 8544.824086
id-ex 123 14295.63016 4.974914 14300.60528
if-id 150 8512.405679 1.761546 8514.16681
m-wb 71 8259.585127 3.308923 8262.894116
pc 60 3808.628768 0.910683 3809.539368
reg le 4178 146416.0234 5.043241 146421.06
81
Table 4.7: Estimated power in microwatts for di erent blocks of the processor (65nm CMOS
technology, clock period 20ns)
Block Name Number of gates Avg Leakage Power Avg Dynamic Power Total Avg Power
ex-m 73 3147.808369 2.225534 3150.033997
id-ex 123 5183.414556 3.332566 5186.747294
if-id 150 3977.928776 2.623281 3980.551846
m-wb 71 3001.879202 2.213355 3004.09249
pc 60 1686.125412 1.033658 1687.159063
reg le 4178 78049.01153 15.659671 78064.67265
Table 4.8: Estimated power in microwatts for di erent blocks of the processor (65nm CMOS
technology, clock period 40ns)
Block Name Number of gates Avg Leakage Power Avg Dynamic Power Total Avg Power
ex-m 73 3150.349716 1.131009 3151.480807
id-ex 123 5187.182222 1.693599 5188.875832
if-id 150 3976.784647 1.333143 3978.117835
m-wb 71 3004.232887 1.12482 3005.357692
pc 60 1683.653332 0.525302 1684.178598
reg le 4178 78611.9774 8.016249 78619.99422
To examine the e ect of clock frequency on the blocks of the processor we again apply
1000 random vectors to the sequential blocks of the processor with the vector period of
200ns and rise time of 1ns. Operating voltage applied is 1V. Fanout wire load delay format
is used with delay of each gate in ns. Results were obtained for two di erent frequencies
20ns and 40ns. The results for 65nm technology are shown in Tables 4.7 and 4.8
From Tables 4.7 and 4.8 we observe that as the clock frequency is reduced to half the
dynamic power is reduced by half, whereas the leakage power is not stable. Even though
the dynamic power reduces to half, the leakage power being too high results in the total
average power increasing with the reduction in clock frequency.
82
Table 4.9: Estimated power in microwatts for di erent blocks of the processor (22nm CMOS
technology, clock period 20ns)
Block Name Number of gates Avg Leakage Power Avg Dynamic Power Total Avg Power
ex-m 73 8547.215723 1.716498 8548.93215
id-ex 123 14304.46375 2.567698 14307.03141
if-id 150 8355.727419 0.991802 8356.719278
m-wb 71 8264.964446 1.707831 8266.672492
pc 60 3804.834792 0.47003 3805.304877
reg le 4178 146260.038 4.038493 146264.0762
Table 4.10: Estimated power in microwatts for di erent blocks of the processor (22nm
CMOS technology, clock period 40ns)
Block Name Number of gates Avg Leakage Power Avg Dynamic Power Total Avg Power
ex-m 73 8550.218306 0.872319 8551.090956
id-ex 123 14309.11012 1.304896 14310.4149
if-id 150 8356.056176 0.50403 8356.560022
m-wb 71 8267.787285 0.867914 8268.655278
pc 60 3802.842693 0.238868 3803.081578
reg le 4178 146664.9324 2.018337 146666.944
Similarly, the clock frequency e ects for the blocks of the processor for 22nm technology
are shown in Tables 4.9 and 4.10.
4.4.2 Power Block
For the modi ed processor the new block added in the architecture is the power block.
The power results for this new block are obtained by applying 1000 random input vectors
with the vector period of 200ns, rise time of 1ns and operating voltage of 1V. Fanout wire
load delay format was used with delay of each gate in ns. Power is estimated for two di erent
83
technologies, 65nm and 22nm technology, and the clock periods used are 10ns, 20ns and
40ns. The results are in microwatts and are shown in Tables 4.11 and 4.12.
Table 4.11: Estimated power in microwatts for power block of the processor (65nm CMOS
technology)
Number of gates Clock period Avg Leakage Power Avg Dynamic Power Total Avg Power
49 10 43.57989 6.246311 49.826202
49 20 44.52892 3.089169 47.618098
49 40 46.42725 1.512446 47.939706
Table 4.12: Estimated power in microwatts for power block of the processor (22nm CMOS
technology)
Number of gates Clock period Avg Leakage Power Avg Dynamic Power Total Avg Power
49 10 335.59347 3.457777 339.051243
49 20 336.11856 1.712433 337.831007
49 40 337.27590 0.840803 338.116719
For the power block too we observe that the leakage power is more for 22nm technology
compared to 65nm technology. We also observe that as the clock frequency is reduced the
dynamic power reduces, whereas the leakage power increases and, leakage power being too
high and a major fraction of the total power, the total power increases with reduction
in clock frequency. Even this results contradicts our expectations as the leakage power
increases instead of remaining stable with the reduction in the clock frequency.
4.4.3 Comparison between the original and modi ed processors
When the power block is added into the architecture, and comparing it with the original
processor when just 1 NOP is inserted after every instruction into the processor, the results
84
Table 4.13: Comparing the original and modi ed processors for 22nm technology for 1 NOP
(power in microwatts)
Processor No. of gates Clock Period Avg Leakage Pwr Avg Dynamic Pwr Total Avg Pwr
Original 6684 20ns 235042.2 3.352106 235045.5521
Modi ed 6787 10ns 225510.6 3.533928 225514.1339
Original 6684 40ns 237590.5 1.670603 237592.1706
Modi ed 6787 20ns 226043.6 1.746964 226045.347
Table 4.14: Comparing the original and modi ed processors for 65nm technology for 1 NOP
(power in microwatts)
Processor No. of gates Clock Period Avg Leakage Pwr Avg Dynamic Pwr Total Avg Pwr
Original 6684 20ns 116676.4 13.29557 116689.6956
Modi ed 6787 10ns 111695.1 14.057876 111709.1579
Original 6684 40ns 118611.3 6.529785 118617.8298
Modi ed 6787 20ns 113055.2 6.828938 113062.0289
obtained for 22nm technology are as shown in Table 4.13 and the results obtained for 65nm
technology are as shown in Table 4.14.
From Tables 4.13 and 4.14 we observe that an average of about 4.46% of power savings
is obtained for the modi ed processor when 1 NOP is inserted into the pipeline after each
instruction. The power savings obtained from these results is not as much as expected.
This might be because the technology  les used for the experiment were not reliable. Also
the tool used to obtain the results was not capable enough to simulate large designs and
operated only for constant inputs whereas the design required pulsed inputs.
85
4.5 Summary
This chapter discusses the theoretical results, which show that the new energy saving
instruction slow-down technique is better than the reference clock slow-down technique for
higher leakage technologies for the case where NOP cycle consumes 50% or less power than
the normal instruction cycle. The architecture of a 32-bit MIPS processor and the modi ed
architecture of the new processor are also explained in this chapter. However, the practical
results obtained were not as expected because of the unreliable technology  les used for the
experiment and the tool used to simulate the design was not e cient.
86
Chapter 5
Conclusion
Leakage power is a major concern in current and future microprocessor designs, since
as with the technology scaling the leakage power is becoming a major component of the
total power consumption. In this thesis, to reduce this leakage power for the higher-leakage
technologies, we propose a new hardware-software technique where pipeline stalls are in-
serted into the processor after every instruction while maintaining the clock rate of the
processor. The hardware units are designed to save leakage power while processing NOP
instruction by putting the idle blocks into sleep mode. This technique is more e ective when
NOP cycle consumes less than 50% power than the regular instruction cycle. For the future
work, power of the active cycles of the processor can be worked upon to further reduce the
leakage and dynamic power. Also voltage reduction can be considered for further reduction
in power when reducing the clock frequency, if the performance penalty can be met.
87
Bibliography
[1] International technology roadmap for semiconductors. http://public.itrs.net.
[2] http://www-device.eecs.berkeley.edu/ ptm:BSIM3 les.
[3] A. Abdollahi, F. Fallah, and M. Pedram, \Leakage Current Reduction in CMOS VLSI Cir-
cuits by Input Vector Control," IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 12, no. 2, pp. 140{154, Feb. 2004.
[4] K. Agarwal, H. Deogun, D. Sylvester, and K. Nowka, \Power Gating with Multiple Sleep
Modes," in Proc. International Symposium on Quality Electronic Design, 2006, pp. 633{637.
[5] V. D. Agrawal, \Low-power design by hazard  ltering," in Proc. 10th International Conf. on
VLSI Design, Jan. 1997, pp. 193{197.
[6] V. D. Agrawal, M. L. Bushnell, G. Parthasarathy, and R. Ramadoss, \Digital circuit design
for minimum transient energy and a linear programming method," in Proc. 12th International
Conf. VLSI Design, 1999, pp. 434{439.
[7] J. D. Alexander, \Simulation Based Power Estimation for Digital CMOS Technologies," Mas-
ter?s thesis, Auburn University, ECE Department, Dec. 2008.
[8] A. Arthurs and L. Ngo, \Analysis of the MIPS-32 Bit, Pipelined Processor using Synthesized
VHDL." Department of Computer Science and Engineering, University of Arkansas.
[9] P. Ashenden, The Designer?s Guide to VHDL. Morgan Kaufmann, 2002.
[10] S. Augsburger and B. Nikoli, \Combining Dual-Supply, Dual-Threshold and Transistor Sizing
for Power Reduction," in Proc. IEEE International Conference on Computer Design, 2002,
pp. 316{321.
[11] J. L. Ayala, A. Veidenbaum, and M. Lopez-Vallejo, \Power-Aware Compilation for Register
File Energy Reduction," International Journal of Parallel Programming, vol. 31, no. 6, pp.
451{467, Dec 2003.
[12] G. Baccarani, M. Wordeman, and R. Dennard, \Generalized Scaling Theory and its Applica-
tion to 1/4 Micrometer MOSFET Design," IEEE Transactions on Electron Devices, vol. ED-
31, no. 4, pp. 452{462, April 1984.
[13] H. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Addison-Wesley, 1990.
[14] R. Bechade, R. Flaker, B. Kau mann, S. Kenyon, C. London, S. Mahin, K. Nguyen, D. Pham,
A. Roberts, S. Ventrone, and T. VonReyn, \A 32b 66MHz 1.8W Microprocessor," in Proc.
IEEE International Solid-State Circuits Conference, Feb. 1994, pp. 208{209.
[15] A. Bellaur and M. I. Elmasry, Low Power Digital CMOS Design: Circuits and Systems, p. 90.
Norwell, MA: Kluwer Publisher, 1996.
[16] A. Bellaur and M. I. Elmasry, Low Power Digital CMOS Design: Circuits and Systems, pp.
135{137. Norwell, MA: Kluwer Publisher, 1996.
88
[17] L. Benini, M. Favalli, and B. Ricco, \Analysis of Hazard Contributions to Power Dissipation
in CMOS IC?s," in Proc. International Workshop on Low-Power Design, Apr. 1994, pp. 27{32.
[18] L. Benini and G. D. Micheli, Dynamic Power Management, Design Techniques and CAD
Tools. Springer, 1998.
[19] L. Benini, G. D. Micheli, E. Macii, M. Poncino, and S. Quer, \Reducing Power of Core Based
Systems by Address Bus Encoding," IEEE Trans. on VLSI Systems, vol. 6, no. 4, pp. 554{562,
Oct. 1998.
[20] D. Benson, Y. Bobra, and B. McWilliams, \Silicon Multichip Modules," in International
Workshop on Low-Power Design, Aug. 1990.
[21] S. Borkar, \Design Challenges of Technology Scaling," IEEE Micro, vol. 19, no. 4, pp. 23{29,
July 1999.
[22] T. D. Burd and R. W. Brodersen, Energy E cient Microprocessor Design. Norwell, MA:
Kluwer Academic Publisher, 2001.
[23] J. Burr and A. Peterson, \Energy Considerations in Multichip Module Based Multiproces-
sors," in Proc. in International Conference on Computer Design, April 1991, pp. 593{600.
[24] J. Burr and J. Shott, \A 200mV Self-Testing Encoder/Decoder using Stanford Ultra-Low-
Power CMOS," in Proc. of the International Solid State Circuits Conference, 1994, pp. 84{85.
[25] B. H. Calhoun and A. P. Chandrakasan, \Standby Power Reduction Using Dynamic Voltage
Scaling and Canary Flip-Flop Structures," IEEE Journal of Solid-State Circuits, vol. 39, no. 9,
pp. 1504{1511, Sept. 2004.
[26] Y. Cao, T. Sato, D. Sylvester, M. Orshansky, and C. Hu, \New Paradigm of Predictive
MOSFET and Interconnect Modeling for Early Circuit Design," in Proc. of IEEE Custom
Integrated Circuit Conference, 2000, pp. 201{204.
[27] A. Chandrakasan, R. Allmon, A. Stratakos, and R. W. Brodersen, \Design of Portable Sys-
tems," in Proc. Custom Integrated Circuits Conf., May 1994, pp. 259{266.
[28] A. Chandrakasan and R. W. Brodersen, \Minimizing Power Consumption in Digital CMOS
Circuits," in Proc. IEEE, 1995, pp. 498{523.
[29] A. Chandrakasan, A. Burstein, and R. Brodersen, \A Low Power Chipset for Portable Multi-
media Applications," in Proc. International Solid-State Circuits Conference, 1994, pp. 82{83.
[30] A. Chandrakasan, S. Sheng, and R. W. Brodersen, \Low-power CMOS digital design," IEEE
J. Solid State Circuits, vol. 27, no. 4, pp. 473{484, April 1992.
[31] K. Chao and D. Wong, \Low Power Considerations in Floorplan Design," in International
Workshop on Low Power Design, 1994, pp. 45{50.
[32] S. Chatre, C. D. Knutson, D. Margineantu, and C. Schulz-Key, \Improving DLX Performance
by Taking Some of the Reduction Out of RISC," in Proc. of the International Conference on
Technical Informatics, 1996.
[33] C. Chen and M. Sarrafzadeh, \Power Reduction by Simultaneous Voltage Scaling and Gate
Sizing," in Proc. of the Asia and South Paci c Design Automation Conference, 2000, pp.
333{338.
[34] M. H. Chowdhury, J. Gjanci, and P. Khaled, \Innovative Power Gating for Leakage Reduc-
tion," in Proc. International Symposium on Circuits and Systems, 2008, pp. 1568{1571.
89
[35] R. H. Dennard, F. H. Gaensslen, H. N. Yu, V. L. Rideout, E. Bassous, and A. R. LeBlanc,
\Design of Ion-Implanted MOSFETs with Very Small Physical Dimensions," IEEE Journal
of Solid State Circuits, vol. SC-9, no. 5, pp. 256{258, Oct. 1974.
[36] D. Dobberpuhl, R. Witek, R. Allmon, R. Anglin, S. Britton, L. Chao, R. Conrad, D. Dever,
B. Gieseke, G. Hoeppner, J. Kowaleski, K. Kuchler, M. Ladd, M. Leary, L. Madden, E. McLel-
lan, D. Meyer, J. Montanaro, D. Priore, V. Rajagopalan, S. Samudrala, and S. Santhanam, \A
200MHz 64b Dual-Issue CMOS Microprocessor," in Proc. 39th IEEE International Solid-State
Circuits Conference, Feb. 1992, pp. 106{107.
[37] F. Emnett and M. Biegel, \Power Reduction Through RTL Clock Gating." Presesnted at
Synopsys Users Group (SNUG), 2000.
[38] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin,
K. Flautner, and T. Mudge, \Razor: A Low-Power Pipeline Based on Circuit-Level Timing
Speculation," in Proc. 36th Annual IEEE/ACM International Symp. on Microarchitecture,
Dec. 2003, pp. 7{18.
[39] M. Farrahi, G. E. Tellez, and M. Sarrafzadeh, \Memory segmentation to exploit sleep mode
operation," in Proc. Design Automation Conference, 1995, pp. 36{41.
[40] K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge, \Drowsy Caches: Simple
Techniques for Reducing Leakage Power," in Proc. International Symposium on Computer
Architecture, 2002, pp. 148{157.
[41] R. Gonzalez, B. M. Gordon, and M. A. Horowitz, \Supply and Threshold Voltage Scaling for
Low Power CMOS," IEEE Journal of Solid-State Circuits, vol. 32, no. 8, pp. 1210{1216, Aug.
1997.
[42] F. Hamzaoglu and M. R. Stan, \Circuit-Level Techniques to Control Gate Leakage for sub-
100nm CMOS," in International Symposium on Low Power Electronics and Design, 2002, pp.
60{63.
[43] M. Hans, \Architectural Aspects of Design for Low Static Power Consumption," Master?s
thesis, Computer Science and Engineering Division, Technical University of Denmark, 2004.
[44] J. G. Hansen, \Design of CMOS Cell Libraries for Minimal Leakage Currents," Master?s
thesis, Dept. of Informatics and Mathematical Modeling, Technical University of Denmark,
2004.
[45] Z. L. He, K. K. Chan, C. Y. Tsui, and M. L. Liou, \Low Power Motion Estimation Design
using Adaptive Pixel Truncation," in Proc. International Symp. on Low Power Electronics
and Design, 1997, pp. 167{172.
[46] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. Mor-
gan Kaufmann, 1990.
[47] S. Hines, J. Green, G. Tyson, and D. Whalley, \Improving Program E ciency by Packing
Instructions into Registers," in Proc. of the 2005 ACM/IEEE International Symposium on
Computer Architecture, 2005, pp. 260{271.
[48] S. Hines, G. Tyson, and D. Whalley, \Improving the Energy and Execution E ciency of
a Small Instruction Cache by using an Instruction Register File," in Proc. of the Watson
Conference on Interaction between Architecture, Circuits, and Compilers, 2005, pp. 160{169.
90
[49] F. Hu, Process-Variation-Resistant Dynamic Power Optimization for VLSI Circuits. PhD
thesis, Auburn University, ECE Department, May 2006.
[50] F. Hu and V. D. Agrawal, \Dual-Transition Glitch Filtering in Probabilistic Waveform Power
Estimation," in Proc. IEEE Great Lakes Symp. on VLSI, Apr. 2005, pp. 357{360.
[51] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose, \Microar-
chitectural Techniques for Power Gating of Execution Units," in International Symposium on
Low Power Electronics and Design, 2004, pp. 32{37.
[52] A. Iyer and D. Marculescu, \Power E ciency of Voltage Scaling in Multiple Clock, Multiple
Voltage Cores," in Proc. IEEE/ACM International Conference on Computer Aided Design,
2002, pp. 379{386.
[53] P. Kalla, X. S. Hu, and J. Henkel, \Distance-Based Recent Use (DRU): An Enhancement to
Instruction Cache Replacement Policies for Transition Energy Reduction," IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 14, no. 1, pp. 69{80, Jan. 2006.
[54] J. T. Kao and A. P. Chandrakasan, \Dual-Threshold Voltage Techniques for Low-Power Dig-
ital Circuits," IEEE Journal of Solid-State Circuits, vol. 35, no. 7, pp. 1009{1018, July 2000.
[55] N. S. Kim, K. Flautner, D. Blaauw, and T. Mudge, \Circuit and Microarchitectural Tech-
niques for Reducing Cache Leakage Power," IEEE Transactions on Very Large Scale Integra-
tion (VLSI) Systems, vol. 12, no. 2, pp. 167{184, Feb. 2004.
[56] P. E. Landman, Low-Power Architectural Design Methodologies. PhD thesis, University of
California, Berkeley, 1994.
[57] C. Lee, J. K. Lee, T. Hwang, and S. Tsai, \Compiler Optimization on Instruction Scheduling
for Low Power," in Proc. of the 13th International Symposium on System Synthesis, 2000, pp.
55{60.
[58] C. E. Leiserson, Area-E cient VLSI Computation. PhD thesis, Massachusetts Institute of
Technology, 1983.
[59] C. E. Leiserson, F. M. Rose, and J. B. Saxe, \Optimizing Synchronous Circuitry by Retiming,"
in Proc. Third Caltech Conf., 1983, pp. 86{116.
[60] C. E. Leiserson and J. B. Saxe, \Optimizing Synchronous Systems," Journal of VLSI and
Computer Systems, vol. 1, no. 1, pp. 41{67, 1983.
[61] C. E. Leiserson and J. B. Saxe, \Retiming Synchronous Circuitry," in Algorithmica, 1991, pp.
5{35.
[62] H. Li, S. Bhunia, Y. Chen, T. N. Vijaykumar, and K. Roy, \Deterministic Clock Gating for Mi-
croprocessor Power Reduction," in Proc. 9th International Symposium on High-Performance
Computer Architecture, Feb. 2003, pp. 113{122.
[63] D. Liu and C. Svensson, \Trading Speed for Low Power by Choice of Supply and Threshold
Voltages," IEEE J. Solid-State Circuits, pp. 10{17, Jan. 1993.
[64] W. Liu, \Techniques of Leakage Power Reduction in Nanoscale Circuits: A Survey." IMM
Report, Dept. of Informatics and Mathematical Modeling, Technical University of Denmark,
2007.
91
[65] P. Lot -Kamran, A. Rahmani, A. Salehpour, A. Afzali-Kusha, and Z. Navabi, \Stall Power
Reduction in Pipelined Architecture Processors," in Proc. of 21st International Conference
on VLSI Design, 2008, pp. 541{546.
[66] Y. Lu, Power and Performance Optimization of Static CMOS Circuits with Process Variation.
PhD thesis, Auburn University, ECE Department, Aug. 2007.
[67] Y. Lu and V. D. Agrawal, \Leakage and Dynamic Glitch Power Minimization Using Integer
Linear Programming for Vth Assignment and Path Balancing," in Proc. International Work-
shop on Power and Timing Modeling, Optimization and Simulation (PATMOS), 2005, pp.
217{226.
[68] Y. Lu and V. D. Agrawal, \CMOS Leakage and Glitch Minimization for Power-Performance
Tradeo ," Journal of Low Power Electronics, vol. 2, no. 3, pp. 378{387, Dec. 2006.
[69] Y. Lu and V. D. Agrawal, \Total Power Minimization in Glitch-Free CMOS Circuits Consid-
ering Process Variation," in Proc. of the 21st International Conference on VLSI Design, Jan.
2008, pp. 527{532.
[70] S. M. Martin, K. Flautner, T. Mudge, and D. Blaauw, \Combined Dynamic Voltage Scaling
and Adaptive Body Biasing for Lower Power Microprocessors under Dynamic Workloads," in
Proc. IEEE/ACM International Conference on Computer Aided Design, 2002, pp. 721{725.
[71] B. Mathew, A. Davis, and M. Parker, \A Low Power Architecture for Embedded Perception,"
in Proc. of International Conference on Compilers, Architecture and Synthesis for Embedded
Systems (CASES), 2004, pp. 46{56.
[72] Y. Meng, T. Sherwood, and R. Kastner, \On the Limits of Leakage Power Reduction in
Caches," in International Symposium on High-Performance Computer Architecture, 2005, pp.
154{165.
[73] B. Moyer, \Low Power Design for Embedded Processors," in Proc. of IEEE, 2001, pp. 1576{
1587.
[74] R. S. Muller and T. I. Kamins, Device Electronics for Integrated Circuits. John Wiley and
Sons, 1986.
[75] R. Murgai, R. Brayton, and A. Sangiovanni-Vincentelli, \Decomposition of Logic Functions
for Minimum Transition Activity," in Proc. International Workshop on Low-Power Design,
Apr. 1994, pp. 33{38.
[76] C. Nagendra, U. Mehta, R. Owens, and M. Irwin, \A Comparison of the Power-Delay Char-
acteristics of CMOS Adders," in Proc. International Workshop on Low Power Design, Apr.
1994, pp. 231{236.
[77] K. Najeeb, V. V. R. Konda, S. S. Hari, V. Kamakoti, and V. M. Vedula, \Power Virus
Generation Using Behavioral Models of Circuits," in Proc. 25th IEEE VLSI Test Symposium,
2007, pp. 35{40.
[78] S. R. Nassif, H. Jiang, and M. Marek-Sadowska, \Bene ts and Costs of Power-Gating Tech-
nique," in International Conference on Computer Design, 2005, pp. 559{566.
[79] K. Natarajan, H. Hanson, S. W. Keckler, C. R. Moore, and D. Burger, \Microprocessor
Pipeline Energy Analysis," in Proc. of the 2003 International Symposium on Low power Elec-
tronics and Design, 2003, pp. 282{287.
92
[80] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The Hard-
ware/Software Interface, Fourth Edition. Morgan Kaufmann, 2009.
[81] T. Pering, T. Burd, and R. Brodersen, \Dynamic Voltage Scaling and the Design of a Low-
Power Microprocessor System," in Proc. Power-Driven Microarchitecture Workshop, 1998, pp.
251{259.
[82] T. S. Perry, \Gordon Moore?s Next Act," IEEE Spectrum, vol. 45, no. 5, May 2008.
[83] D. Pham, M. Alexander, A. Arizpe, B. Burgess, C. Dietz, L. Eisen, R. El-Kareh, J. Eno,
S. Gary, G. Gerosa, B. Goins, J. Golab, R. Golla, R. Harris, B. Ho, Y.-W. Ho, K. Hoover,
C. Hunter, P. Ippolito, R. Jessani, J. Kahle, K. R. Kishore, B. Kuttanna, S. Litch, S. Mallick,
T. Ngo, D. Ogden, C. Olson, S.-H. Park, R. Patel, M. Pham, J. Prado, S. Reeve, R. Reininger,
H. Sanchez, M. Schi i, J. Slaton, G. Thuraisingham, K. Torku, C. Tran, N. Vanderschaaf,
and P. Voldstad, \A 3.0W 75SPECint92 85SPECfp92 Superscalar RISC Microprocessor," in
Proc. International Solid-State Circuits Conference, Feb. 1994, pp. 212{213.
[84] J. Rabaey, Digital Integrated Circuits: A Design Perspective. Prentice Hall, 1995.
[85] H. Rahman and C. Chakrabarti, \An E cient Control Point Insertion Technique for Leakage
Reduction of Scaled CMOS Circuits," IEEE Transactions on Circuits and Systems-II: Express
Briefs, vol. 52, no. 8, pp. 496{500, Aug. 2005.
[86] T. Raja, \A Reduced Constraint Set Linear Program for Low-Power Design of Digital Circuit,"
Master?s thesis, Rutgers University, ECE Department, Mar. 2002.
[87] T. Raja, Minimum Dynamic Power CMOS Design with Variable Input Delay Logic. PhD
thesis, Rutgers University, ECE Department, May 2004.
[88] T. Raja, V. D. Agrawal, and M. L. Bushnell, \Minimum Dynamic Power CMOS Circuit
Design by a Reduced Constraint Set Linear Program," in Proc. 16th International Conf.
VLSI Design, Jan. 2003, pp. 527{532.
[89] T. Raja, V. D. Agrawal, and M. L. Bushnell, \?CMOS Circuit Design for Minimum Dynamic
Power and Highest Speed," in Proc. 17th International Conf. VLSI Design, Jan. 2004, pp.
1035{1040.
[90] T. Raja, V. D. Agrawal, and M. L. Bushnell, \Transistor Sizing of Logic Gates to Maximize
Input Delay Variability," Journal of Low Power Electronics, vol. 2, no. 1, pp. 121{128, Apr.
2006.
[91] J. Sacha and M. J. Irwin, \Number Representations for Predicting Data bus Power Dissipa-
tion," in Proc. of International Conference of Acoustics Speech and Signal Processing, 1998,
pp. 163{168.
[92] R. Saleh, \Power and Low-Power Design." Dept. of ECE, University of British Columbia.
[93] M. J. Schulte, J. E. Stine, and J. G. Jansen, \Reduced Power Dissipation Through Truncated
Multiplication," in Proc. IEEE Alessandro Volta Memorial Workshop Low-Power Design,
March 1999, pp. 61{69.
[94] J. Schutz, \A 3.3V 0.6 m BiCMOS Superscalar Microprocessor," in Proc. IEEE International
Solid-State Circuits Conference, Feb. 1994, pp. 202{203.
[95] D. Snowdon, S. Ruocco, and G. Heiser, \Power Management and Dynamic Voltage Scaling:
Myths and Facts," in Proc. of the 2005 Workshop on Power Aware Real-time Computing,
2005.
93
[96] R. Srinivasan, J. Cook, and L. Eisen, \Evaluating Instruction Reorderings and Transfor-
mations for Microarchitecture Power Reduction," in Proc. of the 6th Annual Austin CAS
International Conference, 2005.
[97] S. Steinke, R. Schwarz, L. Wehmeyer, and P. Marwedel, \Low Power Code Generation of
a RISC Processor by Register Pipelining." Technical Report 754, University of Dortmund,
Dept. of Computer Science XII, 2001.
[98] A. Stratakos, R. W. Brodersen, and S. R. Sanders, \High-E ciency Low-Voltage DC-DC
Conversion for Portable Applications," in International Workshop on Low-Power Design,
April 1994, pp. 105{110.
[99] C. Su, C. Tsui, and A. Despain, \Low Power Architecture Design and Compilation Techniques
for High-Performance Processors," in Proc. of IEEE COMPCON, 1994, pp. 489{498.
[100] S. W. Sun and P. Tsui, \Limitation of CMOS Supply-Voltage Scaling by MOSFET Threshold
Variation," in Proc. IEEE Custom Integrated Circuits Conf., 1994, pp. 43{48.
[101] V. Sundararajan and K. K. Parhi, \Low Power Synthesis of Dual Threshold Voltage CMOS
VLSI Circuits," in Proc. International Symp. on Low Power Electronics and Design, 1999,
pp. 139{144.
[102] V. Tiwari, P. Ashar, and S. Mailk, \Technology Mapping for Low Power," in Proc. 30th Design
Automation Conference, June 1993, pp. 74{79.
[103] Y. F. Tong, R. A. Rutenbar, and D. F. Nagle, \Minimizing Floating-Point Power Dissipation
via Bit-Width Reduction," in Power Driven Microarchitecture Workshop in Conjunction with
ISCA?98, June 1998, pp. 114{118.
[104] C. Tsui, M. Pedram, and A. Despain, \Technology Decomposition and Mapping Targeting
Low Power Dissipation," in Proc. 30th Design Automation Conference, June 1993, pp. 68{73.
[105] S. Uppalapati, \Low Power Design of Standard Cell Digital VLSI Circuits," Master?s thesis,
Rutgers University, ECE Department, May 2004.
[106] S. Uppalapati, M. L. Bushnell, and V. D. Agrawal, \Glitch-Free Design of Low Power ASICS
using Customized Resistive Feedthrough Cells," in Proc. of the 9th VLSI Design and Test
Symposium, Aug. 2005, pp. 41{48.
[107] K. Usami and M. Horowitz, \Cluster Voltage Scaling Technique for Low Power Design," in
International Symposium on Low Power Design, 1995, pp. 3{8.
[108] H. Vaishnav and M. Pedram, \PCUBE: A Performance Driven Placement Algorithm for Lower
Power Designs," in Proc. of the Euro-DAC, 1993, pp. 72{77.
[109] H. J. M. Veendrick, \Short-Circuit Dissipation of Static CMOS Circuitry and its Impact on
the Design of Bu er Circuits," IEEE J. Solid-State Circuits, pp. 468{473, Aug. 1984.
[110] S. R. Vemuru and N. Scheinberg, \Short Circuit Power Dissipation Estimation for CMOS
Logic Gates," IEEE Trans. Circuits Syst., pp. 762{765, Nov. 1994.
[111] F. M. Wanlass and C. T. Sah, \Nanowatt Logic Using Field-E ect Metal-Oxide Semiconductor
Triodes," in Proc. International Solid State Circuits Conference, Feb. 1963, pp. 32{33.
[112] N. H. E. Weste and D. Harris, CMOS VLSI Design, A Circuits and Systems Perspective, 3rd
Edition. Addison Wesley, 2004.
94
[113] W. Ye, N. Vijaykrishnan, M. T. Kandemir, and M. J. Irwin, \The Design and Use of Simple
Power: a Cycle-Accurate Energy Estimation Tool," in Proc. of Design Automation Confer-
ence, 2000, pp. 340{345.
[114] N. K. Yeung, Y.-H. Sutu, T. Y.-F. Su, E. T. Pak, C.-C. Chao, S. Akki, D. D. Yau, and
R. Lodenquai, \The Design of a 55SPECint92 RISC Processor under 2W," in Proc. IEEE
International Solid-State Circuits Conference, Feb. 1994, pp. 206{207.
[115] B. Yu, A New Dynamic Power Cut-O Technology (DPCT) for Leakage Reduction in Deep
Submicron VLSI CMOS Circuits. PhD thesis, Rutgers University, ECE Department, Oct.
2007.
[116] B. Yu and M. L. Bushnell, \A Novel Dynamic Power Cut-o Technique (DPCT) for Active
Leakage Reduction in Deep Submicron CMOS Circuits," in Proc. International Symp. on Low
Power Electronics and Design, 2006, pp. 214{219.
95