Design of 3.33GHz CML Processor Datapath
by
Abdullah Al Owahid
A thesis submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Master of Science
Auburn, Alabama
May 7, 2012
Keywords: CML, CMOS, Processor
Copyright 2012 by Abdullah Al Owahid
Approved by
Fa Foster Dai, Chair, Professor of Electrical and Computer Engineering
Vishwani D. Agrawal, James J. Danaher Professor of Electrical and Computer Engineering
Victor P. Nelson, Professor of Electrical and Computer Engineering
Abstract
Almost a decade processor speed has been stuck at operating frequency 2-3GHz due
to excessive power consumption of CMOS logic gate at higher frequency whereas predicted
speed at present was 10-15GHz. This leads the idea of multi-core design in today?s proces-
sor architecture. However it increases the communication overhead ? and there exist data
dependency which cannot fully exploit the advantage of many-core design. Further many
core design is increasing number of dark silicon and number of core cannot be increased after
certain limit. Therefore a novel approaches in processor design using CML logic gate has
been proposed.
Handcrafted 16-bit CML microprocessor datapath has been developed at operating fre-
quency 3.33GHz using 130nm CMOS technology. With the same feature size, CMOS gate is
incapable to operate beyond 1GHz whereas CML logic gates were optimized for 12GHz using
bias current of 70% of peak ft current with a logic swing of 600mV. Considering critical path
delay, circuit has been slowed down to operate at 3.33GHz.
All the processor components - decoder, mux, register file, ALU was deliberately hand-
crafted due to lack of analog synthesizer tool. Reported static power consumption of multi-
cycle CML processor datapath is 41.264W. However it is not the best case and could have
been reduced to 50% by implementing multi-input CML logic. Expected chip area is 2.2mm
x 3.45mm and power density per unit area is 5.44?W/?m2. Estimated performance evalu-
ated is 892 MIPS. Supply voltage used is 2.8V. CML logic was defined as, logic-1 = 2.8V and
logic-0 = 2.2V. 1V reference voltage was used to constant bias the current source and reset
signal uses 1.3V and 0.7V for high and low logics respectively. It has been observed that
it is possible to realize ultra-high speed processor using existing technology with minimum
power consumption in CML logic.
ii
Acknowledgments
I would like to acknowledge the continuous support and guidance of Dr. Fa Foster Dai.
Without his suggestion and direction it would have been impossible to complete this thesis
work. I would also like to thank my committee members Dr. Vishwani D. Agrawal for his
meaningful suggestions regarding processor architecture and Dr. Victor P. Nelson.
I thank my friends and colleagues - James Clark, Shannon Price, Xin Jin and Baohu Li
for being with me and making life at Auburn enjoyable.
Last but not the least, I would like to thank my family members - my parents whose
love brought me so far, my brother and sister, and especially my wife for her patience.
iii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background and High Speed CML Logic Realization . . . . . . . . . . . . . . . 6
2.1 CML Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 CML Inverter Optimization . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 CML Universal Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Universal CML Gate Optimization . . . . . . . . . . . . . . . . . . . 19
2.3 CML XOR/XNOR Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 CML Mux Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 CML D-latch Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Speed, Power, Area and Delay of Basic CML Components . . . . . . . . . . 28
3 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 First Clock Cycle of Every Instruction . . . . . . . . . . . . . . . . . . . . . 31
3.2 R-type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 R-type ADD/SUB/AND/XOR . . . . . . . . . . . . . . . . . . . . . 32
iv
3.2.2 R-type SLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 R-type SEQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 I-type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 I-type LW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 I-type SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 I-type ADDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.4 I-type MOVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 J-type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 J-type J LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 J-type JZ LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.3 J-type JNZ LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.4 J-type JAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.5 J-type JR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Control Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Component Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 16-bit Register With Enable Input . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Z Register (1-bit Register With Enable Input) . . . . . . . . . . . . . . . . . 49
4.3 4 16-bit 2-to-1 Mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 3 4-bit 2-to-1 Mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 3 16-bit 4-to-1 Mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 16-bit 5-to-1 mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7 16-bit ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.8 16x16 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.9 Sign 4-to-16 extension (Sign 4) . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.10 Sign 8-to-16 extension (Sign 8) . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.11 Sign 12-to-16 extension (Sign 12) . . . . . . . . . . . . . . . . . . . . . . . . 63
4.12 2 unsigned 1-to-16 extension (Unsigned 16) . . . . . . . . . . . . . . . . . . . 64
v
4.13 4 1-bit AND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.14 1 1-bit OR gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Processor Verification and Performance . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Processor Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
vi
List of Figures
1.1 Operating frequency over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Operating Frequency vs. Power of Intel Processor . . . . . . . . . . . . . . . . . 2
1.3 Power density per unit area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Current consumptions for CMOS vs. CML logic . . . . . . . . . . . . . . . . . . 6
2.2 CMOS vs. CML power consumption . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 CML Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Normalized current for CML inverter . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 CML inverter half-circuit small-signal model . . . . . . . . . . . . . . . . . . . . 15
2.6 Post vs. pre layout simulation of CML inverter with 18fF input/output load
capacitance (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 CML inverter layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Universal CML gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9 Universal CML gate with embedded level shifter . . . . . . . . . . . . . . . . . . 18
2.10 Normalized current for CML universal gate . . . . . . . . . . . . . . . . . . . . 20
2.11 Post vs. pre layout simulation of CML AND with 18fF input/output load capac-
itance (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
vii
2.12 CML AND layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.13 CML XOR gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.14 Post vs. pre layout simulation of CML XOR with 18fF input/output load capac-
itance (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.15 CML XOR layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.16 CML Mux realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.17 Post vs. pre layout simulation of CML Mux with 18fF input/output load capac-
itance (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.18 CML Mux layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.19 CML D-latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.20 Post vs. pre layout simulation of CML D-latch with 18fF input/output load
capacitance at 6GHz (166ps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.21 CML D-latch layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Processor datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 First clock cycle of any instruction . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 R-type ADD/SUB/AND/XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 R-type SLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 R-type SEQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 I-type LW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
viii
3.7 I-type SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8 I-type ADDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 I-type MOVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10 J-type J LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.11 J-type JZ LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.12 J-type JNZ LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.13 J-type JAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.14 J-type JR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Datapath Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Block diagram of MS DFF, MS DFF-EN, 16-bit register . . . . . . . . . . . . . 48
4.3 1-bit register output at 6GHz with 20fF load capacitance (clock period 166ps) . 49
4.4 1-bit 4-to-1 mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 1-bit 4-to-1 mux output with 20fF load capacitance (input changes at 83ps) . . 51
4.6 1-bit 5-to-1 mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 1-bit 5-to-1 Mux output with 20fF load capacitance (input changes at 166ps) . . 52
4.8 Block diagram of 16-bit ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.9 16-bit CLA block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.10 Critical path delay in 16-bit CLA is 224.7ps (input changes at 500ps) . . . . . . 55
ix
4.11 16-bit ALU Output (input changes at 300ps) . . . . . . . . . . . . . . . . . . . 56
4.12 16x16 Register File Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.13 4-to-16 Deocoder (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . 59
4.14 1-bit 16-to-1 Mux Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.15 1-bit 16-to-1 Mux Output (input changes at 144ps) . . . . . . . . . . . . . . . . 60
4.16 16x16 Register File Output at 3.33GHz . . . . . . . . . . . . . . . . . . . . . . . 61
4.17 Sign 4 to 16 Extension Output (input changes at 72ps) . . . . . . . . . . . . . . 63
4.18 Unsigned 1 to 16 Extension Output (input changes at 83ps) . . . . . . . . . . . 64
5.1 Handcrafted Processor Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 MOVI instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 ADDI instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 ADD instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Static power consumption of CML processor datapath over 13 clock cycles . . . 71
x
List of Tables
2.1 Inverter Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 CML AND Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 CML XOR Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 CML Mux Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 CML D-latch Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Power, Area and Delay of Basic Components (post layout simulation with 18fF
load capacitance) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Opcode and 15 Different Operations . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Control Signal Table Part-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Control Signal Table Part-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Component Power Dissipation, Expected Area and Delay . . . . . . . . . . . . . 65
xi
List of Abbreviations
ALU Arithmetic Logic Unit
BiCMOS Bipolar and CMOS in same integrated chip
BJT Bipolar Junction Transistor
CML Current Mode Logic
CMOS Complementary Metal Oxide Semi-conductor
ECL Emitter Coupled Logic
ISA Instruction Set Architecture
MCML Mos Current Mode Logic
MIPS Million Instructions Per Second
nMOS n-type Metal Oxide Semi-conductor
RISC Reduced Instruction Set Computing
SoC System on Chip
SPICE Simulation Program with Integrated Circuit Emphasis
SRAM Static Random Access Memory
xii
Chapter 1
Introduction
This thesis presents the design of a handcrafted 3.33GHz CML processor datapath.
RISC architecture has been adopted in designing the multi-cycle processor datapath and the
ISA is 16-bits long. It is the first ever MCML processor that requires constant power dissi-
pation unlike CMOS processors. Also, once optimized, power dissipation does not increase
with increasing operating frequency. Optimizing CML logic for higher operating frequency
requires higher power than optimizing for lower frequency. Therefore, once CML logic has
been optimized for a targeted maximum frequency, operating the circuit at lower frequency
will lose the benefits in terms of power.
Due to larger switching noise associated with CMOS circuits, CML is a better choice for
high speed circuit realization [1]. BJT based CML gates are faster than MOSFET CML due
to higher gm and lower power but have been avoided for process difficulty and uncommonness
in designing digital circuits. Therefore MOSFET-CML (MCML) logic has been used.
1.1 Problem Statement
The problem solved in this thesis: Design a high speed low power processor datapath.
1.2 Background and Motivation
Processor speed has been stuck at 2-3GHz due to excessive power consumption, as
indicated in Figure 1.1 for the last 10 years [2]. Therefore multi-core design has evolved to
increase performance. But the number of cores cannot be increased after a certain limit and
there exists communication overhead ?. Further, due to data dependency, programs cannot
be fully parallelized, which inhibits proper exploitation of multi-core design.
1
Figure 1.1: Operating frequency over time
Intel processor speed vs power have been obtained and plotted in Matlab as depicted
in Figure 1.2 [3].
0 50 100 150 200 250 3000
0.5
1
1.5
2
2.5
3
3.5
4
pentium
pentium mmx
pentium II
pentium III 1400Spentium 4 1.3
pentium 4 506
pentium 4 HT 661
pentium D 820
pentium Extreme Edition 840
core 2 Duo E4400
Core 2 Extreme QX6800
Core i3?560Core i5?670 Core i7?980X Extreme Edition (6core)
????Processor Operating Frequency (GHz)????>
?????????Power (W)?????????>
Figure 1.2: Operating Frequency vs. Power of Intel Processor
In Figure 1.2 it is observed that processor power consumption has been increased almost
exponentially at operating frequencies beyond 2GHz. Also notable is that, core i-5 has higher
operating frequency than core-i7 but the power consumption in core i-7 is almost double due
to a higher number of cores. So reducing the number of cores will not only reduce power
but can result in higher performance due to less data dependency.
2
Power consumption can also be reduced by moving to deep submicron technology. How-
ever, the main drawback to reducing feature size is it increases unit power density and chips
cannot sustain that power, as shown in Figure 1.3 [4].
Figure 1.3: Power density per unit area
In CML, per unit power density is lower as it requires greater area than CMOS due to
load resistances in CML logic. Therefore in CML, we can achieve higher operating frequency
with less power density.
A previous mixed signal superscalar processor was developed in 1997, using 0.5?m BiC-
MOS process and required 3.6V and 2.1V power supplies. The reported operating frequency
was 533MHz and the design used a PowerPC architecture that contained three pipelines and
a large on-chip secondary cache to achieve a peak performance of 1600 MIPS. The 15mm
x 10 mm die contained 2.7M transistors (2M CMOS and 0.7M bipolar) and dissipated less
than 85W [5]. All logic circuits were implemented in three-level emitter coupled logic (ECL)
and only RAM structures were implemented with CMOS circuits.
Although BJT has less switching noise and are faster than MOS transistors they are
expensive and not frequently used in digital circuits [1]. This led to the idea to design a
high speed processor datapath using MCML logic. Further CMOS SRAM cannot operate
3
at 3.33GHz using 0.12?m technology. Therefore only high speed datapath was developed in
this thesis.
1.3 Contribution
CML logic architecture has been discussed in the literatures but it is not guaranteed
that it can realize digital functions unless optimization has been performed [6]-[8]. Optimized
CML logic can steer full bias current at different input combinations, resulting in full voltage
swing that differentiate logic states. Also, technology files provide only transistors, unlike
digital technology files that provides optimized CMOS logic gates. Therefore, due to lack of
analog synthesizer tools handcrafting of CML logic architecture is necessary and then opti-
mization is required for CML logic to realize digital function for a targeted frequency. All the
basic components have been derived first, with maximum operating frequency 12GHz using
130nm CMOS technology. Bias current was chosen to be 70% of peak ft current that gives
us highest possible operating frequency without burning transistors when operate. Further,
biasing CML gates with 70% of peak ft or less may incur 10% propagation delay but can save
more than 40% power [1] and [7]. Logic swing was determined to be 600mV, assuming it will
be advantageous than CMOS logic at this frequency that has fixed swing 1V. These basic
CML logic designs were later used in realizing 16-bit 3.33GHz datapath components. All the
processor components - mux, register file, ALU were deliberately handcrafted in Cadence
Virtuoso. Reported static power consumption of the multi-cycle CML processor is 41.264W
and power density per unit area is 5.44?W/?m2 = 544W/cm2, below traditional CMOS
processor power density per unit area, as indicated in Figure 1.3. Estimated performance
of the multi cycle CML processor datapath is 892 MIPS and expected chip area is 2.2mm x
3.45mm.
This work has been accepted at IEEE International Symposium On Circuits and Systems
(ISCAS) 2012 Conference [9].
4
1.4 Organization
The thesis has been organized as follows: Chapter 2 provides background and describes
high speed CML logic realization. Chapter 3 introducesdatapath design. Chapter 4 describes
CML datapath component realization and verification. Chapter 5 represents processor ver-
ification, performance and comparison. Chapter 6 draws conclusions and discusses future
work.
5
Chapter 2
Background and High Speed CML Logic Realization
Recently, interest in high speed digital circuit baseds on MCML/BJT-CML is increasing
due to low power consumption [10]. Noise coupling between digital circuitry and sensitive
analog blocks has always been a major obstacle in complete system on chip design (SoC) [11].
MOS current mode logic (MCML) is a promising alternative to conventional MOS in mixed
signal applications. Many efforts were exhausted to realize the potential of MCML [12]-
[17]. Even though MCML has been shown to dissipate less power than CMOS at operation
frequencies of more than 300MHz, designers have been reluctant to exchange MCML for
CMOS [14]. The high complexity of MCML and the lack of automation tools made it
impossible to produce robust and power-efficient designs while maintaining low cost and
reasonable time to market.
Figure 2.1: Current consumptions for CMOS vs. CML logic
Figure 2.1 shows typical current consumption for CMOS and CML logics [18]. As
indicated, typically at slower frequency CMOS is beneficial whereas CML takes less power
6
at higher operating frequency. Figure 2.1 is based on assumption, and not according to
circuit measurements, and is not shown in scale. It is supposed that typically beyond 1GHz
CML is beneficial than CMOS.
0.5 1 1.5 2 2.5 3 3.5 40
0.5
1
1.5
2
2.5
3
3.5
4
4.5
??????? Frequency (GHz) ???????>
????????? Power (mW) ?????????>
CMOS Divide by 2 Power Dissipation
CML Divide by 2 Power Dissipation
Figure 2.2: CMOS vs. CML power consumption
A divide by two circuits was realized in CMOS (W = 160nm, 160nm, 200nm, 600nm,
1m, 1.5m, 2.5m, 3m, 7m and 10m for 0.5, 1, 1.2, 1.4, 1.5, 1.6, 1.8, 2, 2.3 and 2.5GHz
respectively) and CML (W = 160nm, 160nm, 160nm, 200nm, 250nm, 300nm each having
Wtail = 120nm for 0.5, 1, 2, 2.5, 3 and 4GHz respectively) architecture. Channel length, L =
120nm was fixed for both cases and power has been plotted in Figure 2.2. At 2.5GHz CMOS
D filp-flop propagation delay reaches 50% of input clock cycle whereas CML D flip-flop was
able to generate correct output till 6GHz. It is obvious that CML dominates over CMOS
beyond 1.5GHz. Benefits of MCML circuit topology over CMOS are largely independent of
technology [6].
Many successful attempts have been made to expose the relationships between the
MCML gate delay and the various design parameters [19]-[21]. These efforts have provided
insight into the design considerations and have been described briefly for CML inverters
and universal CML logic. There is no straight forward method to optimize CML logic and
modeling accurate propagation delay for CML logic with pen and paper is very hard to
7
derive due to higher order effects. Therefore, we will rely on some approximation in the later
subsections and compare with the simulation results in designing high speed CML gates.
2.1 CML Inverter
Figure 2.3: CML Inverter
Figure 2.3 shows a CML Inverter which is typically a differential pair. There exists a
particular biasing voltage Vin (quiescent point), for which Id1 = Id2 = Ibias/2. For differential
input, which is our logic swing ?V, it is necessary to tune the circuit such that when logic-1
(Vin+ ?V/2) is applied to A+ and logic-0 (Vin-?V/2) is applied to A-, T1 turns on and T2
turns off resulting in Id1=Ibias, Id2=0; steering all the current through T1. A Summary of
CML inverter operation has been given in Table 2.1
A+ A- T on T off Out+ Out-
Vin-?V/2 Vin+?V/2 2, 3 1 Vin+?V/2 Vin-?V/2
Vin+?V/2 Vin-?V/2 1, 3 2 Vin-?V/2 Vin+?V/2
Table 2.1: Inverter Operation
8
2.1.1 CML Inverter Optimization
If our supply voltage is Vdd as indicated in Figure 2.3, and the output of the CML gate
drives another gate, then we must have to maintain:
Logic-1 = Vdd = Vin + ?V/2
and Logic-0 = Vdd- ?V = Vin- ?V/2
Let us assume logic-1 (Vin+?V/2) is applied to A+ and logic-0 (Vin-?V/2) is applied to
A-. Therefore, to keep T1 in its forward-active region and T2 in its cut-off region, necessary
conditions are:
Vgs1 > Vth1; Vgd1 < Vth1; Vds1 > Vgs1 ?Vth1
Vgs2 < Vth2; Vgd2 < Vth2
If we assume that the magnitude of the voltage swing is just large enough to steer all the
current from one side to the other then ?V = ?Vmin, which is the minimum swing required
(just large enough to turn off T2),
Vgs2 = Vth2
Vg2 ?Vx = Vth2
Vin ??Vmin/2?Vx = Vth2
Vdd ??Vmin/2??Vmin/2?Vx = Vth2
Vx = Vdd ??Vmin ?Vth2 (2.1)
Vgs1 = Vg1 ?Vx
Vgs1 = Vin + ?Vmin/2?Vx
Vgs1 = Vdd ?Vx (2.2)
Ibias = (?nCox/2)(W/L)(Vgs1 ?Vth1)? (2.3)
9
where ?n is electron mobility in nMOS device and Cox is capacitance per unit area of gate
oxide, Cox=epsilon1ox/tox. Permittivity of SiO2, epsilon1ox = 3.9epsilon10; where permittivity of free space epsilon10 =
8.85?10?14F/cm and tox is thickness of gate oxide. For a short channel device, like 0.12?m
feature size ? similarequal 1.25. For easiness of our pen and paper calcuation, sometimes we will
assume either ? is 2.
For matching purposes, T1 and T2 have to be same feature size such that an equal
amount of current flows on both sides at the quiescent point. Therefore equation 2.1 yields
Vth1 = Vdd ??Vmin ?Vx
and, Vgs1 ?Vth1 = Vdd ?Vx ?Vdd + ?Vmin +Vx
= ?Vmin (2.4)
Ibias = (?nCox/2)(W/L)(?Vmin)? (2.5)
Therefore, Ibias can be realized as a function of minimum logic swing ?Vmin, with Ibias
? ?Vmin. It means, the less logic swing we define, the less bias current we need to realize
a CML inverter. The minimum amount of current that will be required to realize this logic
swing occurs when W is minimum. In other words, we can realize this full swing at high
bias current which is true but we will burn more power. The minimum amount of Wn that
can provide just large enough (smallest) bias current to realize full swing is:
IL = (?nCox/2)(Wn/L)(?Vmin)? (2.6)
Again, let us consider a CML inverter circuit at its quiescent point, as indicated in
Figure 2.3. In determining our minimum logic swing ?Vmin we need to consider how we
are going to forward bias the transistors, meaning what our overdrive should be. Typically,
for overdrive votage, Vov ? 300mV, the transistor reaches soft saturation and when Vov ?
500mV, the transistor reaches hard saturation. Overdrive voltage is referred to as Vov = Vgs
10
- Vt. It is necessary to determine overdrive voltage because, if we apply logic-1 at A+ and
logic-0 at A- and try to turn off T2 but if T2 at quiescent point reached velocity saturation
then (?V)/2 is small enough to completely turn off T2. The consequence will be we cannot
realize full swing. Making our logic swing large can turn T2 off but we will end up in greater
RC constant and high propagation delay. Therefore, we need overdrive voltage such that it
can push the transistor slightly in to the forward active region. A careful consideration was
made such that Vov = Vgs - Vt = 263.9mV. Therefore it is necessary to expose the relation
between logic swing and overdrive voltage. For an input voltage Vg1 at gate terminal of T1,
Vgs1 = Vg1?Vx
Vg1 = Vgs1 +Vx
and Vg2 = Vgs2 +Vx
For differential input voltage, which is our logic swing,
?V = Vg1 ?Vg2
?V = Vgs1 ?Vgs2
Id1 = (1/2)?Cox(W/L)(Vgs1 ?Vt)2 ?=2 for simplicity (2.7)
Id2 = (1/2)?Cox(W/L)(Vgs2 ?Vt)2
?Id1??Id2 = ?VradicalBig(1/2)(?C
ox(W/L))
Id1 +Id2 ?2?Id1.Id2 = (1/2)(?Cox(W/L))?V 2
Ibias ?2?Id1.Id2 = (1/2)(?Cox(W/L))?V 2
2?Id1.Id2 = Ibias ?(1/2)(?Cox(W/L))?V 2
2
radicalBig
Id1.(Ibias?Id1) = Ibias ?(1/2)(?Cox(W/L))?V 2
?4I2d1 + 4Id1.Ibias = I2bias ?Ibias(?Cox(W/L))?V 2 + (1/4)(?Cox(W/L))2?V 4
11
?4I2d1 + 4Id1 ?Ibias = I2bias ?Ibias ?K?V 2 + (1/4)K2?V 4 K=?Cox(W/L)
4I2d1 ?4Id1?Ibias +I2bias ?IbiasK?V 2 + (1/4)K2?V 4 = 0
Id1 = [4Ibias ?
radicalBig
(16I2bias ?16I2bias + 16IbiasK?V 2 ?4K2?V 4)]/8
Id1 = [2Ibias ??V
radicalBig
(4IbiasK ?K2?V 2)]/4
Id1 = Ibias/2?(1/4)?V2
radicalBig
(Ibias.K)
radicalBig
[1?K?V 2/4Ibias]
Id1 = Ibias/2?(?V/2)
radicalBig
(Ibias.K)
radicalBig
[1?(?V/2)2/(Ibias/K)]
As Logic 1 was applied to A+ and Logic-0 at A-, then differential input Vid > 0 resulting
Id1 > Id2.
Id1 = Ibias/2 + (?V/2)
radicalBig
(Ibias.K)
radicalBig
[1?(?V/2)2/(Ibias/K)] (2.8)
Id2 = Ibias/2?(?V/2)
radicalBig
(Ibias.K)
radicalBig
[1?(?V/2)2/(Ibias/K)] (2.9)
At the biasing (quiescent) point, Vid = 0, resulting Id1 = Id2 = Ibias/2. Therefore equation
2.7 yields
Ibias/2 = (1/2)?Cox(W/L)(Vgs1 ?Vt)2
Ibias/2 = (K/2)V 2ov
K = Ibias/V 2ov
Plugging in the value of K in equation 2.8 we get
Id1 = Ibias/2 + (?V/2)(Ibias/Vov)
radicalBig
[1?((?V/2)/Vov)2] (2.10)
Id2 = Ibias/2?(?V/2)(Ibias/Vov)
radicalBig
[1?((?V/2)/Vov)2] (2.11)
12
Now if we want to steer full current through T1, resulting Id1 = Ibias and Id2 = 0, then
equation 2.10 yields
Ibias = Ibias/2 + (?V/2)(Ibias/Vov)
radicalBig
[1?((?V/2)/Vov)2]
Ibias/2 = (?V/2)(Ibias/Vov)
radicalBig
[1?((?V/2)/Vov)2]
1 = ?VV
ov
?
radicalBigg
1?(?V4V
ov
)2
= (?VV
ov
)2 ? 14 ?(?VV
ov
)4
=
parenleftbigg?V
Vov
parenrightbigg2parenleftBigg
1? ?V
2
4V 2ov
parenrightBigg
parenleftbiggV
ov
?V
parenrightbigg2
= 4V
2
ov ??V
2
4V 2ov
4V 4ov = 4V 2ov ??V 2 ??V 4
parenleftBig
2V 2ov ??V 2
parenrightBig2
= 0
?V = ??2Vov (2.12)
It means for ?V = +?2Vov swing either one of the transistor will turn on pushing other
one to be off and at ?V = -?2Vov it will be reversed, as indicated in Figure 2.4.
Figure 2.4: Normalized current for CML inverter
13
At the biasing point shown earlier, Vov = 263.9mV, therefore our minimum logic swing
from the biasing point will be ?V/2=?2?263.9mV = ?373.21mV. In Figure 2.4 it is shown,
at +300mV swing from the biasing point, T1 passes all the bias current and at -300mV from
the biasing point, T2 completely turns on resulting in a total logic swing ?V=600mV. 73mV
discrepancy occurs because it is a short channel device and ? is not 2, but rather close to
1.25. Equation 2.12 can further be extended by plugging Vov in terms of bias current and
W.
?V = ??2Vov
?V = ??2
radicalBigg
Ibias
K
?V = ?
radicaltpradicalvertex
radicalvertexradicalbt 2Ibias
?CoxWL (2.13)
It means the more W/L ratio we have, the faster we can move in switching region. But
making W larger will increase parasitic capacitance at higher operating frequency and the
RC constant will be dominating.
In order to achieve faster full swing, ?V = Ibias * RL, we can increase bias current to
reduce RL. Bias current was tuned to 1.65mA for W=7?m and L=120nm, which is 70% of
peak ft current, and load resistance RL was tuned to 348?.
Input gate capacitance of each CML basic component is, Cgg = 8fF. It is stated that
wire capacitance is 0.10ps/?m [22]. Assuming 100?m is required to connect next stage then
10ps delay should be added and typically 1fF is necessary to mimic 1ps delay. Therefore
10+8 = 18fF was added as output load capacitance (CL).
Propagation delay, Pd = 0.69RC, where C is the capacitance looking in to Drain of
T1/T2.
Pd = 0.69RL(Cintr +CL)
Pd = 0.69RL(Cgd1 +Cdb1 +CL)
14
= 0.69?VI
bias
(Cgd1 +Cdb1 +CL) (2.14)
Figure 2.5: CML inverter half-circuit small-signal model
Where Cintr is intrinsic capacitance which equals the gate to drain (Cgd1) plus drain
to bulk (Cdb1) capacitance as shown in Figure 2.5. Equation 2.14 means, the higher the
bias current, the lower the propagation delay. But, bias can be increased by increasing W,
and at higher frequency further increasing bias current will not reduce propagation delay
and parasitic capacitance will dominate. Hence, optimized width obtained for the upper
level transistors are 7?m and 4?m width for the current source, all having channel length =
120nm with bias current, Ibias=1.65mA.
All these intuitive design considerations were included in designing CML logic circuits
that can achieve maximum operating frequency of 12GHz with 0.12?m feature size. Supply
voltage used was Vdd=2.8V, logic-1 = 2.8V, logic-0 = 2.2V, constant biasing voltage, Vg =
1V to bias the current source and logic swing ?V = 600mV.
Figure 2.6 shows post vs. pre layout simulation of a CML inverter operating at 12GHz
(input changes at 83ps) with 18fF input/output load capacitance. Rising delay of inverter
in circuit level simulation is 17.47ps and extracted layout delay is 24.79ps. The power
consumption of the CML inverter is P = Vdd*Ibias = 2.8V * 1.65mA = 4.62mW, which is
constant. Operating this optimized CML inverter at 1MHz and 12GHz will dissipate the
same amount of power. Therefore power consumption is independent of operating frequency.
Figure 2.6 also shows layout simulation varies with circuit level simulation by an amount of
15
Figure 2.6: Post vs. pre layout simulation of CML inverter with 18fF input/output load
capacitance (input changes at 83ps)
7.32ps. Layout of the CML inverter was also performed and the reported area is 15.960?m
x 24.450?m, as indicated in Figure 2.7.
Figure 2.7: CML inverter layout
2.2 CML Universal Gate
A universal CML gate is represented in Figure 2.8 that can realize AND, NAND, OR
and NOR functions. According to input logic combination as indicated in Figure 2.8, the
16
universal gate can realize an AND function. Reversing the output will realize a NAND
function. Reversing all the inputs and outputs will realize an OR function and thus a NOR
can be realized as well.
Figure 2.8: Universal CML gate
Careful consideration was made to choose Vin=2.5V and ?V=600mV, leading Logic-
1=2.8V and Logic-0=2.5V as shown earlier in CML inverter section. The biasing conditions
are same as before. If we assume Logic-1 has applied in A+ and Logic-0 has been applied
in A- then,
Vgs1 > Vth1; Vgd1 < Vth1; Vds1 > Vgs1 ?Vth1
Vgs2 < Vth2; Vgd2 < Vth2
For input B, levels have to be at least VDS amount down such that the condition VGD3
< Vth and VGD4 < Vth is met when T3 and T4 operate (are in the forward active region).
Therefore logic-1 for input B is Vin + ?V -VDS and Logic-0 is Vin - ?V - VDS.
Due to the discrepancy of logic levels for the second input, a level shifter is necessary
and has been embedded with in each gate as indicated in Figure 2.9. The previous power
17
Figure 2.9: Universal CML gate with embedded level shifter
of the CML gate was Vdd*Ibias and now with the level shifter it is Vdd*3Ibias; total power
increased by a factor of 3.
T7 and T8 have to be the same size as T1 and T2 for matching purposes, such that
T7/T8 can mimic the voltage drop VDS by T1/T2. To balance the differential architecture,
T5 is necessary with T4; because T3 sees either T1 or T2 as a load and is also used for
preventing breakdown. T6, T9 and T10 are of same size and constant biasing voltage (Vg)
is applied to act as a current source.
There are three paths in the universal CML gate and it is necessary to make sure total
bias current (Ibias) flows in any of these path from Vdd to Vss depending on logic combination
from. Splitting of Ibias is not acceptable because full swing realization will not be possible.
Table 2.2 shows the path that will be turned on, depending on input logic combination
and realization of the CML AND function. It is notable that the entry at B+/B- is un-
shifted logic. It is applied to input B at the level shifter and eventually it gets shifted down
by an amount VDS and propagates to input B of CML AND. It is indicated in Table 2.2 that
lowest level (2nd level, input B) dominates in path selection.
18
A+ A- B+ B- Out+ Out- T off T on Path
Vdd-?V Vdd Vdd-?V Vdd Vdd-?V Vdd 1, 2, 3 5, 4, 6 T5, T4, T6
Vdd-?V Vdd Vdd Vdd-?V Vdd-?V Vdd 1, 4, 5 2, 3, 6 T2, T3, T6
Vdd Vdd-?V Vdd-?V Vdd Vdd-?V Vdd 1, 2, 3 5, 4, 6 T5, T4, T6
Vdd Vdd-?V Vdd Vdd-?V Vdd Vdd-?V 2, 4, 5 1, 3, 6 T1, T3, T6
Table 2.2: CML AND Operation
2.2.1 Universal CML Gate Optimization
When a path turns on, it is necessary to steer all the bias current (Ibias) to that path,
as stated before, to realize full logic swing, ?V = RL * Ibias. The more bias current we
have, the faster the swing realized. Intuitive techniques to optimize the CML gate have
been shown in the CML inverter optimization subsection. It was found, using 130nm CMOS
technology, the highest operating frequency we can obtain for CML inverter is 12GHz with
upper transistor W=7?m, L=120nm for bias current of 1.65mA for which current source size
is W=4?m, L=120nm. As we cannot exceed this operating frequency for particular logic
swing ?V=600mV, supply voltage Vdd=2.8V, logic-1=2.8V, logic-0=2.2V parameters, the
same size has been used in CML universal gate for upper transistors. But in this case, it
has second level of input and here it contributes more parasitic capacitance, bias current has
been increased very little from 1.65mA to 1.753mA (for which current source size is W=7?m,
L=120nm) to operate at 12GHz.
For upper level (and 2nd level) transistors, around Vov=200mV has been provided such
that a 300mV swing from the biasing point can turn off and on the transistors. It should
be noted, at quiescent point Ir1 negationslash= Ir2, as indicated in Figure 2.10. The reason is, Out+
has 2 paths at Q point but Out- has 1 path to Vss. Therefore, typically Ir1 = 13Ir2, as it
is observed in Figure 2.10. It is shown that Ir1 = 517.5?A and Ir2 = 1.238mA out of total
bias current 1.753mA. Therefore, initially, Out+ tends to be close to logic-0 (2.2V), which
is 2.371V and Out- tends to be close to logic-1 (2.8V), which 2.62V at Q point. As it can
be seen in normalized current plot of universal CML gate in Figure 2.10 that Id1 negationslash= Id2 at
19
Figure 2.10: Normalized current for CML universal gate
biasing point. But full current steering was made possible at -300mV to +300mV swing,
which is our objective.
Figure 2.11: Post vs. pre layout simulation of CML AND with 18fF input/output load
capacitance (input changes at 83ps)
Figure 2.11 shows post vs. pre layout simulation of the CML AND gate at 12GHz
(input changes at 83ps). Load capacitance, (CL) =18fF was added to mimic next stage
input capacitance, Cgg=8fF plus 10fF to mimic 100?m wire for connecting next stage similar
to the CML inverter [22]. The circuit level simulation shows rising delay is 35.6ps whereas
20
extracted layout simulation has rising delay of 37.6ps. Bias current Ibias=1.759mA and 2
other current source supply 2.019mA each, for a total power consumption of 2.8V * (1.753
+ 2.019 + 2.019)mA = 16.2148mW.
Figure 2.12: CML AND layout
The reported area of CML AND gate is 30.150?m x 42.900?m, as indicated in Figure
2.12 and this area should be the same for CML NAND/OR/NOR. Layouts of all these gates
have been performed and have been simulated but not included because their architectures
are not different from CML AND, just the reverse of inputs or outputs.
2.3 CML XOR/XNOR Gate
A CML XOR gate is shown in Figure 2.13 with an embedded level shifter. Table 2.3
briefly describes CML XOR operation and paths.
Logic-1 represented by Vin+?V/2 = 2.8V and logic-0 is Vin-?V/2 = 2.2V, where Vin
= 2.5V is the biasing (quiescent) voltage and ?V=600mV is the logic swing. Bias current
Ibias=1.757mA and two other current sources supply 2.019mA each, having total power
consumption of 2.8V * (1.757+2.019+2.019)mA = 16.226mW. Power consumption of the
21
Figure 2.13: CML XOR gate
CML XOR is even less than the CML AND gate, which is impractical in CMOS realization
but possible in CML architecture.
A+ A- B+ B- Out+ Out- T on T off
0 1 0 1 0 1 7, 5, 1 2, 3, 4, 6
0 1 1 0 1 0 7, 5, 2 1, 3, 4, 6
1 0 0 1 1 0 7, 6, 4 1, 2, 3, 4
1 0 1 0 0 1 7, 6, 3 1, 2, 4, 5
Table 2.3: CML XOR Operation
For the 00 combination, T1, T5 and T7 will turn on and Ibias will flow down to Vss,
resulting in all other paths being turned off. For the 01 combination, either T2 or T3 will
turn on; depending on A. As A+ is 0 and A- is 1, T5 will be on and T6 will be off, and the
path will be T2, T5 and T7.
As the lowest level dominates the path, for the 10 combination T6 will be on, and as B+
is 0 and B-=1, T4 will turn on and the path will be T4, T6 and T7. For 11 combinations,
T3, T6 and T7 will be turned on.
Figure 2.14 shows post vs. pre layout simulation of CML XOR gate with 18fF in-
put/output load capacitance at 12GHz (input changes at 83ps). It is observed that circuit
22
Figure 2.14: Post vs. pre layout simulation of CML XOR with 18fF input/output load
capacitance (input changes at 83ps)
level rise delay is 33.7ps whereas extracted layout delay is 41.5ps. Therefore, 7.8ps discrep-
ancy exists in between circuit level and device level simulation.
Figure 2.15: CML XOR layout
Reported area of CML XOR is 28.590?m x 45.630?m as indicated in Figure 2.15.
23
2.4 CML Mux Realization
A CML mux has been realized at the transistor level as indicated in Figure 2.16. The
architecture is very similar to the XOR and the power dissipation is less than the CML
AND/XOR gate. Realizing a CML mux with transistors reduces power at least 13 over
realizing the mux by cascading CML AND gates. Realizing mux with transistor has two fold
benefits; one is less power and the other one is delay reduced by 12. Table 2.4 describes CML
mux operation.
Figure 2.16: CML Mux realization
B+ (V) A+(V) S+(V) Out+(V) T on T off
2.2 2.2 2.2 2.2 5, 2 6, 1, 3, 4
2.2 2.8 2.2 2.8 5, 1 6, 2, 3, 4
2.8 2.2 2.2 2.2 5, 2 6, 1, 3, 4
2.8 2.8 2.2 2.8 5, 1 6, 2, 3, 4
2.2 2.2 2.8 2.2 6, 4 5, 1, 2, 3
2.2 2.8 2.8 2.2 6, 4 5, 2, 2, 3
2.8 2.2 2.8 2.8 6, 5 4, 1, 2, 3
2.8 2.8 2.8 2.8 6, 5 4, 2, 2, 3
Table 2.4: CML Mux Operation
24
Figure 2.17 shows post vs. pre layout simulation of CML mux with 18fF input/output
load capacitance at 12GHz (input changes at 83ps). It is observed that circuit level rise
delay is 28.01ps whereas extracted layout delay is 33.99ps. Therefore, 5.98ps discrepancy
exists in between circuit level and device level simulation.
Figure 2.17: Post vs. pre layout simulation of CML Mux with 18fF input/output load
capacitance (input changes at 83ps)
Figure 2.18: CML Mux layout
The reported area for the CML Mux is 36.510?m x 47.520?m, as indicated in Figure
2.18. Total power consumption of the CML Mux is 2.8V * (2.019 + 2.019 + 1.757)mA =
16.226mW.
25
2.5 CML D-latch Realization
A CML D-latch has been realized in transistors, as indicated in Figure 2.20, with four
levels of hierarchy with a reset input, not by cascading CML AND gates, in contrast to
CMOS architecture. This architecture not only reduces power consumption but also reduces
Figure 2.19: CML D-latch
delay. Very surprisingly power consumption is less than the CML AND gate, which cannot
be realized in CMOS architecture. It is notable that, the lowest level dominates in selecting
upper level paths. RST+=1.3V (RST-=0.7V) turns on T9 (T7 off) and ties Q+ to Vss,
irrespective of any input combination.
CLK+ (V) D+(V) RST+(V) Qt+1+(V) T on T off
2.2 2.2 2.2 Qt+ 8, 7, 6, (3 / 4) 9, 5, (4 / 3), 1, 2, 10, 11
2.2 2.8 2.2 Qt+ 8, 7, 6, (3 / 4) 9, 5, (4 / 3), 1, 2, 10, 11
2.8 2.2 2.2 2.2 8, 7, 5, 2 9, 6, 1, 3, 4, 10, 11
2.8 2.8 2.2 2.8 8, 7, 5, 1 9, 6, 2, 3, 4, 10, 11
x x 2.8 2.2 8, 9, 10, 11 1, 2, 3, 4, 5, 6, 7, 8
Table 2.5: CML D-latch Operation
Table 2.5 briefly describes the operation of the CML D-latch. Initially Q+ tends to be
logic-0 (the reason explained for the CML AND gate). If the clock is high, then whichever
26
data comes at D+ is passes to the output Q+. Also, the present state pushes lower level
transistors to retain their logic states when clock is not high. Therefore at beginning, reset
is not necessary in CML architecture.
Figure 2.20: Post vs. pre layout simulation of CML D-latch with 18fF input/output load
capacitance at 6GHz (166ps)
Figure 2.20 shows post vs. pre layout simulation of CML D-latch with 18fF input/output
load capacitance at 6GHz. It is observed that circuit level rise delay is 56.813ps whereas
extracted layout delay is 67.367ps. Therefore, 10.554ps discrepancy exists in between circuit
level and device level simulation.
Figure 2.21: CML D-latch layout
27
Power consumption of the CML D-latch is 2.8V * (2.019 + 2.019 + 1.431)mA =
15.3132mW. Reported area for CML D-latch is 29.190?m x 57.780?m, as indicated in Figure
2.21.
2.6 Speed, Power, Area and Delay of Basic CML Components
Table 2.6 summarizes post simulation power, area and delay with 18fF load (10fF for
wire 8fF for next stage Cgg) capacitance of each CML basic components. In later chapters,
higher level circuit realization has been performed using these basic components to develop
a CML processor datapath.. Therefore, in measuring theoretical worst case delay it has
assumed that delay is additive, meaning if the output of a CML AND gate drives the input
of a CML Mux, the total delay is the sum of the two. Total delay of any complex component
in simulation is found to match the number of basic components in the critical path and
assigning delay to each of them.
Component Power (mW) Area (?mx?m) Delay (ps)
Inverter 4.62 15.96x24.45 24.79
AND/NAND 16.2148 30.15x42.90 37.6
OR/NOR 16.2148 30.15x42.90 46.09
XOR/XNOR 16.226 28.59x45.63 41.5
CML 2-to-1 Mux 16.226 36.51x47.52 33.99
CML D-latch 15.3132 29.19x57.78 67.367
Table 2.6: Power, Area and Delay of Basic Components (post layout simulation with 18fF
load capacitance)
Layout of the processor datapath has not been performed but the circuit level simulation.
It is observed that post vs. pre layout simulation differs maximum 10ps and wire delay is
10ps (for 100?m wire connecting next stage and assuming wire delay is 0.10ps/?m) [22].
Therefore 20fF load capacitance was added inside every CML logic in datapath simulation
to mimic post layout + wire delay (in intermediate stages Cgg = 8fF has already been
considered by simulation tool). Area was predicted by counting the number of gates * area
of each gate + 50% area for wiring.
28
Chapter 3
Datapath
A multi-cycle 16-bit processor datapath has been designed using a RISC architecture
that can execute 15 different operations and a 4-bit opcode has been used. Three different
types of instructions can be performed and instruction set architecture is given in Table 3.1.
Address bus (16-bit)
15-12 11-8 7-4 3-0
Opcode R-type Rd Rs Rt
Opcode I-type Rd Rs Immediate operand
Opcode J-type Address
Table 3.1: Instruction Set Architecture
Figure 3.1: Processor datapath
29
The processor datapath is shown in Figure 3.1, where all the components have been
realized in CML logic except the cache memory and control unit. Correct operation of
the datapath has been verified by providing external control stimulus in every clock cycle
and observing/providing data to/from memory. Fifteen different operations that can be
performed are listed in Table 3.2. The Processor datapath does not contain any novel
approach. However this datapath can be re-structured such that power consumption will be
less and operating frequency can be increased.
Type Opcode Description Cycle
0000 ADD Rd=Rs+Rt 4
0001 SUB Rd=Rs-Rt 4
0010 AND Rd=Rs?Rt 4
R-type 0011 XOR Rd=Rs?Rt 4
0100 SLT Rd=1 if Rs