# Design of 3.33 GHz CML Processor Datapath 

by

Abdullah Al Owahid

A thesis submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the requirements for the Degree of

Master of Science
Auburn, Alabama
May 7, 2012

Keywords: CML, CMOS, Processor

Copyright 2012 by Abdullah Al Owahid

Approved by
Fa Foster Dai, Chair, Professor of Electrical and Computer Engineering
Vishwani D. Agrawal, James J. Danaher Professor of Electrical and Computer Engineering Victor P. Nelson, Professor of Electrical and Computer Engineering


#### Abstract

Almost a decade processor speed has been stuck at operating frequency $2-3 \mathrm{GHz}$ due to excessive power consumption of CMOS logic gate at higher frequency whereas predicted speed at present was $10-15 \mathrm{GHz}$. This leads the idea of multi-core design in today's processor architecture. However it increases the communication overhead $\beta$ and there exist data dependency which cannot fully exploit the advantage of many-core design. Further many core design is increasing number of dark silicon and number of core cannot be increased after certain limit. Therefore a novel approaches in processor design using CML logic gate has been proposed.

Handcrafted 16-bit CML microprocessor datapath has been developed at operating frequency 3.33 GHz using 130 nm CMOS technology. With the same feature size, CMOS gate is incapable to operate beyond 1 GHz whereas CML logic gates were optimized for 12 GHz using bias current of $70 \%$ of peak $f_{t}$ current with a logic swing of 600 mV . Considering critical path delay, circuit has been slowed down to operate at 3.33 GHz .

All the processor components - decoder, mux, register file, ALU was deliberately handcrafted due to lack of analog synthesizer tool. Reported static power consumption of multicycle CML processor datapath is 41.264 W . However it is not the best case and could have been reduced to $50 \%$ by implementing multi-input CML logic. Expected chip area is 2.2 mm x 3.45 mm and power density per unit area is $5.44 \mu \mathrm{~W} / \mu m^{2}$. Estimated performance evaluated is 892 MIPS. Supply voltage used is 2.8 V . CML logic was defined as, logic- $1=2.8 \mathrm{~V}$ and logic $-0=2.2 \mathrm{~V} .1 \mathrm{~V}$ reference voltage was used to constant bias the current source and reset signal uses 1.3 V and 0.7 V for high and low logics respectively. It has been observed that it is possible to realize ultra-high speed processor using existing technology with minimum power consumption in CML logic.


## Acknowledgments

I would like to acknowledge the continuous support and guidance of Dr. Fa Foster Dai. Without his suggestion and direction it would have been impossible to complete this thesis work. I would also like to thank my committee members Dr. Vishwani D. Agrawal for his meaningful suggestions regarding processor architecture and Dr. Victor P. Nelson.

I thank my friends and colleagues - James Clark, Shannon Price, Xin Jin and Baohu Li for being with me and making life at Auburn enjoyable.

Last but not the least, I would like to thank my family members - my parents whose love brought me so far, my brother and sister, and especially my wife for her patience.

## Table of Contents

Abstract ..... ii
Acknowledgments ..... iii
List of Figures ..... vii
List of Tables ..... xi
List of Abbreviations ..... xii
1 Introduction ..... 1
1.1 Problem Statement ..... 1
1.2 Background and Motivation ..... 1
1.3 Contribution ..... 4
1.4 Organization ..... 5
2 Background and High Speed CML Logic Realization ..... 6
2.1 CML Inverter ..... 8
2.1.1 CML Inverter Optimization ..... 9
2.2 CML Universal Gate ..... 16
2.2.1 Universal CML Gate Optimization ..... 19
2.3 CML XOR/XNOR Gate ..... 21
2.4 CML Mux Realization ..... 24
2.5 CML D-latch Realization ..... 26
2.6 Speed, Power, Area and Delay of Basic CML Components ..... 28
3 Datapath ..... 29
3.1 First Clock Cycle of Every Instruction ..... 31
3.2 R-type Instruction ..... 32
3.2.1 R-type ADD/SUB/AND/XOR ..... 32
3.2.2 R-type SLT ..... 33
3.2.3 R-type SEQ ..... 34
3.3 I-type Instruction ..... 35
3.3.1 I-type LW ..... 35
3.3.2 I-type SW ..... 36
3.3.3 I-type ADDI ..... 37
3.3.4 I-type MOVI ..... 37
3.4 J-type Instruction ..... 38
3.4.1 J-type J LABEL ..... 39
3.4.2 J-type JZ LABEL ..... 40
3.4.3 J-type JNZ LABEL ..... 41
3.4.4 J-type JAL ..... 41
3.4.5 J-type JR ..... 42
3.5 Control Signals ..... 43
4 Component Realization ..... 46
4.1 16-bit Register With Enable Input ..... 47
4.2 Z Register (1-bit Register With Enable Input) ..... 49
4.3 4 16-bit 2-to-1 Mux ..... 49
4.43 4-bit 2-to-1 Mux ..... 50
4.53 16-bit 4-to-1 Mux ..... 50
4.6 16-bit 5-to-1 mux ..... 51
4.7 16-bit ALU ..... 52
4.8 16x16 Register File ..... 57
4.9 Sign 4-to-16 extension (Sign 4) ..... 62
4.10 Sign 8-to-16 extension (Sign 8) ..... 63
4.11 Sign 12-to-16 extension (Sign 12) ..... 63
4.122 unsigned 1-to-16 extension (Unsigned 16) ..... 64
4.134 1-bit AND gate ..... 65
4.141 1-bit OR gate ..... 65
5 Processor Verification and Performance ..... 67
5.1 Processor Verification ..... 67
5.2 Performance ..... 70
5.3 Comparison ..... 72
6 Conclusions ..... 74
6.1 Future Work ..... 74
Bibliography ..... 76

## List of Figures

1.1 Operating frequency over time ..... 2
1.2 Operating Frequency vs. Power of Intel Processor ..... 2
1.3 Power density per unit area ..... 3
2.1 Current consumptions for CMOS vs. CML logic ..... 6
2.2 CMOS vs. CML power consumption ..... 7
2.3 CML Inverter ..... 8
2.4 Normalized current for CML inverter ..... 13
2.5 CML inverter half-circuit small-signal model ..... 15
2.6 Post vs. pre layout simulation of CML inverter with 18 fF input/output load capacitance (input changes at 83ps) ..... 16
2.7 CML inverter layout ..... 16
2.8 Universal CML gate ..... 17
2.9 Universal CML gate with embedded level shifter ..... 18
2.10 Normalized current for CML universal gate ..... 20
2.11 Post vs. pre layout simulation of CML AND with 18 fF input/output load capac- itance (input changes at 83ps) ..... 20
2.12 CML AND layout ..... 21
2.13 CML XOR gate ..... 22
2.14 Post vs. pre layout simulation of CML XOR with 18 fF input/output load capac- itance (input changes at 83ps) ..... 23
2.15 CML XOR layout ..... 23
2.16 CML Mux realization ..... 24
2.17 Post vs. pre layout simulation of CML Mux with 18 fF input/output load capac- itance (input changes at 83ps) ..... 25
2.18 CML Mux layout ..... 25
2.19 CML D-latch ..... 26
2.20 Post vs. pre layout simulation of CML D-latch with 18 fF input/output load capacitance at 6 GHz ( 166 ps ) ..... 27
2.21 CML D-latch layout ..... 27
3.1 Processor datapath ..... 29
3.2 First clock cycle of any instruction ..... 31
3.3 R-type ADD/SUB/AND/XOR ..... 32
3.4 R-type SLT ..... 33
3.5 R-type SEQ ..... 34
3.6 I-type LW ..... 35
3.7 I-type SW ..... 36
3.8 I-type ADDI ..... 37
3.9 I-type MOVI ..... 38
3.10 J-type J LABEL ..... 39
3.11 J-type JZ LABEL ..... 40
3.12 J-type JNZ LABEL ..... 41
3.13 J-type JAL ..... 42
3.14 J-type JR ..... 43
4.1 Datapath Components ..... 46
4.2 Block diagram of MS DFF, MS DFF-EN, 16-bit register ..... 48
4.3 1-bit register output at 6 GHz with 20 fF load capacitance (clock period 166 ps ) ..... 49
4.4 1-bit 4-to-1 mux ..... 50
4.5 1-bit 4-to-1 mux output with 20 fF load capacitance (input changes at 83 ps ) ..... 51
4.6 1-bit 5-to-1 mux ..... 52
4.7 1-bit 5-to-1 Mux output with 20fF load capacitance (input changes at 166ps) ..... 52
4.8 Block diagram of 16 -bit ALU ..... 53
4.9 16-bit CLA block diagram ..... 53
4.10 Critical path delay in 16 -bit CLA is 224.7 ps (input changes at 500 ps ) ..... 55
4.11 16-bit ALU Output (input changes at 300ps) ..... 56
4.12 16x16 Register File Schematic ..... 58
4.13 4-to-16 Deocoder (input changes at 83ps) ..... 59
4.14 1-bit 16-to-1 Mux Schematic ..... 59
4.15 1-bit 16-to-1 Mux Output (input changes at 144ps) ..... 60
4.16 16x16 Register File Output at 3.33 GHz ..... 61
4.17 Sign 4 to 16 Extension Output (input changes at 72ps) ..... 63
4.18 Unsigned 1 to 16 Extension Output (input changes at 83ps) ..... 64
5.1 Handcrafted Processor Schematic ..... 67
5.2 MOVI instruction ..... 68
5.3 ADDI instruction ..... 69
5.4 ADD instruction ..... 70
5.5 Static power consumption of CML processor datapath over 13 clock cycles ..... 71

## List of Tables

2.1 Inverter Operation ..... 8
2.2 CML AND Operation ..... 19
2.3 CML XOR Operation ..... 22
2.4 CML Mux Operation ..... 24
2.5 CML D-latch Operation ..... 26
2.6 Power, Area and Delay of Basic Components (post layout simulation with 18fF load capacitance) ..... 28
3.1 Instruction Set Architecture ..... 29
3.2 Opcode and 15 Different Operations ..... 30
3.3 Control Signal Table Part-1 ..... 44
3.4 Control Signal Table Part-2 ..... 45
4.1 Component Power Dissipation, Expected Area and Delay ..... 65

## List of Abbreviations

## ALU Arithmetic Logic Unit

BiCMOS Bipolar and CMOS in same integrated chip

BJT Bipolar Junction Transistor

CML Current Mode Logic

CMOS Complementary Metal Oxide Semi-conductor

ECL Emitter Coupled Logic

ISA Instruction Set Architecture

MCML Mos Current Mode Logic

MIPS Million Instructions Per Second
nMOS n-type Metal Oxide Semi-conductor

RISC Reduced Instruction Set Computing

SoC System on Chip

SPICE Simulation Program with Integrated Circuit Emphasis

SRAM Static Random Access Memory

## Chapter 1

Introduction

This thesis presents the design of a handerafted 3.33 GHz CML processor datapath. RISC architecture has been adopted in designing the multi-cycle processor datapath and the ISA is 16 -bits long. It is the first ever MCML processor that requires constant power dissipation unlike CMOS processors. Also, once optimized, power dissipation does not increase with increasing operating frequency. Optimizing CML logic for higher operating frequency requires higher power than optimizing for lower frequency. Therefore, once CML logic has been optimized for a targeted maximum frequency, operating the circuit at lower frequency will lose the benefits in terms of power.

Due to larger switching noise associated with CMOS circuits, CML is a better choice for high speed circuit realization [1]. BJT based CML gates are faster than MOSFET CML due to higher $g_{m}$ and lower power but have been avoided for process difficulty and uncommonness in designing digital circuits. Therefore MOSFET-CML (MCML) logic has been used.

### 1.1 Problem Statement

The problem solved in this thesis: Design a high speed low power processor datapath.

### 1.2 Background and Motivation

Processor speed has been stuck at $2-3 \mathrm{GHz}$ due to excessive power consumption, as indicated in Figure 1.1 for the last 10 years [2]. Therefore multi-core design has evolved to increase performance. But the number of cores cannot be increased after a certain limit and there exists communication overhead $\beta$. Further, due to data dependency, programs cannot be fully parallelized, which inhibits proper exploitation of multi-core design.


Figure 1.1: Operating frequency over time

Intel processor speed vs power have been obtained and plotted in Matlab as depicted in Figure 1.2 [3].


Figure 1.2: Operating Frequency vs. Power of Intel Processor

In Figure 1.2 it is observed that processor power consumption has been increased almost exponentially at operating frequencies beyond 2 GHz . Also notable is that, core i-5 has higher operating frequency than core-i7 but the power consumption in core i-7 is almost double due to a higher number of cores. So reducing the number of cores will not only reduce power but can result in higher performance due to less data dependency.

Power consumption can also be reduced by moving to deep submicron technology. However, the main drawback to reducing feature size is it increases unit power density and chips cannot sustain that power, as shown in Figure 1.3 [4].


Figure 1.3: Power density per unit area

In CML, per unit power density is lower as it requires greater area than CMOS due to load resistances in CML logic. Therefore in CML, we can achieve higher operating frequency with less power density.

A previous mixed signal superscalar processor was developed in 1997, using $0.5 \mu \mathrm{~m}$ BiCMOS process and required 3.6 V and 2.1 V power supplies. The reported operating frequency was 533 MHz and the design used a PowerPC architecture that contained three pipelines and a large on-chip secondary cache to achieve a peak performance of 1600 MIPS. The 15 mm x 10 mm die contained 2.7 M transistors ( 2 M CMOS and 0.7 M bipolar) and dissipated less than 85 W [5]. All logic circuits were implemented in three-level emitter coupled logic (ECL) and only RAM structures were implemented with CMOS circuits.

Although BJT has less switching noise and are faster than MOS transistors they are expensive and not frequently used in digital circuits [1]. This led to the idea to design a high speed processor datapath using MCML logic. Further CMOS SRAM cannot operate
at 3.33 GHz using $0.12 \mu \mathrm{~m}$ technology. Therefore only high speed datapath was developed in this thesis.

### 1.3 Contribution

CML logic architecture has been discussed in the literatures but it is not guaranteed that it can realize digital functions unless optimization has been performed [6]-[8]. Optimized CML logic can steer full bias current at different input combinations, resulting in full voltage swing that differentiate logic states. Also, technology files provide only transistors, unlike digital technology files that provides optimized CMOS logic gates. Therefore, due to lack of analog synthesizer tools handcrafting of CML logic architecture is necessary and then optimization is required for CML logic to realize digital function for a targeted frequency. All the basic components have been derived first, with maximum operating frequency 12 GHz using 130 nm CMOS technology. Bias current was chosen to be $70 \%$ of peak $f_{t}$ current that gives us highest possible operating frequency without burning transistors when operate. Further, biasing CML gates with $70 \%$ of peak $f_{t}$ or less may incur $10 \%$ propagation delay but can save more than $40 \%$ power [1] and [7]. Logic swing was determined to be 600 mV , assuming it will be advantageous than CMOS logic at this frequency that has fixed swing 1 V . These basic CML logic designs were later used in realizing 16 -bit 3.33 GHz datapath components. All the processor components - mux, register file, ALU were deliberately handcrafted in Cadence Virtuoso. Reported static power consumption of the multi-cycle CML processor is 41.264 W and power density per unit area is $5.44 \mu \mathrm{~W} / \mu m^{2}=544 \mathrm{~W} / \mathrm{cm}^{2}$, below traditional CMOS processor power density per unit area, as indicated in Figure 1.3. Estimated performance of the multi cycle CML processor datapath is 892 MIPS and expected chip area is $2.2 \mathrm{~mm} x$ 3.45 mm .

This work has been accepted at IEEE International Symposium On Circuits and Systems (ISCAS) 2012 Conference [9].

### 1.4 Organization

The thesis has been organized as follows: Chapter 2 provides background and describes high speed CML logic realization. Chapter 3 introduces datapath design. Chapter 4 describes CML datapath component realization and verification. Chapter 5 represents processor verification, performance and comparison. Chapter 6 draws conclusions and discusses future work.

## Chapter 2

## Background and High Speed CML Logic Realization

Recently, interest in high speed digital circuit baseds on MCML/BJT-CML is increasing due to low power consumption [10]. Noise coupling between digital circuitry and sensitive analog blocks has always been a major obstacle in complete system on chip design (SoC) [11]. MOS current mode logic (MCML) is a promising alternative to conventional MOS in mixed signal applications. Many efforts were exhausted to realize the potential of MCML [12][17]. Even though MCML has been shown to dissipate less power than CMOS at operation frequencies of more than 300 MHz , designers have been reluctant to exchange MCML for CMOS [14]. The high complexity of MCML and the lack of automation tools made it impossible to produce robust and power-efficient designs while maintaining low cost and reasonable time to market.


Figure 2.1: Current consumptions for CMOS vs. CML logic

Figure 2.1 shows typical current consumption for CMOS and CML logics [18]. As indicated, typically at slower frequency CMOS is beneficial whereas CML takes less power
at higher operating frequency. Figure 2.1 is based on assumption, and not according to circuit measurements, and is not shown in scale. It is supposed that typically beyond 1 GHz CML is beneficial than CMOS.


Figure 2.2: CMOS vs. CML power consumption

A divide by two circuits was realized in CMOS ( $\mathrm{W}=160 \mathrm{~nm}, 160 \mathrm{~nm}, 200 \mathrm{~nm}, 600 \mathrm{~nm}$, $1 \mathrm{~m}, 1.5 \mathrm{~m}, 2.5 \mathrm{~m}, 3 \mathrm{~m}, 7 \mathrm{~m}$ and 10 m for $0.5,1,1.2,1.4,1.5,1.6,1.8,2,2.3$ and 2.5 GHz respectively) and CML ( $\mathrm{W}=160 \mathrm{~nm}, 160 \mathrm{~nm}, 160 \mathrm{~nm}, 200 \mathrm{~nm}, 250 \mathrm{~nm}, 300 \mathrm{~nm}$ each having $W_{\text {tail }}=120 \mathrm{~nm}$ for $0.5,1,2,2.5,3$ and 4 GHz respectively) architecture. Channel length, $\mathrm{L}=$ 120 nm was fixed for both cases and power has been plotted in Figure 2.2. At 2.5 GHz CMOS D filp-flop propagation delay reaches $50 \%$ of input clock cycle whereas CML D flip-flop was able to generate correct output till 6 GHz . It is obvious that CML dominates over CMOS beyond 1.5 GHz . Benefits of MCML circuit topology over CMOS are largely independent of technology [6].

Many successful attempts have been made to expose the relationships between the MCML gate delay and the various design parameters [19]-[21]. These efforts have provided insight into the design considerations and have been described briefly for CML inverters and universal CML logic. There is no straight forward method to optimize CML logic and modeling accurate propagation delay for CML logic with pen and paper is very hard to
derive due to higher order effects. Therefore, we will rely on some approximation in the later subsections and compare with the simulation results in designing high speed CML gates.

### 2.1 CML Inverter



Figure 2.3: CML Inverter

Figure 2.3 shows a CML Inverter which is typically a differential pair. There exists a particular biasing voltage $V_{i n}$ (quiescent point), for which $I_{d 1}=I_{d 2}=I_{\text {bias }} / 2$. For differential input, which is our logic swing $\Delta \mathrm{V}$, it is necessary to tune the circuit such that when logic-1 $\left(V_{i n}+\Delta \mathrm{V} / 2\right)$ is applied to $\mathrm{A}+$ and $\operatorname{logic-0}\left(V_{i n}-\Delta \mathrm{V} / 2\right)$ is applied to A -, T1 turns on and T2 turns off resulting in $I_{d 1}=I_{\text {bias }}, I_{d 2}=0$; steering all the current through T1. A Summary of CML inverter operation has been given in Table 2.1

| $\mathrm{A}+$ | A- | T on | T off | Out + | Out- |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $V_{\text {in }}-\Delta \mathrm{V} / 2$ | $V_{\text {in }}+\Delta \mathrm{V} / 2$ | 2,3 | 1 | $V_{i n}+\Delta \mathrm{V} / 2$ | $V_{i n}-\Delta \mathrm{V} / 2$ |
| $V_{\text {in }}+\Delta \mathrm{V} / 2$ | $V_{\text {in }}-\Delta \mathrm{V} / 2$ | 1,3 | 2 | $V_{i n}-\Delta \mathrm{V} / 2$ | $V_{i n}+\Delta \mathrm{V} / 2$ |

Table 2.1: Inverter Operation

### 2.1.1 CML Inverter Optimization

If our supply voltage is $V_{d d}$ as indicated in Figure 2.3, and the output of the CML gate drives another gate, then we must have to maintain:

$$
\begin{gathered}
\text { Logic- } 1=V_{d d}=V_{i n}+\Delta \mathrm{V} / 2 \\
\text { and Logic- } 0=V_{d d^{-}} \Delta \mathrm{V}=V_{i n^{-}} \Delta \mathrm{V} / 2
\end{gathered}
$$

Let us assume logic-1 $\left(V_{i n}+\Delta \mathrm{V} / 2\right)$ is applied to $\mathrm{A}+$ and $\operatorname{logic}-0\left(V_{i n}-\Delta \mathrm{V} / 2\right)$ is applied to A-. Therefore, to keep T1 in its forward-active region and T 2 in its cut-off region, necessary conditions are:

$$
\begin{gathered}
V_{g s 1}>V_{t h 1} ; V_{g d 1}<V_{t h 1} ; V_{d s 1}>V_{g s 1}-V_{t h 1} \\
V_{g s 2}<V_{t h 2} ; V_{g d 2}<V_{t h 2}
\end{gathered}
$$

If we assume that the magnitude of the voltage swing is just large enough to steer all the current from one side to the other then $\Delta \mathrm{V}=\Delta V_{\text {min }}$, which is the minimum swing required (just large enough to turn off T2),

$$
\begin{align*}
V_{g s 2} & =V_{t h 2} \\
V_{g 2}-V_{x} & =V_{t h 2} \\
V_{i n}-\Delta V_{\min } / 2-V_{x} & =V_{t h 2} \\
V_{d d}-\Delta V_{\min } / 2-\Delta V_{\min } / 2-V_{x} & =V_{t h 2} \\
V_{x} & =V_{d d}-\Delta V_{\min }-V_{t h 2}  \tag{2.1}\\
V_{g s 1} & =V_{g 1}-V_{x} \\
V_{g s 1} & =V_{i n}+\Delta V_{\min } / 2-V_{x} \\
V_{g s 1} & =V_{d d}-V_{x}  \tag{2.2}\\
I_{b i a s} & =\left(\mu_{n} C_{o x} / 2\right)(W / L)\left(V_{g s 1}-V_{t h 1}\right)^{\alpha} \tag{2.3}
\end{align*}
$$

where $\mu_{n}$ is electron mobility in nMOS device and $C_{o x}$ is capacitance per unit area of gate oxide, $C_{o x}=\epsilon_{o x} / t_{o x}$. Permittivity of $\mathrm{SiO}_{2}, \epsilon_{o x}=3.9 \epsilon_{0}$; where permittivity of free space $\epsilon_{0}=$ $8.85 \cdot 10^{-14} \mathrm{~F} / \mathrm{cm}$ and $t_{o x}$ is thickness of gate oxide. For a short channel device, like $0.12 \mu \mathrm{~m}$ feature size $\alpha \simeq 1.25$. For easiness of our pen and paper calcuation, sometimes we will assume either $\alpha$ is 2 .

For matching purposes, T1 and T2 have to be same feature size such that an equal amount of current flows on both sides at the quiescent point. Therefore equation 2.1 yields

$$
\begin{align*}
V_{t h 1} & =V_{d d}-\Delta V_{\min }-V_{x} \\
\text { and, } V_{g s 1}-V_{t h 1} & =V_{d d}-V_{x}-V_{d d}+\Delta V_{\min }+V_{x} \\
& =\Delta V_{\min }  \tag{2.4}\\
I_{b i a s} & =\left(\mu_{n} C_{o x} / 2\right)(W / L)\left(\Delta V_{\min }\right)^{\alpha} \tag{2.5}
\end{align*}
$$

Therefore, $I_{b i a s}$ can be realized as a function of minimum logic swing $\Delta V_{\text {min }}$, with $I_{b i a s}$ $\propto \Delta V_{\min }$. It means, the less logic swing we define, the less bias current we need to realize a CML inverter. The minimum amount of current that will be required to realize this logic swing occurs when W is minimum. In other words, we can realize this full swing at high bias current which is true but we will burn more power. The minimum amount of $W_{n}$ that can provide just large enough (smallest) bias current to realize full swing is:

$$
\begin{equation*}
I_{L}=\left(\mu_{n} C_{o x} / 2\right)\left(W_{n} / L\right)\left(\Delta V_{\min }\right)^{\alpha} \tag{2.6}
\end{equation*}
$$

Again, let us consider a CML inverter circuit at its quiescent point, as indicated in Figure 2.3. In determining our minimum logic swing $\Delta V_{\min }$ we need to consider how we are going to forward bias the transistors, meaning what our overdrive should be. Typically, for overdrive votage, $V_{o v} \geq 300 \mathrm{mV}$, the transistor reaches soft saturation and when $V_{o v} \geq$ 500 mV , the transistor reaches hard saturation. Overdrive voltage is referred to as $V_{o v}=V_{g s}$

- $V_{t}$. It is necessary to determine overdrive voltage because, if we apply logic-1 at $\mathrm{A}+$ and logic-0 at A- and try to turn off T2 but if T2 at quiescent point reached velocity saturation then $(\Delta \mathrm{V}) / 2$ is small enough to completely turn off T 2 . The consequence will be we cannot realize full swing. Making our logic swing large can turn T2 off but we will end up in greater RC constant and high propagation delay. Therefore, we need overdrive voltage such that it can push the transistor slightly in to the forward active region. A careful consideration was made such that $V_{o v}=V_{g s}-V_{t}=263.9 \mathrm{mV}$. Therefore it is necessary to expose the relation between logic swing and overdrive voltage. For an input voltage Vg 1 at gate terminal of T1,

$$
\begin{aligned}
V_{g s 1} & =V g 1-V_{x} \\
V_{g 1} & =V_{g s 1}+V_{x} \\
\text { and } V_{g} 2 & =V_{g s 2}+V_{x}
\end{aligned}
$$

For differential input voltage, which is our logic swing,

$$
\begin{align*}
\Delta V & =V_{g 1}-V_{g 2} \\
\Delta V & =V_{g s 1}-V_{g s 2} \\
I_{d 1} & =(1 / 2) \mu C_{o x}(W / L)\left(V_{g s 1}-V_{t}\right)^{2} \alpha=2 \text { for simplicity }  \tag{2.7}\\
I_{d 2} & =(1 / 2) \mu C_{o x}(W / L)\left(V_{g s 2}-V_{t}\right)^{2} \\
\sqrt{I d 1}-\sqrt{I d 2} & =\Delta V \sqrt{(1 / 2)\left(\mu C_{o x}(W / L)\right)} \\
I_{d 1}+I_{d 2}-2 \sqrt{I d 1 . I d 2} & =(1 / 2)\left(\mu C_{o x}(W / L)\right) \Delta V^{2} \\
I_{\text {bias }}-2 \sqrt{I d 1 . I d 2} & =(1 / 2)\left(\mu C_{o x}(W / L)\right) \Delta V^{2} \\
2 \sqrt{I d 1 . I d 2} & =I_{\text {bias }}-(1 / 2)\left(\mu C_{o x}(W / L)\right) \Delta V^{2} \\
2 \sqrt{I d 1 .(I b i a s-I d 1)} & =I_{\text {bias }}-(1 / 2)\left(\mu C_{o x}(W / L)\right) \Delta V^{2} \\
-4 I_{d 1}^{2}+4 I_{d 1} \cdot I_{\text {bias }} & =I_{\text {bias }}^{2}-I_{b i a s}\left(\mu C_{o x}(W / L)\right) \Delta V^{2}+(1 / 4)\left(\mu C_{o x}(W / L)\right)^{2} \Delta V^{4}
\end{align*}
$$

$$
\begin{gathered}
-4 I_{d 1}^{2}+4 I_{d 1} \cdot I_{\text {bias }}=I_{\text {bias }}^{2}-I_{\text {bias }} \cdot K \Delta V^{2}+(1 / 4) K^{2} \Delta V^{4} \mathrm{~K}=\mu \operatorname{Cox}(\mathrm{W} / \mathrm{L}) \\
4 I_{d 1}^{2}-4 I d 1 \cdot I_{\text {bias }}+I_{\text {bias }}^{2}-I_{\text {bias }} K \Delta V^{2}+(1 / 4) K^{2} \Delta V^{4}=0
\end{gathered}
$$

$$
\begin{aligned}
I_{d 1} & =\left[4 I_{\text {bias }} \pm \sqrt{\left(16 I_{\text {bias }}^{2}-16 I_{\text {bias }}^{2}+16 I_{\text {bias }} K \Delta V^{2}-4 K^{2} \Delta V^{4}\right)}\right] / 8 \\
I_{d 1} & =\left[2 I_{\text {bias }} \pm \Delta V \sqrt{\left(4 I_{\text {bias }} K-K^{2} \Delta V^{2}\right)}\right] / 4 \\
I_{d 1} & =I_{\text {bias }} / 2 \pm(1 / 4) \Delta V 2 \sqrt{\left(I_{\text {bias }} . K\right)} \sqrt{\left[1-K \Delta V^{2} / 4 I_{\text {bias }}\right]} \\
I_{d 1} & \left.=I_{\text {bias }} / 2 \pm(\Delta V / 2) \sqrt{\left(I_{\text {bias }} . K\right)} \sqrt{\left[1-(\Delta V / 2)^{2} /\left(I_{\text {bias }} / K\right)\right.}\right]
\end{aligned}
$$

As Logic 1 was applied to A+ and Logic-0 at A-, then differential input Vid $>0$ resulting Id1 $>\mathrm{Id} 2$.

$$
\begin{align*}
& I_{d 1}=I_{\text {bias }} / 2+(\Delta V / 2) \sqrt{\left(I_{\text {bias }} \cdot K\right)} \sqrt{\left[1-(\Delta V / 2)^{2} /\left(I_{\text {bias }} / K\right)\right]}  \tag{2.8}\\
& I_{d 2}=I_{\text {bias }} / 2-(\Delta V / 2) \sqrt{\left(I_{\text {bias }} \cdot K\right)} \sqrt{\left[1-(\Delta V / 2)^{2} /\left(I_{\text {bias }} / K\right)\right]} \tag{2.9}
\end{align*}
$$

At the biasing (quiescent) point, $V_{i d}=0$, resulting $I_{d 1}=I_{d 2}=I_{\text {bias }} / 2$. Therefore equation 2.7 yields

$$
\begin{aligned}
I_{b i a s} / 2 & =(1 / 2) \mu C_{o x}(W / L)\left(V_{g s 1}-V_{t}\right)^{2} \\
I_{b i a s} / 2 & =(K / 2) V_{o v}^{2} \\
K & =I_{b i a s} / V_{o v}^{2}
\end{aligned}
$$

Plugging in the value of K in equation 2.8 we get

$$
\begin{align*}
& I_{d 1}=I_{b i a s} / 2+(\Delta V / 2)\left(I_{b i a s} / V_{o v}\right) \sqrt{\left[1-\left((\Delta V / 2) / V_{o v}\right)^{2}\right]}  \tag{2.10}\\
& I_{d 2}=I_{b i a s} / 2-(\Delta V / 2)\left(I_{b i a s} / V_{o v}\right) \sqrt{\left[1-\left((\Delta V / 2) / V_{o v}\right)^{2}\right]} \tag{2.11}
\end{align*}
$$

Now if we want to steer full current through T1, resulting $I_{d 1}=I_{b i a s}$ and $I_{d 2}=0$, then equation 2.10 yields

$$
\begin{align*}
I_{b i a s} & =I_{b i a s} / 2+(\Delta V / 2)\left(I_{\text {bias }} / V_{o v}\right) \sqrt{\left[1-\left((\Delta V / 2) / V_{o v}\right)^{2}\right]} \\
I_{b i a s} / 2 & =(\Delta V / 2)\left(I_{b i a s} / V_{o v}\right) \sqrt{\left[1-\left((\Delta V / 2) / V_{o v}\right)^{2}\right]} \\
1 & =\frac{\Delta V}{V_{o v}} \cdot \sqrt{1-\left(\frac{\Delta V}{4 V_{o v}}\right)^{2}} \\
& =\left(\frac{\Delta V}{V_{o v}}\right)^{2}-\frac{1}{4} \cdot\left(\frac{\Delta V}{V_{o v}}\right)^{4} \\
& =\left(\frac{\Delta V}{V_{o v}}\right)^{2}\left(1-\frac{\Delta V^{2}}{4 V_{o v}^{2}}\right) \\
\left(\frac{V_{o v}}{\Delta V}\right)^{2} & =\frac{4 V_{o v}^{2}-\Delta V^{2}}{4 V_{o v}^{2}} \\
4 V_{o v}^{4} & =4 V_{o v}^{2} \cdot \Delta V^{2}-\Delta V^{4} \\
\left(2 V_{o v}^{2}-\Delta V^{2}\right)^{2} & =0 \\
\Delta V & = \pm \sqrt{2} V_{o v} \tag{2.12}
\end{align*}
$$

It means for $\Delta \mathrm{V}=+\sqrt{2} V_{\text {ov }}$ swing either one of the transistor will turn on pushing other one to be off and at $\Delta \mathrm{V}=-\sqrt{2} V_{o v}$ it will be reversed, as indicated in Figure 2.4.

Expressions


Figure 2.4: Normalized current for CML inverter

At the biasing point shown earlier, $V_{o v}=263.9 \mathrm{mV}$, therefore our minimum logic swing from the biasing point will be $\Delta \mathrm{V} / 2=\sqrt{2} \cdot 263.9 \mathrm{mV}= \pm 373.21 \mathrm{mV}$. In Figure 2.4 it is shown, at +300 mV swing from the biasing point, T1 passes all the bias current and at -300 mV from the biasing point, T 2 completely turns on resulting in a total logic swing $\Delta \mathrm{V}=600 \mathrm{mV} .73 \mathrm{mV}$ discrepancy occurs because it is a short channel device and $\alpha$ is not 2 , but rather close to 1.25. Equation 2.12 can further be extended by plugging $V_{o v}$ in terms of bias current and W.

$$
\begin{align*}
\Delta V & = \pm \sqrt{2} V_{\text {ov }} \\
\Delta V & = \pm \sqrt{2} \sqrt{\frac{I_{\text {bias }}}{K}} \\
\Delta V & = \pm \sqrt{\frac{2 I_{\text {bias }}}{\mu C_{o x} \frac{W}{L}}} \tag{2.13}
\end{align*}
$$

It means the more $\mathrm{W} / \mathrm{L}$ ratio we have, the faster we can move in switching region. But making W larger will increase parasitic capacitance at higher operating frequency and the RC constant will be dominating.

In order to achieve faster full swing, $\Delta \mathrm{V}=I_{\text {bias }} * R_{L}$, we can increase bias current to reduce $R_{L}$. Bias current was tuned to 1.65 mA for $\mathrm{W}=7 \mu \mathrm{~m}$ and $\mathrm{L}=120 \mathrm{~nm}$, which is $70 \%$ of peak $f_{t}$ current, and load resistance $R_{L}$ was tuned to $348 \Omega$.

Input gate capacitance of each CML basic component is, $C_{g g}=8 \mathrm{fF}$. It is stated that wire capacitance is $0.10 \mathrm{ps} / \mu \mathrm{m}$ [22]. Assuming $100 \mu \mathrm{~m}$ is required to connect next stage then 10 ps delay should be added and typically 1 fF is necessary to mimic 1 ps delay. Therefore $10+8=18 \mathrm{fF}$ was added as output load capacitance $\left(C_{L}\right)$.

Propagation delay, $P_{d}=0.69 \mathrm{RC}$, where C is the capacitance looking in to Drain of T1/T2.

$$
\begin{aligned}
& P_{d}=0.69 R_{L}\left(C_{\text {intr }}+C_{L}\right) \\
& P_{d}=0.69 R_{L}\left(C_{g d 1}+C_{d b 1}+C_{L}\right)
\end{aligned}
$$

$$
\begin{equation*}
=0.69 \frac{\Delta V}{I_{b i a s}}\left(C_{g d 1}+C_{d b 1}+C_{L}\right) \tag{2.14}
\end{equation*}
$$



Figure 2.5: CML inverter half-circuit small-signal model
Where $C_{\text {intr }}$ is intrinsic capacitance which equals the gate to drain $\left(C_{g d 1}\right)$ plus drain to bulk $\left(C_{d b 1}\right)$ capacitance as shown in Figure 2.5. Equation 2.14 means, the higher the bias current, the lower the propagation delay. But, bias can be increased by increasing W , and at higher frequency further increasing bias current will not reduce propagation delay and parasitic capacitance will dominate. Hence, optimized width obtained for the upper level transistors are $7 \mu \mathrm{~m}$ and $4 \mu \mathrm{~m}$ width for the current source, all having channel length $=$ 120 nm with bias current, $I_{\text {bias }}=1.65 \mathrm{~mA}$.

All these intuitive design considerations were included in designing CML logic circuits that can achieve maximum operating frequency of 12 GHz with $0.12 \mu \mathrm{~m}$ feature size. Supply voltage used was $V_{d d}=2.8 \mathrm{~V}$, logic- $1=2.8 \mathrm{~V}$, logic- $0=2.2 \mathrm{~V}$, constant biasing voltage, $V_{g}=$ 1 V to bias the current source and logic swing $\Delta \mathrm{V}=600 \mathrm{mV}$.

Figure 2.6 shows post vs. pre layout simulation of a CML inverter operating at 12 GHz (input changes at 83 ps ) with 18 fF input/output load capacitance. Rising delay of inverter in circuit level simulation is 17.47 ps and extracted layout delay is 24.79 ps . The power consumption of the CML inverter is $\mathrm{P}=V_{d d}{ }^{*} I_{\text {bias }}=2.8 \mathrm{~V} * 1.65 \mathrm{~mA}=4.62 \mathrm{~mW}$, which is constant. Operating this optimized CML inverter at 1 MHz and 12 GHz will dissipate the same amount of power. Therefore power consumption is independent of operating frequency. Figure 2.6 also shows layout simulation varies with circuit level simulation by an amount of


Figure 2.6: Post vs. pre layout simulation of CML inverter with 18 fF input/output load capacitance (input changes at 83ps)
7.32 ps. Layout of the CML inverter was also performed and the reported area is $15.960 \mu \mathrm{~m}$ x $24.450 \mu \mathrm{~m}$, as indicated in Figure 2.7.


Figure 2.7: CML inverter layout

### 2.2 CML Universal Gate

A universal CML gate is represented in Figure 2.8 that can realize AND, NAND, OR and NOR functions. According to input logic combination as indicated in Figure 2.8, the
universal gate can realize an AND function. Reversing the output will realize a NAND function. Reversing all the inputs and outputs will realize an OR function and thus a NOR can be realized as well.


Figure 2.8: Universal CML gate

Careful consideration was made to choose $V_{i n}=2.5 \mathrm{~V}$ and $\Delta \mathrm{V}=600 \mathrm{mV}$, leading Logic$1=2.8 \mathrm{~V}$ and Logic- $0=2.5 \mathrm{~V}$ as shown earlier in CML inverter section. The biasing conditions are same as before. If we assume Logic-1 has applied in A+ and Logic-0 has been applied in A- then,

$$
\begin{gathered}
V_{g s 1}>V_{t h 1} ; V_{g d 1}<V_{t h 1} ; V_{d s 1}>V_{g s 1}-V_{t h 1} \\
V_{g s 2}<V_{t h 2} ; V_{g d 2}<V_{t h 2}
\end{gathered}
$$

For input B, levels have to be at least $V_{D S}$ amount down such that the condition $V_{G D 3}$ $<V_{t h}$ and $V_{G D 4}<V_{t h}$ is met when T3 and T4 operate (are in the forward active region). Therefore logic-1 for input B is $V_{i n}+\Delta \mathrm{V}-V_{D S}$ and Logic-0 is $V_{i n}-\Delta \mathrm{V}-V_{D S}$.

Due to the discrepancy of logic levels for the second input, a level shifter is necessary and has been embedded with in each gate as indicated in Figure 2.9. The previous power


Figure 2.9: Universal CML gate with embedded level shifter
of the CML gate was $V_{d d} * I_{\text {bias }}$ and now with the level shifter it is $V_{d d} * 3 I_{\text {bias }}$; total power increased by a factor of 3 .

T 7 and T 8 have to be the same size as T 1 and T 2 for matching purposes, such that $\mathrm{T} 7 / \mathrm{T} 8$ can mimic the voltage drop $V_{D S}$ by $\mathrm{T} 1 / \mathrm{T} 2$. To balance the differential architecture, T5 is necessary with T4; because T3 sees either T1 or T2 as a load and is also used for preventing breakdown. T6, T9 and T10 are of same size and constant biasing voltage ( Vg ) is applied to act as a current source.

There are three paths in the universal CML gate and it is necessary to make sure total bias current ( $I_{\text {bias }}$ ) flows in any of these path from $V_{d d}$ to $V_{s s}$ depending on logic combination from. Splitting of $I_{\text {bias }}$ is not acceptable because full swing realization will not be possible.

Table 2.2 shows the path that will be turned on, depending on input logic combination and realization of the CML AND function. It is notable that the entry at $B+/ B$ - is unshifted logic. It is applied to input B at the level shifter and eventually it gets shifted down by an amount $V_{D S}$ and propagates to input B of CML AND. It is indicated in Table 2.2 that lowest level ( $2^{\text {nd }}$ level, input B) dominates in path selection.

| $\mathrm{A}+$ | $\mathrm{A}-$ | $\mathrm{B}+$ | $\mathrm{B}-$ | Out+ | Out- | T off | T on | Path |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $V_{d d^{-}}-\mathrm{V}$ | $V_{d d}$ | $V_{d d^{-}}-\Delta \mathrm{V}$ | $V_{d d}$ | $V_{d d^{-}}-\Delta \mathrm{V}$ | $V_{d d}$ | $1,2,3$ | $5,4,6$ | $\mathrm{~T} 5, \mathrm{~T} 4, \mathrm{~T} 6$ |
| $V_{d d^{-}}-\mathrm{V}$ | $V_{d d}$ | $V_{d d}$ | $V_{d d^{-}}-\mathrm{V}$ | $V_{d d^{-}}-\Delta \mathrm{V}$ | $V_{d d}$ | $1,4,5$ | $2,3,6$ | $\mathrm{~T} 2, \mathrm{~T} 3, \mathrm{~T} 6$ |
| $V_{d d}$ | $V_{d d^{-}}-\Delta \mathrm{V}$ | $V_{d d^{-}}-\Delta \mathrm{V}$ | $V_{d d}$ | $V_{d d^{-}}-\Delta \mathrm{V}$ | $V_{d d}$ | $1,2,3$ | $5,4,6$ | $\mathrm{~T} 5, \mathrm{~T} 4, \mathrm{~T} 6$ |
| $V_{d d}$ | $V_{d d^{-}}-\Delta \mathrm{V}$ | $V_{d d}$ | $V_{d d^{-}}-\Delta \mathrm{V}$ | $V_{d d}$ | $V_{d d^{-}}-\mathrm{V}$ | $2,4,5$ | $1,3,6$ | $\mathrm{~T} 1, \mathrm{~T} 3, \mathrm{~T} 6$ |

Table 2.2: CML AND Operation

### 2.2.1 Universal CML Gate Optimization

When a path turns on, it is necessary to steer all the bias current $\left(I_{\text {bias }}\right)$ to that path, as stated before, to realize full logic swing, $\Delta \mathrm{V}=R_{L} * I_{\text {bias }}$. The more bias current we have, the faster the swing realized. Intuitive techniques to optimize the CML gate have been shown in the CML inverter optimization subsection. It was found, using 130 nm CMOS technology, the highest operating frequency we can obtain for CML inverter is 12 GHz with upper transistor $\mathrm{W}=7 \mu \mathrm{~m}, \mathrm{~L}=120 \mathrm{~nm}$ for bias current of 1.65 mA for which current source size is $\mathrm{W}=4 \mu \mathrm{~m}, \mathrm{~L}=120 \mathrm{~nm}$. As we cannot exceed this operating frequency for particular logic swing $\Delta \mathrm{V}=600 \mathrm{mV}$, supply voltage $V_{d d}=2.8 \mathrm{~V}, \operatorname{logic}-1=2.8 \mathrm{~V}$, logic $-0=2.2 \mathrm{~V}$ parameters, the same size has been used in CML universal gate for upper transistors. But in this case, it has second level of input and here it contributes more parasitic capacitance, bias current has been increased very little from 1.65 mA to 1.753 mA (for which current source size is $\mathrm{W}=7 \mu \mathrm{~m}$, $\mathrm{L}=120 \mathrm{~nm}$ ) to operate at 12 GHz .

For upper level (and $2^{\text {nd }}$ level) transistors, around $V_{o v}=200 \mathrm{mV}$ has been provided such that a 300 mV swing from the biasing point can turn off and on the transistors. It should be noted, at quiescent point $I_{r 1} \neq I_{r 2}$, as indicated in Figure 2.10. The reason is, Out+ has 2 paths at Q point but Out- has 1 path to $V_{s s}$. Therefore, typically $I_{r 1}=\frac{1}{3} I_{r 2}$, as it is observed in Figure 2.10. It is shown that $I_{r 1}=517.5 \mu \mathrm{~A}$ and $I_{r 2}=1.238 \mathrm{~mA}$ out of total bias current 1.753 mA . Therefore, initially, Out+ tends to be close to logic-0 $(2.2 \mathrm{~V})$, which is 2.371 V and Out- tends to be close to logic- $1(2.8 \mathrm{~V}$ ), which 2.62 V at Q point. As it can be seen in normalized current plot of universal CML gate in Figure 2.10 that $I_{d 1} \neq I_{d 2}$ at


Figure 2.10: Normalized current for CML universal gate
biasing point. But full current steering was made possible at -300 mV to +300 mV swing, which is our objective.


Figure 2.11: Post vs. pre layout simulation of CML AND with 18 fF input/output load capacitance (input changes at 83ps)

Figure 2.11 shows post vs. pre layout simulation of the CML AND gate at 12 GHz (input changes at 83 ps ). Load capacitance, $\left(C_{L}\right)=18 \mathrm{fF}$ was added to mimic next stage input capacitance, $C_{g g}=8 \mathrm{fF}$ plus 10 fF to mimic $100 \mu \mathrm{~m}$ wire for connecting next stage similar to the CML inverter [22]. The circuit level simulation shows rising delay is 35.6 ps whereas
extracted layout simulation has rising delay of 37.6 ps . Bias current $I_{\text {bias }}=1.759 \mathrm{~mA}$ and 2 other current source supply 2.019 mA each, for a total power consumption of $2.8 \mathrm{~V} *(1.753$ $+2.019+2.019) \mathrm{mA}=16.2148 \mathrm{~mW}$.


Figure 2.12: CML AND layout

The reported area of CML AND gate is $30.150 \mu \mathrm{~m} \times 42.900 \mu \mathrm{~m}$, as indicated in Figure 2.12 and this area should be the same for CML NAND/OR/NOR. Layouts of all these gates have been performed and have been simulated but not included because their architectures are not different from CML AND, just the reverse of inputs or outputs.

### 2.3 CML XOR/XNOR Gate

A CML XOR gate is shown in Figure 2.13 with an embedded level shifter. Table 2.3 briefly describes CML XOR operation and paths.

Logic-1 represented by $V_{i n}+\Delta \mathrm{V} / 2=2.8 \mathrm{~V}$ and $\operatorname{logic}-0$ is $V_{i n}-\Delta \mathrm{V} / 2=2.2 \mathrm{~V}$, where $V_{\text {in }}$ $=2.5 \mathrm{~V}$ is the biasing (quiescent) voltage and $\Delta \mathrm{V}=600 \mathrm{mV}$ is the logic swing. Bias current $I_{\text {bias }}=1.757 \mathrm{~mA}$ and two other current sources supply 2.019 mA each, having total power consumption of $2.8 \mathrm{~V} *(1.757+2.019+2.019) \mathrm{mA}=16.226 \mathrm{~mW}$. Power consumption of the


Figure 2.13: CML XOR gate

CML XOR is even less than the CML AND gate, which is impractical in CMOS realization but possible in CML architecture.

| A+ | A- | B+ | B- | Out+ | Out- | T on | T off |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 0 | 1 | 0 | 1 | $7,5,1$ | $2,3,4,6$ |
| 0 | 1 | 1 | 0 | 1 | 0 | $7,5,2$ | $1,3,4,6$ |
| 1 | 0 | 0 | 1 | 1 | 0 | $7,6,4$ | $1,2,3,4$ |
| 1 | 0 | 1 | 0 | 0 | 1 | $7,6,3$ | $1,2,4,5$ |

Table 2.3: CML XOR Operation

For the 00 combination, T1, T5 and T7 will turn on and $I_{\text {bias }}$ will flow down to Vss, resulting in all other paths being turned off. For the 01 combination, either T2 or T3 will turn on; depending on A . $\mathrm{As} \mathrm{A}+$ is 0 and A - is 1 , T 5 will be on and T 6 will be off, and the path will be T 2 , T 5 and T 7 .

As the lowest level dominates the path, for the 10 combination T 6 will be on, and as $\mathrm{B}+$ is 0 and $\mathrm{B}-=1$, T4 will turn on and the path will be T4, T6 and T7. For 11 combinations, T3, T6 and T7 will be turned on.

Figure 2.14 shows post vs. pre layout simulation of CML XOR gate with 18fF input/output load capacitance at 12 GHz (input changes at 83ps). It is observed that circuit


Figure 2.14: Post vs. pre layout simulation of CML XOR with 18 fF input/output load capacitance (input changes at 83ps)
level rise delay is 33.7 ps whereas extracted layout delay is 41.5 ps . Therefore, 7.8 ps discrepancy exists in between circuit level and device level simulation.


Figure 2.15: CML XOR layout

Reported area of CML XOR is $28.590 \mu \mathrm{~m} \times 45.630 \mu \mathrm{~m}$ as indicated in Figure 2.15.

### 2.4 CML Mux Realization

A CML mux has been realized at the transistor level as indicated in Figure 2.16. The architecture is very similar to the XOR and the power dissipation is less than the CML AND/XOR gate. Realizing a CML mux with transistors reduces power at least $\frac{1}{3}$ over realizing the mux by cascading CML AND gates. Realizing mux with transistor has two fold benefits; one is less power and the other one is delay reduced by $\frac{1}{2}$. Table 2.4 describes CML mux operation.


Figure 2.16: CML Mux realization

| $\mathrm{B}+(\mathrm{V})$ | $\mathrm{A}+(\mathrm{V})$ | $\mathrm{S}+(\mathrm{V})$ | Out+(V) | T on | T off |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 2.2 | 2.2 | 2.2 | 2.2 | 5,2 | $6,1,3,4$ |
| 2.2 | 2.8 | 2.2 | 2.8 | 5,1 | $6,2,3,4$ |
| 2.8 | 2.2 | 2.2 | 2.2 | 5,2 | $6,1,3,4$ |
| 2.8 | 2.8 | 2.2 | 2.8 | 5,1 | $6,2,3,4$ |
| 2.2 | 2.2 | 2.8 | 2.2 | 6,4 | $5,1,2,3$ |
| 2.2 | 2.8 | 2.8 | 2.2 | 6,4 | $5,2,2,3$ |
| 2.8 | 2.2 | 2.8 | 2.8 | 6,5 | $4,1,2,3$ |
| 2.8 | 2.8 | 2.8 | 2.8 | 6,5 | $4,2,2,3$ |

Table 2.4: CML Mux Operation

Figure 2.17 shows post vs. pre layout simulation of CML mux with 18 fF input/output load capacitance at 12 GHz (input changes at 83 ps ). It is observed that circuit level rise delay is 28.01 ps whereas extracted layout delay is 33.99 ps . Therefore, 5.98 ps discrepancy exists in between circuit level and device level simulation.


Figure 2.17: Post vs. pre layout simulation of CML Mux with 18fF input/output load capacitance (input changes at 83ps)


Figure 2.18: CML Mux layout

The reported area for the CML Mux is $36.510 \mu \mathrm{~m} \times 47.520 \mu \mathrm{~m}$, as indicated in Figure 2.18. Total power consumption of the CML Mux is $2.8 \mathrm{~V} *(2.019+2.019+1.757) \mathrm{mA}=$ 16.226 mW .

### 2.5 CML D-latch Realization

A CML D-latch has been realized in transistors, as indicated in Figure 2.20, with four levels of hierarchy with a reset input, not by cascading CML AND gates, in contrast to CMOS architecture. This architecture not only reduces power consumption but also reduces


Figure 2.19: CML D-latch
delay. Very surprisingly power consumption is less than the CML AND gate, which cannot be realized in CMOS architecture. It is notable that, the lowest level dominates in selecting upper level paths. $\mathrm{RST}+=1.3 \mathrm{~V}(\mathrm{RST}-=0.7 \mathrm{~V})$ turns on $\mathrm{T} 9(\mathrm{~T} 7$ off) and ties $\mathrm{Q}+$ to Vss, irrespective of any input combination.

| CLK+ (V) | $\mathrm{D}+(\mathrm{V})$ | RST $+(\mathrm{V})$ | $Q_{t+1}+(\mathrm{V})$ | T on | T off |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 2.2 | 2.2 | 2.2 | $Q_{t}+$ | $8,7,6,(3 / 4)$ | $9,5,(4 / 3), 1,2,10,11$ |
| 2.2 | 2.8 | 2.2 | $Q_{t}+$ | $8,7,6,(3 / 4)$ | $9,5,(4 / 3), 1,2,10,11$ |
| 2.8 | 2.2 | 2.2 | 2.2 | $8,7,5,2$ | $9,6,1,3,4,10,11$ |
| 2.8 | 2.8 | 2.2 | 2.8 | $8,7,5,1$ | $9,6,2,3,4,10,11$ |
| x | x | 2.8 | 2.2 | $8,9,10,11$ | $1,2,3,4,5,6,7,8$ |

Table 2.5: CML D-latch Operation

Table 2.5 briefly describes the operation of the CML D-latch. Initially Q+ tends to be logic-0 (the reason explained for the CML AND gate). If the clock is high, then whichever
data comes at $\mathrm{D}+$ is passes to the output $\mathrm{Q}+$. Also, the present state pushes lower level transistors to retain their logic states when clock is not high. Therefore at beginning, reset is not necessary in CML architecture.


Figure 2.20: Post vs. pre layout simulation of CML D-latch with 18 fF input/output load capacitance at 6 GHz (166ps)

Figure 2.20 shows post vs. pre layout simulation of CML D-latch with 18 fF input/output load capacitance at 6 GHz . It is observed that circuit level rise delay is 56.813 ps whereas extracted layout delay is 67.367 ps . Therefore, 10.554 ps discrepancy exists in between circuit level and device level simulation.


Figure 2.21: CML D-latch layout

Power consumption of the CML D-latch is $2.8 \mathrm{~V}{ }^{*}(2.019+2.019+1.431) \mathrm{mA}=$ 15.3132 mW . Reported area for CML D-latch is $29.190 \mu \mathrm{~m} \times 57.780 \mu \mathrm{~m}$, as indicated in Figure 2.21.

### 2.6 Speed, Power, Area and Delay of Basic CML Components

Table 2.6 summarizes post simulation power, area and delay with 18 fF load (10fF for wire 8 fF for next stage $C_{g g}$ ) capacitance of each CML basic components. In later chapters, higher level circuit realization has been performed using these basic components to develop a CML processor datapath.. Therefore, in measuring theoretical worst case delay it has assumed that delay is additive, meaning if the output of a CML AND gate drives the input of a CML Mux, the total delay is the sum of the two. Total delay of any complex component in simulation is found to match the number of basic components in the critical path and assigning delay to each of them.

| Component | Power $(\mathrm{mW})$ | Area $(\mu \mathrm{mx} \mu \mathrm{m})$ | Delay $(\mathrm{ps})$ |
| :---: | :---: | :---: | :---: |
| Inverter | 4.62 | $15.96 \times 24.45$ | 24.79 |
| AND/NAND | 16.2148 | $30.15 \times 42.90$ | 37.6 |
| OR/NOR | 16.2148 | $30.15 \times 42.90$ | 46.09 |
| XOR/XNOR | 16.226 | $28.59 \times 45.63$ | 41.5 |
| CML 2-to-1 Mux | 16.226 | $36.51 \times 47.52$ | 33.99 |
| CML D-latch | 15.3132 | $29.19 \times 57.78$ | 67.367 |

Table 2.6: Power, Area and Delay of Basic Components (post layout simulation with 18 fF load capacitance)

Layout of the processor datapath has not been performed but the circuit level simulation. It is observed that post vs. pre layout simulation differs maximum 10ps and wire delay is 10 ps (for $100 \mu \mathrm{~m}$ wire connecting next stage and assuming wire delay is $0.10 \mathrm{ps} / \mu \mathrm{m}$ ) [22]. Therefore 20fF load capacitance was added inside every CML logic in datapath simulation to mimic post layout + wire delay (in intermediate stages $C_{g g}=8 \mathrm{fF}$ has already been considered by simulation tool). Area was predicted by counting the number of gates * area of each gate $+50 \%$ area for wiring.

## Chapter 3

Datapath

A multi-cycle 16-bit processor datapath has been designed using a RISC architecture that can execute 15 different operations and a 4 -bit opcode has been used. Three different types of instructions can be performed and instruction set architecture is given in Table 3.1.

| Address bus (16-bit) |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $15-12$ | $11-8$ | $7-4$ | $3-0$ |  |
| Opcode R-type | Rd | Rs | Rt |  |
| Opcode I-type | Rd | Rs | Immediate operand |  |
| Opcode J-type | Address |  |  |  |

Table 3.1: Instruction Set Architecture


Figure 3.1: Processor datapath

The processor datapath is shown in Figure 3.1, where all the components have been realized in CML logic except the cache memory and control unit. Correct operation of the datapath has been verified by providing external control stimulus in every clock cycle and observing/providing data to/from memory. Fifteen different operations that can be performed are listed in Table 3.2. The Processor datapath does not contain any novel approach. However this datapath can be re-structured such that power consumption will be less and operating frequency can be increased.

| Type | Opcode | Description | Cycle |
| :---: | :---: | :---: | :---: |
| R-type | 0000 ADD | Rd=Rs+Rt | 4 |
|  | 0001 SUB | $\mathrm{Rd}=\mathrm{Rs}$-Rt | 4 |
|  | 0010 AND | $\mathrm{Rd}=\mathrm{Rs} \bullet$ Rt | 4 |
|  | 0011 XOR | $\mathrm{Rd}=\mathrm{Rs} \oplus \mathrm{Rt}$ | 4 |
|  | 0100 SLT | $\mathrm{Rd}=1$ if $\mathrm{Rs}<$ Rt else $\mathrm{Rd}=0$ | 4 |
|  | 0101 SEQ | $\mathrm{Rd}=1$ if Rs=Rt else $\mathrm{Rd}=0$ | 4 |
| I-type | 0110 LW | $\mathrm{Rd}=[\mathrm{Rs}]+\mathrm{n} ; \mathrm{n}=4 \mathrm{bit}$ | 5 |
|  | 0111 SW | [Rs+n]=Rd; n=4bit | 4 |
|  | 1000 ADDI | $\mathrm{Rd}=\mathrm{Rs}+\mathrm{n}$; $\mathrm{n}=4 \mathrm{bit}$ | 4 |
|  | 1001 MOVI | $\mathrm{Rd}=\mathrm{n} ; \mathrm{n}=8 \mathrm{bit}$ | 4 |
| J-type | 1010 J LABEL | $\mathrm{PC}=$ LABEL; $\mathrm{n}=12 \mathrm{bit}$ | 2 |
|  | 1011 JZ | $\mathrm{PC}=\mathrm{LABEL}$ if $\mathrm{Rd}=0 ; \mathrm{n}=8 \mathrm{bit}$ | 4 |
|  | 1100 JNZ | $\mathrm{PC}=\mathrm{LABEL}$ if $\mathrm{Rd} \neq 0 ; \mathrm{n}=8$ bit | 4 |
|  | 1101 JAL | \$RA $=\mathrm{PC}+1$ \&PC=LABEL; $\mathrm{n}=12 \mathrm{bit}$ | 3 |
|  | 1110 JR | $\mathrm{PC}=\$ \mathrm{RA}$; least 12-bit unused | 2 |

Table 3.2: Opcode and 15 Different Operations

The datapath contains 16 -bit program counter register, 16-bit instruction register, 16x16 register file, 16 -bit ALU, 16 -bit pc+1 register and 16-bit ALU_out register. PC, IR, ALU_out, $\mathrm{PC}+1$ are 16 -bit registers with enable input that controls the write operation. The 16 -bit ALU contains 16-bit CLA, BLA, AND and XOR. In the 16 x 16 register file, 4 -bit rr1, rr2 and rr3 can address any of 16 register (16-bit each) and propagates the content of addressed register in 16 -bit rd1, rd2 and rd3 through 316 -bit 16 -to- 1 mux. 4 -bit wrreg is used for addressing the write register and 1-bit reg_wr is used to enable the write operation. It is assumed, for mem_en $=1$ and $r d \_w r=0$, that memory can be read and for mem_en=1 and
rd_wr $=1$ memory can perform write operation. It takes 2-5 cycles to perform any of the 15 instructions as listed in Table 3.2 with brief description.

In Table 3.2 RA is register-15 implemented in hardware. Also, register-0, which contains 0 is implemented in hardware as well. The critical path delay is 5 clock cycles for the LW instruction and the path is $\mathrm{PC} \Rightarrow \mathrm{IR} \Rightarrow$ Reg_file $\Rightarrow$ ALU_out $\Rightarrow$ MEM $\Rightarrow$ Reg_file.

### 3.1 First Clock Cycle of Every Instruction

The first clock cycle is same for any instruction and the selected path is shown in Figure 3.2, with control logic. As indicated in Figure 3.2, in the first cycle men en $=1$ and fetch $=1$ and data comes into IR register, assuming that memory is fast enough to be read in one clock cycle. Also, at the same time, A_sel $=01$, B_sel=00 selects the content of PC and 1-bit value 1 (from first input of 16bit_4_to_1 mux) and the ALU performs PC+1.


Figure 3.2: First clock cycle of any instruction

### 3.2 R-type Instruction

Six different type of instruction can be performed in R-type operation. They are ADD, SUB, AND, XOR, SLT and SEQ. ADD, SUB, AND, XOR instructions perform addition, subtraction, binary bitwise AND and binary bitwise XOR respectively. SLT stores $R_{d}=1$ if $R_{s}<R_{t}$, else $R_{d}=0$ (performs $R_{s}-R_{t}$ and checks MSB). SEQ stores $R_{d}=1$ if $R_{s}=R_{t}$, else $R_{d}=0$ (performs $R_{s}-R_{t}$ and checks Z flag). It takes four cycles to perform any R-type instruction.

### 3.2.1 R-type ADD/SUB/AND/XOR

In the second cycle of R-type $\mathrm{ADD} / \mathrm{SUB} / \mathrm{AND} / \mathrm{XOR}$ instructions, $\mathrm{A} s \mathrm{sel}=10$, B sel $=$ 01 selects rd1 and rd2 and pc+1 = 1 signal enables $\mathrm{PC}+1$ to be saved in the $\mathrm{PC}+1$ register. Control signal func $=00,01,10$ and 11 at second cycle performs ADD, SUB, AND and


Figure 3.3: R-type ADD/SUB/AND/XOR

XOR respectively. The third clock cycle saves the ALU result in the ALU_OUT register and
control signal is alu_out $=1$. Control signal jal $=1$ passes the content of the ALU_OUT register through 16-bit 2-to-1 mux. At the fourth clock cycle reg_wr $=1$ enables the register file to be written which was addressed by $R_{d}[11: 8]=$ wrreg[3:0] for control signal jal_rd $=$ 0 . At the same cycle pc_wr $=1$ enables PC register to be updated at the end of fourth clock cycle that can be used for next instruction. Second cycle to fourth cycle of R-type ADD/SUB/AND/XOR is shown in Figure 3.3 with control logic.

### 3.2.2 R-type SLT

At the second cycle $\mathrm{PC}+1$ (computed in first cycle) gets updated in the $\mathrm{PC}+1$ register and A_sel $=10$, B_sel $=01$ selects $r d 1$ and $r d 2$ as operands to the ALU, as indicated in Figure 3.4. Function of the ALU is func $=01$, meaning it performs subtraction at the second cycle.


Figure 3.4: R-type SLT

At third clock cycle the result is stored in ALU_OUT register, holding control signal alu_out
$=1$. Control signal jal $=1$ passes the content of the ALU_OUT register through the mux
and the MSB gets ANDed with 1. If $R_{s}<R_{t}$ then MSB $=1$ and the AND result is 1 else $\mathrm{MSB}=0$. This result geta unsigned extension to 16 -bit data. At fourth clock cycle, reg_wr $=1$ enables the addressed register to be written by the data and PC is upadated as well as indicated in Figure 3.4.

### 3.2.3 R-type SEQ

At the second cycle $\mathrm{PC}+1$ gets updated in the $\mathrm{PC}+1$ register and rd 1 and rd2 gets selected as operands to the ALU and the ALU performs the subtraction. At the third cycle


Figure 3.5: R-type SEQ
the ALU result gets updated in ALU_OUT and also control signal $\mathrm{z}=1$ enables the Z register to be updated. 1-bit zero flag register $(Z)$ is used to hold the value, either 0 or 1 to indicate that the resultant bits are all zero or not, respectively. The content of the Z register $(\mathrm{Z}=$ 1 means rd1 $=\operatorname{rd} 2$ else $\mathrm{rd} 1 \neq \mathrm{rd} 2$ ) is ANDed with 1 -bit 1 at the third cycle. This result gets unsigned extension to 16 -bit data and fed to fourth input of 16 -bit 4 -to- 1 mux. At the fourth clock cycle reg_src $=11$ selects that $4^{\text {th }}$ input to that mux and reg_wr $=1$ enables
the write operation to the register file addressed by $R_{d}[11: 8]=$ wrreg[3:0], which is selected by control signal jal_rd $=0$, as shown in Figure 3.5.

### 3.3 I-type Instruction

Four different types of instruction can be performed as immediate type operations. LW computes the contents of the $R_{s}$ register and added to the sign extended 16 -bit immediate data. The computed address is used to point to a location in memory and the data at this address is loaded into register, $R_{d}=\left[R_{s}\right]+\mathrm{n}$. Likewise, SW computes the address but stores the content of the register addressed by 4 -bit $R_{d}$ into memory location, $\left[R_{s}+\mathrm{n}\right]=R_{d}$. ADDI adds the content of register $R_{s}$ to 4-bit immediate operand n (sign extended) and stores it to register $R_{d}, R_{d}=R_{s}+\mathrm{n}$. MOVI converts 8 -bit immediate operand to sign extended 16 -bit data and moves it into register $R_{d}, R_{d}=\mathrm{n}$.

### 3.3.1 I-type LW



Figure 3.6: I-type LW

LW takes 5 clock cycles to complete and the second to fifth clock cycle is shown in Figure 3.6. At the second cycle, A_sel $=11$ and B_sel $=01$, selects sign extended rt[3:0] bits and content of register rs[3:0] through rd2. It also performs addition in second cycle. At the third clock cycle the result is stored in ALU_OUT register and at the fourth clock cycle the computed address (result) points to an address location in memory and memory is being when control signal IorAddr $=1$ and mem_en $=1$. Data from memory is written into register file at the fifth clock cycle and PC is updated while reg_wr $=1$ and pc_wr $=1$. The destination register is addressed by rd[11:8] and mux selection signal jal_rd $=0$.

### 3.3.2 I-type SW

At the second cycle control signal A_sel $=11$ and B_sel $=01$ selects sign extended least significant 4-bits and the content of $R_{s}$ register as operands to the ALU and performs addition. The computed address (result) is then updated in ALU_OUT in the third clock


Figure 3.7: I-type SW
cycle, as indicated in Figure 3.7. At the fourth clock cycle, content of register file output rd3 $=$ data_to_mem is written to the memory $(M E M)$ addressed by computed result ( $R_{s}+\mathrm{n}$ ) when $\operatorname{IorAddr}=1$, mem_en $=1$ and rd_wr $=1$ and PC gets updated.

### 3.3.3 I-type ADDI



Figure 3.8: I-type ADDI

The second and third clock cycles for ADDI are the same as LW/SW, but this time the sum is considered as a result rather than an address and gets updated in ALU_OUT in the third clock cycle. The result is then saved to register $R_{-} d$ by selecting the path through reg_src $=00$ and reg_wr $=1$ at the fourth clock cycle as indicated in Figure 3.8.

### 3.3.4 I-type MOVI

The MOVI operation moves least 8 -bit data into register- $R_{d}$ with sign extension as indicated in Figure 3.9. At the second clock cycle control signal A_sel $=00$, B_sel $=10$


Figure 3.9: I-type MOVI
selects 16 -bit 0 (from first input of 4 -to- 1 mux) and sign-extended 8 -bit data as operands to ALU and performs addition, meaning 8 -bit data is added with 0 and $\mathrm{PC}+1$ is updated in the $\mathrm{PC}+1$ register. At the third clock cycle the ALU result is saved into ALU_OUT register and control signal alu_out $=1$. At the fourth clock cycle data is written into register file addressed by $R_{d}[11: 8]=$ wrreg[3:0] and reg_wr $=1$ and PC is updated.

### 3.4 J-type Instruction

J-type can perform five different types of jump instruction based on opcode. J LABEL performs jump unconditionally and does not save the program counter value. It takes 12-bit as an immediate operand and can jump in [-2048-+2047] bit memory location. JZ and JNZ perform jump based on the contents of $R_{d}$. If $R_{d}=0$ then JZ performs jump and if $R_{d}=1$ then JNZ performs jump. If the condition is not satisfied then the program counter increments by one. Both JZ and JNZ take $\mathrm{n}=8$-bits as an immediate operand and can
jump in $[-128-+127]$ bit memory location. JAL instruction performs the jump operation unconditionally but it saves (jump and link) the incremented program counter into $\$ \mathrm{RA}=$ 1111 register in the register file. $\$ \mathrm{RA}$ is the fifteenth register in the register file that was implemented in hardware. The JR instruction returns from any location by loading the program counter from $\$$ RA that was saved earlier by some other instruction and can jump (return) in between $[-32768-+32767]$ bit memory location.

### 3.4.1 J-type J LABEL

As indicated in Figure 3.10, J LABEL can be executed in two clock cycles. The first


Figure 3.10: J-type J LABEL
cycle is the same as any instruction. At the second clock cycle, sign-extended 12-bit gets selected through 16-bit 5-to-1 mux by control signal pc_sel $=001$. Also, at the second clock cycle control signal pc_wr $=1$ lets PC register be updated.

### 3.4.2 J-type JZ LABEL

JZ LABEL can be executed in four clock cycles as indicated in Figure 3.11. At the second clock cycle computed $\mathrm{PC}+1$ gets updated in the $\mathrm{PC}+1$ register and control signal $\mathrm{jz}=1$ is activated and remains high until the fourth clock cycle. Also in the second clock


Figure 3.11: J-type JZ LABEL
cycle A_sel $=00$ and B_sel $=01$ selects 16 -bit 0 (first input of 16 -bit 4 -to- 1 mux) and the content of $R_{d}$ register through rd2. Input to register file $\operatorname{rr2}$ (which is $R_{d}[11: 8]$ ) was selected through mux signal bit jz $\mid$ jnz. The ALU performs addition and if the result is 0 then ALU output z will go high. At the third clock cycle z_en $=1$ activates z register to be updated. At the fourth clock cycle $z \& j z=1$ (if $z=1$ ) selects 16 -bit 2-to-1 mux control bit to pass Sign8 in the datapath and pc_sel $=010$ lets it pass through 16-bit 5 -to- 1 mux and LABEL is updated in PC register.

### 3.4.3 J-type JNZ LABEL

JNZ can perform jump operation to 8 -bit LABEL if $R_{d} \neq 0$, as indicated in Figure 3.12. At the second cycle control signal jnz $=1$ enables $R_{d}$ to be connected as input rr2 of register


Figure 3.12: J-type JNZ LABEL
file through 4-bit 2-to-1 mux. Input rr2 of reg file generates content of this register at rr2 signal. A_sel $=00$, B_sel $=01$ selects 16 -bit 0 and rr2 and func $=00$ performs addition in this cycle. At the third cycle $z_{z} e n=1$ let $z$ register to be updated. At fourth clock cycle, z \& jnz determines whether PC+1 or Sign8 will be passed through 16-bit 2-to-1 mux and pc_sel $=011$ lets it pass. Also in this cycle pc_wr $=1$ lets PC register be updated.

### 3.4.4 J-type JAL

JAL can be performed in three clock cycles as indicated in Figure 3.13. At the first cycle the instruction is fetched to IR register and PC is incremented by 1 as stated before. At the second clock cycle control signal jal_rd $=1$ selects wrreg[3:0] $=R_{d}[11: 8]$, addressing


Figure 3.13: J-type JAL
which register needs to be written. Also at this cycle $\mathrm{PC}+1$ register gets updated because of $\mathrm{pc}+1=1$. At third clock cycle control signal jal $=1$ selects $\mathrm{PC}+1$ to pass through the 16 -bit 2 -to- 1 mux and reg-wr $=1$ lets it be written to the addressed register. Also Sign12 (12-bit sign extended LABEL) gets selected in the 16 -bit 5 -to-1 mux as control signal pc_sel $=001$ and pc_wr $=1$ enables updating of PC register.

### 3.4.5 J-type JR

JR can be executed in two clock cycles as indicated in Figure 3.14. At the second clock cycle, input to Ref file $\operatorname{rr} 3[3: 0]=\mathrm{ra}[1111]$ gets selected as control signal sw_jr $=1$. The corresponding content of rr3 becomes available at the rd3 16-bit output and gets passed through the 16 -bit 5 -to- 1 mux as control signal is pc_sel $=100$. At the same time result (return address coming through rd3) gets updated in PC register and control signal is pc_wr $=1$.


Figure 3.14: J-type JR

### 3.5 Control Signals

Control signals have been summarized for all fifteen instructions and listed in Table 3.3 and Table 3.4 for every clock cycle.

| C | p | I | m | r | f | s | j | r | r | A | B | f | z | p | a | j | p | J | J |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| y |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| c |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| l |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| e |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | c

Table 3.3: Control Signal Table Part-1

| C | p | I | m | r | f | s | j | r | r | A | B | f | z | p | a | j | p | J | J |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| y |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| c |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| l |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| e |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | c

Table 3.4: Control Signal Table Part-2

## Chapter 4

Component Realization

The processor datapath is shown again in Figure 4.1 to indicate the components that are necessary to realize in CML. This chapter will demonstrate how each of these datapath components was realized using basic CML components developed in Chapter 2. Each basic component included 20fF load capacitance in designing datapath components to mimic post layout simulation and wire delay. Proper handcrafting in Cadence Virtuoso Composer Schematic has been done and Spectre simulation (equivalent to SPICE) was performed to verify correct operation of each block. The tool used is Analog Artist for Spectre simulation and the technology used is 130 nm CMOS. The processor datapath consists of:


Figure 4.1: Datapath Components

1. 416 -bit register with enable input (PC, IR, ALU_OUT, PC+1)
2. Z register (1-bit register with enable input)
3. 416 -bit 2 -to- 1 mux
4. 3 4-bit 2 -to- 1 mux
5. 3 16-bit 4 -to- 1 mux
6. 16-bit 5 -to- 1 mux
7. 16-bit ALU
8. $16 \times 16$ register file (REG FILE)
9. Sign 4 to 16 extension (sign 4$)$
10. Sign 8 to 16 extension (sign 8 )
11. Sign 12 to16 extension (sign 12)
12. 2 unsigned 1 to 16 extension (unsigned 16)
13. 4 1-bit AND gate
14. 1 1-bit OR gate

### 4.1 16-bit Register With Enable Input

A 1-bit register was developed by cascading CML d-latches. As CML comes in complementary form, there are two signals for any input/output. The negative clock pulse was provided to the first d-latch and the positive pulse was given to the second d-latch in order to realize a positive edge triggered D flip-flop. A two input mux was used in front of the master-slave DFF to hold the state when enable is 0 and pass the data to input D of


Figure 4.2: Block diagram of MS DFF, MS DFF-EN, 16-bit register

DFF when enable is 1 . The circuit diagrams of the master-slave D flip-flop, master-slave D Flip-flop with an enable input and 16 -bit register with enable input are shown in Figure 4.2.

MS DFF burns $2^{*} 15.3132 \mathrm{~mW}=30.6264 \mathrm{~mW}$ and MS DFF-EN takes $16.226 \mathrm{~mW}+$ $30.6264 \mathrm{~mW}=46.8524 \mathrm{~mW}$. For simplicity, only 1-bit MS DFF-EN output is shown in Figure 4.3. The 16 -bit register output will be the same and delay is identical as well because they are parallel. In Figure 4.3, 101.41ps rise delay has been observed with 20fF load capacitance. In simulation it is observed that the 16 -bit MS DFF-EN (16-bit register) takes $16 * 46.8524 \mathrm{~mW}$ $=749.6384 \mathrm{~mW}$. Minimum setup time was found to be 40 ps to operate at 6 GHz .

Estimated area of the 16 -bit register is $16 *(2 *$ D latch area +2 -to1 mux area $)=16$ * $\left(2^{*}(29.19 \times 57.78) \mu \mathrm{m}+36.51 \times 47.52 \mu \mathrm{~m}\right)=16^{*}(94.89 \mu \mathrm{~m} \times 57.78 \mu \mathrm{~m})=1518.24 \mu \mathrm{~m} \times$ $57.78 \mu \mathrm{~m} .50 \%$ additive area for wiring will be added later to the total area of the processor to get power density per unit area.

Therfore, in datapath 4 16-bit registers should dissipate $4 * 749.6384 \mathrm{~mW}=2998.55 \mathrm{~mW}$ with an area of $4^{*}(1518.24 \mu \mathrm{~m} \times 57.78 \mu \mathrm{~m})=6072.96 \mu \mathrm{~m} \times 57.78 \mu \mathrm{~m}$.


Figure 4.3: 1-bit register output at 6 GHz with 20 fF load capacitance (clock period 166ps)

### 4.2 Z Register (1-bit Register With Enable Input)

As indicated in section 4.1, the 1-bit register with an enable input has power consumption of 46.8524 mW and an area of $94.89 \mu \mathrm{~m} \times 57.78 \mu \mathrm{~m}$ and reported rise delay is 101.41 ps .

### 4.3 4 16-bit 2-to-1 Mux

A 1-bit 2-to-1 mux has been shown in Figure 2.17, with a power consumption of 16.226 mW and an area of $36.51 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$, with rise delay of 33.99 ps . A 16 -bit 2 -to- 1 mux is 16 instances of 1-bit 2-to- 1 mux with 16 times power consumption of 259.616 mW and 16 times area $584.16 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$. Delay is same because they are parallel in architecutre.

Therefore, 416 -bit 2-to-1 mux power consumption is 1038.464 mW , area $=2336.64 \mu \mathrm{~m}$ $\mathrm{x} 47.52 \mu \mathrm{~m}$ and rise delay $=33.99 \mathrm{ps}$. Due to static power consumption, computing power for 16 -bit 2 -to- 1 mux theoretically as 16 * 1-bit 2 -to- 1 mux produces the same value as simulating 16 -bit 2 -to- 1 mux and getting the power from the tool. However, in some cases where larger number of basic CML components has been used, we simulate the component directly to get power consumption from the tool.

### 4.4 3 4-bit 2-to-1 Mux

As indicated in section 4.3, a 4-bit 2-to-1 mux power consumption is 64.904 mW and area is $146.04 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$. Therefore 34 -bit 2-to- 1 mux power consumption is 194.712 mW , with an area of $438.12 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$ and rise delay is 33.99 ps .

### 4.5 3 16-bit 4-to-1 Mux

A 1-bit 4-to-1 mux was developed using 3 1-bit 2-to-1 mux, providing $S 0$ as the control bit for first stage and S1 as the control bit for the second stage as indicated in Figure 4.4. Expected power consumption is $3^{*}$ 1-bit 2-to-1 mux power $=3^{*} 16.226 \mathrm{~mW}=48.678 \mathrm{~mW}$ and simulation power obtained is $2.8 \mathrm{~V} * 17.38 \mathrm{~mA}=48.664 \mathrm{~mW}$, which is almost identical.


Figure 4.4: 1-bit 4-to-1 mux

The expected area of a 1-bit 4-to-1 mux would be $3 *(36.510 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m})=109.53 \mu \mathrm{~m}$ $\mathrm{x} 47.52 \mu \mathrm{~m}$. Expected rise delay is $2 * 33.99 \mathrm{ps}=67.98 \mathrm{ps}$, due to two stages of CML 2 -to- 1 mux, and simulated rise delay is 52.51 ps , as indicated in Figure 4.5.

Therefore, the 316 -bit 4-to-1 mux power consumption is $3 * 16^{*} 48.664 \mathrm{~mW}=3 *$ $778.624 \mathrm{~mW}=2335.872 \mathrm{~mW}$, area is $3 * 16 *(109.53 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m})=3 *(1752.48 \mu \mathrm{~m} \mathrm{x}$ $47.52 \mu \mathrm{~m})=5257.44 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$ and simulated rise delay is 52.51 ps .


Figure 4.5: 1-bit 4-to-1 mux output with 20fF load capacitance (input changes at 83ps)

### 4.6 16-bit 5-to-1 mux

A 1-bit 5 -to- 1 mux was developed using a 1 -bit 4 -to- 1 mux and a 1 -bit 2 -to- 1 mux as shown in Figure 4.6. Expected power consumption is the sum of two $=48.664 \mathrm{~mW}+$ $16.226 \mathrm{~mW}=64.89 \mathrm{~mW}$ and simulated power is $2.8 \mathrm{~V} * 23.18 \mathrm{~mA}=64.904 \mathrm{~mW}$.

The 16 -bit 5 -to- 1 mux power consumption is $16 * 64.904 \mathrm{~mW}=1038.464 \mathrm{~mW}$ and expected area is $16^{*}(1$-bit 4 -to-1 +1 -bit 2-to-1 $)=16^{*}(109.53 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}+36.51 \mu \mathrm{~m} \mathrm{x}$ $47.52 \mu \mathrm{~m})=16^{*}(146.04 \mu \mathrm{~m}+47.52 \mu \mathrm{~m})=2336.64 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$.

A 1-bit 5-to-1 mux output at where input changes at 166 ps is shown in Figure 4.7. Simulated rise delay is 59.61 ps , as indicated.


Figure 4.6: 1-bit 5-to-1 mux


Figure 4.7: 1-bit 5-to-1 Mux output with 20fF load capacitance (input changes at 166ps)

### 4.7 16-bit ALU

A 16 -bit ALU consists of a 16 -bit CLA, 16 -bit BLA, 16 -bit AND, 16 -bit XOR and 16 bit 4-to-1 mux. For 16-bit inputs at A and B-addition, subtraction, AND and XOR are performed irrespective to 2 -bit function. Based on the function input, the 16 -bit 4 -to- 1 mux passes the selected result to the output. The output bits are also OR-ed and passed to Z signal, which is connected to zero flag register.


Figure 4.8: Block diagram of 16 -bit ALU

A 16-bit carry look ahead adder was designed using the schematic shown in Figure 4.9, where carry generation is $g_{i}$, carry propagation is $p_{i}$ and summation is $s_{i}$ for each 1-bit CLA.


Figure 4.9: 16-bit CLA block diagram

$$
\begin{gathered}
g_{i}=x_{i} \cdot y_{i} ; p_{i}=x_{i} \oplus y_{i} \\
s_{i}=x_{i} \oplus y_{i} \oplus c_{i-1}
\end{gathered}
$$

Level-1 CLA output can be described as:

$$
\mathrm{P} 0=\mathrm{p} 3 \cdot \mathrm{p} 2 \cdot \mathrm{p} 1 \cdot \mathrm{p} 0
$$

$$
\begin{gathered}
\mathrm{P} 1=\mathrm{p} 7 \cdot \mathrm{p} 6 \cdot \mathrm{p} 5 \cdot \mathrm{p} 4 \\
\mathrm{P} 2=\mathrm{p} 11 \cdot \mathrm{p} 10 \cdot \mathrm{p} 9 \cdot \mathrm{p} 8 \\
\mathrm{P} 3=\mathrm{p} 15 \cdot \mathrm{p} 14 \cdot \mathrm{p} 13 \cdot \mathrm{p} 12 \\
\mathrm{G} 0=\mathrm{g} 3+(\mathrm{p} 3 \cdot \mathrm{~g} 2)+(\mathrm{p} 3 \cdot \mathrm{p} 2 \cdot \mathrm{~g} 1)+(\mathrm{p} 3 \cdot \mathrm{p} 2 \cdot \mathrm{p} 1 \cdot \mathrm{~g} 0) \\
\mathrm{G} 1=\mathrm{g} 7+(\mathrm{p} 7 \cdot \mathrm{~g} 6)+(\mathrm{p} 7 \cdot \mathrm{p} 6 \cdot \mathrm{~g} 5)+(\mathrm{p} 7 \cdot \mathrm{p} 6 \cdot \mathrm{p} 5 \cdot \mathrm{~g} 4) \\
\mathrm{G} 2=\mathrm{g} 11+(\mathrm{p} 11 \cdot \mathrm{~g} 10)+(\mathrm{p} 11 \cdot \mathrm{p} 10 \cdot \mathrm{~g} 9)+(\mathrm{p} 11 \cdot \mathrm{p} 10 \cdot \mathrm{p} 9 \cdot \mathrm{~g} 8) \\
\mathrm{G} 3=\mathrm{g} 15+(\mathrm{p} 15 \cdot \mathrm{~g} 14)+(\mathrm{p} 15 \cdot \mathrm{p} 14 \cdot \mathrm{~g} 13)+(\mathrm{p} 15 \cdot \mathrm{p} 14 \cdot \mathrm{p} 13 \cdot \mathrm{~g} 12) \\
\mathrm{c} 1=\mathrm{g} 0+\mathrm{p} 0 \cdot \mathrm{c} 0 \\
\mathrm{c} 2=\mathrm{g} 1+\mathrm{p} 1 \cdot \mathrm{~g} 0+\mathrm{p} 1 \cdot \mathrm{p} 0 \cdot \mathrm{c} 0
\end{gathered}
$$

Level-2 CLA output can be described as:

$$
\begin{gathered}
\mathrm{C} 1=\mathrm{G} 0+\mathrm{P} 0 \cdot \mathrm{c} 0 \\
\mathrm{C} 2=\mathrm{G} 1+\mathrm{P} 1 \cdot \mathrm{G} 0+\mathrm{P} 1 \cdot \mathrm{P} 0 \cdot \mathrm{c} 0 \\
\mathrm{C} 3=\mathrm{G} 2+\mathrm{P} 2 \cdot \mathrm{G} 1+\mathrm{P} 2 \cdot \mathrm{P} 1 \cdot \mathrm{G} 0+\mathrm{P} 2 \cdot \mathrm{P} 1 \cdot \mathrm{P} 0 \cdot \mathrm{c} 0
\end{gathered}
$$

Power consumption for the 16 -bit CLA obtained in circuit simulation is $2.8 \mathrm{~V} * 1.198 \mathrm{~A}$ $=3.3544 \mathrm{~W}$. Critical path sensitization in 16-bit CLA was triggered by providing all 0 's in 16-bit A and B and then All 1's in input A and 1-bit 1 in $C_{i n}$. Simulation result shows rise delay of 16 -bit CLA is 224.7 ps, as indicated in Figure 4.10.

The 16 -bit CLA expected area is $16^{*}$ (1-bit CLA) $+4^{*}$ Level-1 CLA + Level-2 CLA $=16 *(1 \mathrm{AND}+3 \mathrm{XOR})+4^{*}(19 \mathrm{AND}+9 \mathrm{OR})+10 \mathrm{AND}+6 \mathrm{OR}=16 *(30.15 \mu \mathrm{~m}$ $\left.\mathrm{x} 42.9 \mu \mathrm{~m}+3^{*}(28.59 \mu \mathrm{~m} \times 45.63 \mu \mathrm{~m})\right)+4^{*} 28^{*}(30.15 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m})+16^{*}(30.15 \mu \mathrm{~m} \mathrm{x}$ $42.9 \mu \mathrm{~m})=1854.72 \mu \mathrm{~m} \times 45.63 \mu \mathrm{~m}+3376.8 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}+482.4 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}=5713.92 \mu \mathrm{~m}$ x $42.9 \mu \mathrm{~m}$.

A 16-bit BLA can be realized using the same architecture of Level-1 and Level-2, but with a little difference in the 1-bit BLA.

$$
g_{i}=\bar{x}_{i} \cdot y_{i} ; p_{i}=\bar{x}_{i}+y_{i}
$$



Figure 4.10: Critical path delay in 16 -bit CLA is 224.7 ps (input changes at 500 ps )

$$
s_{i}=x_{i} \oplus y_{i} \oplus c_{i-1}
$$

Power consumption of the 16 -bit BLA found in simulation is $2.8 \mathrm{~V} * 1.198 \mathrm{~A}=3.3544 \mathrm{~W}$, which is the same as the 16 -bit CLA and critical path delay should be same as well. Estimated area of 16 -bit BLA would be 16 * 1 -bit BLA $+4^{*}$ Levlel-1 CLA + Level-2 CLA $=16$ * (1AND $+1 \mathrm{OR}+2 \mathrm{XOR})+3376.8 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}+482.4 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}=16 * 2 *($ AND + XOR $)$ $+3859.2 \mu \mathrm{mx} 42.9 \mu \mathrm{~m}=32^{*}(58.74 \mu \mathrm{~m} \times 45.63 \mu \mathrm{~m})+3859.2 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}=5738.88 \mu \mathrm{~m} \mathrm{x}$ $45.63 \mu \mathrm{~m}$.

16 -bit AND power consumption would be $16{ }^{*} 16.2148 \mathrm{~mW}=259.4368 \mathrm{~mW}$, with an expected area of $16^{*}(30.15 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m})=482.4 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}$. Delay will be identical to the 1-bit AND gate which is 37.6 ps .

16 -bit XOR power consumption would be $16 * 16.226 \mathrm{~mW}=259.616 \mathrm{~mW}$, with an expected area of $16^{*}(28.59 \mu \mathrm{~m} \times 45.63 \mu \mathrm{~m})=457.44 \mu \mathrm{~m} \times 45.63 \mu \mathrm{~m}$. Rise delay of the 16 -bit XOR is 41.5 ps .

16 -bit 4 -to- 1 mux power consumption is $16 * 48.664 \mathrm{~mW}=778.624 \mathrm{~mW}$, area is $16{ }^{*}$ $(109.53 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m})=1752.48 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$ and simulated rise delay is 52.51 ps .

Total expected area estimated is, $5713.92 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}+5738.88 \mu \mathrm{~m} \times 45.63 \mu \mathrm{~m}+$ $482.4 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}+457.44 \mu \mathrm{~m} \times 45.63 \mu \mathrm{~m}+1752.48 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}=14145.12 \mu \mathrm{~m} \times$ $47.52 \mu \mathrm{~m}$.


Figure 4.11: 16 -bit ALU Output (input changes at 300ps)

Estimated path-delay from input to output of the ALU is $\max (16$-bit CLA, 16 -bit BLA, 16 -bit AND, 16 -bit XOR) +16 -bit 4 -to- 1 mux $=224.7 \mathrm{ps}+52.51 \mathrm{ps}=277.21 \mathrm{ps}$. But critical path delay exists between the ALU input to Z flag where 4 stage of OR have been used.

The Z flag is used only in 3 instructions out of 15 instructions. In simulating 16 -bit ALU at 300 ps using path sensitization as indicated in Figure 4.10, generated correct output of Out signals at 275.6 ps and Z signal took 348ps. Therefore 12 instructions can be executed at 300 ps ( 3.33 GHz ) and 3 instruction (where z flag is necessary) can be executed in 350 ps (2.85GHz).

In Figure 4.11 the ALU performs addition, ADD FFFF 0000 (with $C_{i n}=1$ ) and result is all 0000 and $\mathrm{Z}=1$ when mux selection bit $\mathrm{S} 0 \mathrm{~S} 1=00$. Then for $\mathrm{S} 0 \mathrm{~S} 1=01$ it performs subtraction, SUB 0000 FFFF and the result is 0001 . For $\mathrm{S} 0 \mathrm{~S} 1=10$ it performs 16 -bit AND, AND FFFF FFFF and the result is FFFF. For $\mathrm{S} 0 \mathrm{~S} 1=11$ the ALU performs 16 -bit XOR, XOR 0000 FFFF and the result is FFFF

Estimated power consumption of 16 -bit ALU is $3.3544 \mathrm{~W}+3.3544 \mathrm{~W}+259.4368 \mathrm{~mW}+$ $259.616 \mathrm{~mW}+778.624 \mathrm{~mW}+15 * 16.2148 \mathrm{~mW}=8.2496988 \mathrm{~W}$ and simulation result shows 16 bit ALU takes $2.8 \mathrm{~V} * 2.944 \mathrm{~A}=8.2432 \mathrm{~W}$. Estimated area of ALU is $14145.12 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$. Time it takes to generate correct operation sensitizing critical path is 275.6 ps for 12 instructions and 348 ps for 3 instructions.

### 4.8 16x16 Register File

16 x16 register file consists of 4 -to- 16 decoder, 16 -bit AND, 1616 -bit registers and 3 16-bit 16-to-1 mux, as shown in Figure 4.12.

The register file has 3 4-bit inputs, rr1, rr2 and rr3, to address any of the 16 registers and corresponding 16 -bit data will arrive at rd1, rd2 and rd3, respectively. 4-bit wrreg[3:0] is used to address any of the 16 register (16-bits each) and data will be written when control signal reg_wr is 1 . Therefore, wrreg[3:0] input is fed to 4 -to- 16 decoder input. For wrreg[3:0] $=0000$ first output bit of decoder will be high and rest of them will be low. For a particular combination of wrreg[3:0], the equivalent BCD value output line of the decoder will be high and rest of them will be low, as indicated in Figure 4.13.


Figure 4.12: 16x16 Register File Schematic

4-to-16 decoder delay is dominated by the 2 stage AND delay. Theoretical estimation of rise delay $=75.2 \mathrm{ps}$ and simulated path delay observed is, rise delay $=72.01 \mathrm{ps}$. Simulated power obtained for 4-to-16 decoder is $2.8 \mathrm{~V} * 278.2 \mathrm{~mA}=778.96 \mathrm{~mW}$ and expected area is 48 * AND gate area $=48 *(30.15 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m})=1447.2 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}$.

The output of the decoder is connected to a 16 -bit AND and the second input of the 16 -bit AND is reg_wr. Therefore, one of sixteen output lines of the 16 -bit AND will be high if and only if the corresponding decoder line is high and reg_wr is high, to make sure that the selected register can be written only when reg_wr is high. The output of 16 -bit AND was connected to the EN input of 16 registers. 16-bit AND power consumption is 16 * $16.2148 \mathrm{~mW}=259.4368 \mathrm{~mW}$, rise delay $=37.6 \mathrm{ps}$, expected area $=16 *(30.15 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m})$ $=482.4 \mu \mathrm{~m} \mathrm{x} 42.9 \mu \mathrm{~m}$.

As stated in section 4.1, 16 -bit register area $=1518.24 \mu \mathrm{~m} \times 57.78 \mu \mathrm{~m}$, with power conusmption of 923.44 mW with rise delay $=101.41$ ps. Therefore, 1616 -bit register area


Figure 4.13: 4-to-16 Deocoder (input changes at 83ps)
$=16^{*}(1518.24 \mu \mathrm{~m} \times 57.78 \mu \mathrm{~m})=24291.84 \mu \mathrm{~m} \times 57.78 \mu \mathrm{~m}$ and power consumption is $16{ }^{*}$ $749.6384 \mathrm{~mW}=11.994 \mathrm{~W}$.

A 1-bit 16-to-1 mux was developed using 1-bit 4-to-1 mux as indicated in Figure 4.14. The two least significant control bits, S 0 and S 1 , were connected to the first stage of the 1-bit 4-to-1 mux and S2, S3 were connected to the second stage.


Figure 4.14: 1-bit 16-to-1 Mux Schematic

Simulated power consumption obtained for the 1 -bit 16 -to- 1 mux is $2.8 \mathrm{~V} * 86.92 \mathrm{~mA}=$ 243.376 mW . Expected area, from section 4.5, is $5^{*}(109.53 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m})=547.65 \mu \mathrm{~m} \mathrm{x}$
$47.52 \mu \mathrm{~m}$. Simulated rise delay of 1-bit 4 -to- 1 mux is 52.51 ps . Therefore expected rise delay of 1-bit 16 -to- 1 mux is 105.02 ps , whereas simulated rise delay is 85.39 ps , as indicated in Figure 4.15.


Figure 4.15: 1-bit 16-to-1 Mux Output (input changes at 144ps)

For simplicity, 3 inputs D0, D1, D2 have been used instead of 16 inputs. D0, D1 and D2 were connected to the first three inputs and fourth input was connected to D0, fifth input to D1, sixth input to D2, seventh input to D0 and so on and correct operation was verified at where intput changes at 144 ps.

16-bit 16 -to- 1 mux power consumption is $16 * 243.376 \mathrm{~mW}=3.894016 \mathrm{~W}$ and expected area is $16^{*}(547.65 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m})=8762.4 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$ with rise delay $=85.39 \mathrm{ps}$.

Therefore, 316 -bit 16 -to- 1 mux power consumption is 11.6872 W with expected area of $26287.2 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}$.


Figure 4.16: 16x16 Register File Output at 3.33 GHz
$16 x 16$ register file path delay would be 4 -to- 16 decoder delay +16 -bit AND delay + 16 -bit MS DFF-EN delay +16 -bit 16 -to- 1 mux delay $=72.01 \mathrm{ps}+37.6 \mathrm{ps}+101.41 \mathrm{ps}+$ $85.39 \mathrm{ps}=296.41 \mathrm{ps}$. Estimated power consumption of register file is 4 -to- 16 decoder + 16 -bit AND +1616 -bit register +316 -bit 16 -to-1 mux $=778.96 \mathrm{~mW}+259.4368 \mathrm{~mW}+$
$11.994 \mathrm{~W}+11.6872 \mathrm{~W}=24.719 \mathrm{~W}$. Estimated area of the $16 \times 16$ register file is 4 -to- 16 decoder area $1447.2 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}+16$-bit AND area $482.4 \mu \mathrm{~m} \mathrm{x} 42.9 \mu \mathrm{~m}+1616$-bit register area $24291.84 \mu \mathrm{~m} \times 57.78 \mu \mathrm{~m}+316$-bit 16-to-1 mux area $26287.2 \mu \mathrm{~m} \times 47.52 \mu \mathrm{~m}=52508.64 \mu \mathrm{~m} \mathrm{x}$ $57.78 \mu \mathrm{~m}$.

Figure 4.16 shows verification of the $16 \times 16$ register file at 300 ps clock period. An analog test bench was set up such that at the first clock cycle it writes FFFF in register-0 but as register-0 is implemented in hardware it will have 0 at the end. In consecutive cycles, data written on registers $1,2,3,4,5$ up to 15 were EEEE, DDDD, 8888, 3333, 2222, 5555, 4444, FFFF, AAAA, 9999, 8888, 7777, 6666, 5555, 0000 and then control signal reg_wr goes low meaning write cannot be performed any more. Then $\mathrm{rr} 1=\mathrm{rr} 2=\mathrm{rr} 3=0,1,2,3,4$ and 5 was provided for simplicity and only 16 -bit rd1 was probed to see the result at register-0, 1,.. 5 which were 0000 , EEEE, DDDD, 8888,3333 and 2222 . Then in the following clock cycle reset $=1$ sets all the outputs to be logic- 0 .

### 4.9 Sign 4-to-16 extension (Sign 4)

Sign 4 to 16 extension was developed using a single stage of sixteen inverters. First, 4-bits were passed to the inverter and the output was flipped to work as a buffer, due to the complementary form of CML logic. The fourth-bit was connected as input to rest of the inverters and those outputs were flipped as well.

Figure 4.17 shows the outputs of the sign 4 to 16 extension at where input changes at 72ps. The fourth-bit (D3) was copied to to input D4-D15 and fragment of output is shown Out0-Out7. Simulated power consumption of sign 4 to 16 extension is $2.8 \mathrm{~V} * 26.4 \mathrm{~mA}=$ 73.92 mW and delay is same as 1 -bit inverter but flipped, and simulation rise delay $=25.5 \mathrm{ps}$. Expected area is $16^{*}(15.96 \mu \mathrm{~m} \times 24.45 \mu \mathrm{~m})=255.36 \mu \mathrm{~m} \times 24.45 \mu \mathrm{~m}$.


Figure 4.17: Sign 4 to 16 Extension Output (input changes at 72 ps )

### 4.10 Sign 8-to-16 extension (Sign 8)

Sign 8 to 16 extension is similar to sign 4 to 16 extension, with the only difference being that the eighth bit (D7) is copied to D8-D15 bits. Area, power consumption and delay are the same as the sign 4 -to-16 extension.

### 4.11 Sign 12-to-16 extension (Sign 12)

Sign 12 to 16 extension is similar to sign 4 to 16 extension with the only difference being that the twelfth bit (D11) is copied to D12-D15 bits. Area, power consumption and delay are same as sign 4 to 16 extension.

### 4.122 unsigned 1-to-16 extension (Unsigned 16)

Unsigned 1-to-16 extension was developed using 16 inverters, similar to sign 4 to 16 extension. The only difference is the 1-bit input was connected to D0 and D1 - D15 were tied down to logic $0(2 . V)$. The output was flipped to work as a buffer, and area, power and delay are the same as sign 4 to 16 extension. Therefore 2 unsigned 1-to-16 extension power consumption is $2 * 73.92 \mathrm{~mW}=147.84 \mathrm{~mW}$ and expected area is $2 *(255.36 \mu \mathrm{~m} \times 24.45 \mu \mathrm{~m})$ $=510.72 \mu \mathrm{~m} \times 24.45 \mu \mathrm{~m}$.


Figure 4.18: Unsigned 1 to 16 Extension Output (input changes at 83ps)

Figure 4.18 shows unsigned 1-to-16 extension output where output Q0 passes 1-bit input D 0 and rest of the inputs are tied to Logic-0 which is 2.2 V . The output glitch that is observed fluctuates only between $2.2228 \mathrm{~V}-2.2223 \mathrm{~V}, 0.5 \mathrm{mV}$ only. The simulator tool usually zooms in when plotting output, meaning glitch is not as significant as it looks like. Only the first four bits of output are shown in Figure 4.17.

### 4.13 4 1-bit AND gate

4 1-bit AND gate power consumption is $4^{*} 16.2148 \mathrm{~mW}=64.8592 \mathrm{~mW}$ and area is $4^{*}$ $(30.15 \mu \mathrm{~m} \times 42.90 \mu \mathrm{~m})=120.6 \mu \mathrm{~m} \times 42.9 \mu \mathrm{~m}$ with rise delay $=37.6 \mathrm{ps}$.

### 4.14 1 1-bit OR gate

One 1-bit OR gate power consumption is 16.2148 mW having an area of $30.15 \mu \mathrm{~m} \mathrm{x}$ $42.90 \mu \mathrm{~m}$ with rise delay $=46.09 \mathrm{ps}$ as stated in Table 2.6.

Table 4.1 summarizes power, expected area and time to generate correct operation for the complete datapath components. Simulation results presented in this table included 20fF load capacitance inside every basic components to mimic post layout simulation and wire capacitance.

| Component | Power (mW) | Expected Area <br> ( $\mu \mathrm{m} \times \mu \mathrm{m}$ ) | Delay (ps) |
| :---: | :---: | :---: | :---: |
| PC, PC+1, ALU_OUT, IR | 2998.55 | $6072.96 \times 57.78$ | 101.41 |
| Z | 46.8524 | $94.89 \times 57.78$ | 101.41 |
| 16-bit 2-to-1 mux (4 used) | 1038.464 | $2336.64 \times 47.52$ | 33.99 |
| 4-bit 2 to 1 mux ( 3 used) | 194.712 | $438.12 \times 47.52$ | 33.99 |
| 16-bit 4 to 1 mux (3 used) | 2335.872 | $5257.44 \times 47.52$ | 31.39 |
| 16-bit 5 to 1 mux | 1038.464 | $2336.64 \times 47.52$ | 59.61 |
| 16-bit ALU | 8243.2 | $14145.12 \times 47.52$ | 275.6 |
| 16x16 Reg file | 24719 | $52508.64 \times 57.78$ | 296.41 |
| Sign 4 | 73.92 | $255.36 \times 24.45$ | 25.5 |
| Sign 8 | 73.92 | $255.36 \times 24.45$ | 25.5 |
| Sign 12 | 73.92 | $255.36 \times 24.45$ | 25.5 |
| Unsigned 16 (2 used) | 147.84 | $510.72 \times 24.45$ | 25.5 |
| 1-bit AND gate (4 used) | 64.8592 | $120.6 \times 42.9$ | 37.6 |
| 1 OR gate | 16.2148 | $30.15 \times 42.90$ | 46.09 |
| Total | 41.065 W | $\begin{aligned} & 84618 \times 57.78 \\ = & 2115.45 \times 2311.2 \end{aligned}$ | 296.41 ps |
|  <br> Frequency | 41.264 W | $\begin{gathered} 2.2 \mathrm{~mm} \times 2.3 \mathrm{~mm} \\ +50 \% \text { area for wiring } \\ =2.2 \mathrm{~mm} \times 3.45 \mathrm{~mm} \end{gathered}$ | 300ps |

Table 4.1: Component Power Dissipation, Expected Area and Delay

As indicated in Table 4.1, estimated power obtained by summing up component power dissipation is 41.065 W whereas simulation power obtained is 41.264 W which is almost identical. Most of the power is dissipated in the $16 \times 16$ register file, which is 24.719 W out of 41.264W. The $16 \times 16$ register file is also a time dominating component and critical path delay is 296.41 ps . Register file was simulated at 300 ps and has been verified as indicated in Figure 4.16. However as stated earlier, for 12 instruction ALU critical path delay is 275.6 ps and for remaining 3 instruction where Z flag is used, critical path delay is 348 ps .

Table 4.1 should be identical to post layout simulation although layout of the processor datapath has not been performed but in circuit level (transistor level). However to mimic post layout simulation and wire capacitance, 20fF load capacitance was added inside every building block. Typically post layout simulation varied with 10 ps with circuit level as indicated in Chapter 2 and 1 fF can mimic 1 ps delay. Also, assuming $100 \mu \mathrm{~m}$ is necessary to connect next stage then additional $10 \mathrm{fF}(0.1 \mathrm{ps} / \mu \mathrm{m})$ is necessary to mimic wire capacitance [22].

An attempt has been made to estimate total area, counting the number of basic component used $+50 \%$ area for wiring. Expected total area, including wire is, $2.2 \mathrm{~mm} \times 3.45 \mathrm{~mm}$ $(2200 \mu \mathrm{~m} \times 3450 \mu \mathrm{~m})$ leading power dissipation per unit area to be $41.264 \mathrm{~W} /(2200$ * 3450) $\mu m^{2}=5.44 \mu \mathrm{~W} / \mu m^{2}$.

## Chapter 5

Processor Verification and Performance

### 5.1 Processor Verification

A handcrafted 16-bit microprocessor datapath has been designed in CML logic as indicated in Figure 5.1. An analog test bench was set up to provide 16-bit data from memory using voltage sources (vpulse) and appropriate external control signals were provided to perform verification. Three instructions have been verified, providing consecutive data. At the first cycle reset was triggered and simulation was performed at $300 \mathrm{ps}(3.33 \mathrm{GHz})$ clock period.


Figure 5.1: Handcrafted Processor Schematic

Each instruction took 4 clock cycles to execute and the result is provided in a sequence of Figure 5.2-5.4.

Instruction executed in cycle order:


Figure 5.2: MOVI instruction

Reset (first cycle)
MOVI $R_{d}=\mathrm{n} ; \mathrm{n} 8$-bit (I-type)
1001000100010111 (applied at second cycle)
Explanation: $R_{d}=$ Reg-1 $=23$; move 23 in register-1. Appropriate external control signal has been applied at cycle $=2,3,4$ and 5 . Only selected signals of register- 1 has been probed and expected result obtained at the beginning of sixth clock cycle as shown in Figure 5.2 (fifth bit of reg1_4 of register-1 is shown in Figure 5.3).

ADDI $R_{d}=R_{s}+\mathrm{n} ; \mathrm{n} 4$-bit (I-type)
1000011100010111 (applied at sixth cycle)


Figure 5.3: ADDI instruction

Explanation: $R_{d}=R_{s}+\mathrm{n}$; add content of register $R_{s}=0001$ (register-1 contains 23 performed in previous instruction) to $\mathrm{n}=0111$ and save the result in $R_{d}=0111$. Therefore register- 7 will contain $23+7=30$. Appropriate control signals were applied at cycle $=6,7$, 8 and 9. $R_{d}=30=11110$ was observed at the beginning of tenth clock cycle, as indicated in Figure 5.3.
$\mathrm{ADD} R_{d}=R_{s}+R_{t}$ (R-type)
0000110100010111 (applied at tenth cycle)
Explanation: $R_{d}=R_{s}+R_{t}$; add content of register $R_{s}=0001$ (register-1) with $R_{t}=0111$ (register-7) and save the result in $R_{d}=1101$ (register-13). Appropriate control signals were applied at cycle $=10,11,12$ and 13 and content of $R_{d}=13$ (register-13) was observed, 23 $+30=53$ (110101) at the beginning of $14^{\text {th }}$ clock cycle, as indicated in Figure 5.4.


Figure 5.4: ADD instruction

Figure 5.5 shows static power consumption of CML processor datapath over 13 clock cycles ( 300 ps clock period). Processor datapath peak power 41.51 W reached at 306.5 ps right after reset signal and average power dissipated is 41.264 W , as indicated in Figure 5.4.

### 5.2 Performance

Performance metric MIPS was used to determine processor performance. 11 instructions take 4 clock cycles to execute, so the probability of executing any of these 11 instructions is $11 / 15$. Similarly, 1 instruction (LW) takes 5 clock cycles to execute and thus probability is $1 / 15$. Likewise, 1 instruction (JAL) takes 3 clock cycles to execute so probability is $1 / 15$. Two instructions ( J LABEL, JR) take 2 clock cycles to execute and probability is $2 / 15$.

It has been stated earlier that 3 instructions (SEQ, JZ, and JNZ) require processor speed to be slowed down to operate at $2.87 \mathrm{GHz}(348 \mathrm{ps})$ due to worst path delay. However


Figure 5.5: Static power consumption of CML processor datapath over 13 clock cycles this problem can be solved by re-structuring the datapath. Assuming each instruction can be performed at $3.33 \mathrm{GHz}(300 \mathrm{ps})$, estimated clock per instruction (CPI) is:

$$
\begin{aligned}
C P I & =4 * \frac{11}{15}+5 * \frac{1}{15}+3 * \frac{1}{15}+2 * \frac{2}{15} \\
& =3.733
\end{aligned}
$$

Assuming 1 million instructions have been executed,

$$
\begin{aligned}
\text { Execution time } & =\text { No. of instruction count } * \text { clock period } * C P I \\
& =10^{6} * 300 \mathrm{ps} * 3.733 \text { clock } / \text { instruction } \\
& =1.1199 * 10^{-3} \text { sec }
\end{aligned}
$$

$$
\begin{aligned}
\text { MIPS } & =\frac{\text { Instruction count }}{\text { Execution time } * 10^{6}} \\
& =\frac{10^{6}}{1.1199 * 10^{-3} \sec * 10^{6}} \\
& =892.93
\end{aligned}
$$

### 5.3 Comparison

There is no common ground to compare the CML processor with a CMOS processor. The same feature size CMOS can never operate as fast CML logic due small swing and differential architecture. The smaller swing we define, the faster we can operate CML logic. In CML there is no dynamic power, but only static power dissipation, and it is constant and does not increase with increasing frequency as stated earlier in Chapter 2. The highest operating frequency for a particular technology (minimum feature size) can be achieved through design intuition such that differential pair can just turn-off one side and steer full current in the other part based on input combinations that benefit CML over CMOS at higher frequency.

These days, Intel's latest processors in $90 \mathrm{~nm}, 65 \mathrm{~nm}, 32 \mathrm{~nm}$ process technology can achieve $2-4 \mathrm{GHz}$ operating frequency with typical power consumption $150 \mathrm{~W}-200 \mathrm{~W}$ and MIPS varies from 6000-10000 on an average [3] and [23]. The CML processor that was developed in this thesis work has been implemented at the circuit level with additional load capacitance to mimic post layout simulation and wire delay. Hardware/software coordination has not been performed, nor has a compiler been designed to implement program code (assembly language) to measure execution time. However, theoretical estimation shows the developed CML processor has MIPS $=892$ with power consumption 41.264 W . Today's processors are mostly pipelined and superscalar in architecture. Re-structuring the multicycle datapath of the CML processor into a pipeline can gain at least 4 times the MIPS
(3568), and making it superscalar can exceed 10000 MIPS while burning $50-70 \mathrm{~W}$ in 120 nm feature size.

Comparing just a single CMOS gate to CML gate is misleading. It is because of

1. Swing is not same. In CML we can define 100 mV or 600 mV swing, whatever we want, based on our requirement. Making the swing 100 mV will burn less power than a 600 mV swing. But in CMOS, swing is fixed, 1 V .
2. At higher frequency, as long as our swing is less than 1V, CML is advantageous over CMOS
3. At lower frequency (below 1 GHz ) it depends. Typically due to constant power dissipation CML is a worse choice than CMOS. Reducing swing to 100 mV or less may benefit CML over CMOS at lower frequency.
4. Same feature size CMOS cannot operate as fast as CML. Therefore, in order to compare the two at a particular operating frequency we need smaller feature size for CMOS and larger feature size for CML, and thus breaking the common ground to compare.
5. One possible way to compare CMOS with CML at the same feature size is not to optimize CML at its highest operating frequency but at some lower frequency, where CMOS is capable to operate at, meaning frequency and feature size are the same. But in this case we are deliberately letting CML lose its benefits. As stated in Figure 2.1, CML power consumption does not increase exponentially, but rather slowly, similar to a horizontal line. Optimizing for lower frequency (where CMOS is capable to operate) may reduce some power but not too much. CML's higher constant power dissipation at lower frequency will let CMOS gate to be favorable because CMOS static power is almost zero and dynamic power will reduce due to lower operating frequency.

## Chapter 6

## Conclusions

The first ever, 3.33 GHz MCML microprocessor datapath has been developed using 130 nm CMOS technology. No prior work exists in literature in designing processor datapaths using MCML logic. It is first in its kind and power consumption is very low compared to traditional CMOS processors. However, a prior attempt was made to develop a BiCMOS superscalar RISC microprocessor where three level ECL logic gates have been used along with CMOS gates in same integrated chip [5].

Reported power consumption of the developed MCML processor is 41.264 W with an estimated area of $2.2 \mathrm{~mm} \times 3.45 \mathrm{~m}$. Expected power density per unit area is $5.44 \mu \mathrm{~W} / \mu m^{2}$. A RISC architecture was adopted in developing the 16 -bit processor datapath. Out of fifteen instructions, twelve instructions can be performed at 3.33 GHz and three instructions can be performed at 2.87 GHz , with an estimated MIPS of 892 .

This thesis work indicates that it is possible to realize superfast processors beyond 20 GHz with minimum power consumption using today's technology. Either a voltage conversion is necessary that can be done by amplifying small signal swings to 1 V and then shifted down or the total system could be implemented in CML to work with CML processor [24].

### 6.1 Future Work

It is possible to come up with a CML synthesizer tool such that the entire design can be automated. The idea behind the statement is, CML gates which are analog can be optimized for multiple operating frequencies and multi-input CML logics by proper handcrafting, similar to digital technology files. Implementing multi-input logic has two fold benefits; delay will be reduced but power consumption will be the same. However to provide enough biasing
for each level of transistors, we have to increase our supply voltage. Once CML logics for multi-input and multiple frequencies have been optimized by proper handcrafting, the rest of the process can be automated as is the case for digital circuits. A CML synthesizer tool should be able to pick different optimized CML logic such that it meets the area and timing constraints. It could be a time oriented interested topic for any PHD student in mixed signal design.

## Bibliography

[1] Vasanth Kakani, Foster F. Dai and R.C Jaeger, "Delay analysis and optimal biasing for high speed low power current mode logic circuits," IEEE International Symposium on Circuits and Systems, vol. 2, pp. II-869-872, May 23-26, 2004.
[2] Dr. Fa Foster Dai's, "ELEC 6190 Introduction to Digital and Analog IC Design," Slide 10, Page 22.
[3] Wikipedia, "http://en.wikipedia.org/wiki/List_of_CPU_power_dissipation".
[4] Dr. Vishwani D. Agrawal's, "ELEC 6270 Low Power Design of Electronic Circuit," Slide no 1 , page 6 .
[5] Cliff A. Maier, James A. Markevitch, Tim Sippel, Earl T. Cohen, Jim Blomgren, James G. Ballard, Jay Pattin, Viki Moldenhauer, Jeffrey A. Thomas and George Taylor, "A $533-\mathrm{MHz}$ BiCMOS superscalar RISC microprocessor," IEEE Journal of Solid-State Circuits, vol. 32, pp. 16251634, Nov. 1997.
[6] Armin Tajalli, Eric Vittoz and Yusuf Leblebici, "Ultra low power subthreshold MOS current mode logic circuits using a novel load device concept," European Solid State Circuits Conference, pp. 304-307, Sept. 11-13, 2007.
[7] M. Alioto and G. Palumbo, "Modeling and optimized design of current mode MUX / XOR and D-flip flop," IEEE Transactions of Circuits and Systems-II, vol. 47, no. 5, pp. 452-461, May 2000.
[8] Dr. Fa Foster Dai's, "ELEC 6190 Introduction to Digital and Analog IC Design," Slide 7, Page 4, 10, 11, 16, 17, 22.
[9] Abdullah Al Owahid and Foster F. Dai, "A 41.264W, 3.33GHz Processor Datapath Using Current Mode Logics In 130nm CMOS Technology, "IEEE International Symposium on Circuits and Systems, May 20-23, 2012.
[10] H. Rein, "Design consideration for very-high-speed Si-bipolar ICs operating up to 50 $\mathrm{Gb} / \mathrm{s}$," IEEE Journal of Solid-State Circuits, vol. 31, pp. 10761090, Aug. 1996.
[11] "International Technology RoadMap for Semiconductors," ITRS, Radio Frequency and Analog/Mixed-Signal Technologies for Wireless Communications Tech. Rep., 2005.
[12] S. Khabiri and M. Shams, "A mathematical programming approach to designing MOS current-mode logic circuits," in Proc. IEEE ISCAS, May 2005, vol. 3, pp. 24252428.
[13] H. Hassan, M. Anis, and M. Elmasry, "MOS current mode circuits: Analysis, design, and variability," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 8, pp. 885898, Aug. 2005.
[14] J. Musicer and J. Rabaey, "MOS current mode logic for low power, low noise CORDIC computation in mixed-signal environments," in Proc. ISLPED, 2000, pp. 102107.
[15] A. Tanabe, M. Umetani, I. Fujiwara, T. Ogura, K. Kataoka, M. Okihara, H. Sakuraba, T. Endoh, and F. Masuoka, "A $10 \mathrm{~Gb} / \mathrm{s}$ demultiplexer IC in 0.18 um CMOS using current mode logic with tolerance to the threshold voltage fluctuation," IEEE J. SolidState Circuits, vol. 36, no. 6, pp. 988996, Jun. 2001.
[16] M. Allam and M. Elmasry, "Dynamic current mode logic (DyCML): A new low-power high-performance logic style," IEEE J. Solid-State Circuits, vol. 36, no. 3, pp. 550558, Mar. 2001.
[17] M. Yamashima and H. Yamada, "A MOS current mode logic (MCML) circuit for lowpower sub-GHz processors," IECE Trans. Electron., vol. E75-C, no. 10, pp. 11811187, Oct. 1992.
[18] Dr. Fa Foster Dai's "ELEC 6190 Introduction to Digital and Analog IC Design," Slide 7, Page 2.
[19] M. Alioto and G. Palumbo, "Design strategies for source coupled logic gates," IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 50, no. 5, pp. 640654, May 2003.
[20] A. Ismail and M. Elmasry, "A low power design approach for MOS current mode logic," in Proc. IEEE Int. SOC Conf., Sep. 2003, pp. 143146.
[21] Osman Musa and Maitham Shams, "An efficient delay model for MOS current mode logic automated design and optimization," IEEE Transactions on Circuits and SystemsI: Regular Papers, vol. 57, no. 8, August 2010.
[22] Xueyang Geng, Fa Foster Dai, J. David Irwin and Richard C. Jaeger, "24-Bit 5.0 GHz Direct Digital Synthesizer RFIC With Direct Digital Modulations in $0.13 \mu \mathrm{~m}$ SiGe BiCMOS Technology," IEEE Journal of Solid-State Circuits, vol. 45, no. 5, pp. 944-954, May 2010.
[23] Wikipedia, "http://en.wikipedia.org/wiki/List_of_Intel_microprocessors".
[24] Xuelian Liu, Hadrian O. Aquino, Alexey Gutin and John. Mcdonald, "A 125-ps Access, 4GHz, 16KB BICMOS SRAM," IEEE International Midwest Symposium on Circuits and Systems, pp. 1222-1225, Aug 2010.

