Design of 3.33GHz CML Processor Datapath by Abdullah Al Owahid A thesis submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Master of Science Auburn, Alabama May 7, 2012 Keywords: CML, CMOS, Processor Copyright 2012 by Abdullah Al Owahid Approved by Fa Foster Dai, Chair, Professor of Electrical and Computer Engineering Vishwani D. Agrawal, James J. Danaher Professor of Electrical and Computer Engineering Victor P. Nelson, Professor of Electrical and Computer Engineering Abstract Almost a decade processor speed has been stuck at operating frequency 2-3GHz due to excessive power consumption of CMOS logic gate at higher frequency whereas predicted speed at present was 10-15GHz. This leads the idea of multi-core design in today?s proces- sor architecture. However it increases the communication overhead ? and there exist data dependency which cannot fully exploit the advantage of many-core design. Further many core design is increasing number of dark silicon and number of core cannot be increased after certain limit. Therefore a novel approaches in processor design using CML logic gate has been proposed. Handcrafted 16-bit CML microprocessor datapath has been developed at operating fre- quency 3.33GHz using 130nm CMOS technology. With the same feature size, CMOS gate is incapable to operate beyond 1GHz whereas CML logic gates were optimized for 12GHz using bias current of 70% of peak ft current with a logic swing of 600mV. Considering critical path delay, circuit has been slowed down to operate at 3.33GHz. All the processor components - decoder, mux, register file, ALU was deliberately hand- crafted due to lack of analog synthesizer tool. Reported static power consumption of multi- cycle CML processor datapath is 41.264W. However it is not the best case and could have been reduced to 50% by implementing multi-input CML logic. Expected chip area is 2.2mm x 3.45mm and power density per unit area is 5.44?W/?m2. Estimated performance evalu- ated is 892 MIPS. Supply voltage used is 2.8V. CML logic was defined as, logic-1 = 2.8V and logic-0 = 2.2V. 1V reference voltage was used to constant bias the current source and reset signal uses 1.3V and 0.7V for high and low logics respectively. It has been observed that it is possible to realize ultra-high speed processor using existing technology with minimum power consumption in CML logic. ii Acknowledgments I would like to acknowledge the continuous support and guidance of Dr. Fa Foster Dai. Without his suggestion and direction it would have been impossible to complete this thesis work. I would also like to thank my committee members Dr. Vishwani D. Agrawal for his meaningful suggestions regarding processor architecture and Dr. Victor P. Nelson. I thank my friends and colleagues - James Clark, Shannon Price, Xin Jin and Baohu Li for being with me and making life at Auburn enjoyable. Last but not the least, I would like to thank my family members - my parents whose love brought me so far, my brother and sister, and especially my wife for her patience. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background and High Speed CML Logic Realization . . . . . . . . . . . . . . . 6 2.1 CML Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 CML Inverter Optimization . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 CML Universal Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Universal CML Gate Optimization . . . . . . . . . . . . . . . . . . . 19 2.3 CML XOR/XNOR Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 CML Mux Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 CML D-latch Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Speed, Power, Area and Delay of Basic CML Components . . . . . . . . . . 28 3 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 First Clock Cycle of Every Instruction . . . . . . . . . . . . . . . . . . . . . 31 3.2 R-type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 R-type ADD/SUB/AND/XOR . . . . . . . . . . . . . . . . . . . . . 32 iv 3.2.2 R-type SLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.3 R-type SEQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 I-type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 I-type LW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.2 I-type SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.3 I-type ADDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.4 I-type MOVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 J-type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 J-type J LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.2 J-type JZ LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.3 J-type JNZ LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.4 J-type JAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.5 J-type JR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Control Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Component Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 16-bit Register With Enable Input . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Z Register (1-bit Register With Enable Input) . . . . . . . . . . . . . . . . . 49 4.3 4 16-bit 2-to-1 Mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 3 4-bit 2-to-1 Mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5 3 16-bit 4-to-1 Mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 16-bit 5-to-1 mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.7 16-bit ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.8 16x16 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.9 Sign 4-to-16 extension (Sign 4) . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.10 Sign 8-to-16 extension (Sign 8) . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.11 Sign 12-to-16 extension (Sign 12) . . . . . . . . . . . . . . . . . . . . . . . . 63 4.12 2 unsigned 1-to-16 extension (Unsigned 16) . . . . . . . . . . . . . . . . . . . 64 v 4.13 4 1-bit AND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.14 1 1-bit OR gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 Processor Verification and Performance . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 Processor Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 vi List of Figures 1.1 Operating frequency over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Operating Frequency vs. Power of Intel Processor . . . . . . . . . . . . . . . . . 2 1.3 Power density per unit area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Current consumptions for CMOS vs. CML logic . . . . . . . . . . . . . . . . . . 6 2.2 CMOS vs. CML power consumption . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 CML Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Normalized current for CML inverter . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 CML inverter half-circuit small-signal model . . . . . . . . . . . . . . . . . . . . 15 2.6 Post vs. pre layout simulation of CML inverter with 18fF input/output load capacitance (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.7 CML inverter layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.8 Universal CML gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.9 Universal CML gate with embedded level shifter . . . . . . . . . . . . . . . . . . 18 2.10 Normalized current for CML universal gate . . . . . . . . . . . . . . . . . . . . 20 2.11 Post vs. pre layout simulation of CML AND with 18fF input/output load capac- itance (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 vii 2.12 CML AND layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.13 CML XOR gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.14 Post vs. pre layout simulation of CML XOR with 18fF input/output load capac- itance (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.15 CML XOR layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.16 CML Mux realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.17 Post vs. pre layout simulation of CML Mux with 18fF input/output load capac- itance (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.18 CML Mux layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.19 CML D-latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.20 Post vs. pre layout simulation of CML D-latch with 18fF input/output load capacitance at 6GHz (166ps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.21 CML D-latch layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Processor datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 First clock cycle of any instruction . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 R-type ADD/SUB/AND/XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 R-type SLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 R-type SEQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 I-type LW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 viii 3.7 I-type SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.8 I-type ADDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.9 I-type MOVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.10 J-type J LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.11 J-type JZ LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.12 J-type JNZ LABEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.13 J-type JAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.14 J-type JR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Datapath Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Block diagram of MS DFF, MS DFF-EN, 16-bit register . . . . . . . . . . . . . 48 4.3 1-bit register output at 6GHz with 20fF load capacitance (clock period 166ps) . 49 4.4 1-bit 4-to-1 mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5 1-bit 4-to-1 mux output with 20fF load capacitance (input changes at 83ps) . . 51 4.6 1-bit 5-to-1 mux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.7 1-bit 5-to-1 Mux output with 20fF load capacitance (input changes at 166ps) . . 52 4.8 Block diagram of 16-bit ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.9 16-bit CLA block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.10 Critical path delay in 16-bit CLA is 224.7ps (input changes at 500ps) . . . . . . 55 ix 4.11 16-bit ALU Output (input changes at 300ps) . . . . . . . . . . . . . . . . . . . 56 4.12 16x16 Register File Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.13 4-to-16 Deocoder (input changes at 83ps) . . . . . . . . . . . . . . . . . . . . . 59 4.14 1-bit 16-to-1 Mux Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.15 1-bit 16-to-1 Mux Output (input changes at 144ps) . . . . . . . . . . . . . . . . 60 4.16 16x16 Register File Output at 3.33GHz . . . . . . . . . . . . . . . . . . . . . . . 61 4.17 Sign 4 to 16 Extension Output (input changes at 72ps) . . . . . . . . . . . . . . 63 4.18 Unsigned 1 to 16 Extension Output (input changes at 83ps) . . . . . . . . . . . 64 5.1 Handcrafted Processor Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 MOVI instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 ADDI instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4 ADD instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5 Static power consumption of CML processor datapath over 13 clock cycles . . . 71 x List of Tables 2.1 Inverter Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 CML AND Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 CML XOR Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 CML Mux Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 CML D-latch Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Power, Area and Delay of Basic Components (post layout simulation with 18fF load capacitance) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Opcode and 15 Different Operations . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Control Signal Table Part-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Control Signal Table Part-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Component Power Dissipation, Expected Area and Delay . . . . . . . . . . . . . 65 xi List of Abbreviations ALU Arithmetic Logic Unit BiCMOS Bipolar and CMOS in same integrated chip BJT Bipolar Junction Transistor CML Current Mode Logic CMOS Complementary Metal Oxide Semi-conductor ECL Emitter Coupled Logic ISA Instruction Set Architecture MCML Mos Current Mode Logic MIPS Million Instructions Per Second nMOS n-type Metal Oxide Semi-conductor RISC Reduced Instruction Set Computing SoC System on Chip SPICE Simulation Program with Integrated Circuit Emphasis SRAM Static Random Access Memory xii Chapter 1 Introduction This thesis presents the design of a handcrafted 3.33GHz CML processor datapath. RISC architecture has been adopted in designing the multi-cycle processor datapath and the ISA is 16-bits long. It is the first ever MCML processor that requires constant power dissi- pation unlike CMOS processors. Also, once optimized, power dissipation does not increase with increasing operating frequency. Optimizing CML logic for higher operating frequency requires higher power than optimizing for lower frequency. Therefore, once CML logic has been optimized for a targeted maximum frequency, operating the circuit at lower frequency will lose the benefits in terms of power. Due to larger switching noise associated with CMOS circuits, CML is a better choice for high speed circuit realization [1]. BJT based CML gates are faster than MOSFET CML due to higher gm and lower power but have been avoided for process difficulty and uncommonness in designing digital circuits. Therefore MOSFET-CML (MCML) logic has been used. 1.1 Problem Statement The problem solved in this thesis: Design a high speed low power processor datapath. 1.2 Background and Motivation Processor speed has been stuck at 2-3GHz due to excessive power consumption, as indicated in Figure 1.1 for the last 10 years [2]. Therefore multi-core design has evolved to increase performance. But the number of cores cannot be increased after a certain limit and there exists communication overhead ?. Further, due to data dependency, programs cannot be fully parallelized, which inhibits proper exploitation of multi-core design. 1 Figure 1.1: Operating frequency over time Intel processor speed vs power have been obtained and plotted in Matlab as depicted in Figure 1.2 [3]. 0 50 100 150 200 250 3000 0.5 1 1.5 2 2.5 3 3.5 4 pentium pentium mmx pentium II pentium III 1400Spentium 4 1.3 pentium 4 506 pentium 4 HT 661 pentium D 820 pentium Extreme Edition 840 core 2 Duo E4400 Core 2 Extreme QX6800 Core i3?560Core i5?670 Core i7?980X Extreme Edition (6core) ????Processor Operating Frequency (GHz)????> ?????????Power (W)?????????> Figure 1.2: Operating Frequency vs. Power of Intel Processor In Figure 1.2 it is observed that processor power consumption has been increased almost exponentially at operating frequencies beyond 2GHz. Also notable is that, core i-5 has higher operating frequency than core-i7 but the power consumption in core i-7 is almost double due to a higher number of cores. So reducing the number of cores will not only reduce power but can result in higher performance due to less data dependency. 2 Power consumption can also be reduced by moving to deep submicron technology. How- ever, the main drawback to reducing feature size is it increases unit power density and chips cannot sustain that power, as shown in Figure 1.3 [4]. Figure 1.3: Power density per unit area In CML, per unit power density is lower as it requires greater area than CMOS due to load resistances in CML logic. Therefore in CML, we can achieve higher operating frequency with less power density. A previous mixed signal superscalar processor was developed in 1997, using 0.5?m BiC- MOS process and required 3.6V and 2.1V power supplies. The reported operating frequency was 533MHz and the design used a PowerPC architecture that contained three pipelines and a large on-chip secondary cache to achieve a peak performance of 1600 MIPS. The 15mm x 10 mm die contained 2.7M transistors (2M CMOS and 0.7M bipolar) and dissipated less than 85W [5]. All logic circuits were implemented in three-level emitter coupled logic (ECL) and only RAM structures were implemented with CMOS circuits. Although BJT has less switching noise and are faster than MOS transistors they are expensive and not frequently used in digital circuits [1]. This led to the idea to design a high speed processor datapath using MCML logic. Further CMOS SRAM cannot operate 3 at 3.33GHz using 0.12?m technology. Therefore only high speed datapath was developed in this thesis. 1.3 Contribution CML logic architecture has been discussed in the literatures but it is not guaranteed that it can realize digital functions unless optimization has been performed [6]-[8]. Optimized CML logic can steer full bias current at different input combinations, resulting in full voltage swing that differentiate logic states. Also, technology files provide only transistors, unlike digital technology files that provides optimized CMOS logic gates. Therefore, due to lack of analog synthesizer tools handcrafting of CML logic architecture is necessary and then opti- mization is required for CML logic to realize digital function for a targeted frequency. All the basic components have been derived first, with maximum operating frequency 12GHz using 130nm CMOS technology. Bias current was chosen to be 70% of peak ft current that gives us highest possible operating frequency without burning transistors when operate. Further, biasing CML gates with 70% of peak ft or less may incur 10% propagation delay but can save more than 40% power [1] and [7]. Logic swing was determined to be 600mV, assuming it will be advantageous than CMOS logic at this frequency that has fixed swing 1V. These basic CML logic designs were later used in realizing 16-bit 3.33GHz datapath components. All the processor components - mux, register file, ALU were deliberately handcrafted in Cadence Virtuoso. Reported static power consumption of the multi-cycle CML processor is 41.264W and power density per unit area is 5.44?W/?m2 = 544W/cm2, below traditional CMOS processor power density per unit area, as indicated in Figure 1.3. Estimated performance of the multi cycle CML processor datapath is 892 MIPS and expected chip area is 2.2mm x 3.45mm. This work has been accepted at IEEE International Symposium On Circuits and Systems (ISCAS) 2012 Conference [9]. 4 1.4 Organization The thesis has been organized as follows: Chapter 2 provides background and describes high speed CML logic realization. Chapter 3 introducesdatapath design. Chapter 4 describes CML datapath component realization and verification. Chapter 5 represents processor ver- ification, performance and comparison. Chapter 6 draws conclusions and discusses future work. 5 Chapter 2 Background and High Speed CML Logic Realization Recently, interest in high speed digital circuit baseds on MCML/BJT-CML is increasing due to low power consumption [10]. Noise coupling between digital circuitry and sensitive analog blocks has always been a major obstacle in complete system on chip design (SoC) [11]. MOS current mode logic (MCML) is a promising alternative to conventional MOS in mixed signal applications. Many efforts were exhausted to realize the potential of MCML [12]- [17]. Even though MCML has been shown to dissipate less power than CMOS at operation frequencies of more than 300MHz, designers have been reluctant to exchange MCML for CMOS [14]. The high complexity of MCML and the lack of automation tools made it impossible to produce robust and power-efficient designs while maintaining low cost and reasonable time to market. Figure 2.1: Current consumptions for CMOS vs. CML logic Figure 2.1 shows typical current consumption for CMOS and CML logics [18]. As indicated, typically at slower frequency CMOS is beneficial whereas CML takes less power 6 at higher operating frequency. Figure 2.1 is based on assumption, and not according to circuit measurements, and is not shown in scale. It is supposed that typically beyond 1GHz CML is beneficial than CMOS. 0.5 1 1.5 2 2.5 3 3.5 40 0.5 1 1.5 2 2.5 3 3.5 4 4.5 ??????? Frequency (GHz) ???????> ????????? Power (mW) ?????????> CMOS Divide by 2 Power Dissipation CML Divide by 2 Power Dissipation Figure 2.2: CMOS vs. CML power consumption A divide by two circuits was realized in CMOS (W = 160nm, 160nm, 200nm, 600nm, 1m, 1.5m, 2.5m, 3m, 7m and 10m for 0.5, 1, 1.2, 1.4, 1.5, 1.6, 1.8, 2, 2.3 and 2.5GHz respectively) and CML (W = 160nm, 160nm, 160nm, 200nm, 250nm, 300nm each having Wtail = 120nm for 0.5, 1, 2, 2.5, 3 and 4GHz respectively) architecture. Channel length, L = 120nm was fixed for both cases and power has been plotted in Figure 2.2. At 2.5GHz CMOS D filp-flop propagation delay reaches 50% of input clock cycle whereas CML D flip-flop was able to generate correct output till 6GHz. It is obvious that CML dominates over CMOS beyond 1.5GHz. Benefits of MCML circuit topology over CMOS are largely independent of technology [6]. Many successful attempts have been made to expose the relationships between the MCML gate delay and the various design parameters [19]-[21]. These efforts have provided insight into the design considerations and have been described briefly for CML inverters and universal CML logic. There is no straight forward method to optimize CML logic and modeling accurate propagation delay for CML logic with pen and paper is very hard to 7 derive due to higher order effects. Therefore, we will rely on some approximation in the later subsections and compare with the simulation results in designing high speed CML gates. 2.1 CML Inverter Figure 2.3: CML Inverter Figure 2.3 shows a CML Inverter which is typically a differential pair. There exists a particular biasing voltage Vin (quiescent point), for which Id1 = Id2 = Ibias/2. For differential input, which is our logic swing ?V, it is necessary to tune the circuit such that when logic-1 (Vin+ ?V/2) is applied to A+ and logic-0 (Vin-?V/2) is applied to A-, T1 turns on and T2 turns off resulting in Id1=Ibias, Id2=0; steering all the current through T1. A Summary of CML inverter operation has been given in Table 2.1 A+ A- T on T off Out+ Out- Vin-?V/2 Vin+?V/2 2, 3 1 Vin+?V/2 Vin-?V/2 Vin+?V/2 Vin-?V/2 1, 3 2 Vin-?V/2 Vin+?V/2 Table 2.1: Inverter Operation 8 2.1.1 CML Inverter Optimization If our supply voltage is Vdd as indicated in Figure 2.3, and the output of the CML gate drives another gate, then we must have to maintain: Logic-1 = Vdd = Vin + ?V/2 and Logic-0 = Vdd- ?V = Vin- ?V/2 Let us assume logic-1 (Vin+?V/2) is applied to A+ and logic-0 (Vin-?V/2) is applied to A-. Therefore, to keep T1 in its forward-active region and T2 in its cut-off region, necessary conditions are: Vgs1 > Vth1; Vgd1 < Vth1; Vds1 > Vgs1 ?Vth1 Vgs2 < Vth2; Vgd2 < Vth2 If we assume that the magnitude of the voltage swing is just large enough to steer all the current from one side to the other then ?V = ?Vmin, which is the minimum swing required (just large enough to turn off T2), Vgs2 = Vth2 Vg2 ?Vx = Vth2 Vin ??Vmin/2?Vx = Vth2 Vdd ??Vmin/2??Vmin/2?Vx = Vth2 Vx = Vdd ??Vmin ?Vth2 (2.1) Vgs1 = Vg1 ?Vx Vgs1 = Vin + ?Vmin/2?Vx Vgs1 = Vdd ?Vx (2.2) Ibias = (?nCox/2)(W/L)(Vgs1 ?Vth1)? (2.3) 9 where ?n is electron mobility in nMOS device and Cox is capacitance per unit area of gate oxide, Cox=epsilon1ox/tox. Permittivity of SiO2, epsilon1ox = 3.9epsilon10; where permittivity of free space epsilon10 = 8.85?10?14F/cm and tox is thickness of gate oxide. For a short channel device, like 0.12?m feature size ? similarequal 1.25. For easiness of our pen and paper calcuation, sometimes we will assume either ? is 2. For matching purposes, T1 and T2 have to be same feature size such that an equal amount of current flows on both sides at the quiescent point. Therefore equation 2.1 yields Vth1 = Vdd ??Vmin ?Vx and, Vgs1 ?Vth1 = Vdd ?Vx ?Vdd + ?Vmin +Vx = ?Vmin (2.4) Ibias = (?nCox/2)(W/L)(?Vmin)? (2.5) Therefore, Ibias can be realized as a function of minimum logic swing ?Vmin, with Ibias ? ?Vmin. It means, the less logic swing we define, the less bias current we need to realize a CML inverter. The minimum amount of current that will be required to realize this logic swing occurs when W is minimum. In other words, we can realize this full swing at high bias current which is true but we will burn more power. The minimum amount of Wn that can provide just large enough (smallest) bias current to realize full swing is: IL = (?nCox/2)(Wn/L)(?Vmin)? (2.6) Again, let us consider a CML inverter circuit at its quiescent point, as indicated in Figure 2.3. In determining our minimum logic swing ?Vmin we need to consider how we are going to forward bias the transistors, meaning what our overdrive should be. Typically, for overdrive votage, Vov ? 300mV, the transistor reaches soft saturation and when Vov ? 500mV, the transistor reaches hard saturation. Overdrive voltage is referred to as Vov = Vgs 10 - Vt. It is necessary to determine overdrive voltage because, if we apply logic-1 at A+ and logic-0 at A- and try to turn off T2 but if T2 at quiescent point reached velocity saturation then (?V)/2 is small enough to completely turn off T2. The consequence will be we cannot realize full swing. Making our logic swing large can turn T2 off but we will end up in greater RC constant and high propagation delay. Therefore, we need overdrive voltage such that it can push the transistor slightly in to the forward active region. A careful consideration was made such that Vov = Vgs - Vt = 263.9mV. Therefore it is necessary to expose the relation between logic swing and overdrive voltage. For an input voltage Vg1 at gate terminal of T1, Vgs1 = Vg1?Vx Vg1 = Vgs1 +Vx and Vg2 = Vgs2 +Vx For differential input voltage, which is our logic swing, ?V = Vg1 ?Vg2 ?V = Vgs1 ?Vgs2 Id1 = (1/2)?Cox(W/L)(Vgs1 ?Vt)2 ?=2 for simplicity (2.7) Id2 = (1/2)?Cox(W/L)(Vgs2 ?Vt)2 ?Id1??Id2 = ?VradicalBig(1/2)(?C ox(W/L)) Id1 +Id2 ?2?Id1.Id2 = (1/2)(?Cox(W/L))?V 2 Ibias ?2?Id1.Id2 = (1/2)(?Cox(W/L))?V 2 2?Id1.Id2 = Ibias ?(1/2)(?Cox(W/L))?V 2 2 radicalBig Id1.(Ibias?Id1) = Ibias ?(1/2)(?Cox(W/L))?V 2 ?4I2d1 + 4Id1.Ibias = I2bias ?Ibias(?Cox(W/L))?V 2 + (1/4)(?Cox(W/L))2?V 4 11 ?4I2d1 + 4Id1 ?Ibias = I2bias ?Ibias ?K?V 2 + (1/4)K2?V 4 K=?Cox(W/L) 4I2d1 ?4Id1?Ibias +I2bias ?IbiasK?V 2 + (1/4)K2?V 4 = 0 Id1 = [4Ibias ? radicalBig (16I2bias ?16I2bias + 16IbiasK?V 2 ?4K2?V 4)]/8 Id1 = [2Ibias ??V radicalBig (4IbiasK ?K2?V 2)]/4 Id1 = Ibias/2?(1/4)?V2 radicalBig (Ibias.K) radicalBig [1?K?V 2/4Ibias] Id1 = Ibias/2?(?V/2) radicalBig (Ibias.K) radicalBig [1?(?V/2)2/(Ibias/K)] As Logic 1 was applied to A+ and Logic-0 at A-, then differential input Vid > 0 resulting Id1 > Id2. Id1 = Ibias/2 + (?V/2) radicalBig (Ibias.K) radicalBig [1?(?V/2)2/(Ibias/K)] (2.8) Id2 = Ibias/2?(?V/2) radicalBig (Ibias.K) radicalBig [1?(?V/2)2/(Ibias/K)] (2.9) At the biasing (quiescent) point, Vid = 0, resulting Id1 = Id2 = Ibias/2. Therefore equation 2.7 yields Ibias/2 = (1/2)?Cox(W/L)(Vgs1 ?Vt)2 Ibias/2 = (K/2)V 2ov K = Ibias/V 2ov Plugging in the value of K in equation 2.8 we get Id1 = Ibias/2 + (?V/2)(Ibias/Vov) radicalBig [1?((?V/2)/Vov)2] (2.10) Id2 = Ibias/2?(?V/2)(Ibias/Vov) radicalBig [1?((?V/2)/Vov)2] (2.11) 12 Now if we want to steer full current through T1, resulting Id1 = Ibias and Id2 = 0, then equation 2.10 yields Ibias = Ibias/2 + (?V/2)(Ibias/Vov) radicalBig [1?((?V/2)/Vov)2] Ibias/2 = (?V/2)(Ibias/Vov) radicalBig [1?((?V/2)/Vov)2] 1 = ?VV ov ? radicalBigg 1?(?V4V ov )2 = (?VV ov )2 ? 14 ?(?VV ov )4 = parenleftbigg?V Vov parenrightbigg2parenleftBigg 1? ?V 2 4V 2ov parenrightBigg parenleftbiggV ov ?V parenrightbigg2 = 4V 2 ov ??V 2 4V 2ov 4V 4ov = 4V 2ov ??V 2 ??V 4 parenleftBig 2V 2ov ??V 2 parenrightBig2 = 0 ?V = ??2Vov (2.12) It means for ?V = +?2Vov swing either one of the transistor will turn on pushing other one to be off and at ?V = -?2Vov it will be reversed, as indicated in Figure 2.4. Figure 2.4: Normalized current for CML inverter 13 At the biasing point shown earlier, Vov = 263.9mV, therefore our minimum logic swing from the biasing point will be ?V/2=?2?263.9mV = ?373.21mV. In Figure 2.4 it is shown, at +300mV swing from the biasing point, T1 passes all the bias current and at -300mV from the biasing point, T2 completely turns on resulting in a total logic swing ?V=600mV. 73mV discrepancy occurs because it is a short channel device and ? is not 2, but rather close to 1.25. Equation 2.12 can further be extended by plugging Vov in terms of bias current and W. ?V = ??2Vov ?V = ??2 radicalBigg Ibias K ?V = ? radicaltpradicalvertex radicalvertexradicalbt 2Ibias ?CoxWL (2.13) It means the more W/L ratio we have, the faster we can move in switching region. But making W larger will increase parasitic capacitance at higher operating frequency and the RC constant will be dominating. In order to achieve faster full swing, ?V = Ibias * RL, we can increase bias current to reduce RL. Bias current was tuned to 1.65mA for W=7?m and L=120nm, which is 70% of peak ft current, and load resistance RL was tuned to 348?. Input gate capacitance of each CML basic component is, Cgg = 8fF. It is stated that wire capacitance is 0.10ps/?m [22]. Assuming 100?m is required to connect next stage then 10ps delay should be added and typically 1fF is necessary to mimic 1ps delay. Therefore 10+8 = 18fF was added as output load capacitance (CL). Propagation delay, Pd = 0.69RC, where C is the capacitance looking in to Drain of T1/T2. Pd = 0.69RL(Cintr +CL) Pd = 0.69RL(Cgd1 +Cdb1 +CL) 14 = 0.69?VI bias (Cgd1 +Cdb1 +CL) (2.14) Figure 2.5: CML inverter half-circuit small-signal model Where Cintr is intrinsic capacitance which equals the gate to drain (Cgd1) plus drain to bulk (Cdb1) capacitance as shown in Figure 2.5. Equation 2.14 means, the higher the bias current, the lower the propagation delay. But, bias can be increased by increasing W, and at higher frequency further increasing bias current will not reduce propagation delay and parasitic capacitance will dominate. Hence, optimized width obtained for the upper level transistors are 7?m and 4?m width for the current source, all having channel length = 120nm with bias current, Ibias=1.65mA. All these intuitive design considerations were included in designing CML logic circuits that can achieve maximum operating frequency of 12GHz with 0.12?m feature size. Supply voltage used was Vdd=2.8V, logic-1 = 2.8V, logic-0 = 2.2V, constant biasing voltage, Vg = 1V to bias the current source and logic swing ?V = 600mV. Figure 2.6 shows post vs. pre layout simulation of a CML inverter operating at 12GHz (input changes at 83ps) with 18fF input/output load capacitance. Rising delay of inverter in circuit level simulation is 17.47ps and extracted layout delay is 24.79ps. The power consumption of the CML inverter is P = Vdd*Ibias = 2.8V * 1.65mA = 4.62mW, which is constant. Operating this optimized CML inverter at 1MHz and 12GHz will dissipate the same amount of power. Therefore power consumption is independent of operating frequency. Figure 2.6 also shows layout simulation varies with circuit level simulation by an amount of 15 Figure 2.6: Post vs. pre layout simulation of CML inverter with 18fF input/output load capacitance (input changes at 83ps) 7.32ps. Layout of the CML inverter was also performed and the reported area is 15.960?m x 24.450?m, as indicated in Figure 2.7. Figure 2.7: CML inverter layout 2.2 CML Universal Gate A universal CML gate is represented in Figure 2.8 that can realize AND, NAND, OR and NOR functions. According to input logic combination as indicated in Figure 2.8, the 16 universal gate can realize an AND function. Reversing the output will realize a NAND function. Reversing all the inputs and outputs will realize an OR function and thus a NOR can be realized as well. Figure 2.8: Universal CML gate Careful consideration was made to choose Vin=2.5V and ?V=600mV, leading Logic- 1=2.8V and Logic-0=2.5V as shown earlier in CML inverter section. The biasing conditions are same as before. If we assume Logic-1 has applied in A+ and Logic-0 has been applied in A- then, Vgs1 > Vth1; Vgd1 < Vth1; Vds1 > Vgs1 ?Vth1 Vgs2 < Vth2; Vgd2 < Vth2 For input B, levels have to be at least VDS amount down such that the condition VGD3 < Vth and VGD4 < Vth is met when T3 and T4 operate (are in the forward active region). Therefore logic-1 for input B is Vin + ?V -VDS and Logic-0 is Vin - ?V - VDS. Due to the discrepancy of logic levels for the second input, a level shifter is necessary and has been embedded with in each gate as indicated in Figure 2.9. The previous power 17 Figure 2.9: Universal CML gate with embedded level shifter of the CML gate was Vdd*Ibias and now with the level shifter it is Vdd*3Ibias; total power increased by a factor of 3. T7 and T8 have to be the same size as T1 and T2 for matching purposes, such that T7/T8 can mimic the voltage drop VDS by T1/T2. To balance the differential architecture, T5 is necessary with T4; because T3 sees either T1 or T2 as a load and is also used for preventing breakdown. T6, T9 and T10 are of same size and constant biasing voltage (Vg) is applied to act as a current source. There are three paths in the universal CML gate and it is necessary to make sure total bias current (Ibias) flows in any of these path from Vdd to Vss depending on logic combination from. Splitting of Ibias is not acceptable because full swing realization will not be possible. Table 2.2 shows the path that will be turned on, depending on input logic combination and realization of the CML AND function. It is notable that the entry at B+/B- is un- shifted logic. It is applied to input B at the level shifter and eventually it gets shifted down by an amount VDS and propagates to input B of CML AND. It is indicated in Table 2.2 that lowest level (2nd level, input B) dominates in path selection. 18 A+ A- B+ B- Out+ Out- T off T on Path Vdd-?V Vdd Vdd-?V Vdd Vdd-?V Vdd 1, 2, 3 5, 4, 6 T5, T4, T6 Vdd-?V Vdd Vdd Vdd-?V Vdd-?V Vdd 1, 4, 5 2, 3, 6 T2, T3, T6 Vdd Vdd-?V Vdd-?V Vdd Vdd-?V Vdd 1, 2, 3 5, 4, 6 T5, T4, T6 Vdd Vdd-?V Vdd Vdd-?V Vdd Vdd-?V 2, 4, 5 1, 3, 6 T1, T3, T6 Table 2.2: CML AND Operation 2.2.1 Universal CML Gate Optimization When a path turns on, it is necessary to steer all the bias current (Ibias) to that path, as stated before, to realize full logic swing, ?V = RL * Ibias. The more bias current we have, the faster the swing realized. Intuitive techniques to optimize the CML gate have been shown in the CML inverter optimization subsection. It was found, using 130nm CMOS technology, the highest operating frequency we can obtain for CML inverter is 12GHz with upper transistor W=7?m, L=120nm for bias current of 1.65mA for which current source size is W=4?m, L=120nm. As we cannot exceed this operating frequency for particular logic swing ?V=600mV, supply voltage Vdd=2.8V, logic-1=2.8V, logic-0=2.2V parameters, the same size has been used in CML universal gate for upper transistors. But in this case, it has second level of input and here it contributes more parasitic capacitance, bias current has been increased very little from 1.65mA to 1.753mA (for which current source size is W=7?m, L=120nm) to operate at 12GHz. For upper level (and 2nd level) transistors, around Vov=200mV has been provided such that a 300mV swing from the biasing point can turn off and on the transistors. It should be noted, at quiescent point Ir1 negationslash= Ir2, as indicated in Figure 2.10. The reason is, Out+ has 2 paths at Q point but Out- has 1 path to Vss. Therefore, typically Ir1 = 13Ir2, as it is observed in Figure 2.10. It is shown that Ir1 = 517.5?A and Ir2 = 1.238mA out of total bias current 1.753mA. Therefore, initially, Out+ tends to be close to logic-0 (2.2V), which is 2.371V and Out- tends to be close to logic-1 (2.8V), which 2.62V at Q point. As it can be seen in normalized current plot of universal CML gate in Figure 2.10 that Id1 negationslash= Id2 at 19 Figure 2.10: Normalized current for CML universal gate biasing point. But full current steering was made possible at -300mV to +300mV swing, which is our objective. Figure 2.11: Post vs. pre layout simulation of CML AND with 18fF input/output load capacitance (input changes at 83ps) Figure 2.11 shows post vs. pre layout simulation of the CML AND gate at 12GHz (input changes at 83ps). Load capacitance, (CL) =18fF was added to mimic next stage input capacitance, Cgg=8fF plus 10fF to mimic 100?m wire for connecting next stage similar to the CML inverter [22]. The circuit level simulation shows rising delay is 35.6ps whereas 20 extracted layout simulation has rising delay of 37.6ps. Bias current Ibias=1.759mA and 2 other current source supply 2.019mA each, for a total power consumption of 2.8V * (1.753 + 2.019 + 2.019)mA = 16.2148mW. Figure 2.12: CML AND layout The reported area of CML AND gate is 30.150?m x 42.900?m, as indicated in Figure 2.12 and this area should be the same for CML NAND/OR/NOR. Layouts of all these gates have been performed and have been simulated but not included because their architectures are not different from CML AND, just the reverse of inputs or outputs. 2.3 CML XOR/XNOR Gate A CML XOR gate is shown in Figure 2.13 with an embedded level shifter. Table 2.3 briefly describes CML XOR operation and paths. Logic-1 represented by Vin+?V/2 = 2.8V and logic-0 is Vin-?V/2 = 2.2V, where Vin = 2.5V is the biasing (quiescent) voltage and ?V=600mV is the logic swing. Bias current Ibias=1.757mA and two other current sources supply 2.019mA each, having total power consumption of 2.8V * (1.757+2.019+2.019)mA = 16.226mW. Power consumption of the 21 Figure 2.13: CML XOR gate CML XOR is even less than the CML AND gate, which is impractical in CMOS realization but possible in CML architecture. A+ A- B+ B- Out+ Out- T on T off 0 1 0 1 0 1 7, 5, 1 2, 3, 4, 6 0 1 1 0 1 0 7, 5, 2 1, 3, 4, 6 1 0 0 1 1 0 7, 6, 4 1, 2, 3, 4 1 0 1 0 0 1 7, 6, 3 1, 2, 4, 5 Table 2.3: CML XOR Operation For the 00 combination, T1, T5 and T7 will turn on and Ibias will flow down to Vss, resulting in all other paths being turned off. For the 01 combination, either T2 or T3 will turn on; depending on A. As A+ is 0 and A- is 1, T5 will be on and T6 will be off, and the path will be T2, T5 and T7. As the lowest level dominates the path, for the 10 combination T6 will be on, and as B+ is 0 and B-=1, T4 will turn on and the path will be T4, T6 and T7. For 11 combinations, T3, T6 and T7 will be turned on. Figure 2.14 shows post vs. pre layout simulation of CML XOR gate with 18fF in- put/output load capacitance at 12GHz (input changes at 83ps). It is observed that circuit 22 Figure 2.14: Post vs. pre layout simulation of CML XOR with 18fF input/output load capacitance (input changes at 83ps) level rise delay is 33.7ps whereas extracted layout delay is 41.5ps. Therefore, 7.8ps discrep- ancy exists in between circuit level and device level simulation. Figure 2.15: CML XOR layout Reported area of CML XOR is 28.590?m x 45.630?m as indicated in Figure 2.15. 23 2.4 CML Mux Realization A CML mux has been realized at the transistor level as indicated in Figure 2.16. The architecture is very similar to the XOR and the power dissipation is less than the CML AND/XOR gate. Realizing a CML mux with transistors reduces power at least 13 over realizing the mux by cascading CML AND gates. Realizing mux with transistor has two fold benefits; one is less power and the other one is delay reduced by 12. Table 2.4 describes CML mux operation. Figure 2.16: CML Mux realization B+ (V) A+(V) S+(V) Out+(V) T on T off 2.2 2.2 2.2 2.2 5, 2 6, 1, 3, 4 2.2 2.8 2.2 2.8 5, 1 6, 2, 3, 4 2.8 2.2 2.2 2.2 5, 2 6, 1, 3, 4 2.8 2.8 2.2 2.8 5, 1 6, 2, 3, 4 2.2 2.2 2.8 2.2 6, 4 5, 1, 2, 3 2.2 2.8 2.8 2.2 6, 4 5, 2, 2, 3 2.8 2.2 2.8 2.8 6, 5 4, 1, 2, 3 2.8 2.8 2.8 2.8 6, 5 4, 2, 2, 3 Table 2.4: CML Mux Operation 24 Figure 2.17 shows post vs. pre layout simulation of CML mux with 18fF input/output load capacitance at 12GHz (input changes at 83ps). It is observed that circuit level rise delay is 28.01ps whereas extracted layout delay is 33.99ps. Therefore, 5.98ps discrepancy exists in between circuit level and device level simulation. Figure 2.17: Post vs. pre layout simulation of CML Mux with 18fF input/output load capacitance (input changes at 83ps) Figure 2.18: CML Mux layout The reported area for the CML Mux is 36.510?m x 47.520?m, as indicated in Figure 2.18. Total power consumption of the CML Mux is 2.8V * (2.019 + 2.019 + 1.757)mA = 16.226mW. 25 2.5 CML D-latch Realization A CML D-latch has been realized in transistors, as indicated in Figure 2.20, with four levels of hierarchy with a reset input, not by cascading CML AND gates, in contrast to CMOS architecture. This architecture not only reduces power consumption but also reduces Figure 2.19: CML D-latch delay. Very surprisingly power consumption is less than the CML AND gate, which cannot be realized in CMOS architecture. It is notable that, the lowest level dominates in selecting upper level paths. RST+=1.3V (RST-=0.7V) turns on T9 (T7 off) and ties Q+ to Vss, irrespective of any input combination. CLK+ (V) D+(V) RST+(V) Qt+1+(V) T on T off 2.2 2.2 2.2 Qt+ 8, 7, 6, (3 / 4) 9, 5, (4 / 3), 1, 2, 10, 11 2.2 2.8 2.2 Qt+ 8, 7, 6, (3 / 4) 9, 5, (4 / 3), 1, 2, 10, 11 2.8 2.2 2.2 2.2 8, 7, 5, 2 9, 6, 1, 3, 4, 10, 11 2.8 2.8 2.2 2.8 8, 7, 5, 1 9, 6, 2, 3, 4, 10, 11 x x 2.8 2.2 8, 9, 10, 11 1, 2, 3, 4, 5, 6, 7, 8 Table 2.5: CML D-latch Operation Table 2.5 briefly describes the operation of the CML D-latch. Initially Q+ tends to be logic-0 (the reason explained for the CML AND gate). If the clock is high, then whichever 26 data comes at D+ is passes to the output Q+. Also, the present state pushes lower level transistors to retain their logic states when clock is not high. Therefore at beginning, reset is not necessary in CML architecture. Figure 2.20: Post vs. pre layout simulation of CML D-latch with 18fF input/output load capacitance at 6GHz (166ps) Figure 2.20 shows post vs. pre layout simulation of CML D-latch with 18fF input/output load capacitance at 6GHz. It is observed that circuit level rise delay is 56.813ps whereas extracted layout delay is 67.367ps. Therefore, 10.554ps discrepancy exists in between circuit level and device level simulation. Figure 2.21: CML D-latch layout 27 Power consumption of the CML D-latch is 2.8V * (2.019 + 2.019 + 1.431)mA = 15.3132mW. Reported area for CML D-latch is 29.190?m x 57.780?m, as indicated in Figure 2.21. 2.6 Speed, Power, Area and Delay of Basic CML Components Table 2.6 summarizes post simulation power, area and delay with 18fF load (10fF for wire 8fF for next stage Cgg) capacitance of each CML basic components. In later chapters, higher level circuit realization has been performed using these basic components to develop a CML processor datapath.. Therefore, in measuring theoretical worst case delay it has assumed that delay is additive, meaning if the output of a CML AND gate drives the input of a CML Mux, the total delay is the sum of the two. Total delay of any complex component in simulation is found to match the number of basic components in the critical path and assigning delay to each of them. Component Power (mW) Area (?mx?m) Delay (ps) Inverter 4.62 15.96x24.45 24.79 AND/NAND 16.2148 30.15x42.90 37.6 OR/NOR 16.2148 30.15x42.90 46.09 XOR/XNOR 16.226 28.59x45.63 41.5 CML 2-to-1 Mux 16.226 36.51x47.52 33.99 CML D-latch 15.3132 29.19x57.78 67.367 Table 2.6: Power, Area and Delay of Basic Components (post layout simulation with 18fF load capacitance) Layout of the processor datapath has not been performed but the circuit level simulation. It is observed that post vs. pre layout simulation differs maximum 10ps and wire delay is 10ps (for 100?m wire connecting next stage and assuming wire delay is 0.10ps/?m) [22]. Therefore 20fF load capacitance was added inside every CML logic in datapath simulation to mimic post layout + wire delay (in intermediate stages Cgg = 8fF has already been considered by simulation tool). Area was predicted by counting the number of gates * area of each gate + 50% area for wiring. 28 Chapter 3 Datapath A multi-cycle 16-bit processor datapath has been designed using a RISC architecture that can execute 15 different operations and a 4-bit opcode has been used. Three different types of instructions can be performed and instruction set architecture is given in Table 3.1. Address bus (16-bit) 15-12 11-8 7-4 3-0 Opcode R-type Rd Rs Rt Opcode I-type Rd Rs Immediate operand Opcode J-type Address Table 3.1: Instruction Set Architecture Figure 3.1: Processor datapath 29 The processor datapath is shown in Figure 3.1, where all the components have been realized in CML logic except the cache memory and control unit. Correct operation of the datapath has been verified by providing external control stimulus in every clock cycle and observing/providing data to/from memory. Fifteen different operations that can be performed are listed in Table 3.2. The Processor datapath does not contain any novel approach. However this datapath can be re-structured such that power consumption will be less and operating frequency can be increased. Type Opcode Description Cycle 0000 ADD Rd=Rs+Rt 4 0001 SUB Rd=Rs-Rt 4 0010 AND Rd=Rs?Rt 4 R-type 0011 XOR Rd=Rs?Rt 4 0100 SLT Rd=1 if Rs