Alternative Techniques for Built-In Self-Test of Field Programmable Gate Arrays Except where reference is made to the work of others, the work described in this thesis is my own or was done in collaboration with my advisory committee. This thesis does not include proprietary or classifled information. Aditya Newalkar Certiflcate of Approval: Victor P. Nelson Professor Electrical and Computer Engineering Charles E. Stroud, Chair, Professor Electrical and Computer Engineering Foster Dai Associate Professor Electrical and Computer Engineering Stephen L. McFarland Acting Dean, Graduate School Alternative Techniques for Built-In Self-Test of Field Programmable Gate Arrays Aditya Newalkar A Thesis Submitted to the Graduate Faculty of Auburn University in Partial Fulflllment of the Requirements for the Degree of Master of Science Auburn, Alabama August 8, 2005 Alternative Techniques for Built-In Self-Test of Field Programmable Gate Arrays Aditya Newalkar PermissionisgrantedtoAuburnUniversitytomakecopiesofthisthesisatitsdiscretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date Copy sent to: Name Date iii Vita Aditya Newalkar, son of Anil and Sugandha Newalkar was born on April 17, 1979, in Mumbai, India. He graduated with Bachelor of Engineering degree in Electronics Engineering from Mumbai University in December 2000. What started as a student project at Indian Institute of Technology (IIT), Powai, Mumbai in year 1999 grew into valuable two year research experience for him after graduating from Mumbai University. While in pursuit of his Master of Science degree at Auburn University, he received guidance of Dr. Charles Stroud in the Electrical and Computer Engineering department. He worked as an intern in Medtronic Navigation in Louisville, CO. iv Thesis Abstract Alternative Techniques for Built-In Self-Test of Field Programmable Gate Arrays Aditya Newalkar Master of Science, August 8, 2005 (B.E., Mumbai University, 2000) 174 Typed Pages Directed by Charles E. Stroud IntheBuilt-InSelf-Testmethodoftestingthelogicandinterconnectresourcesofthe Field Programmable Gate Arrays (FPGAs), conflguration time and time to retrieve of thetestresultsdominatesthedurationofthetest. Thetechniquespresentedinthisthesis ofier reduction in the conflguration time and result retrieval time for the Built-In Self- Test using partial reconflguration and partial conflguration memory readback. Though the work has been done targeting Xilinx Virtex-I and Spartan-II FPGAs, the method is general enough to be applied on any FPGA featuring Partial Run Time Reconflguration (PRTR).WealsoevaluatetheComputerAidedDesign(CAD)toolsthataremainlyused for partial reconflguration, for their usefulness in generating test conflgurations for the programmable interconnect and logic resources of an FPGA using the Built-In Self-Test method. v Acknowledgments TheauthorwouldliketothankDr. CharlesStroudforgivinghiminsightonthesub- jectofBuilt-InSelf-TestforFPGAs. Theauthoradmireshisrelentlesspursuitforquality andisthankfulforhispatience. Specialthankstoauthor?sfamily, Anil, Sugandha, Bhau Newalkar and Siddharth Tambe for their unconditional love, support and continuous en- couragement. The author thanks all friends from Auburn for making his time enjoyable. Finally, principles by which Mohandas Gandhi lived his life give author strength and inspiration. vi Style manual or journal used Journal of Approximation Theory (together with the style known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars. Computer software used The document preparation package TEX (speciflcally LATEX) together with the departmental style-flle aums.sty. vii Table of Contents List of Tables xi List of Figures xiii 1 Introduction 1 1.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Programmable Logic Blocks . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Programmable Interconnection Network . . . . . . . . . . . . . . . 3 1.1.3 Programmable I/O Cells . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Flow of Design with FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Advantages of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Reconflgurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.1 Dynamic Reconflguration . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.2 Static Reconflguration . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.3 Partial Reconflguration . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Testing of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Built-In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6.1 BIST for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.7 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Review of Partial Reconfiguration and BIST 14 2.1 Architecture of Virtex-I and Spartan-II FPGAs . . . . . . . . . . . . . . . 14 2.1.1 PLB Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Interconnect Architecture . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Block RAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Conflguration of the FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 SelectMAP Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Boundary Scan Mode . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Start-up Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Conflguration Memory Architecture of Virtex-I and Spartan-II FPGAs . . 23 2.3.1 Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.2 Frame Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.3 Conflguration Registers . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.4 Full Reconflguration Bitstream . . . . . . . . . . . . . . . . . . . . 32 2.4 Readback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.1 Readback Veriflcation . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.2 Readback Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.3 Readback Operations . . . . . . . . . . . . . . . . . . . . . . . . . 34 viii 2.5 Partial Reconflguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.1 Partial Reconflguration without Shutdown Sequence . . . . . . . . 35 2.5.2 Partial Reconflguration with Shutdown Sequence . . . . . . . . . . 36 2.5.3 BitGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.4 JBits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 BIST for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.6.1 Logic BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.6.2 Interconnect BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6.3 BIST for Xilinx FPGAs . . . . . . . . . . . . . . . . . . . . . . . . 54 2.6.4 Using JBits API to Generate Interconnect BIST Conflgurations . . 57 2.7 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 Partial Reconfiguration and Readback for Logic BIST 61 3.1 Floorplan of Logic BIST to Aid Partial Reconflguration . . . . . . . . . . 62 3.2 Generating Partial Reconflguration Files . . . . . . . . . . . . . . . . . . . 63 3.2.1 Using BitGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3 Generating a Test Plan for Logic BIST . . . . . . . . . . . . . . . . . . . . 66 3.4 Experimental Results for Logic BIST . . . . . . . . . . . . . . . . . . . . . 73 3.5 Partial Conflguration Memory Readback to Retrieve the BIST Results . . 76 3.5.1 Commands for Partial Conflguration Memory Readback . . . . . . 80 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4 Generating Routing BIST Configurations using JBits 83 4.1 Overview of Routing BIST Architecture . . . . . . . . . . . . . . . . . . . 84 4.1.1 Testing the Interconnects in Parallel . . . . . . . . . . . . . . . . . 86 4.2 The Routing BIST RTPCores . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.1 Conflguring the TPG . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.2 Conflguring the ORA . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.3 Routing the WUTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.4 Populating the PLB Array . . . . . . . . . . . . . . . . . . . . . . 94 4.2.5 Generating the XDL File . . . . . . . . . . . . . . . . . . . . . . . 98 4.3 Experimental Results of Routing BIST . . . . . . . . . . . . . . . . . . . . 98 4.3.1 Partial Reconflguration and Routing BIST . . . . . . . . . . . . . 102 4.3.2 Test Phase Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4 Calculation of the Total Number of Interconnect BIST Conflgurations Required . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.1 Hex Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.2 Single Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4.3 Switch Box CIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4.4 MUX CIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.5 Generating Conflgurations for Switch-Box CIPs . . . . . . . . . . . . . . . 120 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 ix 5 Summery and Future Work 123 Bibliography 127 Appendices 132 A Steps in Writing Parent RTPCore 133 B Steps in Writing Child RTPCores 138 C Complete Program Source 140 D Complete List of Connections Between the Mux CIPs 158 x List of Tables 2.1 Virtex TAP Controller Pins . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Constants Used in the Address Calculation [Xil03d]. . . . . . . . . . . . . 26 2.3 Variables Used for Address Calculation [Xil03d] . . . . . . . . . . . . . . 26 2.4 Calculating the Location of the LUT RAM Bit in Virtex-I Bitstream [Xil03d] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Equations for Calculating PLB FF Location in the Bitstream [Xil03d] . . 27 2.6 PLB Column Frame Organization . . . . . . . . . . . . . . . . . . . . . . 27 2.7 Conflguration Registers [Xil03d] . . . . . . . . . . . . . . . . . . . . . . . . 29 2.8 Command Header Format [Xil02d] . . . . . . . . . . . . . . . . . . . . . . 29 2.9 Conflguration Commands and their Usage [Xil03d] [Xil04] . . . . . . . . 30 2.10 Readback Commands Required to Perform Readback on PLB Conflgura- tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.11 Classes Used for Bit Level Manipulation of PLB Elements [Xil01d] . . . . 42 2.12 Classes Used for Bit Level Manipulation of Switch Box CIPs[Xil01d] . . . 43 2.13 Classes Used for Bit Level Manipulation of Output MUX CIPs[Xil01d] . . 44 2.14 Classes Used for Bit Level Manipulation of input MUX CIPs[Xil01d] . . . 44 2.15 Model of Interconnect Resources in the Package com.xilinx.JRoute2.Virtex.ResourceDB [Xil01d] . . . . . . . . . . . . . . . 45 3.1 Command Listing for Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . 70 3.2 Command Listing for Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . 71 3.3 Command Listing for Scenario 3 . . . . . . . . . . . . . . . . . . . . . . . 72 xi 3.4 Command Listing for Scenario 4 . . . . . . . . . . . . . . . . . . . . . . . 72 3.5 Partial Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.6 Partial Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.7 Sizes of Partial Bitstreams vs. Full Bitstreams . . . . . . . . . . . . . . . 75 3.8 Comparison of Boundary Scan Access Method and Partial Conflguration Memory Readback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.9 Bitstream for Partial Conflguration Memory Readback . . . . . . . . . . . 81 4.1 Connectivity between Output Multiplexers and Interconnects . . . . . . . 87 4.2 Input and Output ports of TPGCounterCore . . . . . . . . . . . . . . . . 91 4.3 Input and Output Ports of ORACore . . . . . . . . . . . . . . . . . . . . . 93 4.4 Command Line Arguments Available for the JBits Program . . . . . . . . 99 4.5 Possible Values of the Command Line Arguments . . . . . . . . . . . . . . 100 4.6 Command Listing for Generating Partial Bitstreams for Routing BIST . . 103 4.7 Sizes of Partial Bitstreams vs. Full Bitstreams . . . . . . . . . . . . . . . 104 4.8 Routing BIST and Test Phase Sequence . . . . . . . . . . . . . . . . . . . 106 4.9 Mapping of the CIPs in Various JBits Classes [Xil01d] . . . . . . . . . . . 111 4.10 MUX CIPs in Virtex-I Architecture and their Functions [Xil01d] . . . . . 119 4.11 MUX CIP Groups Tested in Parallel . . . . . . . . . . . . . . . . . . . . . 119 4.12 Testing MUX CIPs for Stuck-On and Stuck-Ofi Faults [SWHA98] . . . . 120 C.1 Input and Output ports of LUT5 . . . . . . . . . . . . . . . . . . . . . . . 140 D.1 Mux CIPs Mux28to1 and Connecting Single Interconnects . . . . . . . . . 158 D.2 MUX CIPs Mux16to1 and Connecting Interconnects . . . . . . . . . . . . 159 xii List of Figures 1.1 General Architecture of FPGA . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Typical Architecture of PLB [Str02] . . . . . . . . . . . . . . . . . . . . . 3 1.3 Typical CIP Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Spatial Vs. Temporal Computing [DeH00] . . . . . . . . . . . . . . . . . 7 1.5 BIST for FPGA [AS01] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Internal Architecture of Virtex-I Slice [Xil01b] . . . . . . . . . . . . . . . 16 2.2 Difierent Types of CIPs Found in FPGA [SWHA98] [FH03] . . . . . . . . 18 2.3 Switch Box CIP and Xilinx Interconnect Architecture . . . . . . . . . . . 19 2.4 Block RAM in Virtex-I and Spartan-II FPGAs . . . . . . . . . . . . . . . 20 2.5 Xilinx Virtex-I and Spartan-II Addressing Scheme [Xil03d] . . . . . . . . 25 2.6 Design Flow of the Application with JBits . . . . . . . . . . . . . . . . . . 38 2.7 JBits Program for Manual Routing . . . . . . . . . . . . . . . . . . . . . . 47 2.8 BIST for FPGA Interconnect Resources [SWHA98] . . . . . . . . . . . . . 51 2.9 FPGA Floorplan for Online Interconnect Testing [AES01] . . . . . . . . . 53 2.10 FPGA Floorplan with \Galaxy" BIST [SNLA02] . . . . . . . . . . . . . . 54 2.11 Complete Testing of Switch Boxes [SWHA98] . . . . . . . . . . . . . . . . 56 2.12 Scan Cell Interfacing with IEEE 1149.1 [HGWS99] . . . . . . . . . . . . . 58 3.1 Floorplan for BIST Test Session . . . . . . . . . . . . . . . . . . . . . . . 63 3.2 Four Difierent Test Plans for Testing Two Slices . . . . . . . . . . . . . . 69 4.1 HorizontalandVerticalInterconnectResourcesTestedforShortsandOpens 85 xiii 4.2 Hex and Single Wires Tested for Shorts and Opens . . . . . . . . . . . . . 88 4.3 Conflguration of Comparator-Based ORA . . . . . . . . . . . . . . . . . . 94 4.4 JBits Program for Routing the WUTs . . . . . . . . . . . . . . . . . . . . 95 4.5 Boundary Condition for Populating PLB Array in Vertical Direction . . . 97 4.6 Switch Box CIP and Xilinx Interconnect Architecture . . . . . . . . . . . 107 4.7 Sections of Switch Box CIP . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.8 Test Conflgurations Needed to Completely Test Switch Box CIPs . . . . . 113 4.9 Test Conflgurations Continued... . . . . . . . . . . . . . . . . . . . . . . . 114 4.10 Routing BIST Conflguration for Testing Mux CIPs . . . . . . . . . . . . . 115 4.11 Testing MUX CIPs in Parallel [RPFZ99] . . . . . . . . . . . . . . . . . . . 118 4.12 Problem of Undetected Faults Due to Invisible Logic in MUX [AS01] . . . 118 A.1 JBits Program for Instantiating a Counter core . . . . . . . . . . . . . . . 136 A.2 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 137 C.1 Conflguration of LUT5 RTPCore . . . . . . . . . . . . . . . . . . . . . . . 141 C.2 JBits Program for User Interaction and Populating the PLB Array . . . . 143 C.3 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 144 C.4 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 145 C.5 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 146 C.6 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 147 C.7 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 148 C.8 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 149 C.9 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 150 C.10 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 151 xiv C.11 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 152 C.12 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 153 C.13 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 154 C.14 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 155 C.15 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 156 C.16 JBits Program Continued... . . . . . . . . . . . . . . . . . . . . . . . . . . 157 xv Chapter 1 Introduction Field programmable gate arrays (FPGAs) have evolved from simple programmable logic devices (PLDs)likeprogrammable array logic (PALs)andprogrammable logic arrays (PLAs). These early devices were used in digital design as glue logic. As the need of the digital system designers grew from simple decoders to more complicated designs like protocol resolvers, multiple PLDs were connected through a programmable routing architecture to form the FPGA [BR96]. This architecture gives the user the ability to program the interconnects to realize various types of complex digital designs. Due to their size and programmability, testing modern FPGAs has become a complex and time consuming task. 1.1 FPGA Architecture The Figure 1.1 shows the general architecture of a typical FPGA. The FPGA con- sists of uncommitted resources of an NxM array of programmable logic blocks (PLBs), programmable input and output (I/O) cells, a programmable interconnection network, and a conflguration memory to program the device. 1.1.1 Programmable Logic Blocks The PLBs of most FPGAs contain multiplexers, look up tables (LUTs) and ip- ops. An important characteristic of a PLB is its functionality, deflned as the number of difierent boolean functions that it can implement [BFRV92]. The elements in the 1 Programmable I/O Blocks Programmable Logic Blocks Programmable Interconnect Network Figure 1.1: General Architecture of FPGA 2 PLB Outputs MUXs Output PLB Outputs LUT/RAM LUT/RAM FF FF Figure 1.2: Typical Architecture of PLB [Str02] PLB architecture can be programmed to function in difierent modes of operation. The LUT can be used either in a LUT mode of operation or random access memory (RAM) mode of operation. In the LUT mode, the element can implement combinational logic functions of multiple inputs (typically 3 to 4). In the RAM mode of operation, the PLB can logically be conflgured to behave either as a synchronous or asynchronous, single port or dual port RAM. The ip- ops can be conflgured in latch mode or edge-triggered mode, with asynchronous or synchronous preset/clear, and programmable clock enable. The multiplexers can be selected to connect the LUT outputs to the ip- ops or to bypass the ip- ops [Str02]. Thus the PLB contains the functionality to implement any combinational or sequential logic functions using the logic resources in the architecture as shown in the Figure 1.2. The mode of operation for each element is selected when the device is programmed or conflgured. 1.1.2 Programmable Interconnection Network The programmable interconnect network in an FPGA, also called its routing archi- tecture [BFRV92], consists of segments of wires of various lengths and programmable 3 Wire A Wire B Configuration Memory Bit Figure 1.3: Typical CIP Structure switches. There are global routing resources and local routing resources. Global routing resources facilitate the routing of the signals between the PLBs that are separated by other PLBs. Local routing resources facilitate the routing between PLBs that are next to each other in the array [SNLA02]. The connections are made via conflgurable inter- connect points (CIPs), also referred to as programmable interconnect points (PIPs). A CIP consists of a transmission gate controlled by conflguration memory bit. As shown in the Figure 1.3, the connection between the wire segments A and B is made or broken depending upon the logic value of the conflguration memory bit [Str02]. 1.1.3 Programmable I/O Cells Most of the I/O pins of the FPGA can be conflgured in input, output or bi- directional mode of operation. The I/O cells can also be programmed as registered or latched I/Os depending on the design. The I/O cells support TTL as well as CMOS I/O standards thereby eliminating need for the voltage shifters for interfacing [Lat02] [Xil01b]. 4 1.2 Flow of Design with FPGAs The circuit designer typically implements the design in a hardware description lan- guage (HDL) and synthesizes the circuit description with the help of one or more com- puter aided design (CAD) tools. The CAD tools generate a bitstream flle that contains programming instructions and data to establish the application speciflc system function- ality of the various programmable resources of the device like PLBs, routing architecture and I/O blocks. This bitstream flle is then loaded into the FPGA chip using one of the conflguration interfaces provided for the FPGA [Jay01]. The process of loading a design-speciflc bitstream into one or more FPGAs to deflne the functional operation of the PLBs, the interconnect resources and I/O blocks is known as conflguring or down- loading the bitstream to the device [Xil99]. The signiflcance of the FPGA design lies in the fact that static random access memory (SRAM) based FPGAs can be reconflgured anunlimitednumberoftimes, implementingadifierentdesigneachtime. Inordertoim- plement a difierent design simply requires overwriting the previous conflguration loaded into the SRAM with a new bitstream through the conflguration interfaces provided by the FPGA manufacturer [Xil99]. 1.3 Advantages of FPGAs FPGAs provide a low cost solution to low volume products where user programma- bility is needed at the deployment time. Application speciflc integrated circuits (ASICs) edge out FPGAs in high volume products in terms of unit costs and performance param- eters, which are signiflcantly better than FPGAs. The reason for signiflcant difierence in performance parameters is that the exibility provided by the programmability in 5 FPGAs is at the cost of substantial signal delays and area overhead introduced by the programming circuitry [AR94]. However, FPGAs ofier some advantages over ASICs including: ? Low cost solution for low volume applications, ? Low non-recurring engineering (NRE) costs, and ? Rapid prototyping [Mil94]. 1.4 Reconflgurable Computing Traditionally, software is considered to be a component that is exible, relatively slow and ine?cient compared to hardware. The hardware is perceived to be customized to the problem and faster compared to the software [DW99]. Hardware can be de- signed to execute functions concurrently. Therefore, at any given time there are multiple computing elements actively performing their functions. This is referred to as spatial computing. In a conventional processor, instructions are executed serially using memory orregisterstostoretheprogramvariables. Thisiscalledtemporalcomputing. Figure1.4 shows examples of spatial and temporal computing. A conventional digital signal pro- cessor (DSP) would take multiple instruction cycles to execute a fllter algorithm (Figure 1.4(b)), while the spatial implementation of the same fllter in an FPGA gives a new result every cycle in a pipelined fashion, thus higher throughput is observed (Figure 1.4(a)) [DeH00]. The idea of reconflgurable computing tries to bring together best of both worlds. Reconflgurable computing relies on devices, like FPGAs, that are user pro- grammable an unlimited number of times. The more generalized resources or structures like LUTs, ip- ops, and SRAMs, that are provided in an FPGA, can be conflgured to 6 + + + + W1 W2 W3 W4 Yi?6 Xi X2 X4 W1 W2 W3X3 X1 W4 t1 Ax t2 Ay X4 X3//X[i?3] X3 X2//X[i?2] X2 X1//X[i?1] Ax Ax ? 1 X1 Ax//x[i] t1 W1 x X1 Ay Ay + 1 [Ay] t1 t1 t1+t2 t2 W4 * X4 t1 t1 + t2 t2 W3 * X3 t1 t1 + t2 t2 W2 x X2 (a) Spatial Computation (b) Temporal Computation Figure 1.4: Spatial Vs. Temporal Computing [DeH00] execute the functions spatially. The process of reconflguration gives the ability to load difierent functions in the FPGA serially in time, taking advantage of temporal comput- ing [DeH00] [DW99]. Because of this inherent parallelism in the FPGA architecture, these devices frequently show an order of magnitude higher performance than a general purpose processor [DeH00][CHW00][GSB+00] [DW99]. 1.4.1 Dynamic Reconflguration Dynamic or runtime reconflguration is a process where the reconflgurable unit is conflgured without interrupting the conflgured system function [ESSA00]. The reason for such an arrangement might be that the design is partitioned into many small parts, either too large or too many to flt in the FPGA simultaneously. These partitions of the original design can be loaded into the FPGA without interrupting the function of other 7 partition(s) already loaded into the FPGA. An external agent like a microprocessor may be used to control which partition(s) is(are) loaded and in what order [ESSA00]. 1.4.2 Static Reconflguration Static or compile time reconflguration is an idea that can be deflned as an inverse of dynamic reconflguration, where the reconflgurable unit is conflgured while it is idle or inactive. Most FPGAs are capable of reading the conflguration data from an elec- trically erasable programmable read-only memory (EEPROM) when the power is turned on [Xil99]. This is referred to as power-on conflguration. This is an example of static reconflguration. 1.4.3 Partial Reconflguration Complete reconflguration of an FPGA chip can be an onerous process [HLS98]. The conflguration time varies depending on the size of the bitstream to be loaded into the FPGA. This delay could be unacceptable in high performance systems, which are expected to be reconflgured many times to execute the system function. If the circuit implemented in one conflguration is not signiflcantly difierent from the one implemented in the next conflguration, the conflguration time can be reduced if the next conflgu- ration bitstream were to contain only the programming instructions and data for the programmable resources of the FPGA that are conflgured difierently from the previous conflguration. The size of the partial reconflguration bitstream is now reduced as it con- tainsonlythedifierencebetweenthepriorfullconflgurationandthesubsequentreconflg- uration [HLS98]. The problem in implementing such a scheme is that the FPGAs should have architectural support for conflguring only part of their programmable resources, 8 referred to as partial reconflguration. For the applications that need to reconflgure only part of their logic depending on the circumstances, this raises an exciting possibility of gaining signiflcant time advantage. Fault-tolerant applications are one example of such applications that beneflt from partially reconflgurable architectures [ESSA00]. 1.5 Testing of FPGAs Commercially available FPGAs have reached gate counts of 8 million, feature banks of RAMs, hundreds of user I/Os and are capable of running at clock speeds of 400 MHz [Xil02e]. Such high performance FPGA-based systems, when subjected to aging and environment (temperature, humidity, vibration, cosmic rays and fi-rays) are vulnerable tofaults[ESSA00]. ThereforehavingagoodFPGAtestingmethodisevermoreessential. Testing of FPGAs poses a difierent set of challenges than ASICs. The challenge is to test PLBs as well as interconnect resources in all possible modes of operation. It is an important consideration for safety-critical applications because if the test methodology tests only the normal mode of operation for a given system function, when the FPGA is reconflgured to implement difierent system function, the latent faults may take over and hamper the system function [AS01] [SS99]. Testing the PLBs and interconnect in all possible modes of operation is advantageous for the fault-tolerant applications to identify if any particular mode of PLB operation is faulty so it can be used in one of the other fault-free modes. The testing method is required to detect single and multiple faults in PLBs and interconnects. Meeting this requisite entails selection of a method thatiscapableofin-systemtesting. Ideally, thetestingmethodshouldnotintroduceany overhead of area and delay penalties [AS01]. Diagnostics provided by the test method 9 should enable the user to identify and locate the defective module for fault-tolerant applications [SNLA02]. 1.6 Built-In Self Test Built-In Self Test (BIST) is a design-for-testability (DFT) approach in which test- ing (test pattern generation, application and output response analysis) is accomplished through built-in hardware features [AKS93]. In other words, the BIST circuitry is part of the hardware that it tests. One of the advantages of BIST includes the capability of in-system testing without the need of external test equipment. For ASIC testing, BIST has area overhead and delay penalties. However, when viewed in the context of FPGAs, BIST ofiers a unique advantage over the external or internal dedicated BIST circuitry: SRAM based FPGAs are reconflgurable and are capable of implementing any given design. While testing, the FPGA may be conflgured as a BIST circuit, the tests are run and the results are obtained. If the device passes the test, it can be reconflgured to implement the desired system function. If one or more faults are detected and identifled, the system function can be reconflgured to avoid the fault(s) for fault-tolerant applications. Thus, the BIST circuit would \disappear" in the reconflguration process after testing of the device is complete. Therefore, it can be said that testability is achieved without any area or performance penalties [AS01]. 1.6.1 BIST for FPGAs An example of the structure of BIST for FPGAs is as follows: a group of PLBs are conflgured as test pattern generators (TPGs), blocks under test (BUTs) and output 10 BUT BUT BUT BUT BUT BUT BUT BUT ORA ORA ORA ORA TPG TPG BIST Start BIST Done Pass/Fail Figure 1.5: BIST for FPGA [AS01] response analyzers (ORAs) as illustrated in Figure 1.5. The TPGs generate test pat- terns that are applied as inputs to the BUTs. All BUTs are conflgured and tested in identical modes of operation. The outputs of the identically programmed BUTs for a set of test patterns generated by TPGs are compared by the output response analyzers (ORAs) giving a single Pass/Fail indication at the end of the BIST sequence depend- ing on whether a mismatch in the BUT outputs was observed [SCKA96] [SKCA96] [AES01] [SNLA02] [AS01]. The structure of TPGs can be as simple as a N-bit counter or a linear feedback shift register (LFSR) [Str02]. ORAs consist of comparators with a latch to retain any mismatch observed by the comparison. The problem with BIST in regards to PLB and interconnect testing is that the FPGA needs to be reconflgured many times, each time testing a difierent mode of oper- ation. For example, to completely test programmable logic in Xilinx 4000 and Spartan 11 series FPGAs, 24 BIST conflgurations are needed, while to completely test interconnects it takes 206 conflgurations [SLS03]. Therefore, with BIST for FPGAs, testing all modes of operation of the FPGA requires a large number of time consuming reconflgurations and thus poses a problem for high performance systems which cannot afiord to spend system down time in lengthy BIST conflguration. 1.7 Thesis Statement Since some FPGA architectures are capable of partial reconflguration to speed-up the reconflguration process, the work presented in this thesis focuses on optimizing the BIST method for FPGA testing using partial reconflguration. Chapter 2 presents more details about the full reconflguration and partial reconflguration facility ofiered by the new Xilinx FPGA families like Spartan-II, Spartan-III, Virtex and Virtex-II. In Chapter 2 reviews CAD tools used for partial reconflguration. It presents details about the boundary scan interface, used to partially reconflgure these FPGAs. It provides details about the current state-of-art BIST method for testing PLBs as well as interconnect resources in FPGAs. Chapter 3 describes how the partial reconflguration can be used for logic BIST of Xilinx Spartan-II, Spartan-III, Virtex and Virtex-II FPGAs and presents results from actual BIST of PLBs in Spartan-II and Virtex devices to illustrate the improvements obtained with partial reconflguration. Chapter 4 explores a technique to generateinterconnectBISTconflgurationsusingJavaapplicationprogrammer?sinterface library, JBits. The chapter also describes the experiments performed to generate partial interconnect BIST conflgurations and the efiects of partial reconflguration on the size of the routing BIST conflgurations. Chapter 4 also includes estimates of the number of BIST conflgurations required to test the routing resources in Virtex I and Spartan 12 II FPGAs and considers partial reconflguration for routing BIST. Finally, Chapter 5 presents the summary and conclusions as well as suggestions for the future research and development. While this thesis will focus on Xilinx FPGAs, it is important to emphasize that these techniques can be applied to any FPGA that supports partial reconflguration. 13 Chapter 2 Review of Partial Reconfiguration and BIST ThischapterbeginswithareviewoftheoperationandarchitectureofXilinxFPGAs. The conflguration memory architectures of Virtex-I and Spartan-II families are then reviewed. These two FPGAs have nearly identical architectures and will be the target of thisresearch. Thedifierencesbetweenpartialreconflgurationandfullreconflgurationare discussed along with the CAD tools used to generate partial reconflguration viz. BitGen and JBits. Finally, a description of how BIST methods areapplied to test programmable logic and interconnect of FPGAs is given. In this section, we review the prior work done using JBits to automatically generate interconnect BIST conflgurations. 2.1 Architecture of Virtex-I and Spartan-II FPGAs The PLB and interconnect architecture of Xilinx Virtex-I and Spartan-II is similar. Therefore, unless specifled, whenever reference is made towards Virtex-I architecture it is assumed that it applies to Spartan-II. 2.1.1 PLB Architecture The unit logic cell in the PLB, consists of a 4-input LUT, a ip- op, and additional dedicated logic. A slice consists of two of these unit logic cells. There are two identical slices in a PLB [Xil01b]. The internals of a single Virtex-I slice are shown in Figure 2.1. 14 Look-Up Table Each logic cell in the PLB features a 4-input LUT. The LUT can be used in the LUTmode, inaRAMmodeorinashiftregistermodeofoperation. IntheLUTmodeof operation, it acts as a 4-input combinational logic function generator. In the RAM mode ofoperation,Virtex-IandSpartan-IIcontainsupportforimplementing16x1synchronous RAM or combining two LUTs in a slice to implement 32x1 or 16x2 synchronous RAM. The LUT can also implement up to a 16-bit shift register in the shift register mode of operation [Xil01b]. Flip-Flops The ip- ops can be conflgured as edge-triggered ip- ops or level-sensitive latches, can be set or reset synchronously or asynchronously, and contain a clock enable signal. The signals used to set or reset the ip- ops are shared by both the ip- ops within the slice. A global reset signal initializes the storage elements [Xil01b]. The value with which the ip- ops are initialized is specifled by a speciflc bit in the conflguration bitstream. The input signals can be applied at the input of the ip- op through the LUT or directly, bypassing the LUT. Additional Logic Thededicatedcarrylogicisusedtoimplementthecarrychainsfoundinwideadders and counters. In order to realize the wide adders and counters, the carry logic takes carry input from the previous stages [Xil01b]. A dedicated multiplexer CY, illustrated in Figure 2.1, is utilized to implement wide arithmetic logic functions [Xil03c]. 15 F4 F3 F2 F1 I0 I1 I2 I3 I0 I1 I2 I3 LUT G4 G3 G3 G1 CE CLK SR Cout Cin CK WE A4 WSO WSH BY DG BX DI WE DI LUT WE DI O O F6 F5 INIT INITD Q EC REV Q EC D REV CY CY F5in BY BX XB F5 X YQ XQ Y YB Figure 2.1: Internal Architecture of Virtex-I Slice [Xil01b] 16 Using multiplexer F5 in Figure 2.1, either of the outputs of the LUTs can be se- lected. Thus implementing a 5-input function generator, a 4:1 multiplexer, or selected functions of up to nine inputs [Xil01b]. The F6 multiplexer, on the other hand, facil- itates implementation of any 6-input function, an 8:1 multiplexer, or selected functions of up to 19 inputs [Xil01b]. 2.1.2 Interconnect Architecture CIPs are the programmable switches used to make connections in the global and local routing resources and the PLBs. There are three basic types of CIPs that can be found in the FPGA interconnect network: cross-point CIP, break-point CIP and multiplexerCIP,asillustratedinFigure2.2. Thecross-pointCIPconnectsordisconnects the connection between a wire segment in the horizontal plane and a wire segment in the vertical plane, depending on the value loaded in the conflguration memory bit. The state of the memory bit controlling the break-point CIP determines if the two segments in the same plane would be connected [SWHA98]. The Xilinx FPGAs feature a switch box CIP or global routing matrix as referred in the literature from Xilinx [Xil01b]. The switch box CIP comprises an array of break-point CIPs that can be programmed to provide a variety of connections in horizontal and vertical routing resources as well as the PLB inputs and outputs as shown in Figure 2.2(d) [Xil01b]. The multiplexer or MUX CIP controls connection to the common interconnect from one of the k possible connections. There are k conflguration memory bits associated with a MUX CIP. The complete set of switch box CIPs has 24 Single wires emerging from its four sides that allow connection between the four neighboring switch box CIPs. The Single lines or x1 lines are part of the local interconnects and provide signal connectivity between the 17 (b) Cross?Point CIP Wire B Wire AWire AWire B (a) Break?Point CIP Wire AWire B output CB CB CB (c) Multiplexer CIP(d) Switch Box CIP Figure 2.2: Difierent Types of CIPs Found in FPGA [SWHA98] [FH03] adjacent PLBs. A total of 12 bufiered Hex wires at each of the four sides drive the signal to switch box CIP that is six PLBs away. The Hex lines or the x6 lines span between the PLBs separated by flve PLBs and are part of global interconnect resources. A total of 12 Long wire segments provide connectivity across the horizontal width and vertical length of the chip [Xil01b]. The Long wires often carry signals to multiple PLBs and span all the PLBs in horizontal or vertical direction. The switch box featured in Virtex-I and Spartan-II architecture along with the Hex and Single interconnects, is shown in the Figure 2.3. 18 24 Single East Switch Matrix Hex South Outputs from PLB 8 24 Single West Single South 24 Hex North Hex West Hex East (0..3) Inputs to PLB (4, 6, 8, 10) (5, 7, 9, 11) (4, 6, 8, 10) (5, 7, 9, 11) (4, 6, 8, 10) (5, 7, 9, 11) (4, 6, 8, 10) (5, 7, 9, 11) (0:3) (0:3) (0:3) (0:23) (0:23) (0:23) (0:23) 24 Single North (0:7) Figure 2.3: Switch Box CIP and Xilinx Interconnect Architecture 19 WEA ENA RSTA ADDRA[#:0] DIA[#:0] WEB ENB RSTB ADDRB[#:0] DIB[#:0] CLKA CLKB DOB[#:0] DOA[#:0] Figure 2.4: Block RAM in Virtex-I and Spartan-II FPGAs 2.1.3 Block RAMs Large memory blocks are provided in the architecture and are referred to as block RAMs. These block RAMs are located along the two outside columns of the PLB array and each memory block RAM occupies the same height as that of 4 PLBs. Thus a PLB array 64 PLBs high will have 16 memory block RAMs in each outer column and thus 32 total block RAMs. The block RAM is, as illustrated in Figure 2.4, a synchronous, dual port memory and has a total capacity of 4096 bits. The width of data and address bus is conflgurable and can be set as per the design requirement. 2.2 Conflguration of the FPGA Typically the bitstream can be downloaded bit serially or byte-wide, i.e. one bit or one byte of conflguration data is written into the conflguration memory each clock cycle. How the bitstream is loaded into the FPGA depends upon the conflguration mode. The conflguration mode can be selected by setting particular logic levels at the mode-select 20 pins of the FPGA. Difierent conflguration modes ofier difierent capabilities e.g. speed of conflguration, partially reconflguration etc. Difierent sequences of events may also take place in difierent conflguration modes. Thus selecting the mode of conflguration is an important design decision. Virtex-I and Spartan-II FPGAs support eight difierent modes of conflguration. The partial reconflguration support is available in selectMAP and boundary scan modes of conflguration [Xil02c]. 2.2.1 SelectMAP Mode In the selectMAP mode, one byte of the bitstream is written every clock cycle into the conflguration data bus interface (pins D[0:7]). To load a given conflguration, selectMAP takes the least time among all the conflguration modes available [Xil01b]. First the conflguration memory is cleared. The conflguration control circuitry senses the mode pins and the mode of conflguration is determined to be selectMAP. The bitstream is them loaded byte-by-byte on every rising edge of the conflguration clock. To ensure the veracity of the bitstream, a cyclic redundancy check (CRC) check is performed at the end. If the CRC checksum loaded is difierent from the internally calculated CRC, the conflgurationsequenceisaborted. OtherwisethenormalStartup-Sequenceiscommenced as will be discussed in subsection 2.2.3 [Xil02d]. 2.2.2 Boundary Scan Mode In the boundary scan mode, one bit of the bitstream is written into the test access port (TAP) of the FPGA each clock cycle. The IEEE 1149.1 test access port and bound- ary scan architecture is an IEEE standard for in-system testing [IEE90]. The boundary scan has a four wire interface as shown in Table 2.1. All FPGAs from Xilinx support 21 boundaryscanmodeofconflgurationandcontainallthemandatoryelementsintheIEEE 1149.1 standard: the TAP controller, the instruction register, the instruction decoder, the boundary scan register, and the bypass register [Xil01b] [Xil02e] [Xil03b] [Xil03a]. Table 2.1: Virtex TAP Controller Pins TDI Test Data In TDO Test Data Out TMS Test Mode Select TCK Test Clock The TAP controller is a 16-state flnite state machine. The logic value of the TMS pin at the rising edge of TCK determines the next state of the TAP controller. The data can be shifted into the data registers by selecting the data register scan sequence or the instruction register by selecting the instruction scan sequence. The Virtex-I and Spartan-II devices implement all the mandatory commands as well as additional commands to the IEEE 1149.1 standard. These additional commands allow read and write access to the conflguration memory. The boundary scan interfaces provide two user deflned serial interfaces to the core of the FPGA. In order to use them, the interfaces must be incorporated in the design. The user deflned serial interfaces are activeaftertheconflgurationiscompletedandmaybeaccessedusingspecialinstructions to the TAP controller [Xil02b] [Xil02a] [Xil]. Theconflgurationcontrolcircuitrysensesthemodepinsandmodeofconflgurationis determined to be boundary scan. The CFG IN instruction is loaded into the instruction register to allow write access to the conflguration memory. The bitstream is then loaded bit serially using the boundary scan interface. If the CRC is determined to be correct then the JSTART instruction is loaded in the instruction register which will initiate the Start-up Sequence [Xil02b] [Xil02a]. 22 2.2.3 Start-up Sequence After the bitstream is completely and successfully written into the conflguration memory, the Start-up Sequencer state machine in the FPGA initiates the Start-up Se- quence. Start-up is the transition from the conflguration mode to normal operational mode of the FPGA [Xil02d]. The Start-up Sequence includes activation of global reset for initialization of the device. Xilinx provides a CAD tool, called BitGen, to control the Start-up Sequence according to the options set by the user. The subsection 2.5.3 gives an overview of this tool. 2.3 Conflguration Memory Architecture of Virtex-I and Spartan-II FPGAs The conflguration memory of Xilinx Virtex-I FPGAs is divided into sections called frames. A frame contains conflguration data for each section of the device, extending vertically from top to the bottom of the device. Multiple conflguration frames clubbed together form a column [Xil02d]. The columns can belong to one of the following types: Center: The center column contains the conflguration for the four global clock pins and routing in the center of the device. Conflgurable Logic Blocks: The PLBs are sometimes also referred to as conflgurable logic blocks (CLBs). This type of column contains the conflguration for all the PLBs and routing in that column, along with two I/O blocks (IOBs) at the top and bottom of the column. IOB: The IOB columns contain the conflguration for all the IOBs on the left and right edges of the device. 23 Block RAM Interconnect: These columns contain the conflguration for all intercon- nect of the block RAMs of the device. Block RAM Content: Thesecolumnscontaintheinitialdatacontentswithwhichthe block RAMs will be pre-loaded during conflguration [Xil03d]. A frame is the smallest unit of reconflguration. The least data that needs to be written into the conflguration memory, in order to conflgure a portion of FPGA, is one frame. The length of the frame increases with the dimensions of PLB array to account for the increase in programmable logic and routing resources in the array. The length of the frame is written into a dedicated internal register in the full conflguration process. As the FPGA is fully conflgured at least once before the partial reconflguration, it is not necessary to write frame length for the partial reconflguration [Xil02d]. 2.3.1 Addressing The conflguration memory address space is divided into RAM blocks and PLB blocks. The RAM block contains the block RAM content columns. The PLB blocks include the Center, PLB, IOB and block RAM interconnect columns. These blocks are thenfurtherdividedintomajorandminoraddresseswhereeachconflgurationcolumnhas a unique major address and each frame has a unique minor address within its column [Xil03d]. For the Virtex-I family, the following addressing scheme is in place for the conflguration memory as shown in Figure 2.5 (which also includes the number of frames in each column): ? the address ?0? is assigned to the center column, ? the even major addresses of PLB column are on the left side of the device, 24 Block RAM Interconnect (27 frames) CLB ColumnCenter ColumnCLB Column (48 frames) (48 frames) (48 frames) Left IOB Column Block RAM (54 frames) CLB Column Block RAM Interconnect (27 frames) CLB Column (48 frames) (54 frames) Content (64 frames) Content (64 frames) Right IOB Column ... RAM1 BIC1 Center RAM0 BIC0 (8 frames) GCL K 2 22 IOBs 22 IOBs IOBsIOBs ... Right IOIO Left Block RAM Figure 2.5: Xilinx Virtex-I and Spartan-II Addressing Scheme [Xil03d] ? the higher even major addresses are assigned to left IOB columns, ? the left block RAM interconnect columns mark the end of even major addresses, ? the address ?1? is assigned to the PLBcolumn at the right side of the center column ? the odd major addresses of PLB column are on the right side of the device, ? the higher odd major addresses are assigned to right IOB columns, and ? therightblockRAMinterconnectcolumnsmarktheendofthePLBaddressblock. The major and minor addresses for any slice or LUT bit in any row/column can be easily calculated, by inserting the values of the constants in the Table 2.2 and Table 2.3, in the formulae given in Table 2.4 and Table 2.5. 25 Table 2.2: Constants Used in the Address Calculation [Xil03d] Term Deflnition Chip Cols Number of PLB columns on the Virtex device. Chip Rows Number of PLB rows on the Virtex-I device. Chip Rams Number of block RAM columns on the Virtex-I device RAM Space Spacing of block RAM columns (in terms of PLB columns). FL Number of 32-bit words in the frame. RW 1 for Read, 0 for Write. CLB Col Column number of the desired PLB. CLB Row Row number of the desired PLB. Slice 0 or 1. FG 0 for the F-LUT, 1 for the G-LUT. lut bit The desired bit from the given LUT. Bits in the LUT are indexed from 0 to 15. XY 0 for the X Flip-Flop, 1 for the Y Flip-Flop. RAM Col Column number of the desired block RAM. RAM Row Row number of the desired block RAM. ram bit The desired bit from the given block RAM. Bits are indexed from 0 to 4095. Table 2.3: Variables Used for Address Calculation [Xil03d] MJA Frame Major Address. MNA Frame Minor Address. fm st wd The index of the word within a full conflguration segment that corre- sponds to the starting word of the desired frame. A full conflguration segment is deflned as the following: 1) for PLB/IOB, all PLB, IOB, and RAM interconnect frames beginning at MJA=0, MNA=0 and 2) for block RAM, all RAM content frames for the given RAM column. Words are numbered starting at 0. fm wd The index of the 32-bit word within a frame that contains the desired bit. Words in a frame are numbered starting at 0. fm wd bit idx The bit index of the desired bit within frame word fm wd. Words are indexed in big-endian style, with bit 31 on the left and bit 0 on the right. fm bit idx Bit index within a frame of the desired bit. Numbered starting with 0 as the left-most (flrst) bit. Bit numbering within a frame continues across all the words in the frame. 26 Table 2.4: Calculating the Location of the LUT RAM Bit in Virtex-I Bitstream [Xil03d] MJA if (CLB Col ? Chip Cols ? 2), then Chip Cols - CLB Col ? 2 + 2 else 2 ? CLB Col - Chip Cols - 1 MNA lut bit + 32 - Slice ? (2 ? lut bit + 17) fm bit idx 3 + 18 ? CLB Row - FG + RW ? 32 fm st wd FL ? (8 + (MJA - 1) ? 48 + MNA) + RW ? FL fm wd oor(fm bit idx ? 32) fm wd bit idx 31 + 32 ? fm wd - fm bit idx Table 2.5: Equations for Calculating PLB FF Location in the Bitstream [Xil03d] MJA if (CLB Col ? Chip Cols ?2) then Chip Cols - CLB Col? 2 + 2 else 2 ? CLB Col - Chip Cols - 1 MNA Slice ? (12 ? XY - 43) - 6 ? XY + 45 fm bit idx (18 ? CLB Row) + 1 + (32 ? RW) fm st wd FL ? (8 + (MJA - 1)? 48 + MNA) + RW ? FL fm wd oor(fm bit idx ? 32) fm wd bit idx 31 + 32 ? fm wd - fm bit idx 2.3.2 Frame Organization The frame can be viewed as being vertically superimposed on the device, with the beginning of the frame at the top of the device. As shown in Table 2.6, the flrst 18 bits control the two IOBs at the top of the column. The subsequent groups of 18 bits are allocated for each PLB row. Finally the last 18 bits control the two IOBs at the bottom of the PLB column. The frame data is then padded with ?0?s to make it an integral multiple of 32-bit words [Xil03d]. Table 2.6: PLB Column Frame Organization Top 2 IOBs PLB R1 PLB R2 ::: PLB Rn Bottom 2 IOBs 18 18 18 ::: 18 18 27 2.3.3 Conflguration Registers The Virtex-I FPGAs provide conflguration registers to control the conflguration process. The conflguration architecture deflnes eleven 32-bit conflguration registers, summarized in Table 2.7. To conflgure the FPGA, commands are written into these conflguration registers followed by the data frames containing the conflguration data, which are then loaded into the conflguration memory of the FPGA [Xil02d]. The conflguration register where the commands are to be written is selected by a 32-bit word called command header format (Table 2.8) or Type-I header. The fleld word count in the command header format, gives the number of words to be written in the subsequent write sequence. With the command header alone, 2048 32-bit words can be written. The conflguration architecture also deflnes the large block count header exten- sion format also known as Type-II header format, that supports larger write sequences [Xil02d]. Command Register (CMD) The state of the conflguration state machine, the Frame Data Register (FDR), and some of the global signals are determined by the command loaded in the command register. The commands are executed each time a new value is loaded into the Frame Address Register (FAR) [Xil03d]. The commands and their functions are summarized in Table 2.9. Conflguration Option Register (COR) The bits in the Conflguration Option Register (COR) determine the behavior of speciflc signals used during conflguration and the Start-up Sequence. The flfteenth bit 28 Table 2.7: Conflguration Registers [Xil03d] Register Name R/W Function Command (CMD) R/W Controls the operation of the conflgura- tion state machine. Conflguration Option (COR) R/W Sets various options for events that take placeintheconflgurationandbehaviorof the device after conflguration. Control (CTL) R/W Sets the preferences for the behavior of the device after the conflguration. Cyclic Redundancy Check (CRC) R/W Used while conflguring the device to load CRCchecksumthatisverifledagainstthe internally counted one. Frame Address (FAR) R/W Used to load the frame address of the en- suing frame data. For PLB data, this is automatically incremented after a com- plete frame is loaded. For block RAM data the frame address has to be incre- mented manually. Frame Data Input (FDRI) W Writing the conflguration data into the conflguration memory. Frame Data Output (FDRO) R Reading the conflguration data and states of registers, ip- ops and LUTs from the conflguration memory. Frame Length (FLR) R/W Determines the size of the frame in 32-bit words. Legacy Output (LOUT) W Fordaisychainingthebitstreamoflegacy devices. Mask (MASK) R/W Mask register for writes to CTL register. Status (STAT) R Loaded with current values of various control and status signals. Table 2.8: Command Header Format [Xil02d] Type Write/Read Destination Register Address Byte Address Word Count 32-bit Words 31:29 28:27 26:13 12:11 10:0 001 10/01 xxxxxxxxxxxxxx xx xxxxxxxxxxx 29 Table 2.9: Conflguration Commands and their Usage [Xil03d] [Xil04] Command Code Description WCFG 1 Write Conflguration Data: Used prior to writing conflguration data to the FDRI. It takes the internal conflguration state ma- chine through a sequence of states that control the shifting of the FDR and the writing of the conflguration memory. LFRM 3 Last Frame: This command is loaded prior to writing the last (pad) data frame if the GHIGH B signal was asserted. This com- mand is not necessary if the GHIGH B signal was not asserted. This allows overlap of the last frame write with the release of the GHIGH B signal. RCFG 4 Read Conflguration Data: Used prior to reading frame data from the Frame Data Output (FDRO). Similar to the WCFG com- mand in its efiect on the Frame Data Register (FDR). START 5 Begin Start-up Sequence: Starts the Start-up Sequence. This command is also used to start a shutdown sequence prior to par- tial reconflguration. The Start-up Sequence begins with the next successful CRC check. RCAP 6 Reset Capture: Used when performing capture in single-shot mode. This command must be used to reset the capture signal if single-shot capture has been selected. RCRC 7 Reset CRC: Used to reset CRC register. AGHIGH 8 Assert GHIGH B Signal: Used prior to reconflguration to pre- vent contention while writing new conflguration data. All PLB outputs and signals are forced to a one. SWITCH 9 Switch CCLK Frequency: Used to change the frequency of the Master CCLK. of this register is reset by the \SHUTDOWN" command, used for shutting down the FPGA for partial reconflguration. The \START" command sets this bit to value ?1?, initiating the Start-up Sequence [Xil03d]. The BitGen tool, which is used to generate the conflguration flle for the device, provides options to set/reset bits in this register. 30 Cyclic Redundancy Check (CRC) The Cyclic Redundancy Check (CRC) register provides a means of checking for transmission errors in the bitstream. A 16-bit CRC checksum is calculated every time data is written into speciflc registers using the following polynomial: CRC-16 = X16 +X15 +X2 +1 [Xil02d] The CRC register is used to store the checksum. In complete reconflguration, CRC check is performed twice by loading a pre-calculated CRC block-check value. The second CRC checksum is calculated with the data of the last frame. A non-zero resulting value indicates error in transmission, therefore, conflguration is aborted. Frame Address Register (FAR) The Frame Address Register contains the address of the frame being loaded. The address is partitioned into block type (PLB or RAM block), major address, and minor address. The minor address is auto-incremented each time a complete data frame is loaded and major address is auto-incremented if the last frame for the PLB column is completely loaded. For RAM blocks the major address needs to be loaded separately [Xil03d]. Frame Data Input Register (FDRI) The Frame Data Input Register is used to specify the size of the conflguration data in words that would be written to the conflguration memory. Type-I or Type-II headers are used depending on how large the data is. The FDRI is used to hold this header information [Xil03d]. 31 Frame Length Register (FLR) The length of the frame without the pad word is set in terms of 32-bit words in this register. As the devices grow in the array size, the frame length increases to incorporate the conflguration data of increased routing and logic resources [Xil03d]. 2.3.4 Full Reconflguration Bitstream The commands in the bitstream for full reconflguration of a Virtex-I device can be dividedinto3commandsets. Theflrstcommandsetinitializestheinternalconflguration logic for loading the data frames. A default value is assigned to the CRC register. The frame length is set in Frame Length Register (FLR). The Conflguration Option Register (COR) is loaded with the value that would specify the desired behavior of the device after the conflguration. The SWITCH command is loaded into the CMD register to change the conflguration clock frequency to the clock frequency specifled in the COR. The second command set writes the conflguration data frames. The command, WCFG (Write Conflguration), is loaded into the CMD register. This, among other things, activates the circuitry that writes the data loaded into the FDRI into the conflg- uration memory cells. The data word count is specifled in command word of Type 1 or if the data word count is too large, the command word of Type 2 follows command word of Type 1. Typically three large frame sets are loaded containing the PLB conflguration, the block RAM conflguration and the last frame data. At the end of the third frame set, the CRC checksum is loaded into the CRC register. The Last Frame command (LFRM) is loaded into the CMD register indicating to the conflguration circuitry that the following frame set will be the last frame. 32 The third command set triggers the Start-up Sequence with the START command and completes the CRC checking and activates the FPGA. 2.4 Readback Readback is the process of reading data from the conflguration memory [Xil02d]. Readback can be utilized to compare the stored conflguration against the actual bit- stream, as well as to read the current state of all internal PLB and IOB registers, LUTs operating in RAM mode and block RAM values. The former is known as readback veri- flcation and latter is referred to as readback capture. Both veriflcation and capture can be done in one readback sequence. 2.4.1 Readback Veriflcation Readback veriflcation can be obtained without any changes in the conflguration memory, through selectMAP and boundary scan mode [Xil02d]. This readback data can then be verifled against a bitmap flle generated by the BitGen tool when run with \readback" option enabled for each design. 2.4.2 Readback Capture In order to examine the state of the internal logic resources, the readback capture capability must be enabled. An additional readback capture option allows a single cap- ture or multiple captures after the device is conflgured. When asserted, the register states are captured in unused space in the conflguration memory on the rising edge of the clock signal [Xil] [Xil02d]. 33 Table 2.10: Readback Commands Required to Perform Readback on PLB Conflguration Synchronization word AA99 5566h Packet Header: Write to FAR Register 3000 2001h Packet Data: Starting frame Address 0000 0000h Packet Header: Write to CMD Register 3000 8001h Packet Data: RCFG 0000 0004h Packet Header: Read from FDRO 2800 6000h Packet Header Type 2: Data Words 48{ |-h The logic allocation flle provided by BitGen indicates the absolute position of the ip- op bits in the complete readback flle. This information will prove to be important while retrieving BIST results, as will be discussed in Chapter 3. 2.4.3 Readback Operations Readback is performed by reading a data packet from the Frame Data Output Register (FDRO) register. There are three types of data packets to be read (one for PLB conflguration with capture data and two for block RAMs). The commands needed to be given in order to accomplish this are summarized in Table 2.10. The complete conflguration memory readback is initiated by writing the starting frame address of (0000 0000)h in the FAR register. The number of words to be read for full readback capture is function of the size of the device, e.g. for an XCV100 the number of 32-bit readback words would be 22,554 [Xil02d]. The bits in the readback bitstream indicate three types of information 1) conflgu- ration data, 2) captured data and 3) pad bytes. The pad bytes align the frame data to a 32-bit word boundary. It can be noted that readback bitstream does not contain any CRC check information. 34 2.5 Partial Reconflguration The sequence of events taking place while partially reconflguring the device is con- siderably difierent than that of full reconflguration. For partial reconflguration, it is required that device be fully conflgured once and that the conflguration interface be active after the full conflguration is complete. That is to say, the I/O pins used for selectMAP mode of reconflguration would retain their conflguration function [Xil02d]. The boundary scan mode is a permanent interface and is always present [Xil03d]. ThebitstreamperformingpartialreconflgurationofanylogicresourceoftheFPGA, henceforth simply referred to as a partial bitstream, contains the major and minor ad- dresses of the frame containing the conflguration data for that resource. The major and minor addresses of the frame are calculated using formulae given in Table 2.4 and Table 2.5 [Xil03d]. The partial bitstream contains instructions to write the address to the FAR register. The ensuing instructions load the FDRI register with the number of words to be written into the conflguration memory. After these instructions, the frame data follows. There are two ways in which the FPGA can be partially reconflgured: with or without shutdown [Xil02c]. 2.5.1 Partial Reconflguration without Shutdown Sequence If the device is not shut down, the functions implemented in the parts of the FPGA not afiected by partial reconflguration may continue to work without interruption. The logic changes in the PLB or routing take place once the corresponding frame gets com- pletely written into the device. This mode would be used for operations such as online testing and fault-tolerance [AES01]. 35 2.5.2 Partial Reconflguration with Shutdown Sequence If the device is shut down, then parts of the FPGA not afiected by partial recon- flguration would stop executing the conflgured function. At the start of the Shutdown Sequence, thedummyword(FFFFFFFF)handthesynchronizationword(AA995566)h arewritten. Thedummy wordprovidestheclockcyclesnecessarytoinitializetheconflg- urationlogic. Thesynchronizationwordisusedtoalignthebitstreamonthe32-bit-word boundary [Xil04]. The Shutdown bit is set in the COR register. The START command is then loaded into the CMD register to start the Shutdown Sequence. The CRC value is reset. As the Shutdown Sequence requires all the other logic in the device to be disabled, the clock to all sequential logic is disabled. The AGHIGH command is then loaded into the CMD register to prevent contention on the internal signals while writing the new data. As the GHIGH B signal is asserted due to the AGHIGH command, the LFRM command is written into the CMD register. This allows writing the Last Frame packet as GHIGH B signal is released [Xil03d]. 2.5.3 BitGen BitGen is a command line tool that converts the netlist flle in Xilinx native format (.ncd flle) into a conflguration bitstream flle. The FPGA can then be conflgured with this bitstream. This tool gives a number of options to control the tasks, in and after the Start-up Sequence, for the design implemented. These tasks include: the timing of the start-up signals, clock rate to be used for the conflguration, signal assignment of some of the I/O pins used during conflguration once the conflguration is over, etc. The BitGen options are set as per the design being implemented. BitGen also gives options for generating a partial reconflguration bitstream that contains only the difierence between 36 the .ncd flle, and the old bitstream. The use of BitGen for partial reconflguration during BIST will be discussed in more detail in Chapter 3. 2.5.4 JBits JBits is a set of Java classes which provide an Application Program Interface (API) into the Xilinx XC4000, Virtex-I, Virtex-II series FPGA bitstreams [GLS99]. The JBits APIfacilitateswritingapplicationsthatwouldmodifythebitstreamon-the- y, conflgure and readback from the FPGA conflguration memories [Xil01a]. The API gives the programmer gate level access to the FPGA and the Java programming language allows a programmer to create many layers of abstraction. JBits can therefore be used to write custom CAD tools featuring dynamic partial reconflguration or traditional CAD tools to produce place-and-route for the FPGA families supported [Xil01a]. The simplest of the applications that can be built with JBits API would take the bitstream generated by BitGen as input and conflgure an FPGA board with it. More advanced applications contain circuit designs specifled with the JBits API calls and generate the output in Xilinx Design Language (XDL) which deflnes the netlist in sym- bolic format along with the bitstream. The design ow with JBits API is illustrated in Figure 2.6. In order to use JBits API, the programmer writes a Java program that utilizes JBits API calls (henceforth simply referred to as JBits program) containing calls to conflgure the PLB and routing resources of the FPGA. Upon execution, the JBits program gen- erates the desired conflguration of logic as well as interconnect resources of the FPGA. Thus, the programmer specifles the design of the circuit using JBits API calls embedded 37 Application XHWIF Bitstream from Xilinx Tools Bitstream for Virtex?I xdl2ncd.exe Design in XDL Bitgen Virtex?I Board Bitstream for Virtex?I Bitstream for Spartan?II JBits API Spartan II Board Figure 2.6: Design Flow of the Application with JBits 38 into a regular Java language program [Xil01a]. The JBits program can then be com- piled with the Java compiler (javac). The JBits program runs under the Java Virtual Machine (JVM) environment. The output of the program can be a regular bitstream, or core flles (.ctf flles), that may run from the hardware simulator described in [Xil01c], or an XDL flle containing the design description which can then be viewed and pro- cessed using conventional Xilinx tools like FPGA editor or BitGen [Xil01a]. Therefore, the application program does not rely on conventional place-and-route (PAR) tools for automatic routing. The JBits API thus combines the exibility of an HDL as well as the functionality of a PAR tool. JBits API provides a conflguration memory readback API which has a facility to read back the state of the logic elements in PLB [Xil01a]. Therefore, tools can be written to interface with the FPGAs. Graphical tools, such as BoardScope, demonstrate the use of this API for tracing the logic values of ip- ops, LUTs and internal signals [LG98]. For the bitstream produced by the JBits program to work correctly, when an FPGA isconflguredwithit, thereareanumberofstructures, internaltoJBitsAPI,thatneedto be initialized in a speciflc sequence. For the simplest applications that can be developed with JBits API, these internal structures can be initialized with a relative ease as their numberissmallandtheprocessofinitializationiswelldocumentedinthedocumentation includedwiththeJBitssoftware[Xil01a]. However, fortheadvancedapplicationsalarge numberofstructuresinternaltoJBitsneedtobeinitialized. Thesequenceinwhichthese internal structures need to be initialized is not as well documented in [Xil01a] [Xil01d]. Therefore, the documentation provided with JBits software is insu?cient for a test en- gineer to write advanced applications. In this thesis, we have developed a sequence of 39 steps that correctly initializes these internal structures. A JBits program that imple- ments this sequence, is observed to generate a bitstream that correctly conflgures the FPGA to produce the desired results and textual representation of the design in XDL. Deflnitions Used As JBits API is completely written in Java [Xil01d], therefore we flnd it convenient to follow the terminology of object oriented languages wherever possible. class \A class is a blueprint or prototype that deflnes the variables and the methods common to all objects of a certain kind" [Jav04]. object \An object is a software bundle of related variables and methods" [Jav04]. The object is an instantiation of a class. For example, in JBits API, a physical pin of an FPGA is modeled in class \Pin". When a reference to particular pin is to be made, an instance of the class \Pin" is created. method \A function deflned in a class" [Jav04]. The methods are invoked from the main program to change the state or to retrieve the state of the object(s). package \A package is a collection of related classes and interfaces providing access protection and namespace management" [Jav04]. Packages are the mechanism created so that names of the identiflers declared in one class deflned in one package donotcollidewithanotherclassdeclaredinaseparatepackageandcreateconfusion for the compiler. The fully qualifled name of the class is the class name preflxed by the name of the package that it belongs to. In this thesis, whenever referring to a class in JBits API, we will identify the class by its fully qualifled name. 40 Device Model There are two models provided by JBits API to access and manipulate the FPGA resources. In the flrst model, each physical FPGA resource (like a logic element in the PLB, IOB and block RAM) is mapped into one of the classes in the package com.xilinx.JBits.Virtex.Bits [Xil01d]. Each class has static 2-dimensional array(s) of integers representing conflguration memory bit(s) associated with the resource(s). The applications that need to set and reset conflguration memory bits, that modify bitstream on-the- y may use this model [Xil01a]. This model is not suitable for designing circuits and specifying routing between the PLBs. There exists another model which is more suitable for the advanced applications that specify the design in the JBits program. This model provides a logical abstraction oftheunderlyingFPGAbymodelingeverylogicelementinthePLB,IOBorblockRAM as a runtime parameterized core (RTPCore) [GL99]. The inputs and outputs of a PLB, IOB or block RAM, Single, Hex and Long wires are modeled as pins. The RTPCores contain ports and signals which are the HDL like features of JBits API. The logical connections between RTPCores are specifled through ports and signals. If the programmer simply specifles the logic elements within the PLB that are used inthedesignandthelogicalconnectionsbetweenthem, theninthismodel, thesynthesis, placement and routing of these cores is completely automatic and is done by another Java program called JRoute [Xil01d]. The automatic synthesis, placement or routing can be a serious impediment when it does not happen the way the designer intended it. JBits API gives the programmer the ability to specify the placement and routing of the RTPCores. The programmer assigns pins to the ports of an RTPCore that needs to be 41 placed. Note that a port may still have unrouted nets and buses attached to it and the router will continue to place them. Model of PLBs As stated above, in the flrst model, for every element in the PLB like the ip- op, LUT, multiplexer, and combinational logic element, there exists a Java class. The class has a static 2-dimensional integer array as a data member [Xil01d]. The bits and their positions in the array correspond to the state and location of the conflguration memory bit associated with that element. Table 2.11 provides the mapping of classes for the elements contained in a Virtex-I PLB. There exists another model for the PLB where the ip- ops, LUTs, multiplexers and logic gates are viewed as RTPCores and the inputs and outputs of these elements are viewed as pins. In order to conflgure the PLB, these RTPCores are instantiated in the program and pins are connected to the ports of the RTPCores. It is su?cient to specify the logical connections between the RTPCores. The router appropriately places the partially routed cores. Table 2.11: Classes Used for Bit Level Manipulation of PLB Elements [Xil01d] CIPs related to slice 0 com.xilinx.JBits.Virtex.Bits.S0Control CIPs related to slice 1 com.xilinx.JBits.Virtex.Bits.S1Control Logic Value of slice 0 FF com.xilinx.JBits.Virtex.Bits.CLB Logic Value of slice 1 FF com.xilinx.JBits.Virtex.Bits.CLB Logic Values of slice 0 LUT com.xilinx.JBits.Virtex.Bits.LUT Logic Values slice 1 LUT com.xilinx.JBits.Virtex.Bits.LUT 42 Table 2.12: Classes Used for Bit Level Manipulation of Switch Box CIPs[Xil01d] com.xilinx.JBits.Virtex.Bits.BiHexToSingle com.xilinx.JBits.Virtex.Bits.UniHexToSingle com.xilinx.JBits.Virtex.Bits.SingleToSingle com.xilinx.JBits.Virtex.Bits.OutMuxToSingle Routing Model There are two models for the routing resources. The conflguration memory bits cor- respondingtotheswitchboxCIPsaremodeledbythestatic2-dimensionalintegerarrays split into four classes shown in the Table 2.12 [Xil01d]. The conflguration memory bits corresponding to the output MUX CIPs are modeled by the static 2-dimensional integer arrays in eight classes shown in the Table 2.13 [Xil01d]. The conflguration memory bits corresponding to the input MUX CIPs are modeled by the static 2-dimensional integer arrays split in twenty eight classes shown in the Table 2.14 [Xil01d]. These classes are useful when explicitly specifying if the CIP is to be turned on or ofi, or when explicitly specifying the connection between the routing resources. A separate model exists for the applications where the routing between two PLBs, PLBs and IOBs, and PLBs and block RAMs is specifled. In this model, interconnect resources are modeled as static one-dimensional integer arrays as depicted in Table 2.15. These integer arrays do not represent the CIPs but instead they model the actual inter- connect resource. Therefore, in order to use the interconnect resources in conjunction with the HDL-like features in the JBits program, an object is created with appropriate static integer array passed as an argument. 43 Table 2.13: Classes Used for Bit Level Manipulation of Output MUX CIPs[Xil01d] com.xilinx.JBits.Virtex.Bits.OUT0 1st Output of the Out Mux com.xilinx.JBits.Virtex.Bits.OUT1 2nd Output of the Out Mux com.xilinx.JBits.Virtex.Bits.OUT2 3rd Output of the Out Mux com.xilinx.JBits.Virtex.Bits.OUT3 4th Output of the Out Mux com.xilinx.JBits.Virtex.Bits.OUT4 5th Output of the Out Mux com.xilinx.JBits.Virtex.Bits.OUT5 6th Output of the Out Mux com.xilinx.JBits.Virtex.Bits.OUT6 7th Output of the Out Mux com.xilinx.JBits.Virtex.Bits.OUT7 8th Output of the Out Mux Table 2.14: Classes Used for Bit Level Manipulation of input MUX CIPs[Xil01d] com.xilinx.JBits.Virtex.Bits.S0BX BX input of the slice 0 com.xilinx.JBits.Virtex.Bits.S0BY BY input of the slice 0 com.xilinx.JBits.Virtex.Bits.S0CE CE input of the slice 0 com.xilinx.JBits.Virtex.Bits.S0Clk Clk input of the slice 0 com.xilinx.JBits.Virtex.Bits.S0SR SR input of the slice 0 com.xilinx.JBits.Virtex.Bits.S1BX BX input of the slice 1 com.xilinx.JBits.Virtex.Bits.S1BY BY input of the slice 1 com.xilinx.JBits.Virtex.Bits.S1CE CE input of the slice 1 com.xilinx.JBits.Virtex.Bits.S1Clk Clk input of the slice 1 com.xilinx.JBits.Virtex.Bits.S1SR SR input of the slice 1 com.xilinx.JBits.Virtex.Bits.TS0 TS0 input of the slice 0 com.xilinx.JBits.Virtex.Bits.TS1 TS1 input of the slice 0 com.xilinx.JBits.Virtex.Bits.S0F1 1th input of F LUT in the slice 0 com.xilinx.JBits.Virtex.Bits.S0F2 2th input of F LUT in the slice 0 com.xilinx.JBits.Virtex.Bits.S0F3 3th input of F LUT in the slice 0 com.xilinx.JBits.Virtex.Bits.S0F4 4th input of F LUT in the slice 0 com.xilinx.JBits.Virtex.Bits.S0G1 1th input of G LUT in the slice 0 com.xilinx.JBits.Virtex.Bits.S0G2 2th input of G LUT in the slice 0 com.xilinx.JBits.Virtex.Bits.S0G3 3th input of G LUT in the slice 0 com.xilinx.JBits.Virtex.Bits.S0G4 4th input of G LUT in the slice 0 com.xilinx.JBits.Virtex.Bits.S1F1 1th input of F LUT in the slice 1 com.xilinx.JBits.Virtex.Bits.S1F2 2th input of F LUT in the slice 1 com.xilinx.JBits.Virtex.Bits.S1F3 3th input of F LUT in the slice 1 com.xilinx.JBits.Virtex.Bits.S1F4 4th input of F LUT in the slice 1 com.xilinx.JBits.Virtex.Bits.S1G1 1th input of G LUT in the slice 1 com.xilinx.JBits.Virtex.Bits.S1G2 2th input of G LUT in the slice 1 com.xilinx.JBits.Virtex.Bits.S1G3 3th input of G LUT in the slice 1 com.xilinx.JBits.Virtex.Bits.S1G4 4th input of G LUT in the slice 1 44 Table 2.15: Model of Interconnect Resources in the Package com.xilinx.JRoute2.Virtex.ResourceDB [Xil01d] Interconnect Type Class Horizontal Long Long Horiz Vertical Long Vert Horiz Horizontal Hex Hex Horiz East, Hex Horiz West. Vertical Hex Hex Vert North, Hex Vert South. Single Single East, Single West, Single North, Single South. Routing To route the Virtex-I and Virtex-II family of FPGAs, JBits API provides a Java program known as JRoute. The capabilities of JRoute include routing between two PLB pins,aPLBpinandanIOBpinoraPLBpinandablockRAMpin. JRoutealsofeatures the capability of connecting a single source pin to multiple sink pins. The programmer has a choice between automatic routing, template based routing and manual routing of the FPGA [Kel00]. Automatic Routing The programmer deflnes the source pin and the sink pin and leaves the task of routing to the automatic router. The call is described as follows [Kel00] [Xil01d]: route(source, sink); 45 Template Based Routing A template contains the direction and the type of a wire. The template does not identify the wire in itself. The static integers in the class com.xilinx.JRoute2.Virtex. ResourceDB.CenterWires denote the direction and type. For example: SINGLE EAST represents any Single wire in the east direction. Similarly, HEX SOUTH is any Hex wire in the south direction. Therefore, in turn, it represents general guidelines to the router. route (source, sink, Template t); Thesourceandthesinkareobjectsofclasscom.xilinx.JBits.CoreTemplate.Pin,andthus model the physical resources of the FPGA. The template is an array of paths existing between the source and the sink. Consider the following example, which instructs the router to take the specifled path while routing pins slice 0, X ip- op output (S0 XQ) in (row,col)=(11,12) to the input of slice 0, F LUT (S0 F1) in (row,col)=(18,5) [Xil01a]: int[] template = { TemplateRouter.OUTMUX, TemplateRouter.HEX_WEST, TemplateRouter.HEX_NORTH, TemplateRouter.SINGLE_NORTH, TemplateRouter.SINGLE_WEST, TemplateRouter.INPUT }; Pin source = new Pin(Pin.CLB, row, col, CenterWires.S0_XQ); jroute.route(source, CenterWires.S0_F1, template); The template router has a limitation that it can only be used to route between the two PLBs [Kel00] [Xil01d]. 46 JBits jbits = super.getJBits(); ResourceFactory rf = ResourceFactory.getResourceFactory(jbits); row = 1; col = 3; //The sink resource is the output of the second multiplexer Pin outmux = new Pin(Pin.CLB, row, col, CenterWires.OUT[2]); Segment seg = rf.getSegment(outmux); //First step: Mark the resource as saved seg.save(); Pin src7 = new Pin(Pin.CLB, row, col, CenterWires.S0_X); Pin sink7 = new Pin(Pin.CLB, row, col, CenterWires.OUT[2]); //Second step: Connect the input of the slice 0 ?X? flip?flop to the second output multiplexer JBitsConnector.makeConnection(jbits, src7, sink7, ps); Figure 2.7: JBits Program for Manual Routing Manual Routing The programmer can specify each interconnect resource to be used for the routing. ItisatwostepprocessasshownintheFigure2.7. First, therequiredresourceisreserved using calls provided in the class com.xilinx.JBits.JRoute2.Virtex.ResourceFactory. This class keeps track of the utilization of the interconnect resources. The resource can be eithersavedfortheimmediateuseorsimplymarkedas\used"forlateruse. Inthesecond step,acalltoanappropriatemethodincom.xilinx.JBits.JRoute2.Virtex.JBitsConnector is performed to complete the manual routing [Kel00] [Xil01d]. The fourth argument to the call to makeConnection method is given to display the output of the routing to the console. 47 2.6 BIST for FPGAs In this section, we explore BIST for FPGAs in more detail. The key points to be considered to understand BIST for FPGAs are: how BIST does in-system testing of the FPGAs, how complete fault coverage is achieved, the diagnostic resolution that the method provides, and the scalability of the approach. Some of the notable attempts to implement BIST for FPGAs are [RZ00] [RPFZ99] [HGWS99] [SWHA98] [SXCT00] [AES01] [AS01] [SKCA97] [SNLA02] [SLS03] [TM03]. BIST for FPGAs is typically divided into logic BIST and interconnect BIST ac- cording to the FPGA resources that it tests. BIST can also be considered according to the state of operation of the FPGA at the time of testing. If part of FPGA is tested without afiecting the normal operation of other part of FPGA, it is referred to as on-line BIST and, if FPGA device needs to be shut down prior to testing, then it is referred to as ofi-line BIST [AES01] [SNLA02]. 2.6.1 Logic BIST The general idea of ofi-line logic BIST, as proposed in [SKCA96] [SCKA96], is to conflgure groups of PLBs as TPGs and ORAs that test the BUTs. As the PLBs can be conflgured in difierent modes of operation, the logic BIST must test all of these modes. The process of testing the PLB in a given mode of operation is referred to as a test phase. To completely test the PLB in all of its modes of operation requires set of test phases referred to as a test session. Each test phase consists of reconflguring the FPGA with the BIST circuitry, initiating the BIST sequence, and reading the BIST results from the ORAs [HGWS99]. For in-system testing, BIST conflgurations would 48 be stored in a system memory and a system controller would be responsible for loading each conflguration into the conflguration memory of the FPGA, initializing the TPGs, BUTs, and ORAs via a Global Reset, and reading the Pass/Fail results in the ORAs at the end of BIST sequence. Generating the test patterns and analyzing the output responsesareperformedconcurrentlybytheBISTcircuitryinthedeviceasshowninthe Figure 1.5 [AS01]. All the BUTs are conflgured identically. As the same test stimulus is applied to these functionally and architecturally equivalent BUTs by two or more TPGs, the output responses received from fault-free BUTs should be the same. Therefore a comparator-based ORA compares results from difierent BUTs and produces a Pass/Fail result by latching any mismatch observed during the BIST sequence. In the next test session the roles of the PLBs are changed from BUTs to ORA or TPGs and vice versa. This approach requires only two test sessions to completely test the PLB in the FPGA as long as at least half of the PLBs are BUTs in each test session [AS01]. The cases of combinations of faults that cannot be detected by this method have negligible chance of occurrence. The requirement of two logic BIST test sessions to completely test the logic re- sources, is independent of the device size, provided that number of PLBs required to implement TPGs (2*NTPG) for the logic resources do not exceed the number of PLBs in the array (N). If NTPG > N2 , then more than two BIST test sessions are required to test the logic resources of the FPGA [SLS03]. Therefore, there is a potential penalty of additional test sessions for smaller FPGAs. For FPGAs featuring larger PLB arrays, excessiveloadingofTPGsmightposeaproblemdependingonthearchitecture. Herethe solution is to divide the PLB array into quadrants with each quadrant conflgured with an independent BIST architecture executing in parallel. There is no delay associated 49 with this and, therefore, this BIST architecture scales linearly for the large PLB arrays [SLS03] [AS01]. This BIST approach has been implemented on commercially available FPGAs such as XC4000 and Spartan from Xilinx [SLS03], ORCA 2C from Lucent Technologies [AS01] and Altera Flex8000 [SCKA96] for programmable logic resources. In order to extend this approach to any FPGA architecture, 1) the PLB should be capable of im- plementing the functionality of LFSR-based or counter-based TPG and an ORA/scan cell, 2) the routing architecture should feature global and local routing elements, and 3) the FPGA should be capable of in-system reconflguration, as the PLBs need to be reconflguredseveraltimesforcompletetesting. Asallthesecapabilitiescanbeeitherim- plemented or are available on most of the commercially available FPGAs, this approach is architecture independent. 2.6.2 Interconnect BIST The interconnect BIST tests a group of wire segments controlled by CIPs, known as wires under test (WUTs). The WUTs are tested for the fault models described below. Counter-basedTPGsaresuitableforgeneratingthetestpatterns. Allpossible2n combinationsoftestvectorsareappliedtotheWUTs,providedn isnotlarge [SWHA98]. The mismatch between a good and faulty WUT is latched by a comparison-based ORA. The general architecture of interconnect BIST is shown in Figure 2.8. The comparison-based ORA has a limitation that a fault will go undetected if the WUTs being compared have equivalent faults, thereby giving equivalent responses for the test patterns applied. Another problem associated with this arrangement is that the diagnostic resolution obtained may be insu?cient for fault-tolerant applications. 50 D PLB PLB ORA Pass/Fail X VY Z U W S Q P R C A B C ED Start BIST Done TPG Figure 2.8: BIST for FPGA Interconnect Resources [SWHA98] [SNLA02] addresses the fault masking problem associated with the comparison-based ORA, through two-testing analysis. This ensures that every WUT is tested at least twice with a difierent group of wires tested by difierent TPG and ORA. The fault and diagnosticsconcernsarealleviatedbyemployingacombinationofstrategieslikereplacing the comparison-based ORA by a scan based ORA, comparing the fault signature against a fault dictionary, isolating the faulty wire using the progressive deletion, interchanging the roles of TPG and ORA, and divide-and-conquer for locating the faults in the various segments of the wire. The approach described in [SXCT00] implements a parity-based ORA that checks parity generated by TPG along with the response of WUTs. 51 Fault Models The following fault models are considered in BIST for interconnect resources: ? Bridging faults between the wires, ? Opens in the wires, ? CIPs stuck-on (stuck-closed), ? CIPs stuck-ofi (stuck-open), ? Wires stuck-at-0, ? Wires stuck-at-1. The bridging faults are observed between the wires that run parallel, where there is a likelihood of having a short. The parallel wires afiected with bridging faults, only when subjected to opposite logic values, fail to transmit correct logic value at the other end. Therefore, applying opposite logic values sensitizes the bridging fault between the parallel running wires [Str02]. As the counter-based TPG applies exhaustive test patterns to the WUTs, the opposite logic is present on each wire segment with both wire segments monitored by an ORA. A stuck-ofi CIP would prevent the transmission of the signal between the wire segmentsthatitconnects. IfthetestpatternsappliedalongthesetofWUTsarereceived correctly by the ORA, the CIP of the set of WUTs, cannot be stuck-ofi [SNLA02]. A stuck-on CIP would be unable to break the connection between the wire segments that it connects, similar to a bridging fault. In order to detect the CIP stuck-on fault, the opposite logic values are applied at the two ends of the CIP and each end of the wire 52 T O O T V?STAR H?STAR Figure 2.9: FPGA Floorplan for Online Interconnect Testing [AES01] segment controlled by the CIP is monitored by ORAs. The fault is detected when an incorrect value is read from one of the ends of the CIP [SNLA02]. The interconnects shorted to VDD or GND would cause the wire stuck-at 1 or 0 faults. The broken continuity in the metal wires are referred to as opens in the wires. The test to detect the wire stuck-at faults and opens in the wire is to determine the ability of the wire to transmit both 0 and 1 between the ends. Previous Implementations of Interconnect BIST For on-line BIST, horizontal and vertical self test areas (STARs) are conflgured in the FPGA without disturbing the system function conflgured in other part of the FPGA. The STARs then rove across the FPGA such that difierent routing resources of FPGA are brought under test. Once a fault is located, the faulty routing resources can be excluded from being used again or used as long as it does not form contention with other resources. The TPG, WUT and ORA conflguration for roving STARs is shown in Figure 2.9. 53 T O O O OOO T O O O O T T T T (b) Global?to?global Cross Point CIP Testing (a) H?STARs for Testing Horizontal Interconnects Figure 2.10: FPGA Floorplan with \Galaxy" BIST [SNLA02] In the ofi-line BIST, the system function is not conflgured while the FPGA is being tested. Therefore, FPGA interconnects are conflgured with parallel horizontal-STARs and vertical-STARs, also referred to as galaxy BIST phases (Figure 2.10). Thus the test patterns are applied to the global and local interconnect resources under test in difierent test phases. While the number of test phases remains the same, the ofi-line BIST reduces the number of times the FPGA needs to be reconflgured in order to test the routing resources, as compared to the on-line BIST approach [SNLA02]. 2.6.3 BIST for Xilinx FPGAs The work related to BIST for Xilinx FPGAs can be found in [SCKA96] [RFZ97] [RPFZ99] [SXCT00] [SLS03] [RZ00]. All of them target the Xilinx 4000 series FPGA family. Xilinx 4000 and Spartan-I family of FPGAs do not support partial reconflgura- tion. 54 [RFZ97] and [SXCT00] target Single interconnects that span between the adjacent PLBs. The methodology described in [RFZ97] is one of the flrst attempts to target the switch-box CIPs. The paper discusses testing of switch-box CIPs for stuck-on and stuck-ofi faults in three test phases. This is elaborated in Figure 2.11. The approach presented in [SXCT00] generates a parity bit for the WUTs consisting of single lines. It seeks to eliminate the fault masking inherent in comparison-based ORAs, where both sets of WUTs have equivalent faults. Another advantage of this approach is that the TPG has to control only a single set of WUTs instead of two as in the case shown in Figure 2.8. The Single lines tested by this approach comprise only about 10% of the total routing resources of Xilinx 4000 series FPGA. These are simpler to test than MUX CIPs and global routing resources that crisscross the complete PLB array [SLS03]. The design of the TPG is also more complicated because of the overhead of calculating parity as opposed to simple counter-based TPG used in [SLS03]. [SLS03]usescomparison-basedORAsfortestingallthelogicaswellasinterconnect resources available in Xilinx 4000 and Spartan-I. The BIST strategy implemented is derived from the earlier works such as [AS01] [HGWS99] [SNLA02]. For logic BIST, two identical TPGs drive the test patterns to the BUTs. The structure of the TPG is difierent depending on the mode of operation being tested. The output of each BUT is compared by two comparison-based ORAs on both the sides. The BIST test sessions are columnorientedduetothepresenceofmoreverticalroutingresourcesthanthehorizontal ones and the presence of dedicated carry chain routing that is vertically oriented. For interconnect BIST, comparison-based ORAs and 2-bit counter-based TPGs are used. The 2-bit counter-based TPG is implemented within a single PLB of Xilinx 4000 family for generating 4-bit test patterns using the current and next state of the counter. 55 CB CB CB F CB CB CB A E DC a) Test Phase #1 b) Test Phase #2 CB CB CB F CB CB CB A E DC B B CB CB CB F CB CB CB A E DC B S?On S?Off S?On S?On S?On S?On S?Off S?On S?Off S?Off S?On S?On S?On S?On S?Off S?Off S?Off S?On F E D C B A CIP Test Phase 1 2 3 c) Test Phase #3 d) CIP Faults Detected Open CIP Logic 1 Logic 0 Closed CIP Figure 2.11: Complete Testing of Switch Boxes [SWHA98] 56 A 0,1 and 1,0 combination exists in any pair of the four bits of the test pattern, at least once in the four test pattern sequences [SLS03]. A minimum of 12 test phases are required to completely test the PLBs, including the dedicated carry-logic and 206 conflgurations are required to test the programmable interconnectresources[SLS03]. Athirdofthetestconflgurationsofprogrammableinter- connectareattributedtothepresenceofwiderMUXCIPsthatrequireoneconflguration for each of its inputs [SLS03] [SNLA02]. Other reasons cited for the higher number of test conflgurations than reported in the earlier works are: reduced observability of ded- icated carry logic for testing, sharing of routing resources with the adjacent PLBs, the local routing resources disallowing inputs to and outputs from the PLBs to come from and go to buses on any side of the PLB [SLS03]. Usually a scan chain mechanism would be used to for retrieving the ORA results as shown in Figure 2.12. As illustrated in [HGWS99], scan chain interfaced with IEEE 1149.1 boundary scan interface improves diagnostic resolution. However, due to the lack of architectural features in the Xilinx 4000 and Spartan series PLBs, retrieving BIST results through a scan chain connected to an IEEE 1149.1 boundary scan interface requires additional test phases. Thus the Pass/Fail results from the ORAs are obtained through complete conflguration memory readback. 2.6.4 Using JBits API to Generate Interconnect BIST Conflgurations Testing logic resources of an FPGA using JBits API is claimed in [SMG01]. The approach described does not conflgure the FPGA with logic or routing BIST architec- ture. The JBits program described in this approach does functional level testing of the conflguration memory of the FPGA and the LUTs. The JBits program writes an FPGA 57 1 0 1 0 1 0 ORA 1 Pass/Fail from ORA N/2 ?1 Pass/Fail from BIST Done To TDO TDI TCK Logic 1 Figure 2.12: Scan Cell Interfacing with IEEE 1149.1 [HGWS99] conflguration into the conflguration memory and then reads the conflguration data back. While testing the LUTs, the JBits program writes a ?0? value into the LUTs and reads back the data from the LUTs. Any mismatch between the two, is signaled on the user input/output terminal. The method described for interconnect testing, conflgures two LUTs with 16-bit shift registers. The flrst shift register drives logic patterns on the WUT. The second shift register records the outputs. The contents of the second LUT are then read back using conflguration memory readback. The approach does not con- sider any of the fault models for the interconnect test. Therefore, fault coverage is an issue. The approach considers the MUX in the PLB that drives the wire as a part of the wire. Therefore, either the MUX or the wire may be faulty if there is a mismatch between the data pattern being driven and the data pattern recorded. [FH03] explores testing interconnect resources by implementing routing BIST using JBits. The counter-based TPG used is 11 PLBs tall and 2 PLBs wide. The comparison- based ORA used has height of 20 PLBs. Of the Single, Hex and Long wires, only Single wires are tested. The approach only tests switch box CIPs. As the Hex wires are not 58 tested, the CIPs at the either end of the Hex wires also remain untested. The CIPs are nottestedforstuck-on(stuck-closed)faults. Outof960switchboxesinaVirtexXCV150 chip that features a 24?36 array of PLBs, 776 switch boxes are tested. In this approach, it is not possible to determine the location of the fault. For the approach to work, an FPGA should feature a PLB array with at least 20 rows. This raises concerns about the scalability of the approach. Finally, the memory required to store the conflguration, is not optimized through partial reconflguration. 2.7 Thesis Statement Improvements in performance of BIST for FPGAs can be obtained by minimizing thesizeoftheconflgurationdatarequiredforeachtestphaseandminimizingtheamount of readback data that needs to be read in order to extract the BIST results. With partial reconflguration, regularity across the test phases can be exploited to generate the partial bitstreams that are much smaller in size than the full conflguration bitstreams. With the partial conflguration memory readback, it is possible to read only the contents of the ORA columns instead of the complete conflguration memory. This thesis examines the improvements that can be obtained with partial reconflguration and readback. It also examines implications of the design and generation of logic and routing BIST conflgu- rations as a result of partial reconflguration and readback. While this thesis focuses on Xilinx Virtex-I and Spartan-II FPGAs the proposed methods can be used in any FPGA that supports partial reconflguration and readback. TheJBitsAPIcontainscompletefunctionalitytogeneratelogicBISTconflgurations for Virtex-I and Virtex-II FPGAs. In order to generate logic BIST conflgurations for Virtex-I and Virtex-II family of FPGAs using JBits, RTPCores can be designed for 59 counter-based TPGs, BUTs conflgured in difierent modes of operation and comparator- based ORAs. The RTPCores can be routed by a parent core. In this thesis, we identify a semi-automaticroutingmethodusingJRoutethatissuitableforgeneratingroutingBIST conflgurations. TheresponseoftheORAscanberetrievedusingthepartialconflguration memoryreadbackfacilityofieredbytheVirtexarchitecture. Strictlyspeaking,JBitsAPI isacollectionofuserabstractionsbuiltovertheconflgurationmemorymapoftheFPGA. Therefore, porting JBits API to other FPGAs requires knowledge of the conflguration memorymapoftheFPGA,whichFPGAmanufacturersarereluctanttosharewiththeir customers. However,methodsexisttoidentifythecorrespondencebetweenthebitstream ofisets and the resources controlled. Therefore, this method of generating interconnect conflgurations using JBits may be ported to other FPGAs as long as the contract with JBits is fully implemented for the target FPGA. 60 Chapter 3 Partial Reconfiguration and Readback for Logic BIST The performance of BIST depends on the number of test phases and test sessions, the total time to download the test phases, the size of the memory required to store all BIST conflgurations needed to guarantee 100% stuck-at fault coverage and the time requiredtoretrievetheORA Pass/Fail results. Partialreconflgurationcanbeefiectively used as a strategy to reduce download time and to minimize the size of the memory re- quired to store all logic and routing BIST conflgurations. In this chapter we present the experiments performed to quantify the performance improvement in terms of reduced time required while loading the logic BIST conflguration by partial reconflguration and reading the results by partial conflguration memory readback. In the next chapter we discuss the minimizing conflguration time required to load the routing BIST conflgura- tions using partial reconflguration. In each logic BIST test phase, PLBs conflgured as BUTs are tested in one of their modesofoperation. Inthenexttestphase,themodeofoperationoftheBUTsischanged. The difierence between the conflguration bitstreams for the two test phases, is expected tobesmallasnowonlytheBUTsareconflguredinthedifierentmodeofoperation, while theTPGs, ORAsandroutingusuallymaintainthesamefunctionalityandconflguration. Partial readback of the conflguration memory would reduce the time required to retrieve the ORA results compared to reading the entire conflguration memory. 61 3.1 Floorplan of Logic BIST to Aid Partial Reconflguration The frame organization of the Virtex-I and Spartan-II architecture is column based. If the BUTs are also aligned in the columns then all the changes in the conflguration bitstream are limited to the frames associated with the columns of BUTs. This would reducetheamountofconflgurationdataneededtobestoredanddownloaded; asafterthe flrst full conflguration, only data that needs to be saved in the memory and downloaded for the next BIST conflguration are the frames that have changed from the conflguration alreadyintheplace. However, ifthe oorplanisroworiented, therewouldbeachangein theconflgurationofBUTswhicharenowrow-based. Thiswouldresultinachangeinthe conflguration of every conflguration memory column that holds the PLB conflguration. Hence, there would be a change in the frame-data corresponding to those columns. Thus the number of frames that need to be loaded is twice the number of frames that need be loaded if the oorplan of the BIST test phase were column oriented, since half as many PLB columns are under test in a given test session. Another important efiect of having a column-oriented test session is observed while performing partial conflguration memory readback to retrieve the ORA Pass/Fail results at the end of each BIST sequence. In the column-oriented BIST test phases, the ORA results would be conflned to (N2 ?1)?2 frames, as there are two ORAs per column. In ordertoretrievetheORAPass/Failresults,onlytheframesthatmapthe ip- opsinthe PLBsconflguredasORAsneedtoberead. The ip- opscontainingthePass/Failresults of the test phase lie in difierent frames and therefore have difierent minor addresses. As a result, the number of ORA Pass/Fail results is the number of ORA columns multiplied by two. However, in the row-oriented BIST test phase, it is necessary to read back N ? 62 Column of BUTs Column of BUTsColumn of TPGs Column of ORAs Column of ORAsColumn of BUTs Column of BUTsColumn of ORAs Column of BUTs Column of ORAs Column of BUTs Column of BUTsColumn of ORAs Column of ORAs Column of BUTs Column of TPGs (a) Test Session 1 (b) Test Session 2 Figure 3.1: Floorplan for BIST Test Session 2 frames to retrieve all the ORA Pass/Fail results as there would be (N2 ?1)?2 ORA Pass/Fail results in every PLB column. Therefore the oorplan of a BIST test session i.e. TPG, BUT and ORA, should be column-based to aid partial reconflguration and readback as shown in the Figure 3.1. 3.2 Generating Partial Reconflguration Files In this section we present the process of generating the partial reconflguration flles for logic and interconnect BIST. A JBits program or a C program may be written to produce the XDL flles containing the netlist of logic or interconnect BIST circuits for Virtex-I and Spartan-II families of FPGAs. We used a C program that automatically generates XDL flles for the logic BIST test phases, developed by Dr. Charles Stroud. This program is run with speciflc command line arguments to directly generate the XDL 63 flles for the target chip. The XDL flles are then converted into NCD flles using the program \xdl.exe" as follows: xdl.exe -xdl2ncd outfile.xdl The logic BIST conflguration bitstreams for the target chip are generated from the NCD flles using the BitGen program. The partial bitstreams can be generated following the process described in the next subsection. 3.2.1 Using BitGen In order to generate the partial reconflguration flles, BitGen is run from the com- mand line with the design netlist flle (.ncd) passed as an argument and the command line option \ActiveReconflg" is disabled by default (ActiveReconflg:No). This means that the generated bitstream would contain the shut down instruction (Shutdown and AGHIGH commands) and the GSR signal would get activated after the partial recon- flguration is done [Xil02c] [Xil02d]. After the partial reconflguration is complete, the GSR signal clears the ORA ip- ops and initializes the TPGs and BUTs for the next test phase. The following is an example usage of BitGen for generating partial BIST conflgura- tion flles: bitgen -g ActiveReconfig:No -r \ bist_phase00.bit bist_phase01.ncd bist_phase01.bit 64 Here -r option is used to create the partial bitstream (bist phase01.bit). This flle contains frames that are difierent between the full conflguration bitstream of the new de- sign(bist phase01.ncd)andthecurrently-loadedfullconflgurationbitstream(bist phase00.bit). The \n" simply indicates the continuation of the command on the next line. BitGen Command Line Options The following command line options are useful when generating partial reconflgu- ration bitstream, readback bitstream, interconnect or logic BIST: -b Whenthisoptionisspecifledonthecommandline,BitGenproducesanASCIIversion ofthebitstream(extension.rbt). TheASCIIfllerepresentsthebitstreaminhuman readable format. The ASCII flle can be used for debugging when the bitstream is modifled for the fault-injection emulation. -g Persist When this option is specifled on the command line, the readback circuit is conflgured in the output bitstream and contents of the conflguration memory can be retrieved. This option should be enabled for retrieving the ORA Pass/Fail results using the conflguration memory readback. -g Readback When this option is specifled on the command line, BitGen produces an ASCIIfllethatcontainsreadbackcommands(extension.rba)anditsbinaryversion (extension .rbb) as well as an ASCII flle with readback data (extension .rbd) that can be used to verify if the conflguration was loaded properly. -l When this option is specifled on the command line, BitGen produces an ASCII report flle (extension .ll) that enumerates all the components in the design that can be read back or captured. The flle contains information about the location of the bits 65 in the readback bitstream, frame address, frame ofiset, type of logic resource and name of the component. After the ORA frames are retrieved, the location of the ORA Pass/Fail results in the frame are indicated by the logic allocation flle. 3.3 Generating a Test Plan for Logic BIST As the partial bitstream records only the changes in the conflguration memory between the two conflgurations, a partial reconflguration bitstream can be downloaded only after its reference conflguration precedes it. Therefore, in general, the sequence in which the full and partial bitstreams are downloaded is important. The sequence of the partial bitstreams, and consequently logic or interconnect BIST test phases, should be such that a minimum number of resources change their conflguration from one phase to the next. If only a small number of logic resources change their conflguration, the numberofframesthatchangefromoneBISTtestphasetothenextissmall. Thisresults in a smaller size of the partial reconflguration bitstream containing the BIST test phase and faster download times. Limitations in the Virtex-I architecture prevent testing two slices in a PLB at the same time. For example, there are only eight output muxes to drive the signals from the two slices to the routing resources. If the implemented design contains any feedback loops or drives multiple outputs of the multiplexer, the number of signal outputs from the BUT that can be compared in a single BIST test phase is further reduced. The BUT has twelve output signals that need to be compared by two difierent ORAs. The ORA implemented in a single slice is capable of comparing three pairs of output signals from two identically conflgured BUTs. Due to the limited number of output multiplexers to carry the signals from BUT to ORA, only a single slice can be conflgured as a BUT. 66 Hence, in a single test phase, one can test a single slice. This doubles the number of test conflgurations as well as having an impact on generating a test plan for partial conflgurations. In order to study the efiect of partial reconflguration on logic BIST, we generated difierent logic BIST conflgurations to bring individual slices under test and conflgured them in difierent modes of operation. There are four options available for conflguring and testing the two slices. ? The slice not under test maintains the same operation mode from the previous test phase while the slice under test changes its mode of operation (Scenario 1), ? Both slices change the operation mode simultaneously (Scenario 2), ? The slice under test alternates from slice 0 and slice 1 in the successive test phases and the slice not under test maintains its mode of operation from the previous conflguration (Scenario 3), ? Themodeofoperationischangedkeepingthesamesliceundertestandtheslicenot under test maintains its mode of operation from the flrst conflguration (Scenario 4). The C program is capable of generating all four scenarios described above for two consecutive test phases. This program demonstrates the improvement in time required to load test conflgurations and is su?cient to draw conclusions about the test plan that would beneflt the most from the partial reconflguration. The size of the bitstream containing full conflguration of a test phase and consequently, the conflguration time 67 required to load the bitstream, does not vary over the test phases. Therefore, we are able to generalize our results for all logic BIST test phases. Initially, the FPGA is fully conflgured with the bitstream containing both slices conflguredintestphase1, andslice0istested. Thesubsequentconflgurationsarepartial conflgurations. The slice under test is indicated by shaded region in Figure 3.2. Scenario 1 (Figure 3.2(a)) illustrates the case when one of the slices is partially reconflgured with the second test conflguration (from test phase 1 to test phase 2), while the other slice maintains its previous conflguration. In Scenario 2 (Figure 3.2(b)), the case under investigation is when both slices change their test conflgurations (from test phase 1 to test phase 2) at the same time. As noted before, only one slice can be tested. In Scenario 3(Figure3.2(c)), slice1isbroughtundertestwhilebothslice0andslice1remainintest phase 1. Finally, in the Scenario 4 (Figure 3.2(d)), the slice under test is flrst tested in the test phase 1, then tested in test phase 2 while other slice maintains its conflguration. The Table 3.1 shows the command list executed to prepare the full and partial bit- stream flles for executing the test plan in the Scenario 1. The flrst four lines of the command listing shown in the Table 3.1 generate the corresponding NCD flles from the XDL flles. On line 5, the full conflguration bitstream is generated from the ini- tial conflguration, i.e. where the slice 0 and 1 are both conflgured with test phase 1. The line 6 lists the command line for generating a partial bitstream containing only the frames difiering between the full conflguration (LogicBIST WestP1S0O0.bit) and the new conflguration (LogicBIST WestP2S0O1.ncd). The result is a partial bitstream (LogicBIST Part West P12S00O01.bit). We compile a full conflguration using the com- mand on line 7. This is the full conflguration with which the FPGA is conflgured, after loading the full conflguration bitstream, followed by the partial bitstream generated by 68 Phase2Phase2 Phase1 Phase1Phase1 Phase2 Phase1Phase1 (c) Scenario 3(a) Scenario 1 Slice 0 Slice 0 Slice 1Slice 1 (b) Scenario 2 Slice 0 Slice 1 Slice 0 Slice 1 Partial Configuration Partial Configuration Partial Configuration Full Configuration Full Configuration Partial Configuration Partial Configuration Partial Configuration Phase2Phase2 Phase1Phase1 Phase1Phase1 Phase1 Phase2 (d) Scenario 4 Phase1 Phase1 Phase1 Phase1 Phase2 Phase1 Phase1 Phase1 Phase2 Phase2 Phase1 Phase1 Phase2 Phase2 Phase2 Phase1 Figure 3.2: Four Difierent Test Plans for Testing Two Slices 69 line 6. The full conflguration is needed while generating the next partial conflguration. The lines 7 and 9 are exactly same as line 5 except that the bitstreams are generated for difierent test phases. The lines 8 and 10 are exactly same as line 6 except that partial bitstreams contain the difierence between LogicBIST WestP2S0O1.bit and Log- icBIST WestP1S1O0.ncd (line 8) and LogicBIST WestP1S1O0.bit and LogicBIST WestP2S1O1.ncd (line 10). Table 3.1: Command Listing for Scenario 1 Line Command 1. xdl.exe -xdl2ncd LogicBIST WestP1S0O0.xdl 2. xdl.exe -xdl2ncd LogicBIST WestP2S0O1.xdl 3. xdl.exe -xdl2ncd LogicBIST WestP1S1O0.xdl 4. xdl.exe -xdl2ncd LogicBIST WestP2S1O1.xdl 5. bitgen -d -l -b LogicBIST WestP1S0O0.ncd 6. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No - r LogicBIST WestP1S0O0.bit LogicBIST WestP2S0O1.ncd Log- icBIST Part West P12S00O01.bit 7. bitgen -d -l -b -w -g Persist:Yes LogicBIST WestP2S0O1.ncd 8. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No -g Read- back -r LogicBIST WestP2S0O1.bit LogicBIST WestP1S1O0.ncd Log- icBIST Part WestP21S01O01.bit 9. bitgen -d -l -b -w -g Persist:Yes LogicBIST WestP1S1O0.ncd 10. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No -g Read- back -r LogicBIST WestP1S1O0.bit LogicBIST WestP2S1O1.ncd Log- icBIST Part WestP12S11O01.bit InordertogeneratethelogicBISTconflgurationsforScenario2,weusethesequence of commands given in Table 3.2. The ow of commands is essentially the same as that of Scenario 1. 70 Table 3.2: Command Listing for Scenario 2 Line Command 1. xdl.exe -xdl2ncd LogicBIST WestP2S0O0.xdl 2. xdl.exe -xdl2ncd LogicBIST WestP2S1O0.xdl 3. bitgen -d -l -b LogicBIST WestP1S0O0.ncd 4. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No - r LogicBIST WestP1S0O0.bit LogicBIST WestP2S0O0.ncd Log- icBIST Part WestP12S00O00.bit 5. bitgen -d -l -b -w -g Persist:Yes LogicBIST WestP2S0O0.ncd 6. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No - r LogicBIST WestP2S0O0.bit LogicBIST WestP1S1O0.ncd Log- icBIST Part WestP21S01O00.bit 7. bitgen -d -l -b -w -g Persist:Yes LogicBIST WestP1S1O0.ncd 8. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No - r LogicBIST WestP1S1O0.bit LogicBIST WestP2S1O0.ncd Log- icBIST Part WestP12S11O00.bit 71 InordertogeneratethelogicBISTconflgurationsforScenario3,weusethesequence of commands given in Table 3.3. The ow of commands is essentially the same as that of Scenario 1 and 2. We do not repeat the process of converting XDL to NCD flles, as they would be available to us from the Scenario 1 and 2. Table 3.3: Command Listing for Scenario 3 Line Command 1. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No - r LogicBIST WestP1S0O0.bit LogicBIST WestP1S1O0.ncd Log- icBIST Part WestP11S01O00.bit 2. bitgen -d -l -b -w -g Persist:Yes LogicBIST WestP1S1O0.ncd 3. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No - r LogicBIST WestP1S1O0.bit LogicBIST WestP2S0O0.ncd Log- icBIST Part WestP12S10O00.bit 4. bitgen -d -l -b -w -g Persist:Yes LogicBIST WestP2S0O0.ncd 5. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No - r LogicBIST WestP2S0O0.bit LogicBIST WestP2S1O0.ncd Log- icBIST Part WestP22S01O00.bit InordertogeneratethelogicBISTconflgurationsforScenario4,weusethesequence of commands given in Table 3.4. The ow of commands is essentially the same as that of Scenarios 1, 2 and 3. The partial bitstream (LogicBIST Part WestP11S01O00.bit) and the full conflguration as a resultant of that (LogicBIST WestP1S1O0.bit) are common between Scenario 3 and Scenario 4, as the sequence of flrst two conflgurations is the same. Table 3.4: Command Listing for Scenario 4 1. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No - r LogicBIST WestP1S1O0.bit LogicBIST WestP2S1O0.ncd Log- icBIST Part WestP12S11O00.bit 2. bitgen -d -l -b -w -g Persist:Yes LogicBIST WestP2S1O0.ncd 3. bitgen -d -l -b -w -g Persist:Yes -g ActiveReconflg:No - r LogicBIST WestP2S1O0.bit LogicBIST WestP2S0O0.ncd Log- icBIST Part WestP22S10O00.bit 72 Table 3.5: Partial Frames Scenario 1 Scenario 2 Conflg Slice 0 Slice 1 Frames Conflg Slice 0 Slice 1 Frames Phase Phase Phase Phase Full 1 1 Full 1 1 Partial 2 1 138 Partial 2 2 264 Partial 1 1 195 Partial 1 1 274 Partial 1 2 138 Partial 2 2 264 Table 3.6: Partial Frames Scenario 3 Scenario 4 Conflg Slice 0 Slice 1 Frames Conflg Slice 0 Slice 1 Frames Phase Phase Phase Phase Full 1 1 Full 1 1 Partial 1 1 92 Partial 1 1 92 Partial 2 1 274 Partial 1 2 264 Partial 2 2 92 Partial 2 2 92 3.4 Experimental Results for Logic BIST The partial bitstream is composed of the frames that are difierent between the conflguration with which the FPGA is currently conflgured and the new conflguration. Table 3.5 and Table 3.6, give the mode of operation for each slice and the number of conflguration data frames of the current conflguration that difier from the previous conflguration. Theflrstcolumninthetablesindicatesthattheflrstconflgurationisafull conflguration while the subsequent conflgurations are partial. In Scenario 4, the total number of frames of the bitstreams required to execute the test phases 1 and 2 would be least. However, in the Scenario 4, the conflguration of both slices is changed at the same time. As more and more test phases are considered, as opposed to two here, the number of frames changed while switching the phase would be a dominant factor. The Scenario 1 shows a lower number of frames difiering when changing the test phase. In this Scenario, 73 the slice under test is conflgured in one test phase while the other slice maintains its conflguration to test phase one. As a result, when all the test phases are considered, the Scenario 1 may be expected produce fewer changed frames while switching the test phases. Therefore, Scenario 1 depicts a test plan that beneflts the most from the partial reconflguration. The conflguration time is the time to load the FPGA with the speciflc design. The conflguration time is directly proportional to the size of the bitstream. If all the other factors, like the rate at which the conflguration bitstream is loaded, and the mode in which the FPGA is conflgured, remain the same, the gain in the conflguration time is a function of the ratio of the size of the partial bitstream in bytes to the size of the full bitstream in bytes. The \Ratio" fleld in the Table 3.7 indicates the ratio of the size of the partial bitstream in bytes and the size of the corresponding full bitstream in bytes. From the Table 3.7 it is clear that as the devices get bigger (XC2S15 has PLB array of 8x12 while XCV50 and XC2S50 have PLB matrix of 16x24), the ratio of the size of the full bitstream to the size of the partial bitstream increases. Not only that, the ratio is highest when the partial bitstream comprises two logic BIST phases with only a few difierences in conflguration. Thus the beneflts of partial reconflguration are more pronounced for the logic BIST test conflgurations for larger devices. The reason is that full reconflguration becomes a more and more expensive process as the PLB array grows and that high architectural regularity in a logic BIST test phases translates to lower numbers of frames changing their conflgurations with the next logic BIST phase. 74 Table 3.7: Sizes of Partial Bitstreams vs. Full Bitstreams XC2S15 Full Bitstream Partial Bitstream Direction Phase Slice Size Size Ratio Previous Phase (bytes) (bytes) Phase Slice West 1 0 24797 NA NA NA NA West 1 1 24797 6301 3.94 1 0 West 2 1 24797 8597 2.88 1 1 West 2 0 24797 6301 3.94 2 1 West 2 0 24797 8717 2.84 1 0 West 1 1 24797 8597 2.88 2 0 West 2 1 24797 8717 2.84 1 1 XC2S50 Full Bitstream Partial Bitstream Direction Phase Slice Size Size Ratio Previous Phase (bytes) (bytes) Phase Slice West 1 0 69985 NA NA NA NA West 1 1 69985 6829 10.25 1 0 West 2 1 69985 17179 4.07 1 1 West 2 0 69985 6829 10.25 2 1 West 2 0 69985 17805 3.93 1 0 West 1 1 69985 18561 3.77 2 0 West 2 1 69985 17717 3.95 1 1 XCV50 Full Bitstream Partial Bitstream Direction Phase Slice Size Size Ratio Previous Phase (bytes) (bytes) Phase Slice West 1 0 69984 NA NA NA NA West 1 1 69984 6828 10.25 1 0 West 2 1 69984 17716 4.07 1 1 West 2 0 69984 6828 10.25 2 1 West 2 0 69984 17804 3.93 1 0 West 1 1 69984 18472 3.77 2 0 West 2 1 69984 17716 3.95 1 1 XC2S50 Full Bitstream Partial Bitstream Direction Phase Slice Size Size Ratio Previous Phase (bytes) (bytes) Phase Slice East 1 0 69985 NA NA NA NA East 1 1 69985 6389 10.95 1 0 75 East 2 1 69985 16489 4.24 1 1 East 2 0 69985 6389 10.95 2 1 East 2 0 69985 16577 3.93 1 0 East 1 1 69985 17245 4.22 2 0 East 2 1 69985 16489 4.24 1 1 3.5 Partial Conflguration Memory Readback to Retrieve the BIST Results AfterthecompletionofaBISTtestphase, the ip- opsineachoftheORAscontain the Pass/Fail result for that test phase. Using the boundary scan interface, the results can then be shifted to the TDO pin and subsequently read by the system controller to determine faulty/fault free states of the FPGA. A diagnostic algorithm can be per- formed on the BIST results to determine the location of the faulty PLBs [AS01]. Using the boundary scan interface, the results can be retrieved using four methods: 1) using existing boundary scan registers, 2) user deflned internal scan registers, 3) conflgura- tion memory readback, 4) integrated ORA and scan register [HGWS99]. The third and fourth methods have proved to be the most valuable in actual BIST implementations [SLS03] [SNLA02] [AS01]. The Virtex series devices allow partial conflguration memory readback of the con- flguration memory. The sequence of commands given to accomplish the partial conflgu- ration memory readback is given in section 2.4. The FAR is set to the major and minor address of the frame containing the ORA ip- op that contains the Pass/Fail results. As all the logic resources in a single column share a common major address, all the ip- ops in one column have a common major address. The minor addresses of the ip- ops, however, depend on the slice that ip- ops belong to. The ip- ops in the same slice will have common minor address irrespective of the row or the column containing the 76 slice. Therefore, allthe ip- opslyinginasinglecolumnofthePLBarrayandinasingle slice have common major and minor addresses. As there are two slices in each PLB, each having two ip- ops, regardless of the number of PLBs in a column, ip- ops in a PLB column are mapped into four difierent conflguration frames. Thus, the Pass/Fail results from all the ORAs in a single column can be retrieved by reading only four frames. The column-based logic BIST oorplan for XCV100 would contain M2 -1=14 ORA columns, where M equals the number of columns of PLBs. Therefore, all the Pass/Fail results from all ORAs in the FPGA would be contained in 56 frames. In the selectMAP mode, the readback data frame of XCV100 contains thirteen 32-bit words [Xil02d]. In order to perform the partial conflguration memory readback on 56 frames we must flrst write 56?6+1 command words into the FPGA as the partial conflguration memory readback is iterative in nature [MG00] [Xil02d]. Therefore, the total number ofcommandbyteswrittenforthepartialconflgurationmemoryreadbackofalltheORAs in a BIST test phase for an XCV100 is 1124. The total number of clock cycles required would depend on the conflguration mode the FPGA is set in. For SelectMAP mode of conflguration, one byte can be read from/written to the conflguration memory in each clockcycle. Therefore,thetotalnumberofbytesinthereadbacktoretrievethePass/Fail results from all the ORAs in a BIST test phase on XCV100 would be 56?13?4 = 2912. The total number of clock cycles to perform partial conflguration memory readback for the ORA Pass/Fail results would be 2912+1348 = 4260. The number of clock cycles to perform partial conflguration memory readback would see an eight-fold increase in the boundary scan mode which is a serial mode. Therefore, the total number of clock cycles required to perform partial conflguration memory readback in boundary scan mode would be 4260?8=34080. 77 The full conflguration memory readback requires 90216 bytes of readback bitstream to be read and 28 bytes of command words to be written. The number of clock cycles required for full conflguration memory readback is 21.2 times the number of clock cycles required for partial conflguration memory readback. For the integrated ORA and scan register, the results of a test phase are latched in each ORA. There are (N?M2 -N)?4 ORA Pass/Fail results that need to be scanned out. For XCV100, the number of ORA Pass/Fail results that need to be scanned out would be 1120, which is 20 times faster than partial memory readback through boundary scan interface. The Virtex architecture does not contain a dedicated multiplexer for utilizing the internal scan chains. Therefore, implementing this approach on Virtex-I or Spartan- II FPGAs presents a logic overhead of one multiplexer per ORA. Also, the ip- ops in the scan chain and the boundary scan signals need to be routed. The lack of dedicated routing and logic resources to implement user deflned scan chains manifests itself in the increase in the number of test conflgurations. The number of BUT inputs that can be compared in ORA reduce from six to four, as two inputs need to be reserved for scan-in from the previous stage and shift control input. Therefore, to test twelve BUT outputs it takes three instead of two conflgurations. From the above calculations, it can be deduced that the number of clock cycles required for partial conflguration memory readback using the SelectMAP mode is 4?M2 - 1?ffn[6+FLR] + 4, where M is the number of columns in the PLB array, ffn is the number of ip- ops in a PLB and FLR is the length of the conflguration memory frame. Therefore, the total number of clock cycles required for partial conflguration memory readback is proportional to the product of the number of columns in the PLB array (M), the number of ip- ops in a PLB (ffn) and the length of the conflguration memory 78 frame (FLR). While the number of clock cycles required for integrated ORA and scan register is proportional to the product of the number of columns in the PLB array (M) and the number of rows in the PLB array (N). As N would be always smaller than the product of FLR and the number of ip ops in a PLB, the integrated ORA scan chain would fair better than partial conflguration memory readback despite one extra partial conflguration. The comparison of implementing three approaches is given in the Table 3.8. The table depicts the number of bytes that need to be retrieved with each of the methods, as the PLB array size increases from XCV100 to XCV150 to XCV1000. The full conflgu- ration memory readback for XCV150 requires 121536 bytes to be read and 28 command bytes to be written. Similarly, full conflguration memory readback for XCV1000 requires 745524 to be read and 28 command bytes to be written. For partial conflguration mem- ory readback, XCV150 requires 72 frames, each of 60 bytes, to be retrieved. Therefore, it requires 4320 bytes of data to be read and 1732 command bytes to be written. For partial conflguration memory readback, XCV1000 requires 192 frames, each consisting of 152 bytes, to be retrieved. Therefore, it requires 29184 bytes of data to be read and 4612 command bytes to be written. Table 3.8: Comparison of Boundary Scan Access Method and Partial Conflguration Memory Readback XCV100 Method Additional Logic Number of clock cycles Full Memory Readback None 721728 (Boundary Scan) Partial Memory Readback None 4260 79 (SelectMAP) Partial Memory Readback None 34080 (Boundary Scan) Integrated ORA and Scan Chain 1 MUX/PLB + routing 1120 XCV150 Full Memory Readback None 972512 (Boundary Scan) Partial Memory Readback None 6052 (SelectMAP) Partial Memory Readback None 48416 (Boundary Scan) Integrated ORA and Scan Chain 1 MUX/PLB + routing 1632 XCV1000 Full Memory Readback None 5964416 (Boundary Scan) Partial Memory Readback None 33796 (SelectMAP) Partial Memory Readback None 270368 (Boundary Scan) Integrated ORA and Scan Chain 1 MUX/PLB + routing 12032 3.5.1 Commands for Partial Conflguration Memory Readback The command set for retrieving the BIST results from Virtex FPGAs using partial conflguration memory readback is given in the Table 3.9. The FAR register is loaded with the start frame address. The starting frame address comprises the major address and the minor address of the frame of data to be read back. The number of 32-bit words to be read back is calculated from the frame length of the device multiplied by the number of frames to be read back. This value is indicated by the bits (10:0) of FDRO register. In this example, the column number 11 in the XCV50 is to be read back. The major address of this column is 2. The packet data to be loaded into the FAR register is as follows: The bits (26:25) are set to (00)2 to signify that the address lies in the one of the PLB columns, the bits (24:17) in FAR register indicate the major address, bits (16:9) indicate the minor address. Therefore, the FAR register is written with the 80 address as (00040000)h. The column consists of 48 frames, each of 44 bytes. The read back data is preceded by the a 32-bit pad data word. One pad frame is also included while calculating the number of words to be read back. Therefore, the total number of 32-bit words to be readback would be 49?48=588. Table 3.9: Bitstream for Partial Conflguration Memory Readback FFFF FFFF AA99 5566 Synchronization Word 3000 2001 Packet Header: Write to FAR register 0004 0000 Packet Data: Starting frame address 3000 8001 Packet Header: Write to CMD register 0000 0004 Packet Data: RCFG 2800 624C Packet Header: Read from FDRO 0000 0000 Flush the pipeline 3.6 Summary The oorplan of the logic BIST should be column-based to aid partial reconflgura- tion and partial conflguration memory readback. The plan that beneflts the most from partial reconflguration is identifled as Scenario 4 where except for the flrst conflguration, a slice under test changes its mode of operation while the other slice maintains its con- flguration. The conflguration time for loading the logic BIST test phase is signiflcantly reduced using the partial conflgurations instead of the full conflgurations. As the devices get bigger, the ratio of the size of equivalent full conflguration bitstreams to the size of partial bitstreams increases. Thus, less time is required to load the test phases. The regularity in the architecture of logic BIST across the test phases is advantageous for the partial reconflguration. Partial conflguration memory readback is a preferred over the full conflguration memory readback to retrieve the ORA Pass/Fail results, however 81 partial conflguration memory readback still lags behind the integrated ORA and scan chain method of retrieving the ORA Pass/Fail results. 82 Chapter 4 Generating Routing BIST Configurations using JBits InthischapterweexplorehowJBitsAPIcanbeusedtogeneratetestconflgurations for testing interconnect resources. Though it is possible to generate logic as well as routing BIST conflgurations using JBits, in this thesis, a method has been implemented for generating routing BIST conflgurations for Virtex-I and Spartan-II FPGAs using JBits API. For implementing routing BIST using JBits API, one approach is to use the RTPCores to conflgure the PLBs as counter-based TPGs or comparator-based ORAs. These RTPCores are referred to as BIST RTPCores. The routing BIST architecture implemented by the BIST RTPCores and the fault models targeted by the routing BIST architecture, are presented in section 4.1. The JBits program developed for this thesis automatically generates bitstreams and XDL flles containing routing BIST conflgurations using the JBits API for Virtex-I. The BIST RTPCores used as TPGs and ORAs and the RTPCores responsible for routing the WUTs between the TPGs and ORAs, as well as populating the PLB array, are described in the section 4.2. The JBits program outputs a bitstream for Virtex-I FPGAs and an XDL flle that enumerates the design specifled in the program. The header of the XDL flle denotes the chip, package and speed grade information [Xil00]. As the architecture of the Virtex-I and Spartan-II FPGAs is the same, the header of the XDL is modifled to re ect the target Spartan-II chip, package and speed grade information. The modifled XDL flle is further processed to generate a bitstream that tests the interconnects of Spartan-II FPGAs. The details about command line options and results obtained by 83 running the program are given in the section 4.3. This section also presents an estimate of the number of routing BIST conflgurations required to test all routing resources of Virtex-I and Spartan-II FPGA. In the subsection 4.4.3, we present a set of four BIST conflgurations to completely test switch box CIPs encountered in Virtex-I and Spartan-II, for CIP stuck-on and CIP stuck-ofi faults. We also propose the desired modiflcations in order to generate the conflgurations to test the switch box CIPs using JBits. The section 4.5 proposes the desired modiflcations in the current implementation of the JBits program in order to generate the conflgurations to test the switch box CIPs using JBits. 4.1 Overview of Routing BIST Architecture The BIST RTPCores instantiate the architecture for routing BIST: Two counter- based TPGs generate two-bit exhaustive test patterns over eight parallel WUTs. Two comparison-based ORAs compare the WUTs from difierent TPGs for a mismatch. The TPGs, WUTs and ORAs are shown in the Figure 4.1. The BIST RTPCores conflgure two slices with TPGCounterCores and two slices with ORACores which are comparison- based ORA that compare the WUTs from difierent TPGs for a mismatch. The BIST RTPCores inherit the functionality of an RTPCore. This enables the BIST RTPCores to take the advantage of HDL-like features and let the router do the most of the routing. When the speciflc resource under test, like a Hex wire or Single wire, is to be specifled, the BIST RTPCores directly refer the physical resources of the FPGA into the design. In order to ensure the fault detection ofiered by this architecture, we consider the fault models described in the Chapter 2. The routing BIST conflgurations generated by the program, test for bridging faults, wire stuck-at faults, CIPs stuck-ofi and CIPs stuck-on 84 ORA TPG TPG ORA ORA TPGTPGORA (a)H?STAR and V?STAR while Testing Horizontal Hex Interconnects going E?W or Vertical Hex Interconnects (b)H?STAR and V?STAR while Testing Horizontal going S?N Hex Interconnects going W?E or Vertical Hex Interconnects going N?S Figure 4.1: Horizontal and Vertical Interconnect Resources Tested for Shorts and Opens faults afiecting Single as well as Hex wires. To sensitize the bridging faults between adjacent wires, both logic levels, logic-0 and 1, need to be presented at least once in the test phase on the parallel running Hex and Single wires. To identify the wires that run parallel, and thus are susceptible to bridging faults, we must have the physical layout of the FPGA. In the absence of this we rely on the graphical layout presented by the FPGA Editor. As the exhaustive test patterns would contain both combinations of opposite logic values (0,1 and 1,0) at least once, the bridging faults between the parallel wires should be detected. If any of the CIPs along the WUTs is afiected by stuck-open faults, then the comparison-based ORA would record a mismatch between the outputs of the two identically conflgured TPGs driving the WUTs. In order to detect the CIPs stuck-closed fault, the TPG controls both the segments associated with the open CIP and applies opposite logic patterns to each segment. Thus both combinations of logic, 01 and 10, need to be tested. 85 4.1.1 Testing the Interconnects in Parallel Multiple Hex and Single interconnects can be driven by an output mux. The TPG drives exhaustive test patterns to the fan-outs of the output mux and tests them in parallel. The number of Hex and Single resources that can be reached from an output multiplexer are listed in the Table 4.1. A set of interconnects routable from one output multiplexer does not overlap with another set of interconnects reachable from the other output multiplexer. While identifying the Hex and Single interconnects that may be tested in parallel, we notice that only the following types of Hex interconnects may be grouped together: Hex East and Hex South, or Hex East and Hex West, or Hex North and Hex South, or Hex North and Hex West. Testing East and South interconnects and WestandNorthinterconnectstogetherissimplercomparedtotestingEastandWestand North and South interconnects. When testing the Hex wires, in the flrst interconnect BIST phase, two identically conflgured TPGs drive two-bit exhaustive test patterns on four Hex interconnects going South-North and four Hex interconnects going East-West. Two comparison-based ORAs detect any mismatch that results from a fault. In the second interconnect BIST phase, four Hex interconnects going North-South and four Hex interconnects going West-East are tested. The TPGs and the ORAs are arranged in the PLB array as shown in Figure 4.2(a). When testing the Single interconnects, the TPGs and the ORAs are arranged in the PLB array as shown in Figure 4.2(b). 86 Table 4.1: Connectivity between Output Multiplexers and Interconnects Interconnect Output Multiplexer0 1 2 3 4 5 6 7 Hex Horiz East 1 1 1 1 1 1 1 1 Hex Horiz West 1 1 1 1 1 1 1 1 Hex Vert South 1 1 1 1 1 1 1 1 Hex Vert North 1 1 1 1 1 1 1 1 Single East 1 2 1 2 1 2 1 2 Single West 1 2 1 2 1 2 1 2 Single South 2 1 2 1 2 1 2 1 Single North 2 1 2 1 2 1 2 1 4.2 The Routing BIST RTPCores The components of a JBits program observe a certain hierarchy. The components at a certain level in the hierarchy are designed to perform certain tasks. While designing the BIST RTPCores, the hierarchy should be taken into consideration. This enables the RTPCores to make optimum use of the functionality ofiered by the JBits API. Observing these restrictions also results in maintainable code. Some of the important design decisions are: ? The parameters that deflne and modify the behavior of the RTPCore, ? Which RTPCores in the hierarchy should conflgure PLBs, ? Which RTPCores in the hierarchy should conflgure the routing, ? How the bitstream and XDL flle is generated, ? How does the programmer control routing, and ? Which logical nets and buses are re ected in the XDL flle and which are ignored. 87 (a) Testing Hex Lines for Bridging Faults and Opens PLB 6 PLB 7 TPG TPG TPG ORAORATPG TPG ORAORA TPG TPG ORAORA ORA ORA TPG Slice0 Slice0 Slice1 Slice1Slice1 Slice1Slice1Slice0 Slice0 Slice1 Slice0 Slice0Slice1Slice0 Slice0 Slice1 PLB 0 PLB 1 PLB 0 ROW 0 ROW 6 WUT WUT WUT WUT WUT WUT ORA (b) Testing Single Lines for Bridging Faults and Opens ORA TPGTPG Slice1Slice0 Slice0 Slice1 PLB 2 PLB 3 ORA ORATPGTPG Slice1Slice0 Slice0 Slice1 PLB 1PLB 0 WUTWUT Figure 4.2: Hex and Single Wires Tested for Shorts and Opens 88 AstheseconsiderationsaregenericandpertaintoeveryJBitsapplication,discussion on all of the above considerations is beyond the scope of this thesis. In this thesis we restrict ourselves to the discussion on how these considerations come into play while developing BIST RTPCores. The steps described in the Appendix A and Appendix B adhere to hierarchy in JBits API and are exible enough to let the programmer control the routing. The program developed for this thesis is given in Appendix C and contains four classes { SimpleRouteBISTApp, SimpleRouteBIST, TPGCounterCore and ORACore. SimpleRouteBISTApp is the entry point of the application. Therefore, it takes care of user interaction and command line arguments. Many useful functions pertaining to the user interaction are contained in the class com.xilinx.util.JBitsCommandLineApp. Therefore, SimpleRouteBISTApp derives its functionality from this class. SimpleRoute- BISTApp performs following functions: 1. Parses the command line arguments. 2. Populates entire PLB array with the routing BIST RTPCore generated by Sim- pleRouteBIST. In order to accomplish this, the parameters to the RTPCore Sim- pleRouteBIST are varied, as will be explained in subsection 4.2.4. 3. Generates the XDL flle, CTF flle and the bitstream for the target FPGA. The class SimpleRouteBIST is a parent RTPCore. The RTPCore derives its func- tionality from the class com.xilinx.JBits.CoreTemplate.RTPCore, which contains useful functions to deflne logical nets and ports. The behavior of SimpleRouteBIST depends on two parameters { i and j, which identify row and column numbers in the PLB array, 89 respectively, where the TPGs should be placed. When testing the Hex lines, the ORA is always six PLB blocks away from the PLB conflgured as a TPG. Therefore, when testing Hex lines, SimpleRouteBIST class conflgures a slice of the PLB six blocks away as an ORA. When testing Single lines, SimpleRouteBIST class conflgures the ORA in a slice of the PLB adjacent to the one conflgured as a TPG. SimpleRouteBIST class then calls a method internalRoute with parameters width, row and col. The parameter width tells the method the size of the WUT group. The parameter width depends on the number of ip- ops contained in a slice, which is two in case of Virtex-I. The parameters row and col specify the row and column number of the TPG. The method then routes the WUT group of size width?2 originating from the PLB(row,col) to the PLB(row,col+6) whengeneratingroutingBISTconflgurationtestingHexwiresrunningEast-West. When WUTsareHexwiresrunningWest-East, theco-ordinatesofthePLBconflguredasORA change to (row,col-6). Similarly, when testing the vertical Hex wires going from North- South, the ORA would be placed in the PLB (row-6,col). Finally, when the WUTs are vertical Hex wires with direction South-North, the coordinates of the PLB conflgured as an ORA would be (row+6,col). SimpleRouteBIST performs following functions: 1. Assigns placement constraints to the child cores depending on the parameters, 2. Conflgures two slices of a PLB as identical counter-based TPGs, 3. ConflguresasliceofanotherPLBasanORAcomparingtheoutputsoftheidentical TPGs, and 4. Conflgures routing between the two PLBs with the specifled WUTs. 90 The classes TPGCounterCore and ORACore are child RTPCores. The child RTP- Cores are spawned by the parent RTPCore SimpleRouteBIST. Therefore the placement constraints are deflned by the parent RTPCore. The behavior of these RTPCores de- pends on the parameter width. The value of this parameter determines how many slices the RTPCore occupies in the PLB array. The function of specifying the routing be- tween the child RTPCores is also performed by SimpleRouteBIST. The architecture of each of these RTPCores is elaborated in the subsection 4.2.1 and 4.2.2. The bitstream manipulation is the responsibility of the top level RTPCore: SimpleRouteBISTApp. 4.2.1 Conflguring the TPG The program uses a counter-based TPG to generate exhaustive test patterns for the WUTs. Virtex-IandSpartan-IIPLBscontainfour ip- ops,twoineachslice. Therefore, a Virtex-I or Spartan-II slice can be conflgured as a 2-bit counter, generating four test vectors. The logical buses and ports deflned in the class TPGCounterCore are shown in the Table 4.2. The RTPCore SimpleRouteBIST deflnes upper level logical buses. The ports connect the upper level logical buses to internally deflned ones. The TPG ip- ops must be conflgured to be reset during the start-up sequence in ensure that the two TPGs being compared by the ORA are synchronized to produce identical test patterns during the BIST sequence. Table 4.2: Input and Output ports of TPGCounterCore Port Width (in Bits) Direction Function clk 1 IN Counter Clock dout 2 OUT Counter Output ce 1 IN Counter Enable/Disable sr 1 IN Counter Set/Reset 91 4.2.2 Conflguring the ORA The functionality of the comparator-based ORA is implemented in the class ORA- Core. The logical input and output ports of the ORACore are shown in the Table 4.3. A Virtex-I slice conflgured as an ORA is shown in the Figure 4.3. As stated earlier, each of the two identical TPGs drives 2 WUTs. Four WUTs are connected to the addr input port of the ORACore. This input port, when the design is placed and routed, corre- sponds to the A1, A2, A3, and A4 inputs of the G LUT. Therefore, when the design is placed and routed, the inputs of the G LUT are connected to the WUTs. The G LUT needs to be conflgured to output a logic 1 if there is any mismatch between the logic values of the WUTs driving identical test patterns. Therefore, the LUT is conflgured with the expression (A1 XOR A2) OR (A3 XOR A4). The Pass/Fail result is latched by the ip- op. Since the 4 inputs of G LUT have been exhausted, we cannot use the Y ip- op to form a feedback path to the input of the G LUT. Therefore, the F LUT and the X ip- op is used. The F LUT ORs the output of the G LUT and X ip- op, which contains the Pass/Fail result of the test phase. Thus, the output of the G LUT needs to be connected to the Y output of the PLB to form a feedback loop to the A1 input of the F LUT. Here we utilize the facility given by JBits API to allocate FPGA physical resources, such as the Y output pin of the PLB, to any of the logical ports of the RTPCore. Using this facility, port lutOut is assigned to the Y output pin. The F LUT is conflgured to implement logic expression (A1 OR A2). Finally, the output of the X ip- op (XQ) is connected to the A2 input of the F LUT. The X ip- op must be conflgured to be reset during the start-up sequence prior to execution of the BIST sequence. 92 Table 4.3: Input and Output Ports of ORACore Port Width (in Bits) Direction Function clk 1 IN Clock Input addr 4 IN Input to the G LUT lutOut 1 IN-OUT Feedback to the Latch oraOut 1 IN-OUT Pass/Fail Output of the ORA sr 1 IN Set/Reset Input 4.2.3 Routing the WUTs JBits API gives exibility to the programmer to specify the physical routing after the RTPCores have been placed. The programmer can skip this step and let JRoute do the completely automatic logical-to-physical mapping at the expense of the control over the routing. In order to generate routing BIST conflgurations, we must maintain control over the process of routing the interconnect resources to be tested. This is the rationale behind method internalRoute(int i, int j). This method may be called anywhere in the JBits program before the call to static method connect(Bus) in the class com.xilinx.JBits.CoreTemplate.RTPCore.Bitstream is made. This method flrst deflnes thephysicalsourcesandsinkpinsalongwiththeWUTs. Theoutputpinsofthe ip- ops of the two TPGs are deflned as source pins. The input pins of the G LUT of the ORA are the sink pins. The static integers corresponding to the WUTs are selected from the class com.xilinx.JRoute2.Virtex.ResourceDB.CenterWires. Thesephysicalresourcesarethenroutedusingtheautomaticroutingmethod. Rout- ing RTPCores using this semi-automating method has several advantages: 1. The routing resources to be tested can be specifled under user control, 2. It is not necessary to specify routing of the other elements not under test in the path, 93 A3 A4 A1 A2 A3 A4 WUT 1 WUT 0 Virtex?I Slice A1 A2 WUT 3 WUT 2 G LUT F LUT X FF XQ Y Figure 4.3: Conflguration of Comparator-Based ORA 3. The complexity of routing the switch box CIPs is avoided except while testing CIPs stuck-ofi faults. 4.2.4 Populating the PLB Array The class SimpleRouteBISTApp takes care of populating the PLB array with the instances of RTPCore SimpleRouteBIST. The command line arguments that are parsed in the main routine, specify the Virtex-I chip and the package being used. The number of columns and rows in the PLB array of the chip is obtained with the static methods deflned in the class com.xilinx.JBits.Virtex.Devices: getClbCols(device name) and get- ClbRows(device name), respectively. It is important to keep in mind that JBits API deflnes the lower left corner of the PLB array as the origin and calculates the X and 94 CoutSrcPin = new Pin[w]; OraPin = new Pin[w]; WutPin = new Pin[w]; JRoute jroute = Bitstream.getVirtexRouter(); CoutSrcPin[0] = new Pin(Pin.CLB, row, col, CenterWires.Slice_XQ[slice]); CoutSrcPin[1] = new Pin(Pin.CLB, row, col, CenterWires.Slice_XQ[slice+1]); CoutSrcPin[2] = new Pin(Pin.CLB, row, col, CenterWires.Slice_YQ[slice]); CoutSrcPin[3] = new Pin(Pin.CLB, row, col, CenterWires.Slice_YQ[slice+1]); OraPin[0] = new Pin(Pin.CLB, row, col+6, CenterWires.SliceG1[slice]); OraPin[1] = new Pin(Pin.CLB, row, col+6, CenterWires.SliceG2[slice]); OraPin[2] = new Pin(Pin.CLB, row, col+6, CenterWires.SliceG3[slice]); OraPin[3] = new Pin(Pin.CLB, row, col+6, CenterWires.SliceG4[slice]); WutPin[1] = new Pin(Pin.CLB, row, col, CenterWires.Hex_Horiz_East[1]); WutPin[0] = new Pin(Pin.CLB, row, col, CenterWires.Hex_Horiz_East[0]); WutPin[2] = new Pin(Pin.CLB, row, col, CenterWires.Hex_Horiz_East[2]); WutPin[3] = new Pin(Pin.CLB, row, col, CenterWires.Hex_Horiz_East[3]); for (int i=0;i=WIDTH/2){ oraSink.setNet(i, ffd1.getNet(i?(WIDTH/2))); } } */ Figure C.15: JBits Program Continued... 156 //Bus OraSink = newBus("oraSink",WIDTH); Net passFailOut = newNet("passFailOut"); //Pass/Fail output from X flip?flop Net lutOut = newNet("lutOut"); //Output of the G LUT Net passFailIn = newNet("passFailIn"); //Logical input of X flip?flop oraFB.setNet(0, passFailOut); oraFB.setNet(1, lutOut); ORAProperties op = new ORAProperties(); op.setIn_addr0(oraSink); //addr port of G LUT op.setIn_addr1(oraFB); //addr port of F LUT op.setOut_lutOut(lutOut); //Y output of Slice op.setIn_clk(clk); op.setIn_ce(Net.NoConnect); op.setOut_dout(passFailIn); op.setOut_doutReg(passFailOut); ORACore BISTORACore= new ORACore("ORA00",op); addChild(BISTORACore); Offset ORAOffset = BISTORACore.getRelativeOffset(); ORAOffset.setHorOffset(Gran.SLICE, (clbCol+6)*2+clbSlice); ORAOffset.setVerOffset(Gran.CLB, clbRow); Bus oraFB = newBus("OraFB",2); internalRoute(WIDTH, clbRow, clbCol); /* Connect Top level nets */ Bitstream.connect(clk); Bitstream.connect(coutSrc); Bitstream.connect(ffd); Bitstream.connect(coutSrc1); Bitstream.connect(ffd1); Bitstream.connect(oraFB); } /* end implement() */ private int clbRow, clbCol, clbSlice; private Pin [] CoutSrcPin; private Pin [] WutPin; private Pin [] OraPin; private Pin [] OutMuxPin; }; /* end class SimpleRouteBIST*/ BISTORACore.implement(Expr.G_LUT("(G1 ^ G2) | (G3 ^ G4)"), Expr.F_LUT("~( ((~F1)&F2) | (F1&(~F2)) )")); Figure C.16: JBits Program Continued... 157 Appendix D Complete List of Connections Between the Mux CIPs Table D.1: Mux CIPs Mux28to1 and Connecting Single Interconnects Wires Number of Distinct Wires Connected S0F1, S0G1, S1F4, S1G4 S0F3, S0G3, S1F2, S1G2 S0F4, S0G4, S1F1, S1G1 S0F2, S0G2, S1F3, S1G3 SINGLE EAST 4 7 7 6 SINGLE WEST 7 6 5 6 SINGLE SOUTH 6 5 6 7 SINGLE NORTH 7 6 6 5 158 Table D.2: MUX CIPs Mux16to1 and Connecting Interconnects Wires Number of Distinct Wires Connected S0BX, S0BY S0CE, S1CE S1BX, S1BY S0SR, S1SR S0Clk, S1Clk TS0, TS1 HEX HORIZ EAST 0 0 0 0 0 0 HEX HORIZ WEST 0 0 0 0 0 0 HEX VERT NORTH 0 1 0 1 1 0 HEX VERT SOUTH 0 0 0 0 0 0 SINGLE EAST 4 2 4 2 1 0 SINGLE WEST 4 2 4 2 1 0 SINGLE SOUTH 4 3 4 3 2 0 SINGLE NORTH 4 3 4 3 2 0 HEX VERT A0 0 0 0 0 0 1 HEX VERT B0 0 0 0 0 0 1 HEX VERT C0 0 0 0 0 0 1 HEX VERT D0 0 0 0 0 0 1 HEX VERT M0 0 0 0 0 0 1 HEX VERT A1 0 0 0 1 0 0 HEX VERT B1 0 0 0 1 0 0 HEX VERT C1 0 0 0 1 0 0 HEX VERT D1 0 0 0 1 0 0 HEX VERT M1 0 0 0 1 0 0 HEX VERT A2 0 0 0 0 1 0 HEX VERT B2 0 0 0 0 1 0 HEX VERT C2 0 0 0 0 1 0 HEX VERT D2 0 0 0 0 1 0 HEX VERT M2 0 0 0 0 1 0 HEX VERT A3 0 1 0 0 0 0 HEX VERT B3 0 1 0 0 0 0 HEX VERT C3 0 1 0 0 0 0 HEX VERT D3 0 1 0 0 0 0 HEX VERT M3 0 1 0 0 0 0 GCLK0 0 0 0 0 1 0 GCLK1 0 0 0 0 1 0 GCLK2 0 0 0 0 1 0 GCLK3 0 0 0 0 1 0 159