Built-In Self Test for Digital Signal Processor Cores in Virtex-4 and Virtex-5 Field 
Programmable Gate Arrays 
 
by 
 
Mary Deepti Pulukuri 
 
 
 
 
A thesis submitted to the graduate faculty of 
Auburn University  
in partial fulfillment of the  
requirements for the Degree of 
Master of Science 
 
Auburn, Alabama 
August 9th, 2010 
 
 
 
 
Keywords: Built-In Self-Test, FPGA, DSP, 
Adder, Multiplier 
 
 
Copyright 2010 by Mary Deepti Pulukuri 
 
 
Approved by 
 
Charles Stroud, Chair, Professor of Electrical and Computer Engineering 
Adit Singh, Professor of Electrical and Computer Engineering 
Victor Nelson, Professor of Electrical and Computer Engineering 
Vishwani Agrawal, Professor of Electrical and Computer Engineering 
 
 
 ii 
Abstract 
 
 
Current Field Programmable Gate Arrays (FPGAs) incorporate special cores, apart from 
logic, such as digital signal processor (DSP) cores. The DSP cores can be cascaded to implement 
complex functions. An effective test approach for testing the logic and configuration memory 
associated with these embedded cores is essential. The thesis presents an effective approach for 
testing digital signal processor cores embedded in Virtex-4 and Virtex-5 FPGAs using Built-In 
Self-Test (BIST) methodology. Since the BIST circuitry can be programmed in the logic present 
inside the FPGA that is not being tested at the time, there is no area overhead or performance 
penalty. 
 The implementation and verification of the developed BIST configurations was done on 
various families and sizes of Virtex-4 and Virtex-5 FPGAs. The developed BIST configurations 
also detected manufacturing faults in some of the Virtex-4 engineering sample parts.   
  
 iii 
Acknowledgements 
 
 
 I would like to thank Dr. Stroud for his guidance and support throughout my research for 
my thesis. I am grateful for his teaching that trained me to be a better engineer. I would like to 
thank Dr. Singh, Dr. Nelson and Dr. Agrawal for serving on my committee and for their helpful 
suggestions in improving my thesis. I would like to thank my colleagues Brad, Joey, Jie and 
Alex for their valuable advice and support. 
 
  
 iv 
Table of Contents 
 
 
Abstract??????????????????????????????? ???. ..ii 
Acknowledgements???????? ??? ??????????????????? iii 
List of Tables????????????? ??... ????????????????. vii 
List of Figures??? ??????????? ??? ??????????????... .ix 
1 Introduction???????????? ??? ??????????????? ??.. 1 
 1.1 Overview of FPGAs????? ???? ???. ???........... ...........................1 
 1.2 Overview of Built-In Self-Test??...?????.????????????... 4 
 1.3 Overview of FPGA BIST??????????????????????... 5 
 1.4 Thesis Statement?????????????????????????? 6 
2 Background Information??????? ??? ????????????? ????.. 8 
 2.1 Configurable Logic Blocks in Virtex-4 and Virtex-5 FPGAs?????????8  
 2.2 Architecture of DSP cores in Virtex-4 and Virtex-5 FPGAs????????...10  
 2.3 Carry Look Ahead (CLA) Adders???????????????????16  
 2.4 Booth Multipliers?????????????????????????.. 18 
 2.5 Prior Work Done in Testing CLA Adders????????????? ??? 19 
 2.6 Prior Work Done in Testing Booth and Booth/Wallace Multipliers??? ???.20  
 2.7 Restatement of Thesis??????????????????? ??. ??.. .24 
3 BIST for DSP Cores in Virtex-4 FPGAS????????? ????? ?....................... 26
 v 
3.1 BIST Approach for DSPs in Virtex-4 FPGA???????????? ???.. 26 
3.1.1 Adder Test????????????? ?.. ???????.........27  
  3.1.2 Multiplier Test??????????????????? ?... ......29 
 3.2 BIST Architecture??????????????????????? ?? ...30 
 3.3 BIST Architecture and Test Sequences??????????? ??... ????34  
 3.4 BIST Generation?????????????????????? ??... ?.. .39 
 3.5 Detection of Faulty DSPs and Fault Coverage??? ????????????41  
 3.6 BIST Timing Analysis??? ????????????????????.....43  
 3.7 Summary??? ? ? ? ??????????????????????......51  
4 BIST for DSP Cores in Virtex-5 FPGAS?????? ??????????????...53  
 4.1 BIST Approach for DSPs in Virtex-5 FPGAs??? ????????????.53  
  4.1.1 Adder and Multiplier Tests.??????? ?? ????????...53  
  4.1.2 Pattern Detector Test?? ??????? ? ??????????.55  
  4.1.3 ALU Logic Mode Test?? ?????????????????..62  
  4.1.4 Cascade Mode Test?? ?????????????????.......64  
  4.1.5 SIMD Mode Test?? ?????????????? ?????...65  
  4.1.6 MACC Extend Mode Test?? ??????????????.........66  
 4.2 BIST Architecture??? ?????????????????????........66  
 4.3 BIST Configurations and Sequences??? ??????????????..?.70  
 4.4 BIST Generation??? ? ????????????????????..........74  
 4.5 Timing Analysis of BIST.??? ????????????????????74  
 4.6 Fault Inject Analysis and Fault Coverage.??... ??????????????79  
4.7 Summary.?????????????????????????? ???. 80 
5 Summary and Conclusion...??????.. ???????????????????..81  
 vi 
 5.1 Summary of Virtex-4 DSP BIST.??.. ? ??????????????.........81  
 5.2 Summary of Virtex-5 DSP BIST??... ???????????????.........82  
 5.3 Application to Other FPGAs and Architectures??... ?????????..........82  
References..??? ??? ??????????????????????????...84
 vii 
List of Tables 
 
 
2.1 OPMODE Values for Virtex-4 and Virtex-5 FPGAs [9] [10]??... ?????????. 13 
2.2 ALUMODE Values Determining the Adder/Subtractor Operation [10]??... ?????. 14 
2.3 Control Values for Logic Functions in Virtex-5 FPGAs [10]??.. ?????????... 15 
2.4 Test Sequence for a 2-bit CLA [15]??.. ???????????????????.. 20 
2.5 Test Sequence for a 4-bit CLA [15]??.. ??????????????? ????.. 20 
2.6 Test Patterns for an 8-bit Multiplier Using 4?4 Test Algorithm??... ????????..21  
2.7 Test Patterns for an 8-bit Multiplier Using 5?3 Test Algorithm??... ????????. .22 
3.1 Stuck-at Fault Simulation Results for 48-bit Adders... .?????? ??... ?????..28  
3.2 Control Register values for TPG control??... ????????????????? ..30 
3.3 BIST Sequences??... ??????????????????????????.... 35 
3.4 Weighted Pseudorandom Patterns??... ????????????????????3 7 
3.5 Initially developed BIST Configurations.??.. ?????????? ??????...... 39 
3.6 Improvement in Download Time using Partial Reconfiguration??... ????????.. 39 
3.7 Faulty DSP Slices in Virtex-4 SX35 and LX60 Engineering Sample Parts??.. ????.. 42 
3.8 Configuration File Size and Test Time Increase for Same Edge Clock??.. ??????50  
3.9 Download File Sizes (in Bits) for an SX55 Device??.. ?????????????...51  
3.10 BIST Configurations for Virtex-4 DSP BIST??...  ?????????????....... 51 
4.1 Multiplier and Adder Test Sequences??... ??????????????????.. 55 
 viii 
4.2 Test Vectors for Testing the 4-bit Patterndetect Logic??? ???????????.. 56 
4.3 Test Vectors for Testing the 4-bit Patternbdetect Logic??? ??????????? 58 
4.4 BIST Configurations for the Pattern Detector??? ?? ????????????.... 62 
4.5 Values for A and B pipeline registers [10]??.. .????????????????.. 64 
4.6 Control Register Values for TPG Control??... ????????????????... 68 
4.7 BIST Configurations for Patterndetect Logic??? ???????????????. 71 
4.8 BIST Configurations for Virtex-5 DSPs ??? ????????????????.... 71 
4.9 BIST Sequences for Virtex-5 DSP ??? ??????????????????.... 73 
4.10 Variables in Table 4.8??? ???????????????????????... 73 
 ix 
List of Figures 
 
 
1.1 General Architecture of an FPGA???...??????????????????? ..3 
1.2 BIST Architecture [1]???.. ????????????????????????..5  
1.3 FPGA BIST Architecture [13]???.. ?????????????????????.6  
2.1 Simplified Architecture of a Virtex-4 CLB [6]???.. ???????????????9  
2.2 Simplified Architecture of a Virtex-5 CLB [7]???. ??????????????...10  
2.3 DSP Tile in Virtex-4 Devices [9]???. ???????????? ???????....11  
2.4 DSP Slice in Virtex-5 FPGAs [10]???. ???????????????????.16  
2.5 Basic Structure of a 4-bit CLA???. ????????????????????...17  
2.6 Adder Test Algorithm using Twisted Ring Counter [15]??? ????????? .?... 20 
2.7 4?4 Multiplier Test Algorithm [16]? ??. ???????????????????22  
2.8 5?3 Multiplier Test Algorithm [17]??. ? ???????????????????23  
2.9 ORA Design for Multiplier BIST in Virtex- II Pro FPGAs [21].............................................24 
3.1 Modified Adder Test Algorithm???. ???? ????????????????.29  
3.2 2-stage CLA adder.???.????????????????????????? .29 
3.3 Multiplier BIST approach???.??????????????????????.. .30 
3.4 DSP BIST Architecture???. ???????????????????????.. .32 
3.5 ORA Architecture???. ?????????????????????????...33  
 x 
3.6 ORA map for a DSP tile?......... ???????????????????????.33  
3.7 TPG Architecture???. ?????????????????????????....34  
3.8 Architecture of the 9-bit LFSRA???. ????????????????????36  
3.9 BIST Template as Seen in FPGA Editor???. ?????????????????41  
3.10 Maximum BIST Clock Frequency for an SX35 Device when DSPs in???. ????...45   
         Configurations #3 and #5 are Clocked on Falling Edge of the Clock 
3.11 Maximum BIST Clock Frequency for an SX35 Device when DSPs in ???.. ????.45  
         Configuration #3 are Clocked on Falling Edge of the Clock 
3.12 Maximum BIST Clock Frequency???.. ??????????????????.. 46 
3.13 Maximum Clock Frequency for Sub-Arrays....??. ??????????????....47  
3.14 Routing Paths for the Sub-Arrays with TPG at the Middle of the Array???. ?? ??.47  
3.15 TPG Position for the Bottom Sub-Array??.? ????????????????..48  
3.16 Timing Analysis Based on Clock Edge for Configuration #3.???. ????????.48  
3.17 Timing Analysis for DSP BIST Configurations #2 through #5.???.. ???????.. 49 
4.1 Architecture for a 4-bit Patterndetect Logic???.. ???????????????..56  
4.2 Architecture for a 4-bit Patternbdetect Logic???.. ??????????????....57  
4.3 TPG for the Pattern Detector???.. ?????????????????????.59  
4.4 Multiplexer Architecture for Selecting the Pattern and the Mask [10]???... ?????.60  
4.5 Detailed Multiplexer Architecture for Selecting Mask???... ???????????.60  
4.6 Auto Reset Logic [10]???... ???????????????????????... 61 
4.7 Overflow and Underflow Logic [10]???.. ??????????????????.61  
4.8 Multiplexer Architecture that Selects Between Direct and Cascade Paths???... ???...65  
       of A and B Ports [10] 
 xi 
4.9 ORA Architecture [24]??.. ????????????????????????..68  
4.10 I/O of a Virtex-5 DSP Slice??.. ??????????????????????.69  
4.11 DSP ORA Orientation in Virtex-5 FPGAs??. ????????????? ???..69  
4.12 TPG Architecture??. ??????????????????????????.70  
4.13 Clock Frequency Based on the Position of the TPGs and the ORAs??. ??????..75  
4.14 Clock Frequency for the Sub-Arrays Based on the Position of the TPGs??. ???? ...76 
4.15 Clock Frequency for Quarter ?Arrays Based on the Position of the TPGs??. ????.77  
4.16 Quarter Arrays for LXT330 device??. ???????????????????.78  
4.17 Clock Frequencies for all Arrays of the SXT95 Device Based on Positions??.. ???.78  
          of the TPGs and ORAs 
4.18 Fault Inject Results for Virtex-5 DSP BIST??. ????????????????79  
 1 
Chapter 1 
 
Introduction 
With the advancement of semiconductor manufacturing technology and the reduction of 
feature size from 4 microns to 45 nanometers, logic design is becoming denser with the 
integration of billions of transistors on a single integrated circuit. An example of such a dense 
logic circuit is the field programmable gate array (FPGA)[1].  
As the complexity of integrated circuits increases, the challenges in testing also increase 
[2]. Testing such complex integrated circuits by the user is a challenging problem since the 
manufactures of FPGAs provide limited information about the internal circuitry. Hence the 
challenge lies in figuring out the architecture of the logic resources and then applying accurate 
tests to ensure complete testing of the FPGA.  
1.1 Overview of FPGAs 
FPGAs have been popular since the mid 1980s for implementing any complex digital 
logic design. The ability of the FPGA to be reprogrammed easily and quickly without changing 
the fabrication or the wiring makes the use of the FPGA very flexible [3]. Over the years the 
FPGA architecture has increased in size and complexity. The number of transistors in the largest 
FPGA now is over a billion [1].  
Figure 1.1 shows the general architecture of an FPGA. The FPGA is a two dimensional 
array of logic blocks. The logic blocks can be programmed to implement any arbitrary digital 
logic circuit [4]. The basic element of the FPGA is the configurable logic block (CLB) [5]. The 
CLBs consists of look up tables (LUTs), multiplexers and flip-flops. They can be configured to 
 2 
perform any desired combinational or sequential logic function [6]. Combinational logic is 
implemented using LUTs and multiplexers. Sequential logic is implemented using flip-flops [8]. 
The number of inputs to the LUT is fixed for a given FPGA but varies with different FPGAs and 
ranges from three to six [1].  Some of the LUTs can be programmed to function as small random 
access memory (RAM) units. The Input/Output blocks (IOB) located on the extreme and center 
columns [6] [7] of the device provide access to the logic blocks integrated inside the FPGA. The 
number of CLBs and IOBs differs based on the family and how big the device is. The number of 
CLBs in Xilinx Virtex-4 FPGAs varies from 1,536 to 22,272 [6] and the number of CLBs in 
Xilinx Virtex-5 FPGAs varies from 2,400 to 25,920 [7]. The memory modules in Xilinx Virtex-4 
FPGAs are 18KB dual port RAMs [6] and 36KB dual port RAMs in Xilinx Virtex-5 FPGAs [7]. 
These memory modules, called block RAMs (BRAMS), can be combined to provide larger 
memory blocks.  
 The Xilinx Virtex-4 and Xilinx Virtex-5 FPGAs also have digital signal processor (DSP) 
cores incorporated in their structures [6] [7]. The DSP cores are used to implement DSP 
applications in a faster and more efficient manner compared to the DSP implementation in the 
earlier family of Xilinx Virtex-2 FPGAs [9]. The DSP core architecture mainly consists of a 2-
port multiplier, a 3-port adder/subtractor unit and a 48-bit accumulator register [9] [10]. The 
multiplier in Xilinx Virtex-4 FPGA is an 18x18-bit two?s complement multiplier [9] and Xilinx 
Virtex-5 FPGAs incorporate a 25x18-bit two?s complement multiplier [10].  A common function 
of the DSP core is the multiply and accumulate (MAC) operation. Besides the multiplier and the 
adder, the DSP cores in Xilinx Virtex-5 FPGAs also have a 48-bit comparator unit and an 
Arithmetic & Logic Unit (ALU) mode of operation that is used to implement 48-bit boolean 
logic functions [10]. Multiplexers select different input/output paths within the DSP core. The 
 3 
DSP cores can also be cascaded to facilitate the implementation of larger input/output functions 
[9] [10].  The number of DSP slices in Virtex-4 FPGAs ranges between 32 and 512 and the 
number of DSP slices in Virtex-5 FPGAs ranges between 32 and 1,056. 
 
The logic blocks are interconnected by a series of horizontal and vertical routing lines 
[11]. The routing lines differ in lengths based on the number of logic blocks they span [1]. The 
routing channel between the logic blocks is determined by a matrix of programmable switches 
called programmable interconnect points (PIPs) [1] [11]. The logic blocks and the 
interconnection between them can be easily reprogrammed by changing the configuration data 
that is downloaded to the FPGA [4].  
  
IOBs CLBs BRAM
s 
Interconnect 
network Embedded cores like DSPs 
Figure 1.1 General Architecture of an FPGA 
 4 
1.2 Overview of Built-In Self-Test 
Because of the increased use of very large scale integrated (VLSI) circuits in digital 
systems, the reliability of these circuits is crucial. Hence the need for test methods at lower costs. 
But the increased complexity of the digital systems makes testing expensive [12]. There are 
different ways of testing the FPGA. In external testing the circuit that generates test patterns for 
the circuit under test and the circuit that observes the response of the circuit under test are 
external to the circuit under test. In built-in self-test (BIST) the test pattern generation circuit and 
the circuit that analyzes the output response of the circuit under test are internal to the circuit 
under test. In offline testing [1], the FPGA is tested while the system is not in its usual mode of 
operation. In application-dependent testing [1] the FPGA is tested for the specific system 
function that is being implemented. In this case, the design for testability (DFT) circuitry is 
implemented in the digital system that is being implemented in the FPGA. This increases the 
area occupied by the digital system circuit that is being implemented.  
BIST is a DFT technique where test circuitry is implemented in the FPGA itself. Figure 
1.2 shows the general BIST architecture [1]. The test pattern generator (TPG) generates test 
patterns for completely testing the device under test. The output response analyzer (ORA) 
observes the output response of the device under test for the input test patterns and reports any 
failures due to faults in the device under test. The test controller coordinates the operation and 
execution of the TPG, ORA, and device under test. 
 5 
 
1.3 Overview of FPGA BIST 
The regularity in the structure of the FPGA makes pseudo-exhaustive testing possible 
without the need for expensive fault simulation [13]. In BIST, the FPGA is tested using a series 
of test configurations. The test configurations are repeatedly programmed into the FPGA to 
ensure that all the operational modes of the FPGA are thoroughly tested and the device functions 
fault-free irrespective of the system function that will be implemented [1] [13].  
Some of the logic blocks in the FPGA that are not under test are configured as TPGs and 
ORAs [13]. Sometimes faults can go undetected if there are faults in a logic block that is 
configured as part of the TPG. Faults can also go undetected if faults exist in the interconnection 
between the TPG and the block under test (BUT). Faulty logic blocks in the TPG or faults in the 
interconnection between the TPG and BUT fail to generate the desired test patterns to completely 
test the BUT. To avoid missing the detection of any fault due to a faulty TPG, two or more TPGs 
are used [1]. Two ORAs observe the output response of every BUT which is also compared to 
the output responses of two other BUTs. As shown Figure 1.3 each BUT is observed by the ORA 
beside it and the ORA below. The bottom BUT in the array is observed by the ORA beside it and 
the ORA at the top of the array which also observes the output response of the top BUT of the 
array. This comparison of output responses is called circular comparison and is done to help 
Figure 1.2 BIST Architecture [1] 
Test 
Controller 
Test Pattern 
Generator 
Device 
Under Test 
Output Response 
Analyzer 
 6 
locate the faulty BUT [1]. At the end of the BIST sequence, the ORA contents, the BIST results, 
can be retrieved to detect and determine individual BUT failures using diagnostic procedures. [1] 
[13].   
 
1.4 Thesis Statement 
Although some research has been done on BIST for DSP cores in general [14], there is 
little literature on BIST for DSP cores in FPGAs. However, prior research has been done in 
designing BIST algorithms for adders and multipliers. An adder BIST approach is presented in 
[15] and multiplier BIST approaches are presented in [16], [17] and [18]. The challenge in 
testing the DSP cores in the FPGA lies in testing the adder and the multiplier circuitry, as the 
remaining components in the DSP core, such as multiplexers and flip-flops, can be easily tested.   
The research work presented in this thesis discusses the development, architecture and 
operation of BIST implementations for DSPs in Xilinx Virtex-4 and Virtex-5 FPGAs. This is 
achieved by making improvements on the previous work done for adders and multipliers to 
generate more effective test patterns and to keep the test time as low as possible independent of 
the specific adder and multiplier architectures. BIST configurations for testing the DSP cores in 
   TPG #1 
   TPG #2 
Figure 1.3 FPGA BIST Architecture [13] 
BUT 
BUT 
BUT 
BUT 
BUT 
ORA 
ORA 
ORA 
ORA 
ORA 
Pass/Fail 
BIST start 
 7 
Xilinx Virtex-4 and Xilinx Virtex-5 FPGA devices are generated based on the architecture 
presented in the data-sheets provided by the manufacture. The resulting BIST configurations are 
downloaded and executed in the FPGA. Effectiveness of the BIST configurations is established 
via fault injection in the configuration memory of actual hardware. [29] 
The remaining chapters in the thesis are organized as follows: Chapter 2 presents an 
overview of the architecture of Virtex-4 and Virtex-5 FPGAs as well as their embedded DSP 
cores. It also presents prior work done in testing DSPs, multipliers and adders. Chapter 3 
presents the architecture and operation of BIST designed for testing the DSP cores in Virtex-4 
and Virtex-5 FPGAs. Chapter 4 presents the actual implementation of this BIST architecture in 
Virtex-4 and Virtex-5 FPGAs. It also presents the results and fault coverage of the BIST. 
Chapter 5 summarizes and concludes the thesis with suggestions for future work.
 8 
Chapter 2 
Background Information 
 The ease of reprogramming an FPGA makes it attractive for the implementation of any 
complex logic system. With the incorporation of embedded memories and specialized cores for 
signal processing, FPGAs can be used for almost any application [25]. This chapter presents the 
architecture of the logic resources used to implement the TPGs and ORAs for BIST for DSP 
cores in Virtex-4 and Virtex-5 FPGAs and explains the architecture of the DSPs cores. This 
chapter also presents prior work done in testing multiplier and adder logic functions which are 
also used in DSPs. 
2.1 Configurable Logic Blocks in Virtex-4 and Virtex-5 FPGAs 
 The TPGs and ORAs for BIST for DSP cores in Virtex-4 and Virtex-5 FPGAs are 
implemented in CLBs. The Virtex-4 CLB comprises four slices. Each slice is connected to a 
switch matrix through which it accesses the global routing resources. Figure 2.1 shows the 
simplified architecture of a Virtex-4 CLB. Each pair of CLB slices are arranged in two separate 
columns. The two slices in the left column are called SLICEMs because they also function as 
small memories and the two slices in the right column are called SLICELs since they function as 
logic only [6].  
 Each slice has two 4-input look up tables (LUTs), two memory elements, a carry chain 
and multiplexers. The memory elements can function as edge-triggered D flip-flops or as level-
sensitive latches. The D flip-flops can either be driven by the output of the LUT or can be driven 
by the inputs to the slice [6]. The multiplexers in the CLB slices are used to combine the LUTs 
 9 
within a CLB or in different CLBs to be able to implement higher input logic functions. The 
carry chain in the slices enables faster addition and subtraction [6]. 
 
The Virtex-5 CLB comprises only two slices. Like the Virtex-4 CLB, the Virtex-5 CLB 
slices are connected to the switch matrix through which the global routing resources can be 
accessed. Figure 2.2 shows the simplified architecture of a Virtex-5 CLB. Each slice arranged in 
a separate column, is independent from the other, and has separate carry chains [7]. Each slice 
has four 6-input LUTs, four memory elements, multiplexers, and a carry-chain.  Each slice has 
three multiplexers which can be used to combine up to four LUTs to be able to implement logic 
functions of up to eight inputs. Higher input logic functions can be implemented by using more 
slices. The carry chain with its dedicated carry logic enables fast addition and subtraction [7]. 
The memory elements in Virtex-5 CLBs are similar to the memory elements in Virtex-4 CLBs. 
They can be configured as edge-triggered D flip-flops or as level sensitive latches by user 
controlled configuration memory bits. They can be driven by the output of the LUTs or by the 
slice inputs [7].  
Figure 2.1 Simplified Architecture of a Virtex-4 CLB [6]  
Cout 
Switch 
Matrix 
SLICEL 
Slice 3 
SLICEL 
Slice 1 
SLICEM 
Slice 0 
SLICEM 
Slice 2 
Cin 
Cout 
Cin 
CLB 
 10 
 
2.2 Architecture of DSP cores in Virtex-4 and Virtex-5 FPGAs. 
Virtex-4 and Virtex-5 FPGAs incorporate DSP cores in their architectures and can be 
used for implementing large math functions, DSP applications such as finite impulse response 
(FIR) filters or to perform complex arithmetic computation without the need of using the general 
FPGA logic [9]. The architecture of a DSP tile in Virtex-4 FPGAs is shown in Figure 2.3 where 
two DSP slices form a DSP tile. Each DSP slice has an 18?18-bit two?s complement multiplier 
that generates two 36-bit partial products. The A and B input ports in the DSP slice provide 18-
bit access to each port of the multiplier. The final stage adder of the multiplier is separated from 
the multiplier and is incorporated in a separate three-port 48-bit adder/subtractor.  The C input 
port is shared by both the DSP slices in the DSP tile and provides 48-bit access to the 
adder/subtractor through the 48-bit Y and Z multiplexer busses [9]. The partial products from the 
multiplication process are sign-extended to 48-bits and summed in the adder/subtractor.  The 
partial products are fed to the adder/subtractor via the 48-bit X and Y multiplexer busses. The 
accumulator register, denoted by P in Figure 2.3, provides the only other 48-bit access to the 
Switch 
Matrix 
Slice 0  
Slice 1  
Cin 
Cout 
Cin 
Cout 
CLB 
Figure 2.2 Simplified Architecture of a Virtex-5 CLB [7] 
 11 
adder/subtractor through the X and Z multiplexer busses. Seven OPMODE signals dynamically 
control the select inputs to the X, Y and Z multiplexers [9].  
 
Figure 2.3 DSP Tile in Virtex-4 Devices [9] 
The adder/subtractor performs P=Z?(X+Y+Cin) and produces a 48-bit two?s complement 
result [9]. Here P is the output port, Cin is the carry-in input, and X, Y and Z are 48-bit 
multiplexer busses. The subtract input to the adder/subtractor shown in Figure 2.3 chooses 
between add or subtract operation of the adder/subtractor. A logic 1 on the subtract input chooses 
the subtract operation and a logic 0 on the subtract input chooses the add operation [9].  
 12 
The data input paths, the control signal input paths and the output paths of the DSP slice 
have optional pipeline registers. Each pipeline register introduces a delay of one clock cycle in 
the path. The number of pipeline registers in the path can be controlled by user-defined 
configuration memory bits that control the select inputs to the shaded multiplexers in Figure 2.3 
[9]. Maximum clock frequency is achieved when all pipeline registers are included in the I/O 
paths of the DSP slice. The A and B input ports can have up to two pipeline registers in their 
paths. The C and P ports can have up to one pipeline register. The input control signals paths that 
select the input paths to the adder/subtractor can have up to one pipeline register in their paths. 
The output of the multiplier also has a pipeline register (Mreg) as shown in Figure 2.3 (next to 
note 3) [9]. The Mreg introduces a clock cycle delay before the partial products are summed in 
the adder/subtractor.  
The DSP slices in a column of DSPs can be cascaded to form larger DSPs. The B and P 
ports in a slice can be cascaded to the slice above. A user-defined configuration memory bit 
selects the B-input source to be direct or cascaded from the slice below. OPMODE values 
dynamically select cascading of the P port at the input to the Z multiplexer [9]. Table 2.1 
illustrates all possible OPMODE values that control the inputs the X, Y and Z multiplexers in 
Virtex-4 and Virtex-5 FPGAs. 
The DSP slice in Virtex-5 FPGA has the same functionality as the DSP slice in Virtex-4 
FPGA with some additional features. The simplified architecture of a single Virtex-5 DSP slice 
is shown in Figure 2.4. The DSPs in Virtex-5 FPGAs incorporate a larger 25?18-bit multiplier. 
The A input port of the Virtex-5 DSP slice is 30-bits wide and the least significant 25 bits of the 
A port provide 25-bit access to the multiplier [10]. The C port is independent to both the DSP 
 13 
slices where each slice has its own 48-bit C port.  The A and B ports can be concatenated to 
provide another 48-bit access to the adder/subtractor [10].  
Table 2.1 OPMODE Values for Virtex-4 and Virtex-5 FPGAs [9] [10] 
Opmode values for the X-multiplexer outputs 
Z 
Opmode[6:4] 
Y 
Opmode[3:2] 
X 
Opmode[1:0] X output Comments 
xxx xx 00 0 default 
xxx 01 01 M When the multiplier is used 
xxx xx 10 P  
xxx xx 11 A:B  
Opmode values for the Y-multiplexer outputs 
Z 
Opmode[6:4] 
Y 
Opmode[3:2] 
X 
Opmode[1:0] Y output Comments 
xxx 00 xx 0 default 
xxx 01 01 M When the multiplier is used 
xxx 10 xx 48?ffffffffffff 
Used for the logic unit bitwise 
operations (Illegal selection for 
Virtex-4) 
xxx 11 xx C  
Opmode values for the Z-multiplexer outputs 
Z 
Opmode[6:4] 
Y 
Opmode[3:2] 
X 
Opmode[1:0] Z output Comments 
000 xx xx 0 default 
001 xx xx PCIN  
010 xx xx P  
011 xx xx C  
100 10 00 P Used for MACC extend (Illegal selection for Virtex-4) 
101 xx xx 17-bit shift PCIN  
110 xx xx 17-bit shift P  
111 xx xx xx Illegal selection for Virtex-4 and Virtex-5 
The adder/subtractor in Virtex-5 DSPs has been extended to function as a two-input 48-
bit logic unit but the architecture of the basic adder/subtractor in DSPs of Virtex-5 FPGAs is 
same as the architecture of the adder/subtractor in DSPs of Virtex-4 FPGAs. ALUMODE control 
signals select between the adder/subtractor/logic unit operation [10]. The subtract signal does not 
exist as a unique input in DSP slices of Virtex-5 FPGAs. Instead, an ALUMODE value of 
 14 
?0000? selects the add operation defined by the equation P=Z+X+Y+Carryin, where X, Y and Z 
are 48-bit multiplexer buses. An ALUMODE value of ?0011? selects the subtract operation 
defined by the equation P=Z?(X+Y+Carryin) [10]. Table 2.2 illustrates the ALUMODE values 
for all the adder/subtractor logic equations that can be implemented.  
Table 2.2 ALUMODE Values Determining the Adder/Subtractor Operation [10] 
ALUMODE[3:0] DSP operation 
0000 Z + X + Y + CARRYIN 
0011 Z ? (X + Y + CARRYIN) 
0001 -Z + (X + Y + CARRYIN) - 1 
0010 Not (Z + X + Y + CARRYIN) 
The bitwise logic operations performed by the logic unit include bitwise logical AND, 
OR, NOT, NAND, NOR, XOR and XNOR operations. ALUMODE inputs along with 
OPMODE[3:2] select the type of logical function as summarized in Table 2.3 [10].  Like the 
DSPs in Virtex-4 devices, the DSPs in Virtex-5 devices also have pipeline registers in their I/O 
paths and control signal paths. The A and B ports can have up to two pipeline registers, the C 
and P ports can have one pipeline register, the control signal paths can have one pipeline register 
and the Mreg pipeline register at the output of the multiplier, can be included by the user based 
on the performance desired. Higher performance is achieved when all the pipeline registers are 
included. Multiplexers that are controlled by configuration bits select the number of pipeline 
registers in these paths [10]. 
A 48-bit pattern detector is incorporated for comparison of two 48-bit patterns and is used 
for applications such as convergent rounding, overflow/underflow detection for saturation 
arithmetic, and auto resetting counters/accumulators [10]. The output of the DSP slice can be 
compared with a 48-bit pattern specified by the user. The pattern to the DSP slice can be 
provided through the C port or can be specified in the configuration memory bits. The output 
 15 
Patterndetect goes to a logic ?1? if the output of the DSP slice matches the pattern, and the output 
Patternbdetect goes to logic ?1? if the output of the DSP slice matches the complement of the 
pattern [10]. A mask can be used to hide certain bits in the pattern detector. The bits hidden by 
the mask are not considered during comparison. Like the pattern, the mask can be provided 
through the C port or can be specified in the configuration memory bits [10].  The overflow and 
underflow flags are set by the DSP slice when the output of the adder/subtractor goes beyond a 
range of patterns determined by the number of 1s in the mask. If N is the number of 1s in the 
mask, the pattern values range from positive 2N to negative 2N-1. When addition goes beyond 2N, 
the Patterndetect output switches from logic ?1? to logic ?0?, which causes the overflow flag to be 
set. When subtraction goes beyond 2N-1, the Patternbdetect output switches from logic ?1? to 
logic ?0?, which causes the underflow flag to be set [10]. 
Table 2.3 Control Values for Logic Functions in Virtex-5 FPGAs [10] 
OPMODE[3:2] ALUMODE[3:0] Logic function 
00 0100 X XOR Z 
00 0101 X XNOR Z 
00 0110 X XNOR Z 
00 0111 X XOR Z 
00 1100 X AND Z 
00 1101 X AND (NOT Z) 
00 1110 X NAND Z 
00 1111 (NOT X) OR Z 
10 0100 X XNOR Z 
10 0101 X XOR Z 
10 0110 X XOR Z 
10 0111 X XNOR Z 
10 1100 X OR Z 
10 1101 X OR (NOT Z) 
10 1110 X NOR Z 
10 1111 (NOT X) AND Z 
The SIMD (Single Instruction Multiple Data) mode is used to split the 
addition/subtraction/logic unit into two 24-bit (two24) or four 12-bit (four12) 
adder/subtracter/logic units. The adder/subtracter unit has two independent carryout signals and 
 16 
four independent carryout signals in the two24 and four12 modes respectively. When used as in 
single 48-bit adder/subtractor unit mode, there is only one carryout signal [10].  
The A port, along with the B and P ports, can be cascaded in DSPs of Virtex-5 FPGAs 
[10]. Cascade signals CARRYCASCIN and CARRYCASCOUT are used to implement 96-bit 
adders, subtractors or logic units. The cascade signals such as MULTSIGNIN and 
MULTSIGNOUT are used to extend the multiply and accumulate (MACC) function to create 
96-bit accumulators. The most significant bit of the output of the multiplier is cascaded through 
its MULTSIGNOUT port to the MULTSIGNIN port of the DSP slice above. The OPMODE 
value for the ?MACC extension? feature is given in Table 2.1 [10]. 
 
Figure 2.4 DSP Slice in Virtex-5 FPGAs [10] 
2.3 Carry Look Ahead (CLA) Adders  
 Carry look ahead (CLA) adders are widely used in most applications where high speed 
addition is performed. The basic structure of a CLA adder is summarized in Figure 2.5.  Each 
adder cell receives a pair of inputs (Ai and Bi) and a carry-in (Ci ) to generate sum (Si), propagate 
(Pi), and generate (Gi) signals. The Pi and Gi signals along with the carry-in signals produce 
 17 
carry-out signals in the look ahead carry unit (LCU) for the subsequent adders. The equations 
shown in Figure 2.5 summarize the logic functions of the CLA adder. The propagate signal can 
be generated by using an OR gate as summarized in the adder equations using the ?POR? 
implementation in Figure 2.5. Another way of generating the propagate signal is by using the 
XOR gate used for the sum as summarized in the adder equations using the ?PXOR? 
implementation in Figure 2.5.  
Larger CLA adders can be constructed by connecting the carry-out of one 4-bit LCU unit 
to the carry-in of the next 4-bit CLA unit [19]. This type of CLA adder is called the ripple CLA 
adder. Another approach feeds the propagate (PG) and generate (GG) signals produced in the 
LCU to a second stage LCU to construct a 16-bit CLA adders. Larger CLA adders of 48-bits like 
the adder used in DSP cores in Virtex-4 and Virtex-5 FPGAs can be constructed either by 
rippling the carry outputs of the second stage LCU or by adding an additional stage of LCUs. 
Additional LCUs reduce delay at the expense of additional area overhead. 
 Figure 2.5 Basic Structure of a 4-bit CLA  
Adder 
A3  B3 
S3 
Adder 
A2  B2 
S2 
Adder 
A1  B1 
S1 
Adder 
A0  B0 
S0 
 P3 G3  C3    P2 G2  C2    P1 G1   C1    P0 G0 
 4-bit Look Ahead Carry Unit        PG GG 
C0 
C4 
Look Ahead Carry Unit Logic Equations 
PG=P0?P1?P2?P3 
GG=G3+G2?P3+G1?P2?P3+G0?P1?P2?P3 
C1=G0+P0?C0 
C2=G1+G0?P1+P1?P0?C0 
C3=G2+G1?P2+G0?P1?P2+P2?P1?P0?C0 
C4=G3+G2?P3+G1?P2?P3+G0?P1?P2?P3+P3?P2?P1?P0?C0 
Adder Logic Equations 
POR:   PXOR: 
S = A?B?Cin S = P?Cin 
P = A+B  P = A?B 
G = A?B  G = A?B 
 18 
2.4 Booth Multipliers 
 An m?n array multiplier performs multiplication be generating n partial products for each 
of the m-bits of the multiplicand. These partial products are summed using an array of adders to 
generate the final result. Booth multipliers reduce the number of partial products to be summed 
by ?recoding?, meaning grouping together some bits of either one of the operands, thereby 
speeding up the multiplication process [16].  
The architecture of the multiplier is divided into three groups of cells. They are named as 
the ?recoding cells (r cells)?, the ?partial product cells (pp cells)? that calculate the partial 
products and the ?adder cells? that sum the partial products. One of the two operands of the 
multiplier is recoded [16]. If the Booth multiplier has a 2-bit recoding then the recoded operand 
is divided into groups where each group has 2 bits. If X is the recoded operand, the bits in the 
group would be X2j, X2j+1 where j varies from 0 to Nx/2 and Nx is the number of bits in the 
operand X [16]. These two bits and the most significant bit (MSB) of the previous group, X2j-1, 
are fed to the recoding cells. The recoding cells produce signals which determine the functions 
that must be performed on the second operand in order to generate the partial products that will 
be calculated by the partial product cells. The partial products are then summed by the adder 
cells [16]. 
Wallace-tree multipliers with ?Booth encoding? speed up the multiplication process 
further [17]. The Booth encoding feature halves the number of partial products, and Wallace-tree 
addition with the output CLA adder to sum the final stage partial products result in the fastest 
addition [17]. This Wallace/Booth multiplier in [17] is divided into three parts: the Booth 
encoder for generating the partial products, the Wallace-tree unit that adds the partial products 
and generates a sum and carry vector and a final stage CLA adder that adds the sum and carry 
 19 
vectors to generate the final result. The Wallace-tree unit consists of half adder and full adder 
units [17].   
2.5 Prior Work Done in Testing CLA Adders 
 For a 4-bit CLA adder implementation that uses an OR gate to produce the propagate 
signal, a minimum set of ten vectors was proposed in [19] to detect all single stuck-at faults. For 
larger ripple CLA adders, a set of eleven vectors was proposed in [19] to detect all single stuck-
at faults. For the ripple CLA adder that uses an XOR gate for calculating the propagate signal, a 
minimum set of twelve vectors was proposed in [19] to detect all single stuck-at faults. But these 
sets of vectors apply only to ripple CLA adder implementations [19].  
 Another test algorithm that tests any n-bit CLA adder implementation is proposed in [15]. 
The CLA adder in [15] is divided into three units: the top level structure of the n-bit CLA, which 
is referred to as the ?MCLA? unit, the ?MPGX? unit that calculates the propagate and generate 
signals (in this test algorithm, the propagate signal is calculated using an OR gate) and the 
?MCLG? unit that calculates all the carry signals [15]. The sum is calculated using the XOR 
operation. The faults in the MCLG unit are difficult to test and hence tests are generated for a set 
of faults that cover all single stuck-at faults on the input paths of the MCLG unit [15]. The known 
tests for the MCLG unit are traced via the MPGX unit to the primary input paths of the MCLA unit to 
obtain a test sequence to detect all the single stuck-at faults in the CLA. Table 2.4 shows the test 
sequence for a 2-bit CLA. This test sequence can be extended to a 4-bit CLA as shown in Table 
2.5 [15].  
The input patterns for an n-bit CLA can be generated using a twisted ring counter 
approach as shown in Figure 2.6. This TPG can be implemented using n XOR gates, n XNOR 
 20 
gates and an (n+1) bit shift register with an inverter to form a twisted ring counter [15]. 
Reference [15] claims 100% single stuck-at gate level fault coverage. 
Table 2.4 Test Sequence for a 2-bit CLA [15] 
Test # A1B1 A0B0 C0 
1 10 10 1 
2 10 00 1 
3 00 11 1 
4 01 01 0 
5 01 11 0 
6 11 00 0 
Table 2.5 Test Sequence for a 4-bit CLA [15] 
Test # A3B3 A2B2 A1B1 A0B0 C0 
1 10 10 10 10 1 
2 10 10 10 00 1 
3 10 10 00 11 1 
4 10 00 11 11 1 
5 00 11 11 11 1 
6 01 01 01 01 0 
7 01 01 01 11 0 
8 01 01 11 00 0 
9 01 11 00 00 0 
10 11 00 00 00 0 
 
 
2.6 Prior Work Done in Testing Booth and Booth/Wallace Multipliers 
A multiplier test algorithm for Booth multipliers is proposed in [16] and claims high fault 
coverage of over 99%. The number of test vectors is 256 and is independent of the size of the 
Figure 2.6 Adder Test Algorithm using Twisted 
Ring Counter [15] 
Ci to adder carry-in 
 
SRegi  
 
SRegi+1 
 
 
to adder inputs 
N+1-bit Serial Shift Register 
reset 
 
Ai 
 
  Bi 
 
 
 21 
multiplier. The BIST TPG can easily be implemented using an 8-bit counter [16]. This test 
algorithm claims to pseudo-exhaustively test all the multiplier cells described in Section 2.4. 
Figure 2.7 shows the BIST architecture used by the test algorithm [16].  The test patterns are 
generated by an 8-bit counter [16]. The 8-bit counter applies all 256 patterns to the inputs of the 
multiplier [16]. This algorithm will be referred to in this thesis as the 4?4 test algorithm. Here 
the four MSB bits of the counter are applied to one input of the multiplier and the four LSB bits 
of the counter are applied to the other input of the multiplier. Starting from the LSB of the 
multiplier operands, the two sets of counter bits are replicated and repeated for each group of 
four bits of the multiplier operands [16]. For an 8-bit multiplier the 4?4 test algorithm will apply 
the test patterns illustrated in Table 2.6, where A[7:0] and B[7:0] are the inputs of the two ports 
of the multiplier and C[7:0] indicate the outputs of the 8-bit counter. 
Table 2.6 Test Patterns for an 8-bit Multiplier Using 4?4 Test Algorithm 
Multiplier 
Inputs A7 A6 A5 A4 A3 A2 A1 A0 B7 B6 B5 B4 B3 B2 B1 B0 
Counter 
Outputs C7 C6 C5 C4 C7 C6 C5 C4 C3 C2 C1 C0 C3 C2 C1 C0 
 
 Another multiplier test algorithm was proposed in [17]. Figure 2.8 shows the BIST 
architecture for the test algorithm [17]. This test algorithm targets Wallace-tree multipliers with 
Booth encoding. The CLA adder used to sum the final partial products in this algorithm is a 
multi-stage LCU CLA adder [17].  
Like the 4?4 test algorithm, the test algorithm in [17] also does not modify the structure 
of the multiplier and an 8-bit counter is used to generate the test patterns for any size multiplier. 
X and Y are the input operands, where the X operand has Booth encoding. In this algorithm, for 
the multiplier input port which has the Booth encoding, the five MSB bits of the counter are 
replicated and repeatedly applied to each group of five bits starting from the LSB of the 
 22 
multiplier operand with Booth encoding [17]. For the other input of the multiplier, the remaining 
three LSB bits of the counter are replicated and applied to each group of three bits starting from 
the LSB of the other multiplier operand. This algorithm is referred to in this thesis as the 5?3 test 
algorithm. In the proposed BIST approach, the output response is compacted by an accumulator 
and compared with the fault-free signature to detect faults [17].  For an 8-bit multiplier the 4?4 
test algorithm will apply the test patterns illustrated in Table 2.7 where A[7:0] and B[7:0] are the 
inputs of the two ports of the multiplier and port A has the booth encoding. C[7:0] indicate the 
outputs of the 8-bit counter. 
 
Table 2.7 Test Patterns for an 8-bit Multiplier Using 5?3 Test Algorithm 
Multiplier 
Inputs A7 A6 A5 A4 A3 A2 A1 A0 B7 B6 B5 B4 B3 B2 B1 B0 
Counter 
Outputs C5 C4 C3 C7 C6 C5 C4 C3 C1 C0 C2 C1 C0 C2 C1 C0 
 
Compacted data 
Multiplexers 
Mul
tiplexe
rs 
Multiplier 
Accumulator 
Nx + Ny 
8-bit counter Nx 
Ny 
4 4 4 4 4 
4 4 
4 
4 
4 
4 
X input operand 
Y input  
operand 
Figure 2.7 4?4 Multiplier Test Algorithm [16] 
 23 
 
The authors in [20] mention that the DSPs in Virtex-4 devices can be tested by applying 
pseudo-random patterns, generated by linear feedback shift registers (LFSRs) to the input ports 
of the DSP slice, but the authors do not provide specific test algorithms or test sequences for 
testing the logic in the DSP cores. Reference [20] also does not mention the fault coverage 
obtained. Furthermore, to apply an exhaustive set of pseudo-random patterns would require a 84-
bit LFSR and 284 ? 1 clock cycles of test application time. 
A BIST approach for the 18?18-bit multipliers embedded in Virtex-II Pro FPGAs was 
proposed and implemented in [21]. This BIST approach was the first BIST approach 
implemented for multipliers in FPGAs. The 4?4 test algorithm proposed in [16] was used to test 
the multipliers embedded in Virtex-II Pro FPGAs. A 10-bit counter was used for the test pattern 
generator, where the eight LSB bits of the counter were used to apply the 4?4 test algorithm to 
Multiplexers 
Mul
tiplexe
rs 
Multiplier 
Accumulator 
Nx + Ny 
8-bit counter Nx 
Ny 
Compacted data 
5 5 5 5 5 
5 3 
3 
3 
3 
3 
X input operand with 
Booth Encoding 
Y input  
operand 
Figure 2.8 5?3 Multiplier Test Algorithm [17] 
 24 
the two 18-bit inputs of the multiplier and the two MSB bits of the counter are used to test the 
clock enable and reset control inputs to the multiplier in registered modes of operation [21]. A 
minimum set of three BIST configurations were developed to test the multipliers in all modes of 
operation. The three BIST configurations include BIST for one ?combinational mode? and two 
?registered modes? of the multiplier. The BIST configuration for the ?combinational mode? is 
used only to test the logic in the multiplier without any registers [21]. The two BIST 
configurations for the registered modes are used to test the programmable active levels of the 
clock enable and reset control inputs of the registers and the active edge of the clock to the 
registers. The BIST configurations were developed in VHDL models and require a complete 
download of each of the three BIST configurations [21].  
The comparison based ORA shown in Figure 2.9 compares the outputs of the multiplier 
blocks under test (BUTs) and produces a pass/fail result for each BIST configuration [21]. This 
ORA design is easy to implement and can be implemented in two LUTs of a CLB slice since the 
contents of each ORA have to be shifted out to obtain the pass/fail result of each ORA since the 
ORAs were connected to form a scan chain to shift out the BIST results [21]. 
 
2.7 Restatement of Thesis: 
 Although no prior work has been done on testing DSP cores in FPGAs, the adder test 
algorithm proposed in [15] and the multiplier test algorithm proposed in [17] can be modified for 
Figure 2.9 ORA Design for Multiplier BIST in Virtex- II Pro FPGAs [21] 
ORA 
Pass/Fail 
BUTi outputk 
 
BUTj outputk 
Shift data 
Shift mode 
LUT 
DFF 
 25 
better fault coverage and can be applied to completely test the adder and the multiplier in DSP 
cores of Virtex-4 and Virtex-5 FPGAs.  
 The TPG for the adder test algorithm proposed in [15] can be easily implemented in the 
CLBs of Virtex-4 and Virtex-5 FPGAs. Although the number of test vectors increases with the 
size of the adder, the adder in DSP cores of Virtex-4 and Virtex-5 FPGAs can be completely 
tested with a reasonably small set of test vectors.  
 The TPG for the multiplier test algorithm proposed in [17] can also be easily 
implemented and applied to multipliers in DSP cores of Virtex-4 and Virtex-5 FPGAs. The TPG 
can be implemented in the CLBs of the FPGAs. The test vectors for the multiplier test algorithm 
are a small set of finite test vectors. These 256 test vectors can be applied to multipliers of any 
size. Besides the adder and the multiplier, the rest of the DSP logic must also be tested. This 
thesis seeks to develop a minimum set of BIST configurations to completely test the DSPs in 
Virtex-4 and Virtex-5 devices. 
 26 
Chapter 3 
BIST for DSP Cores in Virtex-4 FPGAS 
This chapter begins by proposing improvements to the previously proposed multiplier 
and adder test algorithms for higher fault coverage and describes the development of BIST for 
DSP cores in Virtex-4 FPGAs through the application of the improved multiplier and adder test 
algorithms to test the logic in these DSP cores. The BIST architecture along with the BIST 
configurations and test sequences for the DSP cores are discussed. The chapter also discusses the 
retrieval of BIST results and explains how the maximum clock frequency of the BIST 
configurations can be improved. The chapter concludes by summarizing the experimental BIST 
results and the fault coverage obtained on actual Virtex-4 FPGAs. 
3.1 BIST Approach for DSPs in Virtex-4 FPGAs 
 The DSP cores in Virtex-4 FPGAs mainly consist of the adder and the multiplier units. 
Besides the adder and the multiplier the DSP cores include multiplexers and flip-flops used as 
pipeline registers. Since the multiplexers and the flip-flops can be easily tested, the challenge lies 
in testing the adder and the multiplier units in the DSP cores. The data sheets for the DSP cores 
incorporated in Virtex-4 FPGAs do not describe the architecture of the adder and the multiplier 
explicitly. However, one of the Spartan-3 application notes mentions that the architecture of the 
multiplier is based on a modified Booth architecture [22]. From the data sheets [9] it is 
understood that sequential logic is not used since there is no specification of clock cycle latency. 
Of the various combinational logic multipliers such as array, Booth, modified 
 27 
Booth, Wallace-tree, and modified Booth/Wallace-tree multipliers, the modified Booth/Wallace-
tree multiplier seems to be the most likely option because of its higher performance.  
From the data sheets it is clear that the adder that is used to sum the final partial products 
of the multiplication is separated from the multiplier. Of the various combinational logic adders 
such as ripple carry, carry select, carry skip, carry save and carry look ahead (CLA) adders, the 
CLA adder seems to be the most likely option because of its higher performance [23] and also 
because CLA adders are typically used to sum the final partial products in modified 
Booth/Wallace-tree multipliers [17].  
3.1.1 Adder Test 
 The adder test algorithm described in [15] can be used to test the adder in the DSP cores. 
However, fault simulation for the adder test algorithm in [15] revealed that two test patterns that 
were required to achieve 100% fault coverage were missing. Modifying the BIST architecture in 
[15] by replacing the inverter with a D flip-flop and using the Qbar output to drive the input of 
the shift register as illustrated in Figure 3.1 produces the two missing patterns. The modified 
architecture includes an N+1 bit shift register, N XOR gates, N XNOR gates and a D flip-flop, 
where N is the number of bits in the adder. This BIST architecture generates a total of 2?(N+2) 
test vectors for completely testing the CLA adders in the DSP cores. Since the adder in the DSP 
cores is 48-bits wide, a 50-bit shift register (49-bit shift register plus the D flip-flop) is used to 
generate 100 test vectors that completely test the adder, as verified through fault simulation. The 
test patterns generated by the modified architecture are also illustrated in Figure 3.1 for a 4-bit 
adder and the generated missing test patterns are denoted as ?new?. Table 3.1 gives a comparison 
of fault coverage achieved for the adder test algorithms described in Section 2.5 of Chapter 2. 
From Table 3.1 it is observed that the adder test vectors described in [19] effectively test only 
 28 
ripple CLA adders but the adder test algorithm described in [15] effectively tests all 
implementations of the CLA adder but failed to give 100% fault coverage because of the missing 
vectors. The ?Modified BIST? indicates the modification made to the adder test algorithm that 
generated the missing test vectors described in this section. 
Table 3.1 Stuck-at Fault Simulation Results for 48-bit Adders 
Adder 
Implementation 
Gate 
Delays 
Number 
of Faults 
Test Algorithm Vector Set 
vector set [19] BIST [15] Modified BIST 
Ripple Carry Adder 96 1296 100% 99.9% 100% 
Ripple CLA 28 1392 100% 99.9% 100% 
Ripple LCU 12 1542 95.7% 99.9% 100% 
Multi-stage LCU 10 1506 95.9% 99.9% 100% 
The adder/subtractor equation P=Z?(X+Y+Cin) [9] indicates that the adder in the DSP 
slice is a two-stage adder as shown in Figure 3.2. The C-port provides the only 48-bit access to 
the adder. The accumulator register, P, is the only other 48-bit access to the adder. Hence two 
clock cycles are required to apply a single test vector to the adder. During the first clock cycle a 
part of the test vector (48-bits of the 97-bit test vector) is loaded into the accumulator register 
from the C-port through the Y or the Z multiplexers while applying 0s to the other two 
multiplexer ports, the CIN and the SUBTRACT signals. During the second clock cycle, the 48-
bit test vector in the accumulator register is applied to one of the ports of the adder through the X 
or the Z multiplexers. Also during the second clock cycle the remaining 48-bits of the test vector 
are applied to the other port of the adder through the Z or the Y multiplexers (based on the stage 
of the adder being tested) and the test vector bit for the CIN or SUBTRACT signals (based on 
the stage of the adder that is being tested) is applied. For cases during which the overall test 
vector applies a logic ?1? to the SUBTRACT signal, the first part of the vector that is loaded to 
the accumulator register is inverted so that when the SUBTRACT signal inverts this part of the 
 29 
vector while its being applied to the adder port during the second clock cycle, the correct set of 
vectors is still applied to the second stage adder. The two clock cycle test vector application also 
provides complete testing of the accumulator register. Each stage of the adder is tested 
independently and completely in a total of 200 clock cycles.  
 
 
3.1.2 Multiplier Test 
 Fault simulation was performed on gate level models of various 8?8 bit multipliers. 
Based on the fault simulation results it is determined that applying both the 5?3 and the 3?5 test 
algorithms is the most effective way of testing the multiplier cores in Virtex-4 DSPs. Hence the 
multiplier is tested in two sessions of 256 clock cycles each. During the first session, the five 
MSBs of the 8-bit counter are applied to port A of the multiplier and the three LSBs of the 8-bit 
counter are applied to port B of the multiplier. During the second test session, the five MSBs are 
Figure 3.2 2-stage CLA adder 
48-bit 
CLA 
48-bit 
CLA 
(X MUX) (Y MUX) 
(Z MUX) CIN 
Subtract 
48 
48 48 
48 
Qi  
Qi+1 
 
Ci to adder carry-in 
Ai 
to adder inputs 
Bi 
N+1-bit Serial Shift Register 
reset 
AAAABBBBC 
32103210i 
new -> 111100000 
111100001 
111000001 
110100011 
101100111 
011101111 
new -> 000011111 
000011110 
000111110 
001011100 
010011000 
100010000 Figure 3.1 Modified Adder Test Algorithm 
 30 
applied to port B of the multiplier and the three LSBs are applied port A of the multiplier. Figure 
3.3 summarizes the multiplier BIST approach.  
 
3.2 BIST Architecture 
Figure 3.4 illustrates the DSP BIST architecture for any given Virtex-4 FPGA. Two 
TPGs drive alternate rows of DSP tiles that have the same configuration. The two TPGs generate 
identical test patterns to test all the DSP tiles. The control register that controls the TPG is four 
bits wide. Of the four bits the two LSBs (called MODE1 and MODE0) control the test algorithm 
generated by the TPG. The second MSB (called INVCS) controls the active levels of control 
signals such as the OPMODE bits and the active level of the carryin input to the adder unit. The 
MSB provides a global reset to the TPG. The control register is implemented in a CLB and the 
values for the control register are shifted in through Boundary Scan interface while shifting the 
LSB first. The control register values for resetting the TPG and for the various test modes are 
summarized in Table 3.2. The ?X? in Table 3.2 indicates a don?t care bit. 
Table 3.2 Control Register Values for TPG control 
Control Register Values <3:0> 
RESET INVCS MODE1 MODE0 Operation 
         1           X           X             X Resets TPG 
         0           1            X             X Inverts active level of control signals  
         0           X            0             0 Sets the multiplier test algorithm 
         0           X            0             1 Sets the adder test algorithm 
         0           X            1             0 Sets the cascade test algorithm 
Figure 3.3 Multiplier BIST approach 
?  
n (3/5) n (5/3) 
2n 
8-bit counter 
MSB                  LSB 
B port A port 
 31 
  Values to the control register can also be given through a system pins interface when the 
Boundary Scan interface is not used. For the system pins interface, the clock and the control 
inputs to the TPG, such as the TPG reset, the INVCS control and the MODE1 and MODE0 
control signals, are input pins to the device.  
Multiple TPGs are used so that faults in any of the TPGs can not escape detection. Each 
TPG drives both the slices in a tile for the individual control of both slices in the tile during 
cascade modes of operation. The DSP slices are configured in cascade mode of operation in pairs 
instead of cascading all the DSP slices in a column, so that the maximum BIST clock frequency 
is not slowed down. This approach of cascading DSP slices in pairs also ensures that all the DSP 
slices do not fail the test due to the unconnected cascade inputs on the bottom-most DSP slice in 
a column of DSPs and circular comparison can still be used effectively to analyze the outputs of 
the DSPs, disagreeing with the authors? claim in [20]. The bottom slice in a DSP tile is denoted 
by s0 and the top slice in a DSP tile is denoted by s1. Each DSP slice is monitored by two sets of 
ORAs and compared with the outputs of two like DSP slices. Each set of ORAs monitors two 
similar DSP slices, implying a set of ORAs that monitor slice 0 in a DSP tile also monitor slice 0 
in the DSP tile below. The two bottom-most sets of ORAs (one each for slice 0 and slice 1) in a 
column of ORAs monitor the top-most and the bottom-most DSP tiles in the column of DSP 
tiles, forming two circular comparison chains where one chain monitors slice 0 in all the DSP 
tiles and the other chain monitors slice 1 in all the DSP tiles.  
The set of ORAs that monitor slice 0 of the bottom-most DSP tile have clock enables so 
that these ORAs can be disabled at specific times during the cascade mode BIST sequence to 
avoid ORA failure indications due to unconnected cascade inputs on slice 0 of the bottom-most 
DSP tile. The architecture of the ORA is illustrated in Figure 3.5. Each ORA comprises a look-
 32 
up table and a flip flop. The ORAs are synthesized into CLBs where eight ORAs fit into a single 
CLB. The dedicated carry logic in the CLBs, as illustrated in Figure 3.5, can be used to create an 
iterative OR chain of ORAs where the Test Data In (TDI) line is connected to the carry-in of the 
first CLB in the column of ORAs and the carry-out of the last CLB in the last column of ORAs 
[24], which is also the response of the last ORA in the chip array, is connected to the Test Data 
Out (TDO) line. This provides a single bit pass/fail result for the entire test. The TDO line goes 
to logic ?1? when any one ORA in the iterative OR chain detects a mismatch due to faults. This 
reduces the total test time for the fault-free tests since the ORA contents can be obtained via 
partial configuration memory read-back for only those tests that fail. Each DSP slice is observed 
by 48 ORAs in two rows and three columns of CLBs. Figure 3.6 illustrates the mapping of the 
individual output bits from a single DSP slice to the ORAs. Figure 3.6 helps in determining the 
faulty DSPs in a column of DSPs and the individual faulty DSP outputs when partial 
configuration memory read-back is done. All the DSPs are tested concurrently so the length of 
the test sequence is independent of the size of the chip array.  
 
TPG 1 
TPG 0 
DSP s0 
DSP s1 
DSP s0 
DSP s1 
DSP s0 
DSP s1 
DSP s0 
DSP s1 
DSP s0 
DSP s1 
DSP s0 
DSP s1 
ORAs 
ORAs 
ORAs 
ORAs 
ORAs 
ORAs 
ORAs 
ORAs 
ORAs 
ORAs 
ORAs 
ORAs 
Figure 3.4 DSP BIST Architecture 
 33 
 
Figure 3.7 illustrates the general architecture of the TPG. The TPG for DSP BIST 
comprises a 10-bit counter where the two MSB bits are used for the individual control of the four 
256 clock cycle test groups during the 1024 clock cycle test sequence, a 50-bit shift register for 
the adder test, a finite state machine (FSM) for control of OPMODE control signals and two 9-
bit linear feedback shift registers (LFSRs) for generating weighted pseudo random control 
signals. The eight least significant bits of the counter are used to apply test patterns to the 
multiplier. The TPG is modeled in VHDL in 266 lines of codes and is synthesized into 44 CLBs. 
 
DSP 
Slice 0 
DS0 
DSP 
Slice 1 
DS1 
DS0 (16) 
DS0 (17) 
DS0 (18) 
DS0 (19) 
DS0 (20) 
DS0 (21) 
DS0 (22) 
DS0 (23) 
DS0 (24) 
DS0 (25) 
DS0 (26) 
DS0 (27) 
DS0 (28) 
DS0 (29) 
DS0 (30) 
DS0 (31) 
DS0 (32) 
DS0 (33) 
DS0 (34) 
DS0 (35) 
DS0 (36) 
DS0 (37) 
DS0 (38) 
DS0 (39) 
DS0 (40) 
DS0 (41) 
DS0 (42) 
DS0 (43) 
DS0 (44) 
DS0 (45) 
DS0 (46) 
DS0 (47) 
DS0 (0) 
DS0 (1) 
DS0 (2) 
DS0 (3) 
DS0 (4) 
DS0 (5) 
DS0 (6) 
DS0 (7) 
DS0 (8) 
DS0 (9) 
DS0 (10) 
DS0 (11) 
DS0 (12) 
DS0 (13) 
DS0 (14) 
DS0 (15) 
DS1 (16) 
DS1 (17) 
DS1 (18) 
DS1 (19) 
DS1 (20) 
DS1 (21) 
DS1 (22) 
DS1 (23) 
DS1 (24) 
DS1 (25) 
DS1 (26) 
DS1 (27) 
DS1 (28) 
DS1 (29) 
DS1 (30) 
DS1 (31) 
DS1 (32) 
DS1 (33) 
DS1 (34) 
DS1 (35) 
DS1 (36) 
DS1 (37) 
DS1 (38) 
DS1 (39) 
DS1 (40) 
DS1 (41) 
DS1 (42) 
DS1 (43) 
DS1 (44) 
DS1 (45) 
DS1 (46) 
DS1 (47) 
DS1 (0) 
DS1 (1) 
DS1 (2) 
DS1 (3) 
DS1 (4) 
DS1 (5) 
DS1 (6) 
DS1 (7) 
DS1 (8) 
DS1 (9) 
DS1 (10) 
DS1 (11) 
DS1 (12) 
DS1 (13) 
DS1 (14) 
DS1 (15) 
and  
 
indicate 
alternate rows of 
CLBs 
ORAs in the 
CLB slice 
that compare 
DSP slice 0 
outputs  
ORAs in the 
CLB slice 
that 
compare 
DSP slice 1 
outputs  
Figure 3.6 ORA map for a DSP tile 
Figure 3.5 ORA Architecture 
Outi DSPj 
Outi DSPk Pass 
/Fail LUT 
0    1 
1 
Carry out 
Carry in 
EN 
ORACE 
 34 
  
3.3 BIST Configurations and Test Sequences 
 The DSP cores in Virtex-4 devices are tested using three independent BIST sequences 
which correspond to each of the three test modes of operation: the multiplier, adder and cascade 
modes of operation. Table 3.3 summarizes the test sequences. Each BIST sequence is 1024 clock 
cycles long and divided into four groups of 256 clock cycles. During each group of 256 clock 
cycles, specific I/O paths through the multiplier, adder and cascade modes of operation are 
tested. 
The 5?3 multiplier test algorithm is applied to the multiplier in slice 1 while the 3?5 
multiplier test algorithm is applied to the multiplier in slice 0 during the first group of the 
multiplier test sequence. During the second group of the multiplier test sequence the 5?3 
multiplier test algorithm is applied to the multiplier in slice 0 and the 3?5 multiplier test 
algorithm is applied to the multiplier in slice 1. The 5?3 multiplier test algorithm is applied by 
replicating the five MSBs of the vector generated by the 8-bit counter and applying the replicated 
bits to the 18-bit A port while the three LSBs of the vector generated by the 8-bit counter are 
replicated and applied to the 18-bit B port. During the 3?5 multiplier test algorithm the three 
LSBs of the vector generated by the 8-bit counter are replicated and applied to the 18-bit A port 
   
Count 
Shift 
Reg 
FSM 
LFSR 
TPG 
to 
ORAs 
to 
ORAs 
48 48 
P port 
A port 
B port 
C port 
OpMode 
Control 
DSP slice 0 
A port 
B port 
C port 
OpMode 
Control 
36 36 
48 48 
7 7 
17 17 
P port 
Figure 3.7 TPG Architecture 
 35 
while the five MSBs of the vector generated by the 8-bit counter are replicated and applied to the 
18-bit B port. The application of different multiplier test algorithms to slice 0 and slice1 ensures 
that the A and B ports in both the slices receive different test patterns on every clock cycle so 
that the single stuck-at faults on the multiplexers that select between direct and cascade paths on 
port B of the DSP slice can be tested. During the third group of 256 clock cycles of the multiplier 
test sequence, the multiply and add function is tested. The multiplier is not tested in the fourth 
group of the multiplier test sequence. This group only tests for the A port concatenated with the 
B port (denoted as A:B in Table 3.3) bypass of the multiplier. 
Each stage of the two stage adder is tested separately during the first two groups of 256 
clock cycles in the adder test sequence. The first stage adder is tested during the first group of 
256 clock cycles in the adder test sequence and the second stage adder is tested during the 
second and third groups of 256 clock cycles in the adder test sequence. The P output of the DSP 
slice that is left shifted by 17 bits can be fed back to the adder through the Z multiplexer 
(denoted as Z(ShiftP) in Table 3.3). This path to the adder is tested during the fourth group of 
256 clock cycles in the adder test sequence.  
Table 3.3 BIST Sequences 
Test Multiply Adder Cascade 
First 256 ccs P = A?B P = Z(C) P = X(P)+Y(C) P1 = A:B+Z(PC) P0 = Z(C) 
Second 256 ccs P = A?B P = Y(C) P = Y(C)+Z(P) P1 = A:B+Z(ShiftPC) P0 = Z(C) 
Third 256 ccs P = A?B+C P = Z(C) P = Y(C)+Z(P) P1 = Z(C) P0 = A:B+Z(PC) 
Fourth 256 ccs P = A:B+C P = Y(C) P = Y(C)+Z(ShiftP) P1 = Z(C) P0 = A:B+Z(ShiftPC) 
During the third and fourth groups of 256 clock cycles in the adder and the multiplier test 
sequences, weighted pseudorandom patterns generated by linear feedback shift registers (LFSRs) 
are applied to test the various clock enables, resets and carry-in sources in the DSP. Weighted 
 36 
pseudorandom patterns are used so that the pipeline registers of the DSP are reset less often since 
frequent resets of the pipeline registers can cause the fault detection data to be lost before it 
reaches the output of the DSP. Figure 3.8 illustrates the architecture of one of the 9-bit LFSRs, 
LFSRA. The second LFSR, LFSRB uses the reciprocal polynomial of LFSRA. Table 3.4 
summarizes the weighted pseudorandom patterns in terms of their LFSR sources.  
During the cascade mode test sequence, the two DSP slices are independently controlled 
to test the cascade multiplexers and the cascade interconnect between adjacent DSPs. Slice 0 and 
Slice 1 have different P equations, where P0 indicates slice 0 equation and P1 indicates slice 1 
equation  in Table 3.3. During the first group of 256 clock cycles in the cascade test sequence, 
slice 1 receives the P output of slice 0 as its input (denoted by Z(PC) in Table 3.3) and in the 
second group slice 1 receives the shifted P output of slice 0 as its input (denoted by Z(ShiftPC) in 
Table 3.3). During the third group of 256 clock cycles in the cascade test sequence, slice 0 
receives the P output of the previous slice 1 as its input (denoted by Z(PC) in Table 3.3) and in 
the fourth group slice 0 receives the shifted P output of the previous slice 1 as its input (denoted 
by Z(ShiftPC) in Table 3.3). ORA failures due to unconnected cascade inputs on the bottom-
most DSP slice occur in the third and fourth groups of the cascade test sequence. Therefore, the 
ORAs that monitor the DSPs at the bottom of the array are disabled by the TPG during the third 
and fourth groups of 256 clock cycles in the cascade test sequence. 
 
  
LFSRA 
<0> 
LFSRA 
<1> 
LFSRA 
<2> 
LFSRA 
<3> 
Figure 3.8 Architecture of the 9-bit LFSRA 
LFSRA 
<4> 
LFSRA 
<5> 
LFSRA 
<6> 
LFSRA 
<7> 
LFSRA 
<8> 
 37 
Table 3.4 Weighted Pseudorandom Patterns 
DSP Signal Pattern 
CEA (clock enable for Areg) LFSRA<0> 
CEB (clock enable for Breg) LFSRA<1> 
CEM (clock enable for M reg) LFSRA<2> 
CEP (clock enable for Preg) LFSRA<3> 
CECARRYIN (clock enable when 
carryin used for rounding 
applications) 
LFSRA<4> 
CECTRL (clock enable for 
CARRYINSEL, SUBTRACT and 
OPMODE registers) 
LFSRB<5> 
CECINSUB (clock enable when 
carryin is defined by the user) LFSRA<6> 
CARRYINSEL<1:0> (control 
register to select the carryin source) LFSRA<8:7> 
CARRYIN (user defined carryin) LFSRB<0> 
SUBTRACT (user defined subtract) LFSRB<1> 
CEC (clock enable for C reg) LFSRB<2> 
RSTA(reset for Areg) for slice 0 LFSRB<7> and LFSRB<5> and LFSRB<3> 
RSTB(reset for Breg) for slice 0 LFSRB<1> and LFSRB<3> and LFSRB<5> 
RSTM(reset for Mreg) for slice 0 LFSRB<6> and LFSRB<2> and LFSRB<0> 
RSTP(reset for Preg) for slice 0 LFSRB<4> and LFSRB<0> and LFSRB<6> 
RSTCARRYIN (reset for all sources 
of carryin) for slice 0 LFSRB<5> and LFSRB<0> and LFSRB<7> 
RSTCTRL (reset for 
CARRYINSEL, SUBTRACT and 
OPMODE registers) for slice 0 
LFSRB<6> and LFSRB<7> and LFSRB<8> 
RSTA(reset for Areg) for slice 1 LFSRB<6> and LFSRB<4> and LFSRB<2> 
RSTB(reset for Breg) for slice 1 LFSRB<0> and LFSRB<2> and LFSRB<4> 
RSTM(reset for Mreg) for slice 1 LFSRB<5> and LFSRB<1> and LFSRB<8> 
RSTP(reset for Preg) for slice 1 LFSRB<3> and LFSRB<8> and LFSRB<5> 
RSTCARRYIN (reset for all sources 
of carryin) for slice 1 LFSRB<4> and LFSRB<8> and LFSRB<6> 
RSTCTRL (reset for 
CARRYINSEL, SUBTRACT and 
OPMODE registers) for slice 1 
LFSRB<5> and LFSRB<6> and LFSRB<7> 
RSTC (reset for C reg) LFSRB<0> and LSRB<1> and LFSRB<2> 
The DSP cores are tested in five BIST configurations. During these five BIST 
configurations, the DSP configuration memory bits are tested in all functional modes. Table 3.5 
 38 
summarizes the BIST configurations developed during the initial stages of BIST development 
(modifications to these BIST configurations will be explained in the later sections). In Table 3.5, 
column 1 indicates the BIST configuration download number, column 2 indicates the number of 
pipeline registers in the I/O paths of the DSP slice, column 3 indicates the active level of the 
DSP slice control signals, column 4 indicates whether the B port of the DSP slice is in cascade or 
direct mode of operation and column 5 indicates the test sequence number that is applied for 
each of the BIST configurations. The multiplier is tested during the first, second and fourth test 
sequences, the adder is tested during the third and fifth test sequences and the cascade modes of 
operation are tested during the sixth and seventh test sequences. Instead of connecting all the 
DSP slices in cascade mode at the same time alternate DSP slices are connected in cascade mode 
to avoid seeing failures due to the unconnected cascade input lines of the bottom-most DSP slice 
in the array.  During the sixth test sequence, slice 0 is in cascade mode of operation and during 
the seventh test sequence, slice 1 is in cascade mode of operation. 
BIST configurations #2 and #3 are run twice since the TPG control inputs need to be 
changed to run the multiplier and the adder test sequences during the same BIST configuration. 
A total of seven test sequences are applied in five downloads to the FPGA thereby reducing the 
number of downloads to the FPGA by two. The download time can be minimized using partial 
reconfigurations, through which only the configuration memory that contain DSP configuration 
memory bits are written instead writing the whole configuration memory. This can be done by 
maintaining constant placement of the TPGs, the ORAs and the DSPs and by keeping the routing 
constant between them. Table 3.6 illustrates the improvement in download time for the largest 
devices from each of the three families of Virtex-4 FPGAs, FX140, SX55 and LX200, (thereby 
representing the longest download times), when partial configuration is used and all the DSPs in 
 39 
the devices are tested concurrently. The download time with partial reconfiguration in column 2 
of Table 3.6 illustrates the download time for all five configurations where the first download is 
a compressed download and the remaining downloads are partial reconfigurations. The download 
time without partial reconfiguration in Table 3.6 illustrates the download time for all five 
configurations where all five downloads are compressed. The maximum clock frequency for 
download using the Boundary Scan interface is 50MHz. 
Table 3.5 Initially Developed BIST Configurations 
BIST Config 
# Pipeline Registers 
Signals Active 
Level 
B Input Source Test Modes Applied 
Slice 0 Slice 1 Mult Add Casc 
1 All Regs=0 High Direct Direct #1   
2 All Regs=1 High Direct Direct #2 #3  
3 A/Breg=2, Others=1 Low Direct Direct #4 #5  
4 Preg=1, Others=0 High Direct Cascade   #6 
5 Preg=1, Others=0 Low Cascade Direct   #7 
Table 3.6 Improvement in Download Time using Partial Reconfiguration 
Device Download time with partial reconfiguration (sec) Download time without partial reconfiguration (sec) 
FX140 0.30128 1.47814 
SX55 0.27915 1.32346 
LX200 0.22468 1.11734 
 
3.4 BIST Generation 
 The TPG model written in VHDL is synthesized for an FX12 device since the FX12 
device has a large area in the chip array that is occupied by the Power PC and the TPG is 
carefully constrained in an area close to the DSPs that does not interfere with the location of the 
Power PC. The TPG location for all other devices is offset with respect to the location of the 
TPG in the FX12 device based on the number of rows and columns in individual devices. The 
synthesized TPG is in NCD format that can be viewed in FPGA Editor, a Xilinx tool that gives a 
 40 
graphical representation of the device. The synthesized TPG in NCD format is then converted to 
XDL (Xilinx Design Language) format.  
Three programs written in C (developed as part of this thesis work) generate the BIST 
configurations for any size or family of Virtex-4 devices. The V4DSPBIST.exe program calls 
another program TPGXDLEXT.exe. The latter program extracts the TPG, from the synthesized 
XDL and writes it to an output file. The V4DSPBIST.exe program reads that output file and 
places and instantiates the two TPGs. It also instantiates and interconnects the remaining BIST 
architecture, the DSPs under test and the ORAs, and generates a DSP BIST template in XDL 
format. The DSP BIST template in XDL format is converted to NCD format for routing the 
BIST architecture. The routed DSP BIST template is then converted back to XDL format to be 
used by the modification program, V4DSPMOD.exe. This modification program, written in C, is 
used to modify the routed DSP BIST template to generate the five BIST configurations in XDL 
format. These BIST configuration files are then converted back to NCD format to generate the 
download configuration bit files.  
The NCD files of the DSP BIST template and the BIST configuration can be viewed in 
FPGA Editor. The routed DSP BIST templates for the FX12 and SX35 devices are illustrated in 
Figure 3.9a and Figure 3.9b, respectively. For DSP columns to the right of the center column, the 
ORAs are located in three consecutive columns of CLBs on the right side of DSP and for the 
DSP columns to the left of the center column the ORAs columns are positioned to the left of the 
DSP. To avoid the PowerPC modules in FX family devices larger than FX40, the ORAs are 
located to the right side of DSPs that are located left of the center column and for the DSP 
column to the right of the center column, the ORAs are located on the left side of the DSPs.  
 41 
        
a)routed FX12     b)routed SX35 
Figure 3.9 BIST Template as Seen in FPGA Editor 
3.5 Detection of Faulty DSPs and Fault Coverage 
To verify the fault detection capabilities of the DSP BIST, faults were injected into the 
configuration memory bits that control the DSPs in an FX12 device. Figure 3.10 illustrates 
individual fault coverage achieved for each of the seven BIST sequences (BIST sequences #2 
and #3 are the same download and represent BIST configuration #2 in Table 3.5. Similarly, BIST 
sequences #4 and #5 are the same download and represent BIST configuration #3 in Table 3.5. 
This is because configuration downloads #2 and #3 in Table 3.5 are run twice as explained in 
Section 3.3. BIST sequences #6 and #7 represent BIST configurations #4 and #5 respectively in 
Table 3.5). Cumulative fault coverage of 97.4% is achieved. Of the 154 faults injected four faults 
were not detected but could be detected by adding more BIST configurations to achieve 100% 
fault coverage. However, these undetected faults are in non-functional modes of operation 
making the additional BIST configurations impractical. 
TPG0 
TPG1 
DSPs 
ORAs 
TPG0 
TPG1 
DSPs 
ORAs 
 42 
The DSP BIST configurations were able to detect faulty DSPs in some of the engineering 
sample parts of SX35 and LX-60 devices. Of the five SX35 and nine LX60 engineering sample 
parts tested, DSP BIST detected up to five faulty DSPs in four SX35 engineering sample parts 
and one faulty DSP each in two LX60 engineering sample parts. The faulty DSP slices in the 
Virtex-4 SX35 and LX60 engineering sample parts are summarized in Table 3.7. The 
corresponding faulty DSP output bit positions observed by the ORAs are also shown in Table 
3.7. Column 2 in the table describes the position of the faulty DSP slice as named by the BIST 
generation program. For example, DSP_r90c46 implies the DSP slice in the 45th DSP row and 
46th column of the chip array and DSP_r52c46 implies the DSP slice in the 26th DSP row and 
46th column of the chip array. Column 3 describes the test sequence number during which each 
of the faulty DSP slices described in Column 2 were detected. Column 4 gives the failing DSP 
output bits where P is the output port of the DSP slice. Each engineering sample part shown in 
Table 3.7 was tested three times for each of the seven test sequences. The failing DSP output bit 
positions differed during each of the three tests for every engineering sample part shown in Table 
3.7 as illustrated for the first SX 35 device 
Table 3.7 Faulty DSP Slices in Virtex-4 SX35 and LX60 Engineering Sample Parts 
Device Test  number Slice Description Failing DSP output bit positions 
1st test 2nd test 3rd test 
SX35 part#1 1 DSP_r90c46 P0-P3 P0-P10 P0-P47 
  DSP_r52c46 P0?P21 P0-P47 P0-P2 
  DSP_r56c19 P0,P1 P0-P20 P0-P3 
  DSP_r8c35 P0-P47 P0-P47 P0-P47 
  DSP_r80c8 P0-P47 P0-P47 P0-P47 
 2 DSP_r90c46 P0-P10 P0-P1 P0-P1 
  DSP_r52c46 P0-P27 P0-P27 P0-P37 
  DSP_r56c19 P16-P29 P32-P47 P0-P11 P16-P31 P0-P5 P16-P47 
  DSP_r8c35 P0-P31 P42-P47 P0-P37 P0-P37 
 43 
Device Test Number Slice Description Failing DSP output bit positions 
1st test 2nd test 3rd test 
  DSP_r80c8 P0-P5 P16-P47 P0-P37 P0-P37 
 3 DSP_r90c46 P0-P18 P0-P47 P0-P18 
  DSP_r52c46 P0-P27 P0 P0-P29 
  DSP_r56c19 P0-P7 P16-P35 P16-P29 P32-P47 P0-P3 P16-P47 
  DSP_r8c35 P0-P35 P0-P35 P0-P35 
  DSP_r80c8 P0-P35 Po-P35 P0-P3 P16-P47 
SX35part#3 1 DSP_r72c19 P0-P19 
 2 DSP_r72c19 P0-P27 
Device Test number Slice Description Failing DSP output bit positions for all three iterations 
SX35part#4 1 DSP_r64c8 P0-P47 
  DSP_r74c35 P0-P3 
  DSP_r92c19 P0-P2 
 2 DSP_r64c8 P0-P5, P16-P47 
  DSP_r74c35 P0, P1 
 3 DSP_r64c8 P0-P3, P16-P47 
  DSP_r74c35 P0-P16 
SX35part#5 2 DSP_r52c35 P0 
 3 DSP_r52c35 P0-P47 
LX60part#6 1 DSP_r46c15 P0-P40 
 2 DSP_r46c15 P0-P12 
 3 DSP_r46c15 P0-P18 
LX60part#8 2 DSP_r90c15 P0-P47 
 3 DSP_r90c15 P0-P47 
 
3.6 BIST Timing Analysis 
 To determine the maximum clock frequency of DSP BIST, timing analysis was done 
using the Xilinx timing analysis tool TRCE.exe, for all Virtex-4 FPGAs.  Figure 3.10 illustrates 
the maximum BIST clock frequency (in MHz) for all five configurations for an SX35 device 
when the TPG is placed at the bottom of the array. From Figure 3.10 it is observed that the BIST 
clock frequency for configuration #1 is always low. Since configuration #1 has no pipeline 
registers, the timing tool cannot calculate the accurate BIST clock frequency for this 
 44 
configuration since it assumes the possibility of a dynamic cascade of all DSPs in the device 
even though the DSPs are not cascaded during this BIST configuration. Configuration #5 has the 
next slowest BIST clock frequency because the DSPs are clocked on the falling edge of the clock 
while the TPGs and ORAs are clocked on the rising edge of the clock in this configuration. The 
cascade routing between the DSP slices in configuration #5 also decreases the maximum BIST 
clock frequency. To improve the overall clock frequency of BIST, the DSPs in configuration #5 
are clocked on the rising edge of the clock since configuration #3 takes care of testing DSPs 
when clocked on the falling edge of the clock.  
Figure 3.11 illustrates the BIST clock frequency (in MHz) for all five BIST 
configurations for an SX35 device when the DSPs are clocked on the falling edge of the clock 
only in configuration #3. So, now configuration #3 has the slowest BIST clock frequency. From 
the timing analysis results performed on all Virtex-4 devices for the slowest BIST configuration, 
configuration #3, it is observed that the position of the TPG in the array has a significant impact 
on the BIST clock frequency.  Figure 3.12 illustrates the maximum clock frequency (in MHz) for 
the slowest BIST configuration (#3) for all Virtex-4 FPGAs with respect to the TPG position at 
the bottom of the array or at the middle of the array as shown in Figure 3.8a. 
From Figure 3.12 it is seen that higher BIST clock frequency is achieved when the TPG 
is placed at the middle of the array when compared to the placement of the TPG at the bottom of 
the array.  This is because the top-most and the bottom-most DSP slices are placed at an equal 
distance from the TPG when the TPG is placed at the middle of the array. This makes the routing 
distance between the top-most DSP slice and the TPG shorter compared to the longer routing 
distance when the TPG is placed at the bottom of the array. Therefore, the TPG is placed at the 
middle of the array for all devices except the FX12 and FX20 devices that have a PowerPC 
 45 
module at the middle of the array.  The maximum clock frequency for BIST is less than 50MHz 
for some of the larger Virtex-4 FPGAs, like LX100, LX160, LX200 and FX140. Sub-array 
testing can be done for these devices where each half of the array is tested separately. 
 
Figure 3.10 Maximum BIST Clock Frequency for an SX35 Device When DSPs in 
Configurations #3 and #5 are Clocked on Falling Edge of the Clock 
 
Figure 3.11 Maximum BIST Clock Frequency for an SX35 Device when DSPs in 
Configuration #3 are Clocked on Falling Edge of the Clock 
 Figure 3.13 illustrates the maximum clock frequency (in MHz) for the sub-arrays as a 
function of the TPG position in the array. In Figure 3.13, ?Bottom BIST Bottom TPG? refers to 
BIST for the bottom half of the array when the TPG is placed at the bottom of the array, ?Top 
0
20
40
60
80
100
c onfi g1 c onfi g2 c onfi g3 c onfi g4 c onfi g5
Max
imum BIST
 Clo
ck
 F
re
qu
enc
y
0
20
40
60
80
100
c onfi g1 c onfi g2 c onfi g3 c onfi g4 c onfi g5
Max
imum BIST
 Clo
ck
 F
re
qu
enc
y
 46 
BIST Bottom TPG? refers to BIST for the top half of the array when the TPG is placed at the 
bottom of array, ?Bottom BIST Middle TPG? refers to BIST for the bottom half of the array 
when the TPG is placed at the middle of the array, and ?Top BIST Middle TPG? refers to BIST 
for the top half of the array when the TPG is placed at the middle of the array. From Figure 3.13 
it is seen that the maximum BIST clock frequency for the top half of the array when the TPG is 
placed at the middle of the array is more than the maximum clock frequency for the bottom half 
of the array.  This is because when the TPG located at the middle of the array, for the top half of 
the array, the TPG routes across and then up to the DSPs above whereas for the bottom half of 
the array, the TPG routes down to the bottom of the array to the DSPs and then routes across and 
up to the DSPs above. 
 
Figure 3.12 Maximum BIST Clock Frequency 
 
0
10
20
30
40
50
60
70
80
90
100
110
lx15 lx25 lx40 lx60 lx80
lx10
0
lx16
0
lx20
0
fx1
2
fx2
0
fx4
0
fx6
0
fx1
0
0
fx1
4
0
sx2
5
sx3
5
sx5
5
V ir t e x - 4  F P GA
M
a
x
 Clo
c
k
 F
req
 (M
Hz
)
B o tt o m
M idd le
 47 
 
Figure 3.13 Maximum Clock Frequency for Sub-Arrays 
Figure 3.14 illustrates the routing paths for the top and bottom halves of the array when 
the TPG is placed at the middle of the array. Hence, the routing path is longer for the bottom half 
of the array compared to the top half of the array which explains the slower clock frequency.  
Therefore, to make the clock frequency for the bottom half of the array as fast as the top half of 
the array, the TPG is placed at the bottom when testing the bottom half of the array as shown in 
Figure 3.15. 
                
a) Routing for the Top Half of the Array         b) Routing for the Bottom Half of the Array 
Figure 3.14 Routing Paths for the Sub-Arrays with TPG at the Middle of the Array 
 
0
10
20
30
40
50
60
lx 1 0 0 lx 1 6 0 lx 2 0 0 fx 1 4 0
Dev ic e
M
a
x
 Clo
c
k
 F
r
e
q
 (
M
Hz)
B o tt o m  B I S T
B o tt o m  T P G
T o p  B I S T
B o tt o m  T P G
B o tt o m  B I S T
M idd le T P G
T o p  B I S T
M idd le T P G
V4 LX160 
routed 
V4 LX160 
routed 
 48 
 
 
 
 
 
 
 
Figure 3.15 TPG Position for the Bottom Sub-Array 
The BIST clock frequency can be further improved by inverting the clock on the CLB 
slices in which the TPGs and the ORAs are implemented for BIST configuration #3 that has 
inverted clock on the DSP slices. The increase in BIST clock frequency (in MHz) for BIST 
configuration #3 that has inverted clock on the DSP slices as well as the TPGs and ORAs is 
illustrated in Figure 3.16.  
 
Figure 3.16 Timing Analysis Based on Clock Edge for Configuration #3 
0
20
40
60
80
100
120
140
160
180
200
l x15 l x25 l x40 l x60 l x80
l x100 l x160 l x200
fx12 fx20 fx40 fx60
fx100 fx140 sx25 sx35 sx55
Max BIS
T Clo
ck Frequen
cy
c onf ig# 3  s a me  e dge c loc k
c onf ig# 3  opposi te e dge c loc k
V4 LX160 
Unrouted 
 49 
Figure 3.17 illustrates the maximum BIST clock frequency (in MHz) for BIST 
configurations #2 through #5 for DSP BIST when the TPGs, ORAs and DSPs have the same 
clock edge for all the configurations. BIST configuration #1 is not included in Figure 3.18 since 
timing analysis does not give an accurate result for configuration #1.  
 
Figure 3.17 Timing Analysis for DSP BIST Configurations #2 through #5 
Table 3.8 illustrates the increase in download time and test time caused by inverting the 
clock on the TPGs and ORAs for BIST configurations that have inverted clock on the DSP 
slices. This increase in download time happens because for BIST configuration #3 where the 
DSPs are clocked on the falling edge of the clock, the configuration memory of the TPGs and the 
ORAs has to be rewritten in order to match the clock edge of the TPGs and the ORAs with the 
falling clock edge of the DSPs, since the TPGs and ORAs are configured to clock on the rising 
edge of the clock in the previous BIST configuration. The order in which the BIST 
configurations are generated for the data presented in Table 3.8 is #1 through #5. 
0
20
40
60
80
100
120
140
160
180
200
l x15 l x25 l x40 l x60 l x80
l x100 l x160 l x200
fx12 fx20 fx40 fx60
fx100 fx140 sx25 sx35 sx55
Max BIS
T Clo
ck Frequen
cy
co n f ig # 2
co n f ig # 3
co n f ig # 4
co n f ig # 5
 50 
From Table 3.8, it is observed that using same edge clock for BIST configurations that 
have falling edge clock increases the download and test time by a maximum of 7.9% when BIST 
configurations are downloaded in the following order: #1, #2, #3, #4 and #5. This increase is not 
significant when compared to the overall download and test time. 
Table 3.8 Configuration File Size and Test Time Increase for Same Edge Clock 
Device 
# Bits for Config #1 # Bits for Configs #2,#3,#4 & #5 Increase in Download Time (sec) Increase in Test Time (sec) 
Full 
download 
Compress 
download % 
Opposite 
edge 
clock 
Same 
edge clock 
Full 
download 
Compress 
download 
Full 
download 
Compress 
download 
lx15 4,765,568 2,091,808 43.8 62,720 183,808 1.02507 1.05620 1.02503 1.05598 
lx25 7,819,904 3,259,552 41.6 84,736 238,848 1.01949 1.04608 1.01947 1.04596 
lx40 12,259,712 4,747,072 38.7 106,752 293,888 1.01513 1.03855 1.01512 1.03848 
lx60 17,717,632 5,813,376 32.8 106,752 293,888 1.01049 1.03161 1.01049 1.03156 
Lx80 23,291,008 7,512,768 32.2 128,768 348,928 1.00940 1.02881 1.00939 1.02877 
lx100 30,711,680 9,421,696 30.6 150,784 403,968 1.00820 1.02644 1.00820 1.02642 
lx160 40,347,008 10,342,528 25.6 150,784 403,968 1.00625 1.02412 1.00625 1.02410 
lx200 51,367,808 11,714,848 22.8 150,784 403,968 1.03290 1.02133 1.00491 1.02132 
fx12 4,765,568 1,924,288 40.3 62,720 183,808 1.02507 1.06093 1.02503 1.06068 
fx20 7,242,624 2,277,344 31.4 62,720 183,808 1.01657 1.05174 1.01655 1.05156 
fx40 14,936,192 4,548,160 30.4 84,736 238,848 1.01025 1.03326 1.01025 1.03320 
fx60 21,002,880 7,805,024 37.1 194,816 514,048 1.01505 1.03990 1.01505 1.03986 
fx100 33,065,408 11,609,824 35.1 238,848 624,128 1.01156 1.03251 1.01156 1.03249 
fx140 47,856,896 15,736,192 32.8 282,880 734,208 1.00937 1.02817 1.00937 1.02816 
sx25 9,147,648 4,948,448 54 194,816 514,048 1.03417 1.06206 1.03413 1.06196 
sx35 13,700,288 7,367,104 53.7 282,880 734,208 1.03227 1.05899 1.03225 1.05893 
sx55 22,745,216 13,322,656 58.5 723,200 1,835,008 1.04737 1.07915 1.04735 1.07910 
With the inversion of clock on the TPGs on ORAs to match the clock edge on which the 
DSPs are clocked, the BIST clock frequency no longer depends on the edge of the clock used to 
clock the DSPs. So, configuration # 5 can be changed to clock the DSPs on the falling edge of 
the clock as this change might detect some of the undetected faults in Section 3.5. Table 3.9 
illustrates the increase in the bitstream file size caused by the inversion of clock edge on the 
TPGs and the ORAs to match the clock edge of the DSPs. for an SX55 device. From Table 3.9, it 
is observed that the file size also depends on the order in which the BIST configurations are 
generated. Table 3.10 illustrates the BIST configurations for Virtex-4 DSP BIST in the order in 
which they should be generated. 
 51 
Table 3.9 Download File Sizes (in Bits) for an SX55 Device 
Download # 
DSPs in 
configurations 
#3 and #5 have 
falling edge 
clock 
Clock egde on 
TPGs and ORAs 
is matched with 
clock edge on 
DSPs 
Clock edge on TPGs 
and ORAs is 
matched with clock 
edge on DSPs 
BIST 
# 13294720 
BIST 
# 13294720 BIST # 13294720 
Download #1 
(compressed) 1 1 3 
Download #2 
(partial 
reconfiguration) 
2 180800 2 180800 5 180800 
Download #3 
(partial 
reconfiguration) 
3 180800 3 736704 1 736704 
Download #4 
(partial 
reconfiguration) 
4 180800 4 736704 2 180800 
Download #5 
(partial 
reconfiguration) 
5 180800 5 736704 4 180800 
Total 14017920 15685632 14573824 
Table 3.10 BIST Configurations for Virtex-4 DSP BIST 
BIST Config 
# Pipeline Registers 
Signals Active 
Level 
B Input Source Test Modes Applied Clock edge of 
TPGs and ORAs 
Slice 0 Slice 1 Mult Add Casc 
1 A/Breg=2, Others=1 Low Direct Direct #1 #2  Low 
2 Preg=1, Others=0 Low Direct Cascade #3   Low 
3 All Regs = 0 High Direct Direct #4   High 
4 All Regs = 1 High Direct Direct #5 #6  High 
5 Preg=1, Others=0 High Cascade Direct   #7 High 
3.7 Summary 
 A minimum set of BIST configurations was developed to test the DSP cores in Virtex-4 
FPGAs. Fault detection capabilities and fault diagnosis were verified by injecting faults into the 
configuration memory bits controlling the DSP cores in an FX12 device. DSP BIST was also 
able to detect faulty DSP cores in some of the SX35 and LX60 engineering sample parts. Fault 
 52 
coverage of 97.4% is achieved for the faults injected in the configuration memory of the DSP. 
The functional fault coverage as determined by fault simulations is much higher. Fault coverage 
for the faults injected in the configuration memory of the DSP can be improved to 100% by 
adding more BIST configurations if desired. Since these undetected faults are in nonfunctional 
modes of operation the value of additional BIST configurations is questionable. Maximum BIST 
clock frequency was improved by changing the position of the TPG in the chip array. To further 
improve the BIST frequency on larger Virtex-4 devices, where the BIST frequency is less than 
50MHz, sub-array testing is done. Sub-array testing also minimizes the power dissipation caused 
by testing a large number of DSPs simultaneously, as this can cause problems in the system. 
When same edge clock is used on the TPGs, ORAs and DSPs for all configurations, the 
maximum BIST clock frequency is well over 50 MHz. But sub-array testing may still be required 
for larger devices that have large numbers of DSPs to minimize power dissipation. 
  
 53 
Chapter 4 
BIST for DSP Cores in Virtex-5 FPGAS 
This chapter describes the implementation of BIST for DSPs in Virtex-5 FPGAs. The 
BIST architecture, along with the BIST configurations and test sequences for DSP cores in the 
FPGAs, are discussed. The chapter also discusses the retrieval of BIST results and the timing 
analysis of BIST for all Virtex-5 FPGAs. The chapter concludes by summarizing the 
experimental BIST results and the fault coverage achieved. 
4.1 BIST Approach for DSPs in Virtex-5 FPGAs 
 Since most of the features of DSPs in Virtex-5 FPGAs are similar to the features of DSPs 
in Virtex-4 FPGAs, as explained in Chapter 2, the test algorithms used to test the DSPs in 
Virtex-4 FPGAs can also be applied to test DSPs in Virtex-5 FPGAs. The additional features in 
DSPs of Virtex-5 FPGAs can be tested by making modifications to the TPG.  
4.1.1 Adder and Multiplier Tests 
The adder test algorithm described in Section 3.1.1 of Chapter 3 can be used to test the 
adder in Virtex-5 FPGAs as well. Although the adder in Virtex-5 FPGAs can be accessed 
through two 48-bit input ports and can be tested using a one clock cycle per vector approach, the 
two-clock cycle per vector approach described in Section 3.1.1 of Chapter 3 is used to be able to 
also test the accumulator register with the adder. The 5?3 and the 3?5 multiplier test algorithms 
described in Section 3.1.2 of Chapter 3 can be applied to test the multiplier in Virtex-5 FPGAs. 
Like the test sequences for DSPs in Virtex-4 FPGAs, the test sequences for DSPs in Virtex-5 
FPGAs are also 1024 clock cycles long and are divided into four groups of 256 clock cycles 
 54 
each. The adder and the multiplier test sequences are illustrated in Table 4.1.  During the first 
group of 256 clock cycles of the adder test sequence, the first stage of the adder is tested and, 
during the second group of 256 clock cycles, the second stage adder is tested. During the third 
and fourth groups of 256 clock cycles in the adder test sequence, other paths to the adder are 
tested. The 5?3 multiplier test algorithm is applied to the multiplier in slice 1 while the 3?5 
multiplier test algorithm is applied to the multiplier in slice 0 during the first group of 256 clock 
cycles in the multiplier test sequence. During the second group of 256 clock cycles in the 
multiplier test sequence the 5?3 multiplier test algorithm is applied to the multiplier in slice 0 
and the 3?5 multiplier test algorithm is applied to the multiplier in slice 1. The 5?3 multiplier 
test algorithm is applied by replicating the five MSB bits of the vector generated by the 8-bit 
counter and applying the replicated bits to the 30-bit A port while the three LSB bits of the 
vector generated by the 8-bit counter are replicated and applied to the 18-bit B port. During the 
3?5 multiplier test algorithm the three LSB bits of the vector generated by the 8-bit counter are 
replicated and applied to the 30-bit A port while the five MSB bits of the vector generated by the 
8-bit counter are replicated and applied to the 18-bit B port. The application of different 
multiplier test algorithms to slice 0 and slice1 ensures that the A and B ports in both slices 
receive different test patterns on every clock cycle so that the stuck-at faults on the multiplexers 
that select between direct and cascade paths on A and B ports of the DSP slice can be tested. 
During the third group of 256 clock cycles in the multiplier test sequence, the multiply and add 
function is tested and during the fourth group of the multiplier test sequence the A port 
concatenated with the B port, bypass of the multiplier, is tested. 
  
 55 
Table 4.1 Multiplier and Adder Test Sequences 
Test First 256 ccs Second 256 ccs Third 256 ccs Fourth256 ccs 
Multiply P = A?B P = A?B P = A?B+C P = A:B+C 
Adder P = Z(C) P = X(P)+Y(C) P = Y(C) P = Y(C)+Z(P) P = Z(C) P = Y(C)+Z(P) P = Y(C) P = Y(C)+Z(ShiftP) 
4.1.2 Pattern Detector Test 
From the datasheet it is understood that the pattern detector mainly checks for the 
equality between the output of the adder/subtractor/logic unit and a pattern given by the user. 
This pattern can be given dynamically through the C port or statically through the configuration 
memory bits.  A mask field masks individual bits that are not considered during the comparison 
process [10]. Like the pattern, the mask can be given by the user dynamically through the C port 
or statically through the configuration memory bits. Multiplexers controlled by configuration 
memory bits select between dynamic or static patterns and masks. The pattern detector has two 
outputs, Patterndetect and Patternbdetect. The Patterndetect output goes to logic ?1? if the output 
of the adder/subtractor/logic unit matches the user-defined pattern and the Patternbdetect output 
goes to logic ?1? if the output of the adder/subtractor/logic unit matches the complement of the 
user-defined pattern. Although the data sheet does not explicitly define the architecture of the 
pattern detector, it is mentioned that the pattern detect logic performs a bitwise ((P= 
=pattern)||mask) and then ANDs the results to a single bit result [10]. From this equation the 
architecture for the Patterndetect logic is reasoned to be as illustrated in Figure 4.1.  
Figure 4.1 illustrates a 4-bit architecture for the pattern detect logic where o[4:1] indicate 
the output of the adder/subtractor/logic unit, p[4:1] indicate the user-defined pattern and m[4:1] 
indicate the mask. The AND gate shown Figure 4.1 can be completely tested by walking a 0 
 56 
through a field of 1s and applying the all 1s pattern. Table 4.2 illustrates the test vectors to test a 
4-bit pattern detect logic. 
 
Table 4.2 Test Vectors for Testing the 4-bit Patterndetect Logic 
o1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
p1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 
o2 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 
p2 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 
o3 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 
p3 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 
o4 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 
p4 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 
m 
[4:1] 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
x1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 
X2 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 
x3 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 
x4 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 
y1 1 0/ 1 1 1 1 1 1 1 1 0/ 1 1 1 1 1 1 1 
y2 1 1 1 0/ 1 1 1 1 1 1 1 1 0/ 1 1 1 1 1 
y3 1 1 1 1 1 0/ 1 1 1 1 1 1 1 1 0/ 1 1 1 
y4 1 1 1 1 1 1 1 0/ 1 1 1 1 1 1 1 1 0/ 1 
 From Table 4.2 it observed that all four combinations of input patterns (00, 01, 10 and 
11) are applied to the XNOR gates. While applying these set of patterns the mask is set to have 
patterns with alternate 0s and 1s that are applied statically through the configuration memory 
o1 
o2 
o3 
o4 
p1 
p2 
p3 
p4 
m1 
m2 
m3 
m4 
patterndetect 
Figure 4.1 Architecture for a 4-bit Patterndetect Logic 
x1 
x2 
x3 
x4 
y1 
y2 
y3 
y4 
 57 
bits. This allows the application of all four combinations of patterns (00, 01, 10 and 11) on the 
inputs of the OR gates. The shaded portions in Table 4.2 illustrate all the vectors that are applied 
to the Patterndetect logic.  
The datasheet also mentions that the Patternbdetect logic performs the logic equation 
((P= =~pattern)||mask) [10]. Although duplicating the architecture shown in Figure 4.1 with 
XOR gates instead of XNOR gates would satisfy this equation, this implementation requires a lot 
of logic. Hence, the architecture for the Patternbdetect logic is reasoned to be as illustrated in 
Figure 4.2. In Figure 4.2, o[4:1] indicate the output of the adder/subtractor/logic unit, p[4:1] 
indicate the user-defined pattern and m[4:1] indicate the mask. The NOR gate shown in Figure 
4.2 can be completely tested by walking a 1 through a field of 0s and applying the all 0s pattern. 
This is achieved by inverting any one of the input bits at the XNOR gates in Table 4.2. Table 4.3 
illustrates the test vectors for the Patternbdetect logic. 
 
The set of test patterns illustrated in Table 4.3 applies all possible input combinations (00, 
01, 10 and 11) to the XNOR gates. The mask is set to have patterns with alternate 0s and 1s 
applied statically through configuration memory bits. This ensures that the AND gates in Figure 
patternbdetect 
o1 
p1 
o2 
p2 
o3 
p3 
o4 
p4 
m1 
m2 
m3 
m4 
Figure 4.2 Architecture for a 4-bit Patternbdetect Logic 
x1 
x2 
x3 
x4 
y1 
y2 
y3 
y4 
 58 
4.2 are completely tested. The shaded portions in Table 4.3 illustrate all the vectors that are 
applied to the Patternbdetect logic.  
Table 4.3 Test Vectors for Testing the 4-bit Patternbdetect Logic 
o1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 
p1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 
o2 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 
p2 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 
o3 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 
p3 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 
o4 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 
p4 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 
m 
[4:1] 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
0/ 
1 
x1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 
X2 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 
x3 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 
x4 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 
y1 0 1/ 0 0 0 0 0 0 0 0 1/ 0 0 0 0 0 0 0 
y2 0 0 0 1/ 0 0 0 0 0 0 0 0 1/ 0 0 0 0 0 
y3 0 0 0 0 0 1/ 0 0 0 0 0 0 0 0 1/ 0 0 0 
y4 0 0 0 0 0 0 0 1/ 0 0 0 0 0 0 0 0 1/ 0 
  Figure 4.3 illustrates the TPG architecture used to generate test vectors for the pattern 
detector. A 49-bit shift register and a 47-bit shift register are cascaded to form a 96-bit shift 
register. A transition of either logic ?1? or logic ?0? on the MSB bit of the 49-bit shift register 
enables the 47-bit shift register. The 49-bit shift register starts shifting first until there is a 
transition of either logic ?1? or logic ?0? on its MSB bit and from then onwards the 49-bit shift 
register is enabled only when the MSB bit of the 47-bit shift register undergoes a transition from 
logic ?1? or logic ?0?. The ?cs? bit in Figure 4.3 is set to ?0? to generate the test vectors illustrated 
in Table 4.2 and is set to ?1? to generate the test vectors illustrated in Table 4.3. The 49-bit shift 
register is the same shift register that generates test patterns for the adder test. The 47-bit shift 
register is disabled during the adder test. 
 59 
 
In addition to the static and dynamic masks the pattern detector can select its mask from 
two other masks, selected by user-defined attributes ?mode1? and ?mode2?. These masks used for 
rounding applications are determined by the C-port input and change as the C-port input changes 
[10]. The ?mode1? attribute selects the mask to be the complement of the C-port input left shifted 
by 1 and the ?mode2? user attribute selects the mask to be the complement of the C-port input left 
shifted by 2. Multiplexers select the final mask from four options: a dynamic mask given through 
the C-port, a static mask given through configuration memory bits, and the two other masks 
explained above. The multiplexers that choose the final mask and the final pattern are modeled 
as illustrated in Figure 4.4 [10]. The SEL_MASK user attribute, also a configuration memory bit, 
selects between the dynamic mask given through the C-port and the static mask given through 
the configuration memory bits.  The mask selected by the SEL_MASK user attribute is fed into a 
3-input multiplexer that selects between the mask selected by the SEL_MASK attribute and the 
?mode1? and ?mode2? masks [10]. The user attribute SEL_ROUNDING_MASK, the select 
signal to this three input multiplexer, is a combination of two configuration memory bits, CB1 
and CB2 as illustrated in Figure 4.5.  
The auto-reset feature of the pattern detector is used to reset (when user attribute 
AUTORESET_PATTERN_DETECT, which is also a configuration memory bit, is set to be 
TRUE) the output P register of the DSP slice on the subsequent clock cycle after match is 
Figure 4.3 TPG for the Pattern Detector 
reset 
96 
cs 
96 Sreg2 Sreg2i 
49-bit shift 
register 
47-bit shift 
register 
reset 
EN EN 
A, B 
and C 
ports 
 60 
detected between the pattern and the output of the DSP slice or if a pattern was detected on the 
previous clock cycle but is now not detected [10]. The auto-reset feature can also be used to not 
reset (when AUTORESET_PATTERN_DETECT is set to be FALSE) the output P register if 
one of the above explained conditions is met. The user attribute, AUTORESET_ PATTERN_ 
DETECT_ OPTINV, also a configuration memory bit, is set to MATCH to reset or not reset the 
P register on the subsequent clock cycle if a pattern is detected and is set to NOT_MATCH to 
reset or not reset the output P register if a pattern was detected on the previous clock cycle but is 
now not detected. The architecture of the auto-reset logic is explicitly defined in the datasheet as 
illustrated in Figure 4.6 [10].  
 
 
Dynamic mask 
through C port 
Static mask 
through 
configuration 
Sel_Mask 
CB1 
CB2 
mode1 
mode2 
mask 
Figure 4.5 Detailed Multiplexer Architecture for Selecting Mask 
Dynamic mask 
through C port 
Static mask 
through 
configuration 
SEL_MASK 
Mode1 
Mode2 
SEL_ROUNDING
_MASK 
mask 
Figure 4.4 Multiplexer Architecture for Selecting the Pattern and the Mask [10] 
Dynamic pattern 
through C port 
Static pattern through 
configuration 
Select pattern 
 61 
 
 The overflow and underflow logic associated with the pattern detector is used to check 
for overflow or underflow beyond any particular bit position between 0 and 46. The mask 
determines the threshold for overflow or underflow while the pattern is set to 000?00 <47:0>. 
The value beyond which the output of the DSP slice overflows is 2N-1 where N is the number of 
ones in the mask field and the value beyond which the output of the DSP slice underflows is the 
twos complement form of  negative 2N. The architecture of the overflow and underflow logic is 
explicitly defined in the datasheet as illustrated in Figure 4.7 [10]. 
. 
 The BIST configurations illustrated in Table 4.4 completely test the pattern detector and 
its associated multiplexer paths, auto-reset, overflow and underflow logic. To apply the correct 
patterndetect patterndetectpast 
patternbdetect 
overflow 
patternbdetect patternbdetectpast 
patterndetect 
underflow 
Figure 4.7 Overflow and Underflow Logic [10] 
patterndetectpast patterndetect 
AUTORESET_PATTERN_DETECT 
0 
1 
AUTORESET_PATTERN_DETECT_OPTINV 
Figure 4.6 Auto Reset Logic [10] 
OR with 
external 
RSTP 
 62 
test vectors to the pattern detector, the output of the adder/subtractor/logic unit is set to be same 
as the concatenated A and B ports given through the X multiplexer while applying 0s to the Y 
and Z multiplexers. The pattern detector test sequence is 1024 clock cycles long and during the 
entire sequence OPMODE bits <1:0> select the A concatenated with B input through the X 
multiplexer port while OPMODE bits <3:2> apply 0s to the Y multiplexer port and OPMODE 
bits <6:4> apply 0s to the Z multiplexer port. Static or dynamic patterns and masks are set as per 
the BIST configuration. APDO and APD in Table 4.4 indicate Auto-Reset Pattern Detect Optinv 
and Auto-Reset Pattern Detect user attributes, respectively. 
Table 4.4 BIST Configurations for the Pattern Detector 
PSTATIC 
<47:0> 
MSTATIC 
<47:0> 
SEL_PATTERN SEL_MASK CB1 CB2 APDO APD 
0101..01 1111..11 static dynamic 0 0 0 1 
1111..11 0101..01 dynamic static 1 0 1 1 
1010..10 1111..11 static dynamic 0 0 0 0 
1111..11 1010..10 dynamic static 1 0 1 0 
x x dynamic dynamic 0 1 0 1 
x x dynamic dynamic 1 1 0 1 
An 8-bit pattern detector with its auto-reset, overflow and underflow logic along with the 
multiplexers that select the mask and the pattern were written in ASL (Auburn Simulation 
Language) using AUSIM (Auburn University Simulator) to determine the efficiency of the test 
algorithm and the BIST configurations. Of the 568 single stuck-at gate-level faults and the 2276 
bridging faults generated using AUSIM, the BIST configurations with the test vectors illustrated 
in Tables 4.2 and 4.3 detect all detectable single stuck-at and bridging faults. 
4.1.3 ALU Logic Mode Test 
 As described in Section 2.2 of Chapter 2, the DSPs in Virtex-5 FPGAs have an ALU 
logic mode of operation controlled by ALUMODE control signals and OPMODE bits <3:2>. 
The ALU logic unit in the DSPs performs bitwise logical XOR, XNOR, OR, AND, NAND, 
 63 
NOR and NOT operations on two 48-bit inputs as described in Table 2.3 of Chapter 2 [10]. 
When the logic mode is used in the DSP slices, the multiplier is not used and the 30-bit A and 
18-bit B input ports of the DSP slice are concatenated to form a 48-bit input to the logic unit. The 
other 48-bit input is provided by the C input port of the DSP slice [10]. The ALUMODE test 
sequence is 1024 clock cycles long and during the entire test sequence the OPMODE bits <1:0> 
select the A concatenated with B input through the X multiplexer and the OPMODE bits <6:4> 
select the C input through the Z multiplexer.  
 The ALU logic mode can be tested using the same 8-bit counter that is used to test the 
multiplier in the DSPs. During a 256 clock cycle period the ALUMODE is tested for all the 
logical operations described in Table 2.3 of chapter 2. As described in Table 2.2 of chapter 2, the 
ALUMODE values ?0000? and ?0011? correspond to the adder and subtractor functions that are 
tested during the adder test sequence, the remaining adder/subtractor functions in Table 2.2 that 
correspond to ALUMODE values ?0001? and ?0010? are tested during the ALUMODE test 
sequence as well. Counter<0> generates test patterns for the CARRYIN input of the 
adder/subtractor unit. Counter<1> generates test patterns for the A concatenate B input of the 
adder/subtractor/logic unit and counter<2> generates test patterns for the C input of the 
adder/subtractor/logic unit. Counter bits <6:3> generate values for the ALUMODE control bits 
and counter<7> generates values for the control bit OPMODE <3>  while the control bit 
OPMODE<2> is held constant at logic ?0?. During the 256 clock cycle period all possible 
combinations of test vectors (00, 01, 10 and 11) are applied to the two inputs of the logic unit. 
Bridging faults in the logic unit can be tested by increasing the width of the counter to 10-bits 
and using the additional bits to control every other bit of the two inputs to the logic unit. 
  
 64 
4.1.4 Cascade Mode Test 
 The DSP slices in Virtex-5 FPGAs can be cascaded to implement larger multipliers and 
adders used for extended MACC (Multiply and Accumulate) functions. Unlike the DSPs in 
Virtex-4 FPGAs where the B and P ports of a DSP slice can be cascaded, the A, B and P ports of 
a DSP slice can be cascaded in Virtex-5 FPGAs. The sign bit of the multiplier output can also be 
cascaded in Virtex-5 FPGAs. The choice between direct and cascade paths for the A and B ports 
is made by configuration memory bits [10]. The A and B ports can have up to two pipeline 
registers in their direct and cascade paths. The number of pipeline registers in these paths is 
determined by configuration memory bits. The user attributes A_INPUT and B_INPUT select 
between direct and cascade paths of the A and B ports. The user attributes AREG, ACASREG, 
BREG and BCASREG select the number of pipeline registers. The acceptable combinations of 
values for the user attributes that select the number of pipeline registers are summarized in Table 
4.5. The architecture of multiplexers that choose between direct and cascade paths and the 
number of pipeline registers in these paths is illustrated in Figure 4.8 [10]. 
Table 4.5 Values for A and B Pipeline Registers [10] 
Areg and Breg of DSP slice that 
cascades to the slice above 
Areg and Breg of DSP slice that 
receives cascade inputs from slice below 
0 0 
1 1 
2 1,2 
Form Figure 4.8 it is seen that the A and B cascade paths in the DSP slice are tested when 
all paths through MUX 4 are tested. This is achieved by setting the values of user attributes 
AREG and BREG to ?2? while setting values of user attributes ACASREG and BCASREG to ?2? 
and by setting the values of user attributes AREG and BREG to ?1? while setting values of user 
attributes ACASREG and BCASREG to ?1?.  The DSP slices in Virtex-5 FPGAs are cascaded in 
 65 
pairs. Since slice 1 and slice 0 are individually configured to operate in cascade mode, a total of 
four BIST configurations are required to test the cascade mode of operation.  
 
4.1.5 SIMD Mode Test 
 The SIMD mode of operation in Virtex-5 FPGAs, as described in Section 2.2 of Chapter 
2, splits the 48-bit adder into two 24-bit adders or four 12-bit adders. The USE_SIMD user 
attribute selects between these three adder architectures [10]. Hence it can be understood that the 
48-bit adder in Virtex-5 FPGAs is constructed from four 12-bit CLA adders, where each 12-bit 
CLA adder is constructed from three basic CLA adder units described in Section 2.3 of Chapter 
2. In the two 24-bit adder architecture mode, configuration bits block the propagation of the 
carryout signal between the second and the third 12-bit adders [10]. In the four 12-bit adder 
architecture mode the carryout signals are not allowed to propagate between individual 12-bit 
adders and are blocked by configuration bits. In the one 48-bit adder architecture mode, the 
propagation of carryout signals between individual 12-bit adders is not blocked by configuration 
bits [10].  
 By testing the SIMD mode of operation in the one-48 and four-12 modes, all 
configuration bits that select between the three adder architectures will have been tested. The 
SIMD ?one 48? mode is tested during configurations #2 and #3 as summarized in Table 4.7. 
MUX1 
MUX2 MUX3 
MUX4 
Acin/Bcin 
A/B 
RST 
CE 
RST 
CE 
To Mult 
Figure 4.8 Multiplexer Architecture that Selects Between Direct and 
Cascade Paths of A and B Ports [10] 
 66 
Configurations #2 and #3 also test the functionality of the adder. The SIMD four 12-bit mode is 
tested during configuration #8 when the DSPs are in cascade mode of operation as summarized 
in Table 4.7. Since the functionality of the adder is already tested in configurations #2 and #3, 
the faults associated with the SIMD four 12-bit mode can be tested while testing the cascade 
mode of operation without adding another configuration just for testing the adder in the SIMD 
?four 12? mode of operation. 
4.1.5 MACC Extend Mode Test 
Cascade signals CARRYCASCIN, CARRYCASCOUT, MULTSIGNIN and 
MULTSIGNOUT are internal to the DSP and are used to build larger adders and multipliers.  
The most significant bit (MSB) of the multiplier output functions as MULTSIGNOUT and is 
used in MACC extension applications to build a 96-bit MACC. The MULTSIGNOUT bit is 
cascaded to the DSP slice above through its corresponding MULTSIGNIN port. The carryout 
from the DSP slice below is also added along with the MULTSIGNOUT bit [10]. The test 
patterns used to test the multiplier are also applied to test the MACC extend feature. The MACC 
extend feature is tested during configuration #2 as summarized in Table 4.7.  
4.2 BIST Architecture  
 The BIST architecture used for DSPs in Virtex-5 FPGAs is same as the BIST architecture 
used for DSPs in Virtex-4 FPGAs, as described in Section 3.2 of Chapter 3. Two TPGs are used 
to drive alternate rows of DSP tiles so that a faulty TPG can be detected. Both slices of a DSP 
tile are driven by the same TPG for individual control of slices during cascade mode of 
operation. The control register that controls the TPG in Virtex-5 FPGAs is five bits. Of the five 
bits, the three LSBs, called MODE2, MODE1 and MODE0, control the test mode of the TPG. 
The second MSB, called INVCS, controls the active levels of control signals such as the 
 67 
OPMODE bits, ALUMODE bits and the active level of the carryin input to the adder unit. The 
MSB provides a global reset to the TPG. The control register is implemented in a CLB and the 
values for the control register are shifted in through the Boundary Scan interface, where the LSB 
is shifted in first. The control register values for resetting the TPG and for the various test modes 
are illustrated in Table 4.6. Each DSP slice is observed by two sets of ORAs. Similarly 
configured slices of the DSP are compared using column-based circular comparison.  The ORAs 
that observe the bottom-most DSP slice in a column of DSPs are clock enabled so as to mask the 
failure indications due to the unconnected cascade inputs on the bottom-most DSP slice in a 
column of DSPs.  
 The architecture of ORAs that observe DSP slices in Virtex-5 FPGAs is different from 
the architecture of ORAs that observe DSP slices in Virtex-4 FPGAs. Since the CLBs in Virtex-5 
FPGAs have 6-input look-up tables, each look-up table can compare two different output bits of 
two individual DSP slices, unlike the CLBs in Virtex-4 FPGAs, where the look-up tables 
observed only one output bit from two individual DSP slices. If any one of the two bits being 
compared mismatches, the ORA outputs a logic ?0?. The ORA architecture for DSPs in Virtex-5 
FPGAs is illustrated in Figure 4.9 [24]. A DSP slice in a Virtex-5 FPGA has 56 outputs. Figure 
4.10 illustrates the inputs and outputs associated with a DSP slice. Since each CLB slice has four 
6-input look-up tables, seven CLB slices are needed to analyze all 56 outputs. The ORAs that 
analyze the DSPs under test are placed in two columns of CLBs beside the column of DSPs. The 
bottom slices, slice 0s, of the DSP tiles are compared by the first column of ORAs and the top 
slices, slice 1s, of the DSP tiles are compared by the second column of ORAs. The ORAs that 
analyze each DSP slice under test occupy five rows in a single column of CLBs. Since only 
seven CLB slices are required to analyze all the DSP outputs and each row has two CLB slices, 
 68 
the remaining three CLB slices are dummy ORAs and do not analyze a DSP under test but 
generate the carry logic used to construct the iterative OR chain. 
Table 4.6 Control Register Values for TPG Control 
Control Register Values <4:0> 
RESET INVCS MODE2 MODE1 MODE0 Operation 
       1           X           X            X           X Resets TPG 
       0           1            X            X           X Inverts active level of control signals  
       0           X            0            0            0 Sets the multiplier test mode 
       0           X            0            0            1 Sets the adder test mode 
       0           X            0            1            0 Sets the logic test mode 
       0           X            0            1            1 Sets the patern detector test mode 
       0           X            1            0            0 Sets cascade test mode 
Figure 4.11 shows the orientation of the ORAs with respect to the DSP slices under test 
in Virtex-5 FPGAs. The dedicated carry logic in the CLBs of Virtex-5 FPGAs can also be used 
to create an iterative OR chain of ORAs where the Test Data In (TDI) line of the Boundary Scan 
module is connected to the carry-in of the first CLB in the column of ORAs and the carry-out of 
the last CLB in the last column of ORAs is connected to the Test Data Out (TDO) line of the 
Boundary Scan module. This provides a single bit pass/fail result for the entire test. The TDO 
line goes to logic ?1? when any one ORA in the iterative OR chain fails. Individual ORA results 
can be read back to determine which ORAs have failed the test when the overall test fails. 
 
 Figure 4.12 illustrates the general architecture of the TPG used to test DSPs in Virtex-5 
FPGAs. The TPG for DSP BIST comprises a 10-bit counter where the two MSB bits are used for 
Outi DSPj 
Outi DSPk Pass 
/Fail LUT 
0    1 
1 
Carry out 
Carry in Outk DSPj Outk DSPk 
ORACE 
Figure 4.9 ORA Architecture [24] 
 69 
the individual control of the four 256 clock cycle test groups during the 1024 clock cycle test 
sequence, a 49-bit shift register for the adder test, a finite state machine (FSM) for control of 
OPMODE control signals and two 10-bit linear feedback shift registers (LFSRs) for generating 
pseudorandom signals for testing the resets and clock enables on the pipeline registers as well as 
other control signals in the DSP slice that choose the source for the carryin input of the adder 
unit. The eight least significant bits of the counter are used to apply test patterns to the multiplier 
and the ALU logic mode. The 49-bit shift register, along with a 47-bit shift register, generate test 
vectors for the pattern detector. The TPG is written in VHDL (580 lines of code) and is 
synthesized into 99 CLB slices.  
 
 
1st column 
of ORAs 
2nd column 
of ORAs 
DSP0 
(Slice 0) 
DSP1 
(Slice 1) 
DSP0(0:48) 
DSP0(49:56) 
Dummy ORAs 
DSP1(0:48) 
DSP1(49:56) 
Dummy ORAs 
Figure 4.11 DSP ORA Orientation in Virtex-5 FPGAs 
A [0:30] 
B [0:18] 
C [0:48] 
P [0:48] (DSP [0:48]) 
CARRYOUT [3:0] 
PATTERNDETECT  
PATTERNBDETECT  
OVERFLOW 
UNDERFLOW 
V5 DSP SLICE 
(DSP [49:52]) 
Figure 4.10 I/O of a Virtex-5 DSP Slice 
 
(DSP [53]) 
(DSP [54]) 
(DSP [55]) 
(DSP [56]) 
 70 
 
4.3 BIST Configurations and Sequences 
The DSP cores are tested in eleven BIST configurations. During these eleven BIST 
configurations, the DSP configuration memory bits are tested in all configurable combinations. 
Table 4.7 summarizes the BIST configurations for the pattern detect logic explained in Section 
4.1.2. Table 4.8 illustrates all the BIST configurations that are downloaded to test the DSP cores. 
In Table 4.8, column 1 indicates the BIST configuration download number, column 2 indicates 
the number of pipeline registers in the I/O paths of the DSP slice and the values for various user 
attributes, column 3 indicates the active level of the DSP slice control signals, column 4 indicates 
whether the A or B port of the DSP slice is in cascade or direct mode of operation and column 5 
indicates the test modes that are being tested for each configuration.. BIST configurations #2 and 
#3 are repeated five times to test the various modes of operation, as illustrated in Table 4.8. 
Hence the DSPs are tested in 11 BIST configurations and 20 BIST sequences. The multiplier and 
MACC extend tests run during configuration #3 do not functionally test the multiplier and the 
MACC extend feature since the USE_MULT attribute is set to ?none? in this configuration to 
detect faults associated with this attribute. However, the multiplier and the MACC extend feature 
are functionally tested during configuration #2. 
Figure 4.12 TPG Architecture 
   
Count 
96-bit 
Shift 
Reg 
FSM 
LFSR 
TPG 
to 
ORAs 
to 
ORAs 
56 56 
P port 
A port 
B port 
C port 
OpMode 
Control 
DSP slice 0 
A port 
B port 
C port 
OpMode 
Control 
48 48 
48 48 
7 7 
23 23 
P port 
4 4 ALUMode ALUMode 
DSP slice 1 
 71 
Table 4.7 BIST Configurations for Pattern Detect Logic 
User Attribute Values 
BIST Configuration # 
1 2 3 4 5 6 
Sel_Mask C C Mask Mask C C 
Sel_Rounding_Mask Sel_Mask Sel_Mask Sel_Mask Sel_Mask Mode1 Mode2 
Sel_Pattern Pattern Pattern C C C C 
Auto_Reset_Pattern_Detect False False True True True False 
Autoreset_Pattern_Detect_Optinv Match Match Match Not_Match Match Not_Match 
Mask <47:0> FF?FF  FF?FF  55?55  AA?AA  55?55  55?55  
Pattern <47:0> 55?55  AA?AA  FF?FF  FF?FF  FF?FF  FF?FF  
A total of 20 test sequences are applied in eleven downloads to the FPGA. The FPGAs 
are repeatedly reconfigured and tested until they have been tested in all modes of operation. 
BIST Sequences for testing the DSPs in Virtex-5 devices are summarized in Table 4.9 below. 
The variables used in Table 4.8 are explained in Table 4.10. The download time can be 
minimized using partial configurations, through which only portions of the configuration 
memory that contain DSP configuration memory bits are written instead writing the whole 
configuration memory. This can be done by maintaining constant placement of the TPGs, the 
ORAs and the DSPs and by keeping the routing constant between them.  
Table 4.8 BIST Configurations for Virtex-5 DSPs 
BIST 
# 
Pipeline registers and 
Attribute values 
Signal 
Active 
Level 
A & B Input 
Source Test Modes Applied 
Slice0 Slice1 Mult Add ALU Pat Det Casc MACC Extend SIMD/casc 
1 
all regs=0  
use_mult=mult 
use_SIMD=one48 
use_patdet=no_patdet 
H D D yes no yes no no no no 
2 
all regs=1 
use_mult=mult_s 
use_SIMD=one 48 
use_patdet=patdet 
patdet config #1 
[T4.7]  
H D D yes yes yes yes no yes no 
  
 72 
BIST 
# 
Pipeline registers and 
Attribute values 
Signal 
Active 
Level 
A & B Input 
Source Test Modes Applied 
Slice0 Slice1 Mult Add ALU Pat Det Casc MACC Extend SIMD/casc 
3 
all regs=1  
use_mult=none 
use_SIMD=one 48 
use_patdet=patdet 
patdet config #2 
[T4.7] 
L D D no yes yes yes no no no 
4 
all regs=1 
use_mult=mult_s 
use_SIMD=one 48 
use_patdet=patdet 
patdet config #3 
[T4.7] 
H D D no no no yes no no no 
5 
all regs=1 
use_mult=mult_s 
use_SIMD=one 48 
use_patdet=patdet 
patdet config #4 
[T4.7] 
H D D no no no yes no no no 
6 
all regs=1 
use_mult=mult_s 
use_SIMD=one 48 
use_patdet=patdet 
patdet config #5 
[T4.7] 
H D D no no no yes no no no 
7 
all regs=1 
use_mult=mult_s 
use_SIMD=one 48 
use_patdet=patdet 
patdet config #6 
[T4.7] 
H D D no no no yes no no no 
8 
Areg=2 Breg=2 
Acascreg=2 Preg=1 
Bcascreg=2  
all other reg=0 
use_mult=none 
use_SIMD=four12 
use_patdet=no_patdet 
H D C no no no no yes no yes 
9 
Areg=2 Breg=2 
Preg=1 Acascreg=1 
Bcascreg=1 
all other reg=0 
use_mult=mult 
use_SIMD=one 48 
use_patdet=no_patdet 
H D C no no no no yes no no 
10 
Areg=2 Breg=2 
Preg=1 Acascreg=2 
Bcascreg=2 
all other reg=0 
use_mult=mult 
use_SIMD=one 48 
use_patdet=no_patdet 
H C D no no no no yes no no 
 
 73 
BIST 
# 
Pipeline registers and 
Attribute values 
Signal 
Active 
Level 
A & B Input 
Source Test Modes Applied 
Slice0 Slice1 Mult Add ALU Pat Det Casc MACC Extend SIMD/casc 
11 
Areg=2 Breg=2 
Preg=1 Acascreg=1 
Bcascreg=1 
all other reg=0 
use_mult=mult 
use_SIMD=one 48 
use_patdet=no_patdet 
H C D no no no no yes no no 
Table 4.9 BIST Sequences for Virtex-5 DSP 
Test Mode First 256 ccs Second 256 ccs Third 256 ccs Fourth 256 ccs 
Multiply (000) P=A?B (5?3) P=A?B (3?5) P=A?B+C (5?3) P=A:B+C (3?5) 
Adder (001) P=Z(C) P=X(P)+Y(C) P=Y(C) P=X(P)+Z(C) P=Z(C) P=Y(C)+Z(P) P=Y(C) P=Y(C)+Z(ShiftP) 
ALU (010) P=X(A:B)? Z(C)   (? = ?, ?, ?, etc.) 
Pattern Detect 
(010) P=X(A:B) == Z(C)   (?==? indicates comparison) 
Cascade (100) P1=A:B+Z(PC) P
0=Z(C) 
P1=A:B+Z(ShiftPC) 
P0=Z(C) 
P1=Z(C) 
P0=A:B+Z(PC) 
P1=Z(C) 
P0=A:B+Z(ShiftPC) 
MACC extend 
(101) 
P0=Z(P)+Y(A?B)+X(A?B) 
P1=Z(P) 
(P=PC+MULTSIGNIN+CARRYCASCOUT) 
P0=Z(P)  
(P=PC+MULTSIGNIN+CARRYCASCOUT) 
P1= Z(P)+Y(A?B)+X(A?B) 
SIMD/cascade 
(110) 
P0=Z(C) 
P1=Z(PC)+X(A:B) 
P1=Z(PC)+X(A:B) 
P0=Z(C) 
Table 4.10 Variables in Table 4.8 
Variables Explanation 
A, B & C Input ports to the DSP 
P Output port of the DSP 
Z(C),Y(C) C input fed to the adder through the Z and Y multiplexers respectively 
Z(P), X(P) P output of the DSP fed back to the adder through the Z and X multiplexers respectively 
Z(ShiftP) Indicates P output shifted by 17 bits and fed back through the Z multiplexer 
PC PC indicates the cascaded P output from the DSP slice below 
Z(PC) Indicates the cascaded P output from the DSP slice below fed through the Z multiplexer 
Z(ShiftPC) Indicates the cascaded P output from the DSP slice below shifted by 17 bits and fed through the Z multiplexer 
P1 Output P corresponding to slice 1 
P0 Output P corresponding to slice 0 
 74 
4.4 BIST Generation 
 The TPG written in VHDL is synthesized for an LX30T device and the location of the 
TPG in the chip array is determined. The TPG location for all other devices is offset with respect 
to the location of the TPG in the LX30T device based on the number of rows and columns in 
individual devices. The synthesized TPG in NCD format is converted to XDL format. Two 
programs written in C generate the DSP BIST configurations for any size or family of Virtex-5 
devices. The V5DSPBIST.exe program extracts the TPG from the synthesized TPG file in XDL 
format (using the same TPGXDLEXT.exe program developed for Virtex-4) and instantiates the 
TPGs, the ORAs and the DSPs under test and generates a DSP BIST template in XDL format. 
The DSP BIST template in XDL format is converted to NCD format for routing the BIST 
architecture. The routed DSP BIST template is converted back to XDL format to be used by the 
modification program, V5DSPMOD.exe. This modification program, written in C, can be used to 
modify the routed DSP BIST template to generate the BIST configurations in XDL format. 
These BIST configuration files are converted back to NCD format to generate the download 
configuration bit files. The NCD files of the DSP BIST template and the BIST configuration can 
be viewed in FPGA Editor. 
4.5 Timing Analysis of BIST 
 Timing analysis was performed on some of the Virtex-5 devices to determine the slowest 
BIST configuration. From the timing analysis it is observed that the position of the TPGs and the 
ORAs has an impact on the BIST clock frequency. Figure 4.13 illustrates the maximum BIST 
clock frequency for all Virtex-5 devices based on the position of the TPGs and the ORAs when 
the full array of the chip was tested. Figure 4.13 shows that the clock frequency is higher when 
the TPGs are placed at the middle of the chip array compared to positioning the TPGs at the 
 75 
bottom of the chip array and for all devices except the LXT155, LXT330, SXT50 and SXT95 
devices, where the clock frequency is higher when the ORAs are placed to the right of the DSPs 
compared to the position of the ORAs to the left of the DSPs.  The lower clock frequency when 
the ORAs are placed on the left of the DSP is because of fewer routing resources between the 
TPGs and the DSPs. Moving the ORAs to the right of the DSPs reduces the routing congestion in 
all devices except the LXT155, LXT220, SXT50 and the SXT95 devices. According to these 
results, the BIST clock frequency is higher when the TPGs are placed at the middle of the array 
and the ORAs are placed to the right of the DSPs. Hence, the TPGs are placed at the middle of 
the array and the ORAs are placed to the right of the DSPs. For the bigger devices like the 
LXT110, LXT155, LXT330, SXT50 and SXT95 devices, the BIST clock frequency is less than 
50 MHz. For these devices sub-array testing is done, where each half of the array is tested 
separately.  
 
Figure 4.13 Clock Frequency Based on the Position of the TPGs and the ORAs 
0
10
20
30
40
50
60
70
80
90
100
lx
t 20
lx
t 30
lx
t 50
lx
t 85
lx
t 110
lx
t 155
lx
t 220
lx
t 330
s
x
t 35
s
x
t 50
s
x
t 95
f x
t 30
f x
t 70
f x
t 100
f x
t 130
f x
t 200
Dev ice
Max BIS
T Clo
ck Freq MHz
T PGs  bot t om  OR A right
T PGs  m iddle OR A right
T PGs  m iddle OR A lef t
 76 
Figure 4.14 shows the impact of the position of the TPGs, based on sub-array that is 
being tested, on the BIST clock frequency. In Figure 4.14 bottom half and top half indicate the 
bottom half and top half arrays of the chip. Figure 4.14 shows that moving the TPGs to the 
bottom of the array while testing the bottom-half of the array improves the BIST clock frequency 
and raises the BIST clock frequency to over 50MHz for the LXT110, LXT155, LXT220 and 
SXT50 devices. However, the BIST clock frequency for at least one of the sub-arrays is still less 
than 50MHz for the LXT330, SXT50 and SXT95 devices. So for these devices each quarter of 
the array can be tested separately to improve the BIST clock frequency. The position of the TPG 
is at the middle of the array while testing all quarters of the array. Figure 4.15 shows the BIST 
clock frequency for each quarter of the array for the LXT330, SXT50 and the SXT95 devices. 
 
Figure 4.14 Clock Frequency for the Sub-Arrays Based on the Position of the TPGs  
Figure 4.15 shows that the BIST clock frequency is higher for the quarter-arrays that 
have the TPGs located close to the arrays. Figure 4.16 shows the position of the TPGs for each of 
0
10
20
30
40
50
60
lx t 110 lx t 155 lx t 220 lx t 330 s x t 50 s x t 95
D ev i ce
Max BIS
T Clo
ck Freq MHz
T PG m iddle OR A right  bot t om  half
T PG m iddle OR A right  t op half
T PG bot t om  OR A right  bot t om  half
 77 
the quarter arrays for an LXT330 device. For the LXT330 and the SXT50 devices, the clock 
frequency is well above 50MHz when the TPGs are located close to the DSPs under test. So for 
these two devices, two sets of two TPGs can be used, where one set of TPGs is placed at the 
middle of the array to test the middle two quarter-arrays and the other set of TPGs is placed at 
the top of the array to test the top-most quarter-array or at the bottom of the array to test the 
bottom-most quarter of the array.  
 
Figure 4.15 Clock Frequency for Quarter -Arrays Based on the Position of the TPGs 
Figure 4.17 shows the BIST clock frequencies for the SXT95 devices based on different 
locations of the TPGs and the ORAs in the chip array for sub-arrays, quarter-arrays and 
quadrants of the array. For example, in Figure 4.17, the phrase ?tpg bottom ORA right? indicates 
that the TPG is located at the bottom of the array and the ORAs are located on the right side of 
the DSPs.  From Figure 4.17 it is seen that in all the cases the frequency is below 50MHz. So for 
the SXT95 device each quadrant of the quarter-array can be tested separately and four sets of two 
0
10
20
30
40
50
60
70
lx t 330 s x t 50 s x t 95
D ev i ce
Max BIS
T clo
ck Freq MHz
t pg m iddle bot t om -m os t  quart er
t pg m iddle bot t om  half  t op quart er
t pg m iddle t op half  bot t om  quart er
t pg m iddle t op-m os t  quart er
 78 
TPGs can be used, where two sets of TPGs will be placed at the middle of the array with one set 
on the left of the chip array and one set on the right of the chip array. The remaining two sets will 
be placed on either side of the chip at the bottom/top of the array based on the array that is being 
tested. 
 
                            a) bottomost quarter  b) bottom-half       c) top-half            d) topmost quarter 
           top quarter            bottom quarter 
Figure 4.16 Quarter Arrays for LXT330 device 
 
Figure 4.17 Clock Frequencies for all Arrays of the SXT95 Device Based on Positions of the 
TPGs and ORAs 
T i mi n g  an al y si s fo r  D ev i ce sxt95
0
10
20
30
40
50
60
tp
g
 b
o
tt
o
m
 O
R
A
 ri
g
h
t 
fu
ll
 ch
ip
tp
g
 mi
d
d
le
 O
RA 
lef
t 
fu
l l ch
ip
tp
g
 mi
d
d
le
 O
RA 
ri
g
h
t  f
u
ll
 ch
i p
tp
g
 mi
d
d
le
 O
RA 
ri
g
h
t  b
o
tt
o
m
-ha
lf 
a
rr
a
y
tp
g
 mi
d
d
le
 O
RA 
ri
g
h
t  t
o
p
- h
a
lf a
r ra
y
tp
g
 b
o
tt
o
m
 O
R
A
 ri
g
h
t 
b
o
t to
m
- h
a
lf a
r ra
y
tp
g
 mi
d
d
le
 O
RA 
ri
g
h
t  b
o
tt
o
m
-m
o
st 
q
u
a
rte
r
tp
g
 mi
d
d
le
 O
RA 
ri
g
h
t  b
o
tt
o
m
-ha
lf 
to
p
- q
u
a
rte
r
tp
g
 mi
d
d
le
 O
RA 
ri
g
h
t  t
o
p
- h
a
lf b
o
tt
o
m
- q
u
a
rte
r
tp
g
 mi
d
d
le
 O
RA 
ri
g
h
t  t
o
p
- m
o
st 
q
u
a
rter
tp
g
 mi
d
d
le
 O
RA 
ri
g
h
t  t
o
p
- le
ft
 q
u
a
d
r a
n
t
tp
g
 mi
d
d
le
 O
RA 
ri
g
h
t  t
o
p
- ri
g
h
t 
q
u
a
d
r a
n
t
tp
g
 b
o
tt
o
m
 O
R
A
 ri
g
h
t 
b
o
t to
m
- le
ft
 q
u
a
d
r a
n
t
tp
g
 b
o
tt
o
m
 O
R
A
 ri
g
h
t 
b
o
t to
m
- ri
g
h
t 
q
u
a
d
r a
n
t
Max BIS
T clo
ck Freq MHz
 79 
4.6 Fault Inject Analysis and Fault Coverage 
Fault-inject analysis was done by injecting a total of 604 faults of which 564 faults were 
detected. Of the 40 undetected faults, 30 faults are associated with the test circuitry that only the 
manufacturer has access to. These faults are not a concern since they do not affect the 
functioning of the DSP. Six of the remaining ten faults are faults associated with non-functional 
configuration bits that are not mentioned in the data sheet and hence are not of concern. The 
remaining four undetected faults are associated with the PATDET and NO_PATDET 
configuration bits of the USE_PATTERN_DETECT attribute. The PATDET and NO_PATDET 
configuration bits select between combinational paths of different speed. Hence these faults will 
be detected when tested at speed. These faults are not detected because the boundary scan 
interface is used for fault inject analysis, which is slow. Figure 4.18 illustrates the individual and 
cumulative fault coverage for the 20 test sequences. The bar graph in the figure illustrates 
individual fault coverage and the line graph illustrates the cumulative fault coverage. Cumulative 
fault coverage of 99.3% is achieved. 
 
Figure 4.18 Fault Inject Results for Virtex-5 DSP BIST 
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
T e s t S e q u e n c e  #
# F
au
lts
 D
e
te
c
te
d
0
10
20
30
40
50
60
70
80
90
100
F
au
lt C
ove
r
age
 (%)I n d i vi d u al  F C
C u m u l at i ve  F C
 80 
4.7 Summary 
 A minimum set of BIST configurations was developed to test the DSP cores in Virtex-5 
FPGAs. Fault detection capabilities and fault diagnosis were verified by injecting faults into the 
configuration memory bits controlling the DSP cores in an LXT-30 device. Fault coverage of 
99.3% is achieved for the faults injected in the configuration memory of the DSP. The functional 
fault coverage, as determined by fault simulations, is much higher. Fault coverage for the faults 
injected in the configuration memory of the DSP can be improved to 100% by adding more BIST 
configurations if desired. Since these undetected faults are in nonfunctional modes of operation, 
the value of additional BIST configurations is questionable. Timing analysis was done to 
determine the maximum BIST clock frequency. Based on the timing analysis results the position 
of the TPGs and the ORAs in the chip array were changed to improve the maximum BIST clock 
frequency. For larger devices sub-array testing (where only half the chip is tested at a time) and 
quarter-array testing (where only one quarter of the chip is tested at a time) is done. 
  
 81 
Chapter 5 
Summary and Conclusion 
 This chapter highlights and summarizes the work presented in the thesis. Section 5.1 
summarizes DSP BIST for Virtex-4 devices. Section 5.2 summarizes DSP BIST for Virtex-5 
devices. Section 5.3 describes application of DSP BIST for DSPs in other FPGAs. 
5.1 Summary of Virtex-4 DSP BIST 
 DSP BIST for Virtex-4 FPGAs presented in this thesis was developed by writing 8-bit 
and 48-bit models of the architectures of the logic cores in the DSP: multiplier and adder in ASL. 
Fault simulations were carried out using AUSIM to determine the correct set of test patterns and 
configurations for completely testing the cores for single stuck-at and bridging faults. Three test 
sequences were developed for the three test modes: multiplier, adder and cascade modes of 
operation. Five BIST configurations were developed and the FPGA is repeatedly reconfigured to 
run the three test sequences. Seven test sequences are run in five downloads to the FPGA to 
completely test the FPGA in all its functional modes of operations. Fault detection for the DSP 
BIST was evaluated by injecting faults in the configuration memory bits of the DSP in Virtex-4 
FX12 device. Of the 154 faults injected, 150 faults were detected giving a fault coverage of 
97.4%. The four undetected faults are in non-functional modes of the DSP but can be detected by 
adding additional BIST configurations. Timing analysis was done on all Virtex-4 devices to 
determine the maximum BIST clock frequency for each device. Based on this analysis the 
position of the TPGs for achieving a BIST clock frequency of at least 50MHz was determined. 
 82 
Sub-array testing in larger Virtex-4 devices minimizes the power dissipation caused by 
concurrently testing a large number of DSPs in the device.  
5.2 Summary of Virtex-5 DSP BIST 
 Since the architecture of DSPs in Virtex-5 FPGAs is similar to the architecture of DSPs 
in Virtex-4 FPGAs, the test algorithms used to test DSPs in Virtex-4 FPGAs are also applied to 
test the multiplier and the adder cores in DSPs of Virtex-5 FPGAs. Test patterns and BIST 
configurations for the additional circuits in Virtex-5 FPGAs were developed by writing ASL 
models of the circuits and doing fault simulation for these models in AUSIM. Five test sequences 
were developed for the five test modes: multiplier, adder, logic, pattern detector and cascade 
modes of operation. Eleven BIST configurations were developed and the FPGA is repeatedly 
reconfigured to completely test the FPGA in all its functional modes of operation. 20 tests are 
run in eleven downloads to the FPGA. Fault coverage for the DSP BIST developed was 
evaluated by injecting faults in the configuration memory bits of the DSP in Virtex-5 LXT30 
device. Timing analysis was done on all Virtex-5 devices to determine the maximum BIST clock 
frequency for each device. Based on this analysis the position of the TPGs and the ORAs for 
achieving a BIST clock frequency of at least 50MHz was determined. Sub-array, quadrant and 
sub-quadrant testing in larger Virtex-5 devices minimizes the power dissipation caused by 
concurrently testing a large number of DSPs in the device.  
5.3 Application to Other FPGAs and Architectures 
 The DSP BIST for Virtex-4 and Virtex-5 FPGAs presented in this thesis can be extended 
and applied to test DSPs in other FPGAs like Spartan-3A, Spartan-6 and Virtex-6 FPGAs. The 
architectures of DSPs in Spartan-3A and Spartan-6 FPGAs is similar to the architecture of DSPs 
in Virtex-4 FPGAs [26] [27]. However, the DSPs in Spartan-3A and Spartan-6 FPGAs have a 
 83 
pre-adder stage in addition to the circuits present in DSPs of Virtex-4 FPGAs, but the test 
algorithms used to test the adder and the multiplier in Virtex-4 FPGAs can also be used to test 
the adder and multiplier cores in Spartan-3A and Spartan-6 FPGAs. The architecture of DSPs in 
Virtex-6 FPGAs is similar to the architecture of DSPs in Virtex-5 FPGAs [28], but the test 
algorithms used to test the logic circuits in Virtex-5 DSPs can also be used to test the logic 
circuits in Virtex-6 FPGAs.  
DSP BIST for Virtex-4 devices is explained in [30]. [31] gives a detailed explanation of 
adder BIST and [32] gives a detailed explanation of multiplier BIST. 
 
 
  
 84 
References 
[1]  L. T. Wang, C. Stroud and N. A. Touba, ?System On Chip Test Architectures,? Morgan 
Kaufmann Publishers, 2007. 
[2]  K. M. Thompson, ?Intel and the Myths of Test,? IEEE Design & Test of Computers,  vol. 13, 
pp. 79-81, 1996. 
[3]  S. Hauck, ?The roles of FPGAs in Reprogrammable Systems,? Proc. of the IEEE, vol 86, 
pp. 615-638, 1998. 
[4]  M. B. Tahoori and S. Mitra, ?Test Compression for FPGAs,? Proc. IEEE Intl. Test Conf., 
pp. 1-9, 2006. 
[5]  E. Chmelar, ?Minimizing the number of test configurations for FPGAs,? IEEE/ACM Intl. 
Conf. on Computer Aided Design, pp. 899-902, 2004. 
[6]  Xilinx Inc., ?Virtex-4 FPGA User Guide,? UG070 v2.5, 2008. 
[7]  Xilinx Inc., ?Virtex-5 FPGA User Guide,? UG190 v4.2, 2008. 
[8]  W. K. Huang, F.J Meyer and F. Lombardi, ?Array-Based Testing of FPGAs: Architecture 
and Complexity,? Proc. IEEE Intl. Conf. on Innovative Systems in Silicon, pp. 249-258, 
1996. 
[9]  Xilinx Inc., ?XtremeDSP for Virtex-4 FPGAs,? User Guide UG073 (v2.7), Xilinx Inc., 
2008. 
[10]  Xilinx Inc., ?Virtex-5 XtremeDSP Design Considerations,? User Guide UG193 (v3.1), 
Xilinx Inc., 2008. 
 85 
[11]  B. Dufort and G. H Chapman, ?Test Vehicle for a Wafer-Scale Field Programmable Gate 
Array,? Proc. IEEE Intl. Conf. on Wafer Scale Integration, pp.33-42, 1995. 
[12]  A. Orailoglu, ?Microarchitectural Synthesis for Rapid BIST Testing,? IEEE Trans. on 
Computer-Aided Design of Integrated Circuits and Systems, vol. 16, no. 6, pp. 573-586, 
1997. 
[13]   M. Abramovici, C. Stroud and M. Emmert, ?Using Embedded FPGAs for SOC Yield 
Improvement,? Proc. ACM/IEEE Design Automation Conference, pp. 713-724, 2002. 
[14]  S. Adham and S. Gupta, ?DP-BIST: A Built-In Self-Test for DSP Data Paths - A Low 
Overhead and High Fault Coverage Technique,? Proc. IEEE Asian Test Symp., pp. 205-
212, 1996. 
[15]  H. Al-Asaad, J. Hayes, and B. Murray, ?Scalable Test Generators for High-Speed 
Datapath Circuits,? J. Electronic Testing: Theory and Applications, vol. no. 12, pp. 111-
125, 1998. 
[16]  D. Gizopoulos, A. Paschalis and Y. Zorian, ?Effective Built-In Self-Test for Booth 
Multipliers,? IEEE Design & Test of Computers, vol. 15, no. 3, pp. 105-111, 1998. 
[17]  A. Paschalis, N. Kranitis, M. Psarakis, D. Gizopoulus and Y. Zorian, ?An Effective BIST 
Architecture for Fast Multiplier Cores?, Proc. Design, Automation and Test in Europe 
Conf., pp. 117-121, 1999 
[18]  D. Bakalis, E. Kalligeros, D. Nikolos, H. Vergos and G. Alexiou, ?Low Power BIST for 
Wallace Tree-based Fast Multipliers,? Proc. IEEE Int. Symp. on Quality of Electronic 
Design, pp. 433-438, 2000. 
[19] S. Kajihara and T. Sasao, ?On the Adders with Minimum Tests,? Proc. IEEE VLSI Test 
Symp., pp. 10-15, 1997. 
 86 
[20] A. Sarvi and J. Fan, ?Automated BIST-Based Diagnostic Solution for SOPC,? Proc. Int. 
Conf. on Design & Test of Integrated Systems in Nanoscale Technology, pp. 263-267, 
2006. 
[21] C. Stroud and S. Garimella, ?Built-In Self-Test and Diagnosis of Multiple Embedded Cores in 
SOCs,? Proc. Intl Conf. on Embedded Systems and Applications, pp. 130-136, 2005. 
[22] ?Using Embedded Multipliers in Spartan-3 FPGAs,? Application note XAPP467 (v1.1), 
Xilinx, Inc., 2003. 
[23] C. Nagendra, M-J Irwin and R-M Owens, ?Area-Time-Power Tradeoffs in Parallel 
Adders,? IEEE Trans. on Circuits and Systems II, vol 43, no. 10, pp. 689-702, 1996. 
[24]    B.F Dutton and C. E Stroud, ?Built-In Self-Test of Configurable Logic Blocks in Virtex-5 
FPGAs,? Proc. IEEE Southeastern Symp. on System Theory, pp. 230-234, 2009. 
[25] P.H.W Leong, ?Recent Trend in FPGA Architectures and Applications,? Proc. IEEE 
International Symp. on Electronic Design, Test and Applications, pp. 137-141, 2008 
[26] Xilinx Inc., ?XtremeDSP DSP48A for Spartan-3A DSP FPGAs,? User Guide UG431 
(v1.3), Xilinx Inc., 2008. 
[27] Xilinx Inc., ?Spartan-6 Family Overview,? DS160 v1.0, 2009. 
[28] Xilinx Inc., ?Virtex-6 Family Ovreview,? DS150 v1.0, 2009. 
[29] B.F. Dutton, M. Ali, J. Sunwoo and C. E. Stroud, ?Embedded Processor Based Fault 
Injection and SEU Emulation for FPGAs,? Proc. Intl Conf. on Embedded Systems and 
Applications, pp. 183-189, 2009 
[30] M. D. Pulukuri and C. E. Stroud, ?Built-In Self-Test of Digital Signal Processors in 
Virtex-4 FPGAs,? Proc. IEEE Southeastern Symp. On System Theory, pp. 34-38, 2009 
 87 
[31] M. D. Pulukuri and C. E. Stroud, ?On Built-In Self-Test for Adders,? J. Electronic 
Testing: Theory and Applications, vol. 25 no. 6 pp. 343-346, DOI 10.1007/s10836-009-
5114-6, 2009. 
[32] M. D. Pulukuri, G. J. Starr and C. E. Stroud, ?On Built-In Self-Test for Multipliers,? Proc. 
IEEE Southeast Regional Conf., pp. 25-28, 2010.