A Parallel Implementation of Fault Simulation on a Cluster of Workstations Except where reference is made to the work of others, the work described in this thesis is my own or was done in collaboration with my advisory committee. This thesis does not include proprietary or classified information. Kyunghwan Han Certificate of Approval: Fa Foster Dai Associate Professor Electrical and Computer Engineering Soo-Young Lee, Chair Professor Electrical and Computer Engineering Chwan-Hwa John Wu Professor Electrical and Computer Engineering Stephen L. McFarland Acting Dean Graduate School A Parallel Implementation of Fault Simulation on a Cluster of Workstations Kyunghwan Han A Thesis Submitted to the Graduate Faculty of Auburn University in Partial Fulfillment of the Requirements for the Degree of Master of Science Auburn, Alabama December 15, 2006 A Parallel Implementation of Fault Simulation on a Cluster of Workstations Kyunghwan Han Permission is granted to Auburn University to make copies of this thesis at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii Thesis Abstract A Parallel Implementation of Fault Simulation on a Cluster of Workstations Kyunghwan Han Master of Science, December 15, 2006 (B.S., Sungkyunkwan University, 2003) 81 Typed Pages Directed by Soo-Young Lee Parallel simulation on a cluster workstations is one method by which fault simu- lation time for large circuits can be reduced significantly. To get near-linear speedups from parallel processing, parallelization methods should result in an even computational load distribution among processors in a cluster workstations. Fault simulation can be parallelized by partitioning fault list, the test vector or both. In the thesis, parallel fault simulation algorithm called PAUSIM has been developed. This algorithm consists of logic simulation and two steps of fault simulation for sequential logic circuits. Com- pared to the other algorithms, PAUSIM-CY avoids redundant work by a judicious task decomposition. Also, it adopts a cyclic fault partitioning method based on the LOG partitioning and local redistribution, resulting in a well-balanced load distribution. The parallel implementations were done using the MPI library on a cluster of workstations. The results show a significant speed-up by PAUSIM-CY over other existing parallel algorithms. iv Style manual or journal used Journal of Approximation Theory (together with the style known as ?aums?). Bibliograpy follows van Leunen?s A Handbook for Scholars. Computer software used The document preparation package TEX (specifically LATEX) together with the departmental style-file aums.sty. v Table of Contents List of Figures viii List of Tables xi 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Circuit Testing 5 2.1 Fault Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Test Pattern Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Sequential Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Word-parallel Fault Simulation . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Deductive Fault Simulation . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Concurrent Fault Simulation . . . . . . . . . . . . . . . . . . . . . 13 2.3.4 Differential Fault Simulation . . . . . . . . . . . . . . . . . . . . . 15 2.3.5 PROOFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.6 HOPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.7 PARIS and PSF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Parallel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Circuit Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2 Algorithmic Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.3 Fault Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.4 Pattern Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.5 Fault and Pattern Parallelism . . . . . . . . . . . . . . . . . . . . . 26 2.5 AUSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Program Configuration . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.2 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Parallelization 34 3.1 PAUSIM-BL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.1 Task Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 Fault Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3 Test Vector Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 vi 3.2 PAUSIM-CY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.1 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Performance 47 4.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5 Conclusion 67 Bibliography 68 vii List of Figures 1.1 Typical fault simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Definition of fan-out stem. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Definition of fault. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 An Example of a single stuck-at fault. . . . . . . . . . . . . . . . . . . . . 7 2.4 An example of an level order. . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 An example of an output cone. . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6 An example of parallel fault simulation. . . . . . . . . . . . . . . . . . . . 12 2.7 An example of deductive fault simulation. . . . . . . . . . . . . . . . . . . 13 2.8 Fault-lists in concurrent fault simulation. . . . . . . . . . . . . . . . . . . 14 2.9 An example of differential fault simulation . . . . . . . . . . . . . . . . . . 16 2.10 PROOFS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.11 An example of PROOFS fault simulation. . . . . . . . . . . . . . . . . . . 19 2.12 Test sequence partitioning in SPITFIRE-0 . . . . . . . . . . . . . . . . . . 25 2.13 Partitioning in SPITFIRE-1 . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.14 Partitioning in SPITFIRE-2 . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.15 Sequential structure of AUSIM . . . . . . . . . . . . . . . . . . . . . . . . 30 2.16 Data Structure of AUSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.17 An example of gate evaluation of AUSIM. . . . . . . . . . . . . . . . . . . 32 3.1 Task decomposition in PAUSIM-BL . . . . . . . . . . . . . . . . . . . . . 35 viii 3.2 Fault partitioning in PAUSIM-BL for 4 processors where each fault group is distinguished by a different grey-scale . . . . . . . . . . . . . . . . . . . 37 3.3 Test vector set partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Flowchart of the master processor in PAUSIM-BL . . . . . . . . . . . . . 39 3.5 Flowchart of slave processors in PAUSIM-BL . . . . . . . . . . . . . . . . 40 3.6 Example of fault partitioning for s27 benchmark circuit for 5 processors . 42 3.7 Task decomposition in PAUSIM-CY1 . . . . . . . . . . . . . . . . . . . . . 43 3.8 Communication among 4 processors in logic simulation in PAUSIM-CY1. V: Test vector, O: Logic simulation result . . . . . . . . . . . . . . . . . . 45 3.9 Communication among 4 processors in the first step of fault simulation in PAUSIM-CY1. F: Fault, U: Undetected fault, D: Undetected fault . . . . 45 3.10 Communication among 4 processors in the second step of fault simulation in PAUSIM-CY1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 Execution time for 4 processors - small circuit . . . . . . . . . . . . . . . . 50 4.2 Execution time for 4 processors - large circuit . . . . . . . . . . . . . . . . 50 4.3 Execution time for 8 processors - small circuit . . . . . . . . . . . . . . . . 51 4.4 Execution time for 8 processors - large circuit . . . . . . . . . . . . . . . . 51 4.5 Execution time for 16 processors - small circuit . . . . . . . . . . . . . . . 52 4.6 Execution time for 16 processors - large circuit . . . . . . . . . . . . . . . 52 4.7 Execution times for PAUSIM-SF . . . . . . . . . . . . . . . . . . . . . . . 54 4.8 Execution times for PAUSIM-BL . . . . . . . . . . . . . . . . . . . . . . . 54 4.9 Execution times for PAUSIM-CY0 . . . . . . . . . . . . . . . . . . . . . . 55 4.10 Execution times for PAUSIM-CY1 . . . . . . . . . . . . . . . . . . . . . . 55 4.11 Speedup for PAUSIM-SF . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 ix 4.12 Speedup for PAUSIM-BL . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.13 Speedup for PAUSIM-CY0 . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.14 Speedup for PAUSIM-CY1 . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.15 Efficiency for PAUSIM-SF . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.16 Efficiency for PAUSIM-BL . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.17 Efficiency for PAUSIM-CY0 . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.18 Efficiency for PAUSIM-CY1 . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.19 Workload distribution: s3271 benchmark circuit, PAUSIM-SF . . . . . . . 63 4.20 Workload distribution: s3271 benchmark circuit, PAUSIM-BL . . . . . . . 63 4.21 Workload distribution: s3271 benchmark circuit, PAUSIM-CY0 . . . . . . 64 4.22 Workload distribution: s3271 benchmark circuit, PAUSIM-CY1 . . . . . . 64 4.23 Workload distribution: s3271 benchmark circuit, PAUSIM-SF . . . . . . . 65 4.24 Workload distribution: s3271 benchmark circuit, PAUSIM-BL . . . . . . . 65 4.25 Workload distribution: s3271 benchmark circuit, PAUSIM-CY0 . . . . . . 66 4.26 Workload distribution: s3271 benchmark circuit, PAUSIM-CY1 . . . . . . 66 x List of Tables 4.1 Fault coverage statistics using 1600 random vectors on a single processor . 47 4.2 Execution time (seconds) and speedups using 1600 random vectors on multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Mean and standard deviation execution times on eight processors for PAUSIM-SF, PAUSIM-BL, PAUSIM-CY0 and PAUSIM-CY1. The unit for time is second. s1196, s1423 and s1512 . . . . . . . . . . . . . . . . . . 61 4.4 Mean and standard deviation execution times on eight processors for PAUSIM-SF, PAUSIM-BL, PAUSIM-CY0 and PAUSIM-CY1. The unit for time is second. s3271 and s5378 . . . . . . . . . . . . . . . . . . . . . . 62 xi Chapter 1 Introduction Once a digital circuit is designed and fabricated, the circuit needs to be tested for the potential presence of physical defects or faults. The objective of a fault simulation algorithm is to find the fraction of total faults (also referred to as the fault coverage) that are detected by a given set of input vectors. Especially, fault simulation is essential to designing a high fault coverage Built-in Self Test (BIST) becoming popular for VLSI testing. 1.1 Background In the simplest form of testing, a fault is injected into a logic circuit by setting a line or a gate to a faulty value (1 or 0), and then the effects of the fault are simulated using zero-delay logic simulation. Most fault simulation algorithms are typically of O(n2) time complexity, where n is the number of lines in the circuit. Studies have shown that there is little hope of finding a linear time fault simulation algorithm [3]. Figure 1.1 shows a typical fault simulator [1]. The block C() is the fault-free circuit and blocks C(f1) through C(fn) are copies of the same circuit with faults f1 through fn permanently inserted. The good circuit (fault-free circuit) and the faulty circuits are simulated for each test vector. If the output responses of a faulty circuit differ from those of the good circuit, then the corresponding fault is detected, and the fault can be dropped from the fault list, speeding up simulation of subsequent test vectors. A fault simulator can be run in a stand-alone mode to grade a given set of test vectors, 1 Test vectors C ( ) C (f1) C (f2) C (fn) Comparator ? Comparator Comparator Figure 1.1: Typical fault simulator. or interfaced with a test vector generator to reduce the number of faults that must be explicitly targeted by the test vector generator. In a random vector environment, the fault simulator helps in evaluating the fault coverage of a set of random vectors. In either environment, fault simulation can consume a significant amount of time, especially in random vector testing in which millions of vectors may have to be simulated. While many methods have been suggested for efficient fault simulation to evaluate the fault coverage of an enormous amount of test patterns, parallel processing can be utilized to reduce the fault simulation time greatly. 2 1.2 Review of Previous Work To parallelize simulation, one may exploit circuit parallelism, algorithm parallelism, data parallelism, or a combination of them. In a circuit-parallel approach [19], a circuit is partitioned into several parts. Each part is assigned to a processor. When a processor needs the information on a circuit node or line that is not in its own part, it must communicate with the processor that has the information. Hence, a large amount of communication is required. Circuit parallelism has the advantage that each processor needs to store the circuit description and temporary structures only for a fraction of the circuit, hence requiring a smaller space of memory. In an algorithm-parallel approach [24][25], the simulation is carried out in a pipeline mode. That is, all gates at each logic level form a stage of a pipeline. A processor is assigned to each stage. The different test vectors are then pipelined through the logic circuit. In the case of a sequential circuit, since test vectors have to be applied in sequence, it is not possible to exploit pattern-parallel approach. Also, due to the presence of feedback paths through memory elements like flip-flops, speedups may be severely limited. In a data-parallel approach [17][26][29], the simulation data are partitioned into disjoint sets and each set assigned to a processor. Each processor executes the entire algorithm and simulates the entire circuit. Fault parallelism is relatively simple to ex- ploit. The fault list is partitioned among processors, and each processor performs fault simulation on the entire circuit for its own fault list with the complete set of test vectors. It is possible to obtain an almost linear speedup. The problem is that for each partition 3 of faults. Depending on the partitioning of the faults, the faults of each partition for a test vector may not be uniform across all partitions. In pattern parallelism, the given input test vectors are decomposed into subsets. Each processor gets a copy of the entire circuit, the fault set and a subset of the test vectors. Each processor performs fault sim- ulation with its subset of test vectors. For sequential circuits, the future behavior of the circuit depends on the past input vectors, thus, limiting applicability of this approach. 1.3 Problem Definition Parallel computing is becoming an increasingly cost-effective and affordable means for providing high computing power and represents a challenge to costly conventional supercomputers. For example, a cluster of workstations (COWs) can be easily configured as a high performance computing platform. Therefore, it is worthwhile to investigate efficient ways to utilize a COWs for time-consuming circuit testing. 1.4 Objective The objective of this study is to develop parallel fault simulation algorithms that can be efficiently executed on a COWs in order to maximize speedup. This is achieved by judiciously partitioning the fault and test vector spaces, minimizing redundant com- putation and better balancing the load distribution among workstations (processors). 4 Chapter 2 Circuit Testing In this chapter, the general issues of VLSI testing are briefly described, introducing the important concepts and terms. 2.1 Fault Modeling Fan-out stem : A signal that branches to multiple places, each of which is called a fan-out branch. The source of those branches is called stem or fan-in. branch a b1 b2 b3 b4 b5 stem Figure 2.1: Definition of fan-out stem. Fault: A defect in electronic system is the unintended difference between the im- plemented hardware and its intended design. A representation of a defect at the abstract 5 function level is called a fault. That is, a defect and a fault are the imperfections at the hardware and function levels, respectively. a b c Figure 2.2: Definition of fault. A simple digital system in Figure 2.2 consists of an AND gate, two input terminals, a and b, and an output terminal c. But, the connection between b and the gate is left unconnected and the second input of the gate is grounded. The functional output of this system, as implemented, is c = 0, instead of the correct intended output c = ab. For this system, the defect is a short to ground, and the fault is single b stuck at logic 0. Stuck-at Fault: The type of fault described above is modeled by assigning a fixed (0 or 1) value to a signal line in the circuit. A signal line is an input or an output of a logic gate or a flip-flop. The most popular forms are the single stuck-at faults, i.e., two faults per line, stuck-at-1 (s-a-1 or sa1) and stuck-at-0 (s-a-0 or sa0). Figure 2.3 illustrates a single stuck-at fault [1]. A stuck-at-1 fault as marked at the output of the OR gate means that the faulty signal remains 1 irrespective of the input state of the OR gate. It shows that the normal (faulty) value at the output is applied to the AND2 gate as 0 (1). When the input vector (1100) is applied as a test vector for the s-a-1 fault, it is easy to see that the normal and faulty outputs are different. The circuit 6 AND1 OR 1 stuck-a- 1 AND2 Test vector 1 1 0 0 0(1) 0(1) True Re sponse Faulty Res p onse Figure 2.3: An Example of a single stuck-at fault. in Figure 2.3 has seven lines, each of which is the potential site for a single stuck-at fault. Hence, the number of possible faults is 14. Level and Output Cone of Circuit: Each gate in a circuit can be assigned a level, which represents the maximum distance (in gates) from a primary input (PI) to the gate. In Figure 2.4, the levels of gates are shown in circles. A level i gate is one for which at least one of its inputs is from a level i-1 gate. For example, PIs G0, G1, and G2 have the distance of 0 and therefore are assigned a level of 0. Accordingly, the top two inputs of gate G5, the top input of gate G6, and the inputs to inverters G3 and G8 are labelled with a level of 0. When all of the inputs of a gate are labelled, the gate and its outputs are labelled with the maximum of its input levels plus 1. That is, the gates 7 G3 and G8 are given a level of 1. Then, G4 is labeled with a level of 2, G5 and G6 with a level of 3. Finally, G7 is assigned a level of 4. G0 G4 G3 G8 G6 G2 G7 G1 G5 1 1 2 3 3 4 Figure 2.4: An example of an level order. Each gate can be assigned a level by parsing the circuit once from the PI?s to the primary outputs (PO?s). A path is an alternating sequence of wires and gates. Signals are propagated from the inputs of a circuit to the outputs along one or more paths. A gate g is in the output cone OC(Oi) of a circuit PO Oi if there is a path from the output of g to the PO Oi. More formally, a gate g is in the output cone OC(Oi) of the PO Oi if its output is the PO Oi or at least one of the gates on the fan-out of g is in OC(Oi). Figure 2.5 shows the output cone of an output Oi [33]. The faults from the two gates, G1 and G2, are propagated on the same path to the output. The cone to which a gate in a circuit belongs is determined by a simple depth-first search from each PO. Both level 8 G2 G3 G1 Output Cone O i O j G4 Figure 2.5: An example of an output cone. and cone identifications can be carried out in time proportional to the number of gates in the circuit. 2.2 Test Pattern Environment Fault simulation algorithms may be used in two different environments. One is a deterministic test generation environment and the other is a random pattern test generation environment. In the former, a specific algorithm named Automatic Test- Pattern Generation (ATPG) is used to generate a test vector for every fault in a circuit. The ATPG selects a fault from the fault list and tries to generate a test vector for the fault. If the ATPG is successful in generating a test vector, fault simulation is carried 9 out on the entire fault list and the fault simulator finds out the additional faults that are detectable by the same test vector. But, the ATPG is an NP-complete problem and can be computationally very expensive. In fact, in some complex circuits, the use of such algorithms is no longer feasible or practical. In the latter, a fault simulator determines if test vectors lead to the detection of a target fault and evaluates the fault coverage of a set of random test vectors. The random pattern environment uses a relatively inexpensive pseudo-random test pattern generator to generate test vectors. The fault simulator randomly selects test vectors, runs fault simulation and determines which faults are detected. Since the random pattern testing may have to be simulated for a large number of vectors, fault simulation can be very time- consuming. Thus, parallel processing can be employed to reduce the fault simulation time significantly. 2.3 Sequential Methods The task of fault simulation is to determine for each fault in a given list whether the simulation of the faulty and fault-free circuits differ in any primary output. While a fault simulator can be built in a straightforward manner from any logic simulator, the resulting performance would be low due to significant duplicated computation. As a result, various techniques to reduce duplicated computation have been developed. These approaches may be classified by the manner in which they compute and store the good and faulty circuit states. The most popular approaches are word-parallel, deductive, concurrent, differential, and proofs fault simulation algorithms. In this section, these and other single processor simulation algorithms are reviewed. 10 2.3.1 Word-parallel Fault Simulation Several algorithms have been proposed for sequential circuit fault simulation, most of which are targeted at single stuck-at faults in synchronous sequential circuits. Three- valued (0, 1, X) simulation is generally carried out, and no reset is assumed. Word- parallel simulation [4] utilizes bit-oriented logic operations to perform many of gate evaluations simultaneously. If one word consists of 32 bits on a computer, 32 gate evaluations can be performed at a time, where one bit is used for good circuit. The word- parallel simulation can be either fault-parallel or pattern-parallel. The former simulates the good circuit and 31 fault classes with one input vector at a time by assigning a bit to each fault case. The latter simulates 32 input patterns for one fault at a time by assigning a bit to each test vector case. In the word-parallel fault simulation [4], 31 faulty circuits are simulated in parallel with the good circuit. Faults are packed statically into fault groups, and all test vectors are applied to the circuit for a given fault group. Then, the process is repeated for each group of 31 faults. Fault detection is done by comparing the good and faulty circuit values of the primary outputs. Fault dropping is not possible in this algorithm; therefore, the bit space for a faulty circuit is wasted once the fault is detected. Figure 2.6 shows a circuit that is being simulated for three faults, c stuck-at-0, f stuck-at-1 and g stuck-at-0, with a four-bit word. To simulate the fault-free and three faulty circuits in parallel, the signal on each line is expressed as one word. The state of each bit represents the signal value in the fault-free and faulty circuits. When a vector (a, b) = (1, 1) is applied, the output of the circuit with c s-a-0 (the second bit) and g s-a-0 (the fourth bit) differs from that of the fault-free circuit (the first bit). Hence, those 11 bit 0: Fault-free circuit bit 1: Circuit with c s-a-0 bit 2: Circ uit with f s - a-1 bit 3: Circuit w i th g s-a-0 a b 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 00 00 1 0 1 0 0 0 1 0 d c e f g s-a-0 s-a-1 s-a-0 Figure 2.6: An example of parallel fault simulation. faults are detected. The other fault (the third bit), which produces the same output as that by the fault-free circuit, is not detected by this vector. 2.3.2 Deductive Fault Simulation In deductive fault simulation [5], an event-driven algorithm is used, and processing an event involves simulating the good circuit and propagating lists of active faults for a given test vector. Every node in the circuit may have a large list of active faulty circuits associated with it, and fault propagation is done using set operations on the lists of active faulty circuits at the inputs of a gate. However, since an event-driven algorithm 12 is used, fault propagation is done only if one of the active fault lists at the inputs of a gate has changed since the previous time frame. a b c d e f g 1 1 0 1 1 [ b 0 , d 0 , f 1 ] L e = L a U L c U e 0 = [ a 0 , b 0 , c 0 , e 0 ] L g = ( L e L f ) U g 0 = [ a 0 , c 0 , e 0 , g 0 ] U [ a 0 ] [b 0 ][ b 0 , c 0 ] [ b 0 , d 0 ] Figure 2.7: An example of deductive fault simulation. An example of deductive fault simulation is shown in Figure 2.7 [1]. The vector (1, 1) is applied to the circuit. First, logic simulation is carried out to determine all signal values. Next, the s-a-0 and s-a-1 faults on all lines a through g are simulated. The lists of primary inputs just contain the respective s-a-0 faults that are active at the inputs. Their fault lists are denoted as sets, La = [a0] and Lb = [b0]. Fault lists for fan-outs c and d are obtained by adding their locally active faults to the fault list Lb of the stem. The fault lists for e, f and g are obtained by fault propagation. When the fault propagation is completed, four faults a s-a-0, c s-a-0, e s-a-0, and g s-a-0 are detected by the input vector (1, 1). 2.3.3 Concurrent Fault Simulation Concurrent fault simulation [6][7] is similar to the deductive fault simulation, but fault lists are propagated by evaluating individual gates, and only active faulty circuits 13 are simulated, which reduces the execution time. Timing information can easily be incorporated, and function-level modules can be handled. However, more memory is required to store the fault lists. Fault dropping is straightforward in both deductive and concurrent fault simulations. The performance of a concurrent fault simulator can be improved if it is restricted to synchronous sequential circuits [8]. Further improvements in performance have been achieved with a parallel concurrent approach [9]. 1 0 a b c d e f g 1 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0 a 0 b 0 c 0 e 0 0 1 0 1 1 1 b 0 d 0 f 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 a 0 b 0 c 0 d 0 g 0 e 0 f 1 Figure 2.8: Fault-lists in concurrent fault simulation. Figure 2.8 shows that all stuck-at faults are concurrently simulated for an input vector (1, 1) [1]. The subscript notation is used for faults. Thus, fault b0 means b s-a-0 fault. Faults are modeled on all gate inputs, primary inputs a and b, and primary output g. To each good gate, a set of bad gates in grey shade with the corresponding fault name is attached in a linked-list structure. Signal values at the input and output of each gate are written inside the gate. At the primary output g, any bad gate whose output differs 14 from that of the good gate indicates fault detection. Thus, faults a0, c0, e0, and g0 are detected by the test vector (1, 1). In the deductive simulator the fault list is for a signal and contains only the faults that affect that signal. In the concurrent simulator, the fault lists are for a gate and faults that affect the inputs of that gate are included in the list. Fault lists in a concurrent simulator are, therefore, comparatively longer. 2.3.4 Differential Fault Simulation Differential fault simulation algorithm was proposed for synchronous sequential cir- cuits [10], where only differences between the current and previous faulty circuits are simulated. The memory requirement is low, since only a single copy of node values and differences between the succesive faulty circuits in the flip-flops are stored. Fault drop- ping is more difficult, however, since simulation of each faulty circuit depends on the previous faulty circuit. 15 1 1 0 1 1 sa1 (2) sa0 (1) X X 0 CK Curr ent fa ult l i st: [e mpty ] Nex t fault list: [e mpty ] (a) First vector (1, 1) 1 0 1 0 1 sa1 (2) sa0 (1) 1 0 D(1) 1 D (1) CK Curr ent fa ult l i st: [e mpty ] Nex t fault list: [D(1 )] D (2) D (2) (b) Second vector (1, 0) 0 1 0 0 0 sa1 (2) sa0 (1) 1 0 0 CK Curr ent fa ult l i st: [D(1 )] Nex t fault list: [e mpty ] fault (1) dete cted D (1) (c) Third vector (0, 1) Figure 2.9: An example of differential fault simulation 16 Figure 2.9 shows the simulation of two faults (1) and (2) [1]. The vector set contains three vectors, the first of which is simulated in Figure 2.9(a). The initial state of the flip-flop is assumed to be unknown and is denoted as X. After simulating the second vector (1, 0) in Figure 2.9(b), both faults are activated. The effects of fault (1) are denoted as D(1) and that of fault (2) as D(2). Only D(1) reaches the flip-flop input and is added to the next fault list. Note that no fault has been detected so far. Figure 2.9(c) shows the simulation of the third vector (0, 1). The current fault list is updated with D(1), which propagates to the primary output. Thus, fault (1) is now detected and can be dropped. Subsequent vectors will only simulate fault (2) until that is detected. 2.3.5 PROOFS The PROOFS fault simulator combines the features of word-parallel, concurrent, and differential fault simulation algorithms. For each test vector, the good circuit is first simulated, and then only the differences between the good and faulty circuits are simulated. Several faulty circuits are simulated together, with one bit of the computer word used for each faulty circuit, and faults are grouped dynamically with each test vector simulated, in order to fully utilize all bits in the computer word. To limit the memory usage, faulty circuit values are stored at the flip-flops only. Faults are dropped from the fault list once they are detected, and faults that are identified as inactive in a given time frame are not simulated. The overall algorithm of PROOFS [2] is shown in Figure 2.10. It consists of a main loop which reads in the next input vector, evaluates a logic circuit, and then simulates the faulty circuit for each fault group. To simulate a fault group, the group-id is first 17 Clear Circuit R e ad next ve ctor L o gi c simul a ti on Increme n t gr ou p-id Yes N o Fau l t s left for this Vector? Choose next gr oup of faults Inject faults Add f a ulty state-node event s Simulate faulty circuit Dr op dete cted fault s Store f a ulty State-nodes mo re ve ctors ? Yes END No Figure 2.10: PROOFS algorithm. incremented to identify each fault group. Next, the 32 faulty circuits to be included in the fault group are selected. The faults are then injected into the circuit and the node values for the state-nodes from the previous input vector are inserted into the faulty line values. The faulty circuits in this fault group are simulated, and the state-node values are stored for the next vector. 18 a 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0 1 1 1 b c d e f g 1 1 1 1 1 1 1 0 1 1 0 0 0 0 011 0 1 000 0 011 0 1 1 1 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 111 1 101 1 1 111 1 101 1 1 h k m n p q r a sa0 b sa0 c sa0 n sa0 group-id (a) a 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 b c d e f g 1 1 1 1 1 0 0 0 1 2 0 0 0 0 100 0 2 000 0 110 0 2 1 1 1 1 0 0 0 0 2 0 0 0 0 1 1 1 1 1 1 1 1 2 1 0 0 0 2 1 1 1 1 1 1 0 1 1 111 1 101 1 1 111 1 101 1 1 h k m n p q r g sa1 k sa1 p sa0 r sa 0 (b) Figure 2.11: An example of PROOFS fault simulation. 19 Figures 2.11(a) and 2.11(b) show a simple example in which the PROOFS fault simulator is performed for each fault group. An input vector (1, 1, 1) is applied and the word length of four bits is assumed. The fault list groups a s-a-0, b s-a-0, c s-a-0, and n s-a-0 are simulated as shown in Figure 2.11(a). All the propagation lines of these faults have group-id of 1. After the first group is done, the following group g s-a-1, k s-a-1, p s-a-0, and r s-a-0 are simulated as shown in Figure 2.11(b). The lines of circuit affected by the second fault group are updated to a group-id of 2. As shown in Figure 2.11(b), three faults are detected in group 1 and four faults in group 2. These detected faults are dropped, and not simulated by the next test vector to avoid redundant simulation. 2.3.6 HOPE HOPE is a PROOFS-based fault simulation [11]. It screens out faults with short propagation paths through the single fault propagation. A systematic method of identi- fying faults with short propagation paths reduces the number of faults simulated. A significant speedup was achieved by finding representative stem faults for faults in fan-out-free regions [11]; only the representative stem faults are placed into the fault groups and simulated in word-parallel for faults whose effects do not propagate to flip- flops in the previous time frame. Single fault propagation is used to determine whether a fault in a fan-out-free region is active at the stem. Additional improvements in per- formance were obtained by modifying the fault injection procedure, statically ordering faults by fan-out-free region, and dynamically ordering faults to place potentially de- tected faults in separate fault groups [12]. The resulting fault simulator, HOPE, is about twice as fast as PROOFS, which is partially due to the improvements in implementation. 20 2.3.7 PARIS and PSF The parallel-pattern single-fault propagation algorithm [13] for combinational cir- cuits has been extended to synchronous sequential circuits in the fault simulators PARIS (PARallel Iterative Simulator) [14] and PSF (Parallel Sequence Fault simulation) [15]. For each group of 32 test vectors, the good circuit is simulated, followed by simulation of a single fault for all 32 vectors. Several iterations may be required before circuit values stabilize, due to the sequential nature of the circuits. Heuristics are used to minimize the number of iterations performed. Minor differences between PARIS [14] and PSF [15] exist. PARIS packs 32 consecutive vectors in each 32-bit computer word. PSF divides the test sequence into 32 equal subsequences and packs the nth vector of each subse- quence in a single 32-bit word, where n ranges from one to the number of vectors in a subsequence. Both PARIS and PSF simulate one fault at a time, but the number of iterations required to stabilize the circuit state differs between them considerably. The various existing approaches to sequential fault simulation were reviewed. Fur- ther improvements to these sequential algorithms have been made in the parallel algo- rithms using a cluster of workstations. 2.4 Parallel Methods A number of approaches have been proposed for parallelization of fault simulation for both combinational and sequential circuits [18]. 21 2.4.1 Circuit Parallelism The main alternative to fault partitioning is circuit partitioning, in which the cir- cuit being simulated is partitioned among available processors. Circuit-partitioned fault simulation is effectively a variant of parallel logic simulation [19]. Fault simulation based on circuit partitioning has been reported by Mueller-Thuns et al. [20] and Nelson [21] for vector-synchronous implementations on message passing machines. Ghosh [22] presents an implementation based on asynchronous logic simula- tion techniques that, while novel, falls short of achieving high efficiency. In [23], Patil et al. present a circuit-partitioned approach applicable to shared memory machines, that incorporates techniques from parallel logic simulation. A circuit is partitioned among the processors. Since the circuit is evaluated level-by-level with barrier synchronization at each level, the gates at each level are evenly distributed among the processors to bal- ance the workloads. On the encore Multimax shared-memory multiprocessor system, an average speedup of 2.16 was obtained for 8 processors, and the speedup for the ISCAS89 circuit s5378 was 3.29. 2.4.2 Algorithmic Parallelism Algorithm partitioning was proposed for concurrent fault simulation in [24][25]. A pipelined algorithm was developed, and specific functions were assigned to each proces- sor. On a Sun Sparc 2 workstation with a MIPS (Million Instruction Per Second) rating of 28.5 million, an estimated speedup of 4 to 5 was reported for 14 processors, based on software emulation of a message-passing multi-computer [25]. The limitation of this approach is that it cannot take advantage of a larger number of processors. 22 2.4.3 Fault Parallelism Fault partitioning is a simpler method for parallelizing fault simulation than other methods. In static fault partitioning, we have T test vectors that need to be simulated against F faults. A fault list is divided among available processors. Each processor simulated all faults in its partition independently. A fault parallel implementation has a good potential to achieve high speedup. However, the fault activity of each partition for a particular input vector may not be uniform across all partitions because the fault activity depends on the partitioning of the faults. Thus, this static partitioning has generally been considered ineffective in achieving high speedup especially for a large number of processors. Several implementations based on dynamic fault partitioning have been attempted to even out the workloads of the processors at the expense of extra interprocess commu- nication [17][26]. In [26], ProperPROOFS fault simulator use both static and dynamic partitioning model in which static partitioning is performed by dividing the fault list by the number of available processors at the start of processing and asynchronous fault redistribution follows. When a processor completes simulation of its existing fault list, it sends a request to another processor selected at random for more faults to be simulated. The processor receiving a request splits its fault list to share with the requesting proces- sor, or forwards the request to another processor at random if it has an empty fault list. When all faults are simulated, each processor terminates simulation independently. In [17][26], speedup in the range 2.4 - 3.8 was obtained for static fault partitioning over 8 processors for the larger ISCAS89 circuits having reasonably high fault coverage (s5378 23 and s35932) on an INTEL iPSC/860 hypercube. Duba et al. [27] report a parallel im- plementation of CHIEFS for which speedup between five to six on a network of 8 Sun 3/280 file server workstations connected by a 10 Mb/s ethernet was achieved. Markas et al. [28] report a distributed algorithm for which speedups ranged from two to six on eight workstations in a heterogeneous cluster consisting of a Sun-3/160, Sun-3/60, a cluster of VAX 2000, and a cluster of VAX II/GPX. The performance of dynamic fault partitioning was not much better than the static fault partitioning due to the overheads of dynamic load balancing. There is another drawback to both static and dynamic fault partitioning approaches in which the shortest execution time will be bounded by the time to perform the good circuit logic simulation (or simply logic simulation) on a single processor. Each processor must simulate the good circuit and the faulty circuit in its partition. Logic simulation on more than one processor is obviously redundant. Alternatively, if a shard-memory multiprocessor is used, the good circuit may be simulated by one processor only, but the remaining processors will be idle during the simulation, at least for the first time frame. One observation that can be made about the fault partitioning experiments is that a higher speedup is obtained for circuits having lower fault coverage [17][26]. The potential speedup drops as the number of faults simulated drops. This is a reason for higher speed- up unless the hard-to-detect faults are mostly assigned to a single or few processors. The logic simulation is not parallelized in the fault partitioning approach, and therefore, speedup is limited. Parallelizing logic simulation based on partitioning the circuit has been suggested, but has not been successful due to the high level of communication required between processors. 24 2.4.4 Pattern Parallelism For parallelizing logic simulation, some test vector partitioning approaches were performed. The test vector partitioning provides a more scalable implementation, since the logic simulation is also distributed over processors. In SPITFIRE-0 [29], the test vectors are partitioned across the processors. This algorithm is presented as a base of reference for the various test vector partitioning approaches to be described later. As shown in Figure 2.12, the entire fault list is allocated on each processor. Thus, each processor targets the entire list of faults using a subset of the test vectors. Each processor proceeds independently and drops the faults that it can detect. This algorithm is somewhat inefficient in that many faults are very testable and are detected by most, if not all, of the subset of test vectors. Simulating these faults on all processors is a waste of time. Vec0 Vec1 Vec2 Vec3 Flt0 Flt1 Flt2 Flt3 P2 P3 P1 P0 Figure 2.12: Test sequence partitioning in SPITFIRE-0 25 2.4.5 Fault and Pattern Parallelism Some methods to combine fault and pattern parallelism were developed. SPITFIRE- 1 [29], the synchronous two step algorithm, can filter out the easy-to-detect faults in an initial step in which both the fault set and the test set are partitioned among the processors. In the first step, each processor targets a subset of the faults using a subset of the test vectors, as illustrated in Figure 2.13 [29]. A large fraction of the faults is detected in this initial step, the undetected fault lists from the first step are combined and only the remaining undetected faults have to be simulated by all processors using test vectors in its partition in the second step. Vec0 Vec1 Vec2 Vec3 Flt0 P0 Flt1 P1 Flt2 P2 Flt3 P3 Vec0 Vec1 Vec2 Vec3 Udt0 P1 P2 P3 Udt1 P0 P2 P3 Udt2 P0 P1 P3 Udt3 P0 P1 P2 1 st Step for Fault Simulation 2 nd Step for Fault Simulation Figure 2.13: Partitioning in SPITFIRE-1 Other synchronous algorithm, SPITFIRE-2 and SPITFIRE-3, which are extensions of the SPITFIRE-1 algorithm, were presented in [31]. SPITFIRE-2, a hybrid approach, 26 attempted to reduce the partition size used in SPITFIRE-1. The fault and pattern partitioning for the two steps of fault simulation in SPITFIRE-2 is illustrated in Figure 2.14 [31]. As can be seen from the figure, processor i (Pi) uses Vec0 and Flti in the first step of fault simulation. Since all faults are targeted in the first step using the input vectors in Vec0, there is no need to re-simulate these vectors in the second step. In the second step, Pi uses the set of test vector Veci+1 and any undetected faults left at the end of the first step. The advantage of this algorithm is that the number of vectors simulated in each step is now reduced by a factor 1P+1 ? 100percent as compared to SPITFIRE-1. A small additional advantage is that the faulty circuit states available for the undetected faults in the set Udti can be used for simulation with the test set Vecj in the second step of fault simulation on Pj?1. However, it is possible that the second step of this simulator may not drop as many faults as those by SPITFIRE-1 since less test vectors are used in the first step.. SPITFIRE-3 is a multistep pipelined synchronous algorithm which helps in over- coming any drawback in a single or two-step approach. The first step of fault simulation is identical to that in SPITFIRE-1. Synchronization points are introduced in the second step, in which processors exchange the information on the detected faults. This may reduce the amount of work that a processor has to do subsequently, since each processor does not need to target the faults that have been detected by other processors. However, the synchronization points introduce barriers which may slow down parallel execution, when the load is unbalanced among processors. 27 Vec0 Vec0 Vec0 Vec0 Flt0 P0 Flt1 P1 Flt2 P2 Flt3 P3 Vec1 Vec2 Vec3 Vec4 Udt0 P0 P1 P2 P3 Udt1 P0 P1 P2 P3 Udt2 P0 P1 P2 P3 Udt3 P0 P1 P2 P3 1 st Step for Fault Simulation 2 nd Step for Fault Simulation Figure 2.14: Partitioning in SPITFIRE-2 2.5 AUSIM AUSIM is a gate-level, sequential circuit fault simulation program developed at Auburn University. AUSIM runs on the UNIX platform and targets single stuck-at faults in synchronous sequential circuits represented in the ISCAS89 benchmark format. 2.5.1 Program Configuration The AUSIM consists of six sub-commands which are three pre-processing and three main commands. The command default checks the input file if the proper naming convention is used. The command proc indicates that all file names have been specified and processing is to begin. The processing of the files begins with syntax checks of the library file and the ASL file as well as a check for subskt that makes statements to initiate 28 flattening of the hierarchy. After hierarchical flattening is complete, the data structures are loaded and a number of audits and circuit checks are performed for items such as nets with multiple driving sources, nets with no driving sources, etc. The command audit records the audits results. After pre-processing, AUSIM can begin simulation with the three commands: simul8 for logic simulation, fltgen for fault generation, and fltsim for fault simulation. The command simul8 is needed to initiate the application of the vector file (cir- cuit name.vec) to the circuit loaded into the data structures, producing the simulation output result file (curcuit name.out). The command fltgen command generates gate-level stuck-at fault lists and writes the list to the file circuit name.flt. Normally, the fltgen command produces a collapsed gate-level stuck-at fault list. The command fltsim takes input files (circuit name.out and circuit name.flt), performs simulation and produces the detected fault list (circuit name.det), potentially detected fault list (circuit name.pdt) and undetected fault list (circuit name.udt). 2.5.2 Algorithm Analysis Figure 2.15 shows the algorithm structure of sequential AUSIM. All the bits in a computer word are utilized to simulate 32 faulty circuits at once. Logic simulation is performed before simulating faulty circuits. Fault simulation consists of a main loop which reads in the next fault group. The faults are injected into the circuit and each test vector is inserted into the faulty circuit. The faulty circuits are evaluated, and state-node values are stored for the next vector. When each test vector detects all faults of the same fault group, AUSIM moves onto the next fault group. 29 Clear Circuit Read Nex t Vec t o r L o gi c Simul a ti on Write Out file Clear Circuit R e ad Next F a ult gr oup Inject Faults Make Faulty state circuit Read Nex t Vec t o r F a ult Simul a ti on Yes No Check if all faults h a ve been detected m o re vectors ? Yes mo re fault s ? No END m o re vectors ? Yes Yes No No Figure 2.15: Sequential structure of AUSIM The data structure used in AUSIM is shown in Figure 2.16, where the shaded memory blocks represent a linked list. There are three kinds of memory blocks which are gate, input and net structures of a circuit. Each gate memory block consists of a type, name, and input and output pointers. Each input memory block is used to control the state of the nets, which show interconnection among gates, corresponding to the state changed due to each single fault propagation. The fields in the input structure, logval and umask, are used to store the state of every net in the circuit. The fields in the net structure, flt and saf, store the fault information of each net in the faulty circuit. 30 G0 G1 G2 G3 G4 Circuit model G4 OR G3 Input Output NOR G4 Input Output G0 flt saf G1 flt saf Gate str u cture G3 flt saf G3 flt saf G2 flt saf G4 flt saf G0 lo gval umas k G1 lo gval umas k G2 lo gval umas k G3 logval umas k G4 lo gval umas k Input structure Net structure Figure 2.16: Data Structure of AUSIM Each of logval, umask, flt and saf consists of two 32-bit words, where a pair of bits is used to store a different faulty machine?s value. A three-valued logic (0, 1 and X) is used. Two bits are used to code the three values, one in logval and the other in umask. 0 is coded as (0, 0), 1 as (1, 1) and X as (0, 1). 31 X X X X G0 G1 G2 a sa0 b sa0 d sa0 1 1 1 1 1 1 1 1 c sa0 lo gv al um a s k 0 0 0 1 0 0 0 0 flt saf 1 1 1 1 1 1 1 1 lo gv al um a s k 0 0 1 0 0 0 0 0 flt saf 1 1 1 1 0 1 0 0 1 1 1 1 lo gv al um a s k 0 0 0 0 flt saf 1 1 1 1 1 1 1 1 lo gv al um a s k 1 0 0 0 0 0 0 0 flt saf 0 0 0 0 0 0 1 1 0 0 0 0 lo gv al um a s k 0 0 1 0 flt saf G4 G3 a st u c k- at si n g l e f a u l t b st u c k- at si n g l e f a u l t c s t u ck- a t si n g l e f a u l t d st u c k- at si n g l e f a u l t (a) 1 1 1 0 1 1 1 0 X X X X G0 G1 G2 a sa0 b sa0 d sa0 c sa0 lo gv al um a s k 1 0 1 1 1 1 0 1 1 1 0 1 lo gv al um a s k 1 0 1 1 lo gv al um a s k 0 0 0 0 0 1 1 1 0 1 1 1 lo gv al um a s k 0 0 0 0 lo gv al um a s k G4 G3 (b) Figure 2.17: An example of gate evaluation of AUSIM. 32 Figure 2.17 shows an example of gate evaluation for the test vector (1, 1, 1) on the fault list groups (a, b, c, d). A word length of four bits is assumed. Coded logics 0(0, 0), X(0, 1) and 1(1, 1) are used in a sequential circuit, but this example does not use the X value because it is a combinational circuit. Figure 2.17(a) illustrates the logic simulation for good circuit. The steady state value of each net in the circuit is kept in a single array in logval and umask. The information of faulty machine value is stored in another array flt and saf. Figure 2.17(b) represents the net state of circuit modified due to fault injection. After every fault in the same fault group is injected and the net state is updated with the faulty state, the gate evaluation is performed. 33 Chapter 3 Parallelization In this chapter, a parallel fault simulator PAUSIM (Parallel AUSIM) which has developed based on AUSIM [32] is described. Three versions of PAUSIM have been im- plemented, i.e., PAUSIM-BL (BLock partitioning), PAUSIM-CY0 (CYclic partitioning) and PAUSIM-CY1. 3.1 PAUSIM-BL 3.1.1 Task Decomposition PAUSIM-BL (Parallel AUSIM-BLock Partitioning) adopts the test vector and fault set partitioning algorithm for parallel processing. It performs a logic and fault simula- tions separately. Compared to SPITFIRE-1 consisting of the two steps of fault simulation [31], PAUSIM-BL?s design focuses on eliminating a redundant work in logic simulation and fault simulations. In SPITFIRE-1, logic simulation of the fault-free circuit is carried out in both steps of fault simulation, resulting in redundant computation. In order to avoid such redun- dancy, in PAUSIM-BL, the logic simulation results are saved so that they can be referred to during the fault simulation step. In the second step of fault simulation of SPITFIRE-1, a fault may be detected by more than one processor since the fault space is not partitioned. That is, it is possible that a processor may simulate the faults which have been already detected by other 34 Vec0 Vec1 Vec2 Vec3 Flt0 P0 P0 P0 P0 Flt1 P1 P1 P1 P1 Flt2 P2 P2 P2 P2 Flt3 P3 P3 P3 P3 P : Processor Vec : Test vector set Flt : Fault set Udt : Undetected fault set Vec0 Vec1 Vec2 Vec3 P0 P1 P2 P3 Task decomposit ion for logic simulat i on Task decomposit ion fault simulat i on Figure 3.1: Task decomposition in PAUSIM-BL processors. PAUSIM-BL avoids such possibility by assigning a disjoint set of faults to each processor in the second step as shown in Figure 3.1. This also makes it unnecessary to filter out the multiply-detected faults when the faults detected by processors are combined following the (second step of) fault simulation. Fault simulation time on each processor is shorter on average for PAUSIM-BL than for SPITFIRE-1. Let?s define a unit simulation as testing a circuit for a fault using a test vector. The number of possible unit simulations in the fault simulations would be the same regardless of partitioning faults or test vectors. However, the number of unit simulations actually carried out is smaller for PAUSIM-BL than for SPITFIRE-1. First, there is no redundant simulation in PAUSIM-BL. Second, since each processor in PAUSIM-BL is assigned less faults with more test vectors than in SPITFIRE-1, it is more likely in PAUSIM-BL than in SPITFIRE-1 that all of the assigned faults are detected 35 even before all possible unit simulations are tried. A processor stops fault simulation when all the faults in its assigned fault group are detected. 3.1.2 Fault Partitioning In PAUSIM-BL, faults in a fault list are divided into n equal-size groups when there are n processors. In a fault list of a circuit, faults are arranged in the alphanumeric order of propagation gate and net names. Therefore, faults in a group tend to be from the contiguous parts of the circuit. All faults related to a gate are assigned to the corresponding processor. Figure 3.2 shows the distribution of faults in an area of the s27 benchmark circuit. Faults from inputs of a gate are propagated on the same path to the primary outputs. Hence, the workloads for simulating the faults at different inputs of a gate are similar. If a fault group contains more hard-to-detect faults than other groups, there can be a significant load imbalance among processors. 3.1.3 Test Vector Partitioning For combinational circuits, a test vector set may be partitioned into mutually exclu- sive subsets, each of which is assigned to a processor. In a sequential circuit, the current state depends on the previous state in general. Each processor initiates its fault simula- tion, starting from an unknown state at some outputs. This may cause detection of some faults to be missed. In order to eliminate or reduce unknown outputs, the test vector set may be partitioned in an overlapped manner as shown in Figure 3.3 [29]. Those test vectors in an overlapped portion are mainly used for updating the unknown states of outputs to the known states, rather than fault detection, i.e., they act as initializing 36 G3 G6G5 G2 G1 G0 CK G7 G12 G14 G13 G8 G16 G15 G9 G11 G10G17 Figure 3.2: Fault partitioning in PAUSIM-BL for 4 processors where each fault group is distinguished by a different grey-scale vectors. The optimal number of initializing vectors depends on the circuit. Too many initializing vectors would waste computation while few of them may lead to a low fault coverage. 3.1.4 Procedure The procedure of PAUSIM-BL, which consists of logic simulation and fault simula- tion, is described in the flowchart in Figures 3.4 - 3.5. A master processor broadcasts 37 Test Sequence n2 n 3 n 4 n 5 n P1 P2 P3 P4 P5 Overlap P1~P5 : Processors Figure 3.3: Test vector set partitioning the circuit information and the entire test vectors to all slave processors. All processors including the master carry out the logic simulation with their respective partitions of test vectors. Local logic simulation results are collected to the master which then broadcasts the combined result to all slave processors. Then, the fault simulation starts. The master processor partitions a list of faults into subsets and distributes them among processors. The fault simulation is performed on all processors with faults disjointly distributed. At 38 the end of the fault simulation, slave processors report their detected and undetected faults to the master processor. Read in t e st vect o r s Broadca s t a circuit in fo. Yes Broadca s t t e st vect o r s Access to a test vector Logic sim u lation Store the result More vector? Genera te and distribute faults Access to a fault group Access to a test vector No Fault simulation All faults in a fault group are detected? Store det an d udt faults Yes More fault ? Yes Collect det an d udt faults Read in a circuit in fo. T o slaves T o slaves No No More vector ? Yes No To s l a v e s From slav es Figure 3.4: Flowchart of the master processor in PAUSIM-BL 39 Receive t e st vect o r s Receive a circuit in fo. Yes Get a test vector set needed Access to a test vector Logic sim u lation Store the result More vector? receive fault set Access to a fault group Access to a test vector No Fault simulation All faults in a fault group are detected? Store det an d udt faults Yes More fault ? Yes Sen d det an d udt faults From the ma ste r No No More vector ? Yes No From the ma ste r From the ma ste r T o t h e master Figure 3.5: Flowchart of slave processors in PAUSIM-BL 40 3.2 PAUSIM-CY 3.2.1 Load Balancing One problem in PAUSIM-BL is the potential load imbalance among processors. It is due to the fact that all the faults on a gate are assigned to the same processor and the faults assigned to a processor tend to be clustered in space. Computational requirements for the faults close to each other are similar, especially those at the same gate. Therefore, the load distribution among processors can be unbalanced significantly in PAUSIM-BL. Inorderto achieve amore uniformloaddistribution, PAUSIM-CY0(ParallelAUSIM- CYclic) and PAUSIM-CY1 adopt the LOG (Level Output Gate) partitioning of faults [33]. In the LOG partitioning scheme, faults on each gate and each circuit level are assigned to processors in a cyclic fashion. In this way, for example, the hard-to-detect faults on a gate would be distributed to multiple processors rather than a processor. Figure 3.6 shows the difference between block partitioning and LOG partitioning used in PAUSIM-BL and PAUSIM-CY0 and PAUSIM-CY1, respectively. The fault list consists of four attributes which are gate name, net name connected to gate, input or output, and a type of single stuck at fault. Faults are arranged in the alphanumeric order of gate names. It is seen that faults in a PAUSIM-BL partition are mostly from a contiguous part of circuit while those in a PAUSIM-CY0 and PAUSIM-CY1 partition are scattered widely. 3.2.2 Procedures PAUSIM-CY0 is identical with PAUSIM-BL except for the fault partitioning. Faults are distributed among processors in a cyclic manner in PAUSIM-CY in an effort to spread 41 Ra ndom block partitioning G10 G14 in sa0 G10 G11 in sa0 G11 G5 i n sa0 G11 G9 i n sa0 G11 G11 out sa0 G11 G11 out sa1 G12 G1 i n sa0 G12 G7 i n sa0 G12 G12 o ut sa0 G12 G12 o ut sa1 G13 G2 i n sa0 G13 G12 in sa0 G14 G14 o ut sa0 G14 G14 o ut sa1 G15 G12 in sa0 G15 G8 i n sa0 G16 G3 i n sa0 G16 G8 i n sa0 G17 G17 o ut sa0 G17 G17 o ut sa1 G5 CK in sa0 G5 CK in sa1 G5 G10 i n sa0 G5 G10 i n sa1 G6 CK in sa0 G6 CK in sa1 G6 G11 i n sa0 G6 G11 i n sa1 G7 CK in sa0 G8 G8 out s a 1 G7 CK in sa1 G9 G16 i n sa 1 G7 G13 i n sa0 G9 G1 5 in sa 1 G7 G13 i n sa1 G8 G14 i n sa1 G8 G6 in s a 1 G8 G8 out sa0 PE0 PE1 PE2 PE3 PE4 G10 G14 in sa0 G11 G5 i n sa0 G12 G7 i n sa0 G13 G2 i n sa0 G14 G14 o ut sa0 G15 G12 in sa0 G16 G3 i n sa0 G17 G17 o ut sa0 G5 CK in sa0 G6 CK in sa0 G7 CK in sa0 G8 G14 i n sa1 G9 G16 i n sa1 G10 G11 in sa0 G11 G9 i n sa0 G12 G7 i n sa0 G13 G12 in sa0 G14 G14 o ut sa1 G15 G8 i n sa0 G16 G8 i n sa0 G17 G17 o ut sa1 G5 CK in sa1 G6 CK in sa1 G7 CK in sa1 G8 G6 in s a 1 G9 G15 i n sa1 G11 G1 1 o ut sa0 G12 G1 2 o ut sa0 G5 G10 i n sa0 G6 G11 i n sa 1 G6 G11 i n sa0 G7 G13 i n sa 1 G7 G13 i n sa0 G8 G8 out s a 1 G8 G8 out sa0 G11 G11 o ut sa1 G12 G12 o ut sa1 G5 G10 i n sa1 Cyclic partitioning Figure 3.6: Example of fault partitioning for s27 benchmark circuit for 5 processors distribution of hard-to-detect faults over as many different processors as possible. This cyclic partitioning of faults may not balance the load completely. Hence, PAUSIM-CY1 employs a two-step fault simulation for more effective load balancing as shown in Figure 3.7. In PAUSIM-CY1, after the first step of simulation, each slave processor reports its undetected faults to the master which then redistribute the undetected faults such that the load is well balanced over slave processors. 42 Vec0 Vec1 Vec2 Vec3 P0 P1 P2 P3 logic simulation Vec0 Vec1 Vec2 Vec3 Udt0 P0 P0 P0 Udt1 P1 P1 P1 Udt2 P2 P2 P2 Udt3 P3 P3 P3 Vec0 Vec1 Vec2 Vec3 Flt0 P0 Flt1 P1 Flt2 P2 Flt3 P3 1 st step of fault simu lat i on 2 nd step of fault simulation Figure 3.7: Task decomposition in PAUSIM-CY1 3.3 Implementation The parallel simulation programs are written in C, using MPI library functions for communication. Figures 3.8-3.10 illustrate the logic simulation and two steps of fault simulation for PAUSIM-CY1, where the cluster consists of one master processor, P0, and three slave processors, P1, P2, and P3. Communications among processors are required when (i) the master processor broadcasts test vectors (Ti) to the slave processors during logic simulation, (ii) the master processor distributes faults (Fi) to the slave processors in the first step of fault 43 simulation, (iii) the slave processors report the detected (DAi), undetected (UAi) faults and the results (Oi) of logic simulation to the master processor at the end of first step, (iv) the master processor redistributes the undetected faults (UBi) and broadcasts the the results collected (O) to the slave processors in the beginning of the second step, and (v) the slave processors report the detected (DBi) and undetected (UCi) faults to the master processor at the end of fault simulation. These communications are implemented by MPI SEND MPI RECV, and MPI BCAST. 44 Broadca s t P0 P1 P2 P3 V V V V LSi m LSi m LSi m LSi m P0 P1 P2 P3 O0 O1 O2 O3 Figure 3.8: Communication among 4 processors in logic sim- ulation in PAUSIM-CY1. V: Test vector, O: Logic simula- tion result Sen d P0 P1 P2 P3 F0 F1 F2 F3 Step 1 FSi m Step 1 FSi m Step 1 FSi m Step 1 FSi m O0 O1 O2 O3 UA 0 UA 1 UA 2 UA 3 P0 P1 P2 P3 DA 0 DA 1 DA 2 DA 3 Receive Figure 3.9: Communication among 4 processors in the first step of fault simulation in PAUSIM-CY1. F: Fault, U: Un- detected fault, D: Undetected fault 45 Sen d P0 P1 P2 P3 UB 0 UB 1 UB 2 UB 3 O0 O1 O2 O3 DAUA O Sen d Broadca s t P0 P1 P2 P3 UB 0 UB 1 UB 2 UB 3 Step 2 FSi m Step 2 FSi m Step 2 FSi m Step 2 FSi m O U UC 1 UC 2 UC 3 P0 P1 P2 P3 DB 0 DB 1 DB 2 DB 3 O O O DA Receive DB UC 0 U P0 DA DB DA D Figure 3.10: Communication among 4 processors in the second step of fault simulation in PAUSIM-CY1 46 Chapter 4 Performance 4.1 Experimental Environment The four parallel schemes, PAUSIM-SF, PAUSIM-BL, PAUSIM-CY0 and PAUSIM- CY1, have beenimplemented, whichwere describedintheprevious section. ThePAUSIM- SF refers to the parallel implementation which uses the task partitioning scheme of SPITFIRE-1 [31]. The PAUSIM-BL and PAUSIM-CY are the proposed parallel testing schemes described in chapter 3. Table 4.1: Fault coverage statistics using 1600 random vectors on a single processor Circuit #Fault #Gate #FF #PI #PO #Det.fault Coverage s1196 1250 388 18 14 14 1042 83.4% s1423 1663 490 74 17 5 517 31.1% s1512 1411 413 57 29 21 46 3.3% s3271 3438 1035 116 26 14 3103 90.3% s5378 4961 1004 179 35 49 2945 59.4% Table 4.1 shows the characteristics of the benchmark circuits and the fault coverage on a single processor. Note that s1196 and s3271 have a high fault coverage while s1423 and s1512 contain lots of hard-to-detect faults. The four schemes were implemented on a cluster consisting of sixteen utra 5 Sun workstations. The workstations are intercon- nected by a 100-Mbps Ethernet. Results are provided for the five circuits, s1196, s1512, s1423, s3271, and s5378, taken from the ISCAS89 benchmark suite. Logic and fault 47 Table 4.2: Execution time (seconds) and speedups using 1600 random vectors on multi- processor Circ. Sequ. Alg. Execution time Speedups time 4 8 12 16 4 8 12 16 s1196 191.6 SF 131.4 58.6 52.9 50.2 1.5 3.3 3.6 3.8 BL 119.1 52.9 44.8 36.0 1.6 3.6 4.3 5.3 CY0 63.5 42.8 33.1 27.6 3.0 4.5 5.8 6.9 CY1 56.9 36.3 28.0 27.4 3.4 5.3 6.8 7.0 s1423 489.7 SF 180.5 92.3 92.0 78.1 2.7 5.3 5.3 3.8 BL 179.1 89.7 79.2 77.2 2.7 5.5 6.2 6.3 CY0 147.0 78.3 65.6 52.3 3.3 6.3 7.5 9.4 CY1 140.9 77.5 56.7 45.3 3.5 6.3 8.6 10.8 s1512 436.9 SF 135.3 87.6 71.7 61.7 3.2 5.0 6.1 7.1 BL 134.4 85.3 67.0 47.0 3.3 5.1 6.5 9.3 CY0 133.2 80.1 53.9 46.4 3.3 5.5 8.1 9.4 CY1 127.3 73.9 53.4 47.0 3.4 5.9 8.2 9.3 s3271 1077.1 SF 1056.1 696.1 383.3 293.3 1.1 1.5 2.8 3.7 BL 972.2 502.8 345.6 271.8 1.1 2.1 3.1 4.0 CY0 516.6 294.4 274.1 204.4 2.1 3.7 3.9 5.3 CY1 378.3 171.3 142.4 137.8 2.8 6.3 7.6 7.8 s5378 9898.4 SF 4701.3 2220.0 2089.3 1511.9 2.1 4.5 4.7 6.5 BL 4602.3 1951.2 1724.6 1333.2 2.1 5.1 5.7 7.4 CY0 3041.1 1471.6 1206.4 952.3 3.3 6.7 8.2 10.4 CY1 2486.9 1266.8 849.7 675.6 4.0 7.8 11.6 14.7 simulations were done with 1600 random test vectors. The overlap in the test vector partitioning was 25 (test vectors). 4.2 Results In Table 4.2, the execution time and speedup achieved by PAUSIM-SF, PAUSIM- BL, PAUSIM-CY0 and PAUSIM-CY1 on the cluster are provided for the five circuits in Table 4.1. The same fault coverage as in the sequential (uniprocessor) simulation was 48 obtained except for s1423 where one less fault was detected compared to the sequential result. This minor difference between the sequential and parallel simulations is most prob- ably due to the loss of state information in the beginning of parallel simulation (caused by test vector partitioning). When the test vector overlap was increased, there was no difference in the fault coverage between the sequential and parallel simulations. 49 L 0PAUSIM-SFPAUSIM-B PAUSIM-CY PAUSIM-CY1 108.7 99.8 53.1 47.3 s1196 180.5 179.1 147 140.9 s1423 135.3 134.4 133.2 127.3 s1512 Results for P=4 0 50 100 150 200 s1196 s1423 s1512 Circuits E x ecut i on t ime ( s econd s) PAUSIM-SF PAUSIM-BL PAUSIM-CY0 PAUSIM-CY1 Figure 4.1: Execution time for 4 processors - small circuit L 0PAUSIM-SFPAUSIM-B PAUSIM-CY PAUSIM-CY1 1056.1 972.2 516.6 378.3 s3271 4701.3 4602.3 3041.1 2486.9 s5378 00 0 0 Results for P=4 0 1000 2000 3000 4000 5000 s3271 s5378 Circuits E x ecut i on t ime ( s econd s) PAUSIM-SF PAUSIM-BL PAUSIM-CY0 PAUSIM-CY1 Figure 4.2: Execution time for 4 processors - large circuit 50 L 0PAUSIM-SFPAUSIM-B PAUSIM-CY PAUSIM-CY1 49.2 44.2 35.6 22.2 s1196 92.3 89.7 78.3 77.5 s1423 87.6 85.3 80.1 73.9 s1512 Results for P=8 0 20 40 60 80 100 s1196 s1423 s1512 Circuits E x ecut i on t ime ( s econd s) PAUSIM-SF PAUSIM-BL PAUSIM-CY0 PAUSIM-CY1 Figure 4.3: Execution time for 8 processors - small circuit L 0PAUSIM-SFPAUSIM-B PAUSIM-CY PAUSIM-CY1 696.1 502.8 294.4 171.3 s3271 2220 1951.2 1471.6 1266.8 s5378 00 0 0 Results for P=8 0 500 1000 1500 2000 2500 s3271 s5378 Circuits E x ecut i on t ime ( s econd s) PAUSIM-SF PAUSIM-BL PAUSIM-CY0 PAUSIM-CY1 Figure 4.4: Execution time for 8 processors - large circuit 51 L 0PAUSIM-SFPAUSIM-B PAUSIM-CY PAUSIM-CY1 41.9 29.9 23.1 22.9 s1196 78.1 77.2 52.3 45.3 s1423 61.7 47 46.4 47 s1512 Results for P=16 0 20 40 60 80 100 s1196 s1423 s1512 Circuits E x ecut i on t ime ( s econd s) PAUSIM-SF PAUSIM-BL PAUSIM-CY0 PAUSIM-CY1 Figure 4.5: Execution time for 16 processors - small circuit L 0PAUSIM-SFPAUSIM-B PAUSIM-CY PAUSIM-CY1 383.3 271.8 204.4 137.8 s3271 1511.9 1333.2 952.3 675.6 s5378 00 0 0 Results for P=16 0 500 1000 1500 2000 s3271 s5378 Circuits E x ecut i on t ime ( s econd s) PAUSIM-SF PAUSIM-BL PAUSIM-CY0 PAUSIM-CY1 Figure 4.6: Execution time for 16 processors - large circuit 52 It can be seen that the proposed approach to task decomposition achieves shorter execution time (or higher speedup) than the existing approaches. As shown also in Figures 4.1 - 4.6, PAUSIM-BL performs substantially better than PAUSIM-SF, and PAUSIM-CY0 improves over PAUSIM-BL by the cyclic partitioning of faults instead of the block partitioning. As expected, PAUSIM-CY1 outperforms all other algorithms significantly mainly due to the two-step load balancing. It is also observed that the reduction in execution time is larger for a lager circuit. Note that a large circuit must have more room for improvement by a better load balancing scheme. In Figures 4.7 - 4.18. execution time, speedup and efficiency are plotted as functions of the number of processors employed in parallel simulation. As number of processors increases, execution time monotonically decreases in all cases. One thing to note is that the improvement by using more processors tends to saturate in some cases of PAUSIM- SF as clearly seen in Figures. This is mainly due to the ineffective task partitioning. In the proposed algorithms, such saturation is either not observed or much less than in PAUSIM-SF. It is to be mentioned that efficiency achieved by the proposed algorithms, especially PAUSIM-CY1, is very high in almost all cases. In addition to the effective parallelization in PAUSIM-CY1, the long simulation time (compared to the communication overhead) is another factor contributing to such high efficiency. 53 s1196 s1423 s1512 s3271 s5378 108.7 180.5 135.3 1056.1 4701.3 4 49.2 92.3 87.6 696.1 2220 8 43.9 92 71.7 383.3 2089.3 12 41.9 78.1 61.7 293.3 1511.9 16 Execution time for the PAUSIM-SF 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 4 8 12 16 Number of processors Execution time (Seconds) s1196 s1423 s1512 s3271 s5378 Figure 4.7: Execution times for PAUSIM-SF s1196 s1423 s1512 s3271 s5378 119.1 179.1 134.4 972.2 4602.3 4 52.9 89.7 85.3 502.8 1951.2 8 44.8 79.2 67 345.6 1724.6 12 36 77.2 47 271.8 1333.2 16 Execution time for the PAUSIM-BL 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 4 8 12 16 Number of processors Execution time (Seconds) s1196 s1423 s1512 s3271 s5378 Figure 4.8: Execution times for PAUSIM-BL 54 s1196 s1423 s1512 s3271 s5378 63.5 147 133.2 516.6 3041.1 4 42.8 78.3 80.1 294.4 1471.6 8 33.1 65.6 53.9 274.1 1206.4 12 27.6 52.3 46.4 204.4 952.3 16 Execution time for the PAUSIM-CY0 0 500 1000 1500 2000 2500 3000 3500 4 8 12 16 Number of processors Execution time (Seconds) s1196 s1423 s1512 s3271 s5378 Figure 4.9: Execution times for PAUSIM-CY0 s1196 s1423 s1512 s3271 s5378 47.3 140.9 127.3 378.3 2486.9 4 30.4 77.5 73.9 171.3 1266.8 8 23.4 56.7 53.4 142.4 849.7 12 22.9 45.3 47 137.8 675.6 16 Execution time for the PAUSIM-CY1 0 500 1000 1500 2000 2500 3000 4 8 12 16 Number of processors Execution time (Seconds) s1196 s1423 s1512 s3271 s5378 Figure 4.10: Execution times for PAUSIM-CY1 55 s1196 s1423 s1512 s3271 s5378 1.45814 2.713 3.22912 1.0199 2.1055 4 3.26962 5.3055 4.98744 1.5473 4.4587 8 3.62193 5.3228 6.09344 2.8101 4.7377 12 3.81673 6.2702 7.08104 3.6723 6.547 16 Speedup for the PAUSIM-SF 0 1 2 3 4 5 6 7 8 4 8 12 16 Number of processors Speedup s1196 s1423 s1512 s3271 s5378 Figure 4.11: Speedup for PAUSIM-SF s1196 s1423 s1512 s3271 s5378 1.60873 2.7342 3.25074 1.1079 2.1508 4 3.62193 5.4593 5.12192 2.1422 5.073 8 4.27679 6.1831 6.5209 3.1166 5.7395 12 5.32222 6.3433 9.29574 3.9628 7.4245 16 Speedup for the PAUSIM-BL 0 1 2 3 4 5 6 7 8 9 10 4 8 12 16 Number of processors Speedup s1196 s1423 s1512 s3271 s5378 Figure 4.12: Speedup for PAUSIM-BL 56 s1196 s1423 s1512 s3271 s5378 3.01732 3.3313 3.28003 2.085 3.2549 4 4.47664 6.2542 5.45443 3.6586 6.7263 8 5.78852 7.4649 8.10575 3.9296 8.2049 12 6.94203 9.3633 9.41595 5.2696 10.394 16 speedup for the PAUSIM-CY0 0 2 4 6 8 10 12 4 8 12 16 Number of processors speedup s1196 s1423 s1512 s3271 s5378 Figure 4.13: Speedup for PAUSIM-CY0 s1196 s1423 s1512 s3271 s5378 3.36731 3.4755 3.43205 2.8472 3.9802 4 5.27824 6.3187 5.91204 6.2878 7.8137 8 6.84286 8.6367 8.18165 7.5639 11.649 12 6.9927 10.81 9.29574 7.8164 14.651 16 Speedup for the PAUSIM-CY1 0 2 4 6 8 10 12 14 16 4 8 12 16 Number of processors Speedup s1196 s1423 s1512 s3271 s5378 Figure 4.14: Speedup for PAUSIM-CY1 57 s1196 s1423 s1512 s3271 s5378 0.36454 0.6783 0.80728 0.255 0.5264 4 0.4087 0.6632 0.62343 0.1934 0.5573 8 0.30183 0.4436 0.50779 0.2342 0.3948 12 0.23855 0.3919 0.44256 0.2295 0.4092 16 Efficiency for the PAUSIM-SF 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 4 8 12 16 Number of processors Efficiency s1196 s1423 s1512 s3271 s5378 Figure 4.15: Efficiency for PAUSIM-SF s1196 s1423 s1512 s3271 s5378 0.40191 0.6836 0.81269 0.277 0.5377 4 0.45274 0.6824 0.64024 0.2678 0.6341 8 0.3564 0.5153 0.54341 0.2597 0.4783 12 0.33264 0.3965 0.58098 0.2477 0.464 16 Efficiency for the PAUSIM-BL 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 4 8 12 16 Number of processors Efficiency s1196 s1423 s1512 s3271 s5378 Figure 4.16: Efficiency for PAUSIM-BL 58 s1196 s1423 s1512 s3271 s5378 0.75433 0.8328 0.82001 0.5212 0.8137 4 0.55958 0.9085 0.6818 0.4573 0.8408 8 0.48238 0.6221 0.67548 0.3275 0.6837 12 0.43388 0.5852 0.5885 0.3293 0.6496 16 Efficiency for the PAUSIM-CY0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4 8 12 16 Number of processors Efficiency s1196 s1423 s1512 s3271 s5378 Figure 4.17: Efficiency for PAUSIM-CY0 s1196 s1423 s1512 s3271 s5378 0.84183 0.8689 0.85801 0.7118 0.9951 4 0.65797 0.7898 0.73901 0.786 0.9767 8 0.57024 0.7197 0.6818 0.6303 0.9708 12 0.43704 0.6756 0.58098 0.4885 0.9157 16 Efficiency for the PAUSIM-CY1 0 0.2 0.4 0.6 0.8 1 1.2 4 8 12 16 Number of processors Efficiency s1196 s1423 s1512 s3271 s5378 Figure 4.18: Efficiency for PAUSIM-CY1 59 In Tables 4.3 - 4.4, the minimum, maximum, mean, standard deviation, and nor- malized standard deviation of execution time over eight processors are provided in order to compare the four algorithms in terms of load balancing. A larger standard deviation of execution time among processors indicates a larger load imbalance among them. The logic simulation is a small fraction of the entire simulation and is well balanced over processors. What is to be noticed in the table is that the execution time significantly varies with processor in the case of PAUSIM-SF and PAUSIM-BL, leading to the rela- tively large standard deviation (also, the difference between the maximum and minimum execution time) in most cases. That is, the load is not well balanced in PAUSIM-SF and PAUSIM-BL. However, PAUSIM-CY1 greatly reduces the load imbalance and thereby achieves a significant performance improvement. In Figure 4.19 - 4.26, execution times including communication times on individ- ual processors are plotted for more detailed examination of load distribution among processors. It is confirmed that PAUSIM-CY1 balances the load over processors well at the expense of extra communication such that the overall parallel execution time is significantly reduced. 60 Table 4.3: Mean and standard deviation execution times on eight processors for PAUSIM- SF, PAUSIM-BL, PAUSIM-CY0 and PAUSIM-CY1. The unit for time is second. s1196, s1423 and s1512 Circuit Algorithm Step Min.T. Max.T. Mean.T Stdv Norm.Stdv s1196 PAUSIM LSIM 0.8 1.1 0.9 0.10 0.12 -SF FSIM 18.2 43.3 36.9 5.13 0.14 PAUSIM LSIM 0.8 1.1 0.9 0.10 0.12 -BL FSIM 15.1 36.1 32.5 7.07 0.22 PAUSIM LSIM 0.8 0.9 0.8 0.05 0.06 -CY0 FSIM 18.8 27.1 22.6 3.74 0.17 PAUSIM LSIM 0.8 0.9 0.83 0.05 0.06 -CY1 FSIM 16.1 17.4 16.6 0.51 0.03 s1423 PAUSIM LSIM 1.2 1.3 1.3 0.05 0.04 -SF FSIM 71.1 86.3 77.0 5.17 0.07 PAUSIM LSIM 1.2 1.3 1.3 0.05 0.04 -BL FSIM 73.5 84.0 77.3 4.36 0.06 PAUSIM LSIM 1.2 1.3 1.3 0.05 0.04 -CY0 FSIM 60.9 72.4 66.1 3.97 0.06 PAUSIM LSIM 1.2 1.3 1.3 0.05 0.04 -CY1 FSIM 60.5 69.9 66.0 3.06 0.05 s1512 PAUSIM LSIM 1.3 1.3 1.3 0.00 0.00 -BL FSIM 59.9 82.1 68.5 4.82 0.07 PAUSIM LSIM 1.3 1.3 1.3 0.00 0.00 -BL FSIM 63.0 77.9 66.5 4.89 0.07 PAUSIM LSIM 1.3 1.3 1.3 0.00 0.00 -CY0 FSIM 64.0 76.7 71.0 3.65 0.05 PAUSIM LSIM 1.3 1.4 1.4 0.1 0.10 -CY1 FSIM 65.7 68.0 66.3 1.25 0.02 61 Table 4.4: Mean and standard deviation execution times on eight processors for PAUSIM- SF, PAUSIM-BL, PAUSIM-CY0 and PAUSIM-CY1. The unit for time is second. s3271 and s5378 Circuit Algorithm Step Min.T. Max.T. Mean.T Stdv Norm.Stdv s3271 PAUSIM LSIM 4.1 4.4 4.3 0.10 0.02 -SF FSIM 345.3 684.8 477.7 110.67 0.23 PAUSIM LSIM 4.1 4.4 4.3 0.10 0.02 -BL FSIM 205.4 490.8 377.2 102.51 0.27 PAUSIM LSIM 4.1 4.4 4.3 0.10 0.02 -CY0 FSIM 80.4 278.3 155.8 63.59 0.41 PAUSIM LSIM 4.1 4.4 4.3 0.09 0.02 -CY1 FSIM 141.4 158.7 152.2 5.17 0.03 s5378 PAUSIM LSIM 11.3 11.9 11.7 0.19 0.02 -BL FSIM 1711.4 2185.9.6 1923.3 132.92 0.07 PAUSIM LSIM 11.4 11.9 11.7 0.16 0.01 -SF FSIM 1586.9 1912.6 1827.0 122.00 0.07 PAUSIM LSIM 11.5 11.9 11.7 0.15 0.01 -CY0 FSIM 1103.0 1438.6 1246.0 110.49 0.09 PAUSIM LSIM 11.3 11.9 11.7 0.19 0.02 -CY1 FSIM 1197.9 1229.0 1219.8 13.36 0.01 62 LSIM FSIM Communication P0 4.1 345.3 7.2 P1 4.3 368.5 7.2 P2 4.4 466.7 7.2 P3 4.3 684.8 7.2 P4 4.4 515.5 7.2 P5 4.3 568.2 7.2 P6 4.2 425.4 7.2 P7 4.2 447.5 7.2 4.275 477.7375 0.10351 110.668242 0.02421 0.23165073 s3271 0 200 400 600 800 P0 P1 P2 P3 P4 P5 P6 P7 Processor ID Time ( S econds) LSIM FSIM Communication Figure 4.19: Workload distribution: s3271 benchmark circuit, PAUSIM-SF LSIM FSIM Communication P0 4.1 345.3 7.2 P1 4.3 358 7.2 P2 4.4 466.7 7.2 P3 4.3 490.8 7.2 P4 4.4 415.5 7.2 P5 4.3 468 7.2 P6 4.2 205.4 7.2 P7 4.2 267.5 7.2 4.275 377.15 0.10351 102.512926 0.02421 0.27180943 s3271 0 200 400 600 800 P0 P1 P2 P3 P4 P5 P6 P7 Processor ID Time ( S econds) LSIM FSIM Communication Figure 4.20: Workload distribution: s3271 benchmark circuit, PAUSIM-BL 63 LSIM FSIM Communication P0 4.1 219.2 12 60.8 158.4 P1 4.3 278.3 12 62.6 215.7 P2 4.3 124.2 12 63.3 60.9 P3 4.3 123.5 12 61.1 62.4 P4 4.4 150.2 12 59.6 90.6 P5 4.3 153.7 12 61.6 92.1 P6 4.2 116.8 12 56.7 60.1 P7 4.3 80.4 12 50.3 30.1 4.275 155.7875 0.08864 63.5942369 0.02073 0.40821142 s3271 0 200 400 600 800 P0 P1 P2 P3 P4 P5 P6 P7 Processor ID Time ( S econds) LSIM FSIM Communication Figure 4.21: Workload distribution: s3271 benchmark circuit, PAUSIM-CY0 LSIM FSIM Communication P0 4.1 155.5 12 60.6 94.9 P1 4.4 158.7 12 63.3 95.4 P2 4.3 155.2 12 62.7 92.5 P3 4.3 151.9 12 61.1 90.8 P4 4.4 152.7 12 60.2 92.5 P5 4.3 152.4 12 60.9 91.5 P6 4.3 149.4 12 57.5 91.9 P7 4.3 141.4 12 50.4 91 4.3 152.15 0.09258 5.16831003 0.02153 0.03396852 s3271 0 200 400 600 800 P0 P1 P2 P3 P4 P5 P6 P7 Processor ID Time ( S econds) LSIM FSIM Communication Figure 4.22: Workload distribution: s3271 benchmark circuit, PAUSIM-CY1 64 LSIM FSIM Communication P0 2.1 132.1 18.7 0 149.2 P1 2.1 200.2 18.7 0 132.1 P2 2.2 214.8 18.7 0 214.8 P3 2.2 200 18.7 0 165.1 P4 2.3 180.3 18.7 0 267.2 P5 2.2 254.8 18.7 0 254.8 P6 2.2 235.4 18.7 0 255.9 P7 2.2 193.2 18.7 0 193.2 P8 2.2 173.4 18.7 0 244.7 P9 2.3 220.9 18.7 0 150.9 P10 2.2 213.4 18.7 0 223.4 P11 2.2 149.2 18.7 0 217.6 P12 2.2 231.3 18.7 0 200.3 P13 2.2 187.3 18.7 0 90.3 P14 2.2 169.1 18.7 0 189.1 P15 2.3 264.2 18.7 0 140.4 2.20625 201.225 0.0573736.1644485 0.0260.17972145 149.2 s3271 0 50 100 150 200 250 300 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 Processor ID Time ( S econds) LSIM FSIM Communication Figure 4.23: Workload distribution: s3271 benchmark cir- cuit, PAUSIM-SF LSIM FSIM Communication P0 2.1 149.2 18.7 0 149.2 P1 2.1 132.1 18.7 0 132.1 P2 2.2 214.8 18.7 0 214.8 P3 2.2 165.1 18.7 0 165.1 P4 2.3 267.2 18.7 0 267.2 P5 2.2 254.8 18.7 0 254.8 P6 2.2 255.9 18.7 0 255.9 P7 2.2 193.2 18.7 0 193.2 P8 2.2 244.7 18.7 0 244.7 P9 2.3 150.9 18.7 0 150.9 P10 2.2 223.4 18.7 0 223.4 P11 2.2 217.6 18.7 0 217.6 P12 2.2 200.3 18.7 0 200.3 P13 2.2 90.3 18.7 0 90.3 P14 2.2 189.1 18.7 0 189.1 P15 2.3 140.4 18.7 0 140.4 2.20625193.0625 0.0573751.3285739 0.0260.26586506 s3271 0 50 100 150 200 250 300 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 Processor ID Time ( S econds) LSIM FSIM Communication Figure 4.24: Workload distribution: s3271 benchmark cir- cuit, PAUSIM-BL 65 LSIM FSIM Communication P0 2.1 77.3 18.7 15.6 61.7 P1 2.1 185.2 18.7 15.6 169.6 P2 2.2 77 18.7 16.1 60.9 P3 2.2 113.3 18.7 15.5 97.8 P4 2.3 94.6 18.7 16.2 78.4 P5 2.2 83.6 18.7 15.9 67.7 P6 2.2 85.1 18.7 16 69.1 P7 2.1 60.5 18.7 15.8 44.7 P8 2.2 48.5 18.7 15.8 32.7 P9 2.3 119.6 18.7 16.5 103.1 P10 2.2 114.6 18.7 15.6 99 P11 2.2 116.3 18.7 16.2 100.1 P12 2.2 82.7 18.7 16.1 66.6 P13 2.2 80.7 18.7 15.5 65.2 P14 2.2 49.5 18.7 16 33.5 P15 2.3 39.2 18.7 18.2 21 2.2 89.23125 0.0632535.7202784 0.028750.40031131 s3271 0 50 100 150 200 250 300 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 Processor ID Time ( S econds) LSIM FSIM Communication Figure 4.25: Workload distribution: s3271 benchmark cir- cuit, PAUSIM-CY0 LSIM FSIM Communication P0 2.1 119.9 18.7 15.5 104.4 P1 2.1 118 18.7 15.7 102.3 P2 2.3 118.7 18.7 16.2 102.5 P3 2.2 116.1 18.7 15.7 100.4 P4 2.2 113.4 18.7 16 97.4 P5 2.2 97.6 18.7 15.9 81.7 P6 2.2 87.6 18.7 16.2 71.4 P7 2.1 99.5 18.7 15.9 83.6 P8 2.2 118.9 18.7 16.1 102.8 P9 2.3 103.7 18.7 16.1 87.6 P10 2.2 85.2 18.7 16.2 69 P11 2.2 117.6 18.7 16.4 101.2 P12 2.2 84.6 18.7 16.2 68.4 P13 2.2 102.9 18.7 15.5 87.4 P14 2.2 93.6 18.7 16 77.6 P15 2.4 109.8 18.7 18.2 91.6 2.20625105.44375 0.0771912.8487597 0.034990.12185416 s3271 0 50 100 150 200 250 300 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 Processor ID Time ( S econds) LSIM FSIM Communication Figure 4.26: Workload distribution: s3271 benchmark cir- cuit, PAUSIM-CY1 66 Chapter 5 Conclusion Digital circuit testing including fault simulation is computationally intensive and therefore is a a good target application for parallel computing. In this thesis, efficient parallel fault simulation algorithms have been designed and implemented on a cluster of workstations. The proposed algorithms eliminates redundant simulation and reduces the actual number of unit simulation by partitioning faults among processors while as- signed the entire test vector set to all processors. They, PAUSIM-CY0 and PAUSIM- CY1, achieve a better load distribution by adapting the cyclic partitioning of faults. In particular, PAUSIM-CY1 further improves the load distribution by allowing a load re- distribution during fault simulation. The experimental results obtained on the 16-node cluster have demonstrated that the proposed parallel fault simulation algorithms can achieve significantly a shorter execution time than the existing algorithms. 67 Bibliography [1] Michael L. Bushnell, Vishwani D. Agrawal, Essentials of Electronic Testing., Kluwer Academic Publishers. [2] T. M. Niermann, W. -T. Cheng, and J. H. Patel, ?PROOFS: A fast, memory- efficient sequential circuit fault simulator,? IEEE Trans. Computer-Aided Design, pp. 198-207, February 1992. [3] D. Harel and B. Krishnamurthy, ?Is there hope for linear time fault simulation,? Proc. Fault Tolerant Computing Symp., pp. 28-33, June 1987. [4] S. Seshu, ?On an Improved Diagnosis Program,? IEEE Trans. Electronic Comput- ing, vol. EC-14, pp. 76-79, February 1965. [5] D. B. Armstrong, ?A deductive method for simulating faults in logic circuits,? IEEE Transactions on Computers, vol. 21, pp. 462-471, May 1972. [6] E. G. Ulrich and T. Baker, ?The concurrent simulation of nearly identical digital networks,? Proc. Tenth Design Automation Workshop, vol. 6, pp. 145-150. 1973. [7] P. Goel, H. Lichaa, T. E. Rosser, T. J. Stroh, and E. B. Eichelberger, ?LSSD fault simulation using conjunctive combinational and sequential methods,? Proc. Int. Test Conf., pp. 371-376, 1980. [8] K. Kim and K. K. Saluja, ?On fault deletion problem in concurrent fault simulation for synchronous sequential circuits,? Proc. VLSI Test Symp., pp. 125-130, 1992. [9] D. G. Saab, ?Parallel-concurrent fault simulation,? Trans. VLSI Systems, vol. 1, no. 3, pp. 356-364, September 1993. [10] W. -T. Cheng and M. -L. Yu, ?Differential fault simulation - A fast method using minimal memory,? Proc. Design Automation Conf., pp. 424-428, 1989. [11] H. K. Lee and D. S. Ha, ?HOPE: An efficient parallel fault simulator for synchronous sequential circuits,? Proc. Design Automation Conf., pp. 336-340, 1992. [12] H. K. Lee and D. S. Ha, ?New methods of improving parallel fault simulation in synchronous sequential circuits,? Proc. Int. Conf. Computer-Aided Design, pp. 10- 17, 1993. [13] J. A. Waicukauski, E. B. Eichelberger, D. O. Forlenza, E. Lindbloom and T. Mc- Carthy, ?Fault simulation for structured VLSI,? VLSI System Design, pp. 20-32, December 1985. 68 [14] N. Gouders and R. Kaibel, ?PARIS: A parallel pattern fault simulator for synchro- nous sequential circuits,? Proc. Int. Conf. Computer-Aided Design, pp. 542-545, 1991. [15] C-P Kung and C-S. Lin, ?Parallel sequence fault simulation for synchronous sequen- tial circuits,? Proc. European Conference on Desing Automation (EDAC-92), pp. 434-438, March 1992. [16] R. Nair and D.S. Han, ?Vision: An efficient parallel pattern fault simulator for synchronous sequential circuits,? Proc. 13th IEEE Test Symposium (VTS-95), p. 0221, 1995. [17] M. B. Amin and B. Vinnakota, ?ZAMBEZI: A parallel pattern parallel fault se- quential circuit fault simulator,? Proc. VLSI Test Symp., pp. 438-443, 1996. [18] P. Banerjee, parallel Algorithms for VLSI computer-Aided Design. Englewood Cliffs, NJ: PTR Prentice Hall, 1994. [19] L. Soule and T. Blank, ?Parallel logic simulation on general purpose machines,? in Proceedings of the 25th ACM/IEEE Design Automation Conference, pp. 166-171, June 1988. [20] R. B. Mueller-Thuns, D. G. Saab, R. F. Damiano,and J. A. Abraham, ?Portable parallel logic and fault simulation,? in Digest of Papers, International Conference on Computer-Aided Design, pp. 506-509, Nov. 1989. [21] J. F. Nelson, ?Deductive fault simulation on hypercube multiprocessors,? in Pro- ceeding of the 9th ATT Conference on Electronic Testing, Oct. 1987. [22] S. Ghosh, ?NODIFS: A noval, distributed circuit partitioning based algorithm for fault simulation of combinational and sequential digital designs on loosely coupled parallel processors,? tech. rep., LEMS, Division of Engineering, Brown University, Providence, RI, 1991. [23] S. Patil, P. Banerjee, and J. Patel, ?Parallel test generation for sequential circuits on general purpose multiprocessors,? in Proceedings of the 28th ACM/IEEE Design Automation Conference, (San Fransisco, CA), June 1991. [24] P. Agrawal, V. D. Agrawal, K. T. Cheng, and R. Tutundjian, ?Fault simulation in a pipelined multiprocessor system,? Proc. Int. Test Conf., pp. 727-734, 1989. [25] S. Bose and P. Agrawal, ?Concurrent fault simulation of logic gates and memory blocks on message passing multicomputers,? Proc. Design Automation Conf., pp. 332-335, 1992. [26] S. Parkes, P. Banerjee, and J. Patel, ?A parallel algorithm for fault simulation based on PROOFS, ? Proc. Int. Conf. Computer Design, pp. 616-621, 1995. 69 [27] P. A. Duba, R. K. Roy, J. A. Rogers, ?Fault simulation in a distributed environment, ? in Proceedings of the 25th ACM/IEEE Design Automation Conference, pp. 686- 691, June 1988. [28] T. Markas, M. Royals, and N. Kanopoulos, ?On distributed fault simulation,? IEEE Computer, vol. 7, pp. 40-52, Jan. 1990. [29] E. M. Rudnick and J. H. Patel, ?Overcoming the serial logic simulation bottleneck in parallel fault simulation,? Proc 10th Intl. Conf. VLSI Design, pp. 495-501, 1997. [30] M. B. Amin and B. Vinnakota, ?Zamlog: A parallel algorithm for fault simulation based on Zambezi,? Proc. Intl. Conf. on Computer-Aided Design. pp. 509-512, 1996. [31] D. Krishnaswamy, E. M. Rudnick, J. H. Patel and Prithviraj Banerjee, ?SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation.? Proc. 15th IEEE VLSI Test Symposium, 1997. [32] C. E. Stroud, ?Using the Workstation version of AUSIM - Version 2.1.? [33] M. B. Amin and B. Vinnakota, ?Data Parallel-Fault Simulation,? IEEE Trans. VLSI Systems, vol. 7, NO.2 June 1999. 70