Soft Error Rate Determination for Nanometer CMOS VLSI Circuits Except where reference is made to the work of others, the work described in this thesis is my own or was done in collaboration with my advisory committee. This thesis does not include proprietary or classifled information. Fan Wang Certiflcate of Approval: Fa Foster Dai Professor Electrical & Computer Engineering Vishwani D. Agrawal, Chair James J. Danaher Professor Electrical & Computer Engineering Victor P. Nelson Professor Electrical & Computer Engineering Joe F. Pittman Interim Dean Graduate School Soft Error Rate Determination for Nanometer CMOS VLSI Circuits Fan Wang A Thesis Submitted to the Graduate Faculty of Auburn University in Partial Fulflllment of the Requirements for the Degree of Master of Science Auburn, Alabama May 10, 2008 Soft Error Rate Determination for Nanometer CMOS VLSI Circuits Fan Wang Permission is granted to Auburn University to make copies of this thesis at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii Vita Fan Wang, son of Taiguo Wang and Yuhua Lu, was born on February 2, 1983 in Yunxian, Hubei Province, P. R. China. In 1998, he entered Shiyan No.1 Middle School. He joined Wuhan University of Technology in 2001 and graduated with Bachelor of Engineering degree in Electronic Information Engineering in 2005. In the same year in August he entered the Electrical & Computer Engineering Department at Auburn University, Alabama, for graduate study. iv Thesis Abstract Soft Error Rate Determination for Nanometer CMOS VLSI Circuits Fan Wang Master of Science, May 10, 2008 (B.S., Wuhan University of Technology, 2005) 108 Typed Pages Directed by Vishwani D. Agrawal Nanometer CMOS VLSI circuits are highly sensitive to soft errors due to envi- ronmental causes such as cosmic radiation and high-energy particles. These errors are random and not related to permanent hardware faults. Their causes may be internal (e.g., interconnect coupling) or external (e.g., cosmic radiation). Nowadays, the term soft errors, also known as Single Event Upsets (SEU), speciflcally deflnes radiation errors caused in microelectronic circuits when high energy particles strike at sensitive regions of the silicon devices. The soft error rate (SER) estimation analytically predicts the efiects of cosmic radiation and high-energy particle strikes in integrated circuit chips by build- ing SER models. An accurate analysis requires simulation using circuit netlist, device characteristics, manufacturing process and technology parameters, and measurement data on environmental radiation. Experimental SER testing is expensive and analytical approaches are, therefore, beneflcial. We model neutron-induced soft errors using two parameters, namely, occurrence rate and intensity. Our new soft error rate (SER) estimation analysis propagates occurrence rate and intensity as the width of single event transient (SET) pulses, expressed as v a probability and a probability density function, respectively, through the circuit. We consider the entire linear energy transfer (LET) range of the background radiation which is available from measurement data speciflc to the environment and device material. Soft error rates are calculated for ISCAS85 benchmark circuits in the standard units, failure in time (FIT, i.e., failures in 109 hours). In comparison to the reported SER analysis results in the literature, our method considers several more relevant factors including sensitive regions, circuit technology, etc., which may in uence the SER. Our simulation results for ISCAS85 benchmark circuits show similar trend as other reported work. For example, our soft error rate results for C432 and C499 considering ground-level environment are 1:18?103 FIT and 1:41?103 FIT, respectively. Although no measured data are available for logic circuits, SER for 0:25? and 0:13? 1M-bit SRAMs have been reported in the range 104 to 105 FIT, and for 0:25? 1G-bit SRAM around 4:2?103 FIT. We also discuss the factors that may cause several orders of magnitude difierence in our results and certain other logic analysis methods. The CPU time of our analysis is acceptably low. For example, for C1908 circuit with 880 gates, the analysis takes only 1.14 second. The fact that we propagate the error pulse width density information to primary outputs of the logic circuit would allow evaluation of SER reduction schemes such as time or space redundancy. This thesis also proposes a possible soft error reduction technique by hardware redesign involving circuit board reorientation. The basic idea is that the particles with LET smaller than the critical LET will not be able to cause an error if the angle of incidence is smaller than some critical angle. A proper orientation of hardware circuit boards will possibly reduce the soft error rate. vi Acknowledgments First, I would like to sincerely express my appreciation for my adviser Dr. Vishwani D. Agrawal, for his constant support. Without his patient guidance and encouragement, this work would not be possible. His technical advice made my master?s studies a meaningful learning experience. I also want to thank my advisory committee members, Dr. Fa Foster Dai and Dr. Victor P. Nelson for being on my thesis committee and for their invaluable advice on this research. Appreciation is expressed for all research colleagues at Auburn who have helped me in the course of my research work. I thank Gefu Xu, Yuanlin Lu, Nitin Yogi, Kalyana Kantipudi, Jins Alexander, Khushboo Sheth and Wei Jiang for all the helpful discussions throughout this research and for supplying a refreshing working environment in the department. Finally, equally important, I acknowledge with gratitude and afiection, encourage- ment and support given by my parents during my graduate study. I also thank all my family members and friends for their support and concern. Special thanks to my wife Jingyun Li, who has always been with me throughout the struggles and challenges of my graduate study at Auburn. vii Style manual or journal used LATEX: A Document Preparation System by Leslie Lamport together with style know as \aums". Computer software used The document preparation package TEX (speciflcally LATEX) together with the departmental style-flle aums.sty. The images and plots were generated using MicrosoftxaeO?ce Visio 2007/SmartDraw6 and Microsoftxae O?ce Excel 2003. viii Table of Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Contribution of Research . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background 8 2.1 What is Soft Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 A Historical Note on Soft Errors . . . . . . . . . . . . . . . . . . . . 9 2.3 Radiation Environment Overview . . . . . . . . . . . . . . . . . . . 11 2.3.1 Radiation Types . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Terrestrial Radiation Environment . . . . . . . . . . . . . . 15 2.4 How Soft Error Occurs in Silicon . . . . . . . . . . . . . . . . . . . 18 2.4.1 Radiation Mechanisms in Semiconductors . . . . . . . . . . 18 2.4.2 Sensitive Regions in Silicon Devices . . . . . . . . . . . . . . 21 2.4.3 Single Event Transient (SET) . . . . . . . . . . . . . . . . . 22 2.5 An Overview of Soft Error Mitigation Techniques . . . . . . . . . . 27 2.5.1 Prevention Techniques . . . . . . . . . . . . . . . . . . . . . 28 2.5.2 Recovery Techniques . . . . . . . . . . . . . . . . . . . . . . 29 2.6 IBM eServer z990 { A Case Study . . . . . . . . . . . . . . . . . . . 32 2.7 Traditional SER Testing Methods . . . . . . . . . . . . . . . . . . . 33 2.8 Collected SER Field Test Data . . . . . . . . . . . . . . . . . . . . 37 3 Previous Work 39 3.1 Figure of Merit Model for Geosynchronous SER . . . . . . . . . . . 39 3.2 Computer-Based Programs . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Analytical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Environment-Based Probabilistic Soft Error Model 50 4.1 Gate-Level SET Propagation . . . . . . . . . . . . . . . . . . . . . . 54 4.1.1 Pulse Widths Probability Density Propagation . . . . . . . . 54 4.1.2 Logic SEU Probability Propagation . . . . . . . . . . . . . . 59 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 61 ix 4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5 Results Comparison and Discussion 64 5.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6 Soft Error Considerations in Computer Web Servers 69 6.1 Soft Error Reduction in Industrial Servers . . . . . . . . . . . . . . 69 6.2 A Proposed Direction . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7 Conclusion 76 Bibliography 78 Appendices 90 A Terms and Definitions 91 B Units and Conversion Factors 95 x List of Figures 2.1 Sunspot numbers (y-axis) during solar cycles 19 through 23 recorded by Solar In uences Data Center (SIDC) in Belgium [9]. . . . . . . . 17 2.2 Neutron ux versus altitude showing peak at about 60,000 ft [139]. 18 2.3 Neutron ux as a function of altitude and latitude [4]. . . . . . . . . 19 2.4 Fission of 10B induced by the capture of a neutron (commonly hap- pened in SRAMs) [26]. . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Interaction of a high energy neutron and a silicon integrated circuit [6]. 22 2.6 Schematic representation of charge collection in a silicon junction immediately after (a) an ion strike, (b) prompt (drift) collection, (c) difiusion collection, and (d) the junction current induced as a function of time [29]. . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Schematic of the charge collection mechanism when an ionizing par- ticle strikes an electronic junction [149]. . . . . . . . . . . . . . . . . 24 2.8 A schematic view of how SEE-induced current pulse translates into a voltage pulse in a CMOS inverter. . . . . . . . . . . . . . . . . . . 27 2.9 Error correction using duplication, (a) space redundancy structure, (b) time redundancy structure, and (c) C-element [127]. . . . . . . . 31 2.10 Typicaltestsetup(hardware)forneutron-acceleratedSERtesting[86]. 34 2.11 Traditional SER fleld test parameters. . . . . . . . . . . . . . . . . 36 2.12 Soft error rates as a function of IC process technology [7]. . . . . . . 37 4.1 Probability of soft error for each collision of a 30MeV neutron as a function of the average critical charge for an SRAM chip (from SEMM program [172]). . . . . . . . . . . . . . . . . . . . . . . . . 51 xi 4.2 Transforming statistical neutron energy spectrum to SET width statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Proposed probabilistic neutron induced soft error model for logic. . 54 4.4 Comparison of proposed model and HSPICE simulation for CMOS inverter with 10fF load capacitance. . . . . . . . . . . . . . . . . . 58 4.5 Pulse width density propagation through a CMOS inverter with 10fF load. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 A generic gate with particle strike on node 1. . . . . . . . . . . . . . 61 6.1 Three perpendicular orientations for exposing a transistor and par- ticle angle of incidence [158]. . . . . . . . . . . . . . . . . . . . . . . 74 xii List of Tables 1.1 Commodity ash memory reliability requirements (ITRS). . . . . . 2 2.1 Measured failure rate on SRAM-based FPGA applications due to neutron efiects in 130nm technology (Actel) [6]. . . . . . . . . . . . 10 2.2 Projected failure rate on SRAM-based FPGA applications due to neutron efiects in 90nm technology (Actel) [6]. . . . . . . . . . . . . 10 2.3 Mass, charge and radius of particles of interest in radiation efiects [70]. 14 2.4 Sample EDAC methods for memory or data devices [106]. . . . . . 32 2.5 Accelerated testing versus real-time testing [128]. . . . . . . . . . . 35 2.6 Recently reported data on soft error rates. . . . . . . . . . . . . . . 38 4.1 Output 1 probability calculation for n-input Boolean gates. . . . . . 60 4.2 SER results for ISCAS85 benchmark circuits. . . . . . . . . . . . . 62 4.3 SER results for inverter chains. . . . . . . . . . . . . . . . . . . . . 62 5.1 Experimental results for ISCAS85 benchmark circuits. . . . . . . . . 65 5.2 Comparison of our work with other SER estimation methods. . . . 66 6.1 Typical server system reliability goals [34]. . . . . . . . . . . . . . . 70 xiii Chapter 1 Introduction From the beginning of the recorded history, man has believed in the in uence of heavenly bodies on the life on Earth. Machines, electronics included, are considered scientiflc objects whose fate is controlled by man. So, in spite of the knowledge of the exactdateandtimeofitsmanufacture, wedonotdraftahoroscopeforamachine. Lately, however, we have started noticing certain behaviors in the state of the art electronic circuits whose causes are traced to be external and to the celestial bodies outside our Earth. The Single Event Upset (SEU) phenomenon, as this non-permanent (i.e., random or soft) error behavior is termed, in digital systems afiects the modern nanotechnology electronic devices. We believe SEU will assume greater importance in the future [113]. We begin this introduction with a deflnition: \Single Event Upset (SEU): Radiation-induced errors in microelectronic circuits caused when charged particles (usually from the radiation belts or from cosmic rays) lose energy by ionizing the medium through which they pass, leaving behind a wake of electron-hole pairs". ??? NASA Thesaurus [13] Continuous downscaling of CMOS technologies has resulted in clock frequencies reaching multiples of GHz range, supply voltage decreasing below one volt level and load capaci- tances of circuit nodes dropping to femtofarads. Consequently, microelectronics systems are more vulnerable to noise sources in the working environment. Nanotechnology there- fore makes the meeting of reliability requirements highly challenging. Well-known noise 1 Table 1.1: Commodity ash memory reliability requirements (ITRS). Year 2007 2010 2013 2016 Density (megabit) 1024 2048 4096 8192 Maximum Data 166 200 250 300 Rate (MHz) MTTF (hours) 4020 4654 5388 6237 FIT=109/MTTF 2.487?105 2.149?105 1.856?105 1.603?105 sources include power supply uctuations, lightning and electrostatic discharge, inter- connect coupling capacitance and inductance, and thermal radiation from the galaxy, radio-emitting stars and atmospheric gases. A recent study shows that among the ef- fect of the soft failure sources, hard failure mechanisms exhibit product failure rates on the order of 1?100 FIT [186] (failures in 109 hours; see Appendix B for deflnition of FIT). However, the soft error rate of a low-voltage embedded SRAM can easily be 1000 FIT/Mbit [28] Electronics applications continue to demand higher reliability levels [79]. The 2002 International Technology Roadmap for Semiconductors (ITRS), in its di?cult test chal- lenges report (www.itrs.net/Links/2002Update/2002UpdateTest.pdf), gives the reliability re- quirements in mean time to failure (MTTF) for commodity ash memory as shown in Table 1.1. We notice that the maximum data rate and density are expected to in- crease, stressing the reliability requirements. Getting su?cient information on modern microchip reliability, especially with respect to soft errors due to alpha particles or cosmic rays, before the chip is manufactured has become more important for chip designers these days. Most integrated circuits are tested at particle accelerators for their susceptibility to single event efiects (SEE). The soft error rate represents the circuit susceptibility and estimating the soft error rate can be typically done by accelerated testing. The purpose 2 of accelerated life tests is not to expose defects, but to identify and quantify the failures and failure mechanisms which cause products to wear out before the end of their useful life [52]. Unfortunately, accelerated life testing is always time consuming because multi- ple runs are normally needed to get a su?cient number of samples under test to fail for data to be statistically meaningful. The test time may typically vary from few weeks to few months. The SER results will not be available until almost a year after the flrst chips start coming out of the fab. This long delay is generally unacceptable. One alternative method is the costly path of testing many more chips with a bigger test facility. Another is to test the chips in a more sensitive state deviated from the nominal conditions, i.e., at reduced voltage. With reduced voltage microchips are more sensitive to radiation. However, low-voltage testing has too many pitfalls to be used with confldence [199, 204]. Soft error is deflned as a faulty signal state in a microelectronic circuit caused by charged particles striking at sensitive regions in silicon devices [164]. The soft errors in memories (SRAM and DRAM) were extensively studied at the end of the twentieth century [62]. Because memories have high density of components integrating a large number of storage elements, they are more sensitive to soft errors than logic circuits. Soft error rates in logic and processors are increasing along with the feature downscaling technology trend [73, 105]. In addition, if other circuit noises such as interconnect coupling and ground bounce are considered as soft errors, the logic FIT rate is expected to increase faster and flnally the FIT rate in logic is likely to become comparable to the FIT rate in memories [94]. 3 The SER due to high-energy neutrons has been studied in SRAM cells, latches, and logic circuits for feature sizes from 600nm to 50nm. SER per chip for logic circuits is ex- pected to increase by nine orders of magnitude from 1992 to 2011, becoming comparable to the failure rates in unprotected on-chip memories [169]. 1.1 Problem Statement It is very costly to determine the SER of a real chip by accelerated life testing. Experimental results on measurement of neutron ux data at sea level over the past half century have shown a big variance. The neutron ux varies based on the location of the test and time-dependent solar activity [60, 141]. The buildings in which these experiments are located pose another di?culty because difierent mixtures of concrete will have difierent shielding afiects on cosmic rays. These di?culties can make the measurement of the SER of a SRAM or DRAM vary up to 100X, even when tested at the same location. Until 2002, there was no comprehensive model to reliably evaluate soft error rate of a device [202]. An accurate prediction of SER needs SER simulation using actual chip circuit models, which include device, process, and technology parameters. Current SER estimation methods are not well developed for logic circuits. Logic circuits, difierent from memories devices, have speciflc masking efiects on SETs (soft error transients) that depend on the circuit properties. These masking factors are electrical masking, logic masking and temporal masking [133]. Accurate estimation of logic circuit SER continues to be a major challenge as rapid advancement in nanotechnology keeps increasing the circuit sensitivity. 4 In our SER analysis approach, the inputs to the analysis are (1) circuit charac- teristics: circuit netlist, technology and node sensitive region data, and (2)background environment data: LET distribution and neutron ux. The output of our analysis is the neutron caused logic circuit soft error rate in standard FIT (failure in time) units. 1.2 Contribution of Research In this research, we model neutron-induced soft errors using two parameters, namely, occurrence rate and intensity. Our soft error rate estimation analysis propagates occur- rence rate (expressed as a probability function) and intensity as the width of single event transient (SET) pulse expressed by a probability density function through the logic cir- cuit. We develop an algorithm to compute the SER of a logic circuit based on this soft error model. We consider such issues as circuit technology and the altitude, which may in uence the SER results, and use a vector-less statistical approach. We consider the entire linear energy transfer (LET) spectrum of the terrestrial background that is available from measurement data speciflc to environment and device materials. Soft error rates are calculated for ISCAS85 benchmark circuits in the standard unit, failures in time (FIT, i.e., failures in 109 hours). In comparison to the reported SER analysis work by Rao et al. [156], our method considers many more relevant factors, including sensitive regions, circuit technology, etc., which may in uence logic SER. Our simulation results for ISCAS85 benchmark circuits difier from those reported by Rao et al. For example, our estimated soft error rates at ground level for C432 and C499 are 1:18 ? 103 FIT and 1:41 ? 103 FIT, while Rao et al. reported 1:73 ? 10?5 and 6:26?10?5, respectively. We discuss the factors that could have caused several orders 5 of magnitude difierence between these results. Our CPU time is acceptably low. For example, for C1908 with 880 gates our analysis takes just 1.14 seconds on a Sun Fire 280R workstation. With our novel soft error model, we are able to accurately model electrical masking factors in logic circuits. Also, the error pulse width density information at the primary outputs of the logic circuit allows evaluation of SER reduction schemes such as time redundancy and space redundancy. An extensive discussion on soft error considerations for contemporary computer web servers is also presented. We propose a possible soft error rate reduction method that considers the cosmic ray striking angle to redesign the circuit board layout in server systems. Four papers based on the work reported in this thesis have been authored: (1) a tutorial paper that covers broad topics on SEU was presented at the 21st IEEE Inter- national Conference on VLSI Design [186], (2) the new soft error model and logic SER estimation algorithm was presented at the 40th IEEE Southeastern Symposium on Sys- tem Theory [187], (3) the logic SER estimated by our algorithm and that reported in other related work are compared and a detailed discussion is given in a paper presented at 17th IEEE North Atlantic Test Workshop [185], and (4) a manuscript on SER in web servers with a proposal for its reduction is still unpublished. 1.3 Thesis Organization This thesis is organized as follows. In Chapter 2, we provide the basic background on soft errors. Deflnitions of terms in this fleld, the mechanisms of how soft errors occur 6 in silicon, and some widely used soft error mitigation techniques are discussed. In Chap- ter 3, previous soft error rate estimation strategies are broadly discussed. Our attempt is to include the essentials of the existing work related to soft error rate estimation in this chapter. The traditional experimental SER testing methodology is also discussed. In Chapter 4, the novel soft error model is proposed and an algorithm to compute logic SER is developed. In Chapter 5, the new results are compared with those available in the literature. Our extended work that proposes a possible soft error rate reduction method in computer web servers by considering the hardware orientation, is presented in Chapter 6. The thesis is concluded with insights on future work in Chapter 7. 7 Chapter 2 Background 2.1 What is Soft Error An electronic circuit, that bears no permanent hardware fault, may display unex- plained events resulting in spontaneous single bit changes in the system such that there is no way to repeat such failures. In the computer industry such phenomenon is known as a \soft fail", to difierentiate from the \hard or permanent fail", which may be re- pairable [117, 203]. After observing a soft error, there is no implication that the system hardware is any less reliable than before because the soft fail is completely random. These soft fails may be caused by either well-known electronic noise sources such as a power supply uctuations, lightning, and electrostatic discharge (ESD) [103], or the thermal radiation from the galaxy, such as radiation-emitting stars and atmospheric gases. A soft or non-permanent fault is a non-destructive fault and falls into two categories [180]: 1. Transient faults [38], caused by environmental conditions like temperature, hu- midity, pressure, voltage, power supply, vibrations, uctuations, electromagnetic interference, ground loops, cosmic rays and alpha particles. 2. Intermittent faults caused by non-environmental conditions like loose connections, aging components, critical timing, power supply noise, resistive or capacitive vari- ations or couplings, and noise in the system. With advances in design and manufacturing technology, non-environmental conditions may not afiect sub-micron semiconductor reliability. However, the errors caused by 8 cosmic rays and alpha particles remain a dominant factor causing errors in electronic systems. 2.2 A Historical Note on Soft Errors Soft errors have been studied by electrical, aerospace [33], nuclear and radiation engineers for almost half a century. In the period 1954 through 1957 failures in digital electronics were reported during the above-ground nuclear bomb tests. These were orig- inally treated as electronic anomalies in the monitoring equipment because they were random and their cause could not be traced to any hardware fault [199]. Perhaps the flrst paper concerning the role of cosmic rays on electronics was by Wallmark and Marcus [183]. As quoted in the recent literature [123], these authors predicted that cosmic rays would start upsetting microcircuits due to high energy particle strikes and radiation when feature sizes become small enough. Through 1970s and early 1980s, the efiects of radia- tion received attention and more researchers examined the physics of these phenomena. Also starting around 1950, theories of fault tolerant and self-repairing computing system have been developed due to the increased reliability requirements of critical applications like the space-mission [21, 22, 23, 24, 115, 181]. May and Woods of Intel Corporation [119, 120] determined that intermittent errors were caused by the alpha particles emitted by the radioactive decay of uranium and thorium present just in few parts-per-million levels in package materials. Their papers represent the flrst public account of radiation-induced upsets in electronic devices at sea level and these errors were referred to as \soft errors". The term soft error was used to difierentiate from the repeatable errors traceable to permanent hardware faults. 9 Table 2.1: Measured failure rate on SRAM-based FPGA applications due to neutron efiects in 130nm technology (Actel) [6]. Altitude Neutron FPGAs/ #upsets/ MTBF FIT Application Example (feet) Flux System 1M-gate/ (hours) (million) (relative) FPGA/day (1) Ground-based Communication Network 5000 1 512 4:19?10?4 112 8.92 (2) Civilian Avionics System 30,000 ?40 4 1:85?10?2 324 3.09 (3) Military Avionics System 60,000 >160 16 8:33?10?2 18 55.56 Table 2.2: Projected failure rate on SRAM-based FPGA applications due to neutron efiects in 90nm technology (Actel) [6]. Altitude Neutron FPGAs/ MTBF FIT Application Example (feet) Flux System (hours) (million) (relative) (1) Ground-based Communication Network 5000 1 512 58 17.24 (2) Civilian Avionics System 30,000 ?40 4 162 6.17 (3) Military Avionics System 60,000 >160 16 9 111.11 Guenzer and Wolicki [74] reported that the error causing particles came not only from uranium and thorium but that nuclear reactions generated high energy neutrons and protons, which could also cause upsets in circuits. Following the title of their paper, \Single Event Upset of Dynamic RAMs by Neutrons and Protons", the term \SEU" has been in use ever since [74, 123]. In 1979, Ziegler and Lanford from IBM [203] predicted that cosmic rays could result in the same upset phenomenon in digital electronics (not only memories) even at sea level. Recent Soft Error Rate (SER) testing results for SRAM-based FPGAs from Actel [6] show a signiflcant and growing risk of functional failures due to the corruption of conflg- uration data, especially when the system has higher densities. Table 2.1 and Table 2.2 show measured failure rates for 130nm technology and projected failure rates for 90nm technology, respectively, for difierent applications without using any error protection. The error rates are shown in units of MTBF (Mean Time Between Failures) and FIT (Failures in Time). The number of upsets per 1 million gates per day increases for cases 10 (1) through (3) because of the altitude dependent increase in neutron ux density. It is expected that neutron-induced soft errors will get worse by a factor of two as we move from 130nm to 90nm technology. Note that this table ignores alpha particle efiects, which are also expected to be signiflcant for nanometer technologies and will further increase the system failure rate. Radiation induced soft errors have become one of the most important and chal- lenging failure mechanisms in modern electronic devices. SER for commercial chips is controlled to within 100{1000 FIT. Compared to most hard failure mechanisms that pro- duce failure rates on the order of 1{100 FIT, the SER of a low-voltage embedded SRAM can easily be 1000 FIT/Mbit. Therefore, a four-phase approach to deal with them is in progress [162]: 1. Methods to protect chips from soft errors (prevention). 2. Methods to detect soft errors (testing). 3. Methods to estimate the impact of soft errors (assessment). 4. Methods to recover from soft errors (recovery). 2.3 Radiation Environment Overview 2.3.1 Radiation Types Radiation is kinetic energy in the form of high speed particles and electromagnetic waves. In general, radiation mechanisms can be classifled as either ionizing radiation or non-ionizing radiation [89, 174, 175, 176]. 11 1. Ionizing radiation is radiation with enough energy, so that during interaction with an atom it can remove tightly bound electrons from their orbits, thus causing the atom to become charged or ionized. Examples are gamma rays and neutrons. 2. Non-ionizing radiation is radiation without enough energy to remove tightly bound electrons from their orbits in atoms. Examples are microwaves and visible light. Common types of radiation include: alpha particles, beta radiation, gamma rys, and X- rays. Neutron particles are also encountered in nuclear power plants, high-altitude ights and are also emitted from some industrial radioactive sources. In some types of atoms, the nucleus is unstable and spontaneously decays into a more stable form after releasing energy as radiation. The major types of radiation are summarized as follows [89]: Gamma rays and X-rays are short-wavelength photons or electromagnetic radiation. The two names come from their discoveries at difierent times. Gamma rays have their origin in nuclear interaction while X-rays originate from electronic or charged- particle collisions. Their interaction mechanisms with matter are identical. The photons are lightly ionizing, highly penetrating, and leave no activity in the irradi- ated material. Gamma rays have a comparatively higher penetrating power, and it takes a thick sheet of metal such as lead or concrete to attenuate them signiflcantly. Alpha Particles are the nuclei of helium atoms consisting of 2 protons and 2 neutrons. They have an identical mass as a helium nucleus and a positive charge of 2e, where e is the magnitude of charge on an electron, e = 1:6?10?19 coulomb. They normally have high energy in the MeV range (see Appendix A). They interact strongly with matter and are heavily ionizing. They have low penetrating power and travel in 12 straight lines. They are easily stopped even by a sheet of paper. A typical alpha particle energy is 5 MeV with a typical range of 50mm in air and 23? in silicon. Beta Particles have the same mass as an electron but they may be either negatively or positively charged. Because they have small mass and charge, they can penetrate matter more easily than alpha particles but are easily de ected. They have high velocity normally approaching that of light. They produce weak ionization. Beta particles are stopped by a sheet of aluminum or plastic such as perspex. Neutron has the same mass as proton but has no charge, thus it is di?cult to de ect. The capture of a neutron can cause the emission of gamma rays. Neutron rays (streams of neutrons) are classifled according to their energy as thermal neutrons (energy < 1 eV) [60], intermediate neutrons (1 ev < energy < 100 KeV), and fast neutrons (energy > 100KeV). Water is an efiective shield for neutrons. Proton is the nucleus of a hydrogen atom and carries a positive charge of 1 unit, i.e., +e. The proton has a mass thousands of times that of an electron, and consequently is more di?cult to de ect. The proton has a typical range of several centimeters in air, and tens of microns in aluminum at energies in the MeV range. The particle masses, charges and radii of interest for radiation efiects are listed in Table 2.3, derived from experiment data [70]. The ionizing radiation efiects in electronics, such as space vehicle electronics, can be separated into two types: total ionizing dose (TID) and single event efiects (SEE) [106]. x88 Total Ionizing Dose (TID) causes long term degradation of electronics through cumulative energy deposited in a material. Efiects include parametric failures, 13 Table 2.3: Mass, charge and radius of particles of interest in radiation efiects [70]. Particle Mass (kg) Charge (C) Radius (m) Proton 1.672?10?27 1.672?10?19 1.535?10?18 Neutron 1.674?10?27 0 6.317?10?18 Electron 9.109?10?31 1.602?10?19 2.817?10?15 variations in device voltage and functional failures. Signiflcant sources of TID exposure in the space environment include trapped electrons, trapped protons, and solar are protons. x88 Single Event Efiect (SEE) occurs when a single particle strikes the material and deposits su?cient energy in the device to cause an upset. Here, SEE includes soft errors (SEU, SEFI) and hard errors (SEL, SEB, SEGR1). Parametric and permanent functional failures are the principal failure modes associated with the TID environment. Since TID is a cumulative efiect, the total dose tolerances of devices are MTTF (mean time to failure, see Appendix B) numbers, where the time- to-failure is the amount of mission time until the device has encountered enough dose to cause failure [106]. The progression in manufacturing processes to ever deeper sub-micron technologies is increasing the risk from system reliability issues. Due to neutron efiects the man- ufacturers of telecommunications and networking systems are developing qualiflcation tests to identify components that are susceptible to soft errors. The main sources of radiation environment within the interest of avionics and electronics have been listed as follows [50]: 1for deflnitions of TID, SEU, SEE, SEL, SEFI and SEGR, see Appendix A 14 x88 Trapped Belts: Protons and electrons trapped in the Van Allen2 belt. x88 Heavy ions trapped in the magnetosphere. x88 Cosmic ray protons and heavy ions. x88 Protons and heavy ions from solar ares. 2.3.2 Terrestrial Radiation Environment When galactic cosmic rays traverse the earth?s atmosphere, they collide with atomic nuclei and create cascades of interactions and reaction products like neutrons. Some of these neutrons reach the ground and become a source of single event upsets (SEU) in microelectronics. Neutrons produce SEU only when they collide with the nucleus of an atom in a device or its packaging, causing the nucleus to recoil and release densely ionizing nuclear fragments [72]. The probability of a neutron producing a nuclear recoil and fragments to which a particular device may be sensitive depends on the neutron?s kinetic energy. It has been discovered that cosmic rays impinging on the Earth?s atmosphere have almost 90% of the particles as protons, about 9% as helium nuclei (alpha particles) and about 1% as electrons. They are in uenced by the Earth?s magnetic fleld and other factors like colliding with atmospheric molecules. The initial particles originating from the outer space (also called \primaries"), have a shower of about 1600 particles 2The radiation belts are regions of high-energy particles, mainly protons and electrons, held captive by the magnetic in uence of the Earth. They have two main sources. A small but very intense \inner belt"(some call it \The Van Allen Belt" because it was discovered in 1958 by James Van Allen of the University of Iowa) lies within 4000 miles or so of the Earth?s surface. It mainly consists a high-energy protons (10-50 MeV) and is a by-product of the cosmic radiation, a thin drizzle of very fast protons and nuclei which apparently flll all our galaxy [13]. 15 per square meter per second, with a mean energy of ?7 GeV and an energy spectrum that falls ofi at the rate of energy?5=2. The particles with energies below ?1 GeV are de ected by the earth?s magnetic fleld and do not cause showers. The incident particles are protons, helium ions, and heavier ions [198, 200, 201, 203]. These heavy ions interact like individual nucleons. Ziegler et al. [201] report the incident ux as 87% protons and 13% neutrons from measurement. Almost all of the primaries efiectively disappear by altitudes of 20,000m. The secondary particles produced by interaction of the primaries with the gas atoms of the atmosphere include nucleons, electrons and photons. The secondaries are either stopped within the atmosphere from producing further cascades of particles or spontaneously decay into other particles. Finally, the remnants of the cascade strike the earth. The hit rates of difierent particle types, such as alpha particles or neutrons, are available from experimental results [72, 203]. It is, however, necessary to note that there are large variations in the documented measured uxes. These may due to the efiects attributed to magnetic latitude, solar cycles, time of day, season, and so on. The natural radiation levels strongly depend on the activity of the sun and the average solar cycle is eleven years, with approximately four years of solar minimum and seven years of solar maximum shown in Figure 2.1 [9]. Neutrons, created by cosmic ray interactions with O2 and N2 in the air, reach a peak ux value at around 60;000 feet. At 30;000 feet the neutron ux is about 1/3 of the peak value and on the ground the neutron ux is 1/400 of its peak value [140] (Figure 2.2). Solar are protons, together with electrons and alpha particles in smaller quantities, are emitted by the sun periodically during solar storms. These particles with high energy during a solar storm can cause 16 1 9 6 0 1 9 7 0 1 9 8 0 1 9 9 0 2 0 0 0 T i m e ( y e a r s ) 2 5 0 2 0 0 1 5 0 1 0 0 5 0 0 S U N S P O T N U M B E R R i M o n t h l y S m o o t h e d Figure 2.1: Sunspot numbers (y-axis) during solar cycles 19 through 23 recorded by Solar In uences Data Center (SIDC) in Belgium [9]. signiflcant damage to spacecraft solar arrays [71] and produce SEU in electronics [90, 179]. The particle hit rate RPH is given by the equation [200]. RPH = Z En;max En;min Fn(En)dEn ?At (2.1) where Fn(En) is the altitude and location dependent neutron ux [200] deflned between neutron energies En;min and En;max, and At is the total silicon area of a logic circuit. Figure 2.3 [4] illustrates the neutron ux at a variety of altitudes and latitudes. Note that the ux density is more three times higher in Denver than it is in New York, even though both cities are on approximately the same latitude, but Denver is located at a much higher altitude [6]. 17 1 - 1 0 M e V n e u t r o n f l u x ( N / c m 2 - s e c ) 1 . 4 1 . 2 1 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 A l t i t u d e , T h o u s a n d s o f F e e t Figure 2.2: Neutron ux versus altitude showing peak at about 60,000 ft [139]. In the terrestrial environment, another signiflcant source of ionization in packaged devices is alpha particle coming from the radioactive impurities in the package materials. This radiation mechanism will be discussed in the next section. 2.4 How Soft Error Occurs in Silicon This section discusses the soft errors caused by radiation and particle strikes. 2.4.1 Radiation Mechanisms in Semiconductors Three principal radiation sources cause soft errors in advanced semiconductor de- vices [30]: 1. Alpha particles are emitted when the nucleus of an unstable isotope decays to a lower energy state. The particles contain kinetic energy in the range of 4 to 9 18 Figure 2.3: Neutron ux as a function of altitude and latitude [4]. MeV. There are many radioactive isotopes. However, uranium and thorium have the highest activity among naturally occurring materials. In the terrestrial environ- ment, major sources of alpha particles are radioactive impurities such as lead-based isotopes in solder bumps of the ip-chip technology, gold used for bonding wires and lid plating, aluminum in ceramic packages, lead-frame alloys and interconnect metalization [50]. 2. High-energy ( > 1 MeV) neutrons from cosmic radiation can induce soft errors in semiconductor devices via secondary ions produced by the neutron reaction with silicon nuclei. Cosmic rays that are of galactic origin react with the Earth?s atmo- sphere to produce complex cascades of secondary particles. Less than 1% of the primary ux reaches ground level and the predominant particles include muons, neutrons, protons, and pions. Because pions and muons are short-lived and proton 19 and electrons are attenuated by Coulombic interaction with the atmosphere, neu- trons are the most likely cosmic radiation sources to cause SEU in deep-submicron semiconductors at the terrestrial altitudes. The neutron ux is dependent on the altitude above the sea level, the density of the neutron ux increases with the altitude. 3. The third signiflcant source of ionizing particles in electronic devices is the sec- ondaryradiationproducedfromtheinteractionofcosmicrayneutronsandboron[31]. This radiation is induced by low-energy cosmic neutrons, interacting with the iso- tope boron-10 or 10B. Boron is extensively used as p-type dopant in silicon and is also speciflcally used in formation of BPSG (Borophosphosilicate glass) dielec- tric layer [31]. Boron has two isotopes: 10B and 11B of which 10B is unstable. The reaction scheme is shown in Figure 2.4 [26]. In the 10B(n;fi) Li reaction the lithium nucleus is emitted with a kinetic energy of 0.84 MeV 94% of the time and with 1.014 MeV 6% of the time. The gamma photon has energy of 478 KeV, while the alpha particle is emitted with an energy of 1.47 MeV [26]. This mechanism has recently been found to be the dominant source of soft errors in 0.25? and 0.18? SRAMs fabricated with BPSG. Modern microprocessors use highly purifled package materials and this radiation mechanism is greatly reduced, leaving the high-energy cosmic rays as the major reason for soft errors. The SEU due to activation of 10B can be mitigated by removing BPSG material from the process ow. For future deep-submicron DRAM generations a greater sup- pression of soft error rate is expected for devices made with silicon-on-insulator (SOI) technologies [132]. 20 Figure 2.4: Fission of 10B induced by the capture of a neutron (commonly happened in SRAMs) [26]. 2.4.2 Sensitive Regions in Silicon Devices A single event transient (SET) is caused by the generation of charge due to a single particle(protonorheavyion)passingthroughasensitivenodeinthecircuit[157]. SETin linear devices difiers signiflcantly from other types of single event efiects (SEE) like SEU in a memory. Each SET has its unique characteristics like polarity, waveform, amplitude, duration, etc. These characteristics depend on particle impact location, particle energy, device technology, device supply voltage and output load. In CMOS circuits, the \ofi" transistors struck by a heavy ion in the junction area are most sensitive to SEU by particles with LET (linear energy transfer; see Appendix) of around 20 MeV-cm2/mg. When these particles hit the silicon bulk, minority carriers are created and, if collected by the source/drain difiusion regions, a change in the voltage value of the signal node occurs [144]. A particle can induce SEU when it strikes at the channel region of an ofi nMOS transistor or the drain region of an ofi pMOS transistor. The ionization induces a current 21 Figure 2.5: Interaction of a high energy neutron and a silicon integrated circuit [6]. pulse in a p-n junction. Conceptually, when the charge injected by the current pulse at a sensitive node exceeds a critical charge (Qcrit), a SET is generated at the afiected junction. In Figure 2.5 [6], interaction of a high energy neutron and a silicon integrated circuit is shown. 2.4.3 Single Event Transient (SET) In Figure 2.6, a SET is produced after a high-energy ionizing particle strikes a sil- icon device near a sensitive node [29]. Along the traversed path, the particle produces a dense radial distribution of electron-hole pairs as illustrated in Figure 2.6(a). If the resultant ionization track traverses the depletion region, carriers are rapidly collected by the electric fleld, thus compensating the charge stored in the junction. Outside the depletion region the non-equilibrium charge distribution induces a temporary funnel- shaped potential distortion along the trajectory of the event, further enhancing charge 22 I o n T r a c k n + I d r i f t I d i f f ( d ) Figure 2.6: Schematic representation of charge collection in a silicon junction immedi- ately after (a) an ion strike, (b) prompt (drift) collection, (c) difiusion collection, and (d) the junction current induced as a function of time [29]. collection by drift (Figure 2.6(b)). A \prompt" collection phase typically follows for sev- eral tens of picoseconds. As the funnel collapses, difiusion then dominates the collection process (Figure 2.6(c)) until all excess carriers have been collected, recombined, or dif- fused away from the junction area (about nanoseconds). The transient charge collected from the radiation event produces a current pulse at the junction as illustrated in Figure 2.6(d) [29]. Figure 2.7 [149] shows the mechanism of the current pulse generation. The cur- 23 BnZrBnZrBnZrBnZrBnZr BnZrBnZrBnZrBnZrBnZr BnZrBnZrBnZrBnZrBnZrBnZrBnZrBnZr BnZrBnZrBnZr BnZrBnZrBnZr BnZrBnZrBnZr N + E P I o n P a t h F u n n e l i n g + - - + + - + - + - - + + - - + + - + - + - - + + - - + + - + - + - - + + - - + + - + - + - - + + - - + + - - + + - + - - + + - - + + - - + + - C o l l e c t i o n b y D r i f t - + + - + - C o l l e c t i o n b y D i f f u s i o n R e c o m b i n a t i o n I Figure 2.7: Schematic of the charge collection mechanism when an ionizing particle strikes an electronic junction [149]. rent transient typically lasts for 200 picoseconds, with the bulk of the charge collection occurring within 2?3 microns of the junction region for modern submicron CMOS tech- nologies. The time constant depends strongly on the type of particle, its initial energy and the properties of the speciflc device technology [29]. If enough charge is collected by a node its logic state may change. The collected charge (Qcoll) is a function of the ionizing particle?s energy and trajectory, silicon substrate structure and doping, and the local electric fleld [29]. A commonly used approximate analytical model for the induced transient current waveform for ion track charge collection has a double-exponential form [122] with a rapid 24 rise time and a gradual fall time: 8 >< >: I(t) = Qcoll?fi??fl(e? t?fi ?e? t ?fl ) (a) Qcoll = 10:8?L?LET (b) (2.2) where Qcoll is the collected charge (in femtocoulomb) in the sensitive region, ?fi is a process-dependent collection time constant of the junction, and ?fl is the ion-track estab- lishment time constant, which is relatively independent of the technology. Typical values are approximately 1:64?10?10sec for ?fi and 5?10?11sec for ?fl [43]. In bulk silicon, a typical charge collection depth (L in microns) is 2? for every linear energy transfer (LET) of 1 MeV-cm2/mg, and an ionizing particle deposits about 10.8fC charge along each micron of its track. Linear energy transfer (LET) is a measure of the energy transferred to material as an ionizing particle travels through it. The unit of LET is MeV-cm2/mg of material for electronic devices. It is derived from a combination of the energy lost by the particle to the material per unit path length (MeV/cm) divided by the density of the material (mg/cm3). The induced transient voltage pulse may propagate through several levels of logic gates. Because a particle can induce an SEU when it strikes either the channel region of an ofi nMOS transistor or the drain region of an ofi pMOS transistor, we will consider the strike at an ofi pMOS drain area as an illustrative example. The critical charge depends on the total charge collected at the sensitive node as well as on the temporal shape of the current pulse and the device supply voltage. A parameter called \switching time (tth)" or \feedback time" is deflned as the interval starting when the particle strikes and 25 continuing until the afiected node voltage exceeds the threshold voltage. The charge on the output capacitor of the gate containing the transistor equals Qcrit at that time. Qcrit can be calculated by integrating the current that ows at the sensitive node after the strike [57]. The condition for the SEE to propagate is that output node voltage follows Equation 2.3. V ? QcritC = 1C Z tth 0 Iinduced(t)dt (2.3) The width of the voltage pulse depends on the value of the capacitance and the RC time constant of the discharging path. For example, in AMI12 technology, when the output load capacitance is 100fF and the cumulative collected charge is 0:65pC, the amplitude of the voltage pulse is, 0:65pC=100fF = 0:65?10?12C=100?10?15F = 0:65V We observe that for the same charge collected in the sensitive area a smaller load ca- pacitance will have a larger amplitude of the SEE-induced voltage pulse. The discharge process can be modeled by a simple RC-circuit. Then, the voltage as a function of time is v(t) = v(0) ?tRC. Clearly, smaller the RC value, faster is the discharge process. A schematic view of how the SEE-induced current pulse translates into an SEE-induced voltage pulse is given in Figure 2.8. With technology scaling, multiple transient faults may become an issue for next generation ICs [161]. 26 IN VDD OUT C _ load GND SEE occur Charging C _ load IN VDD OUT GND OFF 0 1 0 SEE induced Voltage Pulse Particle Strike 1 ON OFF ON Discharging SEE induced Current Pulse C _ load Figure 2.8: A schematic view of how SEE-induced current pulse translates into a voltage pulse in a CMOS inverter. 2.5 An Overview of Soft Error Mitigation Techniques Soft error tolerant design techniques can be classifled into two types: prevention and recovery. The methods to protect microchips from soft-errors are the prevention methods [186]. They are used during the chip design and development. The recovery methods include on-line recovery mechanisms from soft-errors in order to achieve the chip robustness requirement. These include fault tolerant computing, Error Correcting Code (ECC) and parity, online-testing [66, 97, 99, 101, 137, 138] and redundancy [151, 163]. One should note that soft error is not the only reason why computer systems need to resort to a recovery procedure. Random errors due to noise, unreliable components, and coupling efiects may also require recovery mechanisms [162]. The need for a re- covery mechanism stems from the fact that prevention techniques may not be enough for contemporary microchips, because the supply voltage keeps reducing, feature size keeps shrinking, and the clock frequency keeps increasing. Also, the cost of preven- tion techniques for a fault tolerant design may be too high. Representing the broad 27 area of the error-tolerant computing, here we give a few examples of techniques used for soft error mitigation. In addition, a built-in soft error resilience (BISER) technique for correcting radiation-induced soft errors in latches and ip- ops may be found [192]. In that work, the error-correcting latch and ip- op designs are power e?cient, can correct both ip- op errors and combinational logic errors, and reuse the on-chip scan design-for-testability hardware for cell-level error recovery. 2.5.1 Prevention Techniques Purify the Fabrication Material A signiflcant reduction in the soft error rate of microelectronics can be achieved by eliminating or reducing the sources of radiation. To reduce the alpha particle emission in packaged ICs, high purity materials and processes are employed. Uranium and thorium impurities have been reduced below one hundred parts per trillion for high reliability. Going from the conventional IC packaging to an ultra-low alpha packaging materials, the alpha emission is reduced from 5?10 particles/cm2-hr to less than 0.001 particles/cm2- hr. To reduce the SER induced by the 10B activation by low energy neutrons, BPSG is replaced by other insulators that do not contain boron. In addition, any processes using boron precursors are carefully checked for 10B content before introducing them to the manufacturing process [29]. When these measures are employed the SER of the IC is reduced dramatically, but the SER caused by the high-energy cosmic neutron interactions cannot be easily shielded. 28 Radiation Hardened Process Technologies SER performance can be greatly improved by adapting a process technology either to reduce the collected charge (Qcoll) or increase the critical charge (Qcrit) [197]. One approach is to use additional well isolation (triple-well or guard-ring structure) to re- duce the amount of charge collected by creating potential barriers, which can limit the e?ciency of the funneling efiect and reduce the likelihood of parasitic bipolar collection paths [40]. Another approach replaces bulk silicon well-isolation with silicon-on-insulator (SOI) substrate material. The direct charge collection is signiflcantly reduced in SOI devices because the active device volume is greatly reduced (due to thin silicon device layer on the oxide layer) [132]. Recent work shows a 10X reduction in SER achieved over conventional bulk devices when a fully depleted SOI substrate is used. Unfortunately, SOI substrates are more expensive than conventional bulk substrates and phenomena like parasitic bipolar action limit further reduction of SER [29, 76, 132]. Circuit-level solutions such as the addition of cross-coupled resistors and capacitors to decrease the bit-line oat time are also employed [172]. 2.5.2 Recovery Techniques Fault-tolerant computing methods have been reported in the literature for quite some time [181] but have seen renewed interest due to the SEU phenomenon. On- line testing techniques are frequently used as recovery solutions for soft error mitigation. Speciflc techniques include self-checking design [136], concurrent error detection for flnite 29 state machines (FSM) by signature monitoring [46, 48], error detection and correction (EDAC) codes [75], and redundancy [21]. Redundancy The basic idea of redundancy in design is to gain higher system reliability by sacri- flcing the minimality of time or space, or both. The classic triple modular redundancy (TMR) [21, 42, 47, 69, 110, 115, 168, 182] with a majority voter continues to be widely used. Mitra et al. [127] combine a self-checking design with time redundancy based on the C-element gate to compare two samples of the output signal from a combinational circuit at times t0 and t0 +d, where t0 is the clock sampling time and d is flnite amount of delay. The C-element has the ability to eliminate glitches at combinational outputs. Their error correction structure is illustrated in Figure 2.9 [127]. In this design, if there is an error pulse of width smaller than d that occurs in the combinational logic in Figure 2.9(b), this error pulse will generate difierent values at clocking edges t0 and t0+d. Because the output of the C-element will retain the correct value, the error will be corrected. Space redundancy and time redundancy are often combined together to meet high fault-tolerance requirements with reduced hardware overhead, such as duplication and comparison instead of TMR. Error-Correcting Code and Parity Memories have a signiflcant role in modern systems. Because of very high density of storage cells, a large memory is more sensitive to ionizing particles than logic. A simple solution for protecting a memory is to add parity bits to each memory word. 30 C o m b i n a t i o n a l L o g i c ( C o p y 1 ) C o m b i n a t i o n a l L o g i c ( C o p y 2 ) D Q l a t c h C l k D Q l a t c h C l k I N C l o c k o u t 1 o u t 2 C W e a k K e e p e r ( a ) C o m b i n a t i o n a l L o g i c ( C o p y 1 ) D Q l a t c h C l k I N C l o c k o u t 1 o u t 2 C W e a k K e e p e r d D Q l a t c h C l k ( b ) A B V D D G n d C _ O U T A B C _ O U T 0 0 1 1 1 0 0 1 P r e v i o u s V a l u e R e t a i n e d 1 0 ( c ) Figure 2.9: Error correction using duplication, (a) space redundancy structure, (b) time redundancy structure, and (c) C-element [127]. During the write operation, a parity generator computes parity bits for the data to be written. The parity bits are written into memory along with the data. If a particle strike alters the state of a single bit of a memory word, now including the parity bits, the error can be discovered by checking the parity code during the read operation. Depending on the number of parity bits used, this scheme can detect errors, and correct them as well. Such schemes are often combined with system-level approaches for error recovery [136]. In most situations, however, the error recovery in a memory is more complex so protection of the memory by means of codes like error correcting code (ECC) is preferable. Table 2.4 [106] summarizes sample error detection and correction (EDAC) methods for memory, data and systems [106]. 31 Table 2.4: Sample EDAC methods for memory or data devices [106]. EDAC Method EDAC Capability Parity Single Bit Error Detect Hamming Code Single Bit Error Correct, dou- ble bit detect RS Code Correct consecutive and mul- tiple bytes in error Conventional Encoding Correctsisolatedburstnoisein a communication stream Overlying Protocol Speciflc to each system imple- mentation 2.6 IBM eServer z990 { A Case Study The IBM eServer z990 system is designed to detect and recover from both soft and permanent errors [121]. System z990 contains up to four pluggable nodes connected through a planar board in a daisy chain interconnect structure. Each node contains up to 64 GB physical memory and a 32 MB L2 cache for a system capacity of 256 GB memory and 126 MB L2 cache. In IBM z990 system, microarchitecture-level SEU mitigation features include: ex- tensive use of ECC and parity with retry on data and controls; full SRAM ECC and parity protection; operational retries; microprocessor mirroring, checkpointing and roll- back, and some hardware derating techniques. These approaches may be useful for future mainframe, general purpose, and application-speciflc computing systems. 32 2.7 Traditional SER Testing Methods Soft-error testing seeks to reproduce and then accelerate the die?s real-life environ- ment [93, 118]. Typically a neutron beam accelerator is used to conduct this testing. Because each neutron beam has a speciflc and complex set of neutron properties, the beams must be carefully qualifled to correlate the resulting data with real-time results. Beam qualiflcation includes factors such as energy, spectrum, uency, and tail-efiect correction [39]. A schematic overview of the accelerated test setup is shown in Figure 2.10 [86]. The results of this accelerated test are soft error rate. A general test plan for alpha or neutron accelerated SER testing contains multiple runs for the following speciflcations [86, 92]: x88 Supply voltage (VDD) x88 Input patterns (All 1s, All 0s, or checkerboard) x88 Operational frequency (static or dynamic) x88 Temperature The standard procedures and requirements for terrestrial SER testing of ICs should follow the semiconductor industry?s accelerated testing methods. The JEDEC (Joint Electron Device Engineering Council) standard includes JESD89, JESD89-A [4, 5, 10] and JESD89-2. In JESD89 [4], the standard speciflcations cover soft errors due to alpha particles and atmospheric neutrons. Also, the standard requirements and procedures for terrestrial SER testing of integrated circuits, and the standardized methodology for reporting the results of the tests are deflned. For example, these standards specify that, the SER data obtained from accelerated alpha SER tests should be extrapolated to 33 P o w e r S u p p l y T e s t a n d C o n t r o l B o a r d H e a t i n g C o n t r o l C o n t r o l P C u s e r r o o m ( r a d i a t i o n f r e e ) t e s t r o o m ( s t r a y r a d i a t i o n ) n e u t r o n b e a m D U T b o a r d P o w e r H e a t e r Figure 2.10: Typical test setup (hardware) for neutron-accelerated SER testing [86]. an alpha ux of 0.001 particles/hr-cm2 and the accelerated neutron SER (ASER) test results to the typical neutron ux observed at New York City. For that location, the reported data shows that for an energy range from 10 to 10000 MeV, the neutron ux is 3.9?10?3 N/cm2-s; and for energy range from 1 to 10 MeV the neutron ux is 4.0?10?3 N/cm2-s [86, 4]. Primarily, the procedures apply to memory devices like DRAMs and SRAMs, and with some adjustments they may be used for logic devices [4]. Real-time testing ofiers another means for soft-error rate detection. However, given that neither single-event upsets nor soft-error-induced latch-ups occur frequently, testers employ environmental acceleration, such as testing at high altitudes where the neutron ux is stronger while the spectrum remains similar to that at ground level. For example, the test facility at the Jungfraujoch Lab in Switzerland, located at 11,000 feet, can accelerate sea-level test times by a factor of 11. In testing conducted at this lab, iRoC 34 Table 2.5: Accelerated testing versus real-time testing [128]. Test Type Logistics Time Accuracy Devices Under Test Accelerated Complex: Require qualifled beams ac- cess; Average: 2 to 3 months Good Memories, SoC, FPGA expert team required system level Real-Time Reasonable Average: 4 to 6 months Excellent All Types Technologies obtained a statistically signiflcant number of soft errors on several devices over a period of 4 to 6 months and the tests for soft-error rates covered several difierent phenomena, including multibit upsets [128]. Table 2.5 [128] shows the advantages of accelerated testing over real-time testing. The test results show that, the average FIT per megabyte slightly decreases at each process node. From 130nm down to 90 nm, the FIT per megabyte in memory begins to stabilize. Silicon test results show that the average soft-error rate hovers around 1,000 FIT per megabit (neutron and alpha). Field test is the measurement of the soft error rate of chips due to natural back- ground radiation [142]. This type of testing is frequently used in the evaluation of chip SER by using a tester containing hundreds of chips and evaluating their fail rate at nominal conditions. Field testing is very expensive and may take up to a year to obtain reliable results, but it is used to validate modeling and used in accelerated testing [150]. The traditional SER test needs the parameters shown in Figure 2.11, in which cross section (see Appendix A) is the corresponding interaction probability in the process of computing the interactions of particles of interest with pertinent materials. The fail cross section specifles the sensitivity of a circuit [78]. For a memory, it is determined 35 P a r t i c l e T y p e P a r t i c l e E n e r g y ( M e V ) P a r t i c l e F l u x ( c m 2 S - 1 ) D e v i c e U n d e r T e s t S E U C r o s s S e c t i o n ( c m 2 ) S E R ( P e r b i t s e c o n d ) Figure 2.11: Traditional SER fleld test parameters. by loading the memory with a known bit pattern and measuring the number of ipped bits when the device is exposed to a beam of neutrons or charged particles. For particle energy E, the bit fail cross section, ?(E), is the measured number of bit ips (or fails) per bit per beam uence (particles per unit area) [72]: ?(E) = failsbits?fluence (2.4) The soft-error rate (SER) is then determined by integrating the product of the bit cross section and the difierential ux over the energy range where the circuit is susceptible to fail when ' is beam uence: SER = Z ?(E)(d'dE)dE (2.5) At difierent altitudes, difierent particles play major roles. Also, the particle ux and energy spectra difier. SER can also be obtained by computer simulation. In a typical SER simulator, radiation environment can be described by either an alpha or a neutron energy spectrum. Neutrons do not possess electrical charge so the only way to cause an SEU is by a nuclear reaction with nuclei of Si, B, or some other element. The probability 36 Figure 2.12: Soft error rates as a function of IC process technology [7]. of a neutron producing a nuclear recoil or fragment to which a particular device is sensitive depends on the neutron?s energy. Also, the critical charge (see Appendix A) [68, 83, 102] for the circuit node should be measurable. The sensitivity of an integrated circuit to an upset also depends on the process technology. As semiconductor processes advance to smaller feature sizes, the amount of charge required to cause an upset decreases. The relationship between the process technology and the upset rate is illustrated in Figure 2.12 [7]. Note that this chart includes alpha particle efiects as well as neutron efiects. 2.8 Collected SER Field Test Data In Table 2.6 3, recent SER test results are collected from the published literature [11] and some other relevant sources. It can be concluded that from 1000 to 5000 FIT per Mbit for memory would be a reasonable error rate for modern memory devices. 3The question marks in this table mean the relevant data is not available. 37 Table 2.6: Recen tly rep orted data on soft error rates. Typ e of Memory Rep orted SER Error per FIT/Mbit Source bit-hour Goal for new Cypress pro ducts 200 FIT ? ? [3] SRAM (quoted by vendors) 200 to 2,000 FIT ? ? [2] \typical" 1,000 FIT ? ? [3] DRAM at full sp eed Few hundred to few ? ? [8] thousand FIT SRAM at 0.25 micron and belo w 10,000 to 100,000 FIT ? ? [8] Commercial CMOS memory 1E-5 to 1E-7 per bit-da y 4E-7 ?4.2E-9 4million [3] 400 million [170] \some"0.13-micron tec hnologies 10,000 or 100,000 FIT/Mbit 1E-11 ?1E-10 10,000 ?100,000 [3] 1Gbit memory in 0.25 ?m One error per week 6E-12 6,000 [88] 4M SRAM <1E-10 upset/bit-da y <4.2E-12 4,200 [140] 1Gbit of DRAM (Nite Ha wk) 2.3E-12 upset/bit-hour 2.3E-12 2,300 [1] SRAM and DRAM 1? 2E-12 upset/bit-hour 1? 2E-12 1,000 ?2,000 [1] ?8.2 Gbits of SRAM (CRA Y YMP-8) 1.3E-12 upset/bit-hour 1.3E-12 1,300 [1] SRAM 1,000 FIT/Mbit 1E-12 1,000 [77] 256 MBytes One error per mon th 7E-13 700 [44, 8] 160 Gbits of DRAM (F ermilab) 2.5 errors per da y 7E-13 700 [1] 32 Gbits of DRAM (CRA Y YMP-8) 6E-13 upset/bit-hour 6E-13 600 [1] MoSys 1T-SRAM (no ECC) 500 FIT/Mbit 5E-13 500 [98] Micron estimate, 256 MBytes 2{4 error per da y 1.2 ?2.4E-13 120 ?240 [8] \ultra-lo w" failure rate 50 to 100 FIT/Mbit 5E-14 ?1E-13 50 ?100 [45] 38 Chapter 3 Previous Work The soft error rate estimation predicts the soft-error rates (SER) due to cosmic and high-energy particle radiation in integrated circuit chips by building up accurate SER models [114]. Single event upset phenomenon is a complex process. When neutrons strike silicon, any of more than 100 difierent nuclear reactions can be generated [128]. Accurate measurement of the neutron ux [72] and its energy distribution are the flrst considera- tions in estimating neutron-induced SER. The existing basic concepts and methodologies to estimate cosmic ray induced SER for a given circuit are summarized in this chapter. 3.1 Figure of Merit Model for Geosynchronous SER As discussed in the previous chapter, the error rate Er is proportional to the integral of the product of the appropriate incident particle ux and the SEU cross section. For spacecrafts in geosynchronous orbits, the appropriate ux ' is due to galactic cosmic ray ions. The altitude of these spacecrafts is at the outer fringes of the Van Allen Belt, so their proton-SEU interactions are assumed to be negligible [123]. The Figure of Merit (FOM) Model expressions are obtained by using the chord distribution function for the cross section. Suppose '(L) is the integral SEU ux expressed as number of particles cm?2day?1, then the error rate is: Er? Z LETmax LETmin '(L)d u(L) = Z LETmax LETmin '(L)d u(L)dL dL errors per day (3.1) 39 where Er is error rate and LETmin and LETmax are minimum and maximum linear energy transfer values for the given environment. The flgure of merit (FOM) method provides a single number to roughly estimate the device SEU rate in almost all orbits. After a complex mathematical derivation (for details, see [123]), the flgure of merit formula for estimating geosynchronous SEU is given as [25, 148]: Er = 5?10?10 satL2 c = 5?10?10 abc 2 Q2crit (3.2) where sat is the saturation device cross section and a, b and c are dimensions of the approximated parallelepiped sensitive area. TheQcrit is the critical charge (see Appendix A for its deflnition), and Lc= Qcrit=c, where c is the at-plate capacitance of the sensitive region. The dimension of SEU error rate Er is per sensitive region per unit time, where a, b, and c are given in microns and Qcrit in picocoulomb. Additional numerical expressions for proton-based, neutron-based and alpha-particle-induced SEU error rate can be found in [123]. Example: SER Estimation from Figure of Merit Method [123] For a particular 1K SRAM, from experiment data the individual SEU sensitive region has dimensions approximately 3?10?10 ?m3 and it is modeled as a 0.05pf at plate capacitor. From manufacturer?s speciflcation sheet for this SRAM, the cell VOH = 5:5V and VOL = 2:5V. Calculate SEU error rate per cell-day for this SRAM by flgure of merit equation. Solution: The cell critical charge is Qcrit ? C??V = 0:05?10?12?(5:5?2:5) = 0:15pC. From flgure of merit we can get Er = 5?10?10 ? abc2Q2 crit = 5?10?10 ? (10)?(10)?(3)20:152 = 40 2?10?5 errors per cell per day. For this SRAM, suppose on an average half of the cells are biased so as to be SEU susceptible at a given time, then 12?(1:024?103)?(2?10?5) = 1:024?10?2 errors per device per day. Due to the lack of accuracy in the FOM formula and the fact that the sensitive region data for electronic devices is hard to obtain, computer programs have been developed for higher orders of accuracy. 3.2 Computer-Based Programs Some commonly used computer models [172, 173] for calculating soft error rate are SEMM, CHIME and CREME96. SEMM is Soft-Error Monte Carlo Modeling program developed by IBM. It calculates the soft-error rate of semiconductor chips due to ionizing radiation, primarily for determining whether chip designs meet SER speciflcation requirements. The inputs are detailed circuit layout and process information and circuit Qcrit values [172]. CHIME (CRRES/SPACERAD Heavy Ion Model of the Environment) is a signiflcant tool mainly supported by U.S. Air Force and O?ce of Naval Research grants. CHIME was developed for the calculation of single event efiects due to interplan- etary heavy ions, and a set of relevant models to calculate energy deposit (LET) spectra to resulting single event upset rates. This model incorporates the most accurate and up-to-date database currently available for galactic cosmic rays over the past two solar cycles. Also, it provides a predictive model for these uxes through the next two solar activity minima, to the year 2010 [49]. 41 CREME96 (Cosmic Ray Efiects on Micro Electronics 1996 version) is a widely-used design tool in the aerospace industry. It was developed by Naval Research Labo- ratory [178]. Its main purposes include: x88 Creating numerical models of the ionizing radiation environment in near- Earth orbits; x88 Evaluating the radiation efiects on electronic systems in spacecrafts and in high-altitude aircrafts; x88 Estimating the high LET radiation environment within manned spacecrafts. The CREME96 program takes advantage of the exact multi-term expression for the chord distribution by numerically integrating without proceeding to FOM Equation 3.2. Also, this program contains up to a dozen environment options [123]. 3.3 Analytical Models A circuit level SER analysis can be performed at many levels, from the numerical device simulation level to the abstractions of architectural derating [85, 112, 131, 189]. Accurate estimation of nominal SER is important for circuit level SER estimation and this step lies in the middle of a variety of levels of SER simulation, ranging from device characterization to system level analysis [184]. The SER of a design is deflned in terms of the nominal soft-error rates of individual elements such as SRAMs, sequential elements such as ip- ops and latches, and combinational logic, that depend on the circuit design 42 and the architecture [128]: SERdesign = X i SERnominali ?Prob(error in ith circuit produces a system error) (3.3) where the SERnominali refers to soft error rate of the ith element in the circuit when all inputs and outputs of the element are constant. For example, it can be a node or an SRAM cell in the circuit; SERnominali is independent of the input vectors that activate this element [128]. Generally, the SERnominali value can be obtained by the soft error testing method illustrated in Section 2.7. We deflne the following terms [133, 134]: Nominal SER is the probability of a SEU occurring at a speciflc node. It depends on circuit technology, transistor sizing, node capacitance, VDD value, and tempera- ture. It is independent of the state of input vectors that drive the node. Timing Derating is the fraction of time in which the circuit is susceptible to SEU that propagates the SEE-induced pulse through the ip- ops. The susceptibility of the timing window will be discussed later in this section. The timing derating increases with increasing clock frequency. The timing vulnerability factors of sequential elements have been examined in [165]. Logic Derating (LD) is a measure of how the device logically reacts to a particle strike. LD depends on the design architecture and input stimulus causing an activated path for error propagation to primary outputs without logical masking of the error pulse. 43 Electrical derating is the electrical property of a device that degrades the error pulses passing through it. An SEU is electrically masked if the signal is attenuated by the electrical properties of gates on its propagation path such that the resulting pulse is of insu?cient magnitude or width to be latched [133]. Electrical masking plays an important role in soft error rate estimation for combinational logic. In experimental results on a small circuit having a logic depth of flve gates, ignoring electrical masking efiects is known to cause an overestimation of the SER by 138% [67]. The main factors that in uence the scaling of nominal FIT are as follows [134]: 1. Difiusion area: The probability of SEU on a speciflc node is roughly proportional to the area of the difiusion region for that node since the charge separation occurs near the difiusion areas. 2. Charge scaling: Keeping up with the technology trend, the capacitance per node and the supply voltage are decreasing, hence less charge is needed for the state of the node to ip. Charge scaling dominated the SER trend in the old process to make the SER sensitivity increase every generation. However, in recent deep submicron technologies, many circuits such as memory cells are ux limited or saturated. In such cases, process scaling reduces the difiusion area but does not increase the circuit sensitivity. 3. Voltage scaling: Voltage scaling has historically contributed to a trend of increasing soft error sensitivity with process evolution. But in recent process generations voltage scaling has lagged process dimension scaling contributing to decline in FIT per bit for 90nm and 65nm memory technologies [134]. 44 4. Process advances: Such as SOI or similar partial or fully depleted layers signifl- cantly reduce the charge collection volume and e?ciency leading to reduced sensi- tivity to soft errors. SER sensitivity is also impacted by details of doping proflles and doses. IBM reported a flve times improvement in FIT rate for partially de- pleted SOI for SRAM cells at 90nm [76]. No data has been reported for latches. 5. Flux of alpha particles: The ux of alpha particles strongly depends on the amount of radioactive residues and placement of metal layers in the package. Cleaner ma- terials and more metal layers tend to reduce the ux, thus alpha particles become less of an issue in modern processes. However, alpha articles do impact nodes with very small charge, so the sensitivity to alpha particles increases every generation except for ux-limited circuits. In a real case, the combinational SEU error rate estimation is complex because it is related to gate types and paths the SEU propagates through. A comparative study has been presented between the Qcrit method and the simulation method for estimating the circuit level SER [112]. In that work it is shown that for small circuits with uniformly distributed output values (e.g., ip- op, binary counter), both methods provide similar estimates for SER. SER analysis for logic circuits poses a challenge for electronic reliability analysis. Unlike memories, the soft errors occurring inside the logic circuit may be flltered out by the circuit itself and thus may not efiect the circuit performance as discussed in previous sections. Analytical methods are widely used to model soft errors probabilistically. Asadi et al. [16, 17, 18] present a soft error rate estimation technique based on error probability propagation. Rejimon and Bhanja [159, 160] give a single event fault model based on 45 probabilistic Bayesian networks, which captures spatial dependencies. Hayes et al. [80] present a framework for modeling transient-error tolerance in logic circuits. However, these approaches do not take the electrical masking into account and characteristics of transient pulses like pulse width are ignored. An improvement was provided by Zhao et al. [194, 195]. They proposed a constraint- aware robustness insertion methodology that protects the sequential elements in digital circuits to suppress various noise efiects. Their noise probability density function repre- sents the distribution of noise that has survived circuit masking efiects at internal nodes to reach the ip- ops as determined by a probability matrix mapping. However, in that work the authors did not include the environmental factors like the error rate. Besides, their propagation method requires tabulating all pulse width and height data for each logic gate. It would thus take enormous amount of memory for large logic circuits. A closed-form model for simulation and analysis of voltage transients caused by SEU in logic circuits provides an accuracy within 5% of the result obtained from SPICE with over 100X improvement in computational speed [129]. Ramanarayanan et al. [155] analyze soft error rate in ip- ops and scannable latches. Hazucha et al. [81, 84] proposed an empirical model for estimation of SER induced by neutrons. The dynamic behavior of a circuit with massive critical paths in the presence of an SET has been studied and a novel ip- op architecture to mitigate the efiects of such SETs in combinational circuits proposed [91]. Logic circuit SER estimation systems include SEAT-LA [153], SERA [193] and that described by Rao et al. [156]. A recent paper [124] proposes an approach using symbolic analysis based on binary decision diagrams (BDD), algebraic decision diagrams and a probabilistic model for 46 sequential SER analysis. Rewriting, extensively used for optimizing the area and power consumption, has been found to also reduce the soft error rate [14] The production and propagation of single-event transients in scaled CMOS digital logic circuits have been widely examined in [51, 61, 63]. In [63], Three-dimensional mixed-level simulation is used to study both bulk CMOS and silicon-on-insulator (SOI) technologies for scaling trends to the 100nm technology node. The impact of variations, such as variations in device parameters caused by static process variations, dynamic variations in power supply, temperature and slow degrada- tion of individual devices due to phenomena like hot carrier injection (HCI) and negative bias temperature instability (NBTI) on soft error vulnerability for nanometer VLSI cir- cuits is studied in [152, 154]. The increasing variability not only afiects the behavior of contemporary ICs but also their vulnerability to transient error phenomena, especially radiation induced soft errors. The device threshold voltage can also play a signiflcant role in soft error rate estimation [56]. The algorithmic techniques of formal veriflcation, used for design debugging [166], can also be used to estimate vulnerability to reliability problems and to reduce overheads of circuit mechanisms for soft error resilience. One technique for synthesizing multilevel circuits with concurrent error detection is presented in [177]. Timing redundancy and space redundancy based soft-error tolerance techniques for nanometer technologies have been presented [15, 126, 127, 128, 135]. Timing redun- dancy based scan ip- ops are reused to reduce the SER of combinational logic, thus approaching the goal to minimize the area overhead of radiation hardening [64]. 47 FPGAs have become prevalent in critical applications where transient faults can seriously afiect the system operation. The fault tolerance techniques for transient and permanent faults in SRAM-based FPGAs have been presented in [55, 96]. A good summary of fault tolerant techniques for FPGAs can be found in [100]. Besides FPGA systems, radiation hardened micro-controller techniques are presented [54, 108, 143, 188]. Soft-error/noise tolerant techniques are necessary for maintaining the signal-to-noise ratio (SNR) in critical DSP applications. The checksum-based probabilistic error cor- rection method uses the value indicated by the checksum variable to probabilistically correct the error and achieves up to 5 dB improvement in SNR [19, 20]. System level self-checking and self-diagnosing techniques are proposed in [191] for 32-bit microproces- sor and multipliers. A cost efiective radiation hardening technique, which exploits the hardening gates that have lowest logical masking probability to achieve tradeofis between overhead and soft error failure rate reduction, is presented in [196, 197]. More hardening techniques can be found in [53, 116]. Gate sizing may be another possible approach to increase the transient error tolerance as illustrated in [59]. An approach to minimize the impact of soft errors in domino logic by using comple- mentary pass transistors and an additional weak keeper to selectively isolate the logic gates struck by cosmic rays is studied in [104]. This error suppression approach comes with no extra power consumption and with modest area (2.6%) and delay (13.6%) over- heads. 48 A cost efiective approach to design logic circuits with concurrent error detection by exploring the asymmetric soft error susceptibility of nodes has been described [130]. Combinational logic error analysis and protection schemes are studied in [138]. Inspired by the principles of immunology, a hardware immune system has been demonstrated. This hardware immune system runs in real-time and continuously moni- tors a flnite state machine (FSM) architecture for errors [36, 37]. The impact of technology scaling on soft error rates can be found in [27, 169]. Efiects of CMOS technology scaling and the atmospheric neutron caused soft error rates have been investigated [82]. 49 Chapter 4 Environment-Based Probabilistic Soft Error Model This chapter is an original contribution of the present research. Distinct from mem- ories, in a logic circuit a single event efiect (SEE) exists as a single event transient (SET) pulse. An SET has unique characteristics like polarity, waveform, amplitude and dura- tion, and these characteristics depend on particle impact location, particle energy, device technology, device supply voltage and output load. A single event upset (SEU) does not occur unless the SET can survive the circuit masking efiects and is captured by a clock edge into a sequential element. The SET can be eliminated by electrical masking, logic masking and temporal masking [128, 133]. Environmental neutrons, the principal cause of these transients, come from cascaded interactions when galactic cosmic rays traverse through earth?s atmosphere. These neu- trons reach the ground with flnite probabilities. The neutron ux is usually in units of N=cm2-s, where N is the number of neutron particles. The intensity of cosmic-ray induced neutron ux in the atmosphere varies with altitude, geomagnetic fleld, and so- lar magnetic activity. The ux data are available from observations accumulated over decades [123, 199]. One often cites the JEDEC standard [4]. Each neutron has a unique energy when it arrives at the ground. The particle does not induce an error itself, it is the interaction that causes the error in electronic materials. The neutron energy is one of the key properties here; we neglect the efiects of angle of incidence of the particle strike. Not every particle hits on the sensitive silicon area to induce an error. An SEU occurs with certain probability for each high-energy 50 8.0E-08 1.1E-06 2.1E-06 3.1E-06 4.1E-06 5.1E-06 6.1E-06 7.1E-06 8.1E-06 9.1E-06 1.0E-05 0 2 5 10 15 20 24 30 35 40 Average critical charge (fC) SER probability per hit Probability Figure 4.1: Probability of soft error for each collision of a 30MeV neutron as a function of the average critical charge for an SRAM chip (from SEMM program [172]). particle hit. Such probability can be obtained from existing computer programs, for example, IBM?s SEMM. Figure 4.1 [172] shows the result when a CMOS SRAM chip was simulated for 30-MeV neutron hits. The probability of SEU is a function of the particle energy and the critical charge. In the circuit design process, once a circuit is laid out, the critical charge for each cell is deflned. Although we did not use the SEMM program in our experiment on logic circuits, we mention it to illustrate how the error probability can be derived. 51 To consider all energy components in our proposed soft error model, we average the error probability over difierent energies and assign each circuit node a unique error probability value. The particle energy distribution under speciflc locations for speciflc technology nodes can be obtained from experimental results. For example, the cosmic particle strikes were simulated using a heavy ion beam at the Twin Tandem Van de Graafi accelerator at Brookhaven National Laboratory and the results suggest that in the natural environment of space the probability distribution of high-energy particles falls rapidly with increasing LET. For both 0.5? and 0.35? CMOS technology processes at the ground level, the largest population has a linear energy transfer (LET) of 20MeV- cm2=mg or less and the particles with LET greater than 30MeV-cm2=mg are exceedingly rare [78]. The LET of a striking particle multiplied by a characteristic length of the material gives the charge accumulated due to the strike. These results are used in our experiments in Section 4.2. In addition, from the statistical energy distribution we are able to model the sta- tistical SET widths in logic circuit by applying the LET values to the commonly used transient current double-exponential model [122]: 8> < >: I(t) = Qcoll?fi??fl(e? t?fi ?e? t ?fl ) (a) Qcoll = 10:8?L?LET (b) (4.1) where Qcoll is the collected charge in the sensitive region, ?fi is the collection time con- stant, which is a process-dependent property of the junction, and ?fl is the ion-track establishment time constant, which is relatively independent of the technology. In bulk silicon, a typical charge collection depth (L) is 2? for every 1 MeV-cm2=mg, and an 52 L E T D i s t r i b u t i o n D o u b l e E x p . C u r r e n t M o d e l S t a t i s t i c a l I n d u c e d C u r r e n t C i r c u i t N o d e C a p a c i t a n c e S t a t i s t i c a l P u l s e W i d t h D e n s i t y C h a r g i n g / D i s c h a r g i n g Figure 4.2: Transforming statistical neutron energy spectrum to SET width statistics. ionizing particle deposits about 10:8fC charge along each micron on its track. Typical values are approximately 1:64?10?10sec for ?fi and 5?10?11sec for ?fl [43, 194]. From Equation (4.1), the transient current pulse created by a particle strike for each given LET can be calculated. By charging and discharging the circuit node capacitance, the single event transient current pulse is converted into a transient voltage pulse in Figure 4.2. Following the preceding discussion, Figure 4.3 gives a neutron-induced soft error model for logic circuits. Because the probability per hit is related to the neutron ux which is location dependent, we can easily get the circuit SER in units of FIT for difierent locations if the corresponding neutron ux data are available. In summary, this probabilistic soft error model is based on two considerations: (1), the occurrence of SEUs, presented as the soft error frequencies and (2), once an SEU occurs, it exists in the logic circuit as SETs with difierent pulse width densities rep- resented as probability density functions. Note that the pulse width is not the pulse duration between its half peak-peak values, but is the half of the power supply value in the logic circuit. 53 S E U p r o b a b i l i t y p e r n e u t r o n h i t f o r g i v e n c i r c u i t n o d e N e u t r o n E n e r g y ( L E T ) S p e c t r u m S o f t E r r o r F r e q u e n c y S E T W i d t h s D e n s i t y P r o p o s e d S o f t E r r o r M o d e l Figure 4.3: Proposed probabilistic neutron induced soft error model for logic. 4.1 Gate-Level SET Propagation Having discussed the modeling of soft errors by two factors (occurrence rate and density), we will now discuss the propagation of errors through a logic gate. 4.1.1 Pulse Widths Probability Density Propagation Assume that the input SET width is a random variable X with probability density function fx(X), the SET pulse width density function fy(Y). Suppose the function g expresses the relationship between variable X and variable Y: Y = g(X). Given the probability density function of the input pulse width X and the propagation function g(X), we need to flnd the probability density function of the output pulse width Y. In the following derivation, we use the theory of random functions [146]. The pulse width propagation function g for each individual gate is obtained as fol- lows: 54 X and Y are random variables X: input pulse width, Y: output pulse width fX(x): probability density function of X fY (y): probability density function of Y Given function g: Y = g(X), and more speciflcally, g: Y = gfX; p : W=L; n : W=L; Cload; technologyg Assume g is difierentiable and an increasing function, so g0 and g?1 exist. Then, Z x+?x x fX(s)ds = Z y+?y y fY (t)dt =) fX(x)?x = fY (y)?y i:e:; fY (y) = lim ?x!1 fX(x)?x?y = lim ?x!1 fX(x) 1?y=?x = fX(x)g0(x) =) fY (y) = fX(x)=g0(x) The pulse width propagation depends on the load capacitance and the induced soft error pulse at the input of the gate will propagate only if the afiected node is on 55 a sensitized path of the circuit. Load capacitances are generally determined from the layout. Since, we did not have the physical layouts of benchmark circuits, we used a wire-load capacitance model [171, 190]. Wire-load models estimate capacitance of a net by its pin-count and the technology data. In its simplest form, the load capacitance of a gate can be estimated as the technology-dependent nominal gate delay multiplied by (1 + number of fanouts). Our analysis, however, is not limited to using wire-load models and more accurate capacitance data, if available, can be readily used. First consider a CMOS inverter as an example. Suppose we have a positive glitch (0 to 1 and 1 to 0 transitions separated by a glitch-width interval) at the input. We evaluate the output and, as expected, there will be a negative glitch there. The output width will, however, vary depending on load capacitance and the technology-dependent transistor characteristic, which provide inertial delay to the inverter. For a general multiple input logic gate, a glitch at an input may propagate to the output only if the afiected node is sensitized to the gate output. For example, for a NAND gate with a glitch of certain width on one of its inputs, if any other input is at logic 0 then no matter how wide the input glitch is it will not get through the gate because there is no sensitized path. Even when all other inputs are at 1, the input glitch should be wide enough to overcome the inertia of the gate and propagate to its output. Moreover, unless the glitch can propagate through all gates on a path to a primary output, it will not afiect the correct operation of the circuit. We should remember that in our analysis, single event transient pulses are randomly induced at gates. The probability of a pulse being induced at a gate output depends on the probability of a neutron strike at sensitive regions in that gate. The width of the 56 pulse is then a random variable whose probability density is determined from the LET distribution of the striking neutron, technology-dependent gate characteristics and the output node capacitance. Next, given a pulse is induced, its propagation to next gate toward the primary output will depend on signal values. Thus, signal probabilities will determine the probability of pulse propagation. In addition, the transfer functions of gates (denoted as g()) will determine the probability density function for the propagated pulse width. From HSPICE simulation we flnd that the function g is a nonlinear transmission function. However, a piecewise-linear \3-interval" propagation model can give a good approximation. Given a sensitized path of a generic gate, depending on the input pulse width (Din) and the gate input-output delay there are three intervals of possible input glitch durations that can be identifled [32, 144]. Thus, for a generic logic gate, the pulse width propagation model is: 1. Propagation with no attenuation, if Din ? 2?p. 2. Propagation with attenuation, if ?p < Din < 2?p 3. Non-propagation, if Din ? ?p. Where x88 Din: input pulse width. Also represented by random variable X x88 Dout: output pulse width (to be determined). Also represented by random variable Y x88 ?p: gate input to output delay We validate this propagation model by simulating a CMOS inverter using HSPICE. The results are shown in Figure 4.4. This CMOS inverter is in TSMC035 technology with nMOS W/L ratio = 0:6?=0:24? and pMOS W/L ratio = 1:08?=0:24?. At the gate output, rising delay 57 0 50 100 150 200 250 300 350 4000 50 100 150 200 250 300 350 400 Input Pulse Width (ps) Out Pulse Width (ps) Proposed Model Compared With HSPICE Simulation Results Negative Input Pulse Positive Input Pulse Proposed Model Negative Positive Input Figure 4.4: Comparison of proposed model and HSPICE simulation for CMOS inverter with 10fF load capacitance. was 41:5ps and falling delay was 30:8ps for load capacitance of 10fF. We use an average gate delay of ?p = 36:0ps in the proposed propagation model. The mathematical expression is given in Equation (4.2). In Figure 4.4, the x-axis is the input pulse width and the y-axis is the output pulse width. We observe that when input pulse width is greater than 72ps, i.e., 2?p, the output pulse width can be either greater or smaller than the input pulse width, depending on the input pulse type. These difierences are caused by difierent rising and falling delays. Thus, the proposed model is a good approximation to the HSPICE simulation. Dout = 0 if Din ? 36:0ps (Din ?36:0)? 72:036:0 if 36:0ps < Din < 72:0ps Din if Din ? 72:0ps (4.2) For this CMOS inverter with an output load capacitance of 10fF, an illustration of the monotonic mapping of probability density fy(Y) is given in Figure 4.5. The characteristics of the three regions in this flgure are: the input pulse width in regions 1, 2 or 3 will be flltered, 58 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0 100 200 300 400 Input Width (ps) Probability f(X) 0 50 100 150 200 250 300 350 400 0 100 200 300 400 Input Width X (ps) Output Width Y (ps) Function g: y=g(x) 0 0.002 0.004 0.006 0.008 0.01 0.012 0 100 200 300 400 Probability f(Y) 1 : F i l t e r e d 2 : A t t e n u a t e d 3 : P a s s e d E M R = 0 . 9 6 1 3 2 3 2 1 Figure 4.5: Pulse width density propagation through a CMOS inverter with 10fF load. attenuated, or pass without attenuation, respectively. A pulse being flltered actually assumes the shape of a delta function. Similarly, we simulated all gates by HSPICE to extract the gate delays and build the propagation model g. Similar agreement as in Figure 4.4 was observed for all other logic gates. 4.1.2 Logic SEU Probability Propagation Because all pulse widths must be greater than or equal to 0, we have Z 1 0 fY (y)dy = Z 1 0 fX(x)dx = 1 (4.3) 59 Table 4.1: Output 1 probability calculation for n-input Boolean gates. Gate Probability (output = 1) AND P1(out) = nQ i=1 [P1(in(i))] NAND P1(out) = 1? nQ i=1 [P1(in(i))] OR P1(out) = 1? nQ i=1 [1?P1(in(i))] NOR P1(out) = nQ i=1 [1?P1(in(i))] In fX(x) to fY (y) conversion, there is a fraction of pulses that is flltered out or attenuated due to electrical masking (i.e., suppression by gate inertia). We deflne electrical masking ratio (EMR) as the fraction of pulses that survives propagation in Equation (4.4): EMR = R y>0 fY (y)dy R x>0 fX(x)dx (4.4) We assume that all signal probabilities are known. This can be done in several ways. If a set of input vectors is given, then a zero-delay logic simulation [41] can easily determine all signal probabilities. Alternatively, signal probabilities can be determined from a static analysis of the primary input probabilities [167]. When no vectors are given, one often assumes equiprobable 0s and 1s at primary inputs. Because signal probability calculation has high complexity, often a simple approximation that ignores correlations between signals at gate inputs may be used. In that case, logic 1 probability calculation rules for n-input logic gates are given in Table 4.1. Here, P1(in) and P1(out) denote 1-probabilities of input and output signals of the gate. Logic 0 probabilities are obtained simply by complementing logic 1 probabilities. Our SER analysis works with signal probabilities irrespective of how those probabilities were obtained. For the benchmark circuit results we report, we assumed no given vectors and equiprobable inputs. This corresponds to random input vectors. Signal probabilities were calculated in a single input to output pass using the formulas of Table 4.1. 60 G e n e r i c L o g i c G a t e j 1 2 3 i o D i s t r i b u t i o n : f ( Y ) F r e q u e n c y : P e r r o r ( o ) D i s t r i b u t i o n : f ( X ) F r e q u e n c y : P e r r o r ( 1 ) Figure 4.6: A generic gate with particle strike on node 1. If SEU occurs on input 1 of logic gate j in Figure 4.6 then the output soft error probability is calculated by Equation (4.5): PSEU(o) = PSEU(1)? EMRj| {z } ElectricalMasking ? iY 2 [Pnon?controlling(i)] | {z } Logic Masking (4.5) Here again, we have assumed that all inputs of the gate are statistically independent. This is an approximation that can be improved [95]. However, we believe that uncorrelated signal assumption will give reasonable accuracy and low computation complexity. 4.2 Experimental Results We analyzed ISCAS85 benchmark circuits and inverter chains of varying lengths by a sim- ulator developed in C programming language. For simplicity, we assume that all the circuits are working at the ground level and the probability of SEU per particle hit is 10?4. For ground level we use the neutron energy statistics discussed in previous chapters. We assume the SET width density per circuit node follows the normal distribution with mean ? = 150 and stan- dard deviation = 50. These assumptions are justifled for relatively small value of particle ux and small chip area. From [200], the total neutron ux at sea level is 56.5m?2s?1. For a CMOS circuit in TSMC035 technology, we assume the sensitive region area is 10?m2 for each 61 Table 4.2: SER results for ISCAS85 benchmark circuits. Circuit # # # CPU FIT/gate PIs POs Gates s /output c17 5 2 6 0.01 0.3679 c432 36 7 160 0.04 1.0563 c499 41 32 202 0.14 0.2188 c880 60 26 383 0.08 0.3882 c1908 33 25 880 1.14 0.7427 c2670 233 140 1193 0.77 0.2882 c5315 178 123 2307 2.78 0.5572 c7552 207 108 3512 10.82 0.6652 Table 4.3: SER results for inverter chains. Circuit # # # CPU FIT PIs POs Gates s /gate inv2 1 1 2 0.00 0.2819 inv5 1 1 5 0.00 0.5388 inv10 1 1 10 0.00 0.9654 inv20 1 1 20 0.00 1.8185 inv50 1 1 50 0.00 4.3780 inv100 1 1 100 0.04 8.6473 circuit node. For a circuit with n primary outputs and m gates, the SER per gate per output is 1 n? ni=0( 1 m? mj=0SERi caused by j). From Table 4.3 we see that SER increases almost linearly with increasing inverter chain length. That is because in the inverter chain, there is no logic masking and there will always be a portion of SEUs under the current environmental condition that will survive through inverters no matter how long the chain is. But in Table 4.2 for logic circuits, the SER does not increase with the number of gates. The logic masking in these circuits seems to increase with increased number of gates. Field test data for logic circuits is largely unavailable, but actual neutron experiments on a test chip would help to validate our analysis in the future. The CPU times for these results are for a Sun Fire 280R workstation. 62 4.3 Conclusion In this chapter, we presented a novel soft error model based on two parameters, occurrence rate and single event transient pulse width density for combinational circuits. We developed an algorithm to propagate these parameters through logic and calculate soft error rate in the FIT units. In the next chapter, we will discuss the relevancy of our approach and compare our method with other logic soft error rate estimation methods. 63 Chapter 5 Results Comparison and Discussion In this chapter, we compare our experimental results with the relevant published work and discuss various key factors that may in uence logic SER. Some of those factors have not been considered in the existing logic SER estimation work. 5.1 Experimental Results For the detailed algorithms that propagate soft errors through elementary logic gates and calculate the SER for a circuit, the reader should refer to the previous chapter. Here, we compare our experimental results with previous publications [17, 153, 156, 159, 193]. We simulated ISCAS85 benchmark circuits by a simulator developed in C programming language. For simplicity, we assume that all circuits are working at the ground level and the probability of SEU per particle hit is 10?4. Here we neglect the polarity of SETs and the temporal masking factor. For ground level we use the neutron energy statistics assuming the SET width density per circuit node follows a normal distribution with mean 150 and standard deviation 50. These assumptions are justifled for relatively small values of particle ux and small chip area. From [200], the total neutron ux at sea level is 56.5m?2s?1. For a CMOS circuit in TSMC035 technology, we assume the sensitive region to be 10?m2 for each circuit node. For a circuit with n primary outputs and m nodes, the SER is ?ni=0(?mj=0SERi caused by j) which is difierent from [187] that uses the formula 1n?ni=0( 1m?mj=0SERi caused by j). Rao et al. [156] and Rajaraman et al. [153] calculated total logic SER. In Table 5.1, we compare our SER results for selected benchmark circuits with available results; not all benchmark circuit SER results have been published. We see our results have large difierences when compared with the results from Rao et al. 64 Table 5.1: Experimental results for ISCAS85 benchmark circuits. Circuit # # # Our approach Rao et al. [156] Rajaraman et al. [153] PI PO Gate CPU s FIT CPU s (FITs) CPU min. Error Prob. c432 36 7 160 0.04 1.18?103 <0.01 1.75?10?5 108 0.0725 c499 41 32 202 0.14 1.41?103 0.01 6.26?10?5 216 0.0041 c880 60 26 383 0.08 3.86?103 0.01 6.07?10?5 102 0.0188 c1908 33 25 880 1.14 1.63?104 0.01 7.50?10?5 1073 0.0011 The order of magnitude difierences between results in Table 5.1 need investigation. The published data for SRAMs (see previous chapters) shows SER around 1000 FIT for both analysis and measurement. That is in the same range as our analysis of benchmark circuits. Field test data for logic circuits is largely unavailable and the actual neutron experiments on a test chip in the future will help validate our analysis. The CPU times for our results are for a Sun Fire 280R workstation. The results in [156] were for a Pentium 4 2.4GHZ machine and those in [153] were for a Sun Fire v210 machine. The run times for our approach are comparable [156] or better [153]. 5.2 Discussion of Results In Table 5.2, various methods of analysis are compared. Many factors are listed that in u- ence the calculation of logic SER. However, each of the existing approaches includes only few of them. We make the following observations: 1. The physics of the SEU phenomena seems involved. For example, the analysis of the funneling and the angle of incidence are not considered. We take the energy of neutrons to be the main source that induces the SEU. However, in real cases, it is the physics of interaction between neutrons and silicon that produces the SEU. Simpler modeling and assumptions may in uence the SER estimation accuracy. 2. ThesensitiveregionofatransistorisdeflnedasthechannelregionofanofinMOStransistor or the drain region of an ofi pMOS transistor. For a CMOS circuit, the \on" or \ofi" status of transistors is determined from inputs. In our approach, we statically assume that each 65 Table 5.2: Comparison of our work with other SER estimation methods. Authors Factors considered and LET Re-conv. Sensitive SEU Vectors Location Circuit SET Reference Spectrum Fanout Regions prob. Applied Altitude Tech. Degradation Our work yes no yes yes no yes yes yes Rao et al. [156] yes no no no yes yes yes yes Rajaraman et al. [153] no no no no yes no no yes Asadi- Tahoori [17] no no no yes no no no no Zhang- Shanbhag [193] yes no yes yes yes yes yes no Rejimon- Bhanja [159] no no no yes yes no no no circuit node?s sensitive region is 10?m2. This may bias the SER results. Also, although we have considered the sensitive area of the circuit node, the strikes on pMOS or nMOS in uence the polarity of SETs. So, the dynamic state of the circuit may further afiect the SER. 3. Compared to the earth surface, the size of the sensitive region of a single transistor or a circuit board is trivially small and is getting smaller with the technology trend. At the surface of the earth we take the probability of a particle strike at a sensitive node simply by taking the ratio of the number of particles strikes/?m2-s to strikes/m2-s. Theoretically, it seems correct because note that 1 m2 equals 1012 ?m2. To imagine this event in real cases, most probably there will be no strike on the sensitive regions but such low probability events can not be neglected. Once the SEU occurs, the circuit SER may easily be several orders of magnitude higher compared to the case of no strike at all. 4. For logic circuits, fan-out details should be considered. In our experiment we only consid- ered the worst case error rate for re-convergent fan-outs. For example, if a re-convergent 66 fanout has two paths, and one passes through more gates compared to the other, our pro- gram only takes the path that has fewer gates because it is likely to give the worst SER. Timing and logic simulation may be needed for better accuracy [58]. In a real circuit, two situations can arise: x88 When an SET goes through a large fan-out node the large load capacitance can eliminate the SET through node inertia. x88 Or if the SET is not canceled by the fan-out node, it goes through multiple fan-out paths. If all paths have equal length, the SET might cancel itself at the merging point depending on path inversions. However, if paths have difierent lengths, one SET on the afiected node can cause several propagating SETs to further increase the SER of the circuit. The path delays may also in uence logic SER. 5. Itishighlyrecommendedtohavemorefleldtestsforlogiccircuits. Also, wesuggestthatthe SER results from fleld tests for the same circuit, even in the same working environment, may be widely difierent at difierent times. Still, with fleld test data, the logic circuit SER results can be validated. A comparison with measurement may be the only way to determine which factors can be really neglected and which assumptions and approximations are justifled. 6. We compared [153] and [156] for their HSPICE simulation results. In [156], the SER for the C432 circuit was reported as a FIT rate of 2.42?10?5, while for the same circuit, the HSPICE simulation result in [153] was reported in probabilities. For 5,000 iterations it takes 108 minutes and SER is computed as 0.0725, which equals a FIT rate of 2?1011. So, the two studies difier by a factor of 1016. We conclude that, without a proper understanding of SEU phenomena, any results can at best be misleading. 67 7. None of these SER estimation approaches considered process variation efiects on SER, which may also be a factor in the vulnerability to transient errors. It is reported that, intra-die process variation of threshold voltage may result in SER variation of 41% in a small circuit [154]. 5.3 Conclusion In real cases, with actual signal values, some paths may not be activated. Temporal masking by clock sampling would further increase the masking. From our discussion, the logic SER may be highly sensitive to factors like sensitive region calibration, process variation and circuit characterization, making soft error estimation for logic circuits a complex problem. In the next chapter, we extensively study soft error efiects on modern computer web server systems. 68 Chapter 6 Soft Error Considerations in Computer Web Servers Generally speaking, a computer that is used as a web server by an Internet Service Provider (ISP) with basic mailing and customer site hosting services should have at least the following characteristics (Online source: http://theos.in/windows-server/computer-server-conflguration- for-an-isp): x88 Dual Intel XEON or AMD Dual core processor. x88 RAID 5 1/2 TB disk. x88 4 GB Error-Correcting Code (ECC) RAM. x88 Linux or FreeBSD UNIX operating system. Services such as a Firewall, Web, FTP, RADIUS (an AAA - authentication, authorization, and accounting - protocol for controlling access to network resources) server, gaming, etc., may be supported, so the actual requirements for an ISP server will be more stringent than what is mentioned above. SEU induced errors can be categorized into four types for an electronic system depending on how the system responds to the error [134]: 1) masked error, i.e., the error is toler- ated by the system; 2) correctable error, i.e., the error is detected and successfully corrected; 3) detected uncorrectable errors (DUE); and 4) silent data corruption (SDC), i.e., a non-detectable error that corrupts the system. A typical server system data corruption target is around 1000 years MTBF, as shown in Table 6.1 [34], but it is very hard to achieve this goal in a cost-efiective way. 6.1 Soft Error Reduction in Industrial Servers Server system designs from difierent manufacturers vary in their architectures. However, traditional error checking mechanisms including error correction codes (ECC) or error detection 69 Table 6.1: Typical server system reliability goals [34]. Error Type System MTBF Goal SDC (Silent Data Corruption) 1000 years (114 FIT) DUE for system crash 25 years DUE for application crash 10 years and correction (EDAC) codes, parity, and redundancy, e.g., triple modular redundancy (TMR), are extensively used for reliability, availability and serviceability (RAS) [35, 65]. We consider an industrial product, HP Integrity Non-Stop NS16000 server, as an example to show how the design for reliability concept is embodied in each component within the server. This server has the capability to grow with linear scalability from 2 to 4,080 processors and up to 65 TB of main memory. With an announced seven 9s (99.99999%) level of availability [12], the server?s fault tolerant hardware techniques are summarized below: x88 ServerNet: High bandwidth and low latency optical fabric connection with error detection and isolation capability. The routers are self-diagnosing and data are subjected to a 32-bit cyclic redundancy code (CRC). x88 Parity checks and error-correcting code (ECC) are used in the main memory and cache and parity checks are employed in buses and caches on the logic board. x88 For the disk subsystem, parity checks and end-to-end checksum are employed. We may point out that these techniques combat both hard errors and soft errors. Next, we focus on the error tolerance of storage devices in servers because of their extensive usage and high sensitivity to soft errors. Memory The soft errors caused by alpha particles or cosmic rays are not generally repeatable because they are caused by erroneous charge storage rather than by permanent hardware faults. So, soft errors in a memory, if detected, may be corrected by rewriting the erroneous memory cells with correct data. Otherwise, a failure to correct a soft error in memory may potentially cause a serious system crash. Memory errors are categorized as either single-bit or multi-bit errors. A 70 single bit error can be detected and corrected by standard error correction codes (ECC). However, in the case of multi-bit errors, when more bits than one are afiected simultaneously by cosmic rays, standard ECC may not be su?cient. ECC may be able to detect multi-bit errors, but would have limited ability to correct them in many instances. In some instances, ECC may not even be su?cient to detect the errors. RAID The needs of modern computer server systems for extensive data storage require large capacity mass data storage devices. A web server computer system provides text, graphics, sound and video on demand, and typically such a server system requires access to databases stored on rotating rigid magnetic disk drives. The volume of data required for a web server may need a considerable number of disk drives. This increases the probability that any single device might fail. For maintaining data on multiple disk drives in a redundantform a RAID (RedundantArrays of Inexpensive Disks [87]) disk drive is commonly used. RAID has the capability to reconstruct data stored on any single disk drive, if there is a failure of that disk, from the data stored on other disk drives [145]. The disk drive has on-board soft error recovery procedures for recovering data if a soft error occurs. Because difierent data types stored on a disk drive have difierent characteristics, for example, for alphanumeric data every bit is potentially of critical importance; while for multimedia data like audio and video the corruption of a single bit or even that of several bits is likely to be acceptable because the consequences may not be potentially severe. So each disk can independently allocate, for data of difierent types, difierent data redundancy strategies. For example, if a disk drive is selected to store multimedia data, the soft error recovery can be disabled [145]. 6.2 A Proposed Direction Itisinevitablethattechnologyscalingandemergingmaterialswillleadtomoretransientand soft failures of signals, logic values, devices and interconnects in electronic server systems. Given that the RAS requirements for high performance computer servers are stringent, the existing 71 techniques may not be su?cient for the server requirements. Besides, the cost of high reliability may become too high to be acceptable. For example, extensive use of triple modular redundancy (TMR) can lead to excessive cost. Certain techniques, speciflcally those targeting soft errors, deserve more research to make the next generation servers practical. For an ECC protected memory, because the testing of the data in a memory sector oc- curs only when a \read" command is issued for that sector, seldom-accessed sectors may re- main untested. Harmless single-bit errors may accumulate over time and result in uncorrectable multi-bit errors. Once a \read" request is flnally issued to a seldom-accessed sector, previously correctable errors may have evolved into uncorrectable multiple errors, thereby causing data corruption or system failure [111]. Recovery of memory from multi-bit errors will require more complex means. However, longer error correction codes may be too complex to implement and alternative approaches would be needed [147]. The evolution of the technology brings a new dimension of soft error efiects in logic circuits. The SEU-induced transient pulse duration may span more than one clock cycle of operation, and new fault tolerance solutions working at the system level must be devised [109]. For a microprocessor, a long duration fault will cause errors in two adjacent bits at the circuit outputs, thus posing a catastrophic threat. Hard errors are distinguishable from soft errors through the \error log" reports because hard errors are repeatable. More fleld test results are needed for web servers. It is necessary to flnd out the exact causes of the so-called \soft errors" detected and recorded in the server \error log", although it is a tough task. To distinguish the soft errors caused by non-environmental factors and cosmic rays, experiments of testing identical servers at difierent altitudes, like at ground level and at 6,000 feet altitude, are necessary to flnd out how severe cosmic ray efiects on the system are. Thus, SEU-speciflc protection techniques for complex server systems can be devised. However, such results at this time are largely unavailable. In[107], preliminarymeasurementswerecarriedoutontheask.comsearchengineandasetof o?ce desktop computers. These results suggest that the memory SER in real production systems 72 are much lower than those reported by previous studies. The reason cited for the low SER is that the memory DIMMs (dual in-line memory modules) in the system were plugged perpendicular to the horizontal plane, and the main source of soft errors, cosmic rays, come straight from above. This result provides a possible SER reduction method based on the hardware layout. The rack- mounted server is the most popular layout style for contemporary server systems. A rack is a metal frame that contains bays designed to hold parts of the server computer. The vertical rack spaces between stacks are deflned as rack units (\U-space"); a \U" is equivalent to 1.75 inches. These rack-mounted server systems are ubiquitous. A SEU reliability-oriented hardware layout server system is discussed next and may be an SER reduction technique of the future. First, we will evaluate the sea level cosmic ray characteristics. At sea level, the particle ux contains 94% neutrons, 4% pions, and 2% protons. There are variations in neutron ux with latitude, altitude, diurnal time, earth?s sidereal position, and solar cycle. The earth?s magnetic fleld plays a role of providing a shield against charged particles everywhere except for particles entering vertically at poles. As galactic cosmic ray particles near earth, the magnetic fleld interacts with the particle?s charge and bends the particle?s trajectory [201]. Therefore, at sea level, the particles potentially causing SEU strike the electronics with varying angles to the horizontal plane. There is a minimum required distance that a particle with given LET must travel before su?cient energy is transferred to cause an SEU. So the particle?s angle of incidence on the device is important. This phenomenon is similar to the refraction of a beam when traversing from one material to another. As the incidence angle deviates from the normal, the path length traversed by the radiation increases. The angle of incidence at which upsets occur for a given particle LET is known as the critical angle [88]: cos( c) = LET=LETc (6.1) 73 G A T E D R A I N S O U R C E 1 2 3 t / c o s ( ) t Figure 6.1: Three perpendicular orientations for exposing a transistor and particle angle of incidence [158]. where LETc is critical LET and LETc < LETth. The particles that produce upset are between incidence angles c and ?=2. Thus, two potential cases exist: 1. LET > LETc: all striking incident angles will produce upset. 2. LET < LETc: there is a critical angle, c, above which upsets occur. Figure 6.1 is a schematic view of an MOS transistor. It shows three mutually perpendicular directions of exposure to cosmic rays. The direction labeled 1 is traditionally considered for SEU testing at normal incidence. Directions 2 and 3 represent exposures at grazing incidences and their path lengths through the sensitive volume tend to be longer when protons are incident parallel to the longest dimension of the sensitive volume for proton induced SEU [158]. Particles incident at an angle ( ) have a path that is 1/cos( ) longer than the path at normal incidence. They are hence likely to produce more ionization charge. 6.3 Conclusion We conclude that with a proper understanding of local ground level particle orientation and energy distribution, if the circuit boards of the server system are appropriately oriented, then 74 the SEU caused by particles with LET smaller than the critical LET would be greatly reduced. However, re-arranging and placing the circuit board may not be able to totally eliminate SEU. 75 Chapter 7 Conclusion With the continuous downscaling of CMOS technologies, reliability is bound to become a major bottleneck for the next generation systems. To meet the system reliability requirements it is necessary for both circuit designers and test engineers to get the basic knowledge of the soft errors causing those reliability problems. In this thesis, we flrst present a tutorial study of the single event upset phenomenon that is one of the root causes of soft errors. The basic physics of single event upset, and basic radiation types that cause it are presented. We summarize the concepts of the basic radiation mechanisms, i.e., theerrorproducinginteractions, insilicon. Anoverviewofthenaturalterrestrialenvironment is presented as necessary information to help build an accurate soft error analysis model. Also, soft error mitigation techniques like time and space redundancy, cell hardening and EDAC are illustrated. An industrial design example, the IBM eSerer z990 system, shows how the industry is dealing with soft errors these days. In the second half of this thesis, we present a novel environment-dependent soft error model for logic circuits, based on two parameters: error occurrence rate and soft error transient pulse width density. An error propagation scheme through logic gates is developed that takes electrical masking into account. The SEU pulse width information at the primary outputs can help analyze the efiectiveness of time and space redundancy schemes. Our analysis requires signal probabilities. For any given set of input vectors or signal statistics, these can be obtained either from logic simulation or from static analysis of the circuit topology. For simplicity, we ignore correlations among reconverging signals. If those correlations were considered, some paths maynot be activated. Similarly, temporal masking byclocksampling may further increase the masking. Future extensions may incorporate this analysis in logic simulators. That will take signal correlations and temporal efiects into consideration. 76 The SEU transient has two parameters, the probability of error pulse generation and the random width of the pulse if one is generated. The pules width is represented by a probability density function. Logic gates are modeled with inertial delays. The probability of transmitting an SEU pulse through a gate is determined from the probabilities of other input signals. If transmitted, the SEU pulse width at the output of the gate is a function of the input pulse width and the inertial delay, and is determined from the theory of random functions. A single pass of the circuit determines the SEU statistics of all nodes in the circuit. This analysis is, however, static and is applicable to combinational circuits. It should be extended to sequential circuits in the future. According to our results, the SER may be highly sensitive to factors like accuracy of flnding sensitive regions in silicon, process variation and circuit technology. Comprehensive studies on them should provide better insights in the future. We suggest that circuit topologies will also have a signiflcant efiect on SERs. For example, a narrow circuit like an inverter chain and a wide circuit like a ripple-carry adder will have quite difierent soft error rates. Efiective soft error control requires new cost efiective techniques for soft error protections because classical fault-tolerance techniques are very expensive [125]. In addition, we have presented some essential features of soft error considerations for mod- ern web servers and summarized the commonly used industrial fault tolerant techniques. We have proposed a possible SER reduction technique for conventional hardware server systems by considering the angle of incidence when particles strike at the circuit. Software based fault toler- ance and techniques for network communications, not discussed here, are also important for web server reliability. They require future studies. 77 Bibliography [1] \DRAM Soft Error Rate Calculations." Micron Technology. http://download.micron. com/pdf/technotes/DT28.pdf. [2] \IncreasingNetworkAvailability."CiscoSystems. http://www.cisco.com/warp/public/779/ largeent/learn/technologies/ina/IncreasingNetworkAvailabilityWhitePaper.pdf. [3] \Soft Errors a Problem as SRAM Geometries Shrink." Electronics Supply & Manufacturing, 28 Jan 2002. http://www.ebnews.com/story/OEG20020128S0079. [4] \JEDEC Standard: Measurements and Reporting of Alpha Particles and Terrestrial Comic Ray-Induced Soft Errors in Semiconductor Devices," Technical Report JESD89, Aug. 2001. [5] \JEDEC Standard: Measurements and Reporting of Alpha Particles and Terrestrial Comic Ray-Induced Soft Errors in Semiconductor Devices," Technical Report JESD89A, 2001. Revision of JESD89. [6] \Efiects of Neutrons on Programmable Logic: White Paper," Technical report, Actel Cor- poration, Dec. 2002. [7] \Gate Arrays Wane While Standard Cells Soar: ASIC Market Evolution Continues," Tech- nical report, Semico Research Corporation, Nov. 2002. BusinessWire. [8] \The Ideal SoC Memory: 1T-SRAM," 2002. http://www.mosys.com/news/idsoc.pdf. [9] Solar In uences Data Analysis Center (SIDC), 2004. http://sidc.oma.be/index.php3. [10] \JEDEC Standard: Test Method for Alpha Source Accelerated Soft Error Rate," Technical Report JESD89-2, 2004. Addendum No. 2 to JESD89. [11] \Soft Errors in Electronic Memory - A White Paper," Technical report, Tezzaron Semi- conductor, 2004. [12] \HP Integrity Nonstop Servers: Ordering and Conflguration Guide." HP Data Sheet, 2005. www.hp.com/go/integritynonstop. [13] \NASA Thesauras and Infomation." NASA, 2007. http://www.sti.nasa.gov/thesfrm1.htm. [14] S. Almukhaizim, Y. Makris, Y. S. Yang, and A. Veneris, \Seamless Integration of SER in Rewiring-Based Design Space Exploration," in Proc. International Test Conference, 2006. [15] L. Anghel, D. Alexandrescu, and M. Nicolaidis, \Evaluation of A Soft Error Tolerance Tech- nique Based on Time and/or Space Redundancy," in Proc. 13th Symposium on Integrated Circuits and Systems Design (SBCCI?00), 2000, pp. 237{242. [16] G. Asadi and M. B. Tahoori, \An Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs," in Proc. Military and Aerospace Applications of Programmable Logic Devices (MAPLD), Sept. 2004. [17] G. Asadi and M. B. Tahoori, \An Accurate SER Estimation Method Based on Propagation Probability," in Proc. Design Automation and Test in Europe Conf, 2005, pp. 306{307. [18] G. Asadi and M. B. Tahoori, \An Analytical Approach for Soft Error Rate Estimation in Digital Circuits," in Proc. IEEE International Symposium on Circuits and Systems, 2005, pp. 2991{2994. 78 [19] M. Ashouei, S. Bhattacharya, and A. Chatterjee, \Improving SNR for DSM Linear Systems Using Probabilistic Error Correction and State Restoration: A Comparative Study," in Proc. 11th European Test Symp., 2006, pp. 35{42. [20] M. Ashouei, S. Bhattacharya, and A. Chatterjee, \Probabilistic Compensation for Digital Filters Using Pervasive Noise-Induced Operator Errors," in Proc. 25th IEEE VLSI Test Symp., 2007, pp. 125{130. [21] A. Avizienis, \Fault-Tolerant Computing: An Overview," IEEE Trans. Computers, vol. 4, no. 1, pp. 5{8, 1971. [22] A. Avizienis, \Toward Systematic Design of Fault-Tolerant Systems," Computer, vol. 30, no. 4, pp. 51{58, Apr. 1997. [23] A. Avizienis, \The Hundred Year Spacecraft," in Proc. First NASA/DoD Workshop on Evolvable Hardware, 1999, pp. 233{239. [24] A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, and D. K. Rubin, \The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design," IEEE Trans. Computers, vol. C-20, no. 11, pp. 1312{1321, Nov. 1971. [25] J. Barak, R. A. Reed, K. A. LaBel, N. Center, and M. D. Greenbelt, \On the Figure of Merit Model for SEU Rate Calculations," IEEE Trans. Nuclear Science, vol. 46, no. 6, pp. 1504{1510, 1999. [26] R. Baumann, \Soft Errors in Advanced Semiconductor Devices-Part 1: The Three Ra- diation Sources," IEEE Trans. Device and Materials Reliability, vol. 1, no. 1, pp. 17{22, 2001. [27] R. Baumann, \The Impact of Technology Scaling on Soft Error Rate Performance and LimitstotheE?cacyofErrorCorrection," inProc. International Electron Devices Meeting, IEDM?02, 2002, pp. 329{332. [28] R. Baumann, \Technology Scaling Trends and Accelerated Testing for Soft Errors in Com- mercial Silicon Devices," in Proc. 9th On-Line Testing Symposium, 2003, p. 4. [29] R. Baumann, \Soft Errors In Commercial Integration Integrated Circuits," International Jour. High Speed Electronics and Systems, vol. 14, no. 2, pp. 299{309, 2004. [30] R. Baumann, \Soft Errors in Advanced Computer Systems," IEEE Design & Test of Com- puters, vol. 22, no. 3, pp. 258{266, 2005. [31] R. Baumann, T. Hossain, S. Murata, and H. Kitagawa, \Boron Compounds as a Dominant Source of Alpha Particles in Semiconductor Devices," in Proc. 33rd Annual Reliability Physics Symposium, 1995, pp. 297{302. [32] M. J. Bellido-Diaz, J. Juan-Chico, A. J. Acosta, M. Valencia, and J. L. Huertas, \Logical modeling of delay degradation efiect in static CMOS gates," IEE Proc. Circuits, Devices and Systems. [33] D. Binder, E. C. Smith, and A. B. Holman, \Satellite Anomalies From Galactic Cosmic Rays," IEEE Trans. Nuclear Science, vol. 22, pp. 2675{2680, Dec. 1975. [34] D. C. Bossen, \CMOS Soft Errors and Server Design," in IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, April 7, 2002, pp. 121 07.1{121 07.6. 79 [35] D. C. Bossen, A. Kitamorn, K. F. Reick, and M. S. Floyd, \Fault-Tolerant Design of the IBM pSeries 690 System Using POWER4 Processor Technology," IEEE Trans. Device and Materials Reliability, vol. 46, no. 1, pp. 77{86, 2002. [36] D. Bradley and A. Tyrrell, \A hardware immune system for benchmark state machine error detection," in Proc. Congress on Evolutionary Computation, CEC?02, volume 1, 2002, pp. 813{818. [37] D.W.BradleyandA.M.Tyrrell, \HardwareFaultTolerance: AnImmunologicalSolution," in Proc. IEEE International Conf. Systems, Man, and Cybernetics, volume 1, 2000, pp. 107{112. [38] M. A. Breuer, \Testing for Intermittent Faults in Digital Circuits," IEEE Trans. Comput- ers, vol. C-22, no. 3, pp. 241{246, 1973. [39] S. Buchner, D. McMorrow, J. Melinger, and A. B. Camdbell, \Laboratory Tests for Single- Event Efiects," IEEE Trans. Nuclear Science, vol. 43, no. 2, pp. 678{686, 1996. [40] D. Burnett, C. Lage, and A. Bormann, \Soft-Error-Rate Improvement in Advanced BiC- MOS SRAMs," in Proc. 31st Annual IEEE Reliability Physics Symp., Mar. 1993, pp. 156{160. [41] M. L. Bushnell and V. D. Agrawal, Essentials of Electronic Testing for Digital, Memory & Mixed-Signal VLSI Circuits. Boston: Springer, 2000. [42] C. Carmichael, \Triple Module Redundancy Design Techniques for Virtex Series FPGA," Xilinx Application Notes, vol. 197, 2001. [43] V. Carreno, G. Choi, and R. K. Iyer, \Analog-digital simulation of transient-induced logic errors and upset susceptibility of an advanced control system," in NASA Technical Memo 4241, 1990. [44] A. Cataldo, \IBM Moves to Protect DRAM from Cosmic Invaders." EE Times, 10 June 1998. http://www.eetimes.com/news/98/1012news/ibm.html. [45] A. Cataldo, \MoSys, iRoC Target IC Error Protection." EE Times, 6 Feb. 2002. http://www.eetimes.com/story/OEG20020206S0026. [46] S. T. Chakradhar, S. Kanjilal, and V. D. Agrawal, \Finite State Machine Synthesis with Fault Tolerant Test Function," Jour. Electronic Testing: Theory and Applications, vol. 4, no. 1, pp. 57{69, 1993. [47] P. K. Chande, A. K. Ramani, and P. C. Sharma, \Modular TMR Multiprocessor System," IEEE Trans. Industrial Electronics, vol. 36, no. 1, pp. 34{41, 1989. [48] Z. Chaohuang, N. Saxena, and E. J. McCluskey, \Finite State Machine Synthesis with Concurrent Error Detection," in Proc. International Test Conf., 1999, pp. 672{679. [49] D. L. Chenette, J. Chen, E. Clayton, T. G. Guzik, J. P. Wefel, M. Garcia-Munoz, C. Lopate, K. R. Pyle, K. P. Ray, E. G. Mullen, and D. A. Hardy, \The CRRES/SPACERAD Heavy Ion Model of the Environment (CHIME) for Cosmic Ray and Solar Particle Efiects on Electronic and Biological Systems in Space," IEEE Trans. Nuclear Science, vol. 41, no. 6, pp. 2332{2339, 1994. [50] C. L. Claeys and E. Simoen, Radiation Efiects in Advanced Semiconductor Materials and Devices. Springer, 2002. 80 [51] N. Cohen, T. S. Sriram, N. Leland, D. Moyer, S. Butler, and R. Flatley, \Soft Error Considerations for Deep-Submicron CMOS Circuit Applications," IEEE Trans. Nuclear Science, vol. 44, no. 6, pp. 315{318, 1999. [52] C. Croarkin, P. Tobias, and C. Zey, Engineering Statistics Handbook. NIST and SEMAT- ECH, USA, 2001. [53] W. R. Dawes, Radiation Efiects Hardening Techniques. IEEE NSREC Short Course, Mon- terey, CA, 1985. [54] F. G. de Lima, E. Cota, L. Carro, M. Lubaszewski, R. Reis, R. Velazco, and S. Rezgui, \Designing a Radiation Hardened 8051-Like Micro-Controller," in Proc. 13th Symposium on Integrated Circuits and Systems Design, 2000, pp. 255{260. [55] F. G. de Lima, G. Neuberger, R. F. Hentschke, L. Carro, and R. Reis, \Designing Fault- Tolerant Techniques for SRAM-Based FPGAs," IEEE Design & Test Computers, vol. 21, no. 6, pp. 552{562, 2004. [56] V. Degalahal, R. Ramanarayanan, N. Vijaykrishnan, Y. Xie, and M. J. Irwin, \The Efiect of Threshold Voltages on the Soft Error Rate [Memory and Logic Circuits]," in Proc. 5th International Symposium on Quality Electronic Design, 2004, pp. 503{508. [57] C. Detcheverry, C. Dachs, E. Lorfevre, C. Sudre, G. Bruguier, J. M. Palau, J. Gasiot, and R. Ecofiet, \SEU Critical Charge and Sensitive Area in a Submicron CMOS Technology," IEEE Trans. Nuclear Science, vol. 44, no. 6, pp. 2266{2273, 1997. [58] A. Dharchoudhury, S. M. Kang, H. Cha, and J. H. Patel, \Fast Timing Simulation of Transient Faults in Digital Circuits," in Proc. IEEE/ACM International Conference on Computer-Aided Design, 1994, pp. 719{722. [59] Y. S. Dhillon, A. U. Diril, A. Chatterjee, and A. D. Singh, \Sizing CMOS Circuits for Increased Transient Error Tolerance," in Proc. 10th IEEE International On-Line Testing Symp., 2004, pp. 11{16. [60] J. D. Dirk, M. E. Nelson, J. F. Ziegler, A. Thompson, and T. H. Zabel, \Terrestrial Thermal Neutrons," IEEE Trans. Nuclear Science, vol. 50, no. 6, pp. 2060{2064, 2003. [61] P. E. Dodd and L. W. Massengill, \Basic mechanisms and modeling of single-event upset in digital microelectronics," IEEE Trans. Nuclear Science, vol. 50, no. 3, pp. 583{602, 2003. 0018-9499. [62] P. E. Dodd, F. W. Sexton, G. L. Hash, M. R. Shaneyfelt, B. L. Draper, A. J. Farino, and R. S. Flores, \Impact of Technology Trends on SEU in CMOS SRAMs," IEEE Trans. Nuclear Science, vol. 43, no. 6, pp. 2797{2804, Dec. 1996. [63] P. E. Dodd, M. R. Shaneyfelt, J. A. Felix, and J. R. Schwank, \Production and Propagation of Single-Event Transients in High-Speed Digital Logic ICs," IEEE Trans. Nuclear Science, vol. 51, no. 6, Part 1, pp. 3278{3284, 2004. [64] P. Elakkumanan, K. Prasad, and R. Sridhar, \Time Redundancy Based Scan Flip-Flop Reuse To Reduce SER Of Combinational Logic," in Proc. 7th International Symposium on Quality Electronic Design, 2006, pp. 617{624. [65] M. L. Fair, C. R. Conklin, S. B. Swaney, P. J. Meaney, W. J. Clarke, L. C. Alves, I. N. Modi, F. Freier, W. Fischer, and N. E. Weber, \Reliability, Availability, and Serviceability (RAS) of the IBM eServer z 990," IBM Jour. Res. & Dev., vol. 48, no. 3, pp. 519{534, 2004. 81 [66] M. Favalli and C. Metra, \Online testing approach for very deep-submicron ICs," IEEE Design & Test of Computers, vol. 19, no. 2, pp. 16{23, 2002. [67] W. Feng, X. Yuan, R. Rajaraman, and B. Vaidyanathan, \Soft Error Rate Analysis for Combinational Logic Using An Accurate Electrical Masking Model," in Proc. 20th Inter- national Conf. VLSI Design, 2007, pp. 165{170. [68] L. B. Freeman, \Critical Charge Calculations for a Bipolar SRAM Array," IBM Jour. Res. & Dev., vol. 40, no. 1, pp. 119{129, 1996. [69] A. D. Friedman, \Fault Detection in Redundant Circuits," IEEE Trans. Electronic Com- puters, vol. EC{16, no. 1, pp. 99{100, 1967. [70] T. K. Gaisser, Cosmic Rays and Particle Physics. Cambridge University Press, 1990. [71] L. J. Goldhammer, \Recent Solar Flare Activity and Its Efiect on In Orbit Solar Array," in Proc. IEEE Photovoltaic Specialists Conf., volume 2, (Kissimmee, Florida), 1990, pp. 1241{1248. [72] M. S. Gordon, P. Goldhagen, K. P. Rodbell, T. H. Zabel, H. H. K. Tang, J. M. Clem, and P. Bailey, \Measurement of the Flux and Energy Spectrum of Cosmic-Ray Induced Neutrons on the Ground," IEEE Trans. Nuclear Science, vol. 51, no. 6, pp. 3427{3434, 2004. [73] G. Groeseneken, R. Degraeve, B. Kaczer, and P. Roussel, \Recent Trends in Reliability AssessmentofAdvancedCMOSTechnologies," inProc. International Conf. Microelectronic Test Structures, 2005, pp. 81{88. [74] C. S. Guenzer, E. A. Wolicki, and R. G. Allas, \Single Event Upset of Dymanic RAMs by Neutrons and Protons," IEEE Trans. Nuclear Science, vol. 26, pp. 5048{5052, Dec. 1979. [75] C. N. Hadjicostis and G. C. Verghese, \Coding Approaches to Fault Tolerance in Linear Dynamic Systems," IEEE Trans. Information Theory, vol. 51, no. 1, pp. 210{228, 2005. [76] S. Hareland, J. Maiz, M. Alavi, K. Mistry, and S. Walsta, \Impact of CMOS Process Scaling and SOI on the Soft Error Rates of Logic Processes," 2001, pp. 73{74. [77] G. Harling, \Embedded DRAM Has a Home in the Network Processing World." Integrated System Design, 3 August 2001. http://www.eedesign.com/isd/OEG20010803S0026. [78] K. J. Hass and J. W. Ambles, \Single Event Transients in Deep Submicron CMOS," in Proc. 42nd IEEE Midwest Symp. on Circuits and Systems, volume 1, 1999, pp. 122{125. [79] C. Hawkins, K. Baker, K. M. Butler, J. Fiquera, M. Nicolaidis, V. B. Rao, R. Roy, and T. Welsher, \IC Reliability and Test: What Will Deep Submicron Bring?," IEEE Design & Test of Computers, vol. 16, no. 2, pp. 84{91, 1999. [80] J. P. Hayes, I. Polian, and B. Becker, \An Analysis Framework for Transient-Error Toler- ance," in Proc. 25th IEEE VLSI Test Symp., 2007, pp. 249{255. [81] P. Hazucha, T. Karnik, S. Walstra, B. A. Bloechel, J. W. Tschanz, J. Maiz, K. Soumyanath, G. E. Dermer, S. Narendra, and V. De, \Measurements and Analysis of SER-Tolerant Latch in a 90-nm Dual-VT CMOS process," IEEE Jour. Solid-State Circuits, vol. 39, no. 9, pp. 1536{1543, 2004. [82] P. Hazucha and C. Svensson, \Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate," IEEE Trans. Nuclear Science, vol. 47, no. 6, pp. 2586{2594, 2000. 82 [83] P. Hazucha and C. Svensson, \Optimized Test Circuits for SER Characterization of a Manufacturing Process," IEEE Jour. Solid-State Circuits, vol. 35, no. 2, pp. 142{148, 2000. [84] P. Hazucha, C. Svensson, and S. A. Wender, \Cosmic-Ray Soft Error Rate Characterization of A Standard 0.6-?m CMOS Process," IEEE Jour. Solid-State Circuits, vol. 35, no. 10, pp. 1422{1429, 2000. [85] W. F. Heidergott, \System Level Single Event Upset Mitigation Strategies," International Jounal of High Speed Electronics and Systems, vol. 14, no. 2, pp. 341{352, 2004. [86] T. Heijmen and A. Nieuwland, \Soft-Error-Rate Testing of Deep-Submicron Integrated Circuits," in Proc. Eleventh IEEE European Test Symposium, ETS?06, 2006, pp. 247{252. [87] J. L. Hennessy and D. A. Patterson, Computer Organization and Design: the Hard- ware/Software Interface. San Francisco, California: Morgan Kaufmann, 1997. [88] K. E. Holbert, \Single Event Upsets." Arizona State University. http://www.eas.asu.edu/?holbert/eee460/see.html. [89] A. Holmes-Siedle and L. Adams, Handbook of Radiation Efiects. Oxford University Press, Nov. 1993. Review by M. V. Davis in Radiation Protection Journal, vol. 67, no. 5. [90] A. G. Holmes-Siedle, A. K. Ward, R. Bull, N. Blower, and L. Adams, \The Meteosat-3 Dosimeter Experiment: Observation of Radiation Surges During Solar Flares in Geostation- ary Orbit," in Proc. ESA Space Environment Analysis Workshop, ESTEC, 1990. ESTEC Report No. WPP-23. [91] M. Hosseinabady, P. Lotfl-Kamran, G. Di Natale, S. Di Carlo, A. Benso, and P. Prinetto, \Single-Event Upset Analysis and Protection in High Speed Circuits," in Proc. 11th IEEE European Test Symp., 2006, pp. 29{34. [92] C. Hsieh, P. Murley, and R. O?Brien, \Dynamics of Charge Collection from Alpha-Particle Tracks in Integrated Circuits," IEEE IRPS, p. 38, 1981. [93] S. H. Hwang and G. Choi, \Soft-Error Testing of COTS DRAM Components," in Proc. Autotescon, 1999, pp. 821{827. [94] B. Ingols and A. Rambaud, \iRoC Releases Robust SPARC Test Report," 2002. http://www.us.design-reuse.com/news/news65.html. [95] S. K. Jain and V. D. Agrawal, \Statistical Fault Analysis," IEEE Design & Test of Com- puters, vol. 2, no. 1, pp. 38{44, 1985. [96] R. Jasinski, \Fault-Tolerance Techniques for SRAM-Based FPGAs," The Computer Jour- nal, vol. 50, no. 2, p. 248, 2007. [97] N. K. Jha and S. J. Wang, \Design and Synthesis of Self-Checking VLSI Circuits," IEEE Trans. CAD, vol. 12, no. 6, pp. 878{887, 1993. [98] A. Johnston, \Scaling and technology issues for soft error rates," in Proc. 4th Annual Research Conference on Reliability, (Stanford University), 2000. [99] R. Karri and M. Nicolaidis, \Online VLSI Testing," IEEE Design & Test of Computers, vol. 15, no. 4, pp. 12{16, 1998. [100] F. L. Kastensmidt, L. Carro, and R. Reis, Fault-Tolerance Techniques for SRAM-Based FPGAs, volume 32 of Frontiers in Electronic Testing. Spinger, 2006. 83 [101] S. M. Kia and S. Parameswaran, \Designs for Self Checking Flip-Flops," in Proc. IEEE Computers and Digital Techniques, volume 145, 1998, pp. 81{88. [102] W. A. Kolasinski, J. B. Blake, J. K. Anthony, W. E. Price, and E. C. Smith, \Simulation of Cosmic Ray Induced Soft Errors and Latchup in Integrated Circuit Computer Memories," IEEE Trans. Nuclear Science, vol. NS-26, p. 5087, 1979. [103] J. M. Kolyer and D. E. Watson, ESD from A to Z: Electrical Discharge. New York: Van Nostrand Reinhold, 1990. [104] J. Kumar, J. Kumar, and M. B. Tahoori, \A low power soft error suppression technique for dynamic logic," in M. B. Tahoori, editor, Proc. 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, DFT 2005, 2005, pp. 454{462. [105] K. A. LaBel, C. E. Marshall, P. W. Marshall, C. J. Johnston, A. H. Reed, R. A. Barth, J. L. Seidleck, C. M. Kayali, and S. A. O. Bryan, \A Roadmap for NASA?s Radiation Efiects Research in Emerging Microelectronics and Photonics," in Proc. IEEE Aerospace Conference, volume 5, 2000. [106] K. L. LaBel, P. W. Marshall, J. L. Barth, E. Stassinopoulos, C. Seidleck, and C. Dale, \Commercial Microelectronics Technologies for Applications in the Satellite Radiation En- vironment," in Proc. IEEE Aerospace Applications, (New York), 1996, pp. 375{390. [107] X. Li, K. Shen, M. C. Huang, and L. Chu, \A Memory Soft Error Measurement on Pro- duction System," in USENIX Annual Technical Conference, 2007, p. 6. [108] F. Lima, S. Rezgui, E. Cota, L. Carro, M. Lubaszewski, R. Velazco, and R. Reis, \Designing and Testing a Radiation Hardened 8051-Like Micro-Controller," in Proc. MAPLD Conf., 2000. [109] C. A. Lisboa, M. I. Erigson, and L. Carro, \System Level Approaches for Mitigation of Long Duration Transient Faults in Future Technologies," Technology (nm), vol. 180, no. 130, p. 90, 2007. [110] R. E. Lyons and W. Vanderkulk, \The Use of Triple-Modular Redundancy to Improve Computer Reliability," IBM Jour. Res. & Dev., vol. 6, no. 2, pp. 200{209, 1962. [111] J. M. Maclaren and T. Majni, \Hard/Soft Error Detection," U. S. Patent 6,711,703 Mar. 23 2004. [112] A. Maheshwari, I. Koren, and W. Burleson, \Accurate Estimation of Soft Error Rate (SER) in VLSI Circuits," in Proc. 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, DFT 2004, 2004, pp. 377{385. [113] J. Maiz and N. Seifert, \Introduction to the Special Issue on Soft Errors and Data Integrity in Terrestrial Computer Systems," IEEE Trans. Device and Materials Reliability, vol. 5, no. 3, pp. 303{304, Sept. 2005. [114] W. Maly, \Realistic Fault Modeling for VLSI Testing," in Proceedings of the 24th ACM/IEEE Design Automation Conference, 1987, pp. 173{180. [115] F. P. Mathur and A. Avizienis, \Reliability Analysis and Architecture of a Hybrid- Redundant Digital System: Generalized Triple Modular Redundancy with Self-Repair," in AFIPS Conference Proceedings, Spring 1970 Joint Computer Conference, volume 36, 1970, pp. 375{83. [116] D. G. Mavis and P. H. Eaton, \Soft Error Rate Mitigation Techniques for Modern Micro- circuits," in Proc. 40th Annual Reliability Physics Symposium, 2002, pp. 216{225. 84 [117] T. C. May, \Soft Errors in VLSI: Present and Future," IEEE Trans. Components, Hybrids, and Manufacturing Technology, vol. 2, no. 4, pp. 377{387, 1979. [118] T. C. May, D. L. Crook, R. A. Gralian, D. W.and Reininger, and R. C. Smith, \Soft Error Testing," in Proc. International Test Conference, 1980, pp. 137{150. [119] T. C. May and M. H. Woods, \A New Physical Mechanism for Soft Errors in Dynamic Memories," in Proc. 16th Annual Reliability Physics Symp., 1978, pp. 33{40. [120] T. C. May and M. H. Woods, \Alpha-particle-induced soft errors in dynamic memories," IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2{9, 1979. [121] P. J. Meaney, S. B. Swaney, P. N. Sanda, and L. Spainhower, \IBM z990 Soft Error Detection and Recovery," IEEE Trans. Device and Materials Reliability, vol. 5, no. 3, pp. 419{427, 2005. [122] G. C. Messenger, \Collection of Charge on Junction Nodes from Ion Tracks," IEEE Trans. Nuclear Science, vol. 29, no. 6, pp. 2024{2031, 1982. [123] G. C. Messenger and M. Ash, Single Event Phenomena. Chapman & Hall, 1997. [124] N. Miskov-Zivanov and D. Marculescu, \Circuit Relaiability Analysis Using Symbolic Tech- niques," IEEE Trans. CAD, vol. 25, no. 12, pp. 2638{2649, Dec. 2006. [125] S. Mitra, T. Karnik, N. Seifert, and M. Zhang, \Logic Soft Errors in Sub-65nm Technologies Design and CAD Challenges," in Proc. 42nd Design Automation Conf., 2005, pp. 2{4. [126] S. Mitra, Z. Ming, N. Seifert, T. M. Mak, and K. Kee Sup, \Soft Error Resilient System Design through Error Correction," in Proc. 2006 IFIP International Conference on Very Large Scale Integration, 2006, pp. 332{337. [127] S. Mitra, Z. Ming, S. Waqas, N. Seifert, B. Gill, and K. S. Kim, \Combinational Logic Soft Error Correction," in Proc. International Test Conference, 2006, pp. 1{9. [128] S. S. Mitra, N. Kee, and S. Kim, \Robust System Design with Built-In Soft-Error Re- silience," IEEE Design & Test Computers, vol. 38, no. 2, pp. 43{52, 2005. [129] K. Mohanram, \Closed-form simulation and robustness models for SEU-tolerant design," in Proc. 23rd IEEE VLSI Test Symposium, 2005, pp. 327{333. [130] K. Mohanram and N. A. Touba, \Cost-Efiective Approach for Reducing Soft Error Failure Rate in Logic Circuits," in Proc. International Test Conference, 2003, pp. 893{901. [131] S. S. Mukherjee, J. Emer, and S. K. Reinhardt, \The Soft Error Problem: An Architectural Perspective," in Proc. of the International Symposium on High-Performance Computer Architecture, 2005. [132] O. Musseau, \Single-Event Efiect in SOI Technologies and Devices," IEEE Trans. Nuclear Science, vol. 43, no. 2, pp. 603{613, 1996. [133] H. T. Nguyen and Y. Yagil, \A Systematic Approach to SER Estimation and Solutions," in Proc. 41st Annual IEEE International Reliability Physics Symposium, 2003, pp. 60{70. [134] H. T. Nguyen, Y. Yagil, N. Seifert, and M. Reitsma, \Chip-Level Soft Error Estimation Method," IEEE Trans. Device and Materials Reliability, vol. 5, no. 3, pp. 365{381, 2005. [135] M. Nicolaidis, \Time Redundancy Based Soft-Error Tolerance to Rescue Nanometer Tech- nologies," in Proc. 17th IEEE VLSI Test Symposium, 1999, pp. 86{94. [136] M. Nicolaidis, \Design for Soft Error Mitigation," IEEE Transactions on Device and Ma- terials Reliability, vol. 5, no. 3, pp. 405{418, 2005. 85 [137] M. Nicolaidis and Y. Zorian, \On-Line Testing for VLSI { Compendium of Approaches," Journal of Electronic Testing: Theory and Applications, Special issue on On-line testing, vol. 12, no. 1{2, pp. 7{20, 1998. [138] A. K. Nieuwland, S. Jasarevic, and G. Jerin, \Combinational Logic Soft Error Analysis and Protection," in Proceedings of the 12th IEEE International Symposium on On-Line Testing (IOLTS), 2006, pp. 99{104. [139] E. Normand, \Single-Event Efiects in Avionics," IEEE Trans. Nuclear Science, vol. 43, no. 2, pp. 461{474, 1996. [140] E. Normand, \Single Event Upset at Ground Level," IEEE Trans. Nuclear Science, vol. 43, no. 6, pp. 2742{2750, 1996. [141] E. Normand and T. J. Baker, \Altitude and Latitude Variations in Avionics SEU and Atmospheric Neutron Flux," IEEE Trans. Nuclear Science, vol. 40, no. 6, pp. 1484{1490, 1993. [142] T. J. O?Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, \Field Testing for Cosmic Ray Soft Errors in Semiconductor Memories," IBM Jour. Res. & Dev., vol. 40, no. 1, pp. 41{50, 1996. [143] P. Oikonomakos and M. Zwolinski, \Foundation of Combined Datapath and Controller Self-Checking Design," in Proc. 9th IEEE On-Line Testing Symp., 2003, pp. 30{34. [144] M. Omana, G. Papasso, D. Rossi, and C. Metra, \A Model for Transient Fault Propagation in Combinatorial Logic," in Proc. 9th IEEE On-Line Testing Symp., 2003, pp. 111{115. [145] H. H. Ottesen and G. J. Smith, \Method and Apparatus for Limiting Soft Error Recovery in A Disk Drive Data Storage Device," U. S. Patent 6,631,493 B2, Oct.7 2003. [146] A. Papoulis, Probability, Random Variables, and Stochastic Processes. New York: McGraw- Hill, 1965. [147] D.A.Patterson, G.Gibson, andH.K.Randy, \ACaseforRedundantArraysofInexpensive Disks (RAID)," Readings in Database Systems, 2005. [148] E. L. Petersen, \The SEU flgure of merit and proton upset rate calculations," Nuclear Science, IEEE Transactions on, vol. 45, no. 6, pp. 2550{2562, 1998. [149] J. C. Pickel, Single Event Upset Mechanisms and Predictions. IEEE NSREC Short Course, Gatlinburg, IEEE, New York, 1983. [150] J. C. Pickel, \Single-Event Efiects Rate Prediction," IEEE Trans. Nuclear Science, vol. 43, no. 2, pp. 483{495, 1996. [151] S. J. Piestrak, \Self-Checking Design in Eastern Europe," IEEE Design & Test of Com- puters, vol. 13, no. 1, pp. 16{25, 1996. [152] D. Qian, L. Rong, and X. Yuan, \Impact of Process Variation on Soft Error Vulnerability for Nanometer VLSI Circuits," in Proc. 6th International Conference on ASIC, volume 2, 2005, pp. 1117{1121. [153] R. Rajaraman, J. S. Kim, N. Vijaykrishnan, Y. Xie, and M. J. Irwin, \SEAT-LA: A Soft Error Analysis Tool for Combinational Logic," in Proc. 19th International Conference on VLSI Design, 2006, pp. 499{502. [154] K. Ramakrishnan, R. Rajaraman, S. Suresh, N. Vijaykrishnan, Y. Xie, and M. J. Irwin, \Variation Impact on SER of Combinational Circuits," in Proc. 8th International Sympo- sium on Quality Electronic Design, 2007, pp. 911{916. 86 [155] R. Ramanarayanan, V. Degalahal, N. Vijaykrishnan, M. J. Irwin, and D. Duarte, \Analysis of Soft Error Rate in Flip-Flops and Scannable Latches," in Proc. IEEE International SOC (Systems-on-Chip) Conference, 2003, pp. 231{234. [156] R. R. Rao, K. Chopra, D. Blaauw, and D. Sylvester, \An E?cient Static Algorithm for Computing the Soft Error Rates of Combinational Circuits," in Proc. Design, Automation and Test in Europe Conf., 2006, pp. 164{169. [157] B. G. Rax, A. H. Johnston, and C. I. Lee, \Proton Damage Efiects in Linear Integrated Circuits," IEEE Trans. Nuclear Science, vol. 45, no. 6, pp. 2632{2637, 1998. [158] R. A. Reed, R. A. Reed, P. J. McNulty, and W. G. Abdel-Kader, \Implications of angle of incidence in seu testing of modern circuits," IEEE Trans. Nuclear Science, vol. 41, no. 6, pp. 2049{2054, 1994. 0018{9499. [159] T. Rejimon and S. Bhanja, \An Accurate Probabilistic Model for Error Detection," in Proc. 18th International Conference on VLSI Design, 2005, pp. 717{722. [160] T. Rejimon and S. Bhanja, \Probabilistic Error Model for Unreliable Nano-Logic Gates," in Proc. Sixth IEEE Conference on Nanotechnology, volume 1, 2006, pp. 47{50. [161] D. Rossi, M. Omana, F. Toma, and C. Metra, \Multiple Transient Faults in Logic: An Issue for Next Generation ICs?," in Proc. 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2005, pp. 352{360. [162] K. Roy, S. Kundu, R. Galivanche, V. Narayanan, R. Raina, and P. N. Sanda, \Is the Concern for Soft-Error Overblown?," in Proc. International Test Conf. (Panel Discussion), 2005. [163] Y. Savaria, J. F. Hayes, N. C. Rumin, and V. K. Agarwal, \Theory for the Design of Soft-Error-Tolerant VLSI Circuits," IEEE Jour. Selected Areas in Communications, Jan. 1986. [164] R. D. Schrimpf and D. M. Fleetwood, editors, Radiation Efiects and Soft Errors in In- tegrated Circuits and Electronic Devices, volume 34 of Selected Topics in Electronics and Systems. World Scientiflc, 2004. [165] N. Seifert and N. Tam, \Timing Vulnerability Factors of Sequentials," IEEE Trans. Device and Materials Reliability, vol. 4, no. 3, pp. 516{522, 2004. [166] S. A. Seshia, L. Wenchao, and S. Mitra, \Veriflcation-Guided Soft Error Resilience," in Proc. Design, Automation and Test in Europe Conf., 2007, pp. 1{6. [167] S. C. Seth and V. D. Agrawal, \A New Model for Computation of Probabilistic Testability in Combinational Circuits," Integration, the VLSI Jour., vol. 7, no. 1, pp. 49{75, 1989. [168] K. G. Shin and K. Hagbae, \A Time Redundancy Approach to TMR Failures Using Fault- State Likelihoods," IEEE Trans. Computers, vol. 43, no. 10, pp. 1151{1162, 1994. [169] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, \Modeling the Efiect of Technology Trends on the Soft Error Rate of Combinational Logic," in Proc. International Conference on Dependable Systems and Networks, 2002, pp. 389{398. [170] M. Smith, \The Ideal SoC Memory: 1T-SRAM." Computer & Information Science Dept., Linkopings Univ., Sweden, Apr. 1998. http://www.ida.liu.se/?abdmo/SNDFT/docs/ram- soft.html. [171] M. J. S. Smith, Application-Speciflc Integrated Circuits. Reading, Massachusetts: Addison- Wesley, 1997. 87 [172] G. R. Srinivasan, \Modelling the Cosmic Ray-Induced Soft-Error Rate in Integrated Cir- cuits: An Overview," Microelectronics Reliability, vol. 37, no. 4, pp. 691{691, 1997. [173] G. R. Srinivasan, P. C. Murley, and H. K. Tang, \Accurate, Predictive Modeling of Soft Error Rate Due to Cosmic Rays and Chip Alpha Radiation," in Proc. 32nd Annual IEEE International Reliability Physics Symposium, 1994, pp. 12{16. [174] J. R. Srour, C. J. Marshall, and P. W. Marshall, \Review of Displacement Damage Efiects in Silicon Devices," IEEE Trans. Nuclear Science, vol. 50, pp. 653{670, 2003. [175] A. K. Sutton, Displacement Damage and Ionization Efiects in Advanced Silicon- Germanium Heterojunction Bipolar Transistors. PhD thesis, Georgia Institute of Tech- nology, 2005. [176] H. H. K. Tang, \Nuclear Physics of Cosmic Ray Interaction with Semiconductor Materi- als: Particle-Induced Soft Errors from a Physicist?s Perspective," IBM Jour. Res. & Dev., vol. 40, no. 1, pp. 91{108, 1996. [177] N. A. Touba and E. J. McCluskey, \Logic Synthesis of Multilevel Circuits with Concurrent Error Detection," IEEE Trans. CAD, vol. 16, no. 7, pp. 783{789, 1997. [178] A. J. Tylka, J. H. Adams Jr, P. R. Boberg, B. Brownstein, W. F. Dietrich, E. O. Flueckiger, E. L. Petersen, M. A. Shea, D. F. Smart, and E. C. Smith, \CREME96: A Revision of the Cosmic Ray Efiects on Micro-Electronics Code," IEEE Trans. Nuclear Science, vol. 44, no. 6, pp. 2150{2160, Dec. 1997. [179] A. J. Tylka, W. F. Dietrich, P. R. Boberg, E. C. Smith, and J. H. Adams Jr, \Single Event Upsets Caused by Solar Energetic Heavy Ions," IEEE Trans. Nuclear Science, vol. 43, no. 6 Part 1, pp. 2758{2766, 1996. [180] A. J. van de Goor, Testing Semiconductor Memories: Theory and Practice. Wiley, 1991. [181] J. von Neumann, \Probabilistic Logics and the Synthesis of Reliable Organisms from Un- reliable Components (1959)," in A. H. Taub, editor, John von Neumann: Collected Works, Volume V: Design of Computers, Theory of Automata and Numerical Analysis, Oxford University Press, 1963, pp. 329{378. [182] J. F. Wakerly, \Microcomputer Reliability Improvement Using Triple-Modular Redun- dancy," Proc. IEEE, vol. 64, no. 6, pp. 889{895, 1976. [183] J. T. Wallmark and S. M. Marcus, \Minimum Size and Maximum Packing Density of Non-Redundant Semiconductor Devices," Proc. IRE, vol. 50, pp. 286{298, Mar. 1962. [184] S. V. Walstra and D. Changhong, \Circuit-Level Modeling of Soft Errors in Integrated Circuits," IEEE Trans. Device and Materials Reliability, vol. 5, no. 3, pp. 358{364, 2005. [185] F. Wang and V. D. Agrawal, \Probabilistic Soft Error Rate Determination from Statistical SEU Parameters," in Proc. 17th IEEE North Atlantic Test Workshop, May 2008. [186] F. Wang and V. D. Agrawal, \Single Event Upset: An Embedded Tutorial," in Proc. 21th International Conference on VLSI Design, Jan. 2008, pp. 429{434. [187] F. Wang and V. D. Agrawal, \Soft Error Rate Determination for Nanometer CMOS VLSI Circuits," in Proc. 40th Southeastern Symposium on System Theory, Mar. 2008, pp. 324{ 328. [188] C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, \Techniques to Reduce the Soft Error Rate of A High-Performance Microprocessor," in Proc. 31st Annual International Symposium on Computer Architecture, 2004, pp. 264{275. 88 [189] P. S. Winokur, G. K. Lum, M. R. Shaneyfelt, F. W. Sexton, G. L. Hash, and L. Scott, \Use of COTS Microelectronics in Radiation Environments," IEEE Trans. Nuclear Science, vol. 46, no. 6, pp. 1494{1503, 1999. [190] G. K. Yeap, Practical Low Power Digital VLSI Design. Boston: Springer, 1998. [191] M. Yilmaz, D. R. Hower, S. Ozev, and D. J. Sorin, \Self-Checking and Self-Diagnosing 32-bit Microprocessor Multiplier," in Proc. International Test Conference, 2006. Paper 15.1. [192] M. Zhang, S. Mitra, T. M. Mak, N. Seifert, N. J. Wang, Q. Shi, K. S. Kim, N. R. Shanbhag, and S. J. Patel, \Sequential Element Design With Built-In Soft Error Resilience," IEEE Trans. VLSI Systems, vol. 14, no. 12, pp. 1368{1378, 2006. 1063{8210. [193] M. Zhang and N. R. Shanbhag, \A Soft Error Rate Analysis (SERA) Methodology," in Proc. IEEE/ACM International Conference on Computer Aided Design, 2004, pp. 111{118. [194] C. Zhao and S. Dey, \Evaluating and Improving Transient Error Tolerance of CMOS Digital VLSI Circuits," in Proc. International Test Conference, 2006. [195] C. Zhao, Y. Zhao, and S. Dey, \Constraint-Aware Robustness Insertion for Optimal Noise- Tolerance Enhancement in VLSI Circuits," in Proc. 42nd Design Automation Conference, 2005, pp. 190{195. [196] Q. Zhou and K. Mohanram, \Cost-Efiective Radiation Hardening Technique for Combina- tional Logic," in Proc. IEEE/ACM International Conf. Computer Aided Design, 2004, pp. 100{106. [197] Q. Zhou and K. Mohanram, \Gate Sizing to Radiation Harden Combinational Logic," IEEE Trans. on CAD, vol. 25, no. 1, pp. 155{166, 2006. [198] J. Ziegler and W. Lanford, \The Efiect of Sea Level Cosmic Rays on Electronic Devices," in Proc. IEEE Solid-State Circuits Conf., 1980. [199] J. F. Ziegler, \IBM Experience in Soft Fails in Computer Electronics, 1978-1994," IBM Jour. Res. & Dev., vol. 40, no. 1, pp. 3{18, 1996. [200] J. F. Ziegler, \Terrestrial Cosmic Rays," IBM Jour. Res. & Dev., vol. 40, no. 1, pp. 19{39, 1996. [201] J. F. Ziegler, \Terrestrial Cosmic Ray Intensities," IBM Jour. Res. & Dev., vol. 42, no. 1, 1998. [202] J. F. Ziegler, \Trends in SER of DRAM Memory Chips," Technical report, 2002. http://www.srim.org/SER/SERTrends.htm. [203] J. F. Ziegler and W. A. Lanford, \Efiect of Cosmic Rays on Computer Memories," Science, vol. 206, no. 4420, pp. 776{788, Nov. 1979. [204] J. F. Ziegler and H. P. Muhfeld, \Accelerated Testing for Cosmic Soft-Rrror Rate," IBM Jour. Res. & Dev., vol. 40, no. 1, pp. 51{63, 1996. 89 Appendices 90 Appendix A Terms and Definitions These miscellaneous deflnitions and terms are collected from JEDEC standard [4, 5] and relevant papers cited in the bibliography. AAA authentication, authorization, and accounting { protocol for controlling access to network resources. BPSG Borophosphosilicate glass. BPSG is a type of silicate glass that includes additives con- taining boron and phosphorus. Silicate glass such as PSG and borophosphosilicate glass are commonly used in semiconductor device fabrication for intermetal layers, i.e., for insulating layers deposited between successive metal or conducting layers. Collected Charge The charge collected by a particular device node during the passage of a particle. The collected charge is dependent on the geometry and doping of the node, the particle properties like mass, energy and trajectory, and the density and type of material being penetrated by the incident radiation. Cross Section ( ) the device SEE response to ionizing radiation. Normally, the unit for cross section is cm2=device or cm2=bit . Critical Charge (Qcrit) The minimum amount of charge that when collected at any sensitive node will cause the node to change state. The critical charge is usually generated by incident radiation and its value is dependent on the efiective linear energy transfer, which is usually a function of the angle of incident of the particle radiation. Difierential Flux The time rate of uence per unit energy, the rate of quantity of radiation, particle uence, per unit area incident on a surface per unit energy. The difierential ux is usually expressed number (N) of particles per unit area per unit energy per unit time, 91 like N=cm2 ? MeV ? hr. The term difierential ux in JEDEC standard is synonymous with spectral ux density used in other publications. ECC Error correction code, sometimes called Error Detection And Correction (EDAC). Fluence The total amount of particle radiant energy incident on a surface in a given period of time, divided by the area of the surface. Fluence is usually expressed number (N) of particles per unit area, e.g., N=cm2. Flux Density The time rate of ow of particle energy emitted from or incident on a surface, divided by the area of that surface. The ux density is usually expressed number (N) of particles per square centimeter second (N=cm2?s) or particles per square centimeter hour (N=cm2 ?h). Hard Error An irreversible change in operation that is typically associated with permanent damage to one or more elements of a device or circuit. LET Linear Energy Transfer. LET is a measure of the energy transferred to the device per unit length as an ionizing particle travels through a material. The commonly used unit is MeV ?cm2=mg of material (Si for MOS devices). LETth LET threshold (LETth) is the minimum LET to cause an efiect at a given particle uence. MEU Multiple Event Upsets. MBU A multiple-bit upset in which two or more error bits occur in the same word. An MBU in memory can not be corrected by a simple single-bit ECC. Radiation Energy emitted in the form of electromagnetic waves or moving nuclear particles. In the present research, the primary concern is the ionizing radiation that includes protons, electrons, alpha particles and nuclear reaction products. 92 RAID Redundant Arrays of Inexpensive Disks. RAID is a technology that supports the inte- grated use of two or more hard-drives in various conflgurations for the purposes of achieving greater performance, reliability through redundancy, and larger disk volume sizes through aggregation. SEB Single Event Burnout. Damage of burnout of power transistor or other high voltage devices due to a single energetic particle. SEB includes burnout of n-channel power MOSFETs and it can be triggered in a power MOSFET biased in the OFF state when a heavy ion passing through deposits enough charge to turn it on. Both SEL and SEB susceptibilities decrease at higher temperature. SEE Single Event Efiect. Any measurable or observable change in state or performance of a microelectronic device, component, subsystem or system resulting from a single energetic particle strike. SEE include SEU, SEL, SEB and SEFI. SEFI A energetic particle caused functional interrupt, malfunctions in more complex parts sometimes as lockup, hard error, etc. SEL Single Event Latchup. The SEL is deflned as a condition that causes loss of device func- tionality due to single event induced current. SEL results in a high operating current. It may drag down the node voltage or damage the power supply. The latch-up is caused by heavy ions as well as protons in the sensitive area in semiconductor devices. SEL can be cleared by the power ofi-on reset. Sensitive Volume A region, or multiple regions afiected by SEE-induced radiation. The sen- sitive volume is determined by the angle of the incident radiation, the mass and energy of the incident particles and the density, type of the material in the volume being penetrated by the incident radiation. It is not easy to know the geometry of the sensitive volume of the device but some information can be gained from the test cross section data. SET Single Event Transient. A current or voltage transient pulse caused by SEE. 93 SEU Single Event Upset. Radiation-induced errors in microelectronic circuits caused when charged particles (usually from the radiation belts or from cosmic rays) lose energy by ionizing the medium through which they pass, leaving behind a wake of electron-hole pairs. SEGR Single Event Gate Rupture. SEGR is the destructive burnout of a gate insulator in a power MOSFET. Soft error, static A soft error in a memory that cannot be corrected by repeated reading but can be corrected by rewriting without the removal of power. Soft error, transient A soft error that can be corrected by repeated reading without rewriting or without the removal of power. SER Soft error rate. SOI Silicon on insulator. TID Total ionizing dose. 94 Appendix B Units and Conversion Factors MTTF Mean Time to Failure. MTTR Mean Time to Repair. MTBF Mean Time Between Failures. MTBF = MTTF + MTTR. The concept of Availability is deflned as MTTF/MTBF. FIT FailureinTime; thenumberoffailuresper109 devicehours. 1yearMTTF=109/(24?365)FIT = 114,155 FIT. Gray (Gy) 1 gray = 1 joule per kilogram. rad rad is a unit of radiation dose. 1 rad = 0.01 gray (Gy) = 0.01 joule of energy absorbed per kilogram of matter. Hadron Particles which have strong interaction. Also called nuclear force. Energy Units 1. Electron Volt (eV). One eV is the energy gained by an electron when accelerating through a potential difierence of 1 volt. Energy of radiation is usually in MeV (106eV) or KeV (103eV). 2. Joule (J). 1 eV = 1.6?10?19 J, 1MeV = 1.6?10?13 J. 3. erg. 1 erg = 10?7 Joules. Electronic charge e = 1:602?10?19 coulomb. ?Acm?2 1 ?Acm?2 = 6.241?1012 electrons cm?2s?1 95