A Methodology for Increasing the Dependability of Open Source Software
Components
by
Patrick Pape
A thesis submitted to the Graduate Faculty of
Auburn University
in partial ful llment of the
requirements for the Degree of
Master of Science
Auburn, Alabama
May 5, 2013
Keywords: fault injection, data- ow analysis, dependability, open source
Copyright 2013 by Patrick Pape
Approved by
John A. Hamilton Jr., Chair, Alumni Professor of Computer Science and Software
Engineering
David Umphress, Associate Professor of Computer Science and Software Engineering
Xiao Qin, Associate Professor of Computer Science and Software Engineering
Abstract
With the increasing use of open source software components in real world systems, there
has arisen a need for increasing the dependability of these components. The dependability
of these components can be increased like any other software component, by adding error
detection and error recovery mechanisms to the code to deal with unforeseen errors and faults
occurring in the system that will be running the software. The major issue is not in the
implementation of these error handling mechanisms, but instead in the speed that they can
be applied and where to apply them. The reason for considering an open source solution to
a problem is most likely to save time and money using a solution that may already do what
needs to be done. The goal is to incorporate these open source components into a system
after verifying and if need be, increasing the dependability of the component. But, with
open source components, they are already in a maintenance stage of development, where
the code is either re ned or new functionality is possibly added. Thus, standard techniques
that rely on use during development will take too long to implement. This thesis proposes a
novel way of determining the highest priority modules in the system and the most important
locations for placing error handling mechanisms in the functions within those modules.
ii
Acknowledgments
I would like to thank my committee members for their patience and guidance throughout
the process of completing this thesis. Particular thanks goes out to Dr. Hamilton who was
instrumental in helping me to  nalize my topic for this thesis and also broadening my view
as to the possible implications of the research and potential areas for future work. Also, my
thanks go out to my colleagues in the Auburn Cyber Research Center who let me bounce
my ideas o of them and gave valuable feedback and suggestions for potential research leads.
iii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 OSSD and CotSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Software Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Fault Detection and Correction . . . . . . . . . . . . . . . . . . . . . 13
1.2.5 Data-Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.6 Wrapper-Based Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Fault Injection Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Injections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Environment Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 De ne Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
3.1.1 Instrumentation Stage One . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Data Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.3 Determine Priority Functions . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Identify Locations for Error Handling . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Determine Important Variables . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Instrumentation Stage Two . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Calculate Relative Importance . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Wrapper-Based Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Write-Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Read-Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Phase 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Phase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
v
List of Figures
2.1 Prototype for manual integer injection. . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Integer injection and probe example. . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Function for random bit- ip fault in an integer. . . . . . . . . . . . . . . . . . . 23
2.4 Function for random bit- ip fault in a  oat. . . . . . . . . . . . . . . . . . . . . 24
2.5 Prototype for variable probe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Integer variable probe function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Overview of Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Stage one of methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Stage two of methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Stage three of methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Caller side for  rst critical path . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Called side for  rst critical path . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Call graph for second critical path . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Call graph for alternate second critical path . . . . . . . . . . . . . . . . . . . . 45
vi
List of Tables
3.1 Results for wrapper-based error mechanism testing [1] . . . . . . . . . . . . . . 39
3.2 Peak increase % in run-time and memory usage [1] . . . . . . . . . . . . . . . . 39
4.1 Priority of Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 High Priority Module - Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Fault injection results for variables for critical path one . . . . . . . . . . . . . . 46
4.4 Fault injection results for variables for critical path two . . . . . . . . . . . . . . 46
5.1 Number of failed tests found using proposed methodology . . . . . . . . . . . . 47
5.2 Number of failed tests found using just the local importance . . . . . . . . . . . 47
5.3 % di erence between the number of failed tests found . . . . . . . . . . . . . . . 48
5.4 Proposed Methodology vs. Local Cost-Bene t Analysis . . . . . . . . . . . . . . 48
vii
List of Abbreviations
CotS Consumer-O -the-Shelf
CotSD Consumer-o -the-shelf Development
OSS Open Source Software
OSSD Open Source Software Development
viii
Chapter 1
Introduction
1.1 Overview
The use of open source software is becoming more prevalent in real-world projects and
software development [34]. There are obvious bene ts to using open source software that is
readily available, including great savings in both time and money. Many organizations have
turned to using open source components in their projects and for replacing more expensive
consumer o -the-shelf, CotS, products or in house development. But, open source software
is still limited in its applications in many real-world projects that demand a certain level
of security or dependability in their software before they can be used. The issue with open
source software is that there is usually no guarantee of the dependability of the components
and there is often just a single developer for the system.
It is reasonable to assume that an organization that wishes to use an open source
software, OSS, component would do some stress testing or other veri cation to determine if
the component is worth using. But, the problem with post-development testing of software
system is the large time and cost investment, especially when the component was developed
by an outside source and no one necessarily has experience using it. Testing early and using
certain standards during development can be done to avoid this. There is research available
for de ning key locations for error detection and recovery mechanisms or ways to ensure
software dependability, as discussed in section 1.2.2 and section 1.2.4. The issue with this
research is that it is intended to be used early in the development phase and not on the  nal
product.
Consider this situation: a project developing a UAV to reconnoiter a hostile environment
 nds an open source solution to the ground station software for controlling and receiving
1
data from the UAV. The open source software was completed by a single developer and
the developers of the UAV project determine that the OSS does not meet the criteria for
dependability when dealing with mission-critical components. This means that the project
will need to look for another alternative, such as acquiring CotS or developing the ground
station software in house, costing time and money. The methodology proposed in this thesis
is for increasing the dependability of OSS components so that they can be used in projects
that have higher standards for dependability. A major hurdle to overcome is to create a
methodology that is lightweight enough that the bene ts of using OSS components remain,
but the dependability of the software is increased.
This methodology utilizes fault injection testing, data- ow analysis and lightweight
error detection and recovery mechanisms for increasing software dependability. A fault in-
jection framework was written in order to obtain the values of various metrics throughout
the system. The framework was modeled in part after PROPANE, a propagation analysis
environment that utilizes fault injection. PROPANE allows for the user to de ne and instru-
ment fault injection and sampling locations throughout the system and then outputs logs of
the tests. These logs have applicable metrics for measuring the dependability of the system,
such as failure rates, coverage and latency of the error detection and recovery mechanisms.
PROPANE was a project completed by the DEEDs, dependable, embedded systems and
software, research group [3]. The tool is outdated after not being updated since 2006. In
order to leave room for the potential of future work and expanding the capabilities of the
methodology, a new framework was written.
For the proposed methodology, the chosen use cases should be indicative of the real world
intended use of the system. Because the testing is to be done late stage, there most likely is
not time to fully test every capability of the system. The target source code is instrumented
in order to determine the frequency that modules and the functions within those modules are
called during the intended use cases. Once the call frequency for the modules is determined,
modules with outlying frequencies, such as zero or a small fraction of calls compared to
2
other modules, are disquali ed from being priority modules. The keys to determining the
highest priority modules are: the probability that a fault occurring in memory belongs to
space related to a particular function, the probability that the now corrupted code will run
during the current execution of the program, and the number of functions that a corrupted
variable could pass its value. In this thesis, it is assumed that the chance for a single bit
fault to a ect a function is equal for all functions. The important factors to determine
priority are the probability that the corrupted code will be executed and the propagation.
The probability of the code being executed is determined by the fraction of e ective lines of
code for that function and the fraction of functions that the function either sends data to or
obtains data from. With this calculation completed for each function, the priority modules
can be determined by the sum of the priorities of the functions within them.
A critical path is de ned as the data  ow paths from the highest priority function in
a high priority module. All the variables in the modules are considered for error handling,
and any variables in the connecting functions that have an e ect on one of the variables in
the module are also considered. Once these variables for a critical path are found, a golden
run, a single perfect execution of the test case, is done to simulate a perfect instance of
the system and the results are compared to faulty runs. Once the results of the testing are
collected, the importance of each variable is calculated. Only the variables with the highest
importance with respect to other variables in the critical path are considered as locations for
placing the error handling mechanisms. If error handling were added to the entire system,
then the overhead of the system would be too high and the time needed to implement these
mechanisms would defeat the purpose of using OSS components.
Lastly, lightweight error handling mechanisms are implanted into the code to increase
it?s dependability. The goal for the mechanisms is to be unobtrusive and create as little
overhead as possible. The error handling mechanisms to be used have already been proven
to be both lightweight and e ective at reducing the failure rate of variables, while at the
same time maintaining a relatively low latency increase and memory usage overhead. The
3
novelty of this approach is in the speed at which these priority locations for placing error
handling code are identi ed and in the accuracy in which they are identi ed, not in the error
handling itself.
Ultimately, the goal of this research is to  ll the need for a means of increasing de-
pendability of late stage and post development OSS components so that they can be used
in projects that demand higher dependability software. This research is important because
there is a vast amount of OSS that could be used, but is overlooked because it cannot be
trusted. The methodology is int an initial stage and will ideally be re ned and expanded to
include other major factors, such as security validation.
1.2 Literature Review
For this thesis I reviewed research in several areas including the use of open source
software development, OSSD, in real-world projects versus the use of consumer o -the-shelf
software development, CotSD, software dependability measures and testing; fault injection
techniques and methodologies; fault detection and recovery mechanisms; data  ow analysis;
and wrapper-based software implementations used increasing software dependability. In this
section of the thesis, I present the references used along with a brief explanation of what
they are about and how they are useful to the proposed research.
1.2.1 OSSD and CotSD
A comparison of the bene ts of OSSD and CotSD in [27] states that the bene ts of using
OSSD over CotSD for a project are parallel and repeated development techniques; free user
participants; huge development communities; and e ective user testing. The paper illus-
trates the key di erences in using the two types of software development. The portion that
contributes most to this research has to do with the limitations of OSSD. These key prob-
lems include decreased communication between developers, informal management policies,
4
possibility of low contributors, and economic concerns. From a business point of view, de-
velopers are not being paid for their OSSD and it is possible that they will not be motivated
to continue updating the software. Also, there is a higher short-term cost with integrating
OSS within a system, and a lack of de ned support means more time spent troubleshooting.
Lack of de ned support means that immediate corrections may go unanswered.
An article by Nagy, Yassin, and Bhattacherjee [28] states the bene ts of using OSSD
are clear: cost savings, vendor independence, and open standards. But, the limitations
of OSSD are of greater importance to this research. The paper described  ve barriers for
the integration of open source software in a real-world business environment and possible
remedies. The most applicable of these barriers is the perception of technological immaturity.
It is a common problem that open source solutions are often thought to be of lesser quality
due to zero cost and lack of formal direction during development. This issue can be remedied
with the introduction of a methodology to ensure an OSS solution with a high level of
dependability. Also, the paper gave some examples of real-world businesses switching to
open source software and the savings that resulted. For example, Amazon cut technology
costs from 71 million to 54 million by switching to open source applications. Many companies,
such as Sabre Holdings, saved tens of millions by switching to MySQL for their databases.
A paper by Kropp, Koopman, and Siewiorek [25] discusses the the creation of a set of
black-box processes to certify the suitability of CotS components. There is a drawback to
these black-box processes because such testing can fail to exercise signi cant portions of the
code. The methodology requires the buyers of the CotS software to purchase oracles in order
to test the code against what they want it to do, because the buyer cannot be sure of how
rigorous the testing actually was.
Kropp, Koopman, and Siewiorek [25] wrote a paper about using automated testing
in order to assess component robustness of CotS solutions, because even though there are
savings in cost and time, the CotS component may not have been designed for robustness.
Here, the robustness of the component is de ned as the degree to which the component
5
functions correctly in the presence of exceptional inputs or stressful environmental conditions
[25]. Fault injection testing is used in the testing phase to attempt to cause failures in the
system under di erent circumstances.
The article by Ghosh and Voas [26] about software inoculation discusses the idea of
using fault injection testing to identify vulnerable parts of a software system and introducing
mechanisms to increase the resistance to these faults. Two solutions are o ered after the
testing: sending results to the developer to  x the problem at the source, or using a software
wrapper to harden its defenses. Getting the vendor to make the necessary changes can be
di cult, so the ideal solution is to use the software wrapper. In this case, the wrapper is
useful because of the lack of access to the source code. Whenever the software is run, so is
the wrapper code that is used to catch operating system exceptions and return a speci ed
error code.
1.2.2 Software Dependability
A lightweight code analysis is suggested to evaluate dependability cases in [14]. The
approach is to seek a balance between analysis and design. This is done in order to try and
reduce the critical path for ful lling critical properties, so that a simpler analysis tool can be
used to establish them. These critical properties are tested using dependability cases which
present explicit arguments that a system satis es a critical property. The implementation
of lightweight coding techniques through  ow- and context-sensitive static analysis and the
use of a single pathway on the control- ow mapping to reduce overhead are derived from the
research found in this section.
The importance of variables in dependable software is the major focus in [13]. This
paper is key in the research proposed in this thesis. Here, Leeke and Jhumka propose a novel
metric for measuring dependable software systems. Detecting and correcting erroneous states
can be very di cult. The key to this research is  nding a way to determine if the correct
values are held by variables determined to be important. The importance of variables is
6
measured through a combination of other metrics, including failure rate, spatial impact,
and temporal impact. Spatial and temporal impact are metrics that measure how many
adjacent modules are a ected when a variable in a component is corrupted and how long the
state of the program remains a ected by the corrupted variable, respectively. Importance
is then used to identify the most important variables, where error detection and recovery
mechanisms can be focused.
The location and generation of detectors are considered in [8] and [9]. Leeke and Jhumka
explore the early identi cation of detector locations in order to cut down on the costs of late
lifecycle assessment of dependability. Insight into the potential for errors during development
can help to reduce the costs as opposed to  nding such errors late in development. Spatial
impact is used in order to de ne locations of detectors because of its measure of areas of
high error propagation and module coupling. Fault injection testing was done to verify
the connection between late lifecyle metric spatial impact and the early detection metric of
module coupling.
Leeke, Jhumka, Arif, and Anand propose a methodology for generating e cient error
detection mechanisms in [9] that utilizes fault injection testing and data mining. First, fault
injection testing is done on the target systems with PROPANE and the results are put
through pre-processing algorithms to counter the inherit weaknesses of analyzing trends in
fault injection data. After this, a symbolic pattern-learning algorithm is used to look for
 rst-level predicates, meaning values that have an e ect on the  ow of the program. The
algorithm uses a decision tree to create a baseline for the predicates and is later re ned
in order to create an e cient predicate for error detection and recovery mechanisms. This
means that for each boolean value or if-type statement, there is a node in the true for
whether the program should be true or false with a correct run. If the path on the decision
tree for a run di ers from the path for a correct run, then an error has occurred. This is
done variously through over and under sampling of the data in the pre-processing phase, or
other parameters.
7
1.2.3 Fault Injection
The next section of literature is a review of fault injection techniques and tools from
past and current research that is included in order to provide insight into the design of the
testing phases of the methodology to be proposed in the thesis. The  rst of these is [3]
by Hiller, Jhumka, and Suri for the propagation analysis environment PROPANE. It is an
environment that helps in the process of designing e cient error detection mechanisms by
providing analysis of the e ects of faults in a system. PROPANE supports both faults, by
mutation of source code, and data errors, by manipulating variables and memory contents.
The environment logs the results of these faults and errors and places them in a  le that
can be used to see the results of the test represented in various metrics. It is a portable
and lightweight solution for dependability testing in a wide array of systems. But, the
environment has not been updated in over eight years, so a di erent framework, based
partially on PROPANE?s functionality, was utilized for this methodology.
Chen, Tsai and Iyer [4] present hardware and software fault injection environments
and injection and implementation methods. Discussion on compile-time, where faults are
injected into the source code before the program image is loaded and executed in order
to emulate various error types, was compared to run-time injection. Run-time injection is
detailed, including several examples. Of particular interest is code insertion, such as the
probes used by PROPANE, because this is the method for fault injection that is used in the
proposed methodology. Lastly, the limitations of software fault injection testing compared
to hardware testing are examined along with tools for completing both. The limitations
indicate that hardware testing costs more and has a higher risk of damage, whereas software
testing his more controllable and is easily repeatable.
A paper on enhancing fault injection testbenches by Sosnowski, Gawkowski, Zygulski
and Tymoczko [5] deals with the problems of classical fault injection tools with respect to
improving the experiment e ectiveness and result analysis capabilities. The results of this
research aided in preparing the test phases to be used in the proposed methodology and
8
the resulting analysis. The research is demonstrated through distribution of fault injection
processes within a computer network utilizing data mining methods to collect simulation
results. This resulted in increased speed in the testing phase, but was not without limitations.
The golden run of the software could not be distributed across computers in the network, but
the research provides insight into  nding relations and critical dependability points. This
means that the process for fault injection across a network, in this paper, is not applicable
to the proposed methodology, but the steps for  nding relations and important locations in
the code are.
Moraes, Duraes, Barbosa, Martins, and Madeira [6] use software fault injection for
experimental risk assessment and comparisons in. This research focuses on estimating the
risk of using using outside software, such as CotS, in component-based software development.
The paper presents an approach to assessing the risk of using a software component, either
CotS or not. This is done through fault injection of realistic faults and their impact on
possible component failures and using software complexity metrics to estimate the possibility
of residual defects. The paper presents  ndings on determining cost of failure measurements
through fault injection and frequency of faults. The  ndings presented were then used in
deciding how best to calculate the probability of execution for faults.
A 2011 article by Natella, Cotroneo, Duraes and Madeira [7] and a dissertation by
Natella [22] present an experimental study on the representativeness of injected faults on
residual software faults, meaning faults that are actually experienced in the  eld. The study
shows that the representativeness of the faults is a ected by the location of the fault in the
system, which results in di erent distributions of faults across the system. An approach is
proposed to re ne the fault load by removing faults that are not representative of the residual
software faults. This is done to ensure that the results of the testing are meaningful and
reduce the number of faults needed to test a system. The papers present several approaches
to software fault injection and tools that utilize software insertion fault injection methods.
This is very important to the methodology proposed in this thesis, because the information
9
in these papers was used to in uence how to calculate the probabilities that the faults would
occur in a function. This value is then used to calculate the primary metric for ranking
variables.
A 2010 paper on assessing and improving the e ectiveness of logs when analyzing soft-
ware faults. A paper by Cinque, Cotroneo, Natella, and Pecchia [10] presents an approach to
assess the e ectiveness of using logs to keep track of software faults in the  eld, using fault
injection. The use of logs is crucial to the fault injection testing method in order to collect
data about the results of the testing and this research provides a means to improve the
e ectiveness of the logs. A fault injection framework is proposed utilizing G-SWFIT, which
is used in several of the previous papers, to de ne a set of fault operators and generate the
logs. G-SWFIT injects faults by changing the binary code corresponding to programming
mistakes in high-level source code, which is primarily used when testing CotS software. The
logs are then collected and re ned in order to increase their e ciency. Of interest here, are
the results of the number of injections and the failures that occurred from those injections
that were actually logged. The data shows that before re ning the logs, the number of
faults logged versus unlogged were only around 20% to 30%. This shows the limitations of
using logs to gather data on the software faults and their approach to increasing the percent
coverage of the logs. This data in uenced the method used to collect the information from
the fault injection test run in this paper.
An older article by Clark and Pradhan [11] on using fault injection as a method for val-
idating the dependability of a computer system provides details on various fault models and
injection methods, as well as measures of software dependability. The research also includes
information on several prevalent fault injection tools, such as MESSALINE and FIAT, that
helps in looking for which fault injection tool would best suit the needs of the proposed
methodology. This paper helped lead to developing a framework for the methodology, as
opposed to just using MESSALINE or FIAT, which did not  ll the requirements set for the
methodology.
10
Madeira, Costa and Vieira present a paper on using software fault injection to emulate
software faults [12]. The experimentation results were found by comparing real software
faults found in various programs with the faults injection using the Xception tool, a SWIFI
tool. The results show what kinds of faults that are possible to  nd and which types of faults
are unlikely to be detected using this method of fault injection testing. An approach using
software metrics to guide the injection process is presented when  eld data is not available,
which is the case when trying to determine whether to integrate open source software with
a system is the goal. The concepts presented in this paper were used to see what types of
faults would be practical to test for using software implemented fault injection.
A 2011 paper on fault injection methodology and tools by Song, Qin, Pan, and Deng [17]
describes the concept and principals of fault injection. The paper includes a detailed review
of some techniques and their tools using: hardware fault injection, simulation fault injection,
and software fault injection. Of particular interest is software fault injection data for compile
and run-time fault injection and some useful tools, such as Ferrari, Doctor and Xception.
Also, a comparison of the various tools available between each type of fault injection is
presented, which aids the proposed research by helping to identify the type of fault injection
tool that may need to be used. When selecting which fault injection framework to model
the fault injection step of the methodology after, the data in this research helped to narrow
down the choices.
Two more papers about fault injection methodologies and tools [18] and [19] present
some applications of dependability validation tools. Con dence in the error handling mecha-
nisms in a software component are analyzed through fault injection testing with MESSALINE
in [18] and FIAT in [19]. Each paper provides insight into the operation and possible useful-
ness of their respective tools in the proposed research for the thesis. Of concern here, is not
only the individual fault, but it?s overall e ect on the target system. Other fault injection
frameworks were considered for use in the proposed methodology and were used to guide the
development of the fault injection framework designed for this thesis.
11
A 2011 paper by Benso and Carlo on the art of fault injection [20] provides some
interesting points of interest to consider when conducting any type of fault injection testing.
The paper highlights good and bad practices when setting up a fault injection experiment
and common errors in methodologies that a ect the coherency and meaningfulness of the
results of the experiments. The paper reads as a sort of guide for creating and setting up
fault injection experiments. The major points include, the structure of the environment,
choice of fault model, fault locations and generation of the actual target fault list. Of key
importance is using this paper to ensure that the mistakes of past research are not repeated
in the proposed methodology and that the results will be meaningful. The fault injection
framework designed for this methodology used this paper as a guide for determining the
speci cs of injection location and choosing a fault model.
A paper by Jiang, Munawar, Reidemeister and Ward [21] presents a method for auto-
matic fault detection and diagnosis using information-theoretic monitoring. The methodol-
ogy involves combining two algorithms. The  rst is RatioScore, which is based on Jaccard
coe cients, and the second is SigScore, which uses knowledge of the component dependen-
cies. The approach monitors the state of the system and  ags an anomalous state to be
determined as either faults or not. The approach is limited by the availability of component
dependency information and the dynamic changing of dependencies. The algorithms pre-
sented in this paper were considered for  nding the dependencies of the relevant variables
to be tested in the proposed methodology.
Of concern when utilizing fault injection testing is how the results can be used to identify
the impact of the faults. This is often done utilizing golden runs of the system, meaning a
perfect run of the system with no errors. A 2009 paper by Leeke and Jhumka [23] evaluates
the use of these reference run models in fault injection analysis. Of particular interest in
this paper is the  nding that certain systems are not properly represented in testing with
the use of these reference runs. The paper de nes various types of oracles that are used to
determine the impact of fault injection, such as speci cations, error detection mechanisms
12
or golden runs, and shows that of these, golden runs are not e ective when used on a system
with a main control loop with an irregular period. The paper goes on to suggest a model for
re ning the golden run oracle for use with these types of systems. The data from this paper
was useful for selecting which open source software to test the methodology on because a
golden run oracle is used to look for failures during fault injection.
1.2.4 Fault Detection and Correction
There are several approaches to handling faults that may appear in software, the fol-
lowing papers detail techniques and approaches for handling the detection of these faults.
A paper by Arora and Kulkarni [2] on error detectors and correctors details a theory for
detectors and correctors that characterizes their role in achieving various types of fault-
tolerances. The paper presents a large number of de nitions and background on faults and
the mechanisms to handle them. Also included in the research are applications of di erent
mechanisms and what types of fault-tolerance are best ful lled with which types of mecha-
nisms. Examples are given for the composition of the mechanism and how they can be used
in the software.
A 2009 paper on using software invariants for dynamic detection of transient errors [15]
by Lisboa, Grando, Moreira and Carro discusses the possibility of using software invariants
as a low cost means to detect soft errors after the execution of an algorithm, by looking
for changes in the invariants for the algorithm. Usually the use of software invariants is
during the development phase to check for correctness because of the huge memory footprint
and overhead with late lifecycle use. The paper gives insight into a lightweight approach to
hardening an algorithm without changing the algorithm itself; the overhead and computation
cost of the algorithm is much smaller than duplicating an algorithm. Fault injection is used
to test the algorithms and the proposed hardening methods. A software invariant is a
program property that must be preserved when the code is modi ed, such as preconditions,
post conditions or loop invariants. The focus of the methodology is the detection of single
13
event transient faults during the execution of the software. This paper was considered as a
potential solution for detecting errors that occur in the system after fault injection.
Another 2009 article by Hamill and Goseva-Popstojanova [16] on common trends in
software fault and failure data, examines fault and failure data from two real-world case
studies, giving valuable data about the origins of errors in a system. In particular, the article
focuses on localization of faults leading to individual software failures and the distribution
of di erent types of software faults. The paper reveals requirement faults, coding faults and
data problems as the primary types of software faults. Also, it is shown that individual
software failures are likely caused by several faults spread throughout the software. The
article explores whether or not certain software faults are more common than others, and
how the localization of faults changes depending on the individual software failure. Most
importantly, the research reveals that the trends are likely intrinsic characteristics of software
faults and failure rather than being speci c to a project. This means that it should be possible
to create a methodology that is capable of adapting to any type of project it is used on.
1.2.5 Data-Flow Analysis
Tan and Xiong [29] published a 2011 paper on data  ow error recovery using check-
pointing and instruction-level fault tolerance. This paper is particularly useful because of
its use of both data  ow analysis for error recovery while considering fault tolerance of
the software. The paper proposes an approach to data  ow error recovery using check-
pointing. The software is examined at the instruction level and split into protected code and
unprotected code. In the protected code, data values are replicated and at certain locations,
such as branch instructions, the values are compared and if there is a discrepancy then an
error has occurred and a previous state of the software is loaded. This method has several
similarities to the error handling mechanisms in [1].
An article by Taylor and Osterweil [30] discusses using state data  ow analysis to detect
anomalies in concurrent software. The article presents algorithms for using data  ow analysis
14
for detection of variable usage errors in single and concurrent process software. The focus of
the  ow analysis is on variables that are shared between the multiple processes in a software
system. The fundamental use of data  ow analysis in this research served as good background
information for deciding how best to approach the data  ow analysis found in this thesis.
Bergeretti and Carre present a paper on information  ow and data  ow analysis specif-
ically for while procedures[31]. In this paper, information  ow is expanded from its common
use for ensuring that program variables are not violating security requirements. Information
 ow relations are used to identify program statements that cause information to transmit
between certain inputs, internal, or output values. This is useful in testing and updating pro-
grams and can extend the types of errors that can be detected automatically during static
analysis. The concepts presented in this paper for handling information  ow with while
statements in uenced the method used to determine the relevant variables in the critical
path.
1.2.6 Wrapper-Based Solutions
The following two programs implement wrapper-based solutions for error detection in
complex software systems. The  rst paper details a framework for detecting these errors [32],
where the goal is to cope with software failures, based on data obtained on the application
and OS levels. The framework is implemented as a wrapper about the complex system in
order to detect errors. The reasoning behind the framework is that there are complex inter-
dependencies between components and in turn these lead to subtle faults that are cause by
complex triggering patterns which can escape testing. So, in order to not track down each of
these complex patterns, the entire system is observed to look for errors in both application
and OS activity.
The second paper [1], by Leeke and Jhumka, proposes an automated wrapper-based
approach for designing dependable software. Two major activities for developing dependable
software are detailed: design of the dependability mechanisms and their location. They use
15
their importance metric de ned in an earlier paper to identify variables to be replicated.
These replicated variables are used in standard, e cient dependability mechanisms, and are
deployed wherever the variable is located in the software. Whenever the variable is written
to, copies are made based on the level of importance of the variable. When the variable is
read, the variable values are compared and the majority voting rule is used to select which
of the values will be used. This means that even if one of the instances of the variables being
stored is corrupted, the other two instances will have the correct value and the correct value
will be returned. These algorithms were used in order to create a lightweight implementation
as an alternative to the high overhead of code replication and N-version programming. The
testing is done strictly through fault injection, including the determination of important
variables.
1.3 Problem Statement
The use of OSS components has been used to save organizations money that can be
allocated elsewhere, in addition to the time that does not need to be spent on the development
of the software. This use of OSS for certain components of a system could be even higher,
allowing for an even more widespread use of OSS components. The issue with the widespread
use of OSS components is that most real-world projects have certain standards and software
dependability requirements that need to be met before the open source solutions can be
used.
Most of the research being done in current software dependability for systems where the
source code is available is directed at the development process. These techniques and tools
are meant to be used early in development and followed closely throughout. This is due to
the fact that the later these error detection and recovery mechanisms need to be added, the
more time it takes to develop and troubleshoot the changes to the code. With the case of
open source software, like that of CotS software, the component being used is passed the
development phase and is in its released form. The bene ts of saved time and money start
16
to seem less appealing if a full suite of testing and possible development will need to be done
in order to bring the component up to a level of dependability that meets requirements. The
increase in overhead from adding these error handling mechanisms could render the OSS
component ineligible for use in the system regardless of time spent testing and developing
code.
There are many cases where an OSS solution could be implemented over a CotS solu-
tion. For example, there could be OSS systems available that ful ll the required performance
measures and no CotS systems. Also, with many CotS systems there are additional costs,
such as purchasing a license to use the software. Developing code in house or contracting
an organization to develop software for a particular function is also a signi cant investment
on a project?s budget, when compared to free OSS. With CotS software or contracted devel-
opment, there is a certain level of assurance that the software will be developed and tested
according to the desired requirements of the customer. An issue with OSS systems is that in
many cases, there is no guarantee of any standard of development or formal testing. Often,
development is done by a small group or even just a single person and testing is done only
loosely to ensure that the product will run and the burden of testing and troubleshooting
falls to the community.
In order to solve this problem, this thesis proposes a methodology for increasing the
dependability of OSS components. The focus of the research is to implement a methodology
for identifying critical artifacts and locations in open source systems and adding lightweight
error handling mechanisms to detect and recover from errors. There has been signi cant
research done in the realm of fault injection, including the de nition of metrics used to
identify important variables and injection and sampling methods. The major issue with the
importance metric de ned by [13], is that it is irrelevant when considering more than one
module. A new modi ed importance metric is proposed that takes into account the fault
chance of each function in the system to create a more relevant metric, measured as the
relative importance of a set of variables given a speci c set of use cases. This research is
17
con ned to speci c subsets of system execution. Errors are measured according to these
pre-de ned runs, as it is with all fault injection research. Currently, the greatest measure of
the importance of a variable is to use several metrics, including failure rate and spacial and
temporal impact, described above. This methodology for determining the most important
variables and their propagation is done completely through fault injection, which does not
necessarily re ect real world situations that may occur while the software is in use.
In an attempt to both further verify this method of using a variable?s importance to
direct the use of error handling mechanisms and to explore the possibility of increasing the
bene ts, both fault injection and data- ow analysis is used. To limit the negative e ects
innate to using fault injection testing, data- ow analysis is done on modules to identify the
modules with the highest priority. Then these modules are tested to highlight the variables
and functions with the highest failure rates. Research in the use of data  ow analysis for
dependability testing has grown less popular, but by combining the use of data- ow analysis
with the results from the fault injection testing, the propagation of the critical errors can be
determined and then the important locations and artifacts for implementing error handling
mechanisms can be identi ed. The data- ow analysis helps to keep the error handling code
added to the OSS system in only the most critical locations. This should allow for a rapid
testing phase that increases the software dependability, without creating too much overhead.
The issue with adding error handling mechanisms this late in the life cycle of the software
is that they have a harsh overhead if they are too obtrusive. This thesis proposes using an
existing wrapper-based solution, similar to [1], that is a lightweight error handling mechanism
that limits this overhead, while still increasing the dependability by a signi cant amount.
The majority of the research for late-stage dependability and reliability techniques is aimed
at closed source CotS components. Whereas, the research aimed at systems where the source
code is available are intended to be used much earlier in the life cycle. Thus, this methodology
creates a means of speeding up the process of locating high priority locations for placing error
handling mechanisms to increase the dependability of OSS in order to allow for the bene ts
18
to be used more often in systems that have strict timelines and higher standards for software
dependability and reliability. Furthermore, this methodology forms a foundation for a larger
methodology that can include increasing desired factors other than dependability, such as
security.
19
Chapter 2
Fault Injection Framework
This chapter details the implementation of the fault injection framework that was used
in order to test the methodology on an open source software system.
2.1 Fault Model
The  rst thing to consider when doing fault injection testing is the fault model that
will be used. Depending on the type of fault that is going to be considered, the type of
injection will change. The type of fault model used has two major parts: a local model and
a global model. The local model states the types of faults that could occur in the system
and the global model constrains the occurrence of the faults from the local model, so that
dependability can be measured.[33] In this methodology, the faults considered were those
that could occur at any time or at any place during the execution of the system. The fault
injection framework simulates the occurrence of transient hardware faults. The local model
is a transient data value failure, which means that a variable has a corrupted value and could
become corrupted again. The global model in this case is the assumption that any variable
in the system could be a ected by this data failure. This model is the same model used in
[1].
2.2 Implementation
The implementation of the fault injection framework was split into three di erent parts.
The  rst and most crucial part of the framework was the injection functionality that corrupts
the value in a variable. The probe functionality is used to output the value of a variable
at various locations throughout the testing process. The environment simulator is used to
20
control the test and manages the output of the tests for comparison to the golden run. Note
that all code for the framework was written in C.
2.2.1 Injections
The injection functionality is incorporated into the system  rst by adding the injectors.h
header  le to all  le to be tested and by placing the injectors.o object  le in the compile
path for the system. Once this is done, it is possible to inject errors directly into variables
at any location in the code. There were two di erent types of injection functions written for
the framework. The  rst takes a speci c value and inserts it into the variable in order to
test boundary conditions or speci c faults, such as NULL or zero.
The manual injection function takes in four inputs. The iid variable is the string that
identi es the injection point. This value is compared to a  le containing all the probes and
injections that will be used in the current experiment. If the string is located in the  le,
then the injection is completed and the speci ed injection value, int inject val, is returned.
If not, then the original integer value, int var, will be returned. The variable p speci es the
probability that the injection will occur. This is important when considering several modules
in the same system and are discussed in chapter three.
Figure 2.2 shows a simple example that calls the injection function and speci es that the
value of 100 be inserted into the variable x and the variable x is then stored in the variable
log produced by the framework after each experiment.
The second function, used in the testing in this experiment, injects a random error in
the variable. This is done by  ipping a random bit in the variable and then returning the
variable with the  ipped bit. The prototype and way that the function is called is very
similar to the manual injection function, but instead the injection value is randomly de ned
from within the function. Figure 2.3 shows the function to randomly  ip a bit in an integer.
Note that the  le points fp1 and fp2 are de ned globally for all functions in the injector  le.
21
int injector_i(char *iid, int var, int inject_val, float p);
Figure 2.1: Prototype for manual integer injection.
int fmul( int x, int y) {
x = injector_i("mulInject1",x,100,1);
int result = x * y;
probe_i("mulProbe1",x);
return result;}
Figure 2.2: Integer injection and probe example.
Figure 2.4 shows the more di cult task of  ipping a bit in a  oating point variable
or pointer. The code is mostly the same between the functions, but the added di culty
of  ipping a bit for a variable of  oating point or pointer format is done by casting the
incoming variables address to an unsigned long* and dereferencing it. The bitwise operation
is completed on the unsigned long, which should be the longest variable type available on
the system so that it can be used to represent any other variable including a long double.
When the single random bit- ip is complete, the variable is then cast back as the original
variable pointer and then dereferenced. This method works with pointers of various types
as well, a ecting the address to which the variable points.
Instrumentation of a system involves a single call to an initialization function for the in-
jectors which includes a seed function call for rand and then the injector functions themselves
wherever the user wishes to place a fault in the system.
2.2.2 Probes
The probes currently serve the purpose of allowing the tester to see if the variable
injection was completed and what the value of a particular variable is at a given location in
the system. There is only one set of probe functions, where each function handles a di erent
22
//Fault injection function to force an int variable fault with a given prob and
random bit-flip
int injector_rand_i(char *iid, int var, float p){
double p2 = (double)rand()/(double)RAND_MAX;
unsigned int bitCount = 8 * sizeof(int);
int fault_loc = rand()%bitCount + 1;
int inject_val = var ^ (1 << fault_loc);
if(p2 <= p) {
fp1 = fopen("exp_list.txt", "r");
fp2 = fopen("var_log.txt", "a");
char tmp[256] = {0x0};
while(fp1 != NULL && fgets(tmp, sizeof(tmp), fp1) != NULL) {
if (strstr(tmp,iid)) {
fprintf(fp2,"%s: original = %d injection =
%d\n",iid,var,inject_val);
return inject_val; } }
fclose(fp1);
fclose(fp2);
return var; }
else {
return var;}
}
Figure 2.3: Function for random bit- ip fault in an integer.
23
//Fault injection function to force a float variable fault with given
probability and random value
float injector_rand_f(char *iid, float var, float p){
double p2 = (double)rand()/(double)RAND_MAX;
unsigned int bitCount = 8 * sizeof(float);
int fault_loc = rand()%bitCount + 1;
unsigned long temp = *(unsigned long*)&var;
temp = temp ^ (1 << fault_loc);
float inject_val = *(float*)&temp;
if(p2 <= p) {
fp1 = fopen("exp_list.txt", "r");
fp2 = fopen("var_log.txt", "a");
char tmp[256] = {0x0};
while(fp1 != NULL && fgets(tmp, sizeof(tmp), fp1) != NULL) {
if (strstr(tmp,iid)) {
fprintf(fp2,"%s: original = %f injection =
%f\n",iid,var,inject_val);
return inject_val; } }
fclose(fp1);
fclose(fp2);
return var; }
else {
return var;}
}
Figure 2.4: Function for random bit- ip fault in a  oat.
24
int injector_i(char *iid, int var, int inject_val, float p);
Figure 2.5: Prototype for variable probe.
//Probe function to read integer variable
void probe_i(char *pid, int var_probe){
fp1 = fopen("exp_list.txt", "r");
fp2 = fopen("var_log.txt", "a");
char tmp[256] = {0x0};
while(fp1 != NULL && fgets(tmp, sizeof(tmp), fp1) != NULL) {
if (strstr(tmp,pid)) {
fprintf(fp2,"%s => %d\n",pid,var_probe);}
}
fclose(fp1);
fclose(fp2);
}
Figure 2.6: Integer variable probe function.
type of variable probe. Figure 2.5 shows the prototype for a probe function and Figure 2.2
already showed an example of the probe function?s use in a simple function.
The probe functionality also has to check the experiment list to see if that probe is
to be used during the current experiment. This allows all of the injections and probes to
be placed into the instrumented system at one time and then by modifying the experiment
list document, the framework knows which of the injection and probe sites to use for that
particular experiment. Figure 2.6 shows the function for storing the value of the integer.
2.2.3 Environment Simulator
The environment simulator is the program that is actually run when testing the instru-
mented system. The program takes in the experiment number as a command line input
and then runs the intended use cases of the function a set number of times and collects the
25
results of these experiments in a  le in a results directory with experiment number labels
for each  le. This is done in order to streamline the process of testing the instrumented
system, but also to simulate the intended use of the system. Any inputs the instrumented
code may need are controlled by the environment simulator and any outputs are dealt with
accordingly. This way, the testing of the instrumented component can be done in a way that
closely represents how it will  t into the entire system.
The environment simulator is written to replicate the intended usage of the open source
component; consequently, it will di er based on each system. The way that the environment
simulator is written to test the open source component in the experiment section of this thesis
is not necessarily the same way the simulator will be written when testing a component for
another system. The general method is the same, but the speci cs of implementation will be
di erent. It is important to note that this thesis proposes a methodology for accomplishing
a task and not a set-in-stone way of completing the tasks. The ?how? of the problem is
hypothesized and validated, but the ?what?, as in what to use to accomplish the goal, is left
up to those who are implementing the methodology.
26
Chapter 3
Methodology
3.1 De ne Critical Path
The  rst step of the methodology, as shown in Figure 3.2, is to de ne the critical path
for applying error-handling mechanisms. A critical path in this context is a high-priority
module and all the functions inside of that module, plus functions in any other module
that exchange data with the high-priority module. Determining a critical path is done by
 nding the highest-priority modules and the functions that deal with that module. Priority
is de ned as potential impact an error occurring in a module will have on the system.
3.1.1 Instrumentation Stage One
In order to determine the highest-priority module, the target software is instrumented
to track the number of calls made to each module and function within that module in one
instance of the intended use cases for the system. Finding the frequency that modules will
be called with each execution of the target software allows for modules that are not directly
related to the proper execution of the system to be eliminated from consideration. This
3. Identify Locations for
Error Handling2. Define Critical Path1.Target Software
4. Wrap Important
Variables 5. Wrapped Software
Figure 3.1: Overview of Methodology
27
Target Software
Call Instrumentation Code Analysis Data-Flow
Call Frequency Worst Case LoC Call Depth
Critical Path 2
Figure 3.2: Stage one of methodology
narrows down the potential number of modules for testing. Because the focus is on both
speeding up the process and accurately identifying important locations for error handling, it
is important that no unnecessary tests are completed. Modules with zero or only a fraction
of calls compared to other modules will not be considered. The frequency is also found for
each function in the remaining modules. This value will be used in a later section to calculate
the priority of the function in relation to other functions in the system.
3.1.2 Data Flow Analysis
Once the unrelated modules of the system have been ruled out, it becomes necessary
to narrow down the possible high priority locations for error handling further. The next
step is to determine the priority of each function in the remaining models. The  rst value
to be determined is the depth of the call and caller graphs for each of the functions. This
value will represent the propagation factor of the function in relation to the rest of the
functions in the system. For this step, Doxygen was used in order to generate dependency
graphs for each module and call and caller graphs for each function. Doxygen also lists any
structs, variables, de nes, and classes found in each module. Doxygen is most useful when
28
the target source code has been instrumented with special comment blocks to alert Doxygen
that certain information should be included. Even without the extra comment blocks, it is
able to generate the necessary graphs.
Doxygen was used through its Linux version command line interface. The setup involves
generating a con guration  le that has a list of options that the user can either set to ?yes?
or ?no? and other options such as giving a speci c  le path or  le names. Doxygen was set up
to ignore private internal functions of a module, such as small functions that perform single
actions and other such functions. This is because if the function is an internal function,
then it is only called by functions within the same module. It has the same propagation to
functions in other modules as the function that called it. The Doxygen command is then
run again with the con guration  le for the target software as input.
This data is important because the call and caller graphs for each function were used to
determine its propagation factor in relation to the other potentially high priority functions.
The call depth of an individual function was set to the length of the data path from its
callers and called functions. The propagation factor was found by equation 3.1, where p is
the propagation factor, Fcd is the call depth of the function and Maxcd is the maximum call
depth. The Maxcd is highest call depth for all the functions being considered.
p = Fcd=Maxcd (3.1)
3.1.3 Determine Priority Functions
When  nding the highest priority functions, it was determined that there were three key
factors to consider: probability that the single bit- ip fault occurred in memory related to a
function, probability that the corrupted code in that function would then be run during the
current execution of the program, and the propagation factor. The propagation factor was
found as described in the previous section. For the purposes of testing this methodology and
proving the concept it is assumed that the probability that the fault occurs in a function to
29
be equal for all functions. In reality, this is most likely not the case. It should be possible
to determine the probability that a single bit- ip fault occurs in the memory space related
to a function based on the amount of memory that the function has reserved and obtains
dynamically during run-time. This amount of memory would then be divided by the total
memory used by the program to determine what percentage of the memory is related to that
function. This way is still assuming that the fault has an equal chance to occur in any bit
in memory and does not consider which sections of memory would be more prone to having
a fault than others.
With the probability that a fault occurs in a function equal for all functions, the re-
maining factors to be used to set a priority for each function are the propagation factor and
the probability that the corrupted code in a function will be run during the same execution
of the program in which it occurs. For this thesis, it was determined that the probability
that corrupted code in a function would be executed in the instance of the program it was
injected would be based on the e ective lines of code for that function and the frequency
that function was called given the intended use cases. The e ective lines of code is found
by equation 3.2, where ELoCf is e ective lines of code. The worst case lines of code for a
function, LoCwc, was determined through code analysis as the maximum number of lines
of code that could be executed in a single instance of that function and the function call
frequency, Fcf, was determined through instrumentation.
ELoCf = LoCwc Fcf (3.2)
After calculating the e ective lines of code for each potentially high-priority function,
the probability that corrupted code will be executed in that function at a given point during
run-time is found by equation 3.3, where ELoCf is the e ective lines of code for the function
in question and TELoc is the total e ective lines of code across all functions.
Pe = ELoCf=TELoC (3.3)
30
At this point for each function, probability of corruption and execution of corrupted
code and propagation factor have been determined. These three values are used as shown in
equation 3.4 to determine the priority of each function, where Pe is the probability that the
corrupted code is executed, Pc is the probability of corruption for the function, and p is the
propagation factor. Again, it is assumed that the probability of corruption, Pc is equal for
all functions and as such is equal to 1 for all calculations. When considering the potential a
fault has to a ect a function, it is important to consider both the likelihood that the fault
occurs in memory reserved for that function and the likelihood that the corrupted code will
even be executed. A fault that occurs in memory causing code corruption in a function
that will not be executed again during the remaining run-time of the system, will not have
an e ect on the outcome of the current run. Priority is determined based on the functions
that have the greatest chance to have code corruption that is then executed during run-time
and the potential propagation of that fault. The propagation is important because even if
the fault does not cause failure immediately in the function in which it was found, further
calculations and use of variables that carry the result of the corrupted code could still lead
to failure.
Priority = Pc Pe p (3.4)
This priority calculation is then used to determine the priority modules in the system.
The priority of the modules is just the sum of the priority calculations for all functions within
that module. With the highest priority modules determined, the critical paths can be found.
The critical paths are ranked according to the module priorities found for the intended use
cases. A critical path is the full call graph for the highest priority function in the high priority
module. The  rst critical path starts from the highest priority function in the system for
the intended use cases and the data- ow for the functions it calls and the functions that
call it. The second critical path is the same, but for the highest priority function in the
second highest priority module, and so on. Testing indicates that determining priority this
31
Relevant Variables2 Code Analysis
Fault Injection
Rel. Failure Rate Spatial Impact Temporal Impact
Relative Importance Ranked Variables 3
Figure 3.3: Stage two of methodology
way accurately leads to  nding the variables that create the highest number of relative failed
tests.
3.2 Identify Locations for Error Handling
Now that the critical paths have been determined, it is necessary to identify the priority
locations for placing error handling mechanisms. These locations are the instances of reads
and writes for the important variables in the critical path. In order to determine these
important variables, a variation of the importance metric in [13] is used to calculate the
relative importance of all the important variables found in the critical path. This section
details the process for ranking the relevant variables, as shown in Figure 3.3.
3.2.1 Determine Important Variables
A technique called dynamic program slicing is used to identify the important variables
in the critical path. This is a method of analyzing code that  nds all the statements that
a ect the value of a variable at a given point during a particular execution of the program.
32
Given the intended use cases determined for testing purposes, the highest priority function
and all the statements, and variables within those statements, that have a potential e ect
on the important variables in the critical path are determined.
First, all the variables in the highest priority function for the critical path are added
to the list of relevant variables, because it is unknown which variables in the function will
be the most important. But, it is known from the priority calculations from the previous
section that the most important variables are in this function. The next step is for each
function that is called by the highest priority function in the critical path to be analyzed
with dynamic program slicing. In the called functions, the variables that are used in the
function call itself are the starting point for the dynamic program slicing. In this called
function, any variables that are a ected by the variables sent from the original function are
added to the list of relevant variables. Then, any variables that a ect these variables are
added and so on until there are no more variables to be added from the call graph side of
the critical path.
For the caller side graph of the critical path, the same method is used to identify relevant
variables. The variables sent from the function that calls the highest priority function are
added to the list of relevant variables. Then, any variables in program statements that a ect
these values are added and then the variables that a ect those are added and so on. This
is done until all variables in the chain of function calls from the high priority function in
the critical path to the last function are found. These variables are all added to the list of
relevant variables in the critical path and are subjected to fault injection testing.
3.2.2 Instrumentation Stage Two
The second stage of instrumentation of the target source code is for completing the fault
injection testing of the relevant variables in the critical path. The goal of the fault injection
testing is to determine three key values for each relevant variable in the critical path. These
33
values are the modi ed failure rate, spatial impact, and temporal impact. The spatial and
temporal impact metrics are de ned in [13].
Given a software system with functionality distributed logically over a set of distinct
components, the spatial impact of a variable v, of component C, in a run r, is denoted in
equation 3.5. The spatial impact is then de ned as the number of components that are
corrupted in r when the variable v is corrupted.
 v;C = maxf rv;Cg;8r (3.5)
Spatial impact  nds the number of modules a ected when a variable v in component
C is corrupted. The higher the value, the harder it will be for a system to recover from the
corruption. For the purposes of this thesis, only the worst case value of spatial impact is
considered for each variable.
Given the same software system, temporal impact as de ned by [13] is the impact of
a variable v of component C in a run r, denoted in equation 3.6. This value indicates the
number of time units where at least one component of the software system remains corrupted
in r.
 v;C = maxf rv;Cg;8r (3.6)
After determining the spatial and temporal impact for each relevant variable in the
critical path, it is important to determine the relative failure rate of each of the variables.
The relative failure rate is the failure rate determined through fault injection when the faults
are injected into each function according to the probability of execution found in equation
3.3, given a speci c set of use cases for the target software. The reason why this methodology
does not use just the local failure rate is that the importance calculated with just the failure
rate of a variable and the local spatial and temporal impacts are limited to just the modules
34
where they are calculated. This thesis seeks to modify the importance metric so that its
value is relative across multiple modules.
3.2.3 Calculate Relative Importance
The general-form equation for calculating the local importance of a variable is de ned
in equation 3.7, where G, K and L are arbitrary functions that determine the importance,
spatial impact, and temporal impact respectively. The value incorporates the failure rate
and the spatial and temporal impact of a variable. This is done because a variable v in a
component C with a high spatial impact is likely to pass the corruption on to other variables
and a high temporal impact indicates that recovery e orts will be less e ective.
Iv;C = G[K( v;C);L( v;C)] (3.7)
The  nal equation found for the local importance of a variable is as de ned in equation
3.8, where f indicates the failure rate and n and m indicate the focus of the testing. To be
more speci c, the greater n is the more the metric focuses on the failure rate of the variables
and the greater m is the metric focuses on the spread and time of existence of corruption.
So, an m of zero and an n greater than zero would measure solely for failure rate of the
variable and conversely, an n of zero and an m greater than zero would measure only the
impact of the failure.
Iv;C = [1=(1 f)n][( v;C= max + v;C= max)]m (3.8)
The impact of the variables measured in this way is inherently reliant on other com-
ponents or modules in the system because the spatial and temporal impact are determined
based on their e ect on these modules. So, in order to make these metrics measure relative
importance across multiple modules,  max and  max are determined based on the maximum
values for all variables in the critical path and not just from within a single function. The
35
Wrapped Software3 Wrapper Functions
Figure 3.4: Stage three of methodology
other issue of relevance for the importance metric is the failure rate. Finding the failure rate
in this way assumes that there is an equal chance of the failure occurring across all functions.
A failure that is introduced into a system, but is never executed has no e ect on the
outcome of the current run of the program. The failure rate is instead measured with the
probability of execution in mind. This means that if the fault is injected at the start of the
function to maximize potential damage, the actual likelihood of failure is found only when
the fault is injected according to Pe. Thus, the relative failure rate is determined by the
failure rate of the system when a fault is injected in the function with a probability equal
to the Pe. These changes yield the relative importance metric as de ned in equation 3.9,
where  maxr is the maximum spatial impact across the critical path,  maxr is the maximum
temporal impact across the critical path and fr is the relative failure rate.
RIv;C = [1=(1 fr)n][( v;C= maxr + v;C= maxr)]m (3.9)
Once the relative importance, RIv;C has been calculated for all variables in a critical
path, they are ranked according this value and then these ranked variables move on to the
next stage of the methodology. This process can be repeated for multiple critical paths,
where the relative importance is calculated across all paths.
3.3 Wrapper-Based Error Handling
In this last phase of the proposed methodology, wrapper-based error handling mecha-
nisms are put in place around instances of the highest importance variables in the critical
36
path, as indicated in Figure 3.4. The novelty of this research focuses on identifying important
variables for error handling mechanisms and the speed that the identi cation occurs, not the
actual error handling mechanisms. The mechanisms discussed in this section are from the
solution in [1], where lightweight and e ective error handling mechanisms are tested using a
similar testing scenario as is done in this thesis.
The general idea behind the use of the wrapper-based error handling mechanisms with
the following algorithms is to utilize lightweight and e cient mechanisms to reduce the failure
rate. The lightweight portion of the idea comes from the wrapper-based implementation
around the instances of the important variable in a module. This is preferable to an N-
programming solution or a wrapper-based on an entire module to secure the correct outcome.
The algorithm chosen should be an e cient and well-tested algorithm that is simple to
implement and will not introduce a large amount of overhead due to complicated calculations
or logic. The user of the methodology must decide upon two threshold values to identify
which of the ranked variables should be wrapped.
3.3.1 Write-Wrapper
The  rst algorithm de ned in [1] is the write-wrapper function is called when an im-
portant variable is written to. When a variable v is assigned a value f:::, where f is some
function from the unwrapped module, the ranking of the variable is checked. This is when
the selected threshold values come into play. The  rst threshold,  t determines which vari-
ables will be triplicated and the second,  d determines which will be duplicated. If, for
example, the  rst threshold is the top ten percent out of one hundred variables, then the
variables ranked one through ten will have two shadow variables created upon being written
to. If the second threshold is  fteen percent, then ranks eleven through  fteen will have a
single shadow variable created upon being written to. The remaining variables will not have
a wrapper.
37
Algorithm 1 Write-Wrapper: Writing a variable v
v := f(:::)
if (rank(v)  t) then
create(v;)
create(v;;)
v;v;;v;; := f(:::)
else if (rank(v)  d) then
create(v;)
v;v; := f(:::)
end if
3.3.2 Read-Wrapper
The second algorithm de ned in [1] is the read-wrapper. The read-wrapper is called
whenever an important variable is read. So, when a variable y is assigned a value that
includes the functiong(v;:::) in the unwrapped module, where g is some function of v to be
read, the rank of variable v is compared to the threshold values de ned earlier. Considering
the same example as before, variables ranked one through ten will have a majority algorithm
run against the variable v and its two shadow variables, with the majority variable being
returned. For variables ranked eleven through  fteen, either the variable v or its single
shadow variable will be chosen at random. The algorithms presented in this section are
taken from [1] as the focus of the research is in quickly determining important locations for
error handling mechanisms based o of the relative importance of variables in the critical
path, and not in declaring a new error handling mechanism.
Algorithm 2 Read-Wrapper: Reading a variable v
y := g(v;:::)
if (rank(v)  t) then
y := g(majority(v;;;v;;);:::)
else if (rank(v)  d) then
y := g(random(v;v;);:::)
end if
38
3.3.3 Results
This section displays the results of the testing in [1] that indicate that the error handling
mechanisms discussed there are su cient for the purposes of this methodology and do not
need to be upgraded at this time. Both tables are taken from [1], where 7Z# are the modules
selected for testing from the 7Zip application, FG# are the modules selected for testing from
the Flight Gear application and MG# are the modules selected for testing from the Mp3gain
application.
Module Unwrapped Failure Rate Wrapped Failure Rate
7Z1 0.002407940 0.000017397
7Z2 0.007082023 0.000141946
7Z3 0.000856604 0.000030189
FG1 0.004582688 0.000045475
FG2 0.002481621 0.000002047
FG3 0.001471873 0.000135395
MG1 0.004983750 0.000012083
MG2 0.007888044 0.000013426
MG3 0.002780792 0.000006076
Table 3.1: Results for wrapper-based error mechanism testing [1]
Module Execution Time Memory Usage
7Z1 26.048% 07.55%
7Z2 31.470% 18.16%
7Z3 20.359% 00.94%
FG1 30.660% 20.63%
FG2 35.829% 03.32%
FG3 23.529% 02.03%
MG1 25.983% 05.22%
MG2 29.090% 04.93%
MG3 23.174% 00.58%
Table 3.2: Peak increase % in run-time and memory usage [1]
39
Chapter 4
Experiment
This section details the process of utilizing the proposed methodology in order to  nd
the important error handling locations for an open source software component. The open
source software used for this experiment is called Mp3gain. It is a tool that allows the user
to normalize volume settings by adjusting the gain of individual mp3 tracks and mp3 tracks
across an album. It is a highly modular system and was designed and implemented mostly
by a single developer, aside from a few  xes and language translations.
4.1 Design
The set up of the experiment was to test Mp3gain with several potential real world use
cases. The use cases selected were the options to scan a track or album for the maximum
gain adjustment, normalize volume across all tracks in an album and to undo all known
changes to the  les by Mp3gain. These were selected as the three test cases to represent
the functionality that the hypothetical system looking to incorporate Mp3gain would want
to implement. The target source code was instrumented by  rst adding the header  les to
each module and then adding the object  les from the framework to the path in Mp3gain?s
make le. Then, Mp3gain was re-compiled with the instrumentation attached. The  rst
stage of instrumentation was done by having the program output both the module and the
speci c function within that module to a  le. The output was done once for each use case
for Mp3gain. The three resulting  les were parsed and the frequency of function calls was
obtained.
Next, a Doxygen con guration  le was created in the Mp3gain directory and setup with
the options discussed in chapter 3. The returned results were then used to calculate the call
40
depth for functions and to aid in determining the critical path once a priority module was
determined. The worst case lines of code and dynamic program slicing were done by manual
code analysis.
The fault injection testing was completed for each relevant variable found in the crit-
ical path. The input to Mp3gain for the three use case tests was an album of 25 tracks.
Faults were injected 25 times throughout a single execution of Mp3gain at 25 equal spaced
points, meaning that there was one injection per track analysis. Each fault injection test
for a variable was completed 100 times, so each use case given 25 input tracks was run 100
times. Two types of runs were completed for each variable. The  rst run was to obtain the
spatial, temporal and failure rate metrics. This was done to obtain the worst case spatial
and temporal metrics and to determine the local importance based on local failure rate for
validation purposes. The second set of tests for a variable were done using the Pe found
earlier to create a number of fault injections relative to the actual expected amount of faults
to occur in that function.
Using the relative failure rate and the spatial and temporal impact metrics, the local
and relative importance metrics were calculated for each variable. This process was repeated
using the second critical path to show that the results of the second set of tests should yield
variables with a lesser chance of causing system failure. The number of tests was tracked
in order to compare the total number of tests needed to  nd the important error handling
locations using the proposed method to the number of tests needed to complete a cost-bene t
analysis of placing error handling in each module in the function. Lastly, the proposed
methodology was further validated by comparing the total number of relevant failed test
cases found using the proposed methodology versus using local importance.
Note that the wrapper-based error handling mechanisms were not tested in this case.
This is because there is no change to the error handling mechanisms suggested for use here
compared to those in [1]. The mechanisms proposed there already met the requirements of
41
being lightweight and e cient, so there was no need to develop new mechanisms. Instead,
the focus is on determining important locations for these error handling mechanisms.
4.2 Phase 1
The  rst phase of testing the methodology was to  nd the priority modules in the
system. Table 4.1 shows the priority of modules and table 4.2 shows the ranking and the
values needed to calculate it for the functions in the highest priority modules. Note that in
the case of table 4.2, the call frequency of test 1 and test 2 were omitted because there were
little to no entries. The majority of the work came from the use case involved the volume
normalization across the album, not the max gain analysis or undoing the changes. This
was due to the tag that Mp3gain updates on each  le whenever it makes a change.
Table 4.1: Priority of Modules
Module T1 Freq T2 Freq T3 Freq Sum Priority Rank # of variables
mp3gain 1 462862 483097 3841.5 7 168
apetag 109 151 121 521.7 8 58
gain analysis 0 0 46572341 202183.975 6 53
id3tag 0 0 0 0 9 87
replaygaindll 0 0 0 0 9 19
rg error 0 0 0 0 9 6
common 0 0 4350007 9207297.75 2 24
decode i386 0 0 1448424 55257375.6 1 18
dct64 i386 0 0 2896848 2027793.6 5 11
interface 4 0 241416 2908720.125 3 37
layer1 0 0 0 0 9 24
layer2 0 0 0 0 9 54
layer3 2 0 123710987 2684186.725 4 65
tabinit 0 0 0 0 9 16
The knowledge obtained from these tables is where to start the analysis for the critical
path. In this thesis, testing was done for the  rst and second critical path so that the results
could be analyzed to ensure that the  rst critical path did indeed lead to more failed test
cases than the second. The last step of this phase was determining the critical path. Figure
4.1 through Figure 4.3 indicate the critical path for the highest priority module in the path
42
Table 4.2: High Priority Module - Functions
Module Function T3 Freq Call Depth Rank ELOC Pe
mp3gain main 1 40 15 3693 0.0000075323
common head check 20117 4 11 905265 0.0018463889
common decode header 20117 8 7 1830647 0.0037338086
common print header 0 1 31 0 0.0000000000
common print header compact 0 1 31 0 0.0000000000
common getbits 4289656 3 2 102951744 0.2099815579
common getbits fast 2209014 1 6 39762252 0.0810995452
common set pointer 20117 5 13 281638 0.0005744321
decode i386 synth 1to1 mono 0 7 31 0 0.0000000000
decode i386 synth 1to1 1448424 7 1 315756432 0.6440204404
dct64 i386 dct64 2896848 7 4 11587392 0.0236337776
interface InitMP3 2 5 24 128 0.0000002611
interface ExitMP3 2 3 27 44 0.0000000897
interface remove buf 20117 4 14 281638 0.0005744321
interface copy mp 60351 4 10 1267371 0.0025849444
interface GetVbrTag 2 4 22 190 0.0000003875
interface check vbr header 2 4 25 72 0.0000001469
interface sync bu er 20117 4 9 1448424 0.0029542222
interface decodeMP3 20119 21 3 4969393 0.0101356310
layer3 init layer3 2 4 19 760 0.0000015501
layer3 do layer3 sideinfo 20117 3 12 865031 0.0017643271
layer3 do layer3 40234 8 5 5672994 0.0115707036
layer3 III get side info 2 20117 2 13 1750179 0.003569685
layer3 III dequantize samples 80468 2 5 29692692 0.0605615552
MAX CD 40 SUM ELOC = 490289457 |
43
synth_1to1
synth_1to1_mono
do_layer3
decode_header
decodeMP3 main
Figure 4.1: Caller side for  rst critical path
synth_1to1 dct64
Figure 4.2: Called side for  rst critical path
as denoted by a grey box. Figure 4.4 indicates the caller graph for the function getbitsfast,
which was included to validate that there were no important variables in that path and only
in the path de ned by the highest priority function, getbits.
4.3 Phase 2
The second phase of the experiment tested the variables acquired through dynamic
program slicing for spatial and temporal impact, failure rate, and relevant failure rate. Note
that in the data presented here, temporal impact was limited to the value of either 1 or 25.
getbits decodeMP3 main
Figure 4.3: Call graph for second critical path
44
getbits_fast
III_get_side_info_1
III_get_side_info_2
III_get_scale_factors_1
III_get_scale_factors_2
do_layer3_sideinfo
decodeMP3 main
do_layer3
decode_header
Figure 4.4: Call graph for alternate second critical path
The failures either caused a failure in the single track into which they were injected into or
caused the entire album normalization process to fail, such as with a segmentation fault.
Both of the critical paths were subjected to the fault injection testing and the results
of the top 15% of the variables are included in Table 4.1 and 4.2. Critical path one has
47 variables total and critical path two, including the additional branch from getbitsfast
that would not usually be considered, has 59 variables. The additional variables from the
getbitsfast path are included to show that none qualify as an important variable. The only
variables from critical path two that are near the 15% threshold of rank 16 is the rank 17
bsbufold and rank 7 wordpointer variable. The other variables are either out of range or
already located in critical path one.
The variables analyzed in critical path one are only those which have the potential to
cause an error based on their connection to the highest priority module in the system. It
is understandable that there should be some variables located in critical path two or even
three that have importance rankings not too far out from the thresholds. But, it is clear that
the majority of high importance variables found through the fault injection are located in
the  rst critical path. These highest ranking variables are the key to setting error handling
mechanisms. The important locations for setting error handling mechanisms are the code
statements that either write to or read from one of the ranking variables within the de ned
thresholds. Also, by expanding the range of critical path two and incorporating the variables
45
from other functions in the dolayer3 module more variables were found for fault injection
testing. None of the variables were within the importance ranking threshold. There was one
rank 30 variable and the rest were between rank 40 and 80. Selecting the variables from
the program slice of the highest priority function, getbits, was more e ective at  nding the
important variables than the variables from the program slice of the second highest priority
function, getbitsfast.
Function Variable  v;C  v;C Failure Rate Relative Failure Rate Relative Importance
synth 1to1 bandPtr 3 25 0.0138666667 0.0088 1.7816802003
synth 1to1 pnt 3 25 0.0088 0.0056 1.769765878
do layer3 pcm point 3 25 0.0284 0.0002666667 1.7511506951
synth 1to1 buf 2 25 0.0321333333 0.0206666667 1.5640653421
synth 1to1 maxAmpOnly 2 25 0.0081333333 0.0052 1.5158384332
main curframe 2 25 0.0188 0 1.5
synth 1to1 window 1 25 0.0734666667 0.0472 1.3772426984
Table 4.3: Fault injection results for variables for critical path one
Function Variable  v;C  v;C Failure Rate Relative Failure Rate Relative Importance
main curframe 2 25 0.0188 0 1.5
common wordpointer 1 25 0.188 0.0548 1.3992370494
common bsbufold 1 25 0.1876 0 1.2502694522
main wrdpntr 1 25 0.0033333333 0 1.25
main nprocsamp 4 1 0.0017333333 0 1.04
common rval 2 1 0.2857333333 0.0830666667 0.6424178797
common number of bits 2 1 0.1666666667 0.0484 0.5964698746
layer3 h >linbits 2 1 0.0217333333 0.0012 0.5414243122
main bytesinframe 2 1 0.0065333333 0 0.54
Table 4.4: Fault injection results for variables for critical path two
46
Chapter 5
Results
To validate the results obtained in chapter 4, testing was done to determine the number
of relative failed tests using the proposed methodology versus using just local importance for
cost-bene t analysis. In this context, a failed test was determined using the standard fault
injection method and a relative failed test was determined using the fault injection based
on the probability of execution. Table 5.1 and Table 5.2 show the number of failed tests
and relative failed tests found by the proposed method and by just using local importance,
respectively. Table 5.3 shows the percent di erence between the proposed method and a
cost-bene t analysis of placing error handling mechanisms using only local importance.
Table 5.1: Number of failed tests found using proposed methodology
Critical Path # Failed Tests Relative Failed Tests
CP1 2135 1145
CP2 6601 1406
CP1/2 5398 1724
Table 5.2: Number of failed tests found using just the local importance
Critical Path # Failed Tests Relative Failed Tests
CP1 3647 110
CP2 6551 1110
CP1/2 9100 1155
Just using local importance as a metric to assigning locations for wrapper-functions is
unreliable and is in uenced primarily by failure rate and not by other factors, such as spatial
and temporal impact across multiple modules, or the likelihood that the code corruption
47
Table 5.3: % di erence between the number of failed tests found
Critical Path # %di FR %di MFR
CP1 -70.8196721311 90.3930131004
CP2 0.7574609908 21.0526315789
CP1/2 -68.5809559096 33.0046403712
Table 5.4: Proposed Methodology vs. Local Cost-Bene t Analysis
MP3gain # of tests PM # of tests LCBA % di 
CP1 47 640 92.65625
CP1/2 97 640 84.84375
in that function is meaningful. The proposed methodology does not appear to generate
satisfactory results. The relative importance metric accounts for the highest possible spatial
impact and temporal impact, as well as the likelihood that the corrupted code will be run in
the function where the testing is taking place. By injecting faults with a probability equal to
the chance that the function will still be run at the time of injection, the number of failures
found by the proposed method ranges from 21% to 90% greater than the number of failures
found just by local importance. This is meaningful because calculating the failure rate when
the fault is known to be placed into a particular location for testing purposes, the resulting
failure rate loses its relevance to the system as a whole. One must also consider the chance
that the fault will even occur in the memory space belonging to that function and the chance
that when the fault occurs, the resulting code corruption will even be executed.
Also, Table 5.4 clearly shows a signi cantly reduced number of tests required to locate
these important error handling locations. Given these results, the methodology would appear
to accurately locate variables with a high failure rate relative to the intended use cases for the
system, as well as drastically reduce the number of tests required to identify importance error
handling locations in the target software. There is still room for improvement in the design
48
and implementation of the methodology, discussed in chapter 6, but the results indicate a
successful completion of the research goal.
49
Chapter 6
Conclusion and Future Work
The primary bene t this research provides is to create a methodology for quickly  nding
important locations for placing lightweight error handling mechanisms. These mechanisms
will then increase the dependability of the OSS components they are placed into, allowing
systems that have higher standards for dependability and stricter timelines to consider using
OSS components. This is an important bene t that will serve as a step towards increasing
the use of OSS in more real world applications and mission critical systems. OSS has many
clear bene ts, the most applicable here being the savings in both time and money that are
not being fully utilized in the current market.
The proposed methodology also provides data from a hybrid testing process that com-
bines a fault injection environment with data- ow analysis to identify critical modules and
artifacts in the software by mapping out their propagation throughout the software to es-
tablish a critical path. This path can then be used to illustrate the most critical variables of
the software and how they communicate with each other. These critical variables are deter-
mined through a measurement of their importance determined primarily by relative failure
rate and the  ow of data between and within modules. This serves to show the bene ts of
combining the two testing processes for greater e ect and also to further verify the concept
of determining the importance of artifacts in the software based on their failure rates and
propagation.
The shortcomings of current methodologies are that they are intended for use in earlier
stages of the design process and do not allow for bias in the modular structure of the system.
That is, all modules and signals are treated with the same level of importance in the greater
system view. The proposed methodology attempts to incorporate an understanding of system
50
structure, dependability properties and insight into the operation of the system as a whole
to help determine placement of the error handling mechanisms. For instance, in the case
that there is a hub module that the majority of modules receive input from or certain signals
that have a greater e ect on the overall system, the proposed methodology will recognize
these through code analysis and data- ow analysis. Instead of treating each module as an
isolated unit, the proposed methodology also considers the  ow between modules before
instrumenting the additional error handling mechanisms into the source code.
Lastly, this research is directed at the level of dependability of the OSS systems being
tested, but it would be possible to focus the research instead on the level of security or other
possible factors. The methodology proposed in this thesis will form a foundation for a more
re ned methodology that could be molded to  t the needs of the system to be integrated
into, or possibly a combination of dependability and security. There is still re nement to be
done with the current goal of quickly and accurately de ning high importance locations for
placing error handling mechanisms. Considerations should be made as to the probability that
a fault occurs in memory related to a function and how that would change the priority of the
modules. Further automation of the designed fault injection framework and instrumentation
process should be pursued in the case that the methodology is to see any widespread use on
large components or systems.
51
Bibliography
[1] Leeke, M., Jhumka, A. An Automated Wrapper-based Approach to the Design of De-
pendable Software. in DEPEND 2011:The Fourth International Conference on Depend-
ability.(Cte d?Azur, France 2011) 43-50.
[2] Arora, A., Kulkarni, S. Detectors and Correctors: A Theory of Fault-Tolerance Compo-
nents. in 18th International Conference on Distributed Computing Systems.(May 1998)
436-443.
[3] Hiller, M., Jhumka, A., Suri, N. PROPANE: An Environment for Examining the Prop-
agation of Errors in Software. in ACM SIGSOFT International Symposium on Software
Testing and Analysis.(New York, USA 2002) 81-85.
[4] Hsueh, M., Tsai, T., Iyer, R. Fault Injection Techniques and Tools. ed. Computer, 30
(4). 75-82. April 1997.
[5] Sosnowski, J., Gawkowski P., Zygulski P., Tymoczko A. Enchancing Fault Injection
Testbench. in International Conference on Dependability of Computer Systems.(May
2006) 76-83.
[6] Moraes, R., Duraes, J., Barbosa, R., Martins, E., Madeira, H. Experimental Risk As-
sessment and Comparison Using Software Fault Injection. in 37th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks. (June 2007) 512-521.
[7] Natella, R., Cotroneo, D., Duraes, J., Madeira, H. On Fault Representativeness of
Software Fault Injection. in IEEE Transactions on Software Engineering. 37 (6). 1-18.
2011.
[8] Leeke, M., Jhumka, A. The Early Identi cation of Detector Locations in Dependable
Software. in 22nd IEEE International Symposium on Software Reliability Engineering.
(Hiroshima, Japan 2011) 40-49.
[9] Leeke, M., Arif, S., Jhumka A., Annad, S. A Methodology for the Generation of Ef-
 cient Error Detection Mechanisms. in 41st IEEE/IFIP International Conference on
Dependable Systems and Networks. (Hong Kong, China 2011) 25-36.
[10] Cinque, M., Cotroneo, D., Natella, R., Pecchia, A. Assessing and Improving the Ef-
fectiveness of Logs for the Analysis of Software Faults. in IEEE/IFIP International
Conference on Dependable Systems and Networks. (Chicago, USA 2010) 457-466.
52
[11] Clark, J., Pradhan, D. Fault Injection: A Method for Validating Computer-System
Dependability. ed. Computer, 28 (6). 47-55. June 1995.
[12] Madeira, H., Costa, D., Vieira, M. On the Emulation of Software Faults by Software
Fault Injection. in International Conference on Dependable Systems and Networks. (New
York, USA 2000). 417-426.
[13] Leeke, M., Jhumka, A. Towards Understanding the Importance of Variables in Depend-
able Software. in Dependable Computing Conference (EDCC). (Valencia, Spain 2010)
85-94.
[14] Near, J., Milicevic, A., Kang, E., Jackson, D. A Lightweight Code Analysis and Its Role
in Evaluation of a Dependability Case. in 33rd International Conference on Software
Engineering.(Honolulu, USA 2011). 31-40.
[15] Lisboa, C., Grando, C., Moreira, A., Carro, L. Using Software Invariants for Dynamic
Detection of Transient Errors. in 10th Latin American Test Workshop. (Rio de Janeiro,
Brazil 2009). 1-6.
[16] Hamill, M., Goseva-Popstojanova, K. Common Trends in Software Fault and Failure
Data. ed. IEEE Transactions on Software Engineering, 34 (4). 484-496. August 2009.
[17] Song, N., Qin, J., Pan, X., Deng, Y. Fault Injection Methodology and Tools. in Interna-
tional Conference on Electronics and Optoelectronics (ICEOE). (Dalian, China 2011).
V1-47 V1-50.
[18] Arlat, J., Aguera, M., Amat, L., Crouzet, Y., Fabre, J., Laprie, J., Martins, E., Powell,
D. Fault Injection for Dependability Validation: A Methodology and Some Applications.
ed. IEEE Transactions on Software Engineering, 16 (2). 166-182. February 1990.
[19] Barton, J., Czeck, E., Segall, Z., Siewiorek, D. Fault Injection Experiments Using FIAT.
ed. IEEE Transactions on Computers, 39 (4). 575-582. April 1990.
[20] Benso, A., Carlo, S. The Art of Fault Injection. ed. Journal of Control Engineering and
Applied Informatics, 13 (4). 9-18, August 2011.
[21] Jiang, M., Munawar, M., Reidemeister, T., Ward, P. Automatic Fault Detection and Di-
agnosis in Complex Software Systems by Information-Theoretic Monitoring. in IEEE/I-
FIP International Conference on Dependable Systems and Networks, (Estoril, Portugal
2009). 285-294.
[22] Natella, R. Achieving Representative Faultloads in Software Fault Injection. Universita
Degli Studi Di Napoli Federico Il, November 2011.
[23] Leeke, M., Jhumka, A. Evaluating the Use of Reference Run Models in Fault Injection
Analysis. in IEEE Paci c Rim International Symposium on Dependable Computing,
(Shanghai, China 2009). 121-124.
53
[24] Voas, J. Certifying O -the-Shelf Software Components. ed. Computer, 31 (6). 53-59,
June 1998.
[25] Kropp, N., Koopman, P., Siewiorek, D. Automated Robustness Testing of O -the-
Shelf Software Components. in 28th Annual International Symposium on Fault-Tolerant
Computing, (Munich, Germany 1998). 230-239.
[26] Voas, J., Ghosh, Anup. Inoculating Software for Survivability. ed. Communications of
the ACM, 42 (7). July 1999.
[27] Khanjani, A., Sulaiman, R. The Aspects of Choosing Open Source versus Closed Source.
ed. IEEE Symposium on Computers and Informatics, (Kuala Lumpur, Malaysia 2011).
646-649.
[28] Nagy, D., Yassin, A., Bhattacherjee, A. Organizational Adoption of Open Source Soft-
ware: Barriers and Remedies. ed. Communications of the ACM, 53 (3). 148-151. March
2010.
[29] Xiong, L., Tan, Q. Data Flow Error Recovery with Checkpointing and Instruction-
level Fault Tolerance. in 12th International Conference on Parallel and Distributed
Computing, Applications and Technologies, (Gwangju, Korea 2011). 79-85.
[30] Taylor, R., Osterweil, L. Anomaly Detection in Concurrent Software by Static Data
Flow Analysis. ed. IEEE Transactions on Software Engineering, 6 (3). 265-278. May
1980.
[31] Bergeretti, J., Carre, B. Information-Flow and Data-Flow Analysis of while-Programs.
ed. ACM Transactions on Programming Languages and Systems, 7 (1). 37-61. January
1985.
[32] Bovenzi, A., Cotroneo, D., Pietrantuono, R., Carrozza, G. Error Detection Framework
for Complex Software Systems. in 13th European Workshop on Dependable Computing,
(Pisa, Italy 2011). 61-66.
[33] H. Volzer, Verifying fault tolerance of distributed algorithms formally - an example, in
Proceedings of the 1st International Conference on the Application of Concurrency to
System Design, March 1998, 187-197.
[34] E. Capra, C. Francalanci, F. Merlo, C. Rossi-Lamastra. Firms? involvement in Open
Source projects: A trade-o between software structural quality and popularity. Journal
of Systems and Software 84 (1), January 2011, 144-161.
54