Static Program Analysis In Presence Of Multiple Configurations
by
Farnaz Behrang
A thesis submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Master of Science
Auburn, Alabama
August 2, 2014
Keywords: Refactoring, Preprocessor, Static Analysis
Copyright 2014 by Farnaz Behrang
Approved by
Munawar Hafiz, Chair, Assistant Professor of Computer Science
Jeffrey L. Overbey, Assistant Professor of Computer Science
David Umphress, Professor of Computer Science
Abstract
C programs make heavy use of the C preprocessor, which makes them highly config-
urable. C programming IDEs ignore multiple configurations of C preprocessor because of
its complexity. However, program analyses and transformations have to consider all possible
macro configurations; otherwise, they will be incorrect. This problem is exacerbated in the
context of program transformations: it is impossible to implement any non-trivial program
transformation correctly without supporting sophisticated static analysis.
We modified OpenRefactory/C, a framework for building correct and complex program
transformations for C, to support configuration-aware static analysis and program trans-
formations. We modified the program representation to include conditional directives and
extended the static program analysis to be aware of multiple configurations. For evaluation,
we used the Extract Function refactoring implemented in OpenRefactory/C.
ii
Acknowledgments
I would like to thank Dr. Munawar Hafiz and Dr. Jeffrey Overbey for their guidance
and support throughout this work.
iii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Configuration-aware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Impact of Multiple Configurations on Refactoring and Static Analysis . . . . 2
1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 OpenRefactory/C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Preprocessor Modifications to support Multiple Configurations . . . . . . . . . . 8
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Handling Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Handling Conditional Directives . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Calculating guarding conditions . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Program Representation that Supports Multiple Configuration Preprocessor . . 17
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Multiple configuration aware static analysis . . . . . . . . . . . . . . . . . . . . 22
iv
5.1 Control Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.2 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.3 Control Entry and Control Exit . . . . . . . . . . . . . . . . . . . . . 25
5.1.4 Calculating Successors and Predecessors . . . . . . . . . . . . . . . . 25
5.2 Data Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.1 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Preprocessor Dependence Preservation Analysis . . . . . . . . . . . . . . . . 28
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1 General Testing Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 A Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3.1 Designing Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3.2 Extract Function Refactoring . . . . . . . . . . . . . . . . . . . . . . 34
6.3.3 Evaluating Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
v
List of Figures
1.1 An example to apply Extract Function refactoring . . . . . . . . . . . . . . . . 3
1.2 Code segment from Zlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 An example of a macro with multiple definitions . . . . . . . . . . . . . . . . . . 9
3.2 Use of a macro with multiple definition . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Another example of a macro with multiple definitions . . . . . . . . . . . . . . . 10
3.4 An example of incompatible conditions . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 An example of conditions causing infinite loop . . . . . . . . . . . . . . . . . . . 14
3.6 A segment of code extracted from Zlib . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 An example of breaking the complete C constructs . . . . . . . . . . . . . . . . 18
4.2 Examples extracted from GNU Coreutils 8.21 . . . . . . . . . . . . . . . . . . . 18
4.3 Grammar to include C preprocessor directives . . . . . . . . . . . . . . . . . . . 19
4.4 An example extracted from gmp-4.3.2 . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Partial AST generated for the example in figure 4.4 . . . . . . . . . . . . . . . 20
5.1 A code snippet extracted from GNU Coreutils 8.21 . . . . . . . . . . . . . . . . 24
5.2 CFG for figure 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vi
5.3 Strategy to calculate non jump exits of directive condition node . . . . . . . . . 25
5.4 Strategy to calculate predecessors of directive condition node . . . . . . . . . . . 26
5.5 Strategy to calculate successors of directive condition node . . . . . . . . . . . . 26
5.6 Strategy to calculate successors for every item of a conditional directive block
item list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.7 Strategy to calculate predecessors for every item of a conditional directive block
item list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.8 A code snippet extracted from gmp-4.3.2 . . . . . . . . . . . . . . . . . . . . . . 28
5.9 An example of moving macro to undefined region . . . . . . . . . . . . . . . . . 29
6.1 An example extracted from chmod.c . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Expansion of macro S_IRWXU . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.3 Test coverage results for MoveLocalVariable micro refactoring using EclEmma . 37
6.4 Part of Add Brace micro refactoring code . . . . . . . . . . . . . . . . . . . . . 38
vii
List of Tables
3.1 Guarding conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Nodes of AST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 Reaching definition information for code in figure 5.8 . . . . . . . . . . . . . . . 28
6.1 Failure and cluster statistics for C projects . . . . . . . . . . . . . . . . . . . . . 32
6.2 Test coverage results for Extract Function micro refactorings using EclEmma . . 37
viii
Chapter 1
Introduction
Refactorings are known to improve the design and structure of the code, making it more
maintainable and easier to understand and modify. Refactoring tools are included in modern
IDEs such as Eclipse and Visual Studio to change the source code in a way that preserves
behavior.
Extensive work has been done to provide refactoring tools in IDEs that support object-
oriented programming languages like Java and C#. However, in spite of its popularity, the
same is not true for C. Existing refactoring tools for C only support primitive refactorings
without any sophisticated analysis. Gligoric et al. [14] applied Eclipse?s refactorings for Java
and C to a number of open source codes; 1.4% of the Java test cases failed, while 7.5% of
the C tests failed.
There are many factors that make C difficult to refactor correctly. One of the main
factors is heavy use of C preprocessor [7]. While using preprocessor directives makes the C
code more flexible, its lack of structure adds complexity. Since preprocessor directives do
not conform to the C grammar, code must be preprocessed first before it can be parsed. It is
impossible to apply refactorings on preprocessed version of C code since the un-preprocessed
version cannot be recovered later. Most IDEs including Eclipse CDT provide program analy-
sis and refactoring support based on a single preprocessor configuration, which only considers
one branch of conditional directives. Supporting multiple preprocessor configurations will
enable the refactoring tool to analyze and transform the un-preprocessed source code con-
sidering all possible configurations.
1
It is impossible to implement refactorings correctly without any sophisticated static
analysis. Preprocessor directives and multiple configurations makes the analysis more com-
plicated since all possible macro configurations need to be considered.
1.1 Configuration-aware Analysis
Configuration-aware analyses is a class of analyses that analyze code with respect to
all possible macro configurations. This analyses recently has received attention among re-
searchers in different contexts including type checking, data-flow analysis, and model check-
ing. The idea behind all of them is to analyze the similar code among variants only once to
make the analysis more efficient and applicable to large-scale systems. However, the details
are different.
The most similar work to ours is due to Liebig et al. [25]. They implemented variability-
aware type checking and liveness analysis to compare the performance of these variability-
aware analysis with the performance of corresponding sampling heuristics (single configura-
tion, pair-wise, and code coverage). They found that the performance of variability-aware
analysis scales to large scale systems such as the Linux kernel. However, they did not apply
their analysis in any application areas other than empirical studies. Refactoring tools are
practical application of configuration-aware analysis which present challenges specific to this
context. WeusedOpenRefactory/Casapracticalinfrastructuretoimplementconfiguration-
aware analysis which is a real necessity to perform certain refactorings correctly.
1.2 Impact of Multiple Configurations on Refactoring and Static Analysis
The existing refactoring tools in C ignore multiple preprocessor configurations since it
adds more complexity to build correct and complex refactoring tools. Instead, they con-
sider single configuration which only provides partial information. Therefore, the resulted
refactoring and static analysis is inaccurate.
2
Figure 1.1 shows a sample code. If we try to extract "x = 0;" using an extract function
refactoring that only supports a single configuration, the result will be the program in fig-
ure 1.2 on the left. The conditional directive has been evaluated, and since macro "M" is not
defined, the #if branch is ignored. However, supporting multiple preprocessor configurations
will result in the code on the right, which considers both branches of #if directive.
int main(){
#ifdef M
int x;
#else
unsigned int x;
#endif
x = 0;
printf("x is %d", x);
}
(a) before refactoring
unsigned int foo(unsigned int x) {
x = 0;
return x;
}
(b) single configuration
#ifdef M
int foo(int x) {
x = 0;
return x;
}
#else
unsigned int foo(unsigned int x) {
x = 0;
return x;
}
#endif
(c) multiple configuration
Figure 1.1: An example to apply Extract Function refactoring
Reaching definition analysis and definition-use chains are used to identify which param-
eters to pass to a method and which value to return from an extracted method.
Consider the code segment in figure 1.2, extracted from the Zlib data compression
library, v1.2.5, trees.c, lines 970 to 991. Variable "opt_lenb" is defined on line 970 and used
on lines 976 and 990. We want to extract line 970 using extract function refactoring. If
only #ifdef branches are taken, ignoring #else branches, there would not be any uses of
"opt_lenb" afterextraction. Hence, theextractedfunctionwon?treturnanyvalue. However,
considering all the branches there would be uses of "opt_lenb" which requires the extracted
method to return this variable.
Since libraries might be used under different system configurations, no prediction can
be made about taking a particular branch by conditional directives. Therefore, the pro-
gram analysis and refactorings must consider all feasible possibilities, otherwise they will be
incorrect.
3
970 opt_lenb = static_lenb = stored_len + 5;
...
973 #ifdef FORCE_STORED
974 if (buf != (char*)0) { /* force stored block */
975 #else
976 if (stored_len+4 <= opt_lenb && buf != (char*)0) {
977 /* 4: two words for the lengths */
978 #endif
987 #ifdef FORCE_STATIC
988 } else if (static_lenb >= 0) { /* force static trees */
989 #else
990 } else if (s->strategy == Z_FIXED || static_lenb == opt_lenb) {
991 #endif
Figure 1.2: Code segment from Zlib
1.3 Thesis Statement
Correct program transformations on C programs depend on correctly accounting for
multiple configurations. This involves creating a program representation that supports mul-
tiple configurations, augmenting program analyses, and introducing new program analyses
to include preprocessor directives. The program transformations should be tested for cor-
rectness on real programs in addition to synthetic test cases.
1.4 Thesis Outline
The outline of this thesis is:
 Chapter 2 describes the frame work we worked on. OpenRefactory/C is an infras-
tructure for building correct and complex refactorings and other source-level program
transformations for C.
 Chapter 3 discusses handling of macros and conditional directives in presence of mul-
tiple configuration
 Chapter 4 discusses the program representation that allows multiple configuration sup-
port
4
 Chapter 5 discusses the multiple configuration, preprocessor-aware static analysis
 Chapter 6 discusses the evaluation of multiple configuration support using Extract
Function Refactoring
 Chapter 7 discusses the related work
 Chapter 8 discusses the future work and conclusion
5
Chapter 2
OpenRefactory/C
OpenRefactory/C [17] is a C-specific infrastructure for building refactorings and sim-
ilar source-level transformations, while correctly handling all of the complexities of C. It
is written in Java and supports different front ends including Eclipse plug-in, Vim plug-in,
command line, and web demo. Twelve program transformations including both refactorings
and behavior enhancing transformations [16] has also been implemented in this framework.
OpenRefactory/C supports different static analysis:
 Name binding Analysis: Name binding information is stored in the abstract syntax
tree (AST). Any AST node representing a name reference can be queried to return the
corresponding declaration node.
 Type Analysis: Type information is computed as a query to the AST using name
binding information.
 Control flow Analysis : Control flow information is computed dynamically querying
AST nodes using improved version of Morgenthaler?s [27] algorithm.
 Alias Analysis : OpenRefactory/C supports an inclusion-based (Andersen-style) alias
analysis for local variables.
 Data flow Analysis : OpenRefactory/C stores intraprocedural reaching definitions and
definition-use relationships in the AST.
 Dependence Analysis : OpenRefactory/C supports scalar data dependence analysis
for local variables, as well as computation of control dependences. Control and data
6
dependences check behavior preservation of transformations that move code within a
procedure.
OpenRefactory/C?sCparserandrewritableASTaregeneratedusingLudwig[29], which
was also used to generate the syntactic manipulation infrastructure in Photran. OpenRefac-
tory/C uses mutable ASTs as its primary program representation. Source code manipulation
is performed by modifying the AST and later traversing the tree to output the revised source
code (or create a patch file). All AST nodes implement a common interface that includes
methods to perform tree traversals using the Visitor pattern, find nodes by type, determine
preprocessor constructs affixed to nodes, and of course, manipulate the source code associ-
ated with a node. Like the majority of refactoring tools, OpenRefactory/C uses AST for
both analysis of the code and transformation.
One of the main research goals for OpenRefactory/C is to support analysis and trans-
formation of code containing C preprocessor directives, with correct results under all feasible
preprocessor configurations. These changes require intricate modifications to every part of
the infrastructure: the preprocessor, the lexical analyzer, the parser, the AST, every static
analysis, and every transformation.
The following chapters of this thesis will discuss about the modifications that we made
to every part in order to support configuration-aware static analysis (control flow and data
flow analysis) and configuration-aware transformations.
7
Chapter 3
Preprocessor Modifications to support Multiple Configurations
3.1 Introduction
There are two ways to construct an AST from source code containing preprocessor
directives. One way would be to preprocess the code, parse it and construct the AST. Using
this approach, branches of conditional directives except one will be lost. Since we want
to consider all feasible preprocessor configurations, we need to pseudo-preprocess [9] the
code. When a pseudo-preprocessor is used, the structure of AST is similar to the code after
preprocessing has been applied while it still contains all feasible preprocessor configurations
of the code.
OpenRefactory/C follows the pseudo-preprocessing model described by Garrido [9]. It
includes a lexical analyzer and a parser for generating ASTs. There is a customized version
of C preprocessor that feeds the lexical analyzer.
In this chapter we discuss about the modifications made in OpenRefactory/C?s pre-
processor to enable multiple configuration support. Since during pseudo-preprocessing all
the feasible branches are considered simultaneously, macros may have multiple expansions.
In the rest of the chapter, we will discuss about determining feasible configurations and
calculating the guarding condition for every point of the source code.
3.2 Handling Macros
Considering all feasible preprocessor configurations, macros may have more than one
definition under different branches of conditional directives. Figure 3.1 shows an example
8
of a macro with different definitions in each branch extracted from /usr/include/i386-linux-
gnu/bits/stdio.h.
#ifndef __extern_inline
# define __STDIO_INLINE inline
#else
# define __STDIO_INLINE __extern_inline
#endif
Figure 3.1: An example of a macro with multiple definitions
To make multiple definitions of a macro name distinguishable, we have to keep track
of guarding conditions associated with each macro definition. Later, when we expand a
macro call, every possible expansion is guarded by some guarding conditions. The guarding
condition for the first definition of macro __STDIO_INLINE is !defined (__extern_inline)
and for the second definition is defined(__extern_inline). Figure 3.2 shows a use of this
macro before and after expansion, in the same file. For this simple example, only keeping
track of the immediate level of guarding conditions is sufficient. However, this is not always
true.
__STDIO_INLINE int
fgetc_unlocked (FILE *__fp)
{
return _IO_getc_unlocked (__fp);
}
(a) Before expansion
#ifndef __extern_inline
inline
#else
__extern_inline
#endif int
fgetc_unlocked (FILE *__fp)
{
return _IO_getc_unlocked (__fp);
}
(b) After expansion
Figure 3.2: Use of a macro with multiple definition
During preprocessing when we encounter a macro, we are not aware if there will be more
macros with the same name. More importantly, we are not aware of the sufficient level of
conditions we have to consider to make multiple definitions of that macro distinguishable.
Hence, we have to keep track of all the conditions. However, later when we do the expansion,
we will determine the sufficient level of guarding conditions we need for each macro to make
it distinguishable from other macros with the same name.
9
Figure3.3showsasegmentofcodeextractedfrom/usr/include/i386-linux-gnu/sys/cdefs.h,
lines 380 to 410. Function like macro __REDIRECT_LDBL is defined on lines 393 and 406. For
this example, if we just keep track of the enclosing guarding conditions, it would be lines 382
and 405 which is similar for both definitions. In this case, we have to consider another level
of guarding conditions as well to make two definitions distinguishable during expansion.
380 #if defined __LONG_DOUBLE_MATH_OPTIONAL && defined __NO_LONG_DOUBLE_MATH
381 # define __LDBL_COMPAT 1
382 # ifdef __REDIRECT
...
393 # define __REDIRECT_LDBL(name, proto, alias) \
394 __LDBL_REDIR1 (name, proto, __nldbl_##alias)
...
397 # endif
398 #endif
399 #if !defined __LDBL_COMPAT || !defined __REDIRECT
...
405 # ifdef __REDIRECT
406 # define __REDIRECT_LDBL(name, proto, alias) __REDIRECT (name, proto, alias)
...
409 # endif
410 #endif
Figure 3.3: Another example of a macro with multiple definitions
Thus, we wrote an algorithm to determine the sufficient level of guarding conditions
for every macro that we expand with more than one definition. Generally, two cases may
happen:
 The enclosing guarding condition for the macro definition is either #elif or #else
directive. In this case, we have to find the nearest if directive. (steps 1-2)
 The enclosing guarding condition for the macro definition is either #if or #ifdef or
#ifndef directive:
? The enclosing guarding condition is not the same for both definitions. We keep
the condition, but still find the nearest if directive for both definitions since the
#if directive of one of the definitions might be in the #elif or #else branches
10
of the other definitions. If they are the same, then keep these conditions as well.
(steps 3-16)
? The enclosing guarding condition is the same for both definitions. We keep the
conditions until the conditions are not the same any more (the example showed
in figure 3.3 ). If we have reached an #elif or #else directive, we find the nearest
#if directive. (steps 17-24)
In the following algorithm, the main data structure used is stack since we need the
immediate level of guarding conditions first. C1 is the stack which contains the guarding
conditions associated with the first macro and C2 contains the guarding conditions of the
second macro. Stack C is the final conditions that should be considered for the first macro.
This algorithm is just showed to find the conditions for two macro definitions with the same
name. However, it can easily be used to determine conditions if multiple definitions exist.
1: if C1.top is elif directive OR C1.top is else directive then
2: C = FindNearestIfDirective(C1, C)
3: else
4: if pop(C1).top != pop(C2).top then
5: push(C, pop(C1))
6: if C1 is not empty then
7: temp = FindNearestIfDirective(C1, temp)
8: if C2.top is elif directive OR C2.top is else directive then
9: while C2 is not empty AND C2.top is not if directive do
10: pop(C2)
11: end while
12: end if
13: if temp is not empty AND temp.top == C2.top then
14: push(C, temp)
15: end if
11
16: end if
17: else
18: while C1 is not empty AND C2 is not empty AND C1.top == C2.top do
19: push(C, pop(C1))
20: C2.pop
21: end while
22: C = FindNearestIfDirective(C1, C)
23: end if
24: end if
25:
26: function FindNearestIfDirective(C1, C)
27: while C1 is not empty AND C1.top is not if directive do
28: push(C, pop(C1))
29: end while
30: push(C, pop(C1))
31: return C
32: end function
3.3 Handling Conditional Directives
Garrido [9] claims in her thesis that "A refactoring in the presence of CPP conditionals
is correct if and only if it is correct for all possible system configurations". In other words,
each possible configuration should be analyzed and refactored together. One approach is
to analyze and refactor each configuration separately, but this approach does not scale to
large scale systems in the presence of exponential configurations like the Linux Kernel which
provides more than 10,000 configurable features, which can be combined almost arbitrarly
[34]. The approach we are using is to process all configurations simultaneously [12]. It only
12
constructs a single AST by analyzing the similar code between configurations only once.
Liebig et al. [25] showed this approach works even in the case of the Linux kernel.
In order to consider all feasible configurations of a program, we consider every branch
to be true without evaluating the conditions. Except these cases that we consider them as
infeasible:
 Conditions that are always false such as "0" or "!1".
351 #ifndef _LIBC
352 #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
353 #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
354 #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
355 #else
356 extern _IO_FILE *_IO_stdin attribute_hidden;
357 extern _IO_FILE *_IO_stdout attribute_hidden;
358 extern _IO_FILE *_IO_stderr attribute_hidden;
359 #endif
Figure 3.4: An example of incompatible conditions
 Conditions that do not comply with our requirements such as "defined __cplusplus"
or"defined __GNUG__"whichequalsto"defined __GNUC__ && defined __cplusplus".
 There are some conditions that cannot be true at the same time. Garrido [9] called
these conditions incompatible. Consider figure 3.4 as an example of incompatible
conditions, extracted from /usr/include/libio.h, lines 351 to 359. If you first take
the #ifndef branch and then the #else branch, _IO_stdin in the #else branch is
considered as a macro and it is expanded which it unexpected.
 There are some conditions that considering them to be always true creates an infinite
loop. They mostly happen in the case of including header files. Figure 3.5 shows seg-
mentofcodesfrom/usr/include/features.hand/usr/include/i386-linux-gnu/sys/cdefs.h.
As you can see, features.h includes cdefs.h in the case macro _SYS_CDEFS_H is not de-
fined and cdefs.h also includes features.h. If we consider the guarding conditions of
these two include files always true, there will be an infinite loop. Therefore, although
13
we do not evaluate the conditions, we need to keep track of them to determine if some
conditions should not be taken.
Whenwereachaconditionaldirectiveduringpreprocessing, weareawareofthemacros
that has been defined so far. Considering the defined macros at that point and the
directive condition, we create a logical expression and then determine if the expression
could always be false. If so, we won?t take the branch anymore. For the example in
figure 3.4, the first time that we include features.h, macro _SYS_CDEFS_H has not been
defined yet, so we include cdefs.h but next time since we had this macro defined in
cdefs.h, we won?t take the branch anymore and skip it.
#ifndef __ASSEMBLER__
# ifndef _SYS_CDEFS_H
# include <sys/cdefs.h>
# endif
(a) features.h
#define _SYS_CDEFS_H 1
/* We are almost always included from features.h. */
#ifndef _FEATURES_H
# include <features.h>
#endif
(b) cdefs.h
Figure 3.5: An example of conditions causing infinite loop
3.3.1 Calculating guarding conditions
Every point in the source code has a guarding condition. These guarding conditions are
used in the static analysis where we need to annotate the control-flow graph edges. In this
chapter, we only talk about calculating the conditions and discuss it in more details later in
chapter 5.
Each part of a source code has a set of conditional directives. The guarding condition
of that part is the conjunction of conditions that is calculated for each conditional directive.
The condition for a conditional directive is calculated as follows:
 When conditional directive is #if constant-expression, the condition is the constant-
expression.
14
 When conditional directive is either #ifdef identifier or #ifndef identifier, the condi-
tions are defined(identifier) and !defined(identifier) respectively.
 When conditional directive is #elif constant-expression, the condition is the con-
junction of constant-expression and negate of all previous constant-expressions and
identifiers.
 Whenconditionaldirectiveis #else, theconditionisthenegateofallpreviousconstant-
expressions and identifiers.
We keep track of the conditions as an attribute for AST nodes, so that we could easily
retrieve them later. Since we probably won?t need to be aware of conditions for all the AST
nodes, we decided to store them just for identifiers in order to save time and space. However,
in the case that we need them for all the nodes later it could be easily set for them as well.
3.3.2 Running Example
Figure 3.6 shows a segment of code extracted from Zlib-1.2.5, crc32.c, lines 38 to 50.
Table 3.1 shows the guarding condition for some lines of the program.
38: # if (UINT_MAX == 0xffffffffUL)
39: typedef unsigned int u4;
40: # else
41: # if (ULONG_MAX == 0xffffffffUL)
42: typedef unsigned long u4;
43: # else
44: # if (USHRT_MAX == 0xffffffffUL)
45: typedef unsigned short u4;
46: # else
47: # undef BYFOUR /* can?t find a four-byte integer type! */
48: # endif
49: # endif
50: # endif
Figure 3.6: A segment of code extracted from Zlib
15
Line# Guarding Condition
39 (UINT_MAX == 0xffffffffUL)
42 !(UINT_MAX == 0xffffffffUL)&&(ULONG_MAX == 0xffffffffUL)
45 !(UINT_MAX == 0xffffffffUL)&&!(ULONG_MAX == 0xffffffffUL)&&(USHRT_MAX == 0xffffffffUL)
47 !(UINT_MAX == 0xffffffffUL)&&!(ULONG_MAX == 0xffffffffUL)&&!(USHRT_MAX == 0xffffffffUL)
Table 3.1: Guarding conditions
3.4 Conclusions
This chapter discussed about modifications made to OpenRefactory/C?s preprocessor
in presence of multiple configurations. We talked about handling macros with multiple defi-
nitions, determining feasible conditional directives and calculating the guarding conditions.
Now the result is a token stream that should be parsed to generate the AST. The modifica-
tions of the parser is currently under development and it is beyond the scope of this thesis.
In the next chapter we will describe the modifications we made to generate AST.
16
Chapter 4
Program Representation that Supports Multiple Configuration Preprocessor
As mentioned in chapter 2, OpenRefactory/C uses rewritable ASTs to represent the
source code. ASTs are constructed during parsing where nodes are created to represent the
grammar productions. Before parsing a C program, the program has to be preprocessed
first by the C preprocessor. Then, the resulting token stream is parsed. However, the
grammar of the C parser does not include preprocessor constructs such as macros, #include
directives, and conditional directives (like #if). Thus, we must extend the C grammar to
include the preprocessor constructs. This way, the generated AST has some additional nodes
that represent the information of preprocessor directives. In this chapter, we talk about the
modifications we made to the grammar to generate an AST that reflects C preprocessor
information.
4.1 Introduction
The preprocessor constructs are not necessarily enclosed by statements, they might ap-
pear anywhere in the code while breaking the complete C constructs. Liebig et al. [24]
gives a description that "Disciplined annotations are annotations on certain syntactic code
structures, such as entire functions and statements, whereas we call annotations of indi-
vidual tokens or brackets that do not align with underlying code structure undisciplined
annotations". Figure 4.1 shows an example of a code structure that has undisciplined an-
notations. The code segment is extracted from GNU Coreutils version 8.21, printf.c. The
syntax of labeled-statement in C is case constant-expression : statement but in this example
the constant-expression is enclosed by an #if directive which breaks a complete C construct.
17
switch (*f)
{
#if (__GLIBC__ == 2 && 2 <= __GLIBC_MINOR__) || 3 <= __GLIBC__
case ?I?:
#endif
...
Figure 4.1: An example of breaking the complete C constructs
Making the grammar flexible enough to be able to parse all these cases will result in a
large and ambiguous grammar. Garrido [9] describes a technique that is called completing
the conditional to handle the incomplete C constructs. First, she tokenizes the input and
recognizes incomplete preprocessor conditionals. Then, she moves them forward or backward
as necessary to complete the conditionals. Although the process is reasonable, it is heuristic-
based and it may become complicated in the presence of nested conditionals.
The OpenRefactory/C?s approach is to extend the C grammar, based on Section 6.10 of
the ISO C99 standard [18], in order to include productions of preprocessor directives in the
grammar. However, the grammar only allows preprocessor directives in between complete C
constructs including statement, declaration, and function definition.
Figure 4.2 shows two examples extracted from GNU Coreutils 8.21, stty.c. In figure 4.2
(a), the statement is surrounded by a conditional directive while on figure 4.2 (b), the con-
ditional directive appears in between the statement. In the case that a conditional directive
appears at an unexpected location, a new parsing algorithm [13] that is currently under
development in OpenRefactory/C will be able to handle it.
#if VEOF == VMIN
if ((mode->c_lflag & ICANON) == 0
&& (STREQ (control_info[i].name, "eof")
|| STREQ (control_info[i].name, "eol")))
continue;
#endif
(a)
#if VEOF == VMIN
if ((mode->c_lflag & ICANON) == 0)
#endif
wrapf ("min = \%lu; time = \%lu;",
(unsigned long int) mode->c_cc[VMIN],
(unsigned long int) mode->c_cc[VTIME]);
(b)
Figure 4.2: Examples extracted from GNU Coreutils 8.21
18
4.2 Abstract Syntax Tree
OpenRefactory/C?s internal program representation is a rewritable AST generated by
Ludwig [29]. Figure 4.3 shows the simplified version of the extended grammar productions
to include C preprocessor directives in the AST. The grammar follows the Ludwig?s gram-
mar notation, a variation of Backus-Naur Form (BNF), where nonterminals are denoted by
names in <angle-brackets>, while terminals are denoted by names in SMALL-CAPS. Dif-
ferent alternatives are separated by ?|? character and ??? indicates the optional nonterminals.
Directives such as #define are omitted from the tree since they do not necessarily fit into
the syntactic structure.
hdirective-block-item-listi::= hdirective-block-itemi|
hdirective-block-item-listihdirective-block-itemi
hdirective-block-itemi::= hdeclarationi
|hstatementi
|hfunction-definitioni
|hdirectivei
hmacro-condition-identifieri::= IDENTIFIER
hif-directivei::= #ifdefhmacro-condition-identifierihdirective-block-item-listi?
| #ifndefhmacro-condition-identifierihdirective-block-item-listi?
| #ifhconstant-expressioniNEWLINEhdirective-block-item-listi?
helse-directivei::= #elsehdirective-block-item-listi?
| #elifhconstant-expressioniNEWLINEhdirective-block-item-listi? helse-directivei?
hdirectivei::= hif-directiveihelse-directivei? #endif
Figure 4.3: Grammar to include C preprocessor directives
4.2.1 Running Example
Figure 4.4 shows an example extracted from gmp-4.3.2, gen.c, lines 350 to 356. We used
the above mentioned grammar to generate the AST in OpenRefactory/C. Figure 4.5 shows
19
the AST for the example in figure 4.4. We only show the part of AST that is related to
preprocessor directives. Every node represents part of the code. Since some nodes were big,
we only represent the node type and the method that is used to get that node, in the AST.
The rest of the information is listed in table 4.1.
350: #if HAVE_STRTOUL
351: n = strtoul (argv[0], (char **) NULL, 10);
352: #elif HAVE_STRTOL
353: n = (unsigned long int) strtol (argv[0], (char **) NULL, 10);
354: #else
355: n = (unsigned long int) atoi (argv[0]);
356: #endif
Figure 4.4: An example extracted from gmp-4.3.2
1:CConditionalDIrective
2:CIfDirective
3:Identifier 4:ASTListNode
5:CConditionalDIrective
6:CIdentifier
Expression
7:ASTList
Node
8:CElse
Directive
9:ASTList
Node
getIfDirective()
getIdentifier()
getDirective
BlockItemList()
getElseDirective()
getConstantExpression()
get
Directive
BlockItemList() getElseDirective()
getDirective
BlockItemList()
Figure 4.5: Partial AST generated for the example in figure 4.4
20
Node# Node Type Node
1 CConditionalDirective #if HAVE_STRTOUL
n = strtoul (argv[0], (char **) NULL, 10);
#elif HAVE_STRTOL
n = (unsigned long int) strtol (argv[0], (char **) NULL, 10);
#else
n = (unsigned long int) atoi (argv[0]);
#endif
2 CIfDirective #if HAVE_STRTOUL
n = strtoul (argv[0], (char **) NULL, 10);
3 Identifier HAVE_STRTOUL
4 ASTListNode n = strtoul (argv[0], (char **) NULL, 10);
5 CConditionalDirective #elif HAVE_STRTOL
n = (unsigned long int) strtol (argv[0], (char **) NULL, 10);
#else
n = (unsigned long int) atoi (argv[0]);
#endif
6 CIdentifierExpression HAVE_STRTOL
7 ASTListNode n = (unsigned long int) strtol (argv[0], (char **) NULL, 10);
8 CElseDirective #else
n = (unsigned long int) atoi (argv[0]);
9 ASTListNode n = (unsigned long int) atoi (argv[0]);
Table 4.1: Nodes of AST
4.3 Conclusion
This chapter described the modifications we made in AST in order to include the C
preprocessor directives. We made these changes in AST since we want to be aware of
multiple configuration information during static analysis and refactorings. Now that we
modified AST with additional preprocessor nodes, in the next chapter we could look into
extending the current static analysis in OpenRefactory/C to be preprocessor aware.
21
Chapter 5
Multiple configuration aware static analysis
In the previous chapters, we discussed about integration of multiple configuration sup-
port to OpenRefactory/C. We changed the syntactic structure and also the AST. Corre-
spondingly, the program analysis methods need to be modified. In chapter 1, we discussed
about the impact of multiple configuration on static analysis and refactorings. In this chap-
ter, we will talk about supporting static analysis for multiple configurations.
5.1 Control Flow Analysis
It is possible to derive control flow information directly from a program?s AST. Control
flow information abstracts the actual behavior of the program which is explicitly represented
by control flow graph (CFG). CFG includes all possible execution paths of a program and
it is also used for most data flow analysis. Each node of a program has a set of control
predecessorsfromwhichcontrolmayflowandasetofcontrolsuccessorstowhichcontrolmay
flow [27]. To create a CFG for a program, we need to compute successors and predecessors
of every node of the program. In this section, we discuss about computing successors and
predecessors in presence of preprocessor directives and multiple configurations.
5.1.1 Introduction
In chapter 4 we modified the AST to represent preprocessor information by adding addi-
tional nodes. Now, we need to determine how to represent each additional node in the CFG
and how to calculate the successors and predecessors. Two additional nodes that we are con-
sideringwhenconstructingCFGisCConditionalDirectiveandCMacroConditionIdentifier.
22
CConditionalDirective noderepresentseveryconditionaldirectiveintheAST.Itmight
have two children, CIfDirective and CElseDirective. Note that as shown in the grammar
in chapter 4, #elif directive is treated as else if structure in regular conditional statement
to make the analysis easier.
CMacroConditionIdentifier as shown in the grammar, is an identifier that is the con-
dition in #ifdef or #ifndef directives. Although identifier itself is not among flow nodes, we
created CMacroConditionIdentifier to be able to treat #ifdef and #ifndef directives the
same as #if and #elif directives. This way the analysis are simpler and more consistent.
From this point, when we use #if directive it represents #ifdef and #ifndef directives as
well unless otherwise mentioned.
Generally, we deal with conditional directives like regular conditional statements. In
other words, when constructing a CFG, we express configurations with conditional branches.
As we mentioned in chapter 3, we store the presence conditions of every AST node as an
attribute. Thus, we can annotate the branch edges of CFG as either true or false. How-
ever, there is one difference when dealing with conditional directives. In regular conditional
statements, a single item which is a single AST node comes after if(expression), while in a
conditional directive a block of items comes after #if constant-expression which contains
multiple AST nodes. In the next section we look into an example of constructing a CFG for
a program that contains preprocessor directives.
5.1.2 Running Example
Figure 5.1 shows a code snippet extracted from GNU Coreutils 8.21, timeout.c, lines
349 to 363 and figure 5.2 shows the corresponding CFG where nodes are line numbers, T
stands for true and F stands for false.
As shown in the CFG, #if directive on line 349 might have two successors. One is on
line 350 when we take the true branch, and the other one is #elif directive on line 353. On
23
line 353, there is another conditional branch. If we take the true branch, its successor is on
line 356, if not, its successor is on line 360 in the #else directive.
349: #if HAVE_PRCTL && defined PR_SET_DUMPABLE
350: if (prctl (PR_SET_DUMPABLE, 0) == 0)
351: return true;
352:
353: #elif HAVE_SETRLIMIT && defined RLIMIT_CORE
354: /* Note this doesn?t disable processing by a filter in
355: /proc/sys/kernel/core_pattern on Linux. */
356: if (setrlimit (RLIMIT_CORE, &(struct rlimit) {0,0}) == 0)
357: return true;
358:
359: #else
360: return false;
361: #endif
362:
363: error (0, errno, _("warning: disabling core dumps failed"));
Figure 5.1: A code snippet extracted from GNU Coreutils 8.21
349
350
351
353
356
357
360
363
T
F
T
F
Figure 5.2: CFG for figure 5.1
Inthefollowingsectionsweintroducesomeconceptsthatweareusingduringcalculation
of successors and predecessors.
24
5.1.3 Control Entry and Control Exit
In calculating the successors and predecessors of a flow node, we use the concepts of
control entry and control exit [27]. Control enters the CFG through control entry and exits
through control exit. For instance, control entry of a selection statement like if(E) S is E.
Control exits might be more than one, true and false exits. Using logical operators such as
|| and && in control expressions make multiple exits to happen. For instance, for expression
A && B, there is a true exit of B and false exits of A and B. Control exits could also be
either jump or nonjump exit. Jump exits include return, break, continue, and goto exits
while nonjump exits are only regular control exits like E in if(E) S which is also a false exit.
For conditional directives, we consider control entry to be the control entry of constant
expression in #if or #elif directives, or CMacroConditionIdentifier in #ifdef or #ifndef
directives. For calculating control exits, first we use the strategy in figure 5.3 to calculate
nonjump exits, then we calculate jump exits by just finding return, break, continue, and goto
exits. Now that we have information of control entry and exit in CFG, in the next section
we look into strategies of computing successors and predecessors in presence of preprocessor
directives.
1: - Get the corresponding directive block item list
2: - Add control exits of last block item to list of exits
3: if directive is #if then
4: - Add exits of constant expression to list of exits
5: else
6: - Add CMacroConditionIdentifier to list of exits
7: end if
8: - repeat the same procedure for #elif and #else directives
9: - Return list of exits
Figure 5.3: Strategy to calculate non jump exits of directive condition node
5.1.4 Calculating Successors and Predecessors
In order to calculate successors and predecessors in presence of preprocessor directives,
we add two methods to their current implementation in OpenRefactory/C. One is to find
25
successors/predecessors of directive condition nodes and the other one is calculating succes-
sors/predecessors for every item that belongs to a conditional directive block item list.
Figure 5.4 and 5.5 show the strategy used to find successors of directive condition node
that is either constantExpression or MacroConditionIdentifier in #if or #elif directives.
1: if the condition is for #if directive then
2: - Return the predecessor of #if directive as part of another block item list
3: else
4: - find the the previous #if or #elif branch
5: - Return false exits of that branch
6: end if
Figure 5.4: Strategy to calculate predecessors of directive condition node
1: if the directive block item list is not empty then
2: - Add control entry flow node of first item in the list to successors
3: end if
4: if the directive has either #elif or #else branches then
5: if the directive has #elif branch then
6: - Add control entry flow node of #elif constant expression to successors
7: else
8: if the #else directive block item list is not empty then
9: - Add control entry flow node of first item in the list to successors
10: else
11: - Find the nearest #if directive
12: - Add successors of #if directive to the list of successors
13: end if
14: end if
15: else
16: - Find the nearest #if directive
17: - Add successors of #if directive to the list of successors
18: end if
19: - Return list of successors
Figure 5.5: Strategy to calculate successors of directive condition node
Figure 5.6 and 5.7 show the strategy used to compute successors and predecessors for
every item that belongs to a conditional directive block item list (conditional directive could
be either #if, #elif, or #else directives).
26
1: - Get the corresponding directive block item list
2: if statement is not the last item in the list then
3: - Return control entry flow node of the next item in the list
4: else
5: - Find the nearest #if directive
6: - Return the successor of condition node
7: end if
Figure 5.6: Strategy to calculate successors for every item of a conditional directive block
item list
1: - Get the corresponding directive block item list
2: if statement is not the first item in the list then
3: - Return none jump control exit flow nodes of the previous item in the list
4: else
5: if statement is in #if part of block item list then
6: - Return true exits of #if directive condition
7: else
8: - Return false exits of #if directive condition
9: end if
10: end if
Figure 5.7: Strategy to calculate predecessors for every item of a conditional directive block
item list
5.2 Data Flow Analysis
As we mentioned in chapter 2, OpenRefactory/C stores intraprocedural reaching defini-
tionsanddefinition-userelationshipsintheAST.Theseanalysisareclassicdataflowanalysis
that are used to compute the definitions and uses for every point of a program. Since succes-
sor and predecessor information is needed to calculate data flow information, CFG is used for
data flow analysis. For instance, reaching definition uses two equations, IN and OUT, to de-
termine if a definition reaches a specific point of a program. Let B be a point of program and
P its predecessors , IN(B) = S
p2P
OUT(p) and OUT(B) = Gen(B)S(IN(B) Kill(B)).
Gen(B) are definitions within B that reach the end of B and Kill(B) are definitions that
never reach the end of B due to redefinitions of variables in B. As we changed the CFG
to be configuration aware, now the result of data flow analysis is also affected by those
modifications.
27
5.2.1 Running Example
Figure 5.8 shows a segment of code extracted from gmp-4.3.2, memory.c, lines 39 to 44.
Table 5.1 shows reaching definition information of the code. The format of entry in the table
is (variable name, line number, condition). As it is shown in the table, the definitions of
req_size and size is only reached if DEBUG is selected.
39: void *ret;
40: #ifdef DEBUG
41: size_t req_size = size;
42: size += 2 * BYTES_PER_MP_LIMB;
43: #endif
44: ret = malloc (size);
Figure 5.8: A code snippet extracted from gmp-4.3.2
Line# Gen Kill In Out
39 (ret,39,) (ret,39,)
40 (ret,39,) (ret,39,)
41 (req_size,41,DEBUG) (ret,39,) (ret,39,)
(req_size,41,DEBUG)
42 (size,42,DEBUG) (ret,39,)
(req_size,41,DEBUG)
(ret,39,)
(req_size,41,DEBUG)
(size,42,DEBUG)
44 (ret,44,) (ret,39,) (ret,39,)
(req_size,41,DEBUG)
(size,42,DEBUG)
(ret,44,)
(req_size,41,DEBUG)
(size,42,DEBUG)
Table 5.1: Reaching definition information for code in figure 5.8
5.3 Preprocessor Dependence Preservation Analysis
The baseline analyses mentioned in the previous sections are not sufficient to ensure
the correctness of certain kinds of refactorings?notably, those that reorder, remove, or copy
code within a file. Of the five most commonly used refactorings [28], three of them?Extract
Method, Extract Local, and Move?perform code movement and, therefore, must respect
preprocessor dependences. Figure 5.9 shows an example extracted from GNU Coreutils,
who.c. Applying Extract Function refactoring on line 352 will move it above line 341 where
28
the macro DEV_DIR_WITH_TRAILING_SLASH is undefined. Compiling the code after refactoring
will result in a compiler error.
341: #define DEV_DIR_WITH_TRAILING_SLASH "/dev/"
342: #define DEV_DIR_LEN (sizeof (DEV_DIR_WITH_TRAILING_SLASH) - 1)
...
351: if ( ! IS_ABSOLUTE_FILE_NAME (utmp_ent->ut_line))
352: p = stpcpy (p, DEV_DIR_WITH_TRAILING_SLASH);
Figure 5.9: An example of moving macro to undefined region
Another analysis, preprocessor dependence preservation analysis, has also been imple-
mented and integrated with refactorings in OpenRefactory/C to ensure macro correctness
in addition to baseline correctness. It uses a program representation called preprocessor
dependence graph, or PPDG, that models the relationships between macro definitions and
invocations, as well as the relationships between conditional compilation directives and the
code affected by them. A refactoring tool can simply build a PPDG prior to refactoring,
then build another PPDG after refactoring, and compare the two. The differences between
the two can be used to determine whether the transformation will cause any changes in
preprocessor behavior.
We integrated the preprocessor dependence preservation analysis with three refactor-
ings, Move External Declaration, Move Statement, and Extract Function implemented in
OpenRefactory/C. We automatically applied these refactorings on a large number of targets
in GNU Coreutils version 8.21 and identified that on average, 10% of the refactoring test
cases were blocked by the preprocessor dependence preservation analysis and errors would
be introduced if the preprocessor dependence were not considered.
Therefore, another analyses in addition to baseline analyses might also be needed to en-
sure that refactorings?particularly those that move, delete, or copy code?operate correctly
on code containing lexical macros and conditional compilation.
29
5.4 Conclusion
In this chapter, we looked into modifications that we made to control flow and data flow
analyses in order to include preprocessor information. We also introduced another type of
analysis that has been implemented in OpenRefactory/C to ensure macro correctness. In the
next chapter, we evaluate our multiple configuration aware analysis using Extract Function
refactoring.
30
Chapter 6
Evaluation
To evaluate our multiple configuration aware analysis, we used Extract Function refac-
toring implemented in OpenRefactory/C. This refactoring moves a sequence of statements
into a new function, replacing the original statements with a call to that function. We se-
lected this refactoring since it is particularly important to refactoring tools. Martin Fowler
[8] identified this refactoring as a tipping point for refactoring tools, claiming that if a refac-
toring tool could implement Extract Function refactoring well, it could implement many
other refactorings.
6.1 General Testing Approach
Thegeneraltestingapproach [14]thatweuseinOpenRefactory/Cconsistsofthreemain
steps: (1) finding all the program elements where a given refactoring can be applied, running
the refactoring on each of the elements, and recording the failures with associated message;
(2) splitting the failures into clusters based on the recorded messages; and (3) manually
analyzing the clusters to identify bugs. The first step is entirely automated; step 2 is semi-
automated and step 3 is manual.
In order to collect failures, three inputs are needed: the refactoring under test, a C
project containing the program on which the refactoring will be applied, and a threshold that
determines the maximum number of times to apply the refactoring on the project. Based on
the inputs, a set of refactoring tasks that a given refactoring can be applied are calculated.
Then, the properties for each refactoring task are configured such as providing a new name
for a new function. Before applying the refactoring, it is checked whether proceeding with
the refactoring could change the program behavior. If not, the refactoring is performed. The
31
result after applying the refactoring is a refactored project which is compiled. If there are
any compiler errors, the failures with all the error messages are recorded.
The second step clusters the failures where the goal is to reduce the number of failures
that should be inspected to detect bugs. Abstract messages that are regular expressions
are computed from the concrete messages that were recorded with the failures. After the
messages are abstracted, the clustering splits the failures into groups that have the same
refactoring name and the same abstract messages.
In the third step, the clusters are manually inspected to find bugs.
We used this approach to test the C refactorings in Eclipse CDT on three real software
projects, GMP [15], libpng [22], and zlib [35]. Table 6.1 shows the results where we found
53 bugs and reported them to Eclipse Bugzilla.
Compiler Error Exception
Refactoring #Refact.
Tasks
#
Failures
#
Failures
#
Clusters
Min
Cluster
Max
Cluster
#
Bugs
#
Failures
#
Clusters
Min
Cluster
Max
Cluster
#
Bugs
Rename 10652 598 252 3 1 173 2 346 2 3 343 0
Extract Method 23330 1782 1453 34 1 435 21 329 8 1 56 3
Extract Local 12026 1323 754 11 3 412 7 569 4 6 263 3
Extract Constant 23796 860 142 3 1 84 2 718 3 125 426 2
Toggle Function 2068 863 9 2 1 8 1 854 5 23 409 2P
71872 5426 2610 53 33 2816 22 10
7.5% 7.5%
Table 6.1: Failure and cluster statistics for C projects
6.2 A Problem
As mentioned in chapter 2, when we consider all possible macro configurations, it is
possible that conditional directives appear in the middle of a complete C constructs. The
current parser in OpenRefactory/C is not able to handle those cases. At the first place,
32
we planned to manually change every C file in GNU Coreutils which is our testing target.
However, for some C constructs the number of macros appearing in one construct is so high
that is actually impossible to manually change them for many of the C files. Figure 6.1 shows
one of the examples in chmod.c where the definition for macro CHMOD_MODE_BITS consists of
multiple macros in which some of them have multiple expansions and some others are again
expanded to macros with multiple expansions. For instance, figure 6.2 shows the definition
of S_IRWXU where it consists of three other macros. Subfigures (b), (c), and (d) show multiple
definitions of every one of those macros under different guarding conditions. Again, for every
macro one of the definitions is a macro which may be expanded to one or more macros.
return ((old_mode ^ new_mode) & CHMOD_MODE_BITS) != 0;
(a) cp-hash.c
#define CHMOD_MODE_BITS \
(S_ISUID | S_ISGID | S_ISVTX
| S_IRWXU | S_IRWXG | S_IRWXO)
(b) CHMOD_MODE_BITS
definitions
Figure 6.1: An example extracted from chmod.c
#if !S_IRWXU
# define S_IRWXU \
(S_IRUSR | S_IWUSR
| S_IXUSR)
#endif
(a) S_IRWXU def-
inition
#if !S_IRUSR && S_IREAD
# define S_IRUSR S_IREAD
#endif
#if !S_IRUSR
# define S_IRUSR 00400
#endif
(b) S_IRUSR definition
#if !S_IWUSR && S_IWRITE
# define S_IWUSR S_IWRITE
#endif
#if !S_IWUSR
# define S_IWUSR 00200
#endif
(c) S_IWUSR definition
#if !S_IXUSR && S_IEXEC
# define S_IXUSR S_IEXEC
#endif
#if !S_IXUSR
# define S_IXUSR 00100
#endif
(d) S_IXUSR definition
Figure 6.2: Expansion of macro S_IRWXU
Since it seems very difficult to manually change 108 C files of our testing target, GNU
Coreutils, we decided to write unit tests in Java using JUnit tests. In the rest of this chapter,
we discuss about how we designed and evaluated our test cases.
33
6.3 Test cases
6.3.1 Designing Test cases
Extract Function refactoring implementation in OpenRefactory/C composed of ten mi-
cro refactorings where some of them are directly using either control flow or data flow analy-
sis. Intheremainingofthischapterwelookintoeachofthemicrorefactoringsanddetermine
how each step has been implemented, which analysis it is used and how they are affected by
the changes we made. We designed our test cases for each micro refactoring based on these
changes.
6.3.2 Extract Function Refactoring
Extract Function refactoring composes of ten micro refactoring as follows:
 Replace Statement Sequence with Compound Statement (AddBrace) : This micro refac-
toring puts curly braces around the statements to be extracted. However, this is not
always legal. We have to validate the statement selection. One of the criteria that
we consider is that the selection must be a single entry, single exit region in the con-
trol flow graph. Therefore, this micro refactoring should be tested specifically for the
changes we made on the control flow graph.
 Move Local Variable Declaration into Compound Statement (MoveLocal) : This step
findseverylocalvariableusedinthecompoundstatementandchecksifthevariablehas
a declaration out of the compound statement while it is only used inside the compound
statement. If this is true for any variable, it moves the variable declaration inside the
compound statement. To determine the uses of each definition, this micro refactoring
uses definition-use chains from data flow analysis. Thus, we have to make sure that we
consider all the uses under all the configurations when calculating every variable uses.
34
 Move Initial Assignment into Declaration (MoveInitial) : For every declaration that
is moved inside the compound statement in the previous step, this micro refactoring
determines if there is any initial assignment that could be moved to the declaration.
To ensure whether this movement is legal or not, we use dependence analysis where it
uses the control flow and data flow information.
 Introduce Pointer (IntroPointer) : For each local variable used in the compound state-
ment, if it is aliased or there is any uses of it out of the compound statement, we
introduce a pointer to that variable and replace every uses of that variable inside the
compound statement with the introduced pointer. This steps again uses the informa-
tion from definition-use chains in order to determine if any variable is used out of the
selected area.
 Remove Superfluous Parentheses : This step just removes the superfluous parentheses
introduced in the previous step and it is not affected.
 Add Empty Function : This micro refactoring creates an empty function with appro-
priate arguments. This step does not use any information from either control or data
flow analysis. However, when we determine the appropriate arguments we have to
ensure that we consider all the feasible configurations.
 Move Compound Statement into Function Definition (MoveCompound) : This micro
refactoring moves the statements inside the compound statement into the empty func-
tion added in the previous step. This step uses preprocessor dependence preservation
analysis mentioned end of the previous chapter. We previously tested this analysis on
real C code.
 Replace Pointer Parameter with Function Return (ReplacePointer) : If we only intro-
duced one pointer-typed parameter, which is not a pointer to a struct, union, or an
array and it is always used indirectly (as either pointer or an address, not just the
35
variable itself) we replace the pointer with a function return. This step also has not
been affected by the changes we made.
 Rename : If in the previous step we replaced any pointer parameter with function re-
turn, we rename the variable name to its original name, before introducing the pointer.
This refactoring, in this context, is not affected by the changes.
 Remove Unused Parameter (RemoveParameter) : If any function argument is domi-
nated by an assignment inside the function body, that parameter is no longer needed.
This is also determined by using dependence analysis.
We wrote on average 10 new test cases (excluding their previous tests) for every micro
refactoring that our changes have any impact on it. Then, we ran the Extract Function
refactoring for every test case as well to make sure we are not breaking anything.
6.3.3 Evaluating Test cases
In order to evaluate our test cases, we used EclEmma [6], version 2.3.1, a free Java code
coverage tool. We ran this tool for every micro refactoring that is affected. EclEmma uses a
set of different counters including instructions, branches, lines, methods, and complexity to
calculate coverage metrics.
 Instructions : Itprovidesinformationabouttheamountofcodethathasbeenexecuted.
 Branches : It counts the total number of if and switch statements in a method and
determines the number of executed or missed branches.
 Lines : For all class files that have been compiled with debug information, coverage
information for individual lines can be calculated.
 Methods : Each non-abstract method contains at least one instruction. A method is
considered as executed when at least one instruction has been executed.
36
 Cyclomatic complexity : It is the minimum number of paths that can, in (linear)
combination, generate all possible paths through a method. Thus the complexity value
can serve as an indication for the number of unit test cases to fully cover a certain piece
of software. It also calculates cyclomatic complexity [26] for each non-abstract method
and summarizes complexity for classes, packages and groups.
Figure6.3showsthegeneratedresultforMoveLocalVariablerefactoringusingEclEmma.
Figure 6.3: Test coverage results for MoveLocalVariable micro refactoring using EclEmma
We generated the results for all other micro refactorings. The numbers are shown as
percentages. Table 6.2 shows the results where the average instruction coverage for six micro
refactoringsis85.95%,thebranch,line,methods,andcomplexitycoveragesare77.8%,85.2%,
88.6%, and 47.8% respectively.
Micro
Refactoring
Name
Instruction Branch Line Methods Complexity
AddBrace 87.2 76.1 78.4 92.3 54.7
MoveLocal 89.4 73 92.3 92.3 53.9
MoveInitial 78.5 79.3 72.3 94.1 34.1
IntroPointer 78.8 73.2 78.2 85 46.5
MoveCompound 88.4 77 94.8 93.1 46
RemoveParameter 93.4 70.6 95.2 75 52
Average 85.95 77.8 85.2 88.6 47.8
Table 6.2: Test coverage results for Extract Function micro refactorings using EclEmma
37
6.3.4 Discussion
As mentioned earlier Extract Function refactoring is implemented by integrating mul-
tiple micro refactorings. The integration requires using encapsulation in Java where several
getters and setters methods are needed so that micro refactorings can communicate with
each other. These methods are not tested in individual micro refactorings, they are tested
as part of Extract Function itself. Percentages are affected by these methods and become
lower.
Cyclomatic complexity is highly dependent on number of branches and decision points.
In the case that two different branches are slightly different, we did not write separate test
cases for each branch. Therefore, the cyclomatic complexity percentage is lower. Figure 6.4
shows such a code from Add Brace micro refactoring.
if (parent instanceof CIfStatement) {
statement = (ICStatement)firstBlockItem;
} else if (parent instanceof CSwitchStatement) {
CSwitchStatement switchStatement = firstBlockItem.findNearestAncestor(CSwitchStatement.class);
statement = switchStatement.getStatement();
} else if (parent instanceof CWhileStatement) {
CWhileStatement whileStatement = firstBlockItem.findNearestAncestor(CWhileStatement.class);
statement = whileStatement.getStatement();
} else if (parent instanceof CDoStatement) {
CDoStatement doStatement = firstBlockItem.findNearestAncestor(CDoStatement.class);
statement = doStatement.getStatement();
} else if (parent instanceof CForStatement) {
CForStatement forStatement = firstBlockItem.findNearestAncestor(CForStatement.class);
statement = forStatement.getStatement();
} else if (parent instanceof CCaseStatement) {
CCaseStatement caseStatement = firstBlockItem.findNearestAncestor(CCaseStatement.class);
statement = caseStatement.getStatement();
} else if (parent instanceof CLabeledStatement) {
CLabeledStatement labelStatement = firstBlockItem.findNearestAncestor(CLabeledStatement.class);
statement = labelStatement.getStatement();
} else if (parent instanceof CDefaultStatement) {
CDefaultStatement defaultStatement = firstBlockItem.findNearestAncestor(CDefaultStatement.class);
statement = defaultStatement.getStatement();
}
Figure 6.4: Part of Add Brace micro refactoring code
In order to increase coverage percentage we could write more test cases to cover all
the branches and statements. However, functional testing is mostly used to test refactorings
rather than just unit testing. The part of code mentioned in the above example is assumed to
38
return correct result when writing test cases. Therefore, increasing the coverage percentage
may not have a substantial impact on refactoring correctness.
39
Chapter 7
Related Work
The importance of preprocessor concerns in program analyses (and transformations)
is confirmed in an empirical study by Ernst et al. [7]. Several authors have attempted to
understand how commonly preprocessors are used in C code. These studies rely on heuristics
to approximate the actual usage [1, 7, 10, 23, 30]. The problems of using #ifdef directives
have also been noted [33].
OneoftheearliestpapersonhandlingtheCpreprocessorinasoftwaremaintenancetools
is due to Platoff et al. [31]. Badros and Notkin [3] describe an infrastructure for handling the
C preprocessor in static analysis tools, which invokes user-defined callbacks when interesting
preprocessing events occur. Their infrastructure is based on GCC?s preprocessor, so it only
handlesasingleconfiguration,althoughtheyprovideaccesstounpreprocessed #ifdef regions
as unstructured text, and they provide access to the parser stack in order to aid analysis of
this text.
The first comprehensive work on handling the C preprocessor in a refactoring tool is due
to Garrido [9, 10, 11]; this was discussed previously. Notably, Garrido?s work identifies rules
for refactoring code containing macros and file inclusions, it recognizes the need to guard
symbol table entries with the preprocessor configurations under which they are valid, and
it addresses the difficulties of refactoring codes where conditional compilation directives do
not enclose complete syntactic constructs.
In recent years, researchers have proposed and evaluated variability-aware analysis in
different domains where they do not rely on heuristics and do not impose any restrictions
on the location of preprocessor constructs.
40
Product-line-aware type checkers [21, 2, 20, 19] are able to determine if all potential vari-
ants of a product line that is written in C and implements variability with #ifdef directives
of the C preprocessor are well-typed.
Brabrandetal.[5]demonstratedthebenefitsofusingfeature-sensitivedataflowanalysis
against the brute force approach that generates all the variants of a software product line.
They proposed several algorithms to adapt an existing standard intraprocedural data flow
analysis into a feature-sensitive data flow analysis by extending the representation of control
flow graphs with an inclusion condition associated with each node, describing under which
configurations the node is present in the graph. However, their approach is limited as it gets
Java programs as the input and by using a tool annotate them with conditional directives
like #if. Another limitation of this approach is that they make the conditional directive
annotation disciplined [24] which is not true for real C programs.
Bodden et al. [4] proposed an approach to automatically converts traditional static
analysistosoftwareproductlineanalysis. However, theirapproachisonlylimitedtosoftware
product lines that are expressed with CIDE, an extension of the Eclipse IDE and they are
not currently able to handle C preprocessor features (#ifdef constructs). Moreover, their
tool is limited to analyses expressed in the IFDS [32] framework.
As mentioned earlier, the most similar work to ours is by Liebig et al. [25]. They imple-
mented variability-aware type checking and liveness analysis to compare the performance of
these variability-aware analysis with the performance of corresponding sampling heuristics
(single configuration, pair-wise, and code coverage). They found that the performance of
variability-aware analysis scales to large scale systems such as the Linux kernel. However,
theydidnotapplytheiranalysisinanyapplicationareasotherthanempiricalstudies. Refac-
toring tools are practical application of configuration-aware analysis which present challenges
specific to this context. We used OpenRefactory/C as a practical infrastructure to imple-
ment configuration-aware analysis which is a real necessity to perform certain refactorings
correctly.
41
Chapter 8
Conclusions and Future Works
In this thesis, we discussed handling multiple configurations in C programs in order
to build more accurate program transformations for refactoring. The C language libraries
are highly configurable and use C preprocessor directives heavily. Program analysis and
transformations have to consider all possible macro configurations; otherwise they will be
incorrect.
We modified the C preprocessor in OpenRefactory/C to handle macros with multiple
definitionsandconditionaldirectivesinpresenceofmultipleconfigurations. Wealsoincluded
the preprocessor directives in our program representation so that we could extend the static
analysis to account for multiple configurations. We modified the CFG to express configura-
tions with conditional branches and annotating the branch edges of CFG as either true or
false.
For evaluation, we modified Extract Function refactoring and tested it against our
changes using JUnit tests. In future, we will test our code on real C programs including
GNU Coreutils. Moreover, a new set of refactorings that needs preprocessor directive sup-
portcouldbedevelopedinOpenRefactory/C.Forinstance, aRenameMacrorefactoringthat
renames a macro or an Extract Macro refactoring that extracts parts of code to a macro.
More information will be available on https://sites.google.com/site/masterthesisresults/.
42
Bibliography
[1] B. Adams, W. De Meuter, H. Tromp, and A. Hassan. Can we refactor conditional
compilation into aspects? In AOSD ?09, pages 243?254, New York, NY, USA, 2009.
ACM.
[2] S. Apel, C. K?stner, A. Grolinger, and C. Lengauer. Type safety for feature-oriented
product lines. Automated Software Engg., 17(3):251?300, Sept. 2010.
[3] G. Badros and D. Notkin. A framework for preprocessor-aware c source code analyses.
Softw. Pract. Exper., 30(8):907?924, July 2000.
[4] E. Bodden, T. Tol?do, M. Ribeiro, C. Brabrand, P. Borba, and M. Mezini. Spllift:
Statically analyzing software product lines in minutes instead of years. In Proceedings
of the 34th ACM SIGPLAN Conference on Programming Language Design and Imple-
mentation, PLDI ?13, pages 355?364, New York, NY, USA, 2013. ACM.
[5] C.Brabrand, M.Ribeiro, T.Tol?do, andP.Borba. Intraproceduraldataflowanalysisfor
software product lines. In Proceedings of the 11th Annual International Conference on
Aspect-oriented Software Development, AOSD ?12, pages 13?24, New York, NY, USA,
2012. ACM.
[6] EclEmma home page. http://www.eclemma.org/.
[7] M. Ernst, G. Badros, and D. Notkin. An empirical analysis of C preprocessor use. IEEE
Trans. Softw. Eng., 28(12):1146?1170, Dec. 2002.
[8] M. Fowler. Refactoring: Improving The Design of Existing Code. Addison-Wesley, Jun
1999.
[9] A. Garrido. Program refactoring in the presence of preprocessor directives. PhD thesis,
University of Illinois at Urbana-Champaign, Champaign, IL, USA, 2005.
[10] A. Garrido and R. Johnson. Analyzing multiple configurations of a C program. In ICSM
?05, pages 379?388, 2005.
[11] A. Garrido and R. Johnson. Embracing the C preprocessor during refactoring. J. Softw.
Evol. and Proc., June. 2013.
[12] A. Garrido and R. E. Johnson. Refactoring c with conditional compilation. In ASE,
pages 323?326. IEEE Computer Society, 2003.
43
[13] P. Gazzillo and R. Grimm. SuperC: Parsing all of C by taming the preprocessor. PLDI
?12, pages 323?334, New York, NY, 2012. ACM.
[14] M. Gligoric, F. Behrang, Y. Li, J. Overbey, M. Hafiz, and D. Marinov. Systematic
testing of refactoring engines on real software projects. In ECOOP 2013, volume 7920
of Lecture Notes in Computer Science, pages 629?653. Springer Berlin Heidelberg, 2013.
[15] GMP home page. http://gmplib.org.
[16] M. Hafiz. Security On Demand. PhD thesis, University of Illinois Urbana-Champaign,
2010.
[17] M. Hafiz, J. Overbey, F. Behrang, and J. Hall. OpenRefactory/C: An infrastructure for
building correct and complex C transformations. In WRT ?13, 2013.
[18] International Organization for Standardization. ISO/IEC 9899:1999: Programming
Languages ? C. Dec 1999.
[19] C. K?stner, S. Apel, T. Th?m, and G. Saake. Type checking annotation-based product
lines. ACM Trans. Softw. Eng. Methodol., 21(3):14:1?14:39, July 2012.
[20] C. K?stner, K. Ostermann, and S. Erdweg. A variability-aware module system. In
OOPSLA ?12, pages 773?792, New York, NY, 2012. ACM.
[21] A. Kenner, C. K?stner, S. Haase, and T. Leich. Typechef: Toward type checking #ifdef
variability in c. In Proceedings of the 2Nd International Workshop on Feature-Oriented
Software Development, FOSD ?10, pages 25?32, New York, NY, USA, 2010. ACM.
[22] libpng home page. http://www.libpng.org/pub/png/libpng.html.
[23] J. Liebig, S. Apel, C. Lengauer, C. K?stner, and M. Schulze. An analysis of the variabil-
ity in forty preprocessor-based software product lines. ICSE ?10, pages 105?114, New
York, NY, USA, 2010. ACM.
[24] J. Liebig, C. K??stner, and S. Apel. Analyzing the discipline of preprocessor annota-
tions in 30 million lines of c code. AOSD ?11, 2011.
[25] J. Liebig, A. von Rhein, C. K?stner, S. Apel, J. D?rre, and C. Lengauer. Scalable anal-
ysis of variable software. In Proceedings of the 2013 9th Joint Meeting on Foundations
of Software Engineering, ESEC/FSE 2013, pages 81?91, New York, NY, USA, 2013.
ACM.
[26] T. McCabe. A complexity measure. Software Engineering, IEEE Transactions on,
SE-2(4):308?320, Dec 1976.
[27] J. D. Morgenthaler. Static analysis for a software transformation tool. Technical report,
1997.
[28] E. Murphy-Hill, C. Parnin, and A. P. Black. How we refactor, and how we know it. In
ICSE ?09, 2009.
44
[29] J. Overbey and R. Johnson. Generating rewritable abstract syntax trees. In Software
Language Engineering: First International Conference, SLE 2008. Revised Selected Pa-
pers. Springer-Verlag, 2009.
[30] Y. Padioleau. Parsing C/C++ code without pre-processing. In ETAPS ?09, CC ?09,
pages 109?125, Berlin, Heidelberg, 2009. Springer-Verlag.
[31] M. Platoff, M. Wagner, and J. Camaratta. An integrated program representation and
toolkit for the maintenance of c programs. In ICSM ?91. IEEE, 1991.
[32] T. Reps, S. Horwitz, and M. Sagiv. Precise interprocedural dataflow analysis via graph
reachability. In Proceedings of the 22Nd ACM SIGPLAN-SIGACT Symposium on Prin-
ciples of Programming Languages, POPL ?95, pages 49?61, New York, NY, USA, 1995.
ACM.
[33] H. Spencer and G. Collyer. #ifdef considered harmful, or portability experience with C
news. In Proceedings of the Usenix Summer 1992 Technical Conference, pages 185?198,
Berkeley, CA, USA, June 1992. Usenix Association.
[34] R. Tartler, D. Lohmann, J. Sincero, and W. Schr?der-Preikschat. Feature consistency
in compile-time-configurable system software: Facing the linux 10,000 feature problem.
In Proceedings of the Sixth Conference on Computer Systems, EuroSys ?11, pages 47?60,
New York, NY, USA, 2011. ACM.
[35] zlib home page. http://www.zlib.net.
45