Generic Reverse Engineering Architecture with Compiler and Compression Classification Components Except where reference is made to the work of others, the work described in this dissertation is my own or was done in collaboration with my advisory committee. This dissertation does not include proprietary or classi ed information. Stephen A. Torri Certi cate of Approval: Richard Chapman Associate Professor Computer Science and Software Engineering John A. Hamilton, Jr. Associate Professor Computer Science and Software Engineering David Umphress Associate Professor Computer Science and Software Engineering David Rouse Professor Fisheries & Allied Aquacultures George T. Flowers Dean Graduate School Generic Reverse Engineering Architecture with Compiler and Compression Classification Components Stephen A. Torri A Dissertation Submitted to the Graduate Faculty of Auburn University in Partial Ful llment of the Requirements for the Degree of Doctor of Philosophy Auburn, Alabama May 9, 2009 Generic Reverse Engineering Architecture with Compiler and Compression Classification Components Stephen A. Torri Permission is granted to Auburn University to make copies of this dissertation at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii Dissertation Abstract Generic Reverse Engineering Architecture with Compiler and Compression Classification Components Stephen A. Torri Doctor of Philosophy, May 9, 2009 (M.S., in Computer Science, Washington University in Saint Louis, 2004) (B.S., in Accounting and Finance and Computer Science, Lancaster University (UK), 2001) 195 Typed Pages Directed by John A. Hamilton, Jr. As more and more applications, libraries, and other types of programs are being executed in untrusted environments they will be targets of attackers. These applications are exposed to malicious programs attempting to exploit some publicly known or newly discovered vulnerability in order to produce an unwanted action. These malicious and/or suspect programs can be installed on a system without the knowledge of the user. In these circumstances reverse engineering would be able to discover the functionality of the programs without actually executing them. This is important because it is necessary to know as much about a program before executing it in a controlled environment. Since each binary application, e.g. malicious programs for Intel X86 or Java, was produced by a compiler it would be helpful to customize the reverse engineering process by detecting which compiler was used. This research will be experimenting with methods that help detect the compiler used to create an executable program regardless of the programming language used. The method discovered for compiler detection will be added to a generic reverse engineering architecture that will utilize this information to alter the run-time behavior of the generic reverse engineering architecture. iv Acknowledgments Every great adventure has many stages and a cast of di erent people that contributed to it. The following people I wish to thank for their part in this journey. To my Lord and Savior Jesus Christ because of whom none of this would be possible. It is because of and through Him I was placed here at this place in time to perform the task set before me by His power with the skills endowed to me by Him. To my parents Joseph and Marilynn Torri without whom this journey would not have been possible. It is their unfailing belief in me and support that enabled me to complete my journey. In addition I thank my brothers, Brian and Christopher, and sister, Jennifer, for taking the time to encourage me and help me keep a right perspective on the task at hand. To my adviser Dr John A. Hamilton, Jr. and the IA-Lab research group. I want to thank for their support, discussions and guidance on this project. Also I wish to thank the members of my committee: Dr Richard Chapman, Dr David Umphress and Dr Rouse for their contributions and their helpful discussions. To Dr Morgan Deters for giving me the dissertation by Dr Cristina Cifuentes at Washing- ton University in Saint Louis which inspired me to look at generic reverse engineering. Your encouragement and support helped me to see the potential in this area of research. To Winard Britt for his support of the use of Arti cial Intelligence in classi cation. Your comments on this dissertation, research and encouragement during this process were invaluable. To Dr Christopher Gill who believed in my potential as a researcher and gave me a plat- form from which I could grow. I greatly value our discussions we had and appreciate all your assistance. To the mentors and close friends whom I have met over the years who helped me to learn more about who I am, how I was created and to know what abilities I have been given. v Style manual or journal used Journal of Approximation Theory (together with the style known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars. Computer software used The document preparation package TEX (speci cally LATEX) together with the departmental style- le aums.sty. vi Table of Contents List of Figures xi 1 Introduction 1 1.1 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Detecting properties in executable programs . . . . . . . . . . . . 4 1.2.2 Generic reverse engineering architecture . . . . . . . . . . . . . . 7 1.2.3 Modi cation of the run-time behavior of generic reverse engineer- ing architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background and Literature Review 9 2.1 Reverse Engineering De nitions . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Windows Portable Executable (PE) . . . . . . . . . . . . . . . . 10 2.2.2 Linux ELF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 Java Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Compressed Executables . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Survey of Decompiling Architectures . . . . . . . . . . . . . . . . . . . . 29 2.4.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 Java Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.3 Static Slicing on Binary Executables . . . . . . . . . . . . . . . . 41 2.4.4 Static Analysis on Executables without Decompiling . . . . . . . 42 2.5 Machine Learning Related Work . . . . . . . . . . . . . . . . . . . . . . 44 2.6 Survey of Compiler Detection . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7 Generic Decompilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3 Analysis And Classification 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 The General Regression Neural Network . . . . . . . . . . . . . . . . . . 47 3.2.1 Learnability in the Training Instances . . . . . . . . . . . . . . . 49 3.2.2 Training Set Instances . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.4 Network Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 The Evolutionary Hill-Climber Algorithm . . . . . . . . . . . . . . . . . 51 3.3.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Logarithmic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 Steady-State Genetic Algorithm . . . . . . . . . . . . . . . . . . 54 vii 3.5 Statistical Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.1 Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 Compressed Executables 58 4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1.1 The Training, Validation, and Test Sets . . . . . . . . . . . . . . 59 4.1.2 Optimizer Stopping Conditions . . . . . . . . . . . . . . . . . . . 61 4.1.3 Identifying Compression . . . . . . . . . . . . . . . . . . . . . . . 61 4.1.4 Classifying Type of Compression or Lack Thereof . . . . . . . . . 62 4.1.5 Classifying Type of Compression . . . . . . . . . . . . . . . . . . 63 4.1.6 Impact of Compression . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Data source and compressors . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Cexe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.2 Ultimate Packer for eXecutables . . . . . . . . . . . . . . . . . . 65 4.2.3 PEtite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5 A Compiler Classification Framework for Use in Reverse Engineer- ing 69 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1.1 How can machine learning be used for classifying compiler type? 69 5.2 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.1 Java Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.2 ELF Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.3 Windows PE Input Data . . . . . . . . . . . . . . . . . . . . . . 73 5.2.4 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.1 The Training, Validation, and Test Sets . . . . . . . . . . . . . . 75 5.3.2 Optimizer Stopping Conditions . . . . . . . . . . . . . . . . . . . 78 5.3.3 Experiment I: Java Compiler Classi cation with a GRNN Opti- mized with an Evolutionary Hill-Climber Algorithm . . . . . . . 78 5.3.4 Experiment II: Java Compiler Classi cation with a GRNN Opti- mized with a Steady-State Genetic Algorithm . . . . . . . . . . . 79 5.3.5 Experiment III: ELF File Compiler Classi cation with a GRNN Optimized with a Steady-State Genetic Algorithm . . . . . . . . 81 5.3.6 Experiment IV: Windows PE File Compiler Classi cation with a GRNN Optimized with a Steady-State Genetic Algorithm . . . . 82 5.4 Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6 Classifier Architecture 85 6.1 O ine File Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1.1 Compiler Dump Program . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 viii 6.2.1 Linux ELF XML . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2.2 Java Class XML . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.3 Windows PE XML . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3 GRNN Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3.1 Optimizer Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3.2 Input parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3.3 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3.4 Candidate Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.3.5 GRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7 API 111 7.1 Execute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.1.1 Steps of Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8 Configuration 114 8.1 Con gurator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.2 Master Formula File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.3 Formula File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 9 Components 120 9.1 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 9.1.1 Two Mode Operation . . . . . . . . . . . . . . . . . . . . . . . . 120 9.1.2 Single Primitive Operation . . . . . . . . . . . . . . . . . . . . . 121 9.1.3 Loosely Coupled . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 9.1.4 Hidden data sources . . . . . . . . . . . . . . . . . . . . . . . . . 121 9.2 Component Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 122 9.3 Component Actor Interface . . . . . . . . . . . . . . . . . . . . . . . . . 124 9.4 Input Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 9.4.1 Architecture Type Detector . . . . . . . . . . . . . . . . . . . . . 125 9.4.2 Code Section Detector . . . . . . . . . . . . . . . . . . . . . . . . 126 9.4.3 Data Section Detector . . . . . . . . . . . . . . . . . . . . . . . . 127 9.4.4 Entry Point Detector . . . . . . . . . . . . . . . . . . . . . . . . . 128 9.4.5 File Header Printer . . . . . . . . . . . . . . . . . . . . . . . . . . 129 9.4.6 File Type Detector . . . . . . . . . . . . . . . . . . . . . . . . . . 130 9.4.7 Memory Map Producer . . . . . . . . . . . . . . . . . . . . . . . 130 9.4.8 Null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 9.4.9 Tevis Unknown Region Checker . . . . . . . . . . . . . . . . . . . 132 9.4.10 Tevis Zero Filled Checker . . . . . . . . . . . . . . . . . . . . . . 132 9.4.11 Compiler Classi er . . . . . . . . . . . . . . . . . . . . . . . . . . 133 10 Component Graph 135 10.1 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 10.2 ID Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 10.3 Component Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 10.4 Visitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 11 Data Source 137 ix 12 Data Structures 141 12.1 Filename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 12.2 Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 12.3 Control Flow Graph (CFG) Sequence . . . . . . . . . . . . . . . . . . . 142 13 Meta Information 144 13.1 Importance of meta information . . . . . . . . . . . . . . . . . . . . . . . 144 13.2 How meta information is exchanged . . . . . . . . . . . . . . . . . . . . 146 13.3 Meta Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 13.3.1 Arch Type Meta . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 13.3.2 Code Section Meta . . . . . . . . . . . . . . . . . . . . . . . . . . 149 13.3.3 Code Section 64 Meta . . . . . . . . . . . . . . . . . . . . . . . . 150 13.3.4 Compiler Type Meta . . . . . . . . . . . . . . . . . . . . . . . . . 150 13.3.5 Data Section Meta . . . . . . . . . . . . . . . . . . . . . . . . . . 151 13.3.6 Data Section 64 Meta . . . . . . . . . . . . . . . . . . . . . . . . 152 13.3.7 Entry Point Meta . . . . . . . . . . . . . . . . . . . . . . . . . . 152 13.3.8 Entry Point 64 Meta . . . . . . . . . . . . . . . . . . . . . . . . . 153 13.3.9 File Type Meta . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 13.3.10 Tevis Unknown Region Meta . . . . . . . . . . . . . . . . . . . . 154 13.3.11 Tevis Zero Filled Meta . . . . . . . . . . . . . . . . . . . . . . . . 155 14 Scenarios 156 14.1 Non-expert and Expert user . . . . . . . . . . . . . . . . . . . . . . . . . 156 14.2 Compressed Executables and Compiler Classi cation . . . . . . . . . . . 156 14.3 Behind the scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 15 Conclusion and Future Work 169 15.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 15.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 15.2.1 Compiler Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 169 15.2.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 15.2.3 Control Flow Graph Generation . . . . . . . . . . . . . . . . . . 170 15.2.4 Intermediate Forms . . . . . . . . . . . . . . . . . . . . . . . . . 172 15.2.5 Con gurator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 15.2.6 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 15.2.7 Compressed Executables . . . . . . . . . . . . . . . . . . . . . . . 175 15.2.8 Compiler Classi cation . . . . . . . . . . . . . . . . . . . . . . . 175 15.2.9 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 15.3 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Bibliography 177 x List of Figures 2.1 PE File Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 A GRNN Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 Experiment I Training Instance . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Experiment II Training Instance . . . . . . . . . . . . . . . . . . . . . . 60 6.1 Compiler Dump Algorithm class diagram . . . . . . . . . . . . . . . . . 87 6.2 ELF Compiler Dump Algorithm class diagram . . . . . . . . . . . . . . 88 6.3 ELF Compiler Dump Algorithm sequence diagram for reading a 32-bit Linux ELF le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.4 Windows Compiler Dump Algorithm class diagram . . . . . . . . . . . 89 6.5 Windows Compiler Dump Algorithm sequence diagram for reading a Windows PE+ le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.6 Java Compiler Dump Algorithm class diagram . . . . . . . . . . . . . . 90 6.7 Java Compiler Dump Algorithm sequence diagram for reading a Java class le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.8 Optimizer Algorithm class diagram . . . . . . . . . . . . . . . . . . . . 100 6.9 ELF Optimizer Algorithm class diagram . . . . . . . . . . . . . . . . . 101 6.10 ELF Optimizer Algorithm sequence diagram . . . . . . . . . . . . . . . 101 6.11 Java Optimizer Algorithm class diagram . . . . . . . . . . . . . . . . . 102 6.12 Java Optimizer Algorithm sequence diagram . . . . . . . . . . . . . . . 102 6.13 Windows Optimizer Algorithm class diagram . . . . . . . . . . . . . . . 103 6.14 Windows Optimizer Algorithm sequence diagram . . . . . . . . . . . . 103 6.15 Optimizer class diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 xi 6.16 Optimizer sequence diagram . . . . . . . . . . . . . . . . . . . . . . . . 109 7.1 Libreverse API ?execute? function . . . . . . . . . . . . . . . . . . . . . 111 8.1 Con gurator Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . 114 8.2 Con gurator get Graph sequence diagram . . . . . . . . . . . . . . . . . 115 8.3 High-level view of set of Component Graphs . . . . . . . . . . . . . . . 116 9.1 Component Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 123 9.2 Component execution with its state set to SOURCE MODE . . . . . . 125 9.3 Component execution with its state set to WORKING MODE . . . . . 125 11.1 Data Source Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 138 11.2 Data Transfer Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . 139 11.3 Memory Data Transfer Sequence Diagram . . . . . . . . . . . . . . . . 140 12.1 Data Container Class Diagram . . . . . . . . . . . . . . . . . . . . . . . 143 13.1 Three sets of unique components for handling compiler classi cation . . 145 13.2 Generic components for handling compiler classi cation . . . . . . . . . 145 13.3 Reading meta information from a Data Source . . . . . . . . . . . . . . 146 13.4 Meta Object class diagram . . . . . . . . . . . . . . . . . . . . . . . . . 147 13.5 Meta Item class diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 148 13.6 Arch Type Meta class diagram . . . . . . . . . . . . . . . . . . . . . . . 149 13.7 Code Section Meta class diagram . . . . . . . . . . . . . . . . . . . . . 149 13.8 Code Section 64 Meta class diagram . . . . . . . . . . . . . . . . . . . . 150 13.9 Compiler Type Meta class diagram . . . . . . . . . . . . . . . . . . . . 151 13.10 Data Section Meta class diagram . . . . . . . . . . . . . . . . . . . . . . 151 13.11 Data Section Meta class diagram . . . . . . . . . . . . . . . . . . . . . . 152 13.12 Entry Point Meta class diagram . . . . . . . . . . . . . . . . . . . . . . 153 13.13 Entry Point Meta class diagram . . . . . . . . . . . . . . . . . . . . . . 153 xii 13.14 File Type Meta class diagram . . . . . . . . . . . . . . . . . . . . . . . 154 13.15 Tevis Unknown Region Meta class diagram . . . . . . . . . . . . . . . . 155 13.16 Tevis Zero Filled Meta class diagram . . . . . . . . . . . . . . . . . . . 155 14.1 Graphical view of binary RTL.xml . . . . . . . . . . . . . . . . . . . . . 162 14.2 Graphical view of binary RTL2.xml . . . . . . . . . . . . . . . . . . . . 162 14.3 Graphical view of decompiling analysis.xml . . . . . . . . . . . . . . . . 165 14.4 Graphical view of cpp writer.xml . . . . . . . . . . . . . . . . . . . . . . 165 14.5 Graphical view of uml writer.xml . . . . . . . . . . . . . . . . . . . . . 166 14.6 Graphical view of Non-Expert User Component Graphs . . . . . . . . . 167 14.7 Graphical view of Expert User Component Graphs . . . . . . . . . . . . 168 xiii Chapter 1 Introduction 1.1 Statement of the Problem At present there is a limited set of tools available to an organization to conduct reverse engineering work and research. Most of these tools are focused on a speci c problem space with a limited set of inputs. The problem with these approaches is that they are too limited and provide no ability to reuse the solution for other problems. There is often large overlap in functionality in these approaches that cannot be reused since the functionality is so tightly coupled to the particular problem. One purpose of this work is to contribute to the reverse engineering of software by providing a means for identifying the type of compiler used to compile a Java Class, Linux ELF or Windows PE le. Since each compiler performs its operations di erently it is important to know which one produced a binary application. This information would be valuable in the reverse engineering e ort. There is no method known at present for detecting the compiler used to create a binary computer application. This research shows that it is possible to detect the compiler used to create a binary computer applica- tion, regardless of the programming language used, through information gathered from the binary le headers by classifying against pro les from a list of available compilers. The research focuses on information gathered from the binary le headers and not the actual instructions executed by the system processor. This research investigated the le properties from two di erent binary le types (Windows PE and Linux ELF) and an intermediate le type (Java Class) to determine the initial set of values believed to be useful to the classi cation of the compiler used. Once the le properties to be used were determined, an automated method was used to condense this information into training sets. A training set was produced for each set of 1 les created by the compilers. For example, there are three di erent compilers used to produce Windows PE les. Each set of Windows PE les has a training set containing the desired le properties. All of the training sets were given to an optimization method to determine the properties that are actually useful to the classi cation of the compiler used. This method was repeated for the Linux ELF and Java Class binary le types. Two di erent optimization methods were used to optimize the neural network pa- rameters, similar to the ideas suggested in [22] for the optimization of weights in a Feed-Forward Neural Network using a Particle Swarm Optimizer. Experiments were developed to test the e ectiveness of the system on classifying unseen executable le instances, recording metrics of success rate and average distance to desired output. The output of the optimization method was a nal training set with each entry only containing the actual le properties useful for classi cation. In addition, the user of the optimization method receives a list of le properties used, success rate and neural network parameters. The nal training set and the model parameters are incorporated into a reverse engineering architecture and demonstrated to show how the knowledge of the compiler used can alter a reverse engineering architecture?s behavior. The second contribution of this research provides a general reverse engineering ar- chitecture. The generic reverse engineering architecture contains the following elements: 1. Memory Map data structure to represent a load segment of memory. 2. File readers for the Windows PE, Linux ELF and Java class binary le formats. 3. Meta information data structure for recording discovered properties of the target binary executable and utilizing this information to alter run-time behavior. 4. Generic processing components for performing reverse engineering analysis. 5. Mechanism for exible arrangement of discovered generic processing components. 6. XML based scheme for representing combinations of processing components. 2 Libreverse is a generic reverse engineering architecture created by the author that implements these elements. The approach taken in this work was to utilize the Libre- verse architecture to analyze three of the most popular sets of binary le types in use today. 1.2 Motivation of the Research This dissertation covers the areas of reverse engineering, software engineering and arti cial intelligence. Reverse engineering describes a wide range of topics and is classi- ed into two areas: software source code and executable binaries. Software source code reverse engineering focuses on the understanding of large amounts of programming in- structions. The use of reverse engineering in this area helps convert the input source code into a format (e.g. Uni ed Modeling Language - UML) that aids in the understanding of the software. In reverse engineering of executable binaries, the focus is on the conversion of a target binary le into a higher level language representation (e.g. C++ or Java) in order to perform actions such as source code recovery, malicious code detection and classi cation. More and more applications, libraries, and other types of programs are being exe- cuted in untrusted environments. These applications are exposed to malicious programs attempting to exploit some publicly known or newly discovered vulnerability in order to produce an unwanted action. These malicious and/or suspect programs can be installed on a system without the knowledge of the user. In order to understand the functionality of the programs without actually executing them, reverse engineering can be employed to investigate the behavior of these malicious programs. Companies wishing to obtain a higher-level of assurance on their systems or to recover lost programming source code will utilize reverse engineering. The key motivations of this research were to experiment with methods that: 3 help detect properties in executable programs. The rst property was the compiler used to produce Java Class, Windows PE and Linux ELF executable. The second property was the compression method used for Windows PE executable. produce a generic reverse engineering architecture utilize these detected properties to alter the run-time behavior of the generic reverse engineering architecture to eliminate the need for customized components. 1.2.1 Detecting properties in executable programs In order to more e ectively reverse engineer a program, an engineer would need to know as much information as possible about a suspect program. This information allows the engineers to customize the reverse engineering process to the suspect program, and thereby ensuring a higher degree of success than a generic tool can provide. The Libreverse architecture is a run-time con gured reverse engineering library which is designed to meet the goal of allowing the user to reverse engineer a binary program without having any prior knowledge about it. The value of adding compiler and compressed executable classi cation to Libreverse is that it allows the architecture to make better decisions rather than assuming one process ts all programs. 1.2.1.1 Compressed Executables Computer systems and software have a broad impact on the lives of millions every- day. Computers have long since stopped existing as mere word-processors and gaming engines; they help run our hospitals and defense systems, as well as provide support for a signi cant portion of all commerce in the world. As more and more critical systems become dependent on reliable computing, the dangers of system disruptions and failures increase [1]. As the proliferation of malicious software expands, the ability to classify applica- tions as safe or risky becomes of greater interest to both the computer security and computer forensics communities. \Risky" applications can arise from either deliberate 4 attempts to do ill-intent (as in viruses, worms, and the like) or merely poor software development (poor error-checking, use of risky functions, etc). Taking into account only software created with malicious intent, the number of known viruses today has skyrocketed into the tens of thousands. The number of identi ed malicious programs is surely only a subset of all viruses in the wild [56]. Viruses are also increasing in sophis- tication, including methods of encryption, compression, and mutation in order to hide their malicious intent. Utilizing these methodologies makes identifying malware with traditional signature-based techniques more di cult, since self-modi cation can change the signature of the application. Research has shown that virus variants altered by mu- tation, encryption, and obfuscation prove more di cult to identify by many commercial malware detectors [12]. This work seeks to contribute to reverse engineering research for identifying ma- licious applications through static analysis. Speci cally, the classi cation program at- tempts to identify if an application has capabilities for self-modi cation (typically en- cryption and compression), which can, in turn, make reverse-engineering the application easier. Further, if the capability for compression exists, the system attempts to classify, based on a number of the properties of the executable, which type of compression was used. This approach can be advantageous in practice, since it does not require actually executing potentially dangerous software in order to analyze it. Further, the approach does not require the source code of the application to be analyzed, since source code is not readily available for analysis in a great deal of software (even when considering fairly optimistic predictions made by open source enthusiasts [68]). In order to perform the analysis, rst, information is extracted from the Windows Portable Executable (PE) headers [39] from many common Windows executable ap- plications. Then, a General Regression Neural Network (GRNN) [53] learns from the training data and attempts to estimate the likelihood of compression of a previously un- seen application. Once a training set consisting of le summaries is constructed along with the general regression neural network, two di erent optimization methods are used to optimize the neural network parameters, similar to the ideas suggested in [22] for 5 the optimization of weights in a Feed-Forward Neural Network using a Particle Swarm Optimizer. Experiments are designed to test the e ectiveness of the system on classify- ing unseen application instances, recording metrics of success rate and average distance to desired output. The structure of the training set and the network parameters are adjusted for the more di cult problem of identifying which of three popular modern compression utilities was used to compress a given application. 1.2.1.2 Is there a situation beyond reverse engineering where compiler and compressed executable detection would be useful? Consider the situation where a company wants to install a particular application on a server that will be used for secure transfer and manipulation of private customer health records. The company wants to be assured that the application risks are known and their impact is minimized. In order to do this, the company would need a tool that could identify which compiler was used to produce the binary computer application. This information would enable the company to investigate the compiler for any defects known to be security vulnerabilities. They may want to know if the compiler uses any standard functions known to be insecure. These functions could be used in the application either directly linked into the application or through the use of a shared library. If the company has this information it would take appropriate action to mitigate the risks. A second situation where compiler classi cation is important is when attempting to reverse engineer binary programs. In compilation, certain semantic information is lost transforming the programming instructions from a certain language into an intermedi- ate language like Single Static Assignment (SSA) [17] or Register Transfer Language (RTL) [49]. Various compilers use one or more of the intermediate languages when compiling applications. So in order to decompile an executable a researcher would need to know which one of the compilers was used to create the program. This information would be used to customize their approach. For example, GNU Compiler Collection (GCC) was originally designed to use RTL as its intermediate language. This was true 6 for compilers created before version 4.0. After 4.0 GCC used SSA for representing the instructions during the optimization phase before being translated into RTL for the nal conversion to binary instructions. Therefore if researchers found out that the compiler used was GCC 3.3 they would know that they would have to customize the process to recover the semantic information lost from conversion to RTL. Otherwise if the com- piler were GCC 4.2 or later then they would know they would have to deal with the translation to SSA and then to RTL. Another situation is where a company is attempting to recover the source code for a legacy application. The company is unsure if the application was protected with any a compression utility. In order to be assured that the instructions contained in the legacy application are not the decompression instructions the company would rst use a tool to detect which compression utility, if any, was applied to it. If no compression utility was detected then the company could be assured the instructions are from the legacy application. Otherwise, the company would have to use another tool to decompress the legacy application in order to recover the original binary instructions. 1.2.1.3 How can machine learning be used for classifying? Given the evidence of the past successes in the area of arti cial intelligence (AI) methods being applied to computer defense [7], this work seeks to take advantage of machine learning algorithms (a subset of AI). Neural networks have been applied to a large array of classi cation and regression problems. They also have been shown to be capable of generalizing from a small, representative training set [28]. For example, in [58], a neural network was implemented to classify and prevent boot-disk viruses. Neural networks have also proven useful for SPAM detection and ltering [13], for analyzing user behavior for anomalies [57], and in intrusion detection [31]. 1.2.2 Generic reverse engineering architecture While custom solutions to speci c problems typically perform very well, they fail to be useful beyond the scope of the original problem. Generic reverse engineering systems 7 are necessary for reverse engineering source code les [30][9] and binary applications [14] to aid in software quality assurance, code review and understanding of complex software systems. This work seeks to construct a generic reverse engineering architecture in a way that allows for rearrangement of components in a exible manner. This means components of the generic reverse engineering architecture must be designed to allow for information to be passed without knowledge of the producer or consumer. Also each component must be designed to perform a fundamental operation that cannot be reduced any further. 1.2.3 Modi cation of the run-time behavior of generic reverse engineering architectures Given the problems with speci c reverse engineering solutions, this work seeks to construct a mechanism which will allow the transfer of secondary information, called meta information, between components. The meta information will be used be each component to alter its behavior. For example meta information will allow a component to perform algorithm X when the X meta information is present otherwise a default algorithm is utilized. 1.3 Summary There are a limited set of tools available to an organization to conduct reverse engineering work and research. What is needed is a generic reverse engineering tool, used in research and production, that can use detected properties from the binary applications to make produce better results than assuming one process ts all programs. 8 Chapter 2 Background and Literature Review The purpose of this chapter is to provide background information and literature review on the di erent uses of reverse engineering. Some of the techniques used in compiling programming languages to machine instructions use the same data structures as the reverse engineering process. Therefore, it is necessary to review techniques used in both. 2.1 Reverse Engineering De nitions Compiler: A computer application that transforms instructions written in a programming language into instructions for a target platform architecture. Decompiler: A binary computer application that performs the decompiling of a target binary computer application. Decompiling: The act of transforming instructions for a target platform archi- tecture into the high-level representation of a programming language. Disassembler: A binary computer application that performs the transforming of executable instructions for a target binary computer application into an assem- bly language. The assembly language used depends on the CPU architecture of the binary computer applications. (E.g. x86 binary disassembler produces x86 assembly language) DOS: Disk operating system for personal computers. One example is MS-DOS. A 16-bit operating system released by Microsoft. Executable Binary: A computer application stored in a binary format of an operating system used by a user for a particular task. 9 Reverse engineering: A standard de nition for reverse engineering was given by Chikofsky and Cross [11] in 1990: Reverse engineering is the process of analyzing a subject [software] sys- tem to: { identify the systems components and their interrelationships and { create representations of the system in another form or at a higher level of abstraction Self-modifying Programs: An executable binary that hides the actual instruc- tions that will be executed through encryption and/or compression. XML: Extensible Markup Language (XML) is a simple, very exible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere [65]. 2.2 Background In order to fully understand how the compressed executable and compiler classi ca- tion research use the information in binary le format headers it is necessary to explain the di erent le formats. 2.2.1 Windows Portable Executable (PE) The PE format is the main executable format for binary computer applications for the Windows operating system. It maintains backwards compatibility with MS-DOS for software applications. For the sake of brevity only the DOS, PE, optional and section headers are described. As can be seen in Figure 2.1 and Table 2.1, a PE le requires overhead to allow it to be portable amongst the di erent versions of Windows. The rst section is always a DOS header, often referred to as the DOS stub, which is a small application that runs 10 Figure 2.1: PE File Header 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 DOS 2.0 Header (Includes base of the PE image header) COFF File Header Section Headers Table 2.1: Typical Portable EXE File Layout under MS-DOS and prints the message: \This program cannot be run in DOS mode." Normally, having such a thing at the beginning of an executable would likely cause the application to fail to run. However, as stated in the o cial documentation of the PE format, the le o set to the PE signature is placed at location 0x3c within the stub, which allows the image to be run by Windows even with the stub present. [39] 2.2.1.1 DOS Header The DOS header, in most cases, is only included in an executable program to provide the MS-DOS signature and the location of the PE header. The signature of a DOS header is MZ which is used by the Windows loader to know it is at least dealing 11 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Machine NumberOfSections TimeDateStamp PointerToSymbolTable NumberOfSymbols SizeOfOptionalHeader Characteristics Table 2.2: Fields of the PE COFF File Header with a DOS program. In most versions of Microsoft Windows (e.g. Vista, XP, 2000 Professional) today the loader skips to the location in the binary le where the PE header begins. If the program was executed in DOS mode, a string is printed saying that the application cannot be executed in this mode. 2.2.1.2 PE Header 2.2.1.2.1 Common Object File Format (COFF) File Header : The PE COFF File Header, as shown in Table 2.2 is composed of a signature and the following elds. The signature is used to identify the kind of Windows le. In previous versions of Windows 3.0 and 3.1 operating systems the signature was marked by the four-bytes \NE00", but the most recent version of Windows (e.g. Vista, XP, NT), the signature is marked by the string \PE00". The following list are COFF header properties [39] that could be used during the classi cation process: Machine: The number that identi es the type of target machine (e.g. x86, PowerPC) NumberOfSections: The number of sections. This indicates the size of the section table, which immediately follows the headers. TimeDateStamp: The low 32 bits of the number of seconds since 00:00 January 1, 1970 (a C run-time time t value), that indicates when the le was created. PointerToSymbolTable: The le o set of the COFF symbol table, or zero if no COFF symbol table is present. [39] 12 NumberOfSymbols: The number of entries in the symbol table. This data can be used to locate the string table, which immediately follows the symbol table. SizeOfOptionalHeader: The size of the optional header, which is required for executable les but not for object les. Characteristics: The ags that indicate the attributes of the le. The elds of interest are the number of sections and the characteristics. There should be at least two sections in a le to denote the code and data sections of a binary computer application. Not all compilers follow the same arrangement of sections. The number of sections lets the security researcher know at least how many sections are contained in PE le. The characteristics eld is there to let the Windows operating system know if it is reading in an EXE versus a DLL. 2.2.1.2.2 Optional Header : There are many elds in the optional header, as seen in Figure 2.3, that give information on the kind of binary program contained in a Windows PE le. The rst indication is the Magic eld. This tells the Windows binary loader that it is dealing with a 32-bit version (PE) or a 64-bit version (PE+) Windows program. This information cannot be falsi ed in a system because there is a slight di erence in how certain elds would be interpreted by the Windows binary loader. For example the BaseOfData eld is present in the Windows PE (32-bit version) but not in the Windows PE+ (64-bit version). Falsifying the Magic eld will cause the Windows loader to improperly read the binary computer application from the system disk possibly crashing the Windows loader. There are only two accepted hex values stated in the PE speci cation [39] of ?0x10b? for Windows PE and ?0x20b? for Windows PE+. The following are Optional Header properties [39] that could be used during the classi cation process: Magic: The unsigned integer that identi es the state of the image le. (0x10B = PE), (0x107 = ROM image), (0x20B = PE32+) 13 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Magic MajorLinkerVersion MinorLinkerVersion SizeOfCode SizeOfInitializedData SizeOfUninitializedData AddressOfEntryPoint BaseOfCode BaseOfData Table 2.3: Fields of the PE Optional Header MajorLinkerVersion: The linker major version number. MinorLinkerVersion: The linker minor version number. SizeOfCode: The size of the code (text) section, or the sum of all code sections if there are multiple sections. SizeOfInitializedData: The size of the initialized data section, or the sum of all such sections if there are multiple data sections. SizeOfUninitializedData: The size of the uninitialized data section (marked by the section name BSS), or the sum of all such sections if there are multiple BSS sections. AddressOfEntryPoint: The address of the entry point relative to the image base when the executable le is loaded into memory. For program images, this is the starting address. For device drivers, this is the address of the initialization function. An entry point is optional for DLLs. When no entry point is present, this eld must be zero. The only remaining elds in the optional header that are of interest are the table pointer as seen in Table 2.4 and described in the list below [39]. The table pointers give the relative virtual address in where each table is located in memory and its size. These tables are used by the Windows binary loader to prepare the binary program for execution. There is one table in particular of interest called the Import Table. This 14 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Export Table Import Table Resource Table Exception Table Certi cate Table Base Relocation Table Debug Architecture Export Table Global Pointer TLS Table Load Con g Table Bound Import IAT Delay Import Descriptor CLR Runtime Header Reserved Table 2.4: PE Optional Header Directories table tells what DLLs are needed to run the binary program plus what functions in them are accessed. It is important to note that while these tables are listed in the optional header there does not need to be a matching section header for each one. What can happen is that certain tables, like the import table, can be packed into another section. So the only way to be sure of the presence of the Import Table in a Windows PE le is to rst use the Import Table pointer and size if they are non-zero values. If either is a zero value then a search of the section headers is required to nd the Import Table. The tables listed in the Optional Header may provide details useful to classifying a certain property but certain tools may desire to hide them from analysis. Export Table: The export table address and size. This table describes the exported symbols that can be used through dynamic linking. Import Table: The import table address and size. This table describes the imported symbols from other dynamically linked libraries. 15 Resource Table: The resource table address and size. The resoure table is a tree base data structure describing information used by the application like icons and version information. Exception Table: The exception table address and size. The exception table contains a list of function table entries used for exception handling. Certi cate Table: The attribute certi cate table address and size. The cer- ti cate table contains digital certi cates, like Authenticode signatures [40], which are used by a veri cation policy, e.g. Win32 WinVerifyTrust function [40], to determine the origin and integrity of an application. Base Relocation Table: The base relocation table address and size. The base relocation table is used to adjust base relocations if the operating system loader cannot load the image at the desired location stated in the PE header. Debug: The debug data starting address and size. The debug data contains all the compiler generated debug information. Architecture: Reserved for future use, must be 0. Global Pointer: The RVA of the value to be stored in the global pointer register. The size member of this structure must be set to zero. According to Pietrek [47] the global pointer is \a preassigned value for accessing data within a load module". Further Pietrek says that \On the IA64, each instruction is 41 bits in length. As such, it?s impossible for a 41-bit IA64 instruction to contain a complete 64-bit address. Instead, all memory accesses must be done by using a 64-bit pointer value loaded into a general-purpose register. When working with data de ned within a load module, the global pointer makes it possible to put the target address that you want into a register using a single instruction." TLS Table: The thread local storage (TLS) table address and size. The TLS table \provides direct PE and COFF support for static thread local storage (TLS). 16 TLS is a special storage class that Windows supports in which a data object is not an automatic (stack) variable, yet is local to each individual thread that runs the code. Thus, each thread can maintain a di erent value for a variable declared by using TLS." [39] Load Con g Table: The load con guration table address and size. The load con guration table was used to describe information too big to be described in the le header but it is presently used by Windows XP and later version to contain structured exception handlers. Bound Import: The bound import table address and size. The bound import ta- ble contains a list of precomputed real addresses for each of the imported functions used from DLLs [37]. Delay Import Address Table (IAT): The import address table (IAT) address and size. The IAT is used to support a uniform mechanism for applications to delay loading of a DLL until the rst call into that DLL. Delay Import Descriptor: The delay import descriptor address and size. Al- lows access to the delay IAT to update the entry point function pointers that reside in the data section of the image and initially refer to the delay load thunks. A thunk is a mechanism which allows a 16-bit application to access 32-bit DLLs and vice versa. [38]. Common Language Runtime (CLR) Runtime Header: The CLR run time header address and size. The CLR is used to indicate that a object le contains managed code. Reserved: must be zero [39] 2.2.1.2.3 Section Header : Each section header, as shown in Table 2.5, has a name that is supposed to be descriptive of the contents of the section. The section header information was derived from [39]. While there are prede ned traditional names 17 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Name VirtualSize VirtualAddress SizeOfRawData PointerToRawData PointerToRelocations PointerToLineNumbers NumberOf NumberOf Characteristics Relocations LineNumbers Table 2.5: PE Section Header for some sections there is no guidance as to what happens in the Windows binary loader when it deals with the names that deviate from the norm. In some suspected malicious les some utilities will create sections with names speci c to the utility. For example the PEtite program sometimes makes a section header which contains the decrypter titled as \.petite". Now this is a good indication that the binary program has been compressed with the PEtite program, but this has to be con rmed by other means. It is necessary to not assume anything about an application but to con rm each property. Just labeling a section as \.petite" does not mean the PEtite program was used to compressed the application. The following list of section header information [39] can be used for classi cation: Name: An 8-byte, null-padded UTF-8 encoded string. If the string is exactly 8 characters long, there is no terminating null. For longer names, this eld contains a slash (/) that is followed by an ASCII representation of a decimal number that is an o set into the string table. Executable images do not use a string table and do not support section names longer than 8 characters. Long names in object les are truncated if they are written to an executable le. VirtualSize: The total size of the section when loaded into memory. If this value is greater than SizeOfRawData, the section is zero-padded. This eld is valid only for executable images and should be set to zero for object les. VirtualAddress: For executable images, the address of the rst byte of the section relative to the image base when the section is loaded into memory. For 18 object les, this eld is the address of the rst byte before relocation is applied; for simplicity, compilers should set this to zero. Otherwise, it is an arbitrary value that is subtracted from o sets during relocation. SizeOfRawData: The size of the section (for object les) or the size of the initialized data on disk (for image les). For executable images, this must be a multiple of FileAlignment from the optional header. If this is less than VirtualSize, the remainder of the section is zero- lled. Because the SizeOfRawData eld is rounded but the VirtualSize eld is not, it is possible for SizeOfRawData to be greater than VirtualSize as well. When a section contains only uninitialized data, this eld should be zero. PointerToRawData: The le pointer to the rst page of the section within the COFF le. For executable images, this must be a multiple of FileAlignment from the optional header. For object les, the value should be aligned on a 4byte boundary for best performance. When a section contains only uninitialized data, this eld should be zero. PointerToRelocations: The le pointer to the beginning of relocation entries for the section. This is set to zero for executable images or if there are no relocations. PointerToLinenumbers: The le pointer to the beginning of line-number entries for the section. This is set to zero if there are no COFF line numbers. This value should be zero for an image because COFF debugging information is deprecated. NumberOfRelocations: The number of relocation entries for the section. This is set to zero for executable images. NumberOfLinenumbers: The number of line-number entries for the section. This value should be zero for an image because COFF debugging information is deprecated. Characteristics: The ags that describe the characteristics of the section. 19 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 e ident e type e machine e machine e version e entry e pho e sho e ags e ehsize e phentsize e phnum e shentsize e shnum e shstrndx Table 2.6: ELF Header 2.2.2 Linux ELF The Linux kernel has had three variations for its binary le format over the years. The Executable and Linking Format (ELF) [62] format is the present generation binary le format used for all Linux binary programs For the sake of brevity only the ELF header, program header table and the section headers are covered. 2.2.2.1 ELF Header The ELF header, as shown in Table 2.6, starts with a unique identi er. The unique identi er is read by the Linux binary loader to con rm it is executing an ELF program. It will also read properties from the ELF header like which processor it was intended to run and where to start reading the program header and the section headers. These properties allow the binary loader to con rm the ELF program is capable of being executed on the system (e.g. x86 ELF program executed on a x86 linux host) and how to prepare the ELF program?s image in system memory. e ident: The initial bytes mark the le as an object le and provide machine- independent data with which to decode and interpret the le?s contents. e type: This identi es the object le type. e machine: This value speci es the required architecture for an individual le. 20 e version: This identi es the object le version. e entry: This gives the virtual address to which the system rst transfers control, thus starting the process. e pho : This holds the program header table?s le o set in bytes. e sho : This holds the section header table?s le o set in bytes. e ags: This holds processor-speci c ags associated with the le. e ehsize: This holds the ELF header?s size in bytes. e phentsize: This holds the size in bytes of one entry in the le?s program header table. All entries are the same size. e phnum: This holds the number of entries in the program header table. Thus, the product of e phentsize and e phnum gives the table?s size in bytes. e shentsize: This holds a section header?s size in bytes. A section header is one entry in the section header table; all entries are the same size. e shnum: This holds the number of entries in the section header table. Thus, the product of e shentsize and e shnum gives the section header table?s size in bytes. e shstrndx: This holds the section header table index of the entry associated with the section name string table. 2.2.2.2 ELF Program Header The ELF program header, as shown in Table 2.7 tells the Linux binary loader where to store the program in memory along with setting any speci c permission(s) for that segment of memory. p type: This tells what kind of segment this array element describes or how to interpret the array element?s information. 21 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 p type p o set p vaddr p paddr p lesz p memsz p ags p align Table 2.7: ELF Program Header p o set: This gives the o set from the beginning of the le at which the rst byte of the segment resides. p vaddr: This gives the virtual address at which the rst byte of the segment resides in memory. p paddr: On systems for which physical addressing is relevant, this is reserved for the segment?s physical address. p lesz: This gives the number of bytes in the le image of the segment; it may be zero. p memsz: This gives the number of bytes in the memory image of the segment; it may be zero. p ags: This gives ags relevant to the segment. p align: Loadable process segments must have congruent values for p vaddr and p o set modulo the page size. This member gives the value to which the segments are aligned in memory and in the le. 2.2.2.3 ELF Section Header The layout of the ELF binary format is similar to the Windows PE format but there is an important di erence. In the Windows PE format, we mentioned that a 22 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 sh name sh type sh ags sh addr sh o set sh size sh link sh info Table 2.8: ELF Section Header section can be placed inside another section. This was allowed by the Table pointers in the Optional Header. With regard to ELF, there is no such list of tables pointing to places in the binary le for sections that must be loaded into memory. Any segment of information that must be handled by the Linux binary loader in a particular way needs to have its own section header. The ELF section header is shown in Table 2.8. In the Windows PE format, the import table can be hidden by a utility preparing a self-modifying program inside another section. This type of subterfuge is not possible in the ELF format. Any shared libraries that need to be loaded must be described in the ELF section header called ?.dynamic?. sh name: This speci es the name of the section. Its value is an index into the section header string table section giving the location of a null-terminated string. sh type: This categorizes the section?s contents and semantics. sh ags: Each section supports 1-bit ags that describe miscellaneous attributes. sh addr: If the section will appear in the memory image of a process, this gives the address at which the section?s rst byte should reside. Otherwise, the contains 0. sh o set: This value gives the byte o set from the beginning of the le to the rst byte in the section. 23 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 magic minor version major version constant pool count constant pool item #1 ::: constant pool item #N access ags this class super class interfaces count interface #1 ::: interface #N eld count eld info #1 ::: eld info #N methods count method info #1 ::: method info #N attributes count attribute info #1 ::: attribute info #N Table 2.9: Java Class File format sh size: This gives the section?s size in bytes. Unless the section type is SHT NOBITS, the section occupies sh size bytes in the le. A section of type SHT NOBITS may have a non-zero size, but it occupies no space in the le. sh link: This holds a section header table index link, whose interpretation de- pends on the section type. sh info: This holds extra information, whose interpretation depends on the sec- tion type. 2.2.3 Java Class A Java Class le, as shown in Table 2.9, is di erent from the other two formats, Linux ELF and Windows PE formats, in that every piece of information is contained in one structure. There is no concept of sections, segments, or any data structure other than a Class le data structure. 24 2.2.3.1 Class Header magic: The magic item supplies the magic number identifying the class le for- mat; it has the value 0xCAFEBABE. minor version: Minor version of the class le. major version: Major version of the class le constant pool count: The value is equal to the number of entries in the con- stant pool table plus one. constant pool: This is a table of structures representing various string constants, class and interface names, eld names, and other constants that are referred to within the Class le structure and its substructures. access ags: The value is a mask of ags used to denote access permissions to and properties of this class or interface. this class: The value of this item must be a valid index into the constant pool table. The object at this index must be a Constant Class Info type. super class: For a class, the value of the super class item either must be zero or must be a valid index in the constant pool table. For an interface, the value of the super class item must always be a valid index in the constant pool table. interfaces count: Number of interfaces in the class interfaces. Each value in the interfaces array must be a valid index into the constant pool table. Each object in the constant pool must be of Constant Class Info. eld count: Number of elds in the class methods count: Number of methods in the class attributes count: Number of attributes in the class 25 The \major version" and \minor version" are used by the Java Virtual Machine (JVM) to determine if it is capable of running the java byte code. The Java Class le might have a higher version number than it is capable of executing properly. The JVM starts reading the constant pool, similar to a table of contents, for validating, initializing and preparing the class le for execution. 2.2.3.1.1 Constant Pool: The constant pool is a list of all the classes, interfaces, elds, methods and data that can be used in the Java Class le. Each item in the constant pool is a variable length constant pool Info object. Each object has an 8-bit tag describing the kind of constant pool item it is along with a series of bytes giving the information for that object. The Java Virtual Machine (JVM) reads in all this information into a run time constant pool for use later. It is interesting to note that the Java Class le explicit separates instructions of a program from its data. Each method info structure contains a list instructions stored in the Code Attribute. This attribute is only allowed to contain valid 8-bit Java byte code instructions. No data is allowed to be stored in this attribute. A method can reference constant values found in the constant pool for use in the function. Any data used by the method will be found either in the constant pool, in the stack frame or on the heap. The validation process of the JVM that validates all the data structures in the Java Class le follows a strict set of rules. The discussion of the rules is beyond the scope of this dissertation. 2.3 Compressed Executables Malicious code present in an executable has the capacity to oppose the security aims of con dentially (by revealing information against the wishes of the owner), integrity (by disrupting or altering data), and availability (by rendering systems unusable) [4]. Therefore, it is both natural and imperative to seek to root out and defend against such attacks. 26 There are a variety of methods that have been proposed and applied (with varying levels of success) in order to improve the classi cation of and defense against malware. The general components of defense systems, as presented in [43] include: 1. Data collectors/information sources which provide information on normal and abnormal entities. 2. Analysis components which look at new instances and attempt to determine whether or not they are a threat. 3. Response components which take action against wouldbe threats 4. Testing components which attempt to expose vulnerabilities in the system. Data collectors typically provide information to the analysis components which in turn provide detection capabilities to the response components. The response compo- nents then take appropriate action against a given threat. Finally, the testing compo- nents identify weaknesses in the system and provide information to the data collector and analysis components. One focus of this work is to improve analysis capabilities by providing informa- tion about executable les, even though little information concerning those les may be known. While techniques exist for disassembly and reverse compilation for executa- bles, di erent tactics must be taken for self-modifying les. If a reverse engineering framework has the capacity to know exactly the type of le with which it is dealing with then the process of reverse compilation becomes simpler. The framework can take action using the discovered knowledge of the le to recover the hidden instructions. Otherwise the framework has to take a general approach which may or may not work. Reverse engineering a le supports digital forensic investigation since it allows an an- alyst to determine the intention of a malicious le without actually executing the le. Additionally, knowing that an executable has been compressed can alert an investigator of suspicious activity on a machine under investigation. Normally most executables 27 are stored in a uncompressed format on a system since space is not an issue. Embed- ded system designer may choose to compress their application since the target device often has limited storage space. Therefore when an investigator nds an compressed executable on a system that does not normally need them it should alert them to a potential problem. Static analysis of les typically focuses on determining program properties and be- havior by examining software without running it, while dynamic analysis executes a program repeatedly under known conditions and then observes the results [24]. There are advantages to both approaches. In static analysis, program execution must be ab- stracted and modeled, so the results may be less speci c in order to preserve correctness. Put simply, it may be di cult to determine exactly what a program does in all situations without actually running it. There are other complications, as illustrated in [66]: Tractability: increasing program size dramatically increases the time required for static analysis. Decidability: it is impossible to know how many times a loop will be executed at run time; it is equally impossible to know how many times a function may be called recursively. Pointer Aliasing: in languages that are not type-safe (C, for example) it is pos- sible to have many di erent variables accessing some bu er in memory, and it is infeasible to gain an accurate listing of such variables. In dynamic analysis, while it is typically easier to get a clear picture of what the program does, given speci c test cases, it is generally unreasonable to expect to be able to exhaustively test all possible executions. Worse still, in the domain on information security, it is highly undesirable to execute potentially malicious software many times in order to examine its behavior. Thus, we will largely con ne our discussion of related work to static analysis and reverse engineering schemes. 28 2.4 Survey of Decompiling Architectures This section covers the di erent kinds of decompiling architectures. There are two types of decompiling architectures covered in reverse engineering research. Section 2.4.1 covers the decompiling architectures for x86 problems and Section 2.4.2. 2.4.1 x86 Sung and Mukkamala [55] demonstrates the complexity of reverse engineering code in general. It is highly desirable from a security standpoint to be able to have access to the source code of a potentially dangerous executable. Some decompiling architectures focuses on binaries that are compiled for the Intel x86 architecture. The following research listed below investigated methods to reverse engineer x86 binary programs into a form of high-level output (e.g. C programming language). 2.4.1.1 Cifuentes Summary: Cifuentes [14] illustrates one example of x86 decompiling architecture to transform from x86 executable code to high-level C language code. The stepwise process converts the application to assembly rst. Secondly it converts the assembly output into RTL form in a control ow graph. Finally it converts the RTL in the control ow graph into a high level form of RTL. The output is converted into C programming language. Cifuentes?s work [14] demonstrates the complexity of reverse engineering code in general. It is highly desirable from a security standpoint to be able to have access to the source code of a potentially dangerous executable. However, while creating executables from source code is easily done with modern software, the reverse is not at all simple. Cifuentes illustrates a working process to move from executable code to high-level language code. 29 Many of the steps are simple in theory but di cult in practice due to a variety of issues. Converting from binary to assembly can be di cult, since it is di cult to dis- tinguish between data and instructions in the binary format. System-speci c strategies make this possible but still di cult. Compression and encryption techniques make this e ort even more di cult, especially if the compression or encryption method used on the le is not known (thus the motivation for this research). A further example of the complexity of this process comes in converting from assembly to high-level code. This is often made di cult by the fact that not all assembly instructions have an equivalent high-level representation. Further, di ering choices of high-level language will lead to di ering nal representations. Even given these di culties, Cifuentes manages to successfully use a variety of new and existing tools to e ectively reverse engineer a number of programs of varying complexity. However, as is expected, there are issues with scalability and such e orts have the potential to be somewhat fragile with respect to which executables can be reverse engineered. Key Contributions: A three stage process for automatically converting DOS x86 executables into C programming language: The rst stage is the \Input" where the binary program is converted into a control ow graph containing instructions in RTL format. The second stage is the \Analysis" where the control and data ow analysis trans- form the control ow graph. The nal stage is the \Output" where the contents of the control ow graph are converted into a le containing C programming language instructions. Weaknesses: Due to a variety of issues some of the above steps are simple in theory but di cult in practice. Converting from binary to assembly can be di cult since it is di cult to distinguish between data and instructions in the binary format. System-speci c strategies make this possible but still di cult. Compression and en- cryption techniques make this e ort even more di cult, especially if the compression or 30 encryption method used on the le is not known. A further example of the complexity of this process comes in converting from assembly to high-level code. This is often made di cult by the fact that not all assembly instructions have an equivalent high-level rep- resentation. Further, di ering choices of high-level language will lead to di ering nal representations. Even given these di culties, Cifuentes manages to successfully use a variety of new and existing tools to e ectively reverse engineer a number of programs of varying complexity. However, as is expected, there are issues with scalability, and such e orts have the potential to be somewhat fragile with respect to what executable programs can be reverse engineered. 2.4.1.2 Boomerang Summary: Boomerang [35] is a program which attempts to extend the disserta- tion work of Cifuentes. It is designed to work with either Intel Pentium or Sparc binary executable programs. Key Contributions: Boomerang follows the same structure as Cifuentes?s work: It rst converts an x86 (Linux/Windows), PowerPC (Linux/Mac), or Sparc (So- laris) executable to assembly using a disassembler. Secondly it translates the assembly into an intermediary language for the purpose of reducing the bulk of instructions by analyzing the semantics of the operations. Next it converts the intermediary language into arbitrary high-level code using a decompiler. Finally it post-processed arbitrary high-level code into C language for analysis and use. This program was utilized to recover the core algorithm from a third-party?s bi- nary application along with almost all original class names and the complete class hierarchy. [21] 31 Weaknesses: It su ers from the same issue of scalability of Cifuentes?s work. It cannot be altered to add new functionality to apply this solution to other problems without serious modi cation. It does not support compressed or encrypted executables. Andromeda Decompiler Summary: The goal of the Andromeda Decompiler [51] is to create a universal interactive decompiling environment. The idea is to take in a user?s binary executable and produce output in a high-level language of their choosing. The program runs on the Windows platform and supports 32-bit Intel x86-compatible programs and C/C++ output. Key Contributions: The two main features of Andromeda Decompiler are: Research and investigation of binary modules at a level of source codes. Partial or full restoration up to recompilable forms. Weaknesses: All the features of the system were gleaned from the website for the project. There is no source code available for the project so the internal workings of the decompiler could not be investigated. At the time of this dissertation, no work had been done on this project in four years given the last release was in 2005. Static Detection of Malicious Code in Executable Programs Summary: The static analyzer developed by Bergeron, Debbabi, Desharnais, Er- hioui, Lavioe and Tawbi [3] converts a Windows PE executable le into an intermediate representation, creates a control ow graph from this information and analyzes the graph against a policy to determine if malicious code is present. Key Contributions: The authors provide: A three step process for converting a Windows PE into a control ow graph for analysis using the IDA32 disassembler. 32 { The rst step processes a syntax tree representing the control structure of the program. Each basic block in the control structure contains assembly instructions produced by the IDA32 disassembler. { The second step performs the control ow and data ow analysis on the control ow graph. { The nal step of the process is to apply a security policy to the control ow graph to detect malicious code. Weaknesses: The solution is exclusively catered to x86 executable binaries in the Windows PE format. Its design is very similar to Cifuentes [14] altering only the method of producing the x86 assembly instructions. Whether or not the analyzer can be expanded to Linux and Java is not clear. Analysis of Binary Programs Summary: Kapil and Kumar [32] produced a decompiling system composed of an disassembler library libdisasm and a translation phase using the Standford University Intermediate Format (SUIF). The system converts a binary program to an Intermediate Representation(IR). Key Contributions: The SUIF is a: Modular and extensible system that can describe a wide set of CPU instructions. The system utilizes these two tools in a two pass structure that produces the assembly output of the binary program which is processed as a control ow graph. The output of the program is in SUIF format for conversion later into a higher level language. Weaknesses: The authors state that the system cannot handle input programs that contain pointers, recognize user de ned structures, recognize de ned return type for functions, handle arguments to functions passed in memory, identify loops that contain break or continue statements, and handle control ags changed by all statements and 33 recognize if-then-else block structures. It only assumes control statements alter control ags. Designing an object-oriented decompiler Summary: Desquirr [23] produced an IDA Pro plugin that performed a series of transformations and analysis to produce high level output. Key Contributions: Conversion method for translating assembly instructions into code in a high-level language. It rst translates 16- and 32-bit Intel 386 machine code produced by the IDA Pro disassembler into Register Transfer Language (RTL) form. Secondly, it performs data and control ow analysis before nally producing source code written in the C programming language. Weaknesses: Desquirr cites Cifuentes work as a motivator for creating his so- lution but limits his work to using IDA Pro to reverse engineer x86 programs. Since IDA Pro provides a set of binary le formats and disassembly of the binary input, Desquirr?s program utilized this through the IDA plugin interface. Instead of starting the reverse engineering from the rst byte of the code section as done by the Linux program objdump, the program starts reading from the entry point of the target ap- plication. Desquirr uses the same analysis operations that Cifuentes uses in her tool. These tools separate code into basic blocks, perform live register analysis, nd Def-Use chains, register copy propagation, nd functions and nally produce assembly instruc- tions broken up into individual functions. This solution does not support analysis of other architectures (e.g. PowerPC) or other le formats (e.g. Linux ELF). Cobra: Fine-grained Malware Analysis using Stealth Localized executions Summary: Cobra is a dynamic ne-grained malicious code analysis framework that executes in kernel mode. The goal of the project is to support multi-threading, 34 self-modifying / self-checking code and any form of code obfuscation in both user and kernel mode on commodity operating systems [64]. Key Contributions: Method for breaking up code execution into blocks for analysis. Support of self-modifying code through the use of xfer-stubs which allow Cobra to always have control at the end of a block?s execution. Selective execution of code based on user de ned points. Call back feature for applications built on its API. Weaknesses: While the method of analysis for malicious programs is an excellent solution, it is not scalable. The problem lies in the possible combinations of architec- tures and operating systems. The Microsoft Windows family of operating systems is dedicated mainly for the x86 architecture for home and business computers. The Win- dows operating systems also operate in the embedded products. In order to support the analysis of malicious code in all possible combinations, Cobra would have to be ported to each of the possible operating systems. It also relies on the host operating system to prepare the malicious program for execution and actually executes each of the instruc- tions. While the strength of dynamic analysis of malicious programs is more powerful than static analysis, there is an inherent risk in actually executing the instructions on the host operating system even if it was in a VM environment. The risk of executing malicious programs is that it is not known which, if any, of the host and network services will be used. Detecting Malicious Code by Model Checking Summary: Model checking can be used to provide a exible way to detect mali- cious code patterns in executables. [33] Key Contributions: The authors contribute: 35 Computation Tree Predicate Logic (CTPL) to extend the Computation Tree Logic (CTL) [16]. It is as expressive as CTL but allows a precise way to represent malicious code patterns. Weaknesses: CTPL provides a way to represent malicious code in a precise way describing what a program should not be doing. To fully exercise this method, all possible \malicious" patterns would need to be determined. In addition, the possibilities of no false positives or negatives would have to be assured. The authors do not provide any tests where normal programs, not containing malicious code, were processed by the system. There is no proof given that the system will not accidentally classify something as malicious code nor incorrectly classify a malicious program as a virus free. It is uncertain that this solution would be capable of being used for Java binaries which use di erent types of assembly operations. 2.4.2 Java Class Java program les are platform independent les compiled for Java opcodes to run on a Java Virtual Machine (JVM). Decompiling programs convert the Java opcodes either to Java assembly or back into Java source les. The follow sections describe various solutions to decompiling Java Class les into source les. Sun Disassembler (Javap) Summary: The Javap program is a part of the Sun Microsystem JDK that will disassemble the input class le to the Java assembly. Since it is packaged within the JDK, it is safe to assume that the expected input is one that was compiled with the JDK compiler (javac). Key Contributions: The Javap programs contributes: A Java disassembler which converts a Java class le into a Java source le con- taining Java assembly instructions. 36 Weaknesses: Javap does not produce high-level Java source code which can rapidly aid understanding of a suspect Java Class le. Javap produces a output le containing java assembly instructions from a given Java Class le. An assembly instruc- tions are one step up from the 1s and 0s a CPU or Virtual Machine instruction. It is not very easy to identify high level programming constructs when looking at the assembly instructions. For example identifying a complex if statement. Dava decompiler Summary: The Dava [41] decompiler project is a multiple stage decompiling pro- cess by rst constructing a typed list of all the statements found in the Java Class le. Secondly, it constructs a control ow graph from the the output of the rst stage. Finally, it performs the analysis on the input to produce their high level output. Key Contributions: The authors provide: data structure called Structure Encapsulation Tree (SET) that contains the Java structures as the le is reverse engineered. six stages of transformation 1. Synchronized Blocks which contain the regions of the instructions of a syn- chronized() block. 2. Loop Detection (e.g. while) 3. Remaining Conditionals identi cation (e.g. if) 4. Exceptions 5. Statement Sequences 6. Continues, Breaks, and Arbitrary Labeled Breaks Weaknesses: The authors dive straight into the decompiling. They do not cover what compiler was used. The basic problem with their work is that it is not obvious how these same techniques will apply to other executable binaries compiled for di erent architectures (e.g. x86 program for Windows PE). 37 In the course of explaining their project, the authors talk about what has been previously done in the area of reverse engineering architectures. They talk about Ci- fuentes?s work as an example of restructuring an input binary to source code based on gotos. Cifuentes was targeting C as an output programming language hence the use of goto statements to handle loops in the program. The Dava authors mention that Cifuentes?s approach would require modi cation to handle Java. This is true because C binary executable and Java Class les are vastly di erent. Cifunetes?s input phase could not handle the Java Class le. It would require a new front-end to produce a new control ow graph for the rest of her architecture. The same argument can be applied to Dava because it cannot handle C binary executable programs. Each group of researchers has produced a customized solution that cannot be easily altered. So the Dava authors have no space to argue that they are doing more. Both groups scoped their problem in order to make a tractable solution. The fast JAva Decompiler (JAD) Summary: JAD [34] is a Java decompiler written in C++ that is used to produce source code or Java assembly output from a valid Java Class le. Key Contributions: Enhanced readability of the generated source code. Ability to comment Java source code with JVM byte codes. Useful for veri cation and educational purposes. Full support for inner and anonymous classes. Fast decompilation and simple setup. Automatic conversion of identi ers garbled by Java obfuscators into valid ones. Weaknesses: While JAD can decompile Java Class les into Java source code, it has a list of limitations. It cannot handle zip or jar les as input to the system nor make use of the Java Class hierarchy. It has trouble with nested classes that 38 contain constructors with extra arguments added by the Java compiler. Also it makes no mention about determining the compiler used to create the input class le. JAD ignores line number table and source le attributes in the class le. Finally JAD has trouble with in-lined functions. DJ Java Decompiler Summary: The DJ Java Decompiler [45] is a Java decompiler and disassembler. It will recreate a le containing equivalent source code from a class le. It is a front-end application that utilizes the JAD Decompiler as its decompiling engine. It makes no mention about determining the compiler used to create the input class le. Key Contributions: The DJ Dava Decompiler is a: Stand alone Windows program able to decompile Java Class les without requiring the Java JDK to be installed. Provides an editor to allow changing the generated Java code. Weaknesses: The design is only able to handle Java Class les and not any other executable binaries. Reverse Engineering Compiler (REC) Summary: REC [10] is a graphical decompiler that can handle Intel x86, 68k, PowerPC and MIPS R3000 programs for ELF, COFF, PE, AOUT, Playstation PS-X and raw binary data les. Key Contributions: REC is a: Graphical decompiler that can handle Intel x86, 68k, PowerPC and MIPS R3000 programs for ELF, COFF, PE, AOUT, Playstation PS-X and raw binary data les. Weaknesses Since the source les for REC are not publicly available, there is no way to understand the internal behavior of REC. REC is designed to best handle 39 programs that were originally written in the C programming language and compiled with the debug options enabled. If a target executable binary does not have debugging information present, it cannot produce good output. Decompiler for TurboC (DisC) Summary: DiSC [36] is a decompiler specialized for the TurboC compiler. Key Contributions: Only handles MS-DOS programs originally written in the C programming lan- guage and compiled with the TurboC compiler. DiSC is a prime example of a decompiler written speci cally for a certain compiler. DisC will produce TurboC output that only requires a few modi cations by the user to compile the resulting source code. Weaknesses: It does not recognize oating point code and strings. It is not 100% automated and is limited to TurboC. Most decompilers focus on convert instructions for a particular processor into assembly language. They are not typically customized to decompile an application produced by a particular compiler. Java Optimize and Decompile Environment (JODE) Summary: JODE [29] is an optimizer for Java Class les to remove debug infor- mation, renaming objects like classes or variables, and removing dead code. Key Contributions: Can be used to decompile Java Class les Can optimize Java Class les. Weaknesses: It is limited to Java Class les that do not contain Java generics. 40 Mocha Summary: Mocha [63] is a Java Class decompiler originally written by Hanpeter van Vliet. Key Contributions: Mocha contributes Mechanism to decompile Java Class les into Java source code. Weaknesses: It is limited to Java Class les that do not contain Java generics. The solution was created before Java generics were a part of the Java programming language. The source code for this project cannot be obtained due to the author?s death in 1996. When JAD was applied to the Mocha?s Java Class les, the resulting code was determined to be obfuscated. Any further analysis into how Mocha works would require a lengthy reverse engineering process. 2.4.3 Static Slicing on Binary Executables Summary: In [4], the authors propose a process for static slicing of binary exe- cutables in order to detect malicious code very similar to the approach o ered in [14]. The di erence is that in [14], the author was seeking the source code as the end product, whereas in [4] the authors are interested in the identi cation of security vulnerabilities in software. However, this approach is intended for use on commercial o -the-shelf software, where the source code is unlikely to be available directly for use in analysis. Key Contributions: Their methodology is as follows: 1. Using disassembler software to reduce binary code into an assembly equivalent. 2. Transforming the assembly code into a corresponding high-level representation, which maintains the semantics of the code while improving the quality of the analysis. 3. Using the high-level representation, they attempt to identify potentially malicious portions of the code. This serves to reduce the amount of code which must be further analyzed (thus reducing the complexity of the overall le analysis). 41 4. Examining potentially malicious slices identi ed before. Weaknesses: While this approach does provide a methodology for dealing with software when the source code is not available, it does have certain weaknesses. First, it is highly dependent on the availability and capability of tools, such as disassemblers and assembly-to-source translators to be able to successfully perform analysis. In practice, these tools may not always be available, often have di culties with scalability, and tend to be very system-speci c. Further, these strategies may not be e ective at all against purposefully altered executables (such as those that have been compressed, obfuscated, or encrypted), since they depend on disassembly. 2.4.4 Static Analysis on Executables without Decompiling Analyzing an executable le directly without decompilation improves the robustness of the solution and can provide greater insight than static analysis on the source code. 2.4.4.1 Related Work on Windows Portable Executable Files Several of the other approaches surveyed depend either on the availability of source code directly or on the ability to reverse-compile such source code. However, as shown by [14], this is a non-trivial e ort. Further, in a security environment, this process may not be at all practical for le analysis. To this end, the authors in [59, 60] attempt to extract useful security information directly from Windows PE executables without decompilation or execution. The motivation behind this work is that even when source code is obtainable, it may not always reveal the vulnerabilities in the executable. For example, malicious behaviors may be inserted in the executable post-compilation (many viruses take this approach). Tevis automatically extracts information from the PE le to produce a le synopsis. Then, anomaly detection is done by a series of discovered rules: Checking for mismatches in table sizes as reported by the optional headers versus the actual size in memory. 42 Checking for large blocks of memory unaccounted for in the header (which may potentially be used for holding data in a self-modifying le). Checking for sections in the Windows Portable Executable sections header which are both writable and executable (also potentially used for self-modi cation). Analyzing the symbol table for strings of potentially dangerous functions (for example, notoriously unsafe C functions like gets() ). Identifying purposefully removed information in symbol tables and import tables, which may indicate an attempt to hide inappropriately placed data. Tevis tested a test-set of thousands of Windows Portable Executable (PE) (dis- cussed at greater length in Section 2.2.1) les for anomalies and found no shortage of vulnerabilities in a large array of even Microsoft system les. His work is particularly of interest since he demonstrates that a large amount of information can be extracted and interpreted from the PE header le itself. This shows that patterns and anomalies can be observed directly from the executable. The downside of this research is that the work does not naturally translate well to other executable types or other operating systems. All trends and rules would have to be rediscovered for any other types. An- other problem, in terms of deployment, is that while the system attempts to identify anomalies, it does not attempt to draw many conclusions about those anomalies. With- out further analysis, these anomalies would produce a high false-positive rate. They would constantly alert a user of potential problems with executables, whether or not an immediate security concern actually existed. 2.4.4.2 Related Work on Self-Modifying Virus Detection Static analysis has been proven to work in the detection of self-encrypting and polymorphic viruses by the work presented in [42]. A self-modifying virus has its pay- load encrypted and carries code for decryption once it is in memory. This means that most virus detection utilities will be ine ective as they cannot perform code checking in 43 an attempt to determine the intent of the program. Polymorphic viruses are enhanced self-modifying viruses that were designed to avoid xed patterns which are often used by virus scanners/detectors. The authors of this work created a code simulation en- vironment in order to analyze the actions performed by a number of programs. The authors used a simulator that was allowed to run virus code in a controlled environment in order to detect external API function calls. Also the authors allowed for execution of these API function calls by OS emulation that uses stub functions instead of the real library functions. The results presented in [42] show that it is possible to use static analysis in order to nd viruses that are self-encrypting and/or polymorphic. In a testing set where a majority of the instances (80%) had never been seen by the tool and using only a handful of the policies that were de ned, the system was able to detect malicious code with over 80% accuracy and as high as 95% accuracy with four of the policies de ned. 2.5 Machine Learning Related Work In [58], the authors describe the use of a Neural Network to analyze the boot sector of a PC to determine if it was infected by a virus. In this application, their representation of a training instance was novel, since simply including all the 512 bytes of the boot sector would be infeasible due to Bellman?s Curse of Dimensionality [2]. Instead, they interpreted the boot sector as a series of approximately a dozen features (properties of the boot sector) and classi ed based on this set. Their primary focus was the minimization of false positives, since they were targeting a commercial audience. They used a training set of 200 known boot-sector viruses and roughly 100 executable viruses. The neural network used was trained with an iterative back propagation al- gorithm. This system resulted in a very low false-positive rate (less than 1%) and a modest false-negative rate (around 15%). In [50], the authors attempt to identify malicious executables using a strategy of feature extraction from a large test set of binary executables. Training instance features include Dynamically Linked Libraries (DLL) le names used, DLL function calls made, 44 numbers of function calls made to speci c DLLs, and byte sequences appearing in the les. These test sets were used to train both Na ve Bayesian Classi ers and Multi-Na ve Bayesian Classi ers (probabilistic instance-based learners), which ultimately outper- formed signature based malware-detection programs dramatically in terms of detection rates (although they demonstrate slight false-positive rates). In [55], the authors attempt to reduce the number of elements in training instances (to be used for a Neural Network classi er) by assessing their importance. If smaller training instances can be produced without sacri cing accuracy, then this practice can improve the computational speed of the classi er when being used for classi cation (of course, this will improve the speed of training generally as well). Better still, by removing useless data from the learner, it can actually improve accuracy. In order to rank features, they use a feature selection scheme designed to maximize classi cation ability. This research is signi cant since it examines an e ective mechanism for reducing the size of training instances (they generally saw as large as 20% reductions in the number of features without loss of classi cation accuracy). 2.6 Survey of Compiler Detection A thorough search of unclassi ed sources and published literature was conducted as demonstrated in this chapter. Queries were sent to the open source community and Richard Stallman [54] on the belief that someone had conducted this work. The answers received back showed this was not the case. 2.7 Generic Decompilation Construction of a general decompiling / reverse engineering architecture that can be applied across a wide set of inputs and various problems has not been done. Similarly, detection of which compiler was used to create a binary executable has not been done. Clearly the previous work in this area shows that reverse engineering is possible to 45 some degree. This work will show how previous reverse engineering solutions can be generalized. 2.8 Summary Past research into reverse engineering have shown how speci c solutions can solve certain reverse engineering problems. The solutions were catered to reverse engineering les written in a speci c le format. Each le format provides a lot of information that can be utilized in a generic reverse engineering architecture to verify certain properties about executable les. Machine learning can take advantage of this information to classify and verify properites found in executable les. 46 Chapter 3 Analysis And Classification 3.1 Introduction In this chapter, we describe the machine learning technique utilized, the inner workings of a General Regression Neural Network, and optimization techniques used to optimize the parameters for the GRNN. 3.2 The General Regression Neural Network Neural networks, in general, attempt to emulate the biological neural networks in the brain, which depend on massive parallelism and inter-connectivity in order to process information quickly. A GRNN, shown in Figure 3.1, is a one-pass learning algorithm which attempts to approximate a continuous variable that is dependent on many representative vector training instances. This is more bene cial than other neural network strategies which require iterative learning that uses many iterations in order to produce workable solu- tions. In practice, any analysis system deployed to provide computer security should be created with the awareness that computational resources are not in nite. Typical computer users will not sacri ce much convenience for security. As described by Specht[53], elements of training instances x1:::xn are passed into the network and collected into pattern units (where each unit represents a single training instance). The output from the pattern units is collected in the summation units (as described below) and then the output of the summation units is ultimately collected to provide an estimate for ^Y corresponding to an unseen input vector X. 47 Figure 3.1: A GRNN Block Diagram 48 Each training instance in the GRNN consists of a vector X and a desired output Y. In order to provide an estimate of the value of ^Y for an unseen instance X is calculated in Formula 3.1: ^Y(X) = Pn i=1Yi exp( Ci )P n i=1exp( Ci ) (3.1) where n is the number of all training instances, Yi is the desired output for a given training instance vector, is a constant parameter of the GRNN, and the value of Ci is the Euclidean distance between X and a given instance vector, as shown in Formula 3.2: Ci = pX i=1 jXj Xijj (3.2) where Xj represents a single element of the training instance vector and Xij represents the corresponding element in the instance to be classi ed. The e ectiveness of a Neural Network depends on [28]: 1. The presence of learnable characteristics in the training instances 2. The number of training set instances 3. The representation of the training set 4. The complexity of the problem at hand 5. System parameters of the Neural Network Our classi er attempts to carefully address each of the above factors. 3.2.1 Learnability in the Training Instances One of the signi cant risks in the initial phase of this research was that the infor- mation available from the Java Class, Linux ELF and Windows PE le headers would not be su cient to provide the learning algorithm with enough knowledge to e ectively classify target (e.g. compiler or compression method). The algorithm used depends on 49 discovering relations between various elds in the le headers. If there is not enough in- formation from the le headers that a learner can form these relations then this method would fail. Even the best learner cannot extract information if there is not a correlation between the input vectors and the desired outputs. In regard to compressed le classi cation, there was reason to believe that there is a correlation between the physical and virtual sizes of les in the executable (to determine between compressed and uncompressed) and that this information could be e ectively extracted. Perhaps of greater interest was the hope that each compression algorithm had a su cient characteristic impact on the PE Header to be revealed to the neural network. In a similar light, there was reason to believe that each compiler leaves unique characteristics in the le headers that can be revealed to the neural network. Each compiler has a unique way it writes a binary computer application le for a particular le format given the same input source code le. This is because groups of programmers have di erent methods for solving the same problem. Given the same plans people often interpret them di erently. Therefore, it was likely that there would be unique characteristics found in the le headers that could be used to classify the compiler. 3.2.2 Training Set Instances Under ideal conditions, the training set should be relatively large and perfectly representative of all possible instances that the network could be asked to classify. In practice, this is not always possible. We sought to choose a wide variety of executable les for the two problem areas. In the compressed executable classi cation we sought both compressed and uncompressed of common applications on a Windows XP system for the compression detection. In the compiler classi cation we sought to choose a wide variety of Java Class (produced from Java source les), Windows PE and Linux ELF executable les (produced from C++ source les). 50 3.2.3 Complexity In general, the complexity of the classi cation problem is not under the control of the system designer. In the rst problem, the system attempts to di erentiate between uncompressed les and compressed les that were compressed using one of three common utilities. Thus, the system must attempt to classify a given executable into one of four equivalence classes. In the second problem, the system attempts to di erentiate between the four Java compilers if given a Java class le, three C++ compilers that produced a Windows PE le and three C++ compilers that produced a Linux ELF le. 3.2.4 Network Parameters The General Regression Neural Network is dependent on the value chosen for . is a system parameter of the GRNN which impacts performance in terms of success rate. Larger values of essentially mean that more nearby training instances (in the sense of Euclidean Distance) will contribute to the resultant output. Essentially, large values act as a smoothing function, which can be desirable when the available training data is either sparse or contains outliers. When is small, it reduces the contribution of nearby instances to virtually zero; therefore, only the closest instances will provide any contribution at all. When the training set is highly representative of the classi cation function being considered, a small would be highly desirable. For all cases other than the above, some intermediate value of would be appropriate. No prior information regarding the appropriate value of was known in this case, so optimization methods were utilized to nd a good setting. These approaches are described in Section 3.3. 3.3 The Evolutionary Hill-Climber Algorithm Although there are many techniques for parameter optimization, and although only one parameter was being optimized ( ), a simple Evolutionary Hill-Climber (EHC) was utilized. For comparison, a logarithmic search approach which searches a larger portion of the search space was also implemented and compared. The rationale behind using 51 the EHC was in recognition that this framework might need to be extended to deal with a more complex search. For example, for other similar problems, it might be interesting to determine not only network parameters, but which attributes are useful for the classi cation problem itself (such as exactly which elds in the PE Header should be used). In this case, a simple logarithmic search may not be able to e ectively nd the best solutions (particularly as the number of possible attributes becomes quite large). 3.3.1 Algorithm Description Traditional deterministic Hill-Climbing methodologies tend to inadequately deal with complex search spaces and instead get stuck at local minima [27]. Instead, stochas- tic methods of search often prove preferable in instances where there exists no guarantee of smoothness or of singular maxima in the function to be optimized [48]. In optimizing for the GRNN, there are no such guarantees. The Evolutionary Hill-Climber (EHC), based loosely on the Random-Mutation Hill-Climber described in [26], rst randomly creates a single candidate solution consisting of a chromosome with a number of values corresponding to parameters being optimized (in this case, only the value). Initial values are chosen from within the allowable range dictated by the problem type (in this case, the range of values for the was chosen from the range (0,1.0). This candi- date solution is then evaluated by a tness function, and the tness is assigned to that individual. This is shown in Algorithm 1. Algorithm 1 Evolutionary Hill-Climber 1: procedure EHC 2: t = 0 3: Initialize candidate solution i 4: Evaluate i 5: while GRNN evaluations remain do 6: choose a step location 7: evaluate step location 8: if step yields better tness 9: replace i with new individual 10: end while 11: end procedure 52 On each iteration, a single uniform mutation (taken from the range [0,1]) is added to the parameter value, multiplied by the mutation amount and the starting value of the gene itself. The value of 0.25 for was chosen from the set [0.03,0.05,0.1,0.15,0.25,0.5] by experimentation. This new candidate solution is then evaluated. If the new candidate solution has better tness than the previous, the previous is replaced. However, if the new candidate solution has worse tness than the previous then the candidate is rejected. The overhead associated with the EHC is quite low on each iteration, requiring only a handful of elementary operations. This is worth consideration in order to reduce o -line training time. 3.4 Logarithmic Search In order to present a deterministic approach to determining , we used a very simple logarithmic search algorithm (LOG). For a given search space (in this case, searching between a lower bound of 0 and and upper bound of 1.0), the logarithmic search works as shown in Algorithm 2: Algorithm 2 Logarithmic Search 1: procedure LOG 2: range = upperbound lowerbound 3: while GRNN evaluations remain do 4: step size = rangethe number of steps 5: for for k = 0 ... number of steps do 6: = step size * k + lower bound 7: determine tness of 8: store sigma of best tness 9: upper bound = best sigma + step size 10: lower bound = best sigma - step size 11: guarantee that new bounds are valid 12: range = upper bound - lower bound 13: end for 14: end while 15: end procedure 53 The value of \steps" was chosen to be 200 in order to give su cient granularity to nd a (tiny) value. In plain English, the search divides a range into a number of discrete values and then tests all those discrete increments. Once the algorithm nds the best value in a range, it takes a smaller area around that value, subdivides it, and then performs the search again. This search continues until it either runs out of time or nds a performance within a given tolerance. It is important to note that this search is perfectly adequate for a small search space, but it would not be scalable if the dimension of the search was increased to include more variables (which is part of the future goals of the system). 3.4.1 Steady-State Genetic Algorithm In the initialization phase the SSGA [18] rst generates a population of ten can- didate solutions consisting of a binary listing of attributes (either "0" for not included in the classi cation or "1" for included in the classi cation) and a oating point value of sigma. Initial values of each candidate solution are determined by generating a uni- formly distributed random value. In the second step these candidate solutions are each evaluated using the success rate of the GRNN on the validation set as tness. In the iteration phase the SSGA performs 1000 iterations of the next three steps. First, two binary tournament selections are used to select a rst and second parent. These two parents are then used to create a child via uniform crossover. Second, there is a 10% probability of randomly ipping a single attribute bit in the child candidate solution in order to mutate the con guration. Finally, there is a random probability that a Uniform Mutation operator, is applied to the value of sigma, calculated as shown in Algorithm 3. The child will always replace the worst t individual in the population. This process is repeated until the maximum number of allowed function evaluations has expired. 54 Algorithm 3 Mutate Sigma 1: procedure Mutate Sigma 2: mutation amount 0:125 3: current sigma current sigma + rand(0;1) mutation amount current sigma 4: end procedure 3.5 Statistical Algorithm In order to be complete and we provide a statistical search algorithm to show that a statistical method is not an optimal solution. The input for the statistical search algorithm is a training set of les described as F = fF0;F1;:::;FNg. Each le has a list of attributes to consider de ned as A =fA0;A1;:::;AMg. A0 is a special attribute. It is the unique identi er describing the object being classi ed (e.g. compiler id) called the target id. The statistical search algorithm works as follows: First, determine the statistical information for the training set: 1. Calculate sum and count for each attribute for each target le. The sum is a map with the key equal to the target id, t, and the map value equal to the list of attribute totals. The count is a map with the key equal to the target id and the map value equal to the list of attribute count. The target id is always the rst attribute, a0, associated with each le f(a0). This is shown in Algorithm 4. Algorithm 4 Statistical Algorithm - Attribute Total and Count 1: procedure calculate Column Totals(F) 2: for each le f2F do 3: t f(a0) 4: for a 1;n do . For each attribute a in the le minus target id 5: sum(target id;a) sum(t;a) +f(a) 6: count(target id;a) count(target id;a) + 1 7: end for 8: end for 9: end procedure 2. Calculate the average value, avg, for each attribute for each target. The avg is a map with the key equal to the target id and the map value equal to the list of 55 the average attribute values. The rst attribute, A0, is skipped since this is the unique id for the type of object being classi ed. This is shown in Algorithm 5. Algorithm 5 Statistical Algorithm - Attribute Average 1: procedure calculate Column Averages(F) 2: for t 0;size(sum) do . size(sum) gives the number of target ids 3: for a 1;n do . For each attribute minus the rst 4: avg(t;a) avg(t;a) + sum(t;a)count(t;a) 5: end for 6: end for 7: end procedure 3. Calculate the variance value, variance, for each attribute for each target The variance is a map with the key equal to the target id, t, and the map value initially equal to the sum of the variance for each attribute values. Later the map value is equal to the variance for each attribute value for each target. This is shown in Algorithm 6. Algorithm 6 Statistical Algorithm - Attribute Variance 1: procedure calculate Column Variances(F) 2: for each le f2F do 3: for each attribute a2f do 4: t rst attribute in f 5: variance(t;a) variance(t;a) + (f(t;a) avg(t;a))2 6: end for 7: end for 8: end procedure 4. Calculate the standard deviation for each attribute for each target. This is shown in Algorithm 7. Algorithm 7 Statistical Algorithm - Attribute Standard Deviation 1: procedure calculate Column Standard Deviations 2: for t 0;size(sum) do . size(sum) gives the number of target ids 3: for a 1;n do . For each attribute minus the rst 4: stdev(t;a)+ = pvariance(t;a) 5: end for 6: end for 7: end procedure 56 5. Calculate the lower bound and upper bound for each attribute for each target 3.5.1 Fitness Function In order to evaluate the tness of a candidate solution, we ran the GRNN with the indicated value of . Our rst tness criteria was the success rate (how often the GRNN correctly classi ed the le as self-modifying or not). In order to break ties in cases where the success rates for two candidate solutions were equal, the average distance from the desired output to the resultant output over all the test cases was used. 3.6 Summary In this chapter, we described the machine learning technique, the inner workings of a General Regression Neural Network, and optimization techniques utilized in this research to optimize the parameters for the GRNN. 57 Chapter 4 Compressed Executables In submission to the IEEE Transactions on Information Forensics and Security. In order to e ectively defend computers from malicious software, analysts need as much information as possible about les that will be executed on those machines. Particularly, analysts would like to have the capacity to make predictions about the danger posed by a le, but they would prefer to do so without actually executing the le (and therefore potentially doing harm to the system that the le is being run on). Thus, static le analysis is generally preferable from the perspective of the analyst. Our approach is to apply machine learning techniques to classify les with respect to their status of compression based on their Windows Portable Executable (Windows PE) header information. Speci cally, we use General Regression Neural Networks upon a representative sample of training data to identify les with potentially self-modifying behavior. In the course of this work, we perform optimization for critical algorithm parameters. In order to accomplish these tasks, we make use of Evolutionary Hill- Climbing and a Logarithmic Search for parameter optimization. The system is tested on a previously unseen set of les to assess the ability of the system to generalize. Ultimately, this algorithm could be used to aid in anomaly detection mechanisms and forensic analysis by providing more information from the static analysis of executable les. The remainder of the chapter is organized as follows: In Section 4.1, the speci c experiments are described along with their results and discussion. Section 4.3 describes in detail the collection of data, the algorithms implemented for classi cation, and many of the important issues regarding these implementations. Section 4.4 addresses some general weaknesses of the approach. Section 15.2.7 describes continuing work with the 58 research, as well as future work for the Information Security community at large. Finally, a conclusion is given in Section 4.5. 4.1 Experiments In this section, the experiments are described, and the results are given, along with discussion of the implications of those results. 4.1.1 The Training, Validation, and Test Sets A training instance includes three (3) elds of interest, (Total Size, Sum of Section Sizes, and Total Virtual Size) plus the desired classi cation value (see Figure 4.1). In Experiment I, the desired output was either a 0 (indicating uncompressed) or a 1 (indicating any type of compression). Figure 4.1: Experiment I Training Instance For Experiment II, the desired output varies from the set [0,1,2,3], where 0 indicates an uncompressed le, 1 represents a le produced by cexe, 2 represents a le compressed by PEtite, and 3 represents a le compressed by UPX (see Figure 4.2). For Experiment III, the data was basically the same as Experiment II, but there are no uncompressed les (in other words, the only goal was to distinguish between the 59 Figure 4.2: Experiment II Training Instance type of compression used). In this case, the desired outputs vary from the set [0,1,2], where 0 indicates UPX, 1 represents Cexe, and 2 represents a PEtite le. For all the above experiments, the training set consisted of roughly 600 le in- stances. The validation and test sets were collected with each containing approximately 70 random distinct unseen le instances. The only change in the sets between the rst two experiments is the adjustment in the values for desired outputs (as described above). In the third experiment, however, the uncompressed les were removed from all sets, reducing their sizes by about 25%. In each experiment, the training set was used as the training data for the GRNN. The optimization method was used on the validation set in order to optimize the GRNN. Then, the trained network was tested on the test set in order to provide our results. The rationale behind the three separate sets was to eliminate training bias resulting from the optimization of the GRNN. 60 Optimizer Iterations to Best Best Success Rate Best Mean Square Error LOG 211 0.9143 4.14E-4 0.105 EHC 15.57 0.9143 4.14E-4 0.105 Table 4.1: Identifying Compression: GRNN Parameters and Results 4.1.2 Optimizer Stopping Conditions The optimizers in all the experiments were given 1000 function evaluations (evalua- tions of the GRNN, which is the most expensive portion of the optimization, regardless of method). 4.1.3 Identifying Compression The goal of this experiment was to classify an unseen le as either compressed or uncompressed (no attempt at this stage is made to determine what type of compression was used). First, we used the LOG optimization method (see Section 2) of determining the appropriate value of using the range [0,1.0]. The LOG approach (see Table 4.1) identi es a value of with a 0.9143 Success Rate (very high classi cation on unseen instances). This value of also yields a relatively low Average Distance of 0.105. While this approach is clearly deterministic, it carries a small penalty in terms of o ine training time. The success of the search is bounded by the granularity of the step size (which may need to be quite small if the range of cannot be narrowed down somehow beforehand). In practice, as the number of training instances grows larger (which is desirable for greater ability to generalize over a wider variety of les), the cost to evaluate a GRNN will increase. In this case, evaluating the GRNN many times would be undesirable if training the network is done often in an actual system. To this end, we utilized the EHC to discover , as described in Section 3.3. It is worth noting that our goal is to produce a method for the quick, e cient, and e ective discovery of , not merely for this speci c problem, but rather for a general methodology that can be used for many le classi cation problems. 61 The EHC was run 30 times (for statistical signi cance) with 1000 maximum itera- tions on each run in order to discover a sigma value. The average solution (which always discovered the best known ) is also provided in Table 4.1. In Table 4.1, the value for \Iterations to Best" indicates the average number of itera- tions the EHC took to nd its best value for (lower is better). The value \Best Success Rate" indicates the average percentage of correct identi cations made on the testing set by the GRNN. The value \Best " indicates the mean of all the best-performing values. It should be noted that on every run, the EHC discovered the best value of . Finally, the \Mean Square Error" measure indicates the average distance between the desired output and the actual output given by the GRNN. These results show that the EHC, despite being a non-deterministic algorithm, still consistently identi es the best-performing values of . Further, it does so in a very small number of iterations, even fewer than the deterministic LOG optimizer. The results for the rst experiment were quite promising. Both the LOG and EHC methods of discovering ultimately resulted in very high Success Rates (with bests giving better than 90% classi cation). It should be noted, however, that the EHC method found the good values consuming less computational time. In general, the values of discovered were quite small, which seems to indicate that the training set is fairly representative of the classi cation space. 4.1.4 Classifying Type of Compression or Lack Thereof Since Experiment I gave strong results the system was expanded to deal with the more di cult problem of identifying the compression utility used to compress a le (or to identify if the le was compressed at all). Again, rst the LOG method was applied, as shown in Table 4.2. The value headings for all tables in this section have the same meaning as in Experiment I. The Best Success Rate discovered was 0.8857 (slightly lower than in the previous experiment). The values remain tiny, but the Best Average Distance roughly doubles from Experiment I. This is expected, however, given that there are a larger number of categories for classi cation. 62 Optimizer Iterations to Best Best Success Rate Best Mean Square Error LOG 452 0.8857 2.42E-5 0.214 EHC 58.8 0.8857 2.5E-5 0.214 Table 4.2: Classifying Type of Compression or Lack Thereof: GRNN Parameters and Results Optimizer Iterations to Best Best Success Rate Best Mean Square Error LOG 452 0.9149 2.42E-5 0.107 EHC 36.57 0.9149 2.42E-5 0.107 Table 4.3: Classifying Type of Compression: GRNN Parameters and Results Again, the EHC was also applied to search problem. The results in Table 4.2 show the average performance on the search problem. The Best is discovered in an average of 58.8 EHC iterations and Average Success Rate is the same as the results discovered through the LOG optimization approach. In general, the results from Experiment II are quite promising. Even when dealing with a more complex classi cation problem, the GRNN with the same training data still managed to e ectively classify the type of compression with success rates of nearly 0.89. 4.1.5 Classifying Type of Compression In the third set of experiments, the system was tasked with determining type of compression in the absence of uncompressed les to see if having a priori information regarding the les could improve accuracy. It is conceivable that les to be tested are known to be compressed already (eliminating the possibility of the network confusing les for uncompressed), which could increase the accuracy of the classi cations. Es- sentially, this experiment deals with detecting the di erences between the properties in les known to be compressed with di ering algorithms. Again, both the LOG and EHC optimization methods were applied, as shown in Table 4.3. The Best Success Rate discovered was 0.9149 (better than in the previous experiment) for both optimizers. In fact, they perform roughly as well as the classi ers in Experiment I. In general, both algorithms discover the best , but the EHC does so faster. 63 4.1.6 Impact of Compression While certain attributes, such as the capacity for self-modi cation, compression, and encryption may make an executable suspicious, all of these capabilities have legit- imate purposes as well [69]. For example, completely reasonable uses of compression include: Software on embedded systems NASA Mars Exploration Rovers Spirit and Opportunity used on-board loss-less data compressors to deter reverse engineering to obfuscate contents of executables programs designed for portability on resource-poor media games with large quantities of data large applications available for download backups are sometimes saved as compressed executables to save disk space The lesson taken from the various uses of compression demonstrates that identifying a le as compressed is only one part of an analysis. However, knowing the compression type of a le makes the analysis of said le considerably easier (or sometimes makes an infeasible problem feasible). It should be noted that a le should typically only be agged as \risky" when it exhibits multiple symptoms. Noting the above, le compression and encryption have a negative impact on the capability of many schemes to analyze executable les. Many methods for detecting malicious programs, especially for virus and malware, depend on speci c \signatures" composed of a speci c group of bytes to match les. Such signatures are very speci c in nature (to avoid false positives), but this makes them highly ine ective against even small alterations in the executable. Signature-based schemes have reduced capacity to 64 identify a le if that le?s corresponding signature is compressed or encrypted. Worse still, in order to address such a problem, the signature-detecting scheme would need a corresponding signature for every possible encryption and compression methodology that might produce di erent output! Clearly, given the vast number of encryption and compression algorithms, this is simply not feasible (the signature database would swell to epic proportions). 4.2 Data source and compressors To get a representative sampling of le types and sizes we used a large number of les, including wide variety of executables, from self-written executables to common applications. We then used three di erent compression utilities, Cexe, Ultimate Packer for eXecutables (UPX) [46], and PEtite [19] to compress these les. There were a few instances where one of the compression utilities failed to compress the original executable; however, the information from the remaining utilities? compression of the same le was still included. 4.2.1 Cexe Cexe is a Win32 compressor that tries multiple compressors and chooses the one that gives the smallest size. The version that we used contained compressors for Lempel- Ziv-Welch (lzw), published in 1984 [67], and zlib, which was written by Jean-loup Gailly. Often zlib was used; however, there were some of the smaller les that Cexe used lzw to compress. 4.2.2 Ultimate Packer for eXecutables UPX, written by Markus F.X.J. Oberhumer, L aszl o Moln ar and John F. Reiser, is a packer for executables of several di erent formats. It uses the Not Really Van- ished (NRV) compression library, a generic data compression library written by ober- humer.com, a small company in Linz, Austria. 65 4.2.3 PEtite PEtite is another executable packer which compresses an executable while still allowing the executable to run exactly as the original uncompressed version after loading. While we could not get information on exactly which compression is used by PEtite, after viewing the compression ratios, the method of the compression that is utilized is di erent from both Cexe and UPX. 4.3 Data Extraction We created a TrainerDump program which operates on a similar basis as the pop- ular Pedump utility to extract PE le information. It uses the Libreverse library for opening the Windows PE32 executables in memory and then parsing its headers. The MS-DOS header is read to determine if the program under observation is in fact a win- dows executable. The le extension does not guarantee any particular property about an executable le. The MS-DOS header also provides the location of the Windows PE header which tells us where to look for the section headers. In the Windows PE section header, we can nd two of the properties that are required for the trainer: virtual mem- ory size of the section when loaded and its raw size in the binary le on disk. For each section header found on the disk, a total is produced for both along with the actual size of the le. The actual size of the le is an obvious fact about a binary program. Typically, the binary program takes as much space as required. There is little reason to pad the le so the size is larger on the system. This is contradictory to what a malicious program designer wants to achieve. They often want their application to be as small as possible to avoid suspicion. The data that can potentially be altered are the virtual memory size and the raw data size. The virtual memory size is what is required to store the data or instructions in memory. If the value is smaller than the size of raw data on disk, then the loader will have a memory error when it attempts to put data past the end of the space. If the memory request is equal to or larger than the size of raw data, the loader 66 can perform the loading without errors. The virtual memory size must be at least as large as the raw data size. Most virus writers can alter this number. It is most often altered to ensure that the su cient space is available for decompressing/decrypting a real binary program. Finally, the raw data size cannot be less than the entire program. If a malicious program has a stated raw data size, it is safe to conclude that the size is at least equal to or greater than the actual instructions. Just because the bytes in a section are marked as executable, does not mean that they will actually be used. So the raw data size can be larger than is required, but this is a negative feature if we are trying to minimize our le size. TrainerDump will be reading the target le so the access time of the target le will change. It is important to also note that none of the bytes of the malicious executable are altered or executed. This is necessary when considering that forensic tools should not modify a suspect le during analysis. Therefore, the PE les should have read-only set on them just as a safety back up. TrainerDump program extracts the necessary facts for a large number of les into a space delimited le for use by the GRNN. 4.4 Weaknesses Our approach carries some practical weaknesses that should be noted in the name of good scholarship. First, while our GRNN proved to be an e ective classi er, it is dependent on the quality of the training set. If such a training set could not be engineered (perhaps the information used for the training instances is not available for some type of executables), then this strategy would not be useful. Second, we rely on the information provided in the PE Header to perform the classi cation. If this header could be distorted or misrepresented by a malicious entity, then it could easily fool our system by presenting the le?s pre-compressed header information. Third, training instances must be gathered for every le type that may be subject to analysis. This returns to a fundamental problem regarding static analysis of executables in that they 67 tend to be system speci c. This may not be a signi cant problem, since most reverse- engineering e orts are already very system-speci c (meaning that this restriction is no greater than those that already exist). 4.5 Summary In this chapter, we motivated the need for e ective malware analysis upon exe- cutable les when neither the source code is readily available nor is there any desire to execute the le to examine its behavior. Further, we implemented logarithmic search and evolutionary search tactics in order to optimize the system parameter ( ) of the GRNN. The fairly promising results of the system were discussed and, nally, we pro- vided a number of opportunities for the expansion of this work. 68 Chapter 5 A Compiler Classification Framework for Use in Reverse Engineering 5.1 Introduction In submission to IEEE Symposium Series on Computational Intelligence in Cyber Security (CICS 2009) A software framework is presented in this chapter for extracting useful information from Java Class, Windows PE and Linux ELF les and analyzing that information to classify future les. A General Regression Neural Network is implemented and optimized using both deterministic and stochastic search algorithms. In experimental results, the system can classify compiler type on an unseen le with a more than a 98% degree of accuracy. 5.1.1 How can machine learning be used for classifying compiler type? The problem of determining if a binary computer application was compressed or not is quite similar to the problem of classifying the compiler used to create a binary computer application. Both require certain attributes from the header of the binary computer application in order to make a determination. In order to perform the analysis, rst information is extracted from the les. Instead of selecting attributes from the les to determine compression, new attributes would have to be selected. Once the training data is collected, a General Regression Neural Network (GRNN) [53] learns from the training data and attempts to estimate the type of compiler used to create a le previous unseen by the GRNN. Once a training set consisting of the le summaries is constructed along with the GRNN, two di erent optimization methods are used to optimize the neural network parameters. This is similar to the ideas suggested in [22] for the optimization of weights 69 in a Feed-Forward Neural Network using a Particle Swarm Optimizer. Experiments were developed to test the e ectiveness of the system on classifying unseen le instances and recording metrics of success rate and average distance to desired output. The structure of the training set and the network parameters were adjusted for the problem of classifying which of three families of popular modern Java Compilers, three families of popular modern C++ compilers for Linux or three families of popular modern C++ compilers for Windows were used to compile a given set of test les. Java JDK v5.0 JDK v6.0 IBM Jikes ECJ Windows Cygwin GCC 4.1 Visual Studio 2003 Visual Studio 2005 Linux GCC 4.1 Portland Group C++ 7.1 Intel C++ 10.1 Table 5.1: List of compilers 5.2 Software Architecture In this section, we describe the software architecture of the compiler classi er and the XML format of the input. 5.2.1 Java Input Data The input information used by the compiler classi er was obtained from a simple program which utilizes the Libreverse library to read the input Java Class les. These input training sets of Java Class les were produced by compiling a set of test Java les with the Eclipse, Jikes, Sun Microsystem 1.5 and 1.6 Java compilers. The Libreverse library is a reverse-engineering architecture developed to automatically produce the de- sired high-level output from a given input based on prede ned con gurations of analysis components. 70 The input data was formatted in XML using the schema shown in Listing 5.1 in order to produce a easy parse input format to the classi er. Listing 5.1: XML schema for input Java data The description of each of the elements are: compiler id: Unique id associated with the compiler that produced the le. size: File size version: This value is used to indicate what release level of Java this binary requires the JVM to support. 71 this class index: The value of the this class item must be a valid index into the constant pool table. The constant pool entry at that index must be a Constant Class Info structure representing the class or interface de ned by this class le [8]. super class index: This is similar to the this class de ned above but the only di erence is that this index refers to a parent class or the default Object class. constant pool count: Number of constant pool entries. constant pool info (tag): Unique id for the constant pool entry. constant pool info (count): Number of times the unique id was seen. constant pool info (ratio): Ratio of the number times to the number of con- stant pool entries. eld count: Number of eld info structures in the class le. method count: Number of method info structures in the class le. 5.2.2 ELF Input Data The input information used by the compiler classi er was obtained from a simple program which utilizes the Libreverse library to read the input ELF les. These input training sets of ELF les were produced by compiling a set of test C++ les with the GNU GCC 4.1, Portland Group C++ 7.1, and Intel C++ 10.1 compilers. The Libreverse library was used to automatically produce the desired high-level output from a given input based on prede ned con gurations of analysis components. The input data was formatted in XML using the schema shown in Listing 5.2 in order to produce a easy parse input format to the classi er. The description of each of the elements are: compiler id: Unique id associated with the compiler that produced the le. size: File size 72 Listing 5.2: XML schema for input ELF data entry point address: Virtual address where the system rst transfer control. section header start: Where the Section header table is located in the le. program header count: Number of entries in the program header table. section header count: Number of entries in the section header table. section header string table index: Section header table index associated with the string table. text section size: Size in bytes of the executable code section in memory. 5.2.3 Windows PE Input Data The input information used by the compiler classi er was obtained from a simple program which utilizes the Libreverse library to read the input Windows PE les. These input training sets of Windows PE les were produced by compiling a set of test C++ 73 les with the GNU GCC 4.1 (Cygwin), Visual Studio 2003 and 2005 C++ compilers. The Libreverse library was used to automatically produce the desired high-level output from a given input based on prede ned con gurations of analysis components. The input data was formatted in XML using the schema shown in Listing 5.3 in order to produce a easy parse input format to the classi er. Listing 5.3: XML schema for input PE data The description of each of the elements are: target id: Unique id associated with the compiler that produced the le. size: File size exe header address: File o set where the Windows PE header starts in the le header. co section header count: Number of section headers in the le header. pe opt code size: Size of the code section. 74 pe opt base of data: Base address of the data section. pe opt entry point: Address of the rst instruction to be executed. pe opt image size: Size of the loaded image in memory. 5.2.4 Structure There are two basic sections to the compiler classi er: Input and Analysis. The Input section is responsible for parsing the XML les, preparing the data structures, separating data into training, test and veri cation sets. The input data is retrieved from the input XML les by the IO class. It is respon- sible for calling the Training Data Parser for each input le and then processing it with the Selection Policy to get the nal Training Set. A Training Set is used to contain the three types of data (training, test and veri cation). The main function of the classi er optimizer takes the data retrieved from the IO class and passes it to the Optimizer. The Optimizer is the part that coordinates the processing of the input data. Processing continues until the number of iterations is achieved or time runs out. At that point the success value, nal sigma value, and attributes used are output from the program. 5.3 Experiments In this section, the three major experiments are described and the results of said experiments are given, along with discussion. 5.3.1 The Training, Validation, and Test Sets A training or test or validation instance potentially includes many elds of interest. In Experiment I (Java Compiler Classi cation with the GRNN Optimized with an Evolutionary Hill-Climber Algorithm), a simple optimization was utilized to try to determine appropriate attributes and system parameters for the GRNN. There were a total of 29 Java attributes available plus the continuous variable to be optimized. The 75 Compiler ID ECJ 0 Jikes 1 JDK v5.0 2 JDK v6.0 3 Table 5.2: Java classi cation values Compiler ID GCC 4.1 0 Portland Group C++ 7.1 1 Intel C++ 10.1 2 Table 5.3: ELF classi cation values goal was to create a GRNN that would be most accurate at classifying the compiler type to a value in Table 5.2 for unseen Java le instances. Additionally, potentially reducing the number of attributes used by the GRNN was motivated by: Given that not all of the attributes are strongly correlated with compiler type, some of this information may actually detract from the GRNN?s ability to classify correctly. Reducing the size of the instances greatly improves the speed of the classi er at run time A consequence of reducing run time speed of the classi er is that the time required to perform optimization is dramatically reduced. For Experiment II (Java Compiler Classi cation with the GRNN Optimized with a Steady-State Genetic Algorithm), the experimental goals were unchanged, but a more sophisticated optimization routine (a genetic algorithm) was used to optimize the at- tributes and system parameters. For Experiment III (ELF File Classi cation with the GRNN Optimized with a Steady-State Genetic Algorithm), the experimental setup was essentially the same as in Experiment II, with the exception that ELF le data was being used to classify an unseen le to a value in Table 5.3. 76 Compiler ID Cygwin GCC 4.1 0 Visual Studio 2003 1 Visual Studio 2005 2 Table 5.4: PE classi cation values For Experiment IV (Windows PE File Classi cation with the GRNN Optimized with a Steady-State Genetic Algorithm), the experimental setup was essentially the same as in Experiment II and III, with the exception that Windows PE le data was being used to classify an unseen le to a value in Table 5.4. It should be noted that using a deterministic optimization approach in order to determine the attributes in this case would be more time-consuming since the search space grows by a factor of 2n, where n is the number of attributes (in this case 29 for Java and 10 for the ELF les). For all of the Java experiments, the training set consisted of roughly 334 class le instances, collected as mentioned in Section 5.2.1. The validation and test sets were collected in the same manner each containing approximately 50 - 70 random distinct unseen class le instances. For all of the Linux ELF Experiments, the training set consisted of 567 le instances with validation and test sets containing about 114 random distinct unseen le instances. For all of the Windows PE Experiments, the training set consisted of 563 le instances with validation and test sets containing about 112 random distinct unseen le instances. In each experiment, the training set was used as the training data for the GRNN. The optimization method was used on the validation set in order to optimize the GRNN. Then, the trained network was tested on the unseen test set in order to provide our nal results. The rationale behind the three separate sets was to eliminate training bias resulting from the optimization of the GRNN. In other words, the optimization process does not see the test set instances at any time. Only after the completion of training is the test set used. 77 Success Rate Mean Square Error min. 0.00181 54.5% 0.050 max. 0.00242 95.1% 0.964 stdev. 9.41E-05 9.9% 0.199 avg. 0.00200 78.3% 0.285 Table 5.5: Experiment I: Identifying Java Compiler Type with GRNN optimized by the EHC. Shows the results of optimizing a GRNN using an Evolutionary Hill-Climber Algorithm. The EHC was allowed a maximum of 10000 iterations, had a 10% chance of modifying each iteration, and always performed a single bit- ip mutation on a random attribute. 5.3.2 Optimizer Stopping Conditions All the optimizers in all the experiments were given a xed number of function evaluations (evaluations of the GRNN, which is the most expensive portion of the op- timization, regardless of method). This cap on optimization time is described in the experiment. 5.3.3 Experiment I: Java Compiler Classi cation with a GRNN Optimized with an Evolutionary Hill-Climber Algorithm For this experiment set, the Evolutionary Hill-Climber was used as the optimization method (with settings as described in Section 3.3) to distinguish between the four Java compiler types. 5.3.3.1 Distinguishing Four Compiler Types The goal of this experiment was to classify the unseen class les (the test set) into the appropriate compiler type, using all four possible types. It is worth noting that our goal is to produce a method for the quick, e cient, and e ective discovery of not merely for this speci c problem compiler classi cation problem, but for a general methodology that can be used for many le classi cation problems. The EHC was run 30 times (for statistical signi cance) maximum of 10000 iterations on each run in order to discover a sigma value and a good combination of attributes. 78 The experimental results are provided in Table 5.5. In the table, the value indicated by \ " represents best \ " value discovered by the EHC. The \Succeses Rate" indicates the number of correct classi cations by the con guration on unseen test data. A higher success rate is better. Finally, the \Mean Square Error" as de ned as average squared di erence between the predicted value and actual value. In other words, we favor classi ers that not only correctly identi er the compiler type, but also that give a continuous value that is as close as possible to the actual value. A lower Mean Square Error is better. In general, it should be noted that the performance on this problem was weaker than could be used in practice. While the EHC nds a solution quickly (on average, in only a few iterations), this is likely due to premature convergence to a local minima. The average values of are relatively small (hovering around 2E-03) with only a small degree of variability. The average success rate discovered is about 78%, with around 10% standard deviation. This much error and variability indicates that this particular strategy would be undesirable for everyday application. Not surprisingly, the Mean Square Error is also fairly high, with an average of 0.285. While these results are somewhat promising in that the optimized classi er clearly does much better than random guessing (which would be expected to perform at about 25% accuracy), it still does not achieve a low enough error rate to be used in a practical setting. The weakness in optimization seems to be the inability of the EHC to break out of local minima, which means that it never nds the best con guration of attributes. In order to x this problem, a more powerful search mechanism, the SSGA, was used. 5.3.4 Experiment II: Java Compiler Classi cation with a GRNN Optimized with a Steady-State Genetic Algorithm Since Experiment I gave weaker than desired results, a more powerful optimization algorithm was used. The primary hope was that this optimization would be better able to remove data that had a weak correlation to the compiler type and thus improve the relatively weak success rates obtained so far. This seems reasonable because the 79 Success Rate Mean Square Error max. 0.00233 100.00% 0.159 min. 0.00177 90.24% 0.00190 avg. 0.00201 96.04% 0.0599 stdev. 0.00011 2.85% 0.0457 Table 5.6: Experiment II: Identifying Java Compiler Type with GRNN optimized by the SSGA. Shows the results of optimizing a GRNN using a Steady-State Genetic Algorithm. The SSGA was allowed a maximum of 100 iterations, the population size was set to be 10, the mutation amount was set to be 0.125, and the mutation rate was set to be 0.10. Success Rate Mean Square Error max. 0.00246 100.00% 0.152 min. 0.00172 94.51% 4.64E-12 avg. 0.00199 98.53% 0.0323 stdev. 0.00017 1.48% 0.0338 Table 5.7: Experiment II: Identifying Java Compiler Type with GRNN optimized by the SSGA. Shows the results of optimizing a GRNN using a Steady-State Genetic Algorithm. The SSGA was allowed a maximum of 1000 iterations, the population size was set to be 10, the mutation amount was set to be 0.125, and the mutation rate was set to be 0.10. crossover operation (as described in Section 3.4.1) has the potential to change many attributes in the con guration compared to the one attribute that could be changed by random bit- ip mutation in the EHC. The same training, validation, and test data from the Java les used in Experiment I was used in Experiment II. The settings for the SSGA itself are as described in Section 3.4.1. The results from this experiment allowing only 100 function evaluations (instead of the 10000 which were allowed for the EHC) are shown in Table 5.6. The results for the same experiment allowing 1000 function evaluations are shown in Table 5.7. The meanings of the column headings is unchanged from Experiment I. While the number of required iterations needed to nd the best solution increases in both cases, this seems to be attributed to the fact that those iterations are consistently nding better con gurations as opposed to getting stuck in local minima. Even when the optimizer is limited to 100 iterations, the average success rate is still notably better at 96.04%, albeit with a higher standard deviation of 2.85%. However, in the 1000 iterations case, the success rate achieved rises to 98.53% with a much lower standard 80 deviation of 1.48%. This indicates a level of accuracy useful for practical application. In both the 100 iteration and the 1000 iteration case, the average Mean Square Error drops, garnering 0.0457 and 0.0338, respectively. The average discovered value of remains around 2.01E-03, which seems to indicate that this is a good value of for this problem. It should be noted that the dramatically improved accuracy makes this approach plausible for practical application. Although not visible from the above charts, the vast majority of the classi cation error stems from the GRNN?s inability to distinguish well between the two versions of the JDK. Again, this is due to the fact that the attributes chosen change very little between these two versions. 5.3.5 Experiment III: ELF File Compiler Classi cation with a GRNN Op- timized with a Steady-State Genetic Algorithm Given the promising results of the approach in Experiment II, a similar strategy was pursued for the Linux ELF les. Di erent attributes were available for selection (as described in Section 5.2.2), but the same SSGA and GRNN combination was used. Since the EHC results were not very promising in Experiment I, only the SSGA was used for Experiment III. Again, the optimizer was rst allowed 100, then 1000 iterations in order to discover an optimal con guration. The results from this experiment are reported in Table 5.8 and Table 5.9. A run of the optimizer, described in Section 6.3.3, can have a iteration value of 0. This is due to the fact that the best solution found will have already existed before any children were considered. The results for the ELF les were quite good, showing again results of practical value. Even when given only 100 iterations, a con guration with 99.51% success rate is discovered. Further, not much additional value is gained from further training (only a small improvement in terms of Mean Square Error in the 1000 iteration case). Even in the worst case run, a con guration with 97.21% success rate is discovered. In the best case, the success rate is 99.51%. 81 Success Rate Mean Square Error max. 0.00224 100.00% 0.0561 min. 0.00176 97.21% 5.53E-011 avg. 0.00199 99.51% 0.0128 stdev. 0.0001 0.69% 0.0179 Table 5.8: Experiment III: Identifying C++ Compiler Type for Linux ELF with GRNN optimized by the SSGA. Shows the results of optimizing a GRNN using a Steady-State Genetic Algorithm. The SSGA was allowed a maximum of 100 iterations, the population size was set to be 10, the mutation amount was set to be 0.125, and the mutation rate was set to be 0.10. Success Rate Mean Square Error max. 0.00222 100.0% 0.04 min. 0.00179 98.60% 0.00 avg. 0.00198 99.88% 0.00191 stdev. 7.84E-05 0.32% 0.00740 Table 5.9: Experiment III: Identifying C++ Compiler Type for Linux ELF with GRNN optimized by the SSGA. Shows the results of optimizing a GRNN using a Steady- State Genetic Algorithm. The SSGA was allowed a maximum of 1000 iterations, the population size was set to be 10, the mutation amount was set to be 0.125, and the mutation rate was set to be 0.10. Ultimately, this experiment shows that the approach has promise in terms of being able to produce good results for a wide variety of le types on di ering platforms. 5.3.6 Experiment IV: Windows PE File Compiler Classi cation with a GRNN Optimized with a Steady-State Genetic Algorithm Given the strong results of the approach in Experiment II and III, a similar strategy was pursued for the Windows PE les. Di erent attributes were available for selection (as described in Section 5.2.3), but the same SSGA and GRNN combination was used. Since the EHC results were not very promising in Experiment I, only the SSGA was used for Experiment IV. Again, the optimizer was rst allowed 100, then 1000 iterations in order to discover an optimal con guration. The results from this experiment are reported in Table 5.10 and Table 5.11. 82 Success Rate Mean Square Error max. 0.002 100.00% 0.0356 min. 0.00192 99.11% 0.00 avg. 0.002 99.77% 0.00820 stdev. 1.45E-05 0.37% 0.0145 Table 5.10: Experiment III: Identifying C++ Compiler Type for Windows PE with GRNN optimized by the SSGA. Shows the results of optimizing a GRNN using a Steady- State Genetic Algorithm. The SSGA was allowed a maximum of 100 iterations, the population size was set to be 10, the mutation amount was set to be 0.125, and the mutation rate was set to be 0.10. Success Rate Mean Square Error max. 0.00222 100.0% 0.0400 min. 0.00179 98.60% 0.00 avg. 0.00198 99.88% 0.00191 stdev. 7.84E-05 0.32% 0.00740 Table 5.11: Experiment III: Identifying C++ Compiler Type for Windows PE with GRNN optimized by the SSGA. Shows the results of optimizing a GRNN using a Steady- State Genetic Algorithm. The SSGA was allowed a maximum of 1000 iterations, the population size was set to be 10, the mutation amount was set to be 0.125, and the mutation rate was set to be 0.10. 5.4 Weaknesses With respect to the methodology, it is important to realize that machine learning is dependent on the quality of the training data. If the training data is unrelated to the compiler type or if there is insu cient training data available, then the machine learner will perform poorly (regardless of optimization). It should also be noted that the o ine training becomes time-consuming when the appropriate attributes to use as input are unknown (which is the expected situation). Further, optimization will have to be repeated for every new type of classi cation problem (although this process can be largely automated as has been demonstrated above). Currently, the training data is extracted from the les, but, if certain elds could somehow be altered signi cantly, then the classi er would have diminished value. This might prove to be a di cult task, but, if a le was being designed to thwart reverse engineering e orts, then it is not outside of the realm of possibility. 83 5.5 Summary In this work, we provide a general motivation for reverse engineering of les, as well as the value added by identifying attributes about those les from the Java Class, Linux ELF and Windows PE les themselves. In doing so, we motivate the need for the ability to identify the compiler used to produce a Java Class, Linux ELF or Windows PE le. Moreover, this identi cation should take place with only knowledge obtained from the target le itself. We present a software system capable of extracting information from Java Class, Linux ELF and Windows PE les and analyzing those les appropriately and accurately. This analysis makes use of a General Regression Neural Network that is optimized using a pair of optimization techniques. Experiments were performed on a large test set of Java Class, Linux ELF and Windows PE les to validate the approach and classi cation error was shown to be less than 2%. Ultimately, the contribution of this work is to contribute to reverse engineering by providing a means to identify the type of compiler used to compile a Java Class, Linux ELF and Windows PE le using only information obtained from the target le itself. 84 Chapter 6 Classifier Architecture The compiler classi er is a three stage process. The rst stage, o ine le processing, is responsible for gathering the desired le properties from the input executable produced by a compiler for a certain le type (e.g. Linux ELF). These values are stored into a XML le associated with the compiler for use by the optimizer. The second stage, o ine optimization, is responsible for taking the input XML les for a set of compilers and processing them through a GRNN. The output of the second stage is a sigma value, a ltered XML le containing the actual le properties used and a list of attributes used for classifying a executable le for that le type (e.g. Linux ELF). The nal stage, on-line compiler classi cation, is responsible for classifying the compiler used to create a unknown le by processing the saved XML data and the unknown le properties through a GRNN. The on-line processing will save the detected compiler id as Meta information for use by other child Components. 6.1 O ine File Processing The most time intensive part of classi cation is the optimization phase. It is during this part of the classi cation that the actual values used for classifying new les are determined. The o ine le processing is started rst by collecting the values using the compiler dump program, described in Section 6.1.1 and then Section 6.3.3. 6.1.1 Compiler Dump Program The compiler dump program was created to collect the attributes that are believed by an human expert to be necessary to classify the compiler. It is important to under- stand that the actual attributes necessary to classify the compiler will not be known 85 until after the optimizer program is executed on the collected data. Each of the at- tributes collected are believed to be either di cult to falsify or counterproductive if they are altered. Attributes that are \informative" in nature, for example the date the target program was created, can either be ignored or eliminated without a ecting the program?s execution. The compiler dump program requires four pieces of information from the user via command line ags: compiler name: String name to be associated with the compiler that produced the set of test programs. compiler id: Unique integer identi er to be associated with the compiler that produced the set of test programs. source type: String name describing the type of test programs (e.g. Java class). directory: Location where the test programs can be found. The compiler dump program uses these command line ags to locate the appropri- ate algorithm to use, collect the test programs from the given directory, process each executable program, and store the collected information in a XML le. Section 6.1.1.1 will describe each of the kind of algorithms available. Section 6.2 describes the XML schema associated with the type of test programs. Figure 6.1 shows the structure of the compiler dump program. 6.1.1.1 Compiler Dump Algorithm Each algorithm that is used to gather suspect le attributes must follow the inter- face laid out in the Compiler Dump Algorithm class. The use of an abstract base class for the Compiler Dump Algorithm was necessary since it is important to hide the kind of algorithm being used from the trainer program. It is unnecessary for the compiler dump program to know which algorithm it is interacting with when processing a set of les. This property allows the design to support a wide variety of le types and allows scalability to future design changes. 86 Figure 6.1: Compiler Dump Algorithm class diagram The correct Compiler Dump Algorithm is found through the Compiler Dump Algo- rithm Factory?s get Algorithm function. The Compiler Dump Program calls this func- tion, obtains the correct algorithm, and then calls the algorithm?s functions to perform its work. 6.1.1.1.1 ELF Compiler Dump Algorithm The ELF Compiler Dump Algo- rithm, as shown in Figure 6.2, is used when processing a Linux ELF executable. Since the advent of 64-bit computing the ELF format was extended to support 64-bit executa- bles. Therefore the ELF Compiler Dump Algorithm must rst determine which type of le is being read, load the le via the correct le reader, and then read the necessary attributes. This is shown in Figure 6.3. 87 Figure 6.2: ELF Compiler Dump Algorithm class diagram Figure 6.3: ELF Compiler Dump Algorithm sequence diagram for reading a 32-bit Linux ELF le 6.1.1.1.2 Windows Compiler Dump Algorithm The Windows Compiler Dump Algorithm, as shown in Figure 6.4, is used when processing a Windows PE or PE+ executable. When Microsoft released its 64-bit version of Windows XP it extended the Windows PE format, called PE+, to support 64-bit executables. Therefore the Windows Trainer Dump Algorithm must rst determine which type of le is being read, 88 load the le via the correct le reader, and then read the necessary attributes. This is shown in Figure 6.5. Figure 6.4: Windows Compiler Dump Algorithm class diagram Figure 6.5: Windows Compiler Dump Algorithm sequence diagram for reading a Win- dows PE+ le 6.1.1.1.3 Java Compiler Dump Algorithm The Java Compiler Dump Algo- rithm, as shown in Figure 6.6, is used when processing a Java class le. Unlike the Linux ELF and Windows PE le format, there is no need to determine the kind of le that is being read. All the Java Compiler Dump Algorithm must do is load the le via 89 the Java le reader and then read the necessary attributes. This is shown in Figure 6.7. Another unique feature of this algorithm is the use of statistics about attributes in the le. When investigating the Java class le format it was noticed that certain kinds of objects in the Constant Pool table appeared more often for one compiler versus another. Therefore additional programming code was implemented to keep track of how many of each kind of Constant Pool table entry occurred. These operations are performed by the collect Constant Pool Stats and print Constant Pool Stats functions. Figure 6.6: Java Compiler Dump Algorithm class diagram 90 Figure 6.7: Java Compiler Dump Algorithm sequence diagram for reading a Java class le 6.2 Input Data While a comma-delineated le makes it easy to switch between programs like Win- dows Excel and a text editor it does little in explaining what the information is and any expected elds. XML was chosen to represent the attributes collected from the input les since that it was designed [65] for: 1. XML shall be straightforwardly usable over the Internet. 2. XML shall support a wide variety of applications. 3. XML shall be compatible with SGML. 4. It shall be easy to write programs which process XML documents. 5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero. 6. XML documents should be human-legible and reasonably clear. 7. The XML design should be prepared quickly. 8. The design of XML shall be formal and concise. 9. XML documents shall be easy to create. 91 10. Terseness in XML markup is of minimal importance. It was important that the XML data should be easy to process, human readable, concise and easy to create. Since human developers are involved any ambiguities must be removed. XML has the option that requires les to adhere to a particular format. In this present design a schema is provided but not validated due to the limitation in the XML library used to parse the les. This is an important point in the future development of the compiler dump program. 6.2.1 Linux ELF XML The input information used by the compiler classi er was obtained from a Compiler Dump program utilizes the Libreverse library, a reverse engineering library also written by the authors, to read the input ELF les. The input data was formatted in XML [6] in order to produce a easy parse input format to the classi er. The input data has the following XML schema [25] [61] [5] as shown in Listing 6.1. 92 Listing 6.1: XML schema for input ELF XML data The description of each of the elements are: compiler id: Unique id associated with the compiler that produce the le. 93 size: File size entry point address: The address where program control is transferred. section header start: The le o set to the section header table. program header count: The number of program headers in the le. section header count: The number of section headers in the le. section header string table index: The index into the section header table associated with the string table. text section size: Size in bytes of the executable code section in memory. 6.2.2 Java Class XML The input information used by the compiler classi er was obtained from a Compiler Dump program utilizes the Libreverse library, a reverse engineering library also written by the authors, to read the input Java Class les. The input data was formatted in XML [6] in order to produce a easy parse input format to the classi er. The input data has the following XML schema [25] [61] [5] as shown in Listing 6.2. 94 Listing 6.2: XML schema for Java XML data 95 The attributes that are collected are those that are independent of the java byte codes produced. The description of each of the elements are: version: This value is used to indicate what release level of Java this binary requires the JVM to support. this class index: The value of the this class item must be a valid index into the constant pool table. The constant pool entry at that index must be a Constant Class Info structure representing the class or interface de ned by this class le.[8] super class index: This is similar to the this class de ned above but the only di erence is that this index refers to a parent class or the default Object class. constant pool count: Number of constant pool entries. constant pool info (tag): Unique id for the constant pool entry. constant pool info (count): Number of times the unique id was seen. constant pool info (ratio): Ratio of the number times to the number of con- stant pool entries. eld count: Number of eld info structures in the class le. method count: Number of method info structures in the class le. 6.2.3 Windows PE XML The input information used by the compiler classi er was obtained from a Compiler Dump program utilizes the Libreverse library, a reverse engineering library also written by the authors, to read the input Windows PE les. The input data was formatted in XML [6] in order to produce a easy parse input format to the classi er. The input data has the following XML schema [25] [61] [5] as shown in Listing 6.3. The description of the elements are: compiler id: Unique id associated with the compiler that produce the le. 96 size: File size exe header address: The address in the binary le where the PE header is located. co section header count: The number of section headers in the le. pe opt code size: The size in bytes of the section that contains the executable instructions. pe opt base of data: The memory address where the data for the PE le starts. pe opt entry point: The memory address where the rst instruction to be ex- ecuted is located. pe opt image size: Size in bytes of the PE le in memory. 97 Listing 6.3: XML schema for input PE XML data 98 6.3 GRNN Optimizer The grnn optimizer program was created to handle the o ine classi cation of the compiler classi cation to determine the actual le parameters used and the optimal sigma value. The grnn optimizer has the follow command line ags: directory: Where the training data XML les are stored. source type: String name describing the type of test programs (e.g. Java class). output: Where the nal training data le is stored. The grnn optimizer program uses these command line ags to locate the appropriate algorithm to use, collect the training data XML les from the given directory, train and validate then nally store the condensed information in a XML le. Section 6.3.1 will describe each of the kind of algorithms available. Section 6.2 describes the XML schema associated with the training data. 6.3.1 Optimizer Algorithm Each algorithm that is used to parse the XML training data must follow the in- terface laid out in Figure 6.8. The use of an abstract base class for the Optimizer Algorithm was necessary since it was determined to hide the kind of algorithm being used from the grnn optimizer program. It is unnecessary for the grnn optimizer pro- gram to know which algorithm it is interacting with when processing an XML le. This allows the design to support a wide variety of training data and allows scalability to future design changes. 99 Figure 6.8: Optimizer Algorithm class diagram The correct Optimizer Algorithm is found through the Optimizer Algorithm Fac- tory?s get Algorithm function. The grnn optimizer program calls this function, obtains the correct algorithm, and then calls the algorithm?s functions to perform its work. Each optimization algorithm was designed to accept the type of classi er to use as an template parameter. This design feature allows for the future expansion to additional classi er algorithms without duplication of functionality. 6.3.1.0.4 ELF Optimizer Algorithm The ELF Optimizer Algorithm, as shown in Figure 6.9, is used when processing a Linux ELF executable. The ELF Optimizer Algorithm parses in the input XML le name into a ELF Training Data object, runs the appropriate classi er and outputs the nal XML data. This interaction is shown in Figure 6.10. 100 Figure 6.9: ELF Optimizer Algorithm class diagram Figure 6.10: ELF Optimizer Algorithm sequence diagram 6.3.1.0.5 Java Optimizer Algorithm The Java Optimizer Algorithm, as shown in Figure 6.11, is used when processing a Java Class le. The Java Optimizer Algorithm 101 parses in the input XML le name into a Java Training Data object, runs the appropriate classi er and outputs the nal XML data. This interaction is shown in Figure 6.12. Figure 6.11: Java Optimizer Algorithm class diagram Figure 6.12: Java Optimizer Algorithm sequence diagram 6.3.1.0.6 Windows Optimizer Algorithm The Windows Optimizer Algorithm, as shown in Figure 6.13, is used when processing a Windows PE executable. The 102 Windows Optimizer Algorithm parses in the input XML le name into a Windows Training Data object, runs the appropriate classi er and outputs the nal XML data. This interaction is shown in Figure 6.14. Figure 6.13: Windows Optimizer Algorithm class diagram Figure 6.14: Windows Optimizer Algorithm sequence diagram 103 6.3.2 Input parser Each of the Optimization algorithms described in Section 6.3.1.0.4, 6.3.1.0.5 and 6.3.1.0.6 rely on the IO class to parse the stored XML data. The XML data is the result of running the grnn optimizer program on the collected training data. The parsing of the stored XML data is covered in Section 6.3.2.1 and the data structure returned from the parser is described in Section 6.3.2.3. 6.3.2.1 IO The IO class is use during the on-line and o ine phases of compiler classi cation. The rst ?get Data? function that takes a list of input XML data les is used during the o ine phase. The pseudo code of this function is covered in Algorithm 8. Algorithm 8 IO - get Data (o ine phase) 1: procedure get Data(F) 2: Create an instance of the correct parser 3: for f2F do 4: if f does NOT exist then 5: Throw an exception 6: end if 7: Parse the Training Set from f using the parser 8: end for 9: Normalize the Training Set 10: Separate the Training Set into 60% training, 20% test and 20% validation data 11: Save the maximum value for each column of the Training Set. 12: return sorted Training Set. 13: end procedure The algorithm takes in a Training Set containing N di erent entries and prepares them for use in the optimization operation. The normalization of data is required when using data with a neural network [20] because normalization enables comparison between the various le attributes. The normalized data is separate into the values use to learning from (training data), check the resulting learner (test data) and verify its correct (validation data). All the information is now ready for the optimizer and GRNN to learn if there is any means to distinguish the various classes of input from each other. 104 The second ?get Data? function is used for reading the stored normalized training data that was the result of the grnn optimizer. This information is utilized by the Com- piler Classi er Component described in Section 9.4.11. There is no need to normalize and separate the data again like done in the rst ?get Data? function. The data is con- verted from its XML format into the appropriate Training Set object for use by the Compiler Classi er Component. Algorithm 9 describes the operations of this function. Algorithm 9 IO - get Data (on-line phase) 1: procedure get Data(f) 2: if f does NOT exist then 3: Throw an exception 4: end if 5: Create an instance of the correct parser 6: Parse the training set from f using the parser 7: Create new Training Set object 8: Add input data to the Training Set object 9: Add the maximum values to the Training Set object 10: return sorted Training Set object. 11: end procedure 6.3.2.2 Training Set The training set is a data container for storing the training, test and veri cation information in training data objects. Each Training Set is specialized by the kind of training data objects stored in it through template specialization. Various methods are available to allow iteration through the various training data object?s values during the optimization and classi cation operations. 6.3.2.3 Training Data The training data is a data container that stores the value for one kind of training information (e.g. test data for Linux ELF) specialized by a template parameter. It is a standard template library map that is designed to use the attribute id as the key to set or retrieve information. Each optimization algorithm for a particular le type, e.g. ELF Optimizer Algorithm, uses prede ned attributes in these specialized training data 105 classes to access values stored in the training data. The following sections describe each of the specialized training data classes. Each of the specialized Training Data classes pro- vide prede ned constants, \ATTRIBUTE COUNT" and \CLASSIFIER TARGET"?, that are used by the Optimizer and GRNN classes during the optimization process. These values allow for the Optimizer to know how many attributes it is consider and which target the data is assigned to without knowing anything about the data. By design the \CLASSIFIER TARGET" is always the rst attribute in each training data object. Therefore \CLASSIFIER TARGET" is set to 0. Each class also has an enu- meration which lists the unique id for each attribute. The names and unique ids are used are the result of the expert determining the le attributes that could potentially be used for classi cation. How many le attributes used in each training data object is set in the \ATTRIBUTE COUNT". 6.3.2.3.1 ELF Training Data The ELF Training Data class is used to specialize the training data class for storing values retrieved from the input XML data. It has its \ATTRIBUTE COUNT" set to 8 and lists the values in its Attributes enumeration in Table 6.1. Name ID ATTRIBUTE TARGET ID 0 ATTRIBUTE FILESIZE 1 ATTRIBUTE ENTRY POINT ADDRESS 2 ATTRIBUTE SECTION HEADERS START 3 ATTRIBUTE PROGRAM HEADER COUNT 4 ATTRIBUTE SECTION HEADER COUNT 5 ATTRIBUTE SECTION HEADER STRING TABLE INDEX 6 ATTRIBUTE TEXT SECTION SIZE 7 Table 6.1: ELF Training Data - Attributes enumeration 6.3.2.3.2 Java Training Data The Java Training Data class is used to specialize the training data class for storing values retrieved from the input XML data. It has its \ATTRIBUTE COUNT" set to 32 and lists the values in its Attributes enumeration in Table 6.2. 106 Name ID ATTRIBUTE TARGET ID 0 ATTRIBUTE FILESIZE 1 ATTRIBUTE THIS INDEX 2 ATTRIBUTE SUPER INDEX 3 ATTRIBUTE VERSION 4 ATTRIBUTE CONSTANT POOL COUNT 5 ATTRIBUTE CONSTANT UTF8 COUNT 6 ATTRIBUTE CONSTANT UTF8 RATIO 7 ATTRIBUTE CONSTANT RESERVED COUNT 8 ATTRIBUTE CONSTANT RESERVED RATIO 9 ATTRIBUTE CONSTANT INTEGER COUNT 10 ATTRIBUTE CONSTANT INTEGER RATIO 11 ATTRIBUTE CONSTANT FLOAT COUNT 12 ATTRIBUTE CONSTANT FLOAT RATIO 13 ATTRIBUTE CONSTANT LONG COUNT 14 ATTRIBUTE CONSTANT LONG RATIO 15 ATTRIBUTE CONSTANT DOUBLE COUNT 16 ATTRIBUTE CONSTANT DOUBLE RATIO 17 ATTRIBUTE CONSTANT CLASS COUNT 18 ATTRIBUTE CONSTANT CLASS RATIO 19 ATTRIBUTE CONSTANT STRING COUNT 20 ATTRIBUTE CONSTANT STRING RATIO 21 ATTRIBUTE CONSTANT FIELDREF COUNT 22 ATTRIBUTE CONSTANT FIELDREF RATIO 23 ATTRIBUTE CONSTANT METHODREF COUNT 24 ATTRIBUTE CONSTANT METHODREF RATIO 25 ATTRIBUTE CONSTANT INTERFACE METHODREF COUNT 26 ATTRIBUTE CONSTANT INTERFACE METHODREF RATIO 27 ATTRIBUTE CONSTANT NAME AND TYPE COUNT 28 ATTRIBUTE CONSTANT NAME AND TYPE RATIO 29 ATTRIBUTE FIELD COUNT 30 ATTRIBUTE METHOD COUNT 31 Table 6.2: Java Training Data - Attributes enumeration 6.3.2.3.3 Windows Training Data The Windows Training Data class is used to specialize the training data class for storing values retrieved from the input XML data. It has its \ATTRIBUTE COUNT" set to 8 and lists the values in its Attributes enumeration in Table 6.3. 107 Name ID ATTRIBUTE TARGET ID 0 ATTRIBUTE FILESIZE 1 ATTRIBUTE EXE HEADER ADDRESS 2 ATTRIBUTE COFF SECTION COUNT 3 ATTRIBUTE PE OPT CODE SIZE 4 ATTRIBUTE PE OPT BASE OF DATA 5 ATTRIBUTE PE OPT ENTRY POINT 6 ATTRIBUTE PE OPT IMAGE SIZE 7 Table 6.3: Windows Training Data - Attributes enumeration Figure 6.15: Optimizer class diagram 6.3.2.4 Training Data Parser Each input training data is stored with a di erent XML schema as shown in Sec- tion 6.2.2 of Java Class, Section 6.2.1 for Linux ELF and Section 6.2.3 for Windows PE. An XML parser is used to convert the data in XML format to a Training Set object. 6.3.3 Optimizer The Optimizer object, shown in Figure 6.15, is responsible for discovering the best possible solution for a given input data, sigma and mutation value. It can be specialized with a classi er and data type to be used thereby allowing for replacing the types of classi er algorithm used and the input data type. The function of the Optimizer object utilized by the Optimizer algorithms is the ?process? as described in Algorithm 10 and shown in Figure 6.16. 108 Algorithm 10 Optimizer - process function pseudo code 1: procedure process 2: Populate the gene pool with 10 Candidate Solution objects 3: for i 0 to iteration limit do 4: Evaluate each Candidate Solution in gene pool 5: Get the rst randomly selected parent from the gene pool 6: Get the second randomly selected parent from the gene pool 7: Breed the two parents to produce a child Candidate Solution 8: Mutate the child Candidate Solution attributes 9: Calculate mutation chance between 1 to 10 10: if mutation chance is 1 then 11: Mutate Sigma value 12: end if 13: Evaluate child Candidate Solution 14: Set the iteration count in the child Candidate Solution 15: Replace the worst Candidate Solution found in the gene pool with the child 16: end for 17: Save the best Candidate Solution in the gene pool 18: end procedure Figure 6.16: Optimizer sequence diagram 109 6.3.4 Candidate Solution The Candidate Solution, as shown in Figure 6.15, is a data structure used to keep track of all the possible iterations used in the Optimizer gene pool. In the present implementation of Libreverse the Candidate Solution is instantiated with the GRNN classi er and the appropriate data structure for the optimizer algorithm. 6.3.5 GRNN The GRNN, as shown in Figure 6.15, implements the principles described in Sec- tion 3.2 for use in performing the classi cation during the o -line and on-line phases. 110 Chapter 7 API Applications built against Libreverse work through the API interface directly. The interface controls the operation of the Libreverse library. 7.1 Execute This function, as shown in Figure 7.1, coordinates all the speci c operations of the reverse engineering of the target le. The idea behind this function was that most users of Libreverse will not have the detailed knowledge of what is required to be done. Therefore this function automates all the operations for the user. Figure 7.1: Libreverse API ?execute? function 111 Parameter Description target le File to be processed by Libreverse. input type File type of the input (e.g. binary). output type Desired output (e.g. C++). trace level Tracing level for recording debugging informa- tion. trace mask Bitwise OR value that represents where debug- ging information is recorded. Table 7.1: Parameters of Execute function 7.1.1 Steps of Processing The Reverse class is the location that handles the actual processing of the target le given the input and output type. The execute function takes the parameters, as shown in Table 7.1, and performs as shown in Algorithm 11. Algorithm 11 Reverse - execute 1: procedure execute(target,typeinput,typeoutput, debuglevel, debugmask) 2: Determine if target le exists. 3: Check desired input and output type are valid. 4: Prepare initial source component. 5: Prepare initial data source that contains the target le name. 6: Get the Component Graph associated with the input and output type 7: Execute the input section 8: Execute the analysis section 9: Execute the output section 10: end procedure The rst two steps are performed to ensure everything provided by the application calling ?execute? is correct and available before beginning the processing of the target le. In the third and fourth step, the input Data Source, described later in Chapter 11, is prepared with the given target le name for use with the source component. A Null Component, described later in Section 9.4.8, acts as the source component providing information to the rst component in the Component Graph, described in Chapter 10. In the fth step, the Component Graph to be used is obtained from Con gurator, described in Section 8.1, using the input and output type. The Component Graphs are prede ned arrangements of components that represent the best practices of reverse 112 engineering discovered by the research community. How the graph is obtained is de ned in Chapter 8 and constructed in Chapter 10. The sixth through eighth steps process the target le using the Component Graph. This operation is handled by the Reverse Impl class. In order to hide the internal operations of the processing stages the Reverse Impl class was designed to o er three functions. The rst function, execute Input Section, processes the Input section of the Component Graph. The second function, execute Analysis Section, processes the Analysis section of the Component Graph. Finally execute Output Section processes the Output section of the Component Graph. Information is passed between the sections by the Execute function. Each of the processing functions in Reverse Impl class follows the same algorithm as shown in Algorithm 12. Algorithm 12 Reverse Impl - execute 1: procedure execute 2: Obtain the Component Graph for that part (e.g. input ). 3: Print the Component Graph in graphviz dot le format for use in documentation. 4: Initialize Component Graph with the source component 5: Process each component in the Component Graph 6: Retrieve the results from the last component. 7: end procedure The idea behind splitting up the steps of reverse engineering was inspired by the way Cifuentes [14] and the Boomerang [35] project performed the major steps of reverse engineering. Various uses of reverse engineering have similar input (e.g. Windows PE executable) but have di erent analysis and output operations. 113 Chapter 8 Configuration Libreverse is dynamically con gured at run time based on the kind of input and output the caller to the API execute function. The controller of Libreverse?s con gura- tion is the Con gurator class covered in Section 8.1. It provides a variety of information to various parts of Libreverse. Figure 8.1 shows the class diagram of the Con gurator. Figure 8.1: Con gurator Class Diagram 8.1 Con gurator The Con gurator is a singleton object that provides the Component Graphs for processing a target binary le and sets up the Data Source factory for creating Data 114 Source objects used to pass information. The reason behind having the Con gurator being a singleton was the fact that there should only be one source of information. Therefore the Singleton pattern was employed in this case. The rst responsibility of the Con gurator is to create three Component Graphs composed of Components prearranged by the associated Formula File. The input and output types given to the get Graph function retrieves the appropriate master formula le. Each master formula le contains a text string for the name of the formula le for each of the sections (Input, Analysis, and Output). This is seen in Figure 8.2. The pseudo code for get Graph is shown in Algorithm 13. Figure 8.2: Con gurator get Graph sequence diagram The second responsibility of the Con gurator is to setup the Data Source Factory for creating Data Source objects are created for transferring information from one com- ponent to the next. At this time only the memory passing Data Source is produced to transfer information between components. In the future additional Data Sources will be supported as shown in Section 15.2.6. 8.2 Master Formula File The purpose of the master formula le is to show the three formula les that compose the best practice solution for a speci c problem. An entry in the master 115 Algorithm 13 Con gurator - get Graph 1: procedure get Graph(IT,OT) 2: if Master Formula Map is not set then 3: Get formula directory from Con guration Data. 4: Read the Master Formula Map 5: end if 6: Search for correct Formula Map in the Master Formula Map using IT, OT 7: if Formula Map is not found then 8: throw an exception 9: end if 10: Parse the input Component Graph with a Formula Parser 11: Parse the analysis Component Graph with a Formula Parser 12: Parse the output Component Graph with a Formula Parser 13: return resulting Component Graphs 14: end procedure formula le is composed of a input and output value that represent the key pair for the solution. The input formula, analysis formula and output formula store the string name of the formula le for respective for the Input, Analysis and Output Component Graphs. A high level view of how the formula les work together are seen in Figure 8.3. Figure 8.1 describes the XML schema for the master formula le. input analysis output Figure 8.3: High-level view of set of Component Graphs 116 Listing 8.1: XML schema for master formula le 8.3 Formula File The Formula File represents the best practice for a particular part, e.g. input phase, of the reverse engineering process. The le represents the arrangement of components in a Component Graph stored in XML format using schema shown in Listing 8.2. Each 117 Formula le has a series of node elements that make up the graph. Each node element has a list of predecessors providing it data and meta information plus a name element describing the node?s name and unique id. Each node must follow a set of rules with the following variables. Child node: If a node has predecessors then the predecessor nodes must be exist already in Component Graph therefore no parent node can reference a unknown child. This rule prevents any loops from occurring in the resulting Component Graph. A parent node will have a lower unique id than any of its children. No duplicates. Each node must provide a unique id attribute for the Component Graph being constructed. 118 Listing 8.2: XML schema for formula le 119 Chapter 9 Components 9.1 Design Principles A component is de ned as an object with two modes of operation (source and Working) that is loosely coupled from other components performing a single primitive operation. Previous reverse engineering solutions [35] [14] [41] utilized functions in the programming language to provide the primitive operations. The problem with these approaches is that they do not allow the user to rearrange these primitive operations for new problems without altering the programming source code. The list of design principles are listed here and describe in further detail below. Two mode operation (source vs working) Single primitive operation Loosely coupled Type of source of input data and destination of output data hidden 9.1.1 Two Mode Operation When a component is in the working mode it is designed to operate as a part of a component graph, de ned later in Chapter 10, to perform its speci c action. Each child component takes its input data from their parent components through the data source, de ned later in Chapter 11. The input data is retrieved from a list of data sources contained in the graph visitor de ned later in Chapter 10. If all the necessary sources of input data are provided the component when in the working mode it then performs its speci c operation. The results are placed into a data source and entered into the list of data sources for use by other components. 120 When a component is the head node of the component graph it must have its state set to source mode. When a component is in the source mode its normal operation is not performed. The role of a component in this state is to provide its data contained in a data source to child components in the component graph. For example, the le name of a target binary executable to be analysis can be provided by a Null component set in the source mode. How the graph visitor interacts with the components is described in Chapter 10. 9.1.2 Single Primitive Operation Any complex reverse engineering technique is often composed of many primitive op- erations. These primitive operations are often reused across many di erent techniques. In order to capture this principle each of the components described in this chapter are design to only perform a single primitive operation. Couple this design principle with the loosely coupled nature and the data sources then a component can be reused mul- tiple times in the same component graph without any knowledge to the components before or after it. 9.1.3 Loosely Coupled Each component is designed to accept a visitor interacting with it and a set of data sources. There are no hard links allowed between various components since how a generic reverse engineering architecture will be used cannot be known before hand. Therefore each component must be designed to be moved into various places in a component graph thereby giving it nearly in nite exibility to handle a vast amount of unknown problems. 9.1.4 Hidden data sources Since a design goal was to maintain a loosely coupled connectivity between com- ponents there had to be a means by which the data could be passed. Each data source contains all the information provided by a parent component but without the child 121 component needing to know exactly who is its parent. This is important since hard wiring components together severely limits the range of problems this architecture can solve. Each data source not only hides the previous component from the next but it also has the ability to transmit the information through di erent means. This is further described in Chapter 11. 9.2 Component Class Diagram The component class is a purely virtual class that represents the interface, shown in Table 9.1, that hides the internal workings from other parts of Libreverse. The com- ponent base class implements the features of the component interface and is the parent class that developers must inherit from when making a new component for Libreverse. The component base class also implements the component actor interface described in Section 9.3. The component base class has a member variable for a Component State object which implements the two-mode operation. The overall structure of a component is shown in Figure 9.1. 122 Figure 9.1: Component Class Diagram 123 Function Description Add Input Source This is used during the creation of the Component Graph. Each parent component providing data to a child component must give its id to the child. The child uses this id to check to see if it has all its parent components have completed their operation before running its analysis. Received Input Source Data This is used during the actual processing of the com- ponent graph. The parent component, identi ed by id, has completed its analysis and has recorded its results into the data map type in the graph visitor. The child component marks that it has received in- put from the parent component. Get Name This is used for obtaining the actual name of the component when printing information for debugging or producing a graphviz le of the component graph Run This is used by the graph visitor when it visits the node that contains a component. This starts the primitive operation performed by the component. Results Returns the data source object that contains the output data. This is called by the graph visitor to store the results and notify the child components of the completion of the parent components operation. Set State This is functions will set the mode of the compo- nent. It is mainly used to set a component to the SOURCE MODE when the Reverse Impl class sets the component acting as a source of information to a component Graph. Get Id Return the Unique ID for the component. Get Source List Begin The input token list contains the list of parent com- ponents that supply the child component with data. If the data is ready the Boolean value is true other- wise it is false. Get Source List End The input token list contains the list of parent com- ponents that supply the child component with data. If the data is ready the Boolean value is true other- wise it is false. Table 9.1: Component Interface 9.3 Component Actor Interface The component actor interface is the mechanism by which the two mode operation of the component object is implemented. The component base class inherits from the 124 Figure 9.2: Component execution with its state set to SOURCE MODE Figure 9.3: Component execution with its state set to WORKING MODE component actor to implement the ?process? function that will be called by the appro- priate component state class (Component Source State or Component Working State) depending on which is active in the Component at that time. If the component is in the SOURCE MODE then the ?run? function does nothing since the component base class ?process? function is never called. This is shown in Figure 9.2. Otherwise if the component is in WORKING MODE then the ?run? function will call ?process? function on the component base so that the primitive operation will be performed. This is shown in Figure 9.3. The component base class calls the ?process impl? function in the child class to execute the primitive operation. 9.4 Input Components This section contains a list and description of the components provided as a part of the Libreverse architecture for use in the \input" formula les. 9.4.1 Architecture Type Detector The Architecture Type Detector Component determines the CPU architecture that is supported by this binary executable. For most architectures there is some form of 125 identi er that the operating system loader uses when determining if it can execute an binary executable. There is little incentive for an attacker to alter the id in the binary executable since they want the target program to execute. Algorithm 14 shows the pseudo code for this component. Algorithm 14 Architecture Type Detector - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the le type from the meta object found in the component data object. 9: Get the correct Arch Type Detector Algorithm object from the based on the le type from the Arch Type Detector Algorithm Factory. 10: Execute the algorithm 11: Store the returned Arch Type Meta object in the output data source. 12: end procedure 9.4.2 Code Section Detector The Code Section Detector determines the beginning address in memory of the executable code of a binary executable and size of memory used. The Code Section Detector uses the input le name obtained from the data source and the appropriate le reader produced by the File Reader Factory to read the necessary eld. This information is appended to the output data source as meta information. Algorithm 15 shows the pseudo code for this component. 126 Algorithm 15 Code Section Detector - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the le type from the meta object found in the component data object. 9: Get the correct Code Section Detector Algorithm object from the based on the le type from the Code Section Detector Algorithm Factory. 10: Execute the algorithm 11: Store the returned meta object in the output data source. 12: end procedure Each of the Code Section Detector algorithms uses its associated File Reader to obtain the code section information. Windows PE, Linux ELF and Mac OS.X MachO all allocate memory where the executable instructions are then placed by the operating system loader. Java class les have Code Attribute sections that are part of each method. Each class le is loaded by the JVM as it is required. There is no memory allocated in a similar way compared to the other formats. Therefore there is no real value that can be returned for this nor of the section size. 9.4.3 Data Section Detector The Data Section Detector determines the beginning address in memory of the program data in the binary executable and size of memory used. The Data Section Detector uses the input le name obtained from the data source and the appropriate le reader produced by the le reader factory to read the necessary eld. This information is appended to the output data source as Meta information. Algorithm 16 shows the pseudo code for this component. 127 Algorithm 16 Data Section Detector - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the le type from the meta object found in the component data object. 9: Get the correct Data Section Detector Algorithm object from the based on the le type from the Data Section Detector Algorithm Factory. 10: Execute the algorithm 11: Store the returned meta object in the output data source. 12: end procedure Each of the Data Section Detector algorithms uses its associated File Reader to obtain the data section information. Windows PE, Linux ELF and Mac OS.X MachO all allocate memory where the application data is placed by the operating system loader. Java class les have Code Attribute sections that are part of each method. Each class le is loaded by the JVM as it is required. There is no memory allocated in a similar way compared to the other formats. Therefore there is no real value that can be returned for this nor of the section size. 9.4.4 Entry Point Detector The Entry Point Detector determines the beginning address in memory of the rst executable instruction is located in the binary executable. The Entry Point Detector uses the input le name obtained from the data source and the appropriate le reader produced by the le reader factory to read the necessary elds. This information is appended to the output data source as meta information. Algorithm 17 shows the pseudo code for this component. 128 Algorithm 17 Entry Point Detector - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the le type from the meta object found in the component data object. 9: Get the correct Code Section Detector Algorithm object from the based on the le type from the Code Section Detector Algorithm Factory. 10: Execute the algorithm using the target le name 11: Store the returned meta object in the output data source. 12: end procedure Each of the Entry Point Detector algorithms uses its associated le reader to obtain the entry point information. Windows PE, Linux ELF and Mac OS.X MachO all have a eld in their header which marks the address in memory of the rst instruction to be executed. Java class les have Code Attribute sections that are part of each method. A Java class le without the standard main function is similar to a DLL under Windows or a shared library on Linux. There are many entry points but the system is vastly di erent. In Linux or Windows a le can have one (EXE/DLL) or many (DLL) entry points. Instead of addresses of where entry points for functions stored in the le Java Class les have a Code Attribute section for each method. 9.4.5 File Header Printer The File Header Printer obtains a text string representing all the information that can be obtained from the target binary executable. The File Header Printer uses the input le name obtained from the data source and the appropriate le reader produced by the le reader factory to read the necessary elds. This information is displayed to the standard output on the command line. Algorithm 18 shows the pseudo code for this component. 129 Algorithm 18 File Header Detector - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the correct le reader from the Reader Factory. 9: Print the le header information retrieved from the le reader to the console 10: Copy the input meta information to the output data source 11: end procedure 9.4.6 File Type Detector The File Type Detector determines the le type associated with a binary executable. The File Type Detector uses the input le name obtained from the data source and the appropriate le reader produced by the le reader factory to read the necessary elds. This information is appended to the output data source as meta information. Algorithm 19 shows the pseudo code for this component. Algorithm 19 File Type Detector - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the correct le reader from the Reader Factory. 9: Store the target le name in the output data source. 10: Stored the le type information retrieved from the le reader in the output data source. 11: end procedure 9.4.7 Memory Map Producer The Memory Map Producer creates a Memory Map object representing the memory image of the the target binary executable as it would look after the operating system 130 loader completed loading the binary. The Memory Map Producer uses the input le name obtained from the data source and the appropriate le reader produced by the le reader factory to obtain the Memory Map. The Memory Map object is placed inside a data source and passed onto the next Component. Algorithm 20 shows the pseudo code for this component. Algorithm 20 Memory Map Producer - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the correct le reader from the Reader Factory. 9: Store the target le name in the output data source. 10: Stored the Memory Map object retrieved from the le reader in the output data source. 11: end procedure 9.4.8 Null The Null Component is a useful Component to condense meta information from various sources. It is primarily used with in Libreverse as a source of information for a component graph and when there is no action required at spots in a formula le. Algorithm 21 shows the pseudo code for this component. Algorithm 21 Null Component - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: Copy input data to the output data in the data source. 4: end procedure 131 9.4.9 Tevis Unknown Region Checker The Tevis Unknown Region Checker implements the security check in Jay-Tevis?s dissertation [59] which determines to see if there are any sections in memory not ex- plicitly covered in the Windows PE le header. The original work was expanded to the Linux ELF le format. Java Class format was excluded due to the design of the le format. There is no possibility of a Java Class le having any region in memory not captured somehow in the le format. Algorithm 22 shows the pseudo code for this component. Algorithm 22 Tevis Unknown Region Checker - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the le type from the meta object found in the component data object. 9: Get the correct Tevis Unknown Region Checker Algorithm object from the based on the le type from the Tevis Unknown Region Checker Algorithm Factory; 10: Execute the algorithm using the target le name 11: Store the returned meta object in the output data source. 12: end procedure 9.4.10 Tevis Zero Filled Checker The Tevis Zero Filled Checker implements the security check in Jay-Tevis?s dis- sertation [59] which determines to see if there are any sections in memory that have more than 50 consecutive ?0? bytes. The idea is that a region with too many non-zero bytes could be used to store a malicious program. The original work was expanded to the Linux ELF le format. Java Class format was excluded due to the design of the le format. There is no possibility of a Java Class le having any region in memory not captured somehow in the le format. Algorithm 23 shows the pseudo code for this component. 132 Algorithm 23 Tevis Zero Filled Checker - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the le type from the meta object found in the component data object. 9: Get the correct Tevis Zero Filled Checker Algorithm object from the based on the le type from the Tevis Zero Filled Checker Algorithm Factory; 10: Execute the algorithm using the target le name 11: Store the returned meta object in the output data source. 12: end procedure 9.4.11 Compiler Classi er The Compiler Classi er is used to determine the compiler that produced the target binary executable. The Compiler Classi er uses the input le name obtained from the data source and the Compiler Classi er, as described in Chapter 5, to obtain the meta object output. The meta object placed inside a Data Source as meta information. and passed onto the next Component. Algorithm 24 shows the pseudo code for this component. 133 Algorithm 24 Compiler Classi er - process impl 1: procedure process impl 2: Get the component data object from the input data source. 3: if target le name exists then 4: Get the target le name from the component data object 5: else 6: Throw an exception 7: end if 8: Get the directory where the GRNN data is stored 9: if GRNN data le does not exists then 10: Throw an exception 11: end if 12: Get the le type from the meta object found in the component data object. 13: Get the correct Classi er Algorithm object from the based on the le type from the Compiler Classi er Algorithm Factory; 14: Execute the algorithm using the GRNN data le 15: Store the returned meta object in the output data source. 16: end procedure 134 Chapter 10 Component Graph The Component Graph is a directed graph of Component objects which represents the reverse engineering work to be done on the target le. 10.1 Graph The Boost Graph library was used to construct this data structure. Boost has an adjacency list [52], a two-dimensional graph structure, that has an entry for each vertex. Each vertex in the adjacency list contains a list of vertex references for those vertexes that have a link between them. In the case of Libreverse the adjacency list was specialized as a directed graph. Therefore the links between the vertices?s goes from the parent vertex to its children only. Each vertex in the directed graph is a pointer to a Component object. 10.2 ID Map Since the parent of a vertex has a lower unique identi er than its children the graph of Component objects can be sorted in ascending order. The ID Map is a standard template library (STL) map that uses the unique identi er for a Component object as a key and handle to the vertex of where that Component object is in the directed graph. This data structure allows Libreverse to quickly get access to a particular component in the directed graph using only its unique identi er. 10.3 Component Map The Component Map is a Boost property map that is used to store the Component object at a particular vertex. The [52] states that: 135 The Boost Property Map concepts de ne a general purpose interface for mapping key objects to corresponding value objects, thereby hiding the de- tails of how the mapping is implemented from algorithms. The implemen- tation of types ful lling the property map interface is up to the client of the algorithm to provide. The property map requirements are purposefully vague on the type of the key and value objects to allow for the utmost genericity in the function templates of the generic library. 10.4 Visitor The Graph Visitor is where the graph processing is controlled. The visitor is a breadth- rst search visitor that takes the directed graph, sorts the vertexes in ascending order by their unique ids, calls run() on each of the vertexes in turn and passing on the results to the children vertexes. A breadth- rst search visitor is the best solution since all the parent vertexes for a target vertex must have completed their work before the child Component can execute. A depth- rst search would require some mechanism to revisit a child Component if it had more than one parent. Algorithm 25 shows the pseudo code for the Graph Visitor. Algorithm 25 Graph Visitor - visit 1: procedure process impl 2: Get a handle to the graph object from the Component Graph 3: Topologically sort the vertexes 4: for each vertex in the Component Graph do 5: Call run on the Component Object stored at this vertex 6: Get the number of children for the vertex 7: Save result from Component Object for each children in the data map. 8: for each child of the vertex do 9: Update child Component Object to indicate parent Component Object has provided data. 10: end for 11: end for 12: end procedure 136 Chapter 11 Data Source Components in the component graph need to pass information from one to another in such a way that limits the connection between the two. Past solutions hard wired the various parts of the reverse engineering process together. Their design was focus around solving one problem. There are three typical method for passing information from one part of an ap- plication to another. Most applications are designed to pass information with system memory allocated to the application. Second method for passing information can be done through specially formatted les. Finally information can be stored within a database. In order to be exible to the needs of various users the data sources were con gured to allow di erent formats to be utilized within the Libreverse architecture. At the present time only the memory data source has been implemented. Figure 11.1 shows the class diagram for the data source. 137 Figure 11.1: Data Source Class Diagram Each type of data source has a data transfer base object that contains the informa- tion to be passed. The data source is explicitly instantiated with the data transfer base class describing the data transfer method. At present only the memory data transfer class is fully implemented in Libreverse. Each class that inherits from the data transfer base class has a data source con g base object containing the information necessary to access the target data. This data source con g base object is passed to the data transfer class when it is constructed. Each call to get() will return the stored information and put() will store the information within the data source. An example is shown in Figure 11.2 for the memory data transfer. 138 Figure 11.2: Data Transfer Class Diagram At present the Libreverse architecture supports passing information within system memory allocated to the application. This design decision was used to simplify the implementation of this prototype. The memory data source con g object contains the unique identi er for the Component that provides the information. In the operation of the memory data transfer object the put() call stores the passed in data object in system memory. The get() call returns the store data. The sequence of these operations are shown in Figure 11.3. At present the memory data source con g object does not do anything but is in place in case additional information is required to access the memory (e.g. encryption key). 139 Figure 11.3: Memory Data Transfer Sequence Diagram The key design principle is the separation of what is being store, data object, how it is being transferred, data transfer base, from how it is con gured, data source con g base, allows for great exibility in future designs. More on how the future design of the data source is covered in Section 15.2.6. 140 Chapter 12 Data Structures There are three di erent types of information commonly passed between Compo- nents. 12.1 Filename Some components require the lename of the target le so that they can obtain a handle to the appropriate Reader. These components use the Reader to read values from the header associated with the target le or gain access to the Memory Map for the target le image. The lename data structure is a string that represents the full path to the target le. This is shown in Figure 12.1. 12.2 Memory Map A memory map object represents a image of a section of memory. It is composed of an index of where the read-write head is in the image, an index of where the read- write head was last located, the base address of the image and vector of bytes. When the memory map is created it is initialized for the desired size and base address. A Component can seek to particular addresses or indexes within the memory map to access data. For example a Memory map allows access within the range of 0 to N where N is the last byte (e.g. memory map.index Seek (10)) followed by a read. Or the component can go to a particular address and read information (e.g. memory map.address Seek ( 0x80048f8f )) followed by a read. There is not a list of addresses associated with each index in the memory map. The address associated with each byte is used to calculate the exact index where the read needs to start. For example, if a memory map has a size of 10, a base address of 0x5000 and follows a address Seek to 0x5005 the index of the 141 target memory is 0x5005 - 0x5000 = 5. By using a calculation of the index to a byte of memory does not require the memory map to contain a map of addresses with their associated index. This is shown in Figure 12.1. 12.3 Control Flow Graph (CFG) Sequence A control ow graph sequence is a collection of control ow graphs. A executable program has one entry point where the operating system begins to execute the instruc- tions. The CFG Sequence associated with this program will only contain one CFG. Whereas a shared library will have one or more entry points depending on which func- tions are publicly available. The CFG Sequence associated with the shared library will have one or more CFG. Each CFG will represent one function in the shared library. Each CFG is similar to the component graph. Each node in the CFG has a unique id and a basic block object associated with it. Each basic block contains a list of instruc- tions associated with that part of the program code before a jump instruction. This is shown in Figure 12.1. 142 Figure 12.1: Data Container Class Diagram 143 Chapter 13 Meta Information 13.1 Importance of meta information The properties that can be obtained from various components describing the target binary are called meta information. Previous approaches to reverse engineering limited their scope to a particular type of a target binary. Meta information signi cantly reduces and/or eliminates the need to have special application to reverse engineering a target binary. When constructing a generic reverse engineering architecture it is necessary to design the system in such a way as to eliminate the number of custom components for a particular analysis. Consider the situation where Libreverse would be used to detect which compiler was used for a target binary. If there was not a means to provide extra information about a target binary then there would need to be unique solutions for each type of target binary. Each unique solution would have to have two steps: detection of the target le type and detection of the compiler used. For example, for Windows PE there would be a component to verify that the target binary was indeed a Windows PE le. There would also be a component to perform the compiler detection for compilers that produce Windows PE les. If there were only a few kinds of compilers and target binary le formats then have unique solutions would not be that much of an issue. Unfortunately, since new le formats are coming out as operating systems are created or extended every year this solution does not scale well. Figure 13.1 visual shows how with just three target binary le formats the problems with the unique component approach become apparent. 144 Figure 13.1: Three sets of unique components for handling compiler classi cation It is important to note that there are overlapping steps performed with each unique component. Each le detector requires a le reader that enables it to gather information from the le. The le detector produces an answer to indicate that it was successful or not. This data is passed on to the next component for analysis. In addition each compiler classi er also requires a le reader to gather information as the le detector and speci c actions to classify the target binary?s compiler. This solution can be simpli ed down from the three di erent sets of components into one set of components. The rst component is a le type detector that uses the data, target binary, to determine what its le type. It passes this newly discovered information as meta information to the next component. The second component reads the meta information, determines the algorithm it needs to use, executes the algorithm and passes on the results. This can be seen in Figure 13.2 Figure 13.2: Generic components for handling compiler classi cation 145 13.2 How meta information is exchanged Meta information is passed along in a meta object with a Data Source providing Components with hints about the kind of binary le being analyzed. These hints can allow the designer of the Component to alter the behavior for speci c actions. It is important to realize that di erent reverse engineering applications have similar actions that are performed. They only di erence between these applications is the type of binary being analyzed. So rather than having N di erent kinds of reverse engineering tools one solution can be used to dynamically change its behavior at run-time. Figure 13.3: Reading meta information from a Data Source Figure 13.3 shows the steps in obtaining a meta entry. Each Data Source contains one meta object storing all the meta information received thus far in the reverse engi- neering process. A component will condense the meta information obtained from each received Data Source into one meta object before beginning its work. The rst step is to call ?get()? from each of the Data Sources to get a handle on the underlying Data Object. How the Data Object is stored is described in more detail in Chapter 11. Once a component has a handle to the Data Object it can access the meta object, shown in Figure 13.4, for the meta information stored in it. Each entry in a meta object is a key with a matched Meta Item object. The key is used to access the right Meta Item if it is present in the meta object. The Meta Item can represent a variety of di erent 146 kinds of meta information as described in Section 13.3. After obtaining the desired meta information the component can use this information to customize its behavior. Figure 13.4: Meta Object class diagram By decomposing complex operations into a set of components that use meta in- formation to select the correct algorithm to use for a speci c situation it eliminates the need for a number of custom components for a particular analysis. Libreverse im- plements this design criteria thereby enabling the architecture to be easily expanded to new kinds of meta information and components. This is one of its great strengths when compared to the speci c solutions mentioned in Chapter 2. Each of the spe- ci c solutions have common actions requiring speci c information necessary to perform their purpose. These common actions can be decomposed into simple steps using meta information to alter their behavior. For example, Cifuentes [14] decompiled MS-DOS x86 programs into source les written in C programming language. Each of the steps performed could be made into simple components using and providing meta information to alter Libreverse?s run-time behavior. It would be easy to extend the work to handle Java class les as input and output them as source les written in the C programming language. 147 13.3 Meta Items The meta item class and the child classes which inherit from it were modeled after how the java class le format stores attribute information. In the java class le format a generic base class is used to hide the kind of attributes listed in the le header. If the JVM understands the attribute, identi ed by a unique tag, then the attribute is utilized otherwise it would silently ignore it. The important principle inspired by the java class format was to hide the kind of meta item objects passed. A meta object has no knowledge of the kind of meta items it contains. Each component in Libreverse uses the the appropriate meta item tag to retrieve the meta item from the meta object. This section describes each of the meta item types shown in Figure 13.5. Figure 13.5: Meta Item class diagram 13.3.1 Arch Type Meta The Arch Type Meta, shown in Figure 13.6, is produced by the Architecture Type Detector, described in Section 9.4.1, to capture the CPU architecture supported in the target le. The ?m tag? is the unique identi er, ?m bit length? captures whether the architecture is 32/64 bit, and the ?m type? shows the supported architecture. 148 Figure 13.6: Arch Type Meta class diagram 13.3.2 Code Section Meta The Code Section Meta, shown in Figure 13.7, is produced by the Code Section Detector for 32-bit les, described in Section 9.4.2, to determine the beginning address of the code section in a 32-bit target le. The ?m tag? is the unique identi er, ?m o set? is the o set of the code section from the beginning of the memory for the target le, and ?m size? is the length of the section. Figure 13.7: Code Section Meta class diagram 149 13.3.3 Code Section 64 Meta The Code Section 64 Meta, shown in Figure 13.8, is produced by the Code Section Detector for 64-bit les, described in Section 9.4.2, to determine the beginning address of the code section in a 64-bit target le. The ?m tag? is the unique identi er, ?m o set? is the o set of the code section from the beginning of the memory for the target le, and ?m size? is the length of the section. Figure 13.8: Code Section 64 Meta class diagram 13.3.4 Compiler Type Meta The Compiler Type Meta, shown in Figure 13.9, is produced by the Compiler Type Detector for 32-bit les, described in Section 9.4.11, to determine the compiler used to create the target le. The ?m tag? is the unique identi er, and ?m compiler id? is the unique id describing the detected compiler. 150 Figure 13.9: Compiler Type Meta class diagram 13.3.5 Data Section Meta The Data Section Meta, shown in Figure 13.10, is produced by the Data Section Detector for 32-bit les, described in Section 9.4.3, to determine the beginning address of the data section in a 32-bit target le. The ?m tag? is the unique identi er, ?m o set? is the o set of the data section from the beginning of the memory for the target le, and ?m size? is the length of the section. Figure 13.10: Data Section Meta class diagram 151 13.3.6 Data Section 64 Meta The Data Section Meta, shown in Figure 13.11, is produced by the Data Section Detector for 64-bit les, described in Section 9.4.3, to determine the beginning address of the data section in a 64-bit target le. The ?m tag? is the unique identi er, ?m o set? is the o set of the data section from the beginning of the memory for the target le, and ?m size? is the length of the section. Figure 13.11: Data Section Meta class diagram 13.3.7 Entry Point Meta The Entry Point Meta, shown in Figure 13.12, is produced by the Entry Point Detector for 32-bit les, described in Section 9.4.4, to determine the list of addresses describing the entry point in a 32-bit target le. The ?m tag? is the unique identi er and ?m entry points? is the list of entry points. 152 Figure 13.12: Entry Point Meta class diagram 13.3.8 Entry Point 64 Meta The Entry Point Meta, shown in Figure 13.13, is produced by the Entry Point Detector for 64-bit les, described in Section 9.4.4, to determine the beginning address of the entry point in a 64-bit target le. The ?m tag? is the unique identi er and ?m entry points? is the list of entry points. Figure 13.13: Entry Point Meta class diagram 153 13.3.9 File Type Meta The File Type Meta, shown in Figure 13.14, is produced by the File Type Detector, described in Section 9.4.6, to determine the le type of the target le. The ?m tag? is the unique identi er, ?m bit length? is the bit length of the target le, and ?m type? is the description of the le type. Figure 13.14: File Type Meta class diagram 13.3.10 Tevis Unknown Region Meta The Tevis Unknown Region Meta, shown in Figure 13.15, is produced by the Tevis Unknown Region Detector, described in Section 9.4.9, to determine the regions in mem- ory of a target le not described in the le headers. The ?m tag? is the unique identi er and ?m unknowns? is the list of unknown regions found in the le. 154 Figure 13.15: Tevis Unknown Region Meta class diagram 13.3.11 Tevis Zero Filled Meta The Tevis Zero Filled Meta, shown in Figure 13.16, is produced by the Tevis Zero Filled Checker, described in Section 9.4.10, to determine the areas in memory for target that have more that 50 consecutive zeroes. The ?m tag? is the unique identi er and ?m zero lled? is the list of regions found in the le that have more than 50 consecutive zeros. Figure 13.16: Tevis Zero Filled Meta class diagram 155 Chapter 14 Scenarios In order to best understand the impact of this research two scenarios are used throughout it to provide two di erent points of view. The problem of merely presenting the research is that it does not connect the reader with the vision of the research. It fails to provide the extra information that lives inside the mind of the researcher. 14.1 Non-expert and Expert user The non-expert user of Libreverse is a 15-year veteran of a federal agency. His job in this division is to take suspicious programs o suspect?s computers and attempt to classify the programs for other investigators. Due to a rash of new white collar crimes the department is looking for a automated method to assist the non-expert user. The department wants a tool that can provide information on what compiler was used and if any compression utility was applied to the le. The expert user of the Libreverse is a 20-year computer security researcher for a defense contracting company who is working with the federal agency on an assignment. The expert user has been using the Libreverse for years and now would like to add a new compiler classi cation. The expert user is known as an expert in the eld of compression and compiler classi cation using arti cial intelligence (AI). The expert user will be con guring Libreverse reverse so that the non-expert user can perform their task without requiring extensive education. 14.2 Compressed Executables and Compiler Classi cation In this problem the novice user utilizes the application supplied by the expert user to process each suspect le. The novice user is not making any assumptions about 156 the suspect les found on a suspect?s computer. They are relying on Libreverse to process each suspect le. From their point of view they use a program supplied by the expert user to process a set of les. This application interacts with Libreverse to perform the necessary analysis for each le. Once the novice user starts the application their interaction with Libreverse is done. All they has to do is wait for the result. In the arrangement of components there is a component that will classify the compiler use to create the suspect le or determine. The compiler classi cation component uses the original le name to determine the compiler used. The result of this component is new meta information stating the compiler id used to create the suspect le. This meta information would be utilized by other components to alter their behavior. In the novice user case the output information is used by his application to move the suspect le into a preset directory. In a similar way the compressed executable detector, which will provide meta information on whether any sort of compression was used on the le. This information will be used to place the suspect le into a present directory for further analysis. During a recent investigation into software piracy the expert user discovered a binary executable with a new type of compression. Further investigation of the suspect computer uncovered the new compression utility. The expert user will take the new compression program and compress a prede ned set of test programs. After which they will collect the necessary attribute input data into a le for classi cation and rerun the optimizer. The resulting attributes used to classify between the various compression methods will be used to alter the compressed executable detector component. In addition the expert user wants to expand Libreverse?s ability to detect a new compiler he has found on a suspect?s computer running Windows Vista. Libreverse?s compiler classi cation has no knowledge of this new compiler. At this point the expert user has two choices. First, they can either create a whole new program that can classify this new compiler along with those that Libreverse knows. Or they can take the same steps to classify this new compiler as done with previous compilers that produced Windows PE les. The former method seems to be the way a lot of reverse engineering 157 projects take. A speci c program is created for a speci c problem. The downside is this approach is that the program cannot be reused. Since the expert user values their time they know that all they have to do is to compile a set of test programs with the new compiler. The expert user updates the trainer dump application from Libreverse to read the necessary attributes used for classi cation from the new set of test programs. This extracted information is placed in the same directory as the other input data collected for compilers producing Windows PE executables. After executing trainer dump on the new les they update the grnn optimizer to take advantage of this new data. At this point the expert user can run the grnn optimizer to discover which attributes are necessary to classify between the old Windows compilers and the newly discovered compiler. the expert user takes this information and updates the compiler classi cation component. 14.3 Behind the scene The novice user?s analysis program communicates with Libreverse via the execute function in the API. The analysis program will pass along the string name for the path to the target le along with the suspected input type (BINARY) and output type (CPLUSPLUS). It will not utilize any of the tracing parameters since those are mainly used for debugging Libreverse. Similarly the expert user uses the same analysis program as the novice user. The only di erence is the expert user requires di erent output (e.g. UML) and utilize the tracing parameters so he can follow Libreverse as it attempts to use his new compiler classi cation additions. For each user the con gurator takes the given input and output types to look up the appropriate formula list in the master formula map. The master formula map is created from the contents of the master formula le stored on the system. The master formula le used for this case study is shown in Listing 14.1. 158 Listing 14.1: Case study master formula le binary uml binary RTL 2 .xml d e c o m p i l e r a n a l y s i s .xml uml output .xml binary C++ binary RTL .xml d e c o m p i l e r a n a l y s i s .xml cplusplus output .xml The formulas described in this section are for the purpose of showing the exibility of the con guration system. Unless otherwise noted in Chapter 9 the components used in the formula are future features of Libreverse not presently implemented. Both the expert and non-expert user utilize the input formula that converts the binary input into a control ow graph (CFG) in RTL format. The expert user will store an altered input formula le, binary RTL 2.xml, under a new name to allow testing a compressed executable unpacker component. The input formula, binary RTL.xml, used by the non-expert user is shown in XML format in Listing 14.2 and graphically in Figure 14.1. The input formula used by the expert user is shown in Listing 14.3 and graphically in Figure 14.2. 159 Listing 14.2: Case study input formula le for non-expert user (binary RTL.xml) f i l e t y p e d e t e c t o r memory map producer c o m p i l e r c l a s s i f i e r arch detector cont rol flow prod ucer r t l c o n v e r t e r 160 Listing 14.3: Case study input formula le for expert user (binary RTL 2.xml) f i l e t y p e d e t e c t o r memory map producer c o m p i l e r c l a s s i f i e r arch detector compressed executable unpacker cont rol flow prod ucer r t l c o n v e r t e r 161 Figure 14.1: Graphical view of binary RTL.xml Figure 14.2: Graphical view of binary RTL2.xml 162 The con guration system obtains the correct formula list from the master map using the input keys for the type of input and desired output (e.g. BINARY,CPLUSPLUS). It parses the binary RTL.xml le for the non-expert user to return the component graph of components listes in the XML le. Whereas the con gurator will grab the formula list from the master map using the keys BINARY,UML and parse the binary RTL 2.xml le, shown in Figure 14.2, to return the component graph for the expert user. Each will utilize the same formula le seen in Listing 14.4 and Figure 14.3 for the analysis phase of the reverse engineering process. The input to the analysis formula map will be the results of the input formula map contained in the last component. The rst component to receive the data is the dead register elimination components. Its results are passed onto the dead code elimination and so on. The end result is a CFG containing all the information in a higher level format more resembling source code. The output formula for non-expert user is described in Listing 14.5 and shown in Figure 14.4. The output formula for the expert user is described in Listing 14.6 and shown in Figure 14.5. The output from the analysis formula map will be the input to the output formula map. The purpose of this stage is to convert the contents of the CFG into a meaningful format requested by the user. In the case of the non-expert user this output will be source les written in the C++ programming language. For the expert user the output will be les describing the CFG contents using UML. 163 Listing 14.4: Case study analysis formula le for the expert user (decompil- ing analysis.xml) d e a d r e g i s t e r e l i m i n a t i o n dead code elimination condition code propagation r e g i s t e r a r g u m e n t i d e n t i f i c a t i o n f u n c t i o n r e t u r n r e g i s t e r r e g i s t e r c o p y p r o p a g a t i o n data type propagation 164 Figure 14.3: Graphical view of decompiling analysis.xml Listing 14.5: Case study output formula le for Non-Expert User (cplusplus output.xml) cpp writer Figure 14.4: Graphical view of cpp writer.xml 165 Listing 14.6: Case study output formula le for expert user (uml output.xml) uml writer Figure 14.5: Graphical view of uml writer.xml An overview of how all the formulas t together are shown in Figure 14.6 and 14.7. 166 Figure 14.6: Graphical view of Non-Expert User Component Graphs 167 Figure 14.7: Graphical view of Expert User Component Graphs 168 Chapter 15 Conclusion and Future Work 15.1 Conclusion In this research we have shown a generic reverse engineering system that is exible, scalable and able to be alter its behavior based on information detected during the reverse engineering process. In addition this research has shown that it is possible to detect the compiler used to create a binary computer application, regardless of the programming language used, through information gathered from the binary le format by classifying against pro les from a list of available compilers. This work showed that it is possible to classify the compiler used to create an application with 99.88% accuracy for Linux ELF, 99.88% accuracy for Windows PE and 98.53% accuracy for Java Class. 15.2 Future Work Although we have accomplished a lot during this research the following features are possible for future expansion of Libreverse. 15.2.1 Compiler Detection The compiler detection needs to be expanded to investigate the idea that it is pos- sible to identify a compiler used from looking at a stream of instructions. This method of compiler detection would be very helpful for the situation where the executable was compressed and/or encrypted after it was compiled. Information in the header might make it di cult or impossible to detect the compiler if key information is altered too much. 169 15.2.2 Components In order to show an generic architecture, like Libreverse, is equivalent to a speci c solution it must perform all the actions of the speci c solution. New components will need to be written to perform the actions done in the work by Cifuentes [14], Tevis [59], Mycroft [44] and others. Cifuentes used Register Transfer List(RTL) as the intermediate form to represent the code being reverse engineered. Future work will have to rst reproduce the original work using RTL but also investigate if the approach of each component, described in Table 15.1 works for other intermediate forms. Table 15.1 describes further compo- nents that will be researched and incorporated into the Libreverse reverse engineering architecture. Tevis conducted his research using the Windows PE format. Linux ELF and Mac OS X Mach-O le format are loaded into memory by the operating system like Windows PE le format. The same methods performed in Tevis [59] should be investigated to see if the same potential security vulnerabilities exist in these other formats. Table 15.2 de- scribes further components that will be researched and incorporated into the Libreverse reverse engineering architecture. Mycroft [44] investigated a static analysis method of reconstructing the type of registers in target code represented in SSA form. Cifuentes [15] reports that \In von Neumann machines, code and data are represented in the same way, hence making it impossible to distinguish code from data statically". Therefore it will be necessary to investigate the dynamic means to reconstruct the type of registers in a program. Table 15.3 describes further components that will be researched and incorporated into the Libreverse reverse engineering architecture. 15.2.3 Control Flow Graph Generation Reading a stream of assembly code output from a disassembler or a processed as- sembly output from an application like IDA Pro is still di cult to read. Libreverse needs 170 Components Description Dead-Register Eliminate Dead Registers Identi es and removes dead registers in the basic blocks of a control ow graph. Dead-Condition Code Elimi- nation Identi es and removed condition code from a pro- gram if it is never used. Condition Code Propagation Each machine instructions set ags in the CPU when they are executed. Therefore where the ags are used and de ned must be identi ed for each machine instruction and used as part of the reverse engineering process. Register arguments Identify registers that contain arguments passed for each subroutine. Function Return Registers Identify register used to return values from a sub- routine. Register Copy Propagation Identify temporary registers that are used to hold the contents of other registers. Replace the tem- porary register name with the original register name. Actual Function Parameters Parameters to a function call are either placed into registers or put onto the stack. This com- ponent will identify the parameters and place the parameters in the correct order for the function call. Data Type Propagation Across Procedure Calls When the type of a register is identi ed it is nec- essary to propagate that type to every place the register is used. Register Variable Elimination Replace all remaining registers output from the Register Copy Propagation component with vari- able names. Table 15.1: Future optimization components from Cifuentes Components Description Detecting Anomalies in the target le Locate anomalies in the le (e.g. size of table stated in a header from the actual size in the le). These may indicate target le was tampered. Detect Writable and Exe- cutable sections Locate sections in memory that are both writable and executable. Detecting Vulnerable Library Functions Identify functions that are known to be vulnerable to attack. Table 15.2: Future optimization components from Tevis 171 Components Description Type Reconstruction Dynamically reconstruct types for registers con- tained in the instructions basic blocks of a control ow graph. Table 15.3: Future optimization components from Mycroft to be extended to contain algorithms that can take the input instructions and produce a control ow graph that can aid in the understanding of a target le. These algorithms need to be able to identify all branch types supported by the input instructions and produce a the necessary control ow graph for future analysis. In addition rather than using a two-step process of converting input instruction to assembly to intermediate form the conversion process should be able to convert from input instruction directly into intermediate form. They type of intermediate form should be an option that the researcher can select as a part of the formula le. 15.2.4 Intermediate Forms Various researchers use di erent intermediate forms to perform their analysis (e.g. RTL or SSA). Libreverse needs to be able to represent instructions in Assembly language Register Transfer Language (RTL) Single State Assignment (SSA) in order to give the researcher a wide choice of methods to represent the target in- structions for analysis. By supporting a number of di erent intermediate forms further research in the components described in Section 15.2.2 will be required to to see if the component?s e ectiveness is dependent upon the format of its data (e.g. RTL vs SSA). 172 15.2.5 Con gurator The Con gurator part of the infrastructure of Libreverse needs to be expanded to allow for a method to select the data source to use along with new kinds of data sources to transfer information between components. The Con guration Data class provides the data structure for storing the con g- uration information. It allows the setting and retrieval of the transfer con guration type (e.g. Memory), directory where the formula les are store and DLLs. This data structure allows for additional information to be added while keeping the Con gurator class free of the accessor methods. Con gurator will be changed to get the type of data source it should use from XML le as shown in Figure 15.1. The limitation is only one transfer type is enforced by the schema allowing only one transfer type element in the XML le. It can either be a le pre x used in the le name for the output of each Component, database to where data is store or in memory. 173 Listing 15.1: XML schema for con guration data 15.2.6 Data Source Two new data sources will be created that allow for the storing of data in a XML le or through a database. The File data source will store the contents in an XML format using a unique le name. The administrator of Libreverse will use the Con gurator?s XML le to de ned a le pre x to be used. Each le name will be composed of the pre x plus the unique id of the component. A new XML parser will have to be created to read in the le when the requested. The Database data source will store the contents into a database that was de ned by the administrator of Libreverse in the Con gurator?s XML. A database schema will have to be developed for storing Libreverse?s data containers in the database. In addition a 174 object responsible for creating the data source from the database information or writing to it will also have to be constructed. 15.2.7 Compressed Executables There are number of opportunities to improve upon Compressed Executable clas- si cation to make it more palatable for general use in the eld of Computer Forensics: The classi er should be able to distinguish between a larger number of compression methodologies. These strategies could easily be applied to other operating systems (besides Win- dows) and le formats, provided that similar representative information exists for those formats as well. A separate classi er could possibly be created to further distinguish between types of encryption based on system properties, as demonstrated possible for compres- sion type in Experiment II. The methodologies demonstrated here could be implemented in a le analysis framework, for the purpose of computer defense. Other classi ers (Radial Basis Function Networks, Feed-Forward Neural Networks, Na ve Bayesian Classi ers) could be implemented and compared with the GRNN. Evolutionary Computation could be utilized to create a \White Hat Attacker" which attempted to evolve malicious les which had the signature appearance of non-self-modifying les. The GRNN could then attempt to learn from these \tricky" instances and adapt appropriately by modifying its training data. 15.2.8 Compiler Classi cation Expand the work to distinguish between versions of compilers Expand the work to additional di erent le types (e.g. Mac OS.X MachO and .NET Files) 175 Explore alternative machine learning methods, particular those with less overhead. Explore identi cation of compilers using a stream of instructions. 15.2.9 XML At present the XML library used for parsing the XML les used by libreverse does not validate them against the supplied schema. It is important that the XML les be correct before using them in a production environment. Therefore a method for validating the XML les before they are used needs to be discovered. 15.3 Final Thoughts So long as there are people attempting to circumvent security controls through ma- licious software there will be a need to have reverse engineering tools available. These tools will need to be easily adapted to handle the volume of malicious software. They will need to be easily used by the security professional without requiring extensive ed- ucation into the techniques of reverse engineering. Libreverse provides this necessary exibility and usability to the non-expert user for processing malicious software as well as a research platform for the expert user. The future of reverse engineering lies in the hands of the dedicated security professionals and researchers protecting our information infrastructures by understanding the software that attacks them and identifying vulner- abilities in the infrastructure software. Reverse engineering will be one line of defense in this constantly shifting battle eld. 176 Bibliography [1] J. Allen, A. Christie, W. Fithen, J. McHugh, J. Pickel, and E. Stoner. State of the Practice of Intrusion Detection Technologies. Techni- cal report. http://www.sei.cmu.edu/publications/documents/99.reports/ 99tr028/99tr028abstract.html. Last accessed on 9/23/2008., 2000. 4 [2] R. Bellman. Adaptive Control Processes. Princeton University Press, Princeton, NJ, 1961. 44 [3] J. Bergeron, M. Debbabi, J. Desharnais, M. Erhioui, Y. Lavoie, and N. Tawbi. Static detection of malicious code in executable programs. In Proceedings of the International Symposium on Requirements Engineering for Information Security, 2001. 32 [4] J. Bergeron, M. Debbabi, M.M. Erhioui, and B. Ktari. Static Analysis of Binary Code to Isolate Malicious Behaviors. In IEEE 8th International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, pages 184{189, 1999. 26, 41 [5] P. Biron and A. Malhorta. Xml schema part 2: Datatypes second edition. Technical Report REC-xmlschema-2-20041028, World Wide Web Consortium (W3C), 2004. 92, 94, 96 [6] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, F. Yergeau, and J. Cowan. Extensible Markup Language (XML) 1.1 - W3C recommendation 16-august-2006. Technical Report REC-xml11-20060816, World Wide Web Consortium (W3C), 2004. 92, 94, 96 [7] W. Britt, S. Gopalaswamy, J. Hamilton, G. Dozier, and K. Chang. Computer Defense Using Arti cial Intelligence. In SpringSim 07: Symposium on Simulation Software Security, pages 378{386, 2007. 7 [8] A. Buckley. JSR 202: Java Class File Speci cation Update. http://jcp.org/en/ jsr/detail?id=202. Last accessed on 9/23/2008., 2006. 72, 96 [9] G. Canfora, A. Cimitile, U. De Carlini, and A. De Lucia. An extensible system for source code analysis. IEEE Transactions on Software Engineering, 24:721{740, 1998. 8 [10] G. Caprino. Rec: Reverse engineering compiler. http://www.backerstreet.com/ rec/rec.htm. Last accessed on 9/23/2008., 2007. 39 177 [11] E. Chikofsky and J. Cross II. Reverse engineering and design recovery: A taxonomy. IEEE Software, 7(1):13{17, 1990. 10 [12] M. Christodorescu and S. Jha. Testing malware detectors. In International Sym- posium on Software Testing and Analysis (ISSTA), pages 34{44, 2004. 5 [13] Z. Chuan, L. Xianliang, H. Mengshu, and Z. Xu. A lvq-based neural network anti-spam email approach. SIGOPS Operting System Review, 39(1):34{39, 2005. 7 [14] C. Cifuentes. An environment for the reverse engineering of executable programs. In APSEC, pages 410{419, 1995. 8, 29, 33, 41, 42, 113, 120, 147, 170 [15] C. Cifuentes, T. Waddington, and M. Van Emmerik. Computer security analysis through decompilation and high-level debugging. In In Porceedings of the Working Conference on Reverse Engineering (WCRE), pages 375{380. IEEE Press, 2001. 170 [16] E. M. Clarke and E. A. Emerson. Design and synthesis of synchronisation skele- tons using branching time temporal logic. In In Logic of Programs. Proceedings of Workshop volume 131 of Lecture Notes in Computer Science, page 5271. Springer, 1981. 36 [17] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. An e cient method of computing static single assignment form. In POPL ?89: Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 25{35, New York, NY, USA, 1989. ACM. 6 [18] L. Davis. Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York, 1991. 54 [19] Un4seen Developments. Petite win32 executable compressor, 2005. Published on- line at http://www.un4seen.com/petite. Last accessed on 6/20/2008. 65 [20] S. K. Dogra. Normalization, 2008. Published online at http://www.qsarworld. com/qsar-statistics-normalization.php. Last accessed on 8/14/2008. 104 [21] M. Van Emmerik and T. Waddington. Using a decompiler for real-world source recovery. In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE?04), pages 27{36. IEEE Computer Society, 2004. 31 [22] A. Engelbrecht. Computational Intelligence. John Wiley & Sons, Ltd, West Sussex, England, 2002. 2, 5, 69 [23] D. Eriksson. Designing an object-oriented decompiler - decompilation support for interactive disassembler pro. Master?s thesis, Blekinge Institute of Technology, June 2002. 34 [24] M. Ernst. Static and dynamic analysis: Synergy and duality. In Workshop on Dy- namic Analysis 2003: ICSE Workshop on Dynamic Analysis, pages 24{27, Port- land, OR, 2003. 28 178 [25] D. Fallside and P. Walmsley. Xml schema part 0: Primer second edition. Technical Report REC-xmlschema-0-20041028, World Wide Web Consortium (W3C), 2004. 92, 94, 96 [26] S. Forrest and M. Mitchell. Relative building-block tness and the building-block hypothesis. In L. Darrell Whitley, editor, Foundations of Genetic Algorithms 2, pages 109{126. Morgan Kaufmann, San Mateo, CA, 1993. 52 [27] D. Golberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Publishing Company, Inc., Reading, Massachusetts, 1989. 52 [28] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River, New Jersey, 2nd edition, 1999. 7, 49 [29] J. Hoenicke. Java optimize and decompile environment (jode). Available at http: //jode.sourceforge.net. Last accessed on 9/23/2008., 2006. 40 [30] S. Jarzabek and T. Keam. Design of a generic reverse engineering assistant tool. In Proceedings of 2nd Working Conference on Reverse Engineering, 1995., pages 61{70. IEEE, July 1995. 8 [31] W. Jing-xin, W. Zhi-ying, and D. Kui. A network intrusion detection system based on the arti cial neural networks. In InfoSecu ?04: Proceedings of the 3rd international conference on Information security, pages 166{170, New York, NY, USA, 2004. ACM Press. 7 [32] K. Kapil and V. Kumar. Analysis of binary programs. http://www.cse. iitk.ac.in/report-repository/2005/Y1189_Y1392_BTP_Report.ps. Last ac- cessed on 9/23/2008., April 2005. 33 [33] J. Kinder, S. Katzenbeisser, C. Schallhart, and H. Veith. Detecting malicious code by model checking. In Conference on Detection of Intrusions and Malware and Vulnerability Assessment (DIMVA?05), volume 3548 of Lecture Notes in Computer Science, pages 174{187. Springer, July 2005. 35 [34] P. Kouznetsovi. JAD - the fast JAva Decompiler. http://www.kpdus.com/jad. html. Last accessed on 9/23/2008., 2001. 38 [35] G. Krol. Boomerang. http://boomerang.sourceforge.net/index.php. Last ac- cessed on 9/23/2008., 2006. 31, 113, 120 [36] S. Kumar. Disc: Decompiler for turboc. Available at http://www.debugmode. com/dcompile/disc.htm. Last accessed on 9/23/2008., 2003. 40 [37] lczelion. Tutorial 6: Import table, 2008. Published online at http:// win32assembly.online.fr/pe-tut6.html. Last accessed on 9/24/2008. 17 [38] Microsoft. Prb: Types of thunking available on win32 platforms, 2005. Pub- lished online at http://support.microsoft.com/kb/125710. Last accessed on 9/24/2008. 17 179 [39] Microsoft. Visual Studio, Microsoft Portable Executable and Common Object File Format (Rev. 8.0). http://www.microsoft.com/whdc/system/platform/ firmware/PECOFF.mspx. Last accessed on 9/23/2008., 2006. 5, 11, 12, 13, 14, 17, 18 [40] Microsoft. Windows authenticode portable executable signature format, 2008. Published online at http://download.microsoft.com/download/9/c/ 5/9c5b2167-8017-4bae-9fde-d599bac8184a/Authenticode_PE.docx. Last ac- cessed on 9/24/2008. 16 [41] J. Miecznikowski and L. Hendren. Decompiling java using staged encapsulation. In WCRE ?01: Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE?01), page 368, Washington, DC, USA, 2001. IEEE Computer Society. 37, 120 [42] A. Mori, T. Izumida, T. Sawada, and T. Inoue. A tool for analyzing and detecting malicious mobile code. In ICSE, pages 831{834, 2006. 43, 44 [43] S. Mukkamala, A. Sung, and A. Abraham. Cyber security challenges: Designing e cient intrusion detection systems and antivirus tools. http://citeseer.ist. psu.edu/735221.html. Last accessed on 9/23/2008., 2005. 27 [44] A. Mycroft. Type-based decompilation (or program reconstruction via type recon- struction). In 8th European Symposium on Programming, ESOP?99, pages 208{224, 1999. 170 [45] A. Neshkov. DJ Java Decompiler. http://members.fortunecity.com/neshkov/ dj.html. Last accessed on 9/23/2008., 2007. 39 [46] M. F. Oberhumer and L. Moln ar. The ultimate packer for executables (upx), 2005. Published online at http://upx.sourceforge.net. Last accessed on 6/20/2008. 65 [47] M. Pietrek. Programming for 64-bit windows, 2000. Published online at http://msdn.microsoft.com/en-us/magazine/bb985017.aspx. Last accessed on 9/24/2008. 16 [48] S. Russell and P. Norvig. Arti cal Intelligence: A Modern Approach. Prentice Hall, 2 edition, 2003. 52 [49] H. Schorr. Computer-Aided Digital System Design and Analysis Using a Register Transfer Language. IEEE Transactions on Electronic Computers, pages 730{737, 1964. 6 [50] M. Schultz, E. Eskin, E. Zadok, and S. Stolfo. Data Mining Methods for Detection of New Malicious Executables. In IEEE Symposium on Security and Privacy, pages 38{49, 2001. 44 [51] A. Shulga. Andromeda Decompiler, 2005. 32 180 [52] J. Siek, L. Lee, and A. Lumsdaine. The Boost Graph Library: User Guide and Reference Manual (C++ In-Depth Series). Addison-Wesley Professional, 2001. pp. 213,277-278. 135 [53] D. Specht. A General Regression Neural Network. IEEE Transactions on Neural Networks, 2:568{576, 1991. 5, 47, 69 [54] R. Stallman. personal communication, 2008. January 24, 2008 email. 45 [55] A. Sung and S. Mukkamala. Identifying important features for intrusion detection using support vector machines and neural networks. In Symposium on Applications and the Internet (SAINT), pages 209{217, 2003. 29, 45 [56] Symantec. Symantec Security Response - De nitions Added [Online]. http:// www.symantec.com/business/security_response/index.jsp. Last accessed on 9/23/2008., 2006. 5 [57] K. Tan. The application of neural networks to UNIX computer security. In Proc. Int. Conf. Neural Networks, ICNN, pages 476{481. IEEE Computer Society, 1995. 7 [58] G. Tesauro, J. Kephart, and G. Sorkin. Neural Networks for Computer Virus Recognition. In IEEE Expert, volume 11, pages 5{6, 1996. 7, 44 [59] J. Tevis. Automatic Detection of Software Vulnerabilities in Executables. PhD thesis, Auburn University, 2005. 42, 132, 170 [60] J. Tevis and J. Hamilton Jr. Static analysis of anomalies and security vulnerabil- ities in executable les. In ACM-SE 44: Proceedings of the 44th annual Southeast regional conference, pages 560{565, New York, NY, USA, 2006. ACM Press. 42 [61] H. Thompson, D. Beech, M. Maloney, and N. Mendelson. Xml schema part 1: Structures second edition. Technical Report REC-xmlschema-1-20041028, World Wide Web Consortium (W3C), 2004. 92, 94, 96 [62] Unknown. Executable and linkable format. http://www.x86.org/ftp/manuals/ tools/elf.pdf. Last accessed on 9/24/2008. 20 [63] H. van Vliet. Mocha - the java decompiler. Available at http://www.brouhaha. com/~eric/software/mocha. Last accessed on 9/23/2008., 1996. 41 [64] A. Vasudevan and R. Yerraballi. Cobra: Fine-grained malware analysis using stealth localized-executions. In SP ?06: Proceedings of the 2006 IEEE Sympo- sium on Security and Privacy, pages 264{279, Washington, DC, USA, 2006. IEEE Computer Society. 35 [65] W3C. Extensible markup language (xml). Available at http://www.w3.org/XML/. Last accessed on 9/23/2008., 2008. 10, 91 [66] M. Weber, V. Shah, and C. Ren. A Case Study in Detecting Software Security Vulnerabilities Using Constraint Optimization. In SCAM, pages 3{13, 2001. 28 181 [67] T. Welch. A Technique for High-Performance Data Compression. IEEE Computer, 17(6):8{19, 1984. 65 [68] D. Wheeler. Why Open Source Software/Free Software (OSS/FS,FLOSS, or FOSS)? http://www.dwheeler.com/oss_fs_why.html. Last accessed on 9/23/2008., 2007. 5 [69] T. Yetiser. Polymorphic Viruses, Implementation, Detection, and Protection. http://vx.netlux.org/lib/ayt01.html. Last accessed on 9/23/2008, 1993. 64 182