Secdoop: A Confidentiality Service on Hadoop Clusters by James Holland Majors A thesis submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Master of Science Auburn, Alabama May 9, 2011 Keywords: Hadoop, Cryptography, MapReduce, Distributed Computing Copyright 2011 by James Holland Majors Approved by Xiao Qin, Chair, Associate Professor of Computer Science and Software Engineering Wei-Shinn Ku, Assistant Professor of Computer Science and Software Engineering Hari Narayanan, Professor of Computer Science and Software Engineering Abstract The MapReduce model has proven to be an effective way to demonstrate academic research while using distributed file-systems. The MapReduce programming model was in- troduced by Google in 2004 [1]. MapReduce has proven to be a good solution for large data sets requiring intensive processing. Hadoop, an open-source Java implementation of MapReduce, was created by Yahoo in 2007. Industries that deal with sensitive data in large scales are hesitant to embrace a solution of processing that distributes their sensitive data. Cryptography is often used to protect sensitive data, but it is computing intensive, often making it undesirable as a solution. Utilizing cryptography while distributing the processing over a trusted cluster will improve the overall runtime. This is an excellent solution for large data sets that are sensitive in nature [2][4]. In this paper, we describe two applications that distribute the cryptographic process over a trusted cluster. The first application will handle encryption of an input file that will be placed inside the Distributed File System (DFS). The second application will handle decryption of an input file that is located on the DFS. These two applications will demonstrate the effect of utilizing cryptography while distribut- ing processing over a Hadoop cluster. Our application features the use of cryptography in parallel on a cluster of commodity machines. Our experimental results show the performance increase when dealing with large sets of data. ii Acknowledgments A special thanks to Dr. Xiao Qin for constant colaboration of this research. Thanks to Shu Yin for assistance with the figures created from our collected data. Thanks to Zhiyang Ding for assistance with troubleshooting the RecordReader problems. Thanks to Xiajun Ruan for assistance with troubleshooting cryptography implementations. A special thanks to Gregory Poirier for assistance with troubleshooting, data collection, and the automation of testing. Thanks to Riley Spahn for assistance with validation. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Motivation And Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Secdoop Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3.1 Secdoop Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3.2 Generation of Plaintext for Testing . . . . . . . . . . . . . . . . . . . 9 4.3.3 Testing and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4.1 Hadoop Application Behavior . . . . . . . . . . . . . . . . . . . . . . 10 4.4.2 Secdoop Application Details . . . . . . . . . . . . . . . . . . . . . . . 11 5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 iv 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 A Secdoop Subversion Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B Example Mapper Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 C Configuration File Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 v List of Figures 3.1 Demonstration of How Secdoop Interacts with Hadoop and HDFS . . . . . . . . 7 5.1 Single Machine Encrypt Test Runtimes with All Algorithms . . . . . . . . . . . 16 5.2 Ten Node Cluster Encrypt Test Runtimes with All Algorithms . . . . . . . . . . 17 5.3 AES Encrypt Test Runtime Comparisons of the Cluster vs. Single Machine . . . 17 5.4 DES Encrypt Test Runtime Comparisons of the Cluster vs. Single Machine . . 18 5.5 DESede Encrypt Test Runtime Comparisons of the Cluster vs. Single Machine . 18 vi List of Tables 5.1 Average Encrypt Runtimes on a Single Machine in Seconds . . . . . . . . . . . 19 5.2 Average Encrypt Runtimes on a Ten Node Cluster in Seconds . . . . . . . . . . 19 5.3 AES Encrypt Runtime Comparisons between the Maximum Number of Map Tasks 19 5.4 DES Encrypt Runtime Comparisons between the Maximum Number of Map Tasks 20 5.5 DESede Encrypt Runtime Comparisons between the Maximum Number of Map Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 vii List of Abbreviations DFS Distributed File System JFS Java File System NFS Network File System MPI Message Passing Interface PVFS Parallel Virtual File System JCE Java Cryptographic Extension HDFS Hadoop Distributed File System AES Advanced Encryption Standard DES Data Encryption Standard DESede Triple Data Encryption Standard viii Chapter 1 Introduction MapReduce is a programming model, proposed by Google, based on the Lisp commands ?Map? and?Reduce?. Implementations of this model are commonly used for processing large data sets [1]. As the size of available storage devices increase in conjuction with the decrease in prices for the devices, it is more cost effective for industries to store larger sets of data. Also, as the complexity of data systems increases, so does the size of data stored by industrial applications. These industries are challenged with the need to efficiently analyze terabytes or even petabytes of information due to the increased data size produced from these applications. There is a need to find a solution that allows for larger amounts of data to be analyzed and the computation to be distributed among multiple machines to increase the processing efficiency. With terabyte hard drives becoming more and more cost effective, businesses can afford to host their own clusters as more open-source products arrive on the scene to assist programmers in their customization efforts. Hadoop is an open-source Java implementation of the MapReduce programming model that can be set up on a wide variety of hardware platforms [2][3][4]. There has been a great deal of research that focuses on the performance of Hadoop. In May of 2009, it was announced that a team at Yahoo! sorted one terabyte of information in 62 seconds [5]. Hadoop has been a proven tool that is used by many of the major companies in the industry, such as: Yahoo!, Amazon, and Facebook, along with many major Universities. The open source availability makes this a common platform for developers to customize applications to solve unique challenges. The inheritance provided in Java, allows developers to override the Map and Reduce interfaces to tailor the application to their customized needs. The developer can also cus- tomize input formats, output formats, and record readers to handle the input splits in an 1 optimal fashion. Application customization is facilitated through the use of Hadoop and Java libraries. Hadoop allows developers an easy way to write parallel programs with minimal code writing. Hadoop configurations are automatically set up to manage the communica- tions between different nodes in the cluster as well as data replication and data storage management. Once the problem of distribution is addressed through the use of MapReduce, the issue of security arises and is a major consideration in the viability of using distribution as a solution for a company. Whenever there are distributed communications, there is always an issue of privacy that must be considered. Security solutions are very process intensive, but the speed of process distribution opens the door for multiple solutions which have not been feasible in the past. Companies can reduce their security risks through the use of in-house trusted clusters. By having an in-house trusted cluster [6][7][8], the company knows where their data resides in the cloud. Companies can set up firewalls or security policies for their trusted clusters in order to implement whatever security level is warranted for their data. These companies may choose options such as totally blocking user access from outside their network, or limiting access by use of authentication incorporating key management and other security policies. 2 Chapter 2 Related Work There have been a multitude of cryptographic file systems that address data confiden- tiality. The Java File System (JFS) [9] is a distributed solution that works on heterogenious operating systems. The JFS is setup as a client/server architecture and is a reasonable solu- tion when the Network File System(NFS) [10] is not feasible because this solution does not allow for parallel computing over a cluster. The Message Passing Interface (MPI) is a useful tool when dealing with parallel com- puting [11][12][13]. MPISec is an implementation of cryptographic services using MPI [14]. MPISec uses the Parallel Virtual File System (PVFS) and does not employ any type of data replication. The lack of data replication is a bottleneck when different nodes in the cluster process segments of the file that arent located locally. Hadoop replicates data over multiple nodes thus eliminatinating the bottleneck, which makes it a good solution. MPI is chal- lenged with managing which processes are allowed access to different parts of the file. The Hadoop namenode, also known as the master, handles the details of access control between map tasks, and Hadoop application developers can rely on the default configuration of the cluster without any extra code added. This reduces complexity for application developers and allows easy implementation of parallel programs. Biodoop was developed as a framework built on Hadoop because bioinformatics is com- putationally intense and deals with large data sets [15]. The results from Biodoop?s experi- ments show that Hadoop is a versatile framework for problems with light computation and large data sets, as well as heavy computation with small data sets. This is similar to security in the fact that there are many aspects to security and to be specific, cryptography is very computationally heavy. 3 SecureMR is a security application that has been implemented using Hadoop [16]. Se- cureMR focuses on the integrity of the services running MapReduce. The application ana- lyzes the behavior of worker nodes to determine the integrity of the services running. Our solution focuses on data confidentiality rather than service integrity. The idea of having a cryptographic distributed file system is a sound idea, but it does not fit well with the MapReduce programming model [9]. If cryptography is implemented at the file system level, the overhead would outweigh the benefits of the MapReduce model. This is the main reason that our solution is running at the application level. Our solution is built as a framework that will begin as a cryptographic solution for the MapReduce model and we will continue to build this application into a security suite. 4 Chapter 3 Motivation And Background The driving force of this experiment was to increase performance of data processing associated with large datasets by using a distributed system, while utilizing cryptography to meet the security needs. A major factor considered was the cost of the equipment, the open-source availability and the ease of programming using Hadoop. Using this model, the developer creates a custom Map and Reduce function. When the custom application is run, the master node locates the data on the distributed file system and splits up the file into small manageable pieces. Each piece is sent to a worker node(s) and then processed using the Map function. The Map function then produces key/value pairs from the pieces that are then passed back to an intermediate file. The Reduce function receives these intermediate values and combines values that are similar according to the keys. Once the values are combined, the Reduce function returns the result back to the system [1]. The complexity of developing parallel applications is reduced by having only two main functions that are needed in applications for MapReduce. MapReduce has the ability to run on commodity systems that can lower the price of a cluster and makes this option a reasonable solution for companies to have an in-house cluster. The greater scalability provided by the MapReduce programming model allows for the opportunity to expand the model to include more nodes as more processing is needed. This expansion may result in the need to have multiple geographic sites available and the communications between these sites can become a security weak point in the overall system. This vulnerability causes several fields of industry including, but not limited to financial, medical, and military, to lose interest in this solution as sensitive data should not be transmitted in plaintext over unsecured connections. Another option is to use cryptography to protect the sensitive data, but the deterrent lies within the time consuming, 5 intensive processing that is required to use encryption and decryption. One possibility is to use an in-house trusted Hadoop cluster to prepare the data for transmission. Java provides a security library called the Java Cryptographic Extension (JCE) which allows programmers to specify what type of encryption standard and key length is desired using cipher streams. Hadoop can easily employ cryptographic solutions in custom parallel applications using the JCE provided by Java. This allows a low cost and efficient customizable solution to this security vulnerability. 3.1 Hadoop Hadoop consists of two major parts, the MapReduce model and the Hadoop Distributed File System (HDFS). The HDFS is managed by a ?NameNode? that is the master of the cluster. The NameNode manages the file locations, the data replication, and keeps track of storage nodes [2]. The MapReduce portion of Hadoop is managed by the ?JobTracker?. The JobTracker is responsible for the task scheduling and job monitoring during the MapReduce process. Each node in the cluster contributes both to storage and to computation. The computation portion is handled by the JobTracker, and the storage portion is maintained by the NameNode [2][4]. The NameNode also ensures that each file fragment is replicated across multiple nodes to increase reliability and availability of data. The replication of data also provides a fault tolerance during runtime in case a node is busy or malfunctions. When a job is scheduled that deals with a file that is replicated over multiple nodes, the nodes can begin work immediately on separate parts of the file without consuming the network due to retransmission of file fragments, and this is a result of data replication. 3.2 Secdoop Behavior Figure 3.1 shows how Secdoop interacts with the NameNode and the TaskTracker of Hadoop. The red objects represent the Hadoop master nodes, and the cluster of worker nodes are represented on the right side of the figure. Secdoop starts by being inserted into 6 Figure 3.1: Demonstration of How Secdoop Interacts with Hadoop and HDFS the job queue in the JobTracker. The NameNode then transfers the data splits back to Secdoop. Secdoop then calls the mapper class to process the data from the NameNode. The Mapper then completes and the intermediate data is then returned and then transferred back to the NameNode for storage. 7 Chapter 4 Research Overview 4.1 Project Description This research project investigates the use of cryptography in MapReduce with process distribution within a trusted cluster. The run time of three encryption standards will act as a benchmark to demonstrate the performance gain from process distribution. In this project, we have chosen to work with the Advanced Encryption Standard (AES), Data Encryption Standard (DES), and the Triple Data Encryption Standard (DESede). AES employs a variety of bit substitutions, row shifting, column mixing, and different keys for each round of encryption [17]. DES, which is slower than AES, begins with 48 bit keys and iterates through 16 rounds of computations including bitwise XOR operations. DESede is the same as using the DES algorithm three different times with different keys each time [18]. Each of these algorithms is process intensive and is not a good solution for large amounts of data without the distribution of processing. Our experiment will demonstrate the timings for each of the algorithms using the Java Cryptographic Extension library. We have developed two applications to demonstrate the cryptographic process: encrypt and decrypt. 4.2 Assumptions In our experiments we assume that the application is running on a trusted cluster that is not concerned with data tampering or man-in-the-middle attacks towards communications. For testing purposes we leave key management for future work. The focus on this experiment is to demonstrate the effects of the distribution of processing on the runtime of cryptographic solutions when compared to the processing of a single machine. 8 4.3 Design Issues 4.3.1 Secdoop Design Secdoop runs at the application level in order to not interfere with the MapReduce model in Hadoop. If our solution were to run at the file system level, the overhead would defeat the performance increase from the MapReduce model. Since our solution does not protect at the file system level, the plaintext files are needed by the HDFS. The problem with having the plaintext on the HDFS is that if the cluster is compromised, then the encryption is of no use. This is why we suggest our solution to be a trusted in-house cluster that can prepare sensitive data for further distribution. Another issue with file system level cryptography is inheritance in the classes used by Hadoop. Many of the subclasses override super methods and constructors which would cause for a major modification to the entire system to ensure that the cryptographic services were incorporated properly. The application level allows for major performance boosts and minimal performance overhead. 4.3.2 Generation of Plaintext for Testing We have created a testing harness that runs according to the configuration files that are passed to it. The harness also includes a program that generates the needed files for a given experiment. Below is how to generate the files using our distribution of Secdoop: Generate your input sets and configuration files by calling the generate.py script in the secdoop directory: ./generator.py If you would like, take a look at generator.py?s usage by running: ./generator.py -h Next, format the Hadoop namenode by running the following: hdfs namenode -format 9 After running generator.py and formatting your namenode, you run the stage.sh script. This will put all of your generated input files into HDFS. This will create the necessary XML and text files for testing. After they run, the inputs and configs directories will be populated. (See more included in Appendix A) 4.3.3 Testing and Validation Unit tests were written to ensure that the ciphers were implemented correctly and re- turned the expected results. In order to validate that Secdoop was performing the cryptogra- phy as expected, we used the Gutenberg example files that is often used for the ?Wordcount? example for Hadoop. The input file for encryption was the Gutenberg ebook of Ulysses by James Joyce. The resulting file from the encryption application was then sent as the input to the decryption application. The output of the decryption was then compared to the orig- inal input. We used this method for the Single Machine applications as well as the Hadoop applications that are tested during our experiments. After the validation tests were returned with the expected results, we generated test files to be used as benchmark testing. Config- urations were also generated in order to help with the automation of testing. Configuration files included the file size to be tested, the algorithm to be tested, and the location of the input files. The original input, encrypted data, and decrypted data reside on the cluster after the experiments have completed. All of the data is stored in part files while on the HDFS, but a simple script can restore these back to normal if the solution calls for a piece of intermediate data. 4.4 Implementation Details 4.4.1 Hadoop Application Behavior When a Hadoop application starts up, the NameNode, or master node, locates which nodes already have the data replicated on the worker node?s local storage. The input files 10 are split up into many parts called ?Input Splits? and then passed individually to the ap- plication?s mapper class. There are many records in each input split, and by default, each record consists of one line from the original input file. The mapper then sends one record at a time to a map function that is custom to the application. After initial unit tests were run on our proof of concept mapper class, we found that in the ?LineRecordReader? class, the byte buffer that stores the records for each input split was not creating a new instance of the byte buffer for each new record. This did not clear the buffer and left unexpected data in the buffer if the next record was shorter than the previous record. The reason that Hadoop does not create the new byte buffer for each record is because of the overhead that it would cause. We solved this problem by using a ?System.arraycopy()? to create the new byte buffer for each record in our mapper class and strip the excess of previous records. 4.4.2 Secdoop Application Details In our experiments we employ the use of password based cryptography in order to ensure that the keys will be the same for every mapper class without using key management. By using password based cryptography we are able to use the similar variations of the passphrases for each of the three algorithms. In order to limit the differences between the three algorithms, we also used Cipher Block Chaining mode for all three standards. The first application will consist of a driver and a mapper to implement encryption. The driver program will setup the configurations needed to run the mapper. The mapper class will run a static block of code that will set up a cipher stream for each input split. The map function will be called for each record in the input split and then encrypt it using one of the three algorithms specified above. The input key/value pairs will be of type LongWritable for the key, and of type Text for the values. The output key/value pairs will be of type LongWritable for the key, and BytesWritable for the values. We use the Hadoop api for LongWritable, Text, and BytesWritable. The map function will encrypt the input values and associate the output values with a key that will keep the sequence of the lines in the 11 original file. The encryption application does not need a reducer function because we do not plan on combining any information. The output will be an intermediate file that will consist of the key value pairs. The output file format will be a sequence file, which can also be found in the Hadoop api, and this will store the data in an intermediate file on the HDFS. The second application will consist of a driver and mapper for the decryption process. The driver will act the same as the previous by setting up the configuration and file types to be dealt with. The mapper will run a static block of code to set up the cryptographic keys and ciphers for decryption. The input file will be an intermediate sequence file input format that is a result from the first application. The input key/value pairs for the decrypt mapper will be of type LongWritable for the keys, and the values will be of type BytesWritable. The output key/value pairs will be of type LongWritable for the keys, and Text for the values. The map function will go through each line and decrypt using the ciphers set up in the static block. The values are still associated with a key that will keep the order of the original file, and thus a reducer function is not needed. The results will be plaintext that is the same as the original input file. 12 Chapter 5 Experiments 5.1 Experimental Design This experiment is designed to demonstrate the difference in runtime between a single machine and a Hadoop cluster environment consisting of ten identical nodes. For the AES, DES, and DESede algorithms, we used the cipher-block chaining mode in order to keep a consistent mode for each algorithm and to limit the number of variables in the experiment. We used password based encryption in order to guarantee that the key would be the same during every trial without the need for using key management. Unit tests were written for each program used in the tests in order to verify the integrity of the results. For the single machine trails, three algorithms and eight different files sizes were used. For each of the three algorithms a program was created to encrypt the data and another to decrypt the data, totaling six programs. The trials consists of running each program for every file size. For the Hadoop cluster environment trials, three algorithms and the same eight file sizes were used. Eight total programs were created for the cluster environment, consisting of two drivers and six mappers. Two drivers were created, one to configure Hadoop for encryption and the other to configure Hadoop for the decryption. Each of the three algorithms required a mapper for encryption and another for decryption, totaling six mappers. There is a configuration file which is passed in at the command line and specifies which algorithm?s mapper to use. When the file is encrypted, we use the key from the key value pair to keep the sequence of the file intact, thus eliminating the need for a reducer. For a trial, each mapper runs for every file size. 13 5.2 Experimental Setup Single Machine Configuration: HP ProLiant ML110G6 brand computer with a 2.4Ghz, 4 core, Intel Xeon processor, 2GB of RAM, and 160GB HDD, running Ubuntu 10.04 (32 bit version). There is no cluster or network overhead. The machine is functioning as a local, stand-alone unit, and is considered to be a comodity machine due to it?s common hardware and affordable price. Cluster Configuration: Ten machines with hardware identical to the single machine using a Gigabit switch, comprise the ten node cluster used for experiments. File Sizes: 1MB, 64MB, 512MB, 1GB, 2GB, 4GB, 8GB, 32GB. Data Collected: All data collected for use in figures and graphs are an average of three experimental trials and are expressed in seconds. Due to minute differences in the encryption and decryption timing, only the encryption timings are presented in the figures and graphs. 5.3 Experimental Results Figure 5.1 shows the average runtime of all three cryptographic algorithms running on a single machine. The runtime increases exponentially as the size of the file increases. AES has the fastest runtimes by far when compared to the other two algorithms. DESede unexpectedly outperformed DES on the 1MB file, but this was due to the limited number of trials run. The three algorithms behavior?s are clearly demonstrated even with a limited number of trials run, and as the file sizes get larger, the difference becomes more and more apparent. Table 5.1 lists the data that was used for Figure 5.1. The choice of algorithms becomes a major performance decision as file sizes increase. For a 32GB file, AES completes encryption in almost one third of the time that DESede takes. The three algorithms are much more competitive when running in the parallel environ- ment on the cluster. Figure 5.2 compares the three algorithms on a ten node cluster, and 14 shows that the parallel environment enhances the runtime significantly. The cluster over- head becomes a large performance hit when the size of the files are below 512MB. The three algorithms maintain the performance behaviors demonstrated in the single machine, but the difference between the performance is relatively minimal for all file sizes. The figures show that the cluster is an ideal solution for large files. Figure 5.3 compares the performance of AES on the single machine and the ten node cluster. The single machine outperforms the cluster for the 1MB file and the 64MB file. The cluster significantly out- performs the single machine for any file larger than 512MB. The performance gain from the cluster increases as the file sizes increase. Table 5.2 lists the data points that were gathered from the cluster testing. When comparing the Table 5.1 and Table 5.2, the cluster completes the AES encryption for the 8GB file 3.69 times faster than the single machine, and the 32GB AES encryption completes 4.34 times faster than the single machine. Figure 5.4 compares the DES algorithm on the single machine and the ten node cluster. The performance gains from the cluster is greater than the AES performance gains on the cluster. Similar to AES, the performance gain is not recognized until the 512MB file size is reached. The cluster completes the DES encryption for the 8GB file 6.35 times faster than the single machine, and the 32GB file completes 7.07 times faster than the single machine. On the single machine, DESede was by far the slowest of the three algorithms. Hadoop enables DESede to be a competitor against the other algorithms. Figure 5.5 shows the dramatic performance gain, when using the cluster compared to the single machine, on large file sizes. The DESede completes encryption for the 8GB file 8.76 times faster than on the single machine, and the 32GB file completes 9.60 times faster than the single machine. We noticed during testing that only half of the CPU was being utilized, so we in- vestigated if there was possible performance tuning by adding more map tasks to be run simultaneously on the worker nodes. Tests were set up to compare the three algorithms with the maximum map tasks set to be 2, 3, and 4. These are simple configurations that can be added at the command line or in a configuration file. Table 5.3 demonstrates the effects of 15 1MB 64MB 512MB 1GB 2GB 4GB 8GB32GB10 ?1 100 101 102 103 104 Run Time (sec) in Log Form Single Machine AES DES DESede Figure 5.1: Single Machine Encrypt Test Runtimes with All Algorithms the number of maps for the AES algorithm. The results are inconclusive and do not show much of any pattern except for the 32GB file. Table 5.4 shows the effects of the number of map tasks for the DES algorithm. The data collected for DES is also inconclusive and the only pattern is the 32GB file. Similarly Table 5.5 shows the results from the DESede tests for the number of map tasks effect on the algorithm. The gain for the 32GB file is encouraging in the fact that future tests should be done in order to see the effect on larger files than 32GB. 16 1MB 64MB 512MB 1GB 2GB 4GB 8GB32GB10 1 102 103 Run Time (sec) in Log Form 10?Node Cluster AES DES DESede Figure 5.2: Ten Node Cluster Encrypt Test Runtimes with All Algorithms 1MB 64MB 512MB 1GB 2GB 4GB 8GB32GB10 ?1 100 101 102 103 104 File Size Run Time (sec) in Log Form 10?Node Cluster VS. Single Machine (AES) Cluster Single Figure 5.3: AES Encrypt Test Runtime Comparisons of the Cluster vs. Single Machine 17 1MB 64MB 512MB 1GB 2GB 4GB 8GB32GB10 0 101 102 103 104 File Size Run Time (sec) in Log Form 10?Node Cluster VS. Single Machine (DES) Cluster Single Figure 5.4: DES Encrypt Test Runtime Comparisons of the Cluster vs. Single Machine 1MB 64MB 512MB 1GB 2GB 4GB 8GB32GB10 0 101 102 103 104 File Size Run Time (sec) in Log Form 10?Node Cluster VS. Single Machine (DESede) Cluster Single Figure 5.5: DESede Encrypt Test Runtime Comparisons of the Cluster vs. Single Machine 18 File Size AES DES DESede 1MB 0.98 4.67 2.06 64MB 3.17 10.07 15.73 512MB 32.52 71.29 119.8 1GB 58.7 127.94 236.96 2GB 121.71 270.56 466.77 4GB 279.52 551.75 937.85 8GB 583.29 1115.04 1868.93 32GB 2566.9 4654.84 7637.29 Table 5.1: Average Encrypt Runtimes on a Single Machine in Seconds File Size AES DES DESede 1MB 18.13 14.75 14.92 64MB 18.71 17.31 20.95 512MB 24.04 25.31 30.41 1GB 35.04 37.04 36.74 2GB 53.74 58.08 69.01 4GB 88.23 103.77 109.8 8GB 158.27 175.59 213.37 32GB 591.18 657.99 795.52 Table 5.2: Average Encrypt Runtimes on a Ten Node Cluster in Seconds File Size 2 Maps 3 Maps 4 Maps 1MB 18.13 16.3 15.3 64MB 18.71 17.31 17.46 512MB 24.04 25.72 23.98 1GB 35.04 34.32 35.2 2GB 53.74 51.55 52.1 4GB 88.23 79.65 93.75 8GB 158.27 144.19 149.54 32GB 591.18 564.98 534.24 Table 5.3: AES Encrypt Runtime Comparisons between the Maximum Number of Map Tasks 19 File Size 2 Maps 3 Maps 4 Maps 1MB 14.75 14.71 14.68 64MB 17.31 17.97 17.27 512MB 25.31 26.97 25.39 1GB 37.04 32.95 32.75 2GB 58.08 45.91 54.49 4GB 103.77 92.25 86.78 8GB 175.59 175.74 154.6 32GB 657.99 597.5 583.72 Table 5.4: DES Encrypt Runtime Comparisons between the Maximum Number of Map Tasks File Size 2 Maps 3 Maps 4 Maps 1MB 14.92 14.76 14.86 64MB 20.95 20.28 20.28 512MB 30.41 30.76 29.85 1GB 36.74 38.38 40.65 2GB 69.01 60.55 60.44 4GB 109.8 102.2 98.75 8GB 213.37 194.59 177.97 32GB 795.52 707.14 679.53 Table 5.5: DESede Encrypt Runtime Comparisons between the Maximum Number of Map Tasks 20 Chapter 6 Conclusion The results from the experiments show exponential improvement in the processing speed as the files increase in size. The AES algorithm outperformed the DES and DESede algorithm as expected, but the results on the cluster were much closer than the results of the single machine. The single machine outperformed the cluster when the file sizes were less than 512MB, but there was an exponential gain when using the cluster as the file sizes increased above 512MB. This solution is optimal for companies that have large amounts of sensitive data that need to be distributed. This solution can be implemented where companies can own a commodity cluster at a low cost, and prepare large sets of data for further distribution. Our model, Secdoop, is an extremely useful model when industries are concerned about the security of the cloud. If the data is already encrypted by our solution, then industries can store data and not worry about where the data resides in the cloud. Secdoop provides an option to process securely over an in-house cluster that is scalable and cost effective. There is room for performance tuning by manipulating the configuration of Hadoop on a cluster. We are interested in the comparison of runtimes when file sizes are larger than 32GB and we increase the Maximum Number of Map Tasks to be closer to the number of cores per CPU in each node. The results that we gathered using the two, three, and four Maximum Number of Map Tasks was not enough to conclude any patterns. We also plan on comparing the effect of the number of nodes in a cluster versus the performance gain over a single machine. 21 Chapter 7 Future Work This project has been documented in order to serve as an example to future students in the field of computer science. We have created project descriptions and lectures that will aid in the education process. The project descriptions and lesson plans cover topics such as developing tests for applications, developing applications in a distributed environment, and how to implement security measures in Java. This project is the framework for a security suite that has started with cryptography and will focus on topics such as: key management, authorization, and authentication. Another possibility is to create a library of configurations that holds the settings for the different algorithms that will be used by this application. With a library of preconfigured settings, companies could download this suite and run it on their own secure in-house cluster without the need for customization. If a company desired a customized configuration, then they could use our library of solutions as an example. There is also an ethical negative side to the use of MapReduce if the mappers are written to attack a set of data. This leads to the need for stronger encryption or larger keys. If distribution of processing can break keys faster than before, then all of the data that is currently thought of as ?safe? will need stronger protection. This is alarming when clusters are becoming more and more affordable. 22 Bibliography [1] Dean, J., Ghemawat, S., ?MapReduce: Simplified Data Processing on Large Clusters,? Communications of the ACM - 50th anniversary issue: 1958 - 2008, volume 51, issue 1 (January 2008), pg 107-113 [2] Shvachko, K., Kuang, H., Radia, S., Chansler, R., ?The Hadoop Distributed File Sys- tem,? Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), p.1-10, May 03-07, 2010 [3] Krishna, M., Kannan, B., Ramani, A., Sathish, S., ?Implementation and Performance Evaluation of a Hybrid Distributed System for Storing and Processing Images from the Web,? Cloud Computing Technology and Science, IEEE International Conference on, pp. 762-767, 2010 IEEE Second International Conference on Cloud Computing Tech- nology and Science, 2010 [4] Jia, B., Wlodarczyk, T., Rong, C., ?Performance Considerations of Data Acquisition in Hadoop System,? Cloud Computing Technology and Science, IEEE International Conference on, pp. 545-549, 2010 IEEE Second International Conference on Cloud Computing Technology and Science, 2010 [5] White, T, Hadoop: The Definitive Guide. O?Reilly Media, 2009. [6] Park, S., Lee, J., Chung, T., ?Cluster-Based Trust Model against Attacks in Ad-Hoc Networks,? Convergence and Hybrid Information Technology, 2008. ICCIT ?08. Third International Conference on , vol.1, no., pp.526-532, 11-13 Nov. 2008 [7] Mishra, S., Kushwaha, D.S., Misra, A.K., ?A Cooperative Trust Management Frame- work for Load Balancing in Cluster Based Distributed Systems,? Recent Trends in Information, Telecommunication and Computing (ITC), 2010 International Conference on , vol., no., pp.121-125, 12-13 March 2010 [8] Li, X., Jing, Z., ?A Trust Cluster Based Key Management Protocol for Ad hoc Net- works,? Anti-counterfeiting, Security, Identification, 2007 IEEE International Work- shop on , vol., no., pp.371-376, 16-18 April 2007 [9] O?Connell, M., Nixon, M., ?JFS: A Secure Distributed File System for Network Com- puters,? EUROMICRO Conference, 1999. Proceedings. 25th , vol.2, no., pp.450-457 vol.2, 1999 [10] Sandberg, R., ?The Sun Network File System: Design, Implementation and Experi- ence,? Tech. Report, Mountain View CA: Sun Microsystems, 1987 23 [11] LeBlanc, T., Subhlok, J., Gabriel, E., ?A High-Level Interpreted MPI Library for Par- allel Computing in Volunteer Environments,? Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGRID ?10). IEEE Computer Society, Washington, DC, USA, 673-678. [12] Aleksander, M., Litawa, G., Karpinskyi, V., ?Distributed computing system which solve an elliptic curve discrete logarithm problem,? CAD Systems in Microelectronics, 2009. CADSM 2009. 10th International Conference - The Experience of Designing and Ap- plication of , vol., no., pp.378-380, 24-28 Feb. 2009 [13] Gomes, A.M., Kakugawa, F.R., de Paula Bianchini, C., Massetto, F.I., ?A thread- safe communication mechanism for message-passing interface based on MPI Standard,? Pervasive Computing (JCPC), 2009 Joint Conferences on , vol., no., pp.173-178, 3-5 Dec. 2009 [14] Prabhakar, R., Patrick, C., Kandemir, M., ?MPISec I/O: Providing Data Confidential- ity in MPI-I/O,? Cluster Computing and the Grid, 2009. CCGRID ?09. 9th IEEE/ACM International Symposium on , vol., no., pp.388-395, 18-21 May 2009 [15] Leo, S., Santoni, F., Zanetti, G., ?Biodoop: Bioinformatics on Hadoop,? ICPPW, pp.415-422, 2009 International Conference on Parallel Processing Workshops, 2009 [16] Wei, W., Du, J., Yu, T., Gu, X., ?SecureMR: A Service Integrity Assurance Frame- work for MapReduce,? Computer Security Applications Conference, 2009. ACSAC ?09. Annual , vol., no., pp.73-82, 7-11 Dec. 2009 [17] National Institute of Standards and Technology, Federal Information Processing Stan- dards Publication 197, ?Advanced Encryption Standard,? November 26 2001 [18] National Bureau of Standards, Federal Information Processing Standards Publication 46-3, ?Data Encryption Standard,? National Bureau of Standards, US Dept. of Com- merce, Jan. 1977 [19] Hou, F., Gu, D., Xiao, N., Tang, Y., ?Secure Remote Storage through Authenticated Encryption,? Proceedings of the 2008 International Conference on Networking, Archi- tecture, and Storage (NAS ?08). IEEE Computer Society, Washington, DC, USA, pg 3-9 [20] Gu, Y., Grossman, R., ?Exploring Data Parallelism and Locality in Wide Area Net- works,? Workshop on Many-task Computing on Grids and Supercomputers (MTAGS), co-located with SC08, Austin, TX. Nov. 2008 24 Appendices 25 Appendix A Secdoop Subversion Information We are using Subversion for version control. To checkout the repository, it is suggested that you add the following environment variable to your .bashrc: export SECDOOP_SVN=http://@svn.binary-snobbery.com/secdoop Where is your Subversion username. -- NOTE: This may change in the future to allow for anonymous SVN access. To checkout the Subversion repository, in your HOME directory run: svn co $SECDOOP_SVN/trunk trunk NOTE: Checking out in another location will require you edit the following environment variables to point to the correct location. You must add the following lines to your .bashrc file: export HADOOP_DIR=$HOME/trunk/hadoop export HADOOP_COMMON_HOME=$HADOOP_DIR PATH=$PATH:$HADOOP_DIR/bin HADOOP_COMMON_HOME is used by the Hadoop scripts. HADOOP_DIR is used by the Secdoop scripts. To build Secdoop, you can simply run ?ant dist-all? in the trunk/secdoop folder. This will compile all of the necessary classes needed to test both Secdoop and the standalone JCE tests. In order to test anything against the Java Cryptographic Extensions (JCE), you need to install the JAR files from the jce_policy-6.zip file in trunk/. Unzip the file and copy it into your JRE?s lib/security director. For example, on Ubuntu 10.* you can do the following: unzip jce_policy-6.zip 26 sudo cp jce/*.jar /usr/lib/jvm/java-6-sun/jre/lib/security First thing?s first, actually configure hadoop. You?ll find some example configuration files in hadoop/conf. Copy those files without "example" in the name, and you?ll have the files you need for configuration. Change the values to whatever is appropriate for your cluster. The testing harness is configured through a tests configuration file, by default named "tests.conf." In tests.conf you can specify the algorithms you wish to test and the size of the input files. Currently supported sizes and algorithms are listed in tests-example.conf. Then, you need to generate your input sets and configuration files by calling the generate.py script in the secdMishra, S.; Kushwaha, D.S.; Misra, A.K.; , "A ./generator.py If you would like, take a look at generator.py?s usage by running: ./generator.py -h Next, format the Hadoop namenode by running the following: hdfs namenode -format After running generator.py and formatting your namenode, you run the stage.sh script. This will put all of your generated input files into HDFS. This will create the necessary XML and text files for testing. After they run, the inputs and configs directories will be populated. Next, start up Hadoop: hadoop-startup.sh NOTE: In our testing environment, we have a shared home directory via NFS. So, each of the Hadoop installations is actually shared across the cluster. This can be problematic, because the hadoop log directory is on NFS. The overhead from NFS writes was actually enough to notice a roughly 10% performance decrease. Also, the userlogs directory had to by symbolically linked to a common local (non-nfs) directory. If your cluster doesn?t share a Hadoop installation, this shouldn?t be an issue. Then you?re ready to run tests. From the secdoop directory: 27 ./test.sh For information about test.sh command-line options, simply run: ./test.sh -h After testing, shutdown the Hadoop cluster: hadoop-shutdown.sh If at any point you need to refresh your Hadoop cluster (i.e. delete all of your temporary mapreduce data, remove all of your HDFS local file stores on cluster nodes, and re-format your namenode), you can use hadoop-clear.sh found in hadoop/bin. This script should be in your PATH if you made the changes to your .bashrc, mentioned above, correctly. 28 Appendix B Example Mapper Class /***************************************************************************/ /* EncryptMapper.java by James Majors */ /* This is the example Mapper function for the AES Encryption application. */ /* The input will be split up and equally distributed across the available */ /* nodes in the cluster. The encryption process will be done for each */ /* input split and paired with a key value that will help maintain the */ /* sequence of the file. The output will be sent to an intermediate file */ /* that will contain the key/value pairs. */ /***************************************************************************/ public class EncryptMapper extends MapReduceBase implements Mapper { private static Cipher encryptCipher; static { //GENERATE KEY FOR THE GIVEN ALGORITHM //INITIALIZE CIPHER } public void map(Key1, Value1, OutputCollector output, Reporter reporter) throws IOException { //Replace the value from the input splits with the System.arraycopy() //Encrypt the input value //Output the results to an intermediate file () } } 29 Appendix C Configuration File Example This is an example of the configuration for the AES 1MB test: inputFile inputs/1MB.txt encryptedFile AES-1MB-enc decryptedFile AES-1MB-dec algorithm AES 30