Collaborative-Adversarial Pair (CAP) Programming by Rajendran Swamidurai A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama December 18, 2009 Keywords: Collaborative-adversarial pair programming, CAP, pair programming, PP, collaborative programming, agile development, test driven development, empirical software Engineering Copyright 2009 by Rajendran Swamidurai Approved by David A. Umphress, Associate Professor of Computer Science and Software Engineering James Cross, Professor of Computer Science and Software Engineering Dean Hendrix, Associate Professor of Computer Science and Software Engineering ii Abstract The advocates of pair programming claim that it has a number of benefits over traditional individual programming, including faster software development, higher quality code, reduced overall software development cost, increased productivity, better knowledge transfer, increased job satisfaction and increased confidence in the resulting product, at only the cost of slightly increased personnel hours. While the concept of pair programming is attractive, it has some detraction. First, it requires that the two developers be at the same place at the same time. Second, it requires an enlightened management that believes that letting two people work on the same task will result in better software than if they worked separately. Third, the empirical evidence of the benefits of pair programming is mixed. Anecdotal and empirical evidence shows that pair programming is better suited for job training than for real software development. Pair programming is more effective than traditional single-person development if both members of the pair are novices to the task at hand. Novice-expert and expert-expert pairs have not been demonstrated to be effective. This research proposes a new variant of pair programming called the Collaborative- Adversarial Pair (CAP) programming. Its objective is to exploit the advantages of pair programming while at the same time downplaying the disadvantages. Unlike traditional pairs, where two people work together in all the phases of software development, CAPs start by designing together; splitting into independent test construction and code implementation roles; then joining again for testing. iii Two empirical experiments were conducted during the Fall 2008 and Spring 2009 semesters to validate CAP against traditional pair programming and individual programming. Forty two (42) volunteer students, undergraduate seniors and graduate students from Auburn University?s Software Process class, participated in the studies. The subjects used Eclipse and JUnit to perform three programming tasks with different degrees of complexity. The subjects were randomly divided into three experimental groups: individual (Solo) programming group, pair programming (PP) group and collaborative adversarial pair (CAP) programming group in the ratio of 1:2:2. The results of this experiment point in favor of CAP development methodology and do not support the claim that pair programming in general reduces the overall software development time or increase the program quality or correctness. iv To My wife Uma and My guru Dr. David Ashley Umphress v Acknowledgments I consider completing this dissertation to be the greatest accomplishment of my life thus far. This is a result of sacrifices and encouragement by full many individuals. Although it would not be possible for me to list them all, I would like to mention a handful without whom this accomplishment would have remained a dream. It is with deep sense of gratitude that I acknowledge my indebtedness to my Ph.D. committee members; in particular, my advisor Dr. David A. Umphress. He has been a wise and dependable mentor and an exemplary role model in helping me achieve my professional goals. Dr. Umphress has always given me invaluable guidance, support and enthusiastic encouragement. Heartfelt thanks are also extended to other committee members, Dr. James Cross and Dr. Dean Hendrix for their suggestions and guidance which has greatly improved the quality of my work. Special thanks goes to all forty two students (fall 2008 and spring 2009 software process class) who participated in the control experiments. I would also like to thank all the professors/teachers who have taught me (right from kindergarden to this date) and under whom I have worked as a Teaching Assistant at Auburn. The inspiration I have drawn from my long list of friends, right from my childhood to this date, deserves a special acknowledgement. From the bottom of my heart, I want to thank my parents, my in laws and my extended family for their love and support. Lastly, I would also like to thank my wife Mrs. Uma Rajendran, my son, Soorya Gokulan and my daughter Sneha for their love, support and unstinting faith. vi Table of Contents Abstract ......................................................................................................................................... ii Acknowledgments......................................................................................................................... v List of Tables ............................................................................................................................. viii List of Figures .............................................................................................................................. ix List of Abbreviations ................................................................................................................. xiii Chapter 1: Introduction ............................................................................................................... 1 Chapter 2: Literature Review ....................................................................................................... 4 2.1. Pair Programming .................................................................................................... 4 2.2. Pair Programming Experiments ............................................................................... 9 2.3. The Pairing Activity ............................................................................................... 23 2.4. The Effect of Pair Programming on Software Development Phases ..................... 35 Chapter 3: Research Description .............................................................................................. 41 3.1. The CAP Process ................................................................................................... 41 Chapter 4: Applied Results and Research Validation ............................................................... 51 4.1. Subjects .................................................................................................................. 51 4.2. Experimental Tasks ................................................................................................ 51 4.3. Hypotheses ............................................................................................................. 52 4.4. Cost ........................................................................................................................ 53 vii 4.5. Program Correctness .............................................................................................. 54 4.6. Experiment Procedure ............................................................................................ 54 4.7. Results .................................................................................................................... 57 4.8. Observations .......................................................................................................... 99 Chapter 5: Applied Results and Research Validation ............................................................. 102 5.1. Conclusions .......................................................................................................... 102 5.2. Future Work ......................................................................................................... 104 References ............................................................................................................................... 105 Appendix A ............................................................................................................................. 111 Appendix B ............................................................................................................................. 113 viii List of Tables Table 2.1: Summary of Pair Programming Experiments ........................................................... 19 Table 2.2: Summary of Pair Programming Experiments Results ............................................... 22 Table 2.3: When to Pair Program ............................................................................................... 25 Table 2.4: Effects of Software Processes on PP ......................................................................... 33 Table 2.5: Effects of Programming Languages on PP ............................................................... 34 Table 2.6: Effects of Software Development Methods on PP .................................................... 34 Table 2.7: Summary of Pair Design Experiments ...................................................................... 37 Table 4.1: Total Software Development Time ........................................................................... 62 Table 4.2: Coding Time .............................................................................................................. 64 Table 4.3: The number of test cases passed ................................................................................ 67 Table 4.4: Total Software Development Time ........................................................................... 72 Table 4.5: Coding Time .............................................................................................................. 75 Table 4.6: Total Software Development Time ........................................................................... 81 Table 4.7: Coding Time .............................................................................................................. 84 Table 4.8: Total Software Development Time ........................................................................... 90 Table 4.9: Coding Time .............................................................................................................. 93 Table 4.10: Summary of Control Experiments and their Results ............................................... 98 ix List of Figures Figure 2.1: Pair Programming Time Line ..................................................................................... 7 Figure 2.2: The DaimlerChrysler C3 work area ......................................................................... 28 Figure 2.3: Pair Programming Workplace Layout ..................................................................... 29 Figure 2.4: RoleModel Software Workstation Layout .............................................................. 29 Figure 2.5: Conventional Environment....................................................................................... 30 Figure 2.6: Rearranged Environment for Better Role Switching ............................................... 30 Figure 2.7: ?Circle table? for pair programming ....................................................................... 31 Figure 3.1: CAP Development Activity ..................................................................................... 42 Figure 3.2: CAP Development Cycle ........................................................................................ 44 Figure 3.3: Collaborative-Adversarial Pairs (CAP) .................................................................... 44 Figure 3.4: Build Code / Unit Implementation in CAP ............................................................. 44 Figure 3.5: A Class-Responsibility-Collaborator (CRC) index card ......................................... 45 Figure 3.6: Unit Test Environment ............................................................................................ 48 Figure 4.1: Experimental Setup .................................................................................................. 56 Figure 4.2: Experimental Procedure .......................................................................................... 56 Figure 4.3: Q-Q Plot of Residuals (Dynamic Pairs Total Software Development Time) ......... 58 Figure 4.4: Q-Q Plot of Residuals (Dynamic Pairs Coding Time) ............................................. 59 Figure 4.5: Test for Normality (Dynamic Pairs Total Software Development Time) ............... 59 Figure 4.6: Test for Normality (Dynamic Pairs Coding Time) ................................................. 59 x Figure 4.7: Box plot (Dynamic Pairs Total Software Development Time) ............................... 60 Figure 4.8: Box plot (Dynamic Pairs Coding Time) .................................................................. 60 Figure 4.9: Average Total Software Development Time (Dynamic Pairs) ............................... 62 Figure 4.10: Total Software Development Time (Dynamic Pairs) ............................................. 63 Figure 4.11: t-Test Results (Dynamic Pairs Total Software Development Time) ..................... 63 Figure 4.12: Average Coding Time (Dynamic Pairs) ................................................................ 65 Figure 4.13: Box plot (Dynamic Pairs Coding Time) ................................................................ 65 Figure 4.14: t-Test Results (Dynamic Pairs Coding Time) ....................................................... 66 Figure 4.15: The number of test cases passed (Dynamic Pairs) ................................................ 67 Figure 4.16: Q-Q Plot of Residuals (Static Pairs Total Software Development Time) ............. 68 Figure 4.17: Q-Q Plot of Residuals (Static Pairs Coding Time) ................................................ 69 Figure 4.18: Test for Normality (Static Pairs Total Software Development Time) .................. 70 Figure 4.19: Test for Normality (Static Pairs Coding Time) ..................................................... 70 Figure 4.20: Box plot (Static Pairs Total Software Development Time) ................................... 70 Figure 4.21: Box plot (Static Pairs Coding Time) ..................................................................... 71 Figure 4.22: Average Total Software Development Time (Static Pairs) ................................... 73 Figure 4.23: Total Software Development Time (Static Pairs) ................................................. 73 Figure 4.24: t-Test Results (Static Pairs Total Software Development Time) .......................... 74 Figure 4.25: Average Coding Time (Static Pairs) ...................................................................... 75 Figure 4.26: Box plot (Static Pairs Coding Time) ...................................................................... 76 Figure 4.27: Wilcoxon Mann-Whitney U test Results (Static Pairs Coding Time) ................... 76 Figure 4.28: Q-Q Plot of Residuals (Combined CAP Vs PP Total Software Development Time) ..................................................................................................................................................... 77 xi Figure 4.29: Q-Q Plot of Residuals (Combined CAP Vs PP Coding Time) ............................. 78 Figure 4.30: Test for Normality (Combined CAP Vs PP Total Software Development Time) 78 Figure 4.31: Test for Normality (Combined CAP Vs PP Coding Time) ................................... 79 Figure 4.32: Box plot (Combined CAP Vs PP Total Software Development Time) ................ 79 Figure 4.33: Box plot (Combined CAP Vs PP Coding Time) .................................................... 80 Figure 4.34: Average Total Software Development Time (Combined CAP Vs PP) ................. 81 Figure 4.35: Box Plot (Combined CAP Vs PP Total Software Development Time) ................ 82 Figure 4.36: t-Test Results (Combined CAP Vs PP Total Software Development Time) ........ 83 Figure 4.37: Average Coding Time (Combined CAP Vs PP) ................................................... 84 Figure 4.38: Box plot (Combined CAP Vs PP Coding Time) ................................................... 85 Figure 4.39: Wilcoxon Mann-Whitney U test Result (Combined CAP Vs PP Coding Time) .. 85 Figure 4.40: Q-Q Plot of Residuals (CAP Vs IP Total Software Development Time) ............. 86 Figure 4.41: Q-Q Plot of Residuals (CAP Vs IP Coding Time) ................................................ 87 Figure 4.42: Test for Normality (CAP Vs IP Total Software Development Time) ................... 87 Figure 4.43: Test for Normality (CAP Vs IP Coding Time) ..................................................... 88 Figure 4.44: Box plot (CAP Vs IP Total Software Development Time) ................................... 88 Figure 4.45: Box plot (CAP Vs IP Coding Time) ..................................................................... 89 Figure 4.46: Average Total Software Development Time (CAP Vs IP) .................................... 90 Figure 4.47: Total Software Development Time (CAP Vs IP) .................................................. 91 Figure 4.48: t-Test Results (CAP Vs IP Total Software Development Time) ........................... 92 Figure 4.49: Average Coding Time (CAP Vs IP) ....................................................................... 93 Figure 4.50: Box plot (CAP Vs IP Coding Time) ..................................................................... 94 Figure 4.51: t-Test Results (CAP Vs IP Coding Time) .............................................................. 94 xii Figure 4.52: Average Software Development Time for Static PP and Dynamic PP .................. 99 Figure 4.53: Average Software Development Time for Static CAP and Dynamic CAP ........ 100 xiii List of Abbreviations ANOVA Analysis of variance BF Brown and Forsythe's variation of Levene's test C3 Chrysler Comprehensive Compensation CAP Collaborative-Adversarial Pair Programming CRC Class Responsibility Collaborator CSP Collaborative Software Process GLM General Linear Models GUI Graphical User Interface IDE Integrated Development Environment IP Individual Programming J2EE Java 2 Platform, Enterprise Edition JDK Java Development Kit LOC Lines of Code OO Object Oriented PP Pair Programming PSP Personal Software Process Q-Q Quintile-Quartile SAS Statistical Analysis Software TDD Test Driven Development xiv UML Unified Modeling Language XP Extreme Programming 1 1. INTRODUCTION One of the popular, emerging, and most controversial topics in the area of Software Engineering in the recent years is pair programming. Pair programming (PP) is a way of inspecting code as it is being written. Its premise ? that of two people, one computer ? is that two people working together on the same task will likely produce better code than one person working individually. In pair programming, one person acts as the ?driver? and the other person acts as the ?navigator.? The driver is responsible for typing code; the navigator is responsible for reviewing the code. In a sense, the driver addresses operational issues of implementation and the observer keeps in mind the strategic direction the code must take. Though the history of pair programming stretches to punched cards, it gained prominence in the early 1990?s. It became popular after the publication in 1999 of Extreme Programming Explained by Kent Beck, where it was noted as one of the 12 key practices promoted by Extreme Programming (XP) [Beck 2000]. In recent years, industry and academia have turned their attention and interest toward pair programming [Arisholm et al. 2007, Canfora et al. Dec06] and it has been widely accepted as an alternative to traditional individual programming [Muller 2005]. The advocates of pair programming claim that it has many benefits over traditional individual programming, including faster software development, higher quality code, reduced overall software development cost, increased productivity, better knowledge transfer, increased 2 job satisfaction and increased confidence in their work, only at the cost of slightly increased personnel hours [Arisholm et al. 2007]. While the concept of pair programming is attractive, it has some detraction. First, it requires that the two developers be at the same place at the same time. This is frequently not realistic in busy organizations where developers may be matrixed concurrently to a number of projects. Second, it requires an enlightened management that believes that letting two people work on the same task will result in better software than if they worked separately. This is a significant obstacle since software products are measured more by tangible properties, such as the number of features implemented, than by intangible properties, such as the quality of the code. Third, the empirical evidence of the benefits of pair programming is mixed: the works of Judith Wilson et al. [Wilson et al. 1993], John Nosek [Nosek 1998], Laurie Williams [Williams et al. 2000], Charlie McDowell et al. [McDowell et al. 2002], and Xu and Rajlich [Xu et al. 2006] support the costs and benefits of pair programming; experiments by Nawrocki and Wojciechowski [Nawrocki et al. 2001], Jari Vanhanen and Casper Lassenius [Vanhanen et al. 2005], Erik Arisholm et al. [Arisholm et al. 2007], Matevz Rostaher and Marjan Hericko [Rostaher et al. 2002], and Hanna Hulkko and Pekka Abrahamson [Hulkko et al. 2005] show that statistically there is no significant difference between the pair programming and solo programming. Don Wells and Trish Buckley [Wells et al. 2001], Kim Lui and Keith Chan [Lui et al. 2006] and Erik Arisholm et al. [Arisholm et al. 2007] show that pair programming is more effective than traditional single-person development if both members of the pair are novices to the task at hand. Novice-expert and expert-expert pairs have not been demonstrated to be effective. According to Karl Boutin [Boutin 2000] many developers are forced to abandon pair 3 programming due to lack of resources (e.g. due to small team size). He also observed that abandoning the pair programming in the middle of the project hindered the integration of new modules to the existing project. This research proposes a new variant of pair programming called the Collaborative- Adversarial Pair (CAP) programming. Its objective is to exploit the advantages of pair programming while at the same time downplaying the disadvantages. Unlike traditional pairs, where two people work together in all the phases of software development, CAPs start by designing together; splitting into independent test construction and code implementation roles; then joining again for testing. 4 2. LITERATURE REVIEW 2.1. Pair Programming Pair programming is a programming technique in which two people program all production code in a single machine using one keyboard and one mouse. The members of each pair are assigned two different roles. One partner with keyboard and mouse, known as driver 1 , types and thinks about the best way to implement the current method in hand and the other partner, known as navigator or observer, watches or reviews the code being typed, looking for errors and thinks strategically about the feasibility of the overall approach, additional test cases to be addressed and the way to simplify the whole system in order to overcome the current problem [Beck 2000]. The following are some of the key points highlighted in the pair programming literature: ? Paring is dynamic and the people have to pair with different people in the morning and evening sessions. A programmer can pair with anyone in the development team [Beck 2000]. ? Along with writing the code for test cases, the pairs also evolve the system?s design. Pairs add value to almost all the stages of the system development including analysis, implementation, and testing [Beck 2000]. 1 There were no specific names given for the two partners by Kent Beck in his ?Extreme Programming Explained?. The names driver and navigator were originally used by Laurie Williams in her article called ?Integrating pair programming into a software development process? [Williams 2001]. 5 ? The driver and observer are full partners and they exchange their roles quite often [Martin 2003, Wake 2002]. ? The pair programming activity provides a means for real-time problem solving and real- time quality assurance [Pressman 2005]. ? Pair programming is a social skill, not a technical skill. It has to be practiced with the people who already know how to do it [Wells 2001]. ? Pair programming is not an activity in which one person programs and other person simply watches. Moreover, pair programming is not a tutoring activity in which the experienced partner teaches to the inexperienced ones. It is a conversation between two people understand together and trying to do simultaneous activity (analysis, design, implement, or test) [Beck 2000]. Even though the terms collaborative programming (CP) and pair programming (PP) are interchangeably used in literature, they are not the same. There are two fundamental differences between them. First there is no working protocol exclusively specified for collaborative programming; whereas, pair programming has a well defined working protocol which prescribes to continuously overlapping reviews and the creation of artifacts. Second, pair programming team is strictly restricted to two people and there is no such restriction for collaborative programming team; it may contain two or more people [Canfora et al. 2007]. 2.1.1. Pair Programming History The history of pair programming dates back to punched cards in the early 1940s when Von Neumann worked with IBM. But pair programming became popular only after Kent Beck published ?Extreme Programming Explained? in 1999. The timeline of pair programming is discussed below: 6 Dave W. Smith, an Agile Software Project Leader and Coach, while discussing the history of Extreme Programming (XP), wrote, ?Jerry Weinberg told me that John Von Neumann's team at IBM used pair programming in much the same form that XP employs it now? [Perl 2004]. In 1950?s Fred Brooks, author of The Mythical Man, tried pair programming with his fellow graduate student Bill Wright when he was a graduate student [Williams et al. 2003]. E. W. Dijkstra recalled his pair programming experience in 1969 (What led to ?Notes on Structured Programming? - EWD249), in the article EWD1308-5 2 . Dick Gabriel reported his pair programming experience as ?Pair programming was a common practice at the M.I.T. Artificial Intelligence Laboratory when I was there in 1972-73? and in 1984, his team used pair programming in the Common Lisp Project [Williams et al. 2003]. In 1991 Flor observed and recorded exchanges between two collaborative programmers [Flor 1991]. In 1993, Judith D. Wilson, Nathan Hoskin and John T. Nosek [Wilson et al. 1993] of Temple University conducted a collaborative programming experiment with students. Two books published in 1995 discussed pair programming. Larry Constantine, in his book titled Constantine on Peopleware, discussed about pair programming conducted at Whitesmith Ltd. Jim Coplien, in his book titled Pattern Languages of Programming Design claimed that pair developers can produce more than the sum of the two individual developers [McDowell et al. 2002]. In 1996, while working on the Chrysler Comprehensive Compensation System (commonly referred to as 'C3') Kent Beck and Ron Jeffries team adopted a new way of working 2 The article EWD1308-5 was written in 2001 and EWD249 was published in 1969. 7 which is currently known as the Extreme Programming (XP), which employed pair programming as one of the core principles [Anderson et al. 1998]. Randall W. Jensen, Software Technology Support Center, Hill Air Force Base, reported his pair programming experience in 1996 as ?The undergraduate experience led me to propose an experiment in the application of what we called two-person programming teams. The term pair programming had not been coined at that time? [Jensen 2003] 3 . In 1998, John T. Nosek, Temple University, Philadelphia, conducted collaborative programming (similar to pair programming) experiment [Nosek 1998]. In 1999 Kent Beck published Extreme Programming Explained in 1999; pair programming is the one of the 12 core practices introduced in Extreme Programming [Beck 2000], familiarly known as XP. Figure 2.1: Pair Programming Time Line 3 The paper was actually published only in 2003. 8 2.1.2. Benefits of Pair Programming The proponents of pair programming claim that the pair programming software development provides the following benefits over the traditional individual software development: ? Increases software quality ? Increases productivity ? Increases design quality ? Increases program correctness ? Provides constant design and code review ? Reduces overall software development time and cost ? Helps in Team building, knowledge transfer and learning ? Enhances job satisfaction and confidence ? Helps in solving complex problems ? Reduces the effort need to develop a piece of code ? Reduces risk of project failures ? Reduces staffing risks 2.1.3. Drawbacks of Pair Programming While the literature lists several benefits of pair programming, the detractors assert that pair programming has the following drawbacks: ? Doubles the developers required and development cost ? Increases the software development time ? Quality improvement also in question ? Not suitable for very large projects 9 ? Suitable only for novice-novice pairs ? It is very intense ? It is good for job training, not for professional software development ? Bringing out personality conflicts and clashes between developers ? Coding styles, ego, or intimidation would only slow the developers down ? Programming is a solidarity activity ? Experienced programmers may refuse to share 2.2. Pair Programming Experiments This section includes 12 out of 35 published collaborative and pair programming experiments and case studies in which (1) a comparison was made between pair programming and individual programming, and (2) evaluates one or more of the software metrics, namely program development time/cost, productivity (LOC/hr), program correctness (program readability and functionality), and job satisfaction. The remaining 23 experiments or case studies which did not include pairs verses individual comparison, software metrics evaluation and/or coding phase of the software development process were excluded in this section. For more information please see Appendix A, which lists all the pair programming experiments and case studies published so far and the reason why the experiment or case study was excluded from the analysis. 2.2.1. Judith Wilson et al. Experiment [Wilson et al 1993] In 1993, Judith D. Wilson, Nathan Hoskin and John T. Nosek of Temple University conducted a collaborative programming experiment with 34 upper division undergraduate students of a database course (two sections). 14 students from the first section acted as the 10 control groups (individuals) and 20 students in the second section were randomly grouped into 10 experimental (pairs) groups. The task was solving a ?traffic light signal problem? in 60 minutes using Pascal, C, dBase III, or pseudo code. The purpose of the study was to investigate: (1) readability and functionality of the solution, (2) confidence and enjoyment of the work, and (3) students in which group earn high grades. The results of the experiment were: (1) pairs produced slightly better readable and functional codes, (2) pairs expressed more confidence and enjoyment, and (3) ability had little effect on pair performance, i.e. high grade is significantly associated with individuals, but not with pairs. The experiment indicates that collaboration helps novice programmers, collaboration helps solve informal problems, and collaboration helps students master analytical skills required to analyze and model problems. 2.2.2. The Nosek Experiment [Nosek 1998] John T. Nosek, Temple University, Philadelphia, conducted a collaborative programming experiment in 1998 using 15 full-time system programmers. The subjects were divided into 5 control groups (individuals) and 5 experimental groups (pairs) on a truly random basis. The task was to write a database consistency-check script in the C programming language in 45 minutes on an X-window system. The aim of the experiment was to find: (1) readability and functionality of the solution, (2) average problem solving time, (3) confidence and enjoyment of the work, and (4) how experienced programmers perform as compared to less experienced programmers. The results of the experiment were: (1) pairs programs were more readable and functional, (2) pairs took more 11 time on average, (3) pairs expressed more confidence and enjoyment of their job, and (4) experienced programmers performed better than inexperienced ones. The experiment indicates that collaboration improves problem solving process and improves programmer?s performance. 2.2.3. Laurie Williams?s Experiment [Williams et al. 2000] Laurie Williams from University of Utah conducted a Pair Programming experiment in 1999 with 41 advanced undergraduate students in a Software Engineering course. The subjects were divided into 13 control groups (individuals) and 14 experimental groups (pairs). The individuals used Humphrey's Personal Software Process (PSP) and the pairs used Williams? Collaborative Software Process (CSP) to complete their tasks. The subjects were not selected randomly; instead, they were picked from among the 35 that initially indicated a preference for working collaboratively. The students were asked to code four class projects 4 over 6 weeks time, which was part of their course curriculum. The first project was used as Pair-Jelling 5 (initial adjustment) experiment. The aim of the study was to find: (1) number of test cases passed, (2) average problem solving time, (3) number of defects in the programs, and (4) job satisfaction. The results of the experiment were: (1) pairs programs passed more test cases than individuals, (2) pairs spent 15% more time on average to solve a problem, (3) pairs code had 15% fewer defects than individuals, and (4) pairs expressed more job satisfaction. 4 Programs size and programming language used were not mentioned in the paper. 5 Tuckman?s model (see Appendix B for more detail about Tuckman?s model) is known as Pair Jelling in the pair programming literature [Lui et al. 2006] 12 2.2.4. Nawrocki and Wojciechowski Experiment [Nawrocki et al. 2001] Jerzy Nawrocki and Adam Wojciechowski from the Poznan University of Technology conducted a pair programming experiment in the 1999/2000 winter semester using 21 students. The 21 subjects were randomly divided into three groups of 6, 5 and 5 in such a way that the average GPA of each group was the same. The first group used Watts Humphrey?s Personal Software Process (PSP), the second and third groups used Extreme Programming (XP) as their development process. The individual group which used XP was called XP1 and the pairs group which used XP was called XP2. The students were asked to solve four C/C++ programs ranges between 150 and 400 LOC. The aim of the study was to compare Extreme Programming (XP) with the Watts Humphrey?s Personal Software Process (PSP). The results of the experiment were: (1) there was no difference in time between XP1 and XP2 groups, (2) pair programming was more predictable than other two approaches, (3) XP1 was the most efficient programming technology, and (4) there was no difference between PSP and XP2. The experiment indicates that experimentation and test-oriented thinking reduces development time, pair programming with Extreme Programming (XP) was not efficient, XP1 was more efficient than PSP, pair programming was more predictable than individual programming, and rework for XP2 was slightly smaller compared with other two approaches. 2.2.5. Charlie McDowell et al. Experiment [McDowell et al. 2002] In 2000/01, Charlie McDowell, Linda Werner, Heather Bullock and Julian Fernald from the University of California, Santa Cruz studied the effects of Pair Programming in an introductory programming course with approximately 600 students. A total of 172 students from the fall 2000 section were divided into 86 pairs (experimental group) and 141 students from the 13 spring 2001 section were used as control group (individuals). The students were asked to complete 5 programming assignments 6 . The aim of the study was to find the effects of PP on performance in the course. The results of the experiment were: (1) pair programming improves program quality in terms of functionality and program readability, and (2) pair programming did not help the students learn their course material and independently apply their knowledge to new programs. 2.2.6. Rostaher and Hericko Experiment [Rostaher et al. 2002] In 2002, Matevz Rostaher and Marjan Hericko from Slovenia conducted a pair programming experiment using 16 professional programmers. The 16 subjects were divided into 4 control groups (individuals) and 6 experimental groups (pairs) based upon their programming experience. The programmers were asked to develop a simple insurance contract administration system using six small stories in Smalltalk and its integrated development environment (IDE). The purpose of the experiment was to get the time spent in percentage on each activity by the programmers, based on their experience level. The results of the experiment were: (1) there was no difference in average time spent by individuals and pairs, (2) experiment results did not favor pair programming. The experiment indicates that acceptance tests must be written before the development, and refactoring caused more problems for programmers than did tests. 2.2.7. Muller Experiments [Muller 2005] Matthias M. Muller, University of Karlsruhe, Germany conducted two experiments to compare pair programming with peer review. The first experiment was conducted in 2002; in 2003 the same experiment was repeated with 38 computer science students. The 38 subjects were 6 Assignment sizes and programming languages are not mentioned 14 divided into 23 control groups (individuals) called review groups and 19 experimental groups (pairs). In the review group, an individual programmer developed the program, compiled it, had it reviewed by an unknown reviewer, and then conducted the testing. In the pair programming group, all the development activities were carried out by two programmers sitting in front of the same computer. The students were asked to solve polynomial and shuffle-puzzle problems using Java on both occasions. The purpose of the study was to find the cost of pair programming and peer review methods. The results of the experiment were: (1) there was no difference in program correctness, and (2) for a similar level of correctness there was no difference in development cost. The experiment indicates that pair and individual programmers can be interchanged in terms of cost. 2.2.8. Vanhanen and Lassenius Experiment [Vanhanen et al. 2005] In 2004, Jari Vanhanen and Casper Lassenius, Helsinki University of Technology, Finland conducted a pair programming experiment using 10 computer science students. The 10 subjects were randomly divided into 2 control groups (individuals) and 3 experimental groups 7 (pairs). For a given requirement specification each team was asked to develop a distributed, multiplayer casino system within 400 hours using J2EE technologies. The purpose of the experiment was to investigate pair programming effects, namely productivity, defects, design quality, knowledge transfer, and enjoyment of work at the development team level. The results of the experiment were: (1) the productivity of pairs was 29% less than individuals, (2) pairs code contained 8% fewer defects, but after delivery pairs had more defects, (3) pairs programs were less functional than individual?s programs, (4) pairs 7 In the middle of the project one pair abandoned pair programming without notice because they considered it inefficient. 15 design quality was slightly better than individuals, (5) knowledge transfer among pairs was better, and (6) pairs expressed less job satisfaction. The experiment indicates that pair programming did not help in solving complex tasks; pair programming helped programmers in finding and fixing errors; and fewer defects in programs and better knowledge transfer among pairs indicates that pair programming may decrease further development costs of the system. 2.2.9. Hulkko and Abrahamsson Experiments [Hulkko et al. 2005] Hanna Hulkko and Pekka Abrahamsson from Finland conducted two case studies on pair programming in 2004. In the first case study, master?s students were the subjects and in the second case study, master?s students as well as research scientists were the subjects. There were 4 to 6 teams in each control group (individuals) and in each experimental group (pairs), and they were asked to develop four different projects sizes ranging from 3700 to 7700 LOC using the Mobile-D 8 development process. The first project was developing Internet application using Java and JSP, and the remaining three were mobile application development using Mobile Java and Symbian C++. The purpose of the study was to find the impact of pair programming on product quality. The results of the experiment were: (1) there was no difference in productivity between pairs and individuals, (2) pair programming is more suitable for learning and complex tasks, (3) the code produced by pair programming had lower adherence to coding standard, (4) readability of the programs were better in pairs code, and (4) there was no difference in program correctness between pairs and individuals. 8 Mobile-D is an agile development approach developed by Pekka Abrahamsson et al [Abrahamsson et al. 2004]. In this approach development practices are based on Extreme Programming, method scalability is based on Crystal methodologies, and life-cycle coverage is based on Rational Unified Process. 16 The experiment indicates that pair programming did not provide the benefits claimed in the pair programming literature, and that productivity of pair programming was not consistently high. 2.2.10. Muller Experiment [Muller 2006] Matthias M. Muller, University of Karlsruhe, Germany conducted a pair programming experiment using 18 computer science students. The 18 subjects were randomly divided into 8 control groups (individuals) and 5 experimental groups (pairs). Due the difficult programming task two individuals did not complete coding, so the modified control group was only 6 individuals. The students were asked to design, code and test an elevator control system using the Java programming language. Both the control and the experimental groups were initially paired for the design phase. Once the design was completed with a partner, the control group students were asked to code and test independently. The primary purpose of the study was to find the impact of the pair design phase on pair programming and solo programming. The results of the experiment were: (1) there was no difference in program correctness, and (2) for a similar level of correctness there was no difference in development cost. The experiment indicates: (1) there is no difference in development cost for both pair and individual programming, if similar level of program correctness is needed and (2) since the probability of building wrong solution is much lower for pairs, the pair programming process can be replaced by a pair design phase followed by a solo implementation phase. 2.2.11. Xu and Rajlich Experiment [Xu et al. 2006] Shaochun Xu from Algoma University College, Laurentine University and Vaclav Rajlich from Wayne State University conducted a pair programming case study using 12 17 students. The control group was formed using 4 undergraduate computer science students from Algoma University College and the experimental group was formed using 8 undergraduate computer science students from Wayne State University. In Feb 2005, two pairs completed their work and the other two pairs completed their work in Jun 2005. All four individuals completed their work in Feb 2006. The participants were asked to develop an application which computes bowling scores. The pairs were asked to develop the program using the Eclipse Java IDE along with Junit. There were no such restrictions for the individuals, so two of the four individuals used Eclipse and the remaining two individuals used Text Pad with the JDK. The pairs were asked to use Extreme Programming (XP) and Test Driven Development (TDD); whereas the individuals were asked to use the traditional Waterfall process. The primary purpose of the study was to investigate the effect of Extreme Programming and Test Driven Development on game development. The results of the experiment were: (1) the productivity for pairs was very high compared with individuals, (2) pairs program had better design than individuals, (3) pairs wrote better quality code than individuals, and (4) pairs programs passed more test cases than individuals. The experiment indicates that game developers can benefit from a XP-like approach, which includes pair programming. 2.2.12. Erick Arisholm et al. Experiment [Arisholm et al. 2007] Erick Arisholm, Hans Gallis, Tore Dyba, and Dag I.K. Sjoberg conducted a pair programming experiment using 295 professional programmers from Norway, Sweden, and the UK. This was a two-phase experiment: the first phase, the individual programming phase, was conducted in 2001 using 99 programmers and the second phase, the pair programming phase, 18 was conducted in 2004 and 2005 using 196 (98 pairs) programmers. The programmers were grouped into three categories, namely junior, intermediate, and senior based on an assessment of their Java programming experience by their project managers. The programmers were asked to add 4 new features to an existing coffee machine application using professional Java tools. The primary purpose of the study was to evaluate pair programming with respect to system complexity and programmer expertise. The results of the experiment were: (1) there was no difference in development time between pairs and individuals, (2) there was no difference in program correctness between pair and individual programs, and (3) pairs required more effort than individuals to add new features. The experiment indicates that the effect of pair programming on duration, effort and correctness depends on system complexity and not on programmer?s expertise. The juniors were the beneficiaries from the pair programming and there was no benefit for intermediates and seniors from pair programming. 2.2.13. Summary of PP Experiments Twelve pair programming experiments have been discussed in section 2.2.1 through 2.2.12. A synopsis of these experiments highlighting the name and year of the experiment, number of participants in the experiment, software process used, number of problems solved, programming language used, duration of experiment, lines of code, development methodology used, phases paired, and the experimental problem solved is shown in table 2.1. 19 Study Year Subjects (Ind + Pair) Software Process #E xp Prog. Language Duration LOC D ev. M et hod P ar i ng Phases Problem Ind Pair D C T Wilson et al. [Wilson et al. 1993] 1993 Students (14+10) Randomly selected NA NA 1 Pascal, C, dBase III, Pseudo Code 60 min NA SD SP X Traffic signal problem John Nosek [Nosek 1998] 1998 Professionals (5+5) Randomly selected NA NA 1 C 45 min NA SD SP X Database consistency check script Williams et al. [Williams et al. 2000] 1999 Students (13+14) Not randomly selected PSP CSP M NA 6 weeks NA SD SP X X 4 home works Nawrocki and Wojciechowski [Nawrocki et al. 2001] 1999/ 2000 Students (5+5) Randomly selected XP XP M C/C++ NA 150- 400 TDD SP X 4 programs McDowell et al [McDowell et al. 2002] 2000/ 2001 Students (141+86) NA NA M NA Semester NA SD SP X 5 assignments Rostaher et al. [Rostaher et al. 2002] 2002 Professionals (4+6) XP XP 1 Smalltalk One day NA TDD SP X X Six stories Matthias M?ller [Muller 2005] 2002/ 2003 Students (23+19) XP XP M Java NA NA TDD SP X Polynomial & Shuffle Puzzle Vanhanen and Lassenius [Vanhanen et al. 2005] 2004 Students (2+2) Randomly Selected NA NA 1 J2EE 400hr NA TDD SP X X X Casino system Hulkko and Abrahamson [Hulkko et al. 2005] 2004 Students & Research Scientists (4 to 6 + 4 to 6) Mobile D Mobile D M Java & JSP, Mobile Java, Symbian C++ NA 3700- 7700 TDD NA X One Internet application, 3 mobile application Matthias M?ller [Muller 2006] 2004 Students (6+5) XP XP 1 Java NA NA TDD SP X X X Elevator system Xu and Rajlich [Xu et al. 2006] 2005, 2006 Students (4+4) Water fall XP 1 Eclipse, JDK NA NA SD/ TDD SP X Bowling game Arisholm et al. [Arisholm et al. 2007] 2001, 2004/ 2005 Professionals (99+98) NA NA 1 Java Tools 8 hr NA NA SP X Coffee machine NA ? Not Available XP ? Extreme Programming SP ? Static Pairing TDD ? Test Driven Development D ? Design M ? Multiple PSP ? Personal Software Process DP ? Dynamic Paring SD ? Standard Development C ? Code CSP ? Collaborative Software Process T ? Test Table 2.1: Summary of Pair Programming Experiments 20 Programming efficiency or productivity is the measure of Line of Code (LOC) produced per hour per programmer. Nawrocki and Wojciechowski [Nawrocki et al. 2001], Vanhanen and Lassenius [Vanhanen et al. 2005] and Hulkko and Abrahamson [Hulkko et al. 2005] show that the productivity of the pair programmers was not more than the individual programmers productivity; the only exception to this is the Xu and Rajlich [Xu et al. 2006] experiment. John Nosek [Nosek 1998], Williams et al. [Williams et al. 2000], Nawrocki and Wojciechowski [Nawrocki et al. 2001], Rostaher et al. [Rostaher et al. 2002], Matthias M?ller [Muller 2005], Xu and Rajlich [Xu et al. 2006], and Arisholm et al. [Arisholm et al. 2007] show that the time taken by the pair programmers to complete a task was more than the time taken by the individual programmers. Moreover, Nawrocki and Wojciechowski [Nawrocki et al. 2001] and Rostaher et al. [Rostaher et al. 2002] show that pairs took almost double the time than individual programmers. The defect density is measured in terms of number of test cases passed [Williams et al. 2000, Xu et al. 2006] and/or relative defect density (defects/KLOC) [Williams et al. 2000, Hulkko et al. 2005]. Williams et al. [Williams et al. 2000] and Xu and Rajlich [Xu et al. 2006] show that the number of test cases passed by pairs programs were higher than individual programmers. Matthias M?ller [Muller 2005] shows that programs written by pair groups and review groups have similar level of correctness. Arisholm et al. [Arisholm et al. 2007] report that the pairs did not produce more correct programs than individuals. Vanhanen and Lassenius [Vanhanen et al. 2005] report that after coding and unit testing the programs written by pairs had less defects; whereas, after the system testing and bug fixing the programs written by pairs had more defects than individuals. 21 Williams et al. [Williams et al. 2000] report that pairs programs had less defect density, but Hulkko and Abrahamson [Hulkko et al. 2005] show that pairs produced code with more defect density. Wilson et al. [Wilson et al. 1993] and John Nosek [Nosek 1998] measure the code quality in terms of its functionality, the number of software components contained in the program, and readability, the number of comments the program contains; whereas, Xu and Rajlich [Xu et al. 2006] measured the code quality in terms of its elegances and readability. Xu and Rajlich [Xu et al. 2006] show that the programs written by pairs were more readable and elegance, but Wilson et al. [Wilson et al. 1993] and John Nosek [Nosek 1998] show that statistically there was no significant difference in readability between the individual and pair programmers codes. With respect to functionality the John Nosek [Nosek 1998] experiment shows that pair programs were more functional, whereas, in the Wilson et al. [Wilson et al. 1993] experiment, the individual programmers programs were more functional than pairs. Based on the post experiment survey the experimenters calculate the programmer?s job satisfaction and confidence on their work. John Nosek [Nosek 1998], Williams et al. [Williams et al. 2000], Vanhanen and Lassenius [Vanhanen et al. 2005], Xu and Rajlich [Xu et al. 2006] and Wilson et al. [Wilson et al. 1993] show that pairs expressed their satisfaction over pair programming. Wilson et al. [Wilson et al. 1993], John Nosek [Nosek 1998], and Williams et al. [Williams et al. 2000] show that pairs expressed their confidence on their work when using pair programming. The results of the above mentioned experiments with respect to the efficacy of pair programming are shown in table 2.2. 22 Study S t at i st i cal Test P r oduct i vi t y Ti m e / C ost C or r ect ness C ode Q ual i t y S a tis fa c tio n C onf i dence Wilson et al. [Wilson et al. 1993] t-test No No Yes John Nosek [Nosek 1998] t-test No Yes Yes Yes Williams et al. [Williams et al. 2000] No statistical test? No Yes Yes Yes Nawrocki and Wojciechowski [Nawrocki et al. 2001] No statistical test No No Rostaher et al. [Rostaher et al. 2002] t-test No Matthias M?ller?? [Muller 2005] Mann-Whitney Test No No Vanhanen and Lassenius [Vanhanen et al. 2005] No statistical test No No Yes Hulkko and Abrahamson [Hulkko et al. 2005] No statistical test No No Xu and Rajlich* [Xu et al. 2006] No statistical test Yes No Yes Yes Yes Arisholm et al. [Arisholm et al. 2007] ANCOVA No No Yes ? Supports PP claims (i.e., PP is beneficial than Individual programming) No ? Not Supports PP claims (i.e., PP is not beneficial than Individual programming) ? The authors claim that they used independent sample t-test, but the results were neither published nor used in the paper ?? Pair programming Vs Review (solo coding phase followed by two person inspection) experiment * Experiment to validate Extreme Programming (XP) against Waterfall method in game development Table 2.2: Summary of Pair Programming Experiments Results 23 2.3. The Pairing Activity While much of the literature explains what pair programming is, it fails to answer some key questions: ? When to pair program? ? How to form pairs? ? How frequently partners have to switch their roles? ? When to exchange the partners? ? What the working environment should look like? ? Who owns the task at hand ? the pair or a person? ? Who owns the code? ? Whether Extreme Programming or pair programming denies specialists? ? What is the role of programming languages and tools in pair programming? 2.3.1. When to Pair Program? John Nosek [Nosek 1998] suggests that pair programming might be preferred over individual programming in situations like (1) speeding up development ? if the organization wants to bring its product earlier to market for it to gain an edge over its competitors and (2) improving software quality ? to produce a high quality product, which has very high profit margin. Thus pair programming is preferred when the organization need to develop high quality products in short time. Matthias Muller [Muller 2005] suggests that pair programming is a viable option for developing software with fewer failures. Judith Wilson et al. [Wilson et al. 1993], Don Wells and Trish Buckley [Wells et al. 2001], Kim Lui and Keith Chan [Lui et al. 2006], and Erik Arisholm et al. [Arisholm et al. 2007] observe that novice programmers benefit from pair programming. Don Wells and Trish Buckley 24 [Wells et al. 2001] observe that novice-novice pairs work better than expert-novice pairs, because the novices feel that they are not intimidated and demoralized. Moreover the novices learned from each other while solving the problem. Don Wells and Trish Buckley [Wells et al. 2001] also suggest that people with equal experience should pair in order to achieve significant productivity and morale. Studies by Jari Vanhanen and Casper Lassenius [Vanhanen et al. 2005] and Hanna Hulkko and Pekka Abrahamsson [Hulkko et al. 2005] show that pair programming helps in transferring the knowledge about the system among the team members; meaning, it enhances training. Studies by Hanna Hulkko and Pekka Abrahamsson [Hulkko et al. 2005], Erik Arisholm et al. [Arisholm et al. 2007], Benedicenti and Paranjape [Benedicenti et al. 2001], Becker-Pechau et al. [Pechau et al.2003] and Gittins et al. [Gittins et al. 2001] show that pair programming is useful with complex tasks. Moreover, Erik Arisholm et al. [Arisholm et al. 2007] suggest that pair programming is effective when assigning complex maintenance tasks to junior programmers. Jari Vanhanen and Casper Lassenius [Vanhanen et al. 2005], on the other hand, show that pair programming does not help in solving complex tasks. Xu and Rajlich quote Kent Beck [Beck 2000] as stating ?that pair programming (or XP) is not suitable for very large projects? [Xu et al. 2006]. Ambu and Gianneschi [Ambu et al. 2003] suggest that pair programming is not suitable with tight deadlines. Pair programming is not possible if the development team size is small [Boutin 2000]. Karl Boutin [Boutin 2000] reported that in his research and development lab the developers were forced to abandon pair programming due to lack of resources (i.e. due to small team size). At the 25 same time Kent Beck [Beck 2000] suggests that XP is not possible when the development team size is more than 10. Table 2.3 summarizes the points discussed in this section. When to Pair Program When not to Pair Program Need to speed up development To improve software quality Require program with less failures When the programmers are novice To solve complex tasks For job training Programmers of equal experience Large projects Tight deadlines Very small team sizes and team size of >10 Table 2.3: When to Pair Program 2.3.2. Forming Pairs According to Don Wells and Trish Buckley [Wells et al. 2001], people with equal experience should pair in order to achieve significant productivity and morale. They also suggest that an experienced-novice pair will not set up a proper pair relationship; instead it will set up only a teacher-student relationship, possibly creating a novice programmer morale problem. If experienced-novice pairs tied up for a longer session of pair programming then both will get uninterested, exhausted, and demoralized. They also suggest that novice programmers should be paired with other novice programmers so that both will learn from each other. Once novice programmers begin to gain confidence then they can be paired with an experienced partner. 2.3.3. Role Switching Role switching is the process of the driver and the navigator exchanging their roles. Kent Beck [Beck 2000] does not directly say anything about switching roles in the pair programming definition but implied such with ?Set up your desks so two people can sit side by side and shift the keyboard back and forth without having to move their chairs? when he was describing the development activity. Matevz Rostaher and Marjan Hericko [Rostaher et al. 2002] suggest that 26 role switching rhythm (the high frequency of role switching, more than 20 times per day, and short phases of uninterrupted activity, 5 minutes in average) is essential for test-first pair programming. According to William Wake [Wake 2002], role switching can be done every couple of minutes or a few times an hour. Robert Martin suggests that whenever the driver gets tired or stuck, the navigator should take over the driver?s job. This is normally happens several times an hour. Matevz Rostaher and Marjan Hericko [Rostaher et al. 2002] observed that role switching occurred 21 times per day on average for all programmers and 42 times per day on average for experienced programmers. They also observed that uninterrupted activity lasted 5 minutes in average for all programmers and 3 minutes for experienced programmers. Lippert et al. [Lippert et al. 2001] observed that the physical working environment (seating arrangement) plays a crucial part in role switching. Conventional seating arrangement hinders the frequent role switching. Once the seating is rearranged, pairs switch their roles more frequently (the seating arrangement is discussed more detail in section 2.3.5). 2.3.4. Partner Exchange The main idea behind rotating developers among different pairs is to spread the system knowledge to every member of the development team. Kent Beck [Beck 2000] says ?Paring is dynamic?, meaning, people have to pair with different people in the morning and evening sessions, and a programmer can pair with anyone in the development team. William Wake [Wake 2002] suggests that the developers have to exchange their partners every day and some developers will exchange their partners more often depending upon the situation. Robert Martin [Martin 2003] suggests that every member of the 27 development team should try all the activities of the current iteration and that he/she has to partner with every member in the team. He also suggests that every programmer has to work in at least in two different pairs. 2.3.5. Workplace Layout To emphasize the importance of the workplace layout for pair programming?s success in DaimlerChrysler C3 project, Kent Beck [Beck 2000] writes ?I was brought in because of my knowledge of Smalltalk and objects, and the most valuable suggestion I had was that they should rearrange the furniture?. According to Kent Beck [Beck 2000], a reasonable work place is important for any project?s success. Kent Beck [Beck 2000] and Lippert et al. [Lippert et al. 2001] suggest that the physical environment (i.e., the desk and seating arrangement) plays a critical role in pair programming. This was confirmed by the result of the survey conducted by Laurie Williams and Robert Kessler [Williams et al. 2000b] in which 96% of the programmers agreed that proper workplace layout was critical to their pair programming success. Lippert et al. [Lippert et al. 2001] also observed that the conventional seating arrangement hindered the frequent role switching, and once the seating was rearranged, the pairs switched their roles more frequently. For the success of pair programming, developers need to communicate with their partners and with other members of the team as well [Beck 2000, Williams et al. 2003]. The pair programming layout must be arranged in such a way that it allows inter-pair and intra-pair communications. 28 Kent Beck [Beck 2000] defines the working environment for pair programming as follows: ?Common office layouts don't work well for XP. Putting your computer in a corner, for example, doesn't work, because it is impossible for two people to sit side-by-side and program. Ordinary cubicle wall heights don't work well?walls between cubicles should be half-height or eliminated entirely. At the same time, the team should be separated from other teams?. ?One big room with little cubbies around the outside and powerful machines on tables in the middle is about the best environment I know?. The DaimlerChrysler C3 work area [Beck 2000] is shown in figure 2.2. Six computers were placed on two large tables and pairs were allowed to sit at any available machine. Figure 2.2: The DaimlerChrysler C3 work area [Beck 2000] 29 According to Laurie Williams and Robert Kessler [Williams et al. 2000b, Williams 2003], pair programmers should able to slide the keyboard and mouse back and forth without moving their chairs. There are two programming layouts 9 shown in figure 2.3. Laurie Williams and Robert Kessler [Williams et al. 2000b] preferred the layout in the right over the layout in the left. Figure 2.3: Pair Programming Workplace Layout [Wiki] To facilitate the inter-pair and intra-pair communications, RoleModel Software, Holly Springs, NC developed a workstation layout, in which 6 tables are arranged as shown in figure 2.4 [Williams et al. 2003]. Figure 2.4: RoleModel Software Workstation Layout [Williams et at. 2003] 9 This layout[Wiki] was contributed by Beck and Cunningham [Williams et al. 2000b] 30 When Lippert et al. [Lippert et al. 2001] started developing their JWAM framework using Extreme Programming (XP), they started programming using the conventional working layout consisting of desks with fixed cabinets at their sides as shown in figure 2.5. Although this layout permitted them to do pair programming, they found out that role switching was not easy. Once they realized that due to this physical environment the role switching occurred only a few times per day, they rearranged the furniture as shown in figure 2.6, which, in turn, enhanced their roles switching activity. But from their experience they suggest that the ?Circle table? layout shown in figure 2.7 would be a better choice for pair programming. However, Lippert et al. [Lippert et al. 2001] have not provided reasoning for their proposed pair programming layout and the physical layout has not been tested. Figure 2.5: Conventional Environment [Lippert et al. 2001] Figure 2.6: Rearranged Environment for Better Role Switching [Lippert et al. 2001] 31 Figure 2.7: ?Circle table? for pair programming [Lippert et al. 2001] 2.3.6. Task Responsibility In pair programming, two programmers write code for a user story. Pairing is a dynamic activity, in which a developer may need to pair with more than one developer to finish the task at hand. This raises the question ?who is responsible for the task at hand?? If a task needs some special technologies like GUI or database then who is responsible to carry out that task? According to William Wake [Wake 2002], a single developer owns the task at hand. The developer responsible for the task may partner with one person for one aspect of the task and someone else for another aspect of the task. Robert Martin [Martin 2003] clearly indicates that no programmer is responsible or has authority over any technology; everybody has to work in all technologies. 2.3.7. Code Ownership Since the code for a task is written by many developers in the development team, no individual developer has ownership rights. The entire team owns the code, i.e. collective code ownership [Beck 2000, Wake 2002]. 32 2.3.8. XP/PP Deny Specialists? Robert Martin [Martin 2003] states ?This doesn?t mean that XP denies specialists. If your specialty is GUI, you are most likely to work on GUI tasks, but you will also be asked to pair on middleware and database tasks. If you decide to learn a second specialty, you can sign up for tasks and work with specialists who will teach it to you. You are not confined to your specialty?. 2.3.9. Role of Programming Languages and Tools in PP Jerzy Nawrocki and Adam Wojciechowski [Nawrocki et al. 2001] suggest that pair programming described by Extreme Programming is less efficient than reported by earlier researchers. From Table 2.4 it is apparent that pair programming experiments conducted using Extreme Programming (XP) do not support the claims of pair programming. This confirms Jerzy Nawrocki?s and Adam Wojciechowski?s [Nawrocki et al. 2001] claim that XP tailored for single person use produces better results than XP used with pair programming. Looking closer at the results of pair programming experiments listed in Table 2.4, it is clear that pairs do not outperform the individual programmers when the same working environment or software process were provided to the programmers. Moreover, XP with modern object-oriented programming languages such as Smalltalk and Java seems to be less effective for pair programming. This may be due to the modern compilers and/or development environments and tools available for the programmers; e.g., the navigator role was effectively replaced or even enhanced by the modern compilers and IDE. Table 2.5 also suggests that the advantage or benefits of having a navigator (an extra pair of eyes or an extra brain) for continuous code review 33 in pair programming has been diminished by the arrival of modern programming languages and professional development tools. From Table 2.6, we can observe that the pair programming implemented with Test Driven Development (TDD) as prescribed by XP, does not outperform individual programming. This may be due to the TDD used in XP, which allows developers to define the exact functionality of the method before the actual code implementation. This means that every developer knows in advance exactly what he/she is going to implement. In this way, every developer is capable of implementing the module by himself without the help of the partner. Study Software Process Programming Language Result Ind. Pair Williams et al. [Williams et al. 2000] PSP CSP C++ Supports PP claims Xu and Rajlich [Xu et al. 2006] Water Fall XP Eclipse, JDK Supports PP claims Hulkko and Abrahamson [Hulkko et al. 2005] Mobile D Mobile D Java & JSP, Mobile Java, Symbian C++ Not supports PP claims Nawrocki and Wojciechowski [Nawrocki et al. 2001] XP XP C/C++ Not supports PP claims Rostaher et al. [Rostaher et al. 2002] XP XP Smalltalk Not supports PP claims Matthias M?ller [Muller 2005] XP XP Java Not supports PP claims Table 2.4: Effects of Software Processes on PP 34 Programming Language Study Result Pascal, C/C++ Wilson et al. [Wilson et al. 1993] Supports PP claims John Nosek [Nosek 1998] Supports PP claims Williams et al. [Williams et al. 2000] Supports PP claims Nawrocki and Wojciechowski [Nawrocki et al. 2001] Not supports PP claims Smalltalk Rostaher et al. [Rostaher et al. 2002] Not supports PP claims Java Matthias M?ller [Muller 2005] Not supports PP claims Xu and Rajlich? [Xu et al. 2006] Supports PP claims Hulkko and Abrahamson [Hulkko et al. 2005] Not supports PP claims Professional Java Tools Vanhanen and Lassenius [Vanhanen et al. 2005] Not supports PP claims Arisholm et al. [Arisholm et al. 2007] Not supports PP claims ? - The main aim of the experiment is to evaluate the Extreme Programming (XP) against Waterfall model in game development; not pair programming versus individual programming experiment. Table 2.5: Effects of Programming Languages on PP Development Method Study Software Process Result Ind. Pair Standard Development Wilson et al. [Wilson et al. 1993] NA NA Supports PP claims John Nosek [Nosek 1998] NA NA Supports PP claims Williams et al. [Williams et al. 2000] PSP CSP Supports PP claims Vanhanen and Lassenius [Vanhanen et al. 2005] NA NA Not supports PP claims Test Driven Development Rostaher et al. [Rostaher et al. 2002] XP XP Not supports PP claims Matthias M?ller [Muller 2005] XP XP Not supports PP claims Hulkko and Abrahamson [Hulkko et al. 2005] Mobile D Mobile D Not supports PP claims Nawrocki and Wojciechowski [Nawrocki et al. 2001] XP XP Not supports PP claims Table 2.6: Effects of Software Development Methods on PP 35 2.4. The Effect of Pair Programming on Software Development Phases One of the basic requirements of pair programming is that all production code must be programmed by pairs, which, in turn, doubles the developers required to complete a project and also almost doubles the development cost. Unquestionably this is a waste of resource; though the proponents of pair programming claim that ?pair programming increases initial development time but saves time in the long run because there are fewer defects? [Cockburn et al. 2000]. Up to now there is no empirical evidence for their claim. Because the amount of skill required to carry out the various phases of software process are different, there is no guarantee that pair programming will produce the same results in all the phases. The results of the Hanna Hulkko and Pekka Abrahamson [Hulkko et al. 2005] case studies suggest that pair programming was more useful in the beginning of the project and that the pair programming effort steadily decreased in the subsequent iterations and again increased in the final iteration (defect correction after system test). The main aim of this section is to explore whether pairing up of developers is required in all the phases of software development, or if there an alternate way to minimize the pair-up times between these developers, in order to maximize the resource utilization and reduce the development cost. 2.4.1. Pair Design Due to the asymmetrical nature of the design and code phases, we cannot expect all the benefits of pair-coding to apply to pair-design as well [Canfora et al. Sep 06]. Various studies highlight the benefits of pair-design. According to Laurie Williams et al. [Williams et al. 2000], pair-analysis and pair-design are more critical than pair-implementation, and pair-analysis and 36 pair-design are critical for pair success. They also state that ?It is doubtless true that two brains are better than one when performing analysis and design?. Emilio Bellini et al. [Bellini et al. 2005] reveal that pair-design was more predictable than individual design and helped the developers to understand the system while developing it. This learned knowledge about the system can help developers in developing the project with less rework. The pair-design experiment conducted by Gerardo Canfora et al. [Canfora et al. Sep 06] in September 2006, suggests that pair-design will also produce all anticipated benefits of pair- coding. Their experimental results show that pairs produced better design in less time than individuals. Moreover, with respect to effort and quality, the pair design was more predictable than individual design (i.e. the standard deviation of pair metrics was smaller than the one of solos). They also suggest that the industry can use pair design in critical situations and also in situations with short deadlines, lack of resources, and lack of skilled personnel. The pair design experiment conducted by Gerardo Canfora et al. [Canfora et al. Dec 06] in December 2006, suggests that pair design slows down the task but improves quality. They also found that the quality of pair design was more predictable (i.e. the standard deviation obtained by pairs was smaller than the one of solos) than individual design quality. Matthias M. Muller [Muller 2006] conducted a pair programming experiment using 18 computer science students. The 18 subjects were randomly divided into 8 control groups (individuals) and 5 experimental groups (pairs). The students were asked to design, code and test an elevator control system using Java. Both control and experimental groups were initially paired for the design phase. Once the design was completed with the partner, the control group students were asked to code and test independently. The results show that the costly pair programming 37 process (design, code and test) can be replaced by a less expensive process of pair-design phase followed by individual code and test phases. On the other hand, Hiyam Al-Kilidar et al. [Al-Kilidar et al. 2005] found the effects of pair work on the quality of designs to be mixed. In the first module, pairs produced better quality design than solos. In the second module, the pairs and solos interchanged their roles; solos became pairs and pairs became solos. There was no significant difference in design quality between pairs and solos. Pairs produced slightly better design than individuals in Jari Vanhanen?s and Casper Lassenius?s [Vanhanen et al. 2005] experiment. In Xu?s and Rajlich?s experiment [Xu et al. 2006], pairs developed better design than individuals. The summary of the pair-design experiments is shown in Table 2.7. Study Result Emilio Bellini et al?. [Bellini et al., 2005] Pair design was more predictable than individual design Knowledge transfer about the system was higher among pairs than solos Hiyam Al-Kilidar et al?. [Al-Kilidar et al., 2005] Mixed results about the design quality Vanhanen and Lassenius? [Vanhanen et al. 2005] Pairs produced slightly better design than individuals Gerardo Canfora et al?. [Canfora et al., Sep 06] Pair design was better than individual design Pairs took less time than individuals Pair design was more predictable than individual design Gerardo Canfora et al?. [Canfora et al., Dec 06] Pair design was better than individual design Pairs took more time than individuals Pair design was more predictable than individual design Matthias Muller? [Muller, 2006] Pair programming can be replaced by pair design followed by individual code and test Xu and Rajlich? [Xu et al. 2006] Pair program had better design than individual program ?These experiments had only design phase and there were no coding and testing phases ? These were pair programming experiments which includes design phase Table 2.7: Summary of Pair Design Experiments We can conclude the following, from the work to date: ? Pair design improves design quality ? Pair design is more predictable than individual design in terms of effort and quality ? The development time for the pair design and individual design has mixed results 38 ? Pair programming can be replaced with pair design phase followed by individual code and test phases in order to reduce cost. 2.4.2. Pair Coding The pair-coding in Extreme Programming is almost nothing but pair programming itself. Laurie Williams and Robert Kessler [Williams et al., 2000] claim that pair-analysis and pair- design is more critical than pair-implementation. They also report that for simple and routine work, pairs split the work and do it individually in a more effective manner than when they work as pairs. In addition to this, the programmers report that for detail-oriented tasks, such as GUI drawing, the partners in the pair do not help much. Many researchers including Williams et al. [Williams et al. 2000], Muller and Tichy [Muller et at. 2001], Lui and Chen [Lui et al. 2003], Hulkko and Abrahamsson [Hulkko et al. 2005], and Erik Arisholm et al. [Arisholm et al. 2007] report that pair programming is useful only for complex tasks and not useful for simple and routine tasks. With respect to program quality (in terms of functionality and readability), pair programming experiments show mixed results. Wilson et al. [Wilson et al. 1993], John Nosek [Nosek 1998], McDowell et al [McDowell et al. 2002], and Xu and Rajlich [Xu et al. 2006] show that pairs produced better quality code than individuals; whereas Vanhanen and Lassenius [Vanhanen et al. 2005] and Hulkko and Abrahamson [Hulkko et al. 2005] show that individuals produced better quality code than pairs. Regarding program correctness (i.e. number of test cases passed), again, pair programming experiments registered mixed results. Williams et al. [Williams et al. 2000] and Xu and Rajlich [Xu et al. 2006] show that pairs programs pass more test cases; whereas, Matthias M?ller [Muller 2005], Hulkko and Abrahamson [Hulkko et al. 2005], Matthias M?ller [Muller 39 2006], and Arisholm et al. [Arisholm et al. 2007] show that there is no difference in program correctness between pair and individual programs. Almost all experiments show that pairs spend more time than individuals, which indicating that pair-coding is a rather slow and expensive technology. The conclusion of pair-coding is, ? Pair coding phase is not as important as pair design phase ? Pair coding is slow and expensive ? Pair coding is useful only for complex tasks not for simple and/or routine tasks ? Empirical evidence is mixed regarding program quality ? Empirical evidence is mixed regarding program correctness 2.4.3. Pair Testing Laurie Williams et al. [Williams et al., 2000] claim that pair-testing is the least critical phase in the pair programming process and that pairs can split up to run test cases on two computers as long as defects are identified. Hulkko and Abrahamson [Hulkko et al, 2005] show that the relative amount of effort spent on the defect correction phase (performed after system test) of the project is very high. Jari Vanhanen and Casper Lassenius [Vanhanen et al., 2005] observed that pairs write code with fewer defects, but are less careful in system testing. They also suggest that unless the pairs do careful system testing, the benefits (fewer defects) they obtain in coding phase of pair programming will be lost. Pairs delivered system with more defects as compared with individual programmers. This is due to the reason that individuals found and removed more defects before delivery than pairs. 40 2.5. Alternatives to Traditional Pair Programming [Confer 2009] Collaborative-Adversarial Pair (CAP) programming is a variant of the pair programming concept advocated by many agile techniques. CAP was developed at Auburn University several years ago as part of a commercial cell-phone software project. In 2003, Dr. David Umphress were asked by Rocket Mobile, Inc., a west-coast firm that specializes in cell phone software development, to reverse engineer one of their BREW products and rewrite it in JME. The effort was directed by Dr. David Umphress and the team consisted of two doctoral students ? Brad Dennis and William "Amos" Confer ? who each had six or seven years of industrial software development experience. The team purposely adopted an XP-like process because they believed that it gives them the greatest visibility into the project, and because it allowed them to deliver the product to the customer in increments for reliability testing. The team quickly determined that pair programming was not working. Both developers were highly independent and felt they each knew best how to build the code. Too, they worked different parts of the day: one developer was a morning person and the other was a night person. They overlapped two hours a day, at best. The team evolved over the first month of the project the idea of the collaborative- adversarial pair as the most realistic way we could produce reliable software. After the initial development, Amos and Dr Chapman used it in the senior capstone design course that is part of the Bachelor of Software Engineering and Bachelor of Wireless Engineering. The Collaborative- Adversarial Pair (CAP) programming process employed a synchronize-and-stabilize approach to development. 41 3. RESEARCH DESCRIPTION The primary purpose of this research is to create and/or formally define a stable and reliable agile software development methodology called Collaborative-Adversarial Pair (CAP) programming. We see CAP as an alternative to traditional pair programming in situations where pair programming is not beneficial or is not possible to practice. The primary objectives of this research are: ? To identify the pair-programming process, as well as the effectiveness, advantages, and disadvantages of pairs. ? To define the Collaborative-Adversarial Pair (CAP) process whose objective is to exploit the advantages of pair programming while at the same time downplaying its disadvantages. ? To evaluate Collaborative-Adversarial Pair (CAP) programming against pair programming and traditional individual programming in terms of productivity, correctness and job satisfaction. 3.1. The CAP Process [Umphress 2008] The Collaborative-Adversarial Pair (CAP) programming process employs a synchronize-and- stabilize approach to development. As shown in Figure 3.1, the features are grouped into prioritized feature sets then build the sets in a series of software cycles, one set per cycle. 42 Figure 3.1: CAP Development Activity The CAP development cycle is shown in Figure 3.2. Each cycle starts with the entire project team reviewing the features to be built. It is here that the customer requirements are translated into product requirements by converting user stories into ?developer stories,? which are essentially manageable units of work that map to user stories. Progress is tracked by two measures: the ratio of the number of users stories built to the total number of user stories, and the ratio of the developer stories completed to the total number of developer stories to be built in the cycle. The first measure expresses progress to the customer; the second measure tracks internal progress. After the feature review, the team moves into collaborative-adversarial mode (see Figure 3.3). The developers work together collaboratively to identify how to architect and design the features. They use this time to clarify requirements and discuss strategy. They then walk through their design with the overall project leader. After the design is approved, they move into adversarial roles. One developer is assigned the responsibility of implementing the design and the other developer is given the task of writing black-box test cases for the various components. The goal of the implementer is to build unbreakable code; the goal of the tester is to break the code. Note that the implementer is still responsible for writing unit-level white-box tests as part of his development efforts (see Figure 3.4). Once both developers have completed their tasks, they run the code against the tests. Upon discovering problems, the pair resumes their 43 adversarial positions: the tester verifies that the test cases are valid and the implementer repairs the code and adds a corresponding regression unit test. In some cases, the test cases are not valid and are, themselves, fixed by the tester. At the conclusion of the test phase, the team moves to a post mortem step. Here, the team (including the project manager) reviews the source code and the test cases. The purpose of the review is to 1) ensure the test cases are comprehensive and 2) identify portions of the code that are candidates for refactoring and not to find bugs; so the team does not walk through the code at a statement-by-statement level. This has been found to be so tedious that the participants quickly become numb to any problems. It is assumed that the majority of defects are caught in the blackbox functional tests or in the whitebox unit tests. Any gaps in test cases are captured as additional developer stories; refactoring tasks are done likewise. These developer stories receive a high enough priority that they are among the first tasks completed in the subsequent software development cycle. A new development cycle begins again by following the post mortem step. 44 Figure 3.2: CAP Development Cycle Figure 3.3: Collaborative-Adversarial Pairs (CAP) Figure 3.4: Build Code / Unit Implementation in CAP 45 3.1.1. Design CAP uses Class Responsibility Collaborator (CRC) cards to design the software. A brainstorming tool used widely in the design of object-oriented software, the CRC cards were invented by Ward Cunningham [Beck et al. 1989]. CRC cards are usually created from 4" x 6" index cards and are used to determine which classes are needed and how they will interact. A CRC card contains the following information: 1. The class name. 2. Its super class. 3. The responsibilities of the class. 4. The names of other classes with which the class will collaborate to fulfill its responsibilities. Figure 3.5 illustrates a template CRC card. Class Name: Super Class Name: Responsibilities Collaborators Figure 3.5: A Class-Responsibility-Collaborator (CRC) index card 46 3.1.2. Black Box Test Cases In functional testing (or behavioral testing), every program is considered to be a function that maps values from its input domain to values in its output range. The functional testing is also called black box testing, because testing does not depend on the content or implementation of the function. Black box testing is completely based on the external specifications (i.e. inputs and outputs) of the function and is usually data driven. With functional testing, test cases are developed only from external descriptions of the software, including specifications, requirements, and design. The functional test cases have the following two distinct advantages: 1. They are independent from software implementation. Implementation changes do not affect the test cases and vice-versa. 2. They can be developed in parallel with the implementation, which, in turn, reduces the overall project development interval. The functional test cases may suffer from the following two drawbacks: 1. There may be a redundancy in the developed test cases. 2. There can be a probability that portions of the software may be untested. 3.1.3. Unit Implementation Implementation refers to programming and is intended to satisfy the requirements in the manner specified by the detailed design. Unit (or software component or module) refers to the smallest part of the implementation that will be separately maintained. Normally a unit or software component is a set of collaborating classes. In some cases, a component may contain a single class. The unit implementation procedure in CAP is given below, which follows the Test- Driven Development (TDD) approach: 47 1. Write a test unit 2. Compile the test. ? It should fail to compile because the code that the test calls has not been implemented 3. Implement the methods/write code ? Refactor first if necessary ? Do not compile yet ? Follow the coding standard ? Code in a manner that is easiest to verify 4. Self-inspect the code. ? Do not compile/execute yet ? Be convinced that the code does the required job (the compiler will never do this because it merely checks the syntax). ? Fill out the code inspection checklist ? Record the time and defect logs 5. Compile the code ? Repair syntax defects ? Record time and defect log 6. Run the test and see it pass. 7. Refactor for clarity and to remove duplication 8. Repeat from the top 48 3.1.3.1. Unit Test Unit test is used to verify the software component or module of software design. Because a component is not a stand-alone program, a driver and/or stub software must be developed for each unit test. The unit test environment is shown in figure 3.6. A driver is a main program (in many applications) that accepts test case data, passes such data to the component to be tested, and prints relevant results. A stub is a dummy subprogram, serving to replace module that are subordinate to (called by) the component to be tested. It uses the subordinate module?s interface, may do minimal data manipulation, provides verification of entry, and returns control to the module undergoing testing. To simplify unit testing, the designed component must be highly cohesive. When only one function is addressed by a component, the number of test cases is reduced and errors can be more easily predicted and uncovered. Figure 3.6: Unit Test Environment 49 3.1.4. Testing in CAP Vs PP The pair programming methodology uses the white box testing strategy, which has the following drawbacks: 1. Since the white box test cases are developed from program source code, there is no way to recognize whether all the specified behaviors are implemented or not. 2. It is very difficult to employ white-box testing on purchased or contracted software because its internal structure is unknown. On the other hand, the black box techniques alone are not sufficient enough to identify all the test cases; indeed, both white box and black box approaches are needed. By combining the black box and white box testing techniques, we will get the following benefits: 1. The redundancy and gaps problems of black box testing can be recognized and resolved. 2. White box testing aids in identifying behaviors that are not in the specification (such as a virus). This will never be revealed by black box functional testing. The CAP testing procedure judiciously combines the functional (black box) and structural (white box) testing to provide the confidence of functional testing and the measurement of structural testing. 3.1.5. Refactoring Refactoring is the process of changing software?s internal structure, in order to improve design and readability and reduce bugs, without changing its observable behavior. Martin Fowler [Fowler 1999] suggests that refactoring has to be done in three situations: when adding new function to the software, when fixing a bug, and when we review the code (i.e., whenever new idea arises at the time for code review or when the code is identified as being too complex). The 50 first two cases will be covered by the refactoring session of the unit implementation. Since CAP incorporates the code review session after integration and test, an additional refactoring phase is necessary. Refactoring also helps developers to review someone else?s code and helps the code review process to have more concrete results [Fowler 1999]. 51 4. APPLIED RESULTS AND RESEARCH VALIDATION Two empirical experiments were conducted during fall 2008 and spring 2009 to validate CAP against traditional pair programming and individual programming. The subjects used Eclipse and JUnit to perform three programming tasks with different degrees of complexity. 4.1. Subjects Forty two (42) volunteer students from the Software Process class, a combined class of undergraduate seniors and graduate students, participated in the study. All participants had already taken software modeling and design (using UML) and computer programming courses such as C, C++ and Java. Out of fourteen students, 11 students had 1 to 5 years of industrial programming experience, two had no or less than one year programming experience, and one student had more than 5 years programming experience. Four students had prior pair programming experience. 4.2. Experimental Tasks The subjects were asked to solve the following three programming problems in Java (Test Driven Development using Eclipse): Problem1: Write a program which reads a text file and displays the name of the file, the total number of occurrences of a user-input string the total number of non-blank lines in the file, and the count the number of lines of code according to the LOC Counting Standard used in PSP, Personal Software Process [Humphrey 2005]. You may assume that the source code adheres to the LOC Coding Standard. This assignment should not determine if the coding standard has been followed. The program should be capable of sequentially processing multiple files by repeatedly 52 prompting the user for file names until the user enters a file name of "stop". The program should issue the message, "I/O error", if the file is not found or if any other I/O error occurs. Problem2: Write a program to list information (name, number of methods, type, and LOC) of each proxy in a source file. The program should also produce an LOC count of the entire source file. Your program should accept as input the name of a file that contains source code. You are to read the file and count the number of lines of code according to our LOC Counting Standard. You may assume that the source code adheres to the LOC Coding Standard. This assignment should not determine if the coding standard has been followed. The exact format of the application-user interaction is up to you. ? A "proxy" is defined as a recognizable software component. Classes are typical proxies in an object-oriented systems; subprograms are typical proxies in traditional functionally- decomposed systems. ? If you are using a functionally-decomposed (meaning, non-OO) approach, the number of methods for each proxy will be "1". If you are using an OO approach, the number of methods will be a count of the methods associated with an object. Probelm3: Write a program to calculate the planned number of lines of code given the estimated lines of code (using PSP?s PROBE Estimation Script). Your program should accept as input the name of a file. Each line of the file contains four pieces of information separated by a space: the name of a project and its estimated LOC (LOCe), planned LOC (LOCp), and actual LOC (LOCa). Read this file and echo the data to the output device. Accept as input from the keyboard a number which represents the estimated size (E) of a new project. Output the calculations of each decision and the responding planned size (P), as well as the PROBE decision designation (A, B, or C) used to calculate P. For each decision, indicate why it is/isn't valid. The exact format of the application-user interaction is up to you. ? Your software should gracefully handle error conditions, such as non-existent files and invalid input values. ? Round P up to the nearest multiple of 10. 4.3. Hypotheses H0 1 (Time/Cost Overall ): The overall software development cost of CAP is equal or higher than PP in average. Ha 1 (Time/Cost Overall ): The overall software development cost of CAP is less than PP in average. H0 2 (Time/Cost Overall ): The overall software development cost of CAP is equal or higher than individual programming in average. 53 Ha 2 (Time/Cost Overall ): The overall software development cost of CAP is less than individual programming in average. H0 3 (Time/Cost Coding ): The cost of CAP coding phase is equal or higher than the cost of PP coding phase in average. Ha 3 (Time/Cost Coding ): The cost of CAP coding phase is less than cost of PP coding phase in average. H0 4 (Time/Cost Coding ): The cost of CAP coding phase is equal or higher than the cost of individual programming coding phase in average. Ha 4 (Time/Cost Coding ): The cost of CAP coding phase is less than cost of individual programming coding phase in average. H0 5 (Correctness): The number acceptance tests failed in CAP is equal or higher than the number of acceptance tests failed in PP in average. Ha 5 (Correctness): The number acceptance tests failed in CAP is less than the number of acceptance tests failed in PP in average. 4.4. Cost To study the cost of overall software development, we compared the total development time, measured in minutes, of all the phases. Both pair programming (PP) and individual programming (IP) consisted of design, coding and test phases; whereas, the CAP consisted of test case development phase in addition to the PP phases. The IP, PP and CAP total software development costs were calculated as per the following formulas: = Time Design + Time Coding + Time Test = 2* (Time Design + Time Coding + Time Test ) = 2* (Time Design + Time Test ) + Time Coding + Time TestCaseDevelopment 54 To study the cost of coding phase, we compared the coding time, measured in minutes, of the coding phase. The IP, PP and CAP coding phase costs were calculated as per the following formulas. = Time Coding = 2* (Time Coding ) = Time Coding 4.5. Program Correctness To study the program correctness, the number of post-development test cases, black-box test cases developed from the specifications, passed by programs developed by IP group, PP group and CAP group were compared. 4.6. Experiment Procedure 1. Consent Process: At the beginning of the course both in fall 2008 and in spring 2009 the IRB (Auburn University Institutional Review Board) approved informed consent for the project was handed out and students were given the chance to volunteer to participate. The researcher provided information to students about the project, handed out consent forms, answered any questions students raised by the students, and requested that the forms be returned the following class; so students had at least one intervening day to review all aspects of consent. The researcher returned the following class and answered the questions, if any, and collected the consent forms. 55 2. Pre-Test: In the pre-test all the subjects were asked to solve two programming problems individually in order to measure their programming skills. 3. Pre-Experiment Survey: Each subject was asked to complete a survey questionnaire which collected demographic information such as age, class level (senior/graduate), programming languages known, experience level, and pair programming experience. 4. Assigning the Subjects to Experimental Groups: Based on the pre-test?s result and the survey, the subjects were divided into groups of five. The subjects were randomly selected from each group and assigned to the three experimental groups: individual programming (IP) group, pair programming (PP) group, and collaborative adversarial pair (CAP) programming group. 5. Workshop: Before the actual control experiments started there was a workshop for all the subjects. First, a lecture was arranged to explain the concepts of collaborative-adversarial pair programming, pair programming, and unit testing, and acceptance testing. Then, a pair programming practice session (known as pair-jelling exercise) was conducted, which enabled the programmers to understand the pair programming practices. 6. Control Experiments: a. Control Experiment-1 (Dynamic Pairs): Three programming exercises were given to each experimental group. The subjects in both the PP group and the CAP group were randomly paired-up with a partner in their own group to do the first problem. After the first problem the pairs rotated within their own group (i.e., a PP pair interchanged partners with another PP pair and a CAP pair interchanged partners with another CAP pair). The new rotated pairs completed the second problem. The group?s pairs rotated once again to do the third problem. 56 b. Control Experiment-2 (Static Pairs): Three programming exercises were given to each experimental group. The subjects in both the PP group and the CAP group were randomly paired-up with a fixed partner to do all three exercises. The subjects in the IP group were asked to complete all the three exercises alone. Figure 4.1 summarizes the experimental procedure. Figure 4.1: Experimental Procedure The design of the experiments is shown figure 4.2. Figure 4.2: Experimental Setup 57 4.7. RESULTS 4.7.1. Statistical Test Selection Statistical tests are of two types: parametric and non-parametric. Each parametric test depends on several assumptions, such as the data must follow the normal distribution, the sample size should be within a specified range, and there shouldn?t be any outliers in the data. When its assumptions are met, a parametric test is more powerful than its corresponding non-parametric test. Non-parametric methods do not depend on the normality assumption, work quite well for small samples, and are robust to outliers. Student?s t-Test is suitable for smaller sample sizes (e.g. <30). The ?normal curve z test? is more suitable for larger samples (e.g. ?30). For polytomous independents (i.e. if the samples are subdivided into many distinct subordinate parts) the analysis of variance, ANOVA, tests are more suitable. Therefore, it is clear that before we could finalize which statistical tests were most suitable to validate the CAP, we needed to analyze the data whether it satisfies the normality and no outlier properties or not. We used a Q-Q plot of residuals 10 and SAS?s GLM procedure to test for normality. The Q-Q plot is a plot of residuals in sorted order (Y-axis) against the value those residuals should have if the distribution of the residuals were normal; i.e., it shows the observations on the X-axis plotted against the expected normal scores (Z-scores, known as quintiles) on the Y-axis. The line shows the ideal normal distribution with mean and standard-deviation of the sample. If the points roughly follow the line, then the sample has normal distribution. The SAS?s GLM procedure uses the method of least squares to fit general linear models. The GLM procedure with BF 10 The residual of a sample is the difference between the sample and the observed sample mean. 58 (Brown and Forsythe?s variation of Levene?s test) option allows us to test the normality of the sample. We used a box plot to identify outliers, i.e., data points which are numerically distant from the rest of the data. In a box plot the outliers are indicated using circles. 4.7.2. Empirical Experiment-1 (Dynamic Pairs-Fall 2008) Test Results 4.7.2.1. Test for Normality Figures 4.3 and 4.4 show the Q-Q plot of residuals for the total software development time and coding time, respectively. The points on the Q-Q plots of residuals lie nearly on the straight line, which indicates that both the total software development time and the coding time data follows normal distribution. Figure 4.3: Q-Q Plot of Residuals (Dynamic Pairs Total Software Development Time) 59 Figure 4.4: Q-Q Plot of Residuals (Dynamic Pairs Coding Time) Figures 4.5 and 4.6 show the results of the SAS?s ?GLM procedure with BF option? for total software development time and coding time, respectively. In both Figure 4.5 and 4.6 the P value of all experiments are insignificant at 5% significant level (p>0.05), which indicates that statistically there is no significant evidence to reject the normality; i.e., both the overall software development time and the coding time data follows normal distribution. Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.935497 Pr < W 0.2423 Kolmogorov-Smirnov D 0.154598 Pr > D >0.1500 Cramer-von Mises W-Sq 0.08843 Pr > W-Sq 0.1507 Anderson-Darling A-Sq 0.548835 Pr > A-Sq 0.1396 Figure 4.5: Test for Normality (Dynamic Pairs Total Software Development Time) Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.919181 Pr < W 0.1250 Kolmogorov-Smirnov D 0.189357 Pr > D 0.0866 Cramer-von Mises W-Sq 0.088422 Pr > W-Sq 0.1507 Anderson-Darling A-Sq 0.545294 Pr > A-Sq 0.1423 Figure 4.6: Test for Normality (Dynamic Pairs Coding Time) 60 4.7.2.2. Outliers The box plots for the total software development time and coding time are given in Figures 4.7 and 4.8 respectively. There are no circles in Figures 4.7 and 4.8, which indicates that there are no outliers either in PP?s overall software development time and coding time or in CAP?s overall software development time and coding time. Figure 4.7: Box plot (Dynamic Pairs Total Software Development Time) Figure 4.8: Box plot (Dynamic Pairs Coding Time) CAP PP 100 200 300 400 500 600 t t i m e i ndi cat or CAP PP 0 100 200 300 400 500 c t i m e i ndi cat or 61 4.7.2.3. Statistical Test Determination for Experiment-1 The sample size was 18 (9 experiments completed by PP group plus 9 experiments completed by CAP group). Since the sample size was small, we used Student?s t-Tests to compare the CAP groups? means with the PP groups? means. The t-Test depends on several assumptions: ? If the sample size is less than 15, then the data for the t-Test should be strictly normal. ? If the sample size is between 15 and 40, then the data may be partially normal, but it should not contain outliers. ? When sample size is more than 40, then the data may be markedly skewed. Our sample size was 18, and both total development time and coding time followed normal distribution, and there were no outliers. Consequently, Student?s t-Test was identified as suitable for comparing both the CAP total software development time means with the PP total software development time means, and the CAP coding time means with the PP coding time means. 4.7.2.4. Total Software Development Time (Hypothesis 1) The total software development time for the PP groups and the CAP groups are shown in Table 4.1. The PP groups took 285 minutes in average for Problem1, 446 minutes in average for Problem2, and 223 minutes in average for Problem3; whereas, the CAP groups took only 166 minutes (42% less than PP groups) in average for Problem1, 208 minutes (53% less than PP groups) in average for Problem2, and 199 minutes (11% less than PP groups) in average for Problem3. The average time taken to solve all the three problems is 954 minutes for the PP groups and 573 minutes (40% less than PP groups) for the CAP groups. 62 Method Problem1 Problem2 Problem3 CAP-G1 180 275 120 CAP-G2 148 189 273 CAP-G3 171 160 204 Average 166 208 199 PP-G1 250 488 272 PP-G2 342 346 256 PP-G3 264 504 140 Average 285 446 223 Table 4.1: Total Software Development Time (Dynamic Pairs) Figure 4.9 shows the average time taken by PP groups and CAP groups for the total software development for the given three problems. Figure 4.9: Average Total Software Development Time (Dynamic Pairs) The box plot in Figure 4.10 shows the total time taken by all 18 pairs (3x3 programs completed by PP groups and 3x3 programs completed by CAP groups). The boxes contain 50% of the data points, the line between lower border and box contain 25% of data points, and the line between the box and upper border contain another 25% data points. The plus mark in the plot (box) indicates the mean value and the horizontal line in the middle of the box indicates the median value. The plot indicates that all the nine CAP programs took less time than the mean value of the PP programs. 166 208 199 285 446 223 0 100 200 300 400 500 P1 P2 P3 CAP PP 63 Figure 4.10: Total Software Development Time (Dynamic Pairs) The Student?s t-Test results are shown in Figure 4.11. The p-value in the equality of variances test is significant at the 5% significant level (p<0.05), which indicates that the data has unequal variance, so we have to take the unequal variance t-Test result, which is p=0.0129(2 sided t-value). Since p<0.05, there is insufficient support for the hypothesis H0 1 that the overall software development cost or time of CAP is equal or higher that PP in average. The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL Variable indicator N Mean Mean Mean Std Dev Std Dev Std Dev Std Err ttime CAP 9 150.51 191.11 231.72 35.682 52.826 101.2 17.609 ttime PP 9 227.85 318 408.15 79.219 117.28 224.68 39.094 ttime Diff (1-2) -217.8 -126.9 -35.99 67.741 90.955 138.43 42.877 T-Tests Variable Method Variances DF t Value Pr > |t| ttime Pooled Equal 16 -2.96 0.0092 ttime Satterthwaite Unequal 11.1 -2.96 0.0129 Equality of Variances Variable Method Num DF Den DF F Value Pr > F ttime Folded F 8 8 4.93 0.0368 Figure 4.11: t-Test Results (Dynamic Pairs Total Software Development Time) CAP PP 100 200 300 400 500 600 t t i m e i ndi cat or 64 Decision: Reject H0 1 in favor of Ha 1 since p-value < ? (?=0.05). Thus we have sufficient statistical evidence to conclude that the overall software development cost or time of CAP is less than PP in average. 4.7.2.5. Coding Time (Hypothesis 3) The coding time for the PP groups and the CAP groups are shown in Table 4.2. The PP groups took 192 minutes in average for Problem1, 371 minutes in average for Problem2, and 170 minutes in average for Problem3; whereas, the CAP groups took only 65 minutes (66% less than PP groups) in average for Problem1, 52 minutes (86% less than PP groups) in average for Problem2, and 79 minutes (54% less than PP groups) in average for Problem3. The average time taken to solve all the three problems is 733 minutes for PP groups and 196 minutes (73% less than PP groups) for CAP groups. Method Problem1 Problem2 Problem3 CAP-G1 38 55 51 CAP-G2 91 61 98 CAP-G3 65 40 89 Average 65 52 79 PP-G1 92 272 194 PP-G2 320 346 196 PP-G3 164 494 120 Average 192 371 170 Table 4.2: Coding Time (Dynamic Pairs) Figure 4.12 shows the average time taken by PP groups and CAP groups for the coding phase of the software development for the given three problems. 65 Figure 4.12: Average Coding Time (Dynamic Pairs) The box plot in Figure 4.13 shows the coding time taken by all 18 pairs (3x3 programs completed by PP groups and 3x3 programs completed by CAP groups). The plot indicates that all the nine CAP programs took less time than 75% PP programs. Figure 4.13: Box plot (Dynamic Pairs Coding Time) The Student?s t-Test results are shown in Figure 4.14. The p-value in the equality of variances test is significant in the 5% significant level (p<0.05), which indicates that the data has unequal variance, so we have to take the unequal variance t-Test result, which is P=0.0028 (2 65 52 79 192 371 170 0 100 200 300 400 P1 P2 P3 CAP PP CAP PP 0 100 200 300 400 500 c t i m e i ndi cat or 66 sided t-value). Since P<0.05, there is insufficient support for the hypothesis H0 3 that the cost of the CAP coding phase is equal or higher that PP coding phase in average. The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL Variable indicator N Mean Mean Mean Std Dev Std Dev Std Dev Std Err ctime CAP 9 48.133 65.333 82.534 15.115 22.377 42.87 7.4591 ctime PP 9 146.56 244.22 341.89 85.821 127.06 243.41 42.352 ctime Diff (1-2) -270.1 -178.9 -87.72 67.942 91.226 138.84 43.004 T-Tests Variable Method Variances DF t Value Pr > |t| ctime Pooled Equal 16 -4.16 0.0007 ctime Satterthwaite Unequal 8.5 -4.16 0.0028 Equality of Variances Variable Method Num DF Den DF F Value Pr > F ctime Folded F 8 8 32.24 <.0001 Figure 4.14: t-Test Results (Dynamic Pairs Coding Time) Decision: Reject H0 3 in favor of Ha 3 since p-value < ? (?=0.05). Thus we have sufficient statistical evidence to conclude that the cost of CAP coding phase is less than the cost of PP coding phase in average. 4.7.2.6. Program Correctness (Hypothesis 5) The number of post-development test cases passed by the PP group programs and the CAP group programs are shown in Table 4.3 and Figure 4.15. The acceptance tests were conducted by a disinterested party. Specifically, a graduate teaching assistant for the introductory Java course was recruited to do this. The tester was not involved in any other way with the experiment. The total numbers of test cases passed by the PP groups was 13, 17, and 29 for 67 Problem1, Problem2, and Problem3 respectively. Whereas, the total numbers of test cases passed by the CAP groups was 16, 20, and 30 for Problem1, Problem2, and Problem3 respectively. Group Problem1 Problem2 Problem3 PP1 5/6 6/8 10/10 PP2 4/6 8/8 9/10 PP3 4/6 3/8 10/10 Total 13/18 17/24 29/30 CAP1 5/6 8/8 10/10 CAP2 5/6 8/8 10/10 CAP3 6/6 4/8 10/10 Total 16/18 20/24 30/30 Table 4.3: The number of test cases passed (Dynamic Pairs) Figure 4.15: The number of test cases passed (Dynamic Pairs) Table 4.3 indicates that the number of acceptance tests failed in CAP is less than the number of acceptance tests failed in PP. Therefore, there is insufficient support for the hypothesis H0 5 . Decision: Reject H0 5 in favor of Ha 5 . We have sufficient evidence to conclude that the number of acceptance tests failed in CAP is less than the number of acceptance tests failed in PP. 16 20 30 13 17 29 0 5 10 15 20 25 30 35 P1 P2 P3 CAP PP 68 4.7.3. Empirical Experiment-2 (Static Pairs-Spring 2009) Test Results 4.7.3.1. Test for Normality Figures 4.16 and 4.17 show the Q-Q plot of residuals for the total software development time and coding time, respectively. The points on Figure 4.16 lie nearly on the straight line; whereas, the points on Figure 4.17 do not follow the straight line, which indicates that the total software development time data follows normal distribution whereas the coding time data is not. Figure 4.16: Q-Q Plot of Residuals (Static Pairs Total Software Development Time) -2 -1 0 1 2 - 200 - 100 0 100 200 300 400 r e s Nor mal Quant i l es 69 Figure 4.17: Q-Q Plot of Residuals (Static Pairs Coding Time) Figures 4.18 and 4.19 show the results of the SAS?s ?GLM procedure with BF option? for total software development time and coding time, respectively. In Figure 4.18 the p value of all tests (expect Shapiro-Wilk test) are insignificant at 5% significant level (p>0.05), which indicates that statistically there is no significant evidence to reject the normality; i.e., the overall software development time data follows normal distribution. In Figure 4.19 the p value of all tests are not insignificant at 5% significant level (p<0.05), which indicates that statistically there is significant evidence to reject the normality; i.e., the coding time data does not follow normal distribution. -2 -1 0 1 2 -200 -100 0 100 200 300 400 500 r e s Normal Quantiles 70 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.881142 Pr < W 0.0273 Kolmogorov-Smirnov D 0.161488 Pr > D >0.1500 Cramer-von Mises W-Sq 0.084384 Pr > W-Sq 0.1751 Anderson-Darling A-Sq 0.618083 Pr > A-Sq 0.0929 Figure 4.18: Test for Normality (Static Pairs Total Software Development Time) Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.749179 Pr < W 0.0003 Kolmogorov-Smirnov D 0.248771 Pr > D <0.0100 Cramer-von Mises W-Sq 0.196178 Pr > W-Sq <0.0050 Anderson-Darling A-Sq 1.297565 Pr > A-Sq <0.0050 Figure 4.19: Test for Normality (Static Pairs Coding Time) 4.7.3.2. Outliers The box plots for the total software development time and coding time are given in Figures 4.20 and 4.21 respectively. There are no circles in Figures 4.20 and 4.21, which indicates that there are no outliers either in PP?s overall software development time and coding time or in CAP?s overall software development time and coding time. Figure 4.20: Box plot (Static Pairs Total Software Development Time) CAP PP 0 200 400 600 800 1000 t t i m e i ndi cat or 71 Figure 4.21: Box plot (Static Pairs Coding Time) 4.7.3.3. Statistical Test Determination for Experiment-2 The sample size was 18 (9 experiments completed by PP groups plus 9 experiments completed by CAP groups). Since the sample size was small, we used t-Tests to compare the CAP groups? means with the PP groups? means. Our sample size was18, the total development time followed normal distribution, and there were no outliers. Consequently Student?s t-Test was used to compare the CAP total software development time means with the PP total software development time means. Since the coding time data was not normally distributed. The Wilcoxon Mann-Whitney U test was used to compare the CAP coding time means with the PP coding time means. CAP PP 0 200 400 600 800 1000 c t i m e i ndi cat or 72 4.7.3.4. Total Software Development Time (Hypothesis 1) The total software development time for the PP groups and the CAP groups are shown in Table 4.4. The PP groups took 603 minutes in average for Problem1, 484 minutes in average for Problem2, and 377 minutes in average for Problem3; whereas, the CAP groups took only 197 minutes (67% less than PP groups) in average for Problem1, 192 minutes (60% less than PP groups) in average for Problem2, and 236 minutes (37% less than PP groups) in average for Problem3. The average time taken to solve all the three problems was 1464 minutes for PP groups and 625 minutes (57% less than PP groups) for CAP groups. Method Problem1 Problem2 Problem3 CAP-G1 159 200 311 CAP-G2 210 122 156 CAP-G3 222 254 240 Average 197 192 236 PP-G1 592 544 312 PP-G2 350 480 510 PP-G3 866 428 310 Average 603 484 377 Table 4.4: Total Software Development Time (Static Pairs) Figure 4.22 shows the average time taken by PP groups and CAP groups for the total software development for the given three problems. 73 Figure 4.22: Average Total Software Development Time (Static Pairs) The box plot in Figure 4.23 shows the total time taken by all 18 pairs (3x3 programs completed by PP group and 3x3 programs completed by CAP group). The plot indicates that all the nine CAP programs took less time than the least value of the PP program groups. Figure 4.23: Total Software Development Time (Static Pairs) 197 192 236 603 484 377 0 100 200 300 400 500 600 700 P1 P2 P3 CAP PP CAP PP 0 200 400 600 800 1000 t t i m e i ndi cat or 74 The Student?s t-Test results are shown in Figure 4.24. The p-value in the equality of variances test is significant in the 5% significant level (p<0.05), which indicates that the data has unequal variance, so we have to take the unequal variance t-Test result, which is P=0.0011(2 sided t-value). Since P<0.05, there is insufficient support for the hypothesis H0 1 that the overall software development cost or time of CAP is equal or higher that PP in average. The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL Variable indicator N Mean Mean Mean Std Dev Std Dev Std Dev Std Err ttime CAP 9 163.97 208.22 252.47 38.885 57.569 110.29 19.19 ttime PP 9 354.12 488 621.88 117.65 174.17 333.67 58.057 ttime Diff (1-2) -409.4 -279.8 -150.2 96.605 129.71 197.41 61.147 T-Tests Variable Method Variances DF t Value Pr > |t| ttime Pooled Equal 16 -4.58 0.0003 ttime Satterthwaite Unequal 9.73 -4.58 0.0011 Equality of Variances Variable Method Num DF Den DF F Value Pr > F ttime Folded F 8 8 9.15 0.0052 Figure 4.24: t-Test Results (Static Pairs Total Software Development Time) Decision: Reject H0 1 in favor of Ha 1 since p-value < ? (?=0.05). Thus we have sufficient statistical evidence to conclude that the overall software development cost or time of CAP is less than PP in average. 4.7.3.5. Coding Time (Hypothesis 3) The coding time for PP group and CAP group are shown in Table 4.5. The PP groups took 437 minutes in average for Problem1, 319 minutes in average for Problem2, and 306 minutes in average for Problem3; whereas, the CAP groups took only 81 minutes (81% less than 75 PP groups) in average for Problem1, 117 minutes (63% less than PP groups) in average for Problem2, and 142 minutes (54% less than PP groups) in average for Problem3. The average time taken to solve all the three problems was 1062 minutes for PP groups and 340 minutes (68% less than PP groups) for CAP groups. Method Problem1 Problem2 Problem3 CAP-G1 18 113 124 CAP-G2 132 77 121 CAP-G3 94 161 180 Average 81 117 142 PP-G1 308 242 218 PP-G2 200 380 420 PP-G3 804 336 280 Average 437 319 306 Table 4.5: Coding Time (Static Pairs) Figure 4.25 shows the average time taken by PP groups and CAP groups for the coding phase of the software development for the given three problems. Figure 4.25: Average Coding Time (Static Pairs) 81 117 142 437 319 306 0 50 100 150 200 250 300 350 400 450 500 P1 P2 P3 CAP PP 76 The box plot in Figure 4.26 shows the coding time taken by all 18 pairs (3x3 programs completed by PP group and 3x3 programs completed by CAP group). The plot indicates that all the nine CAP programs took less time than the least value of the PP program group. Figure 4.26: Box plot (Static Pairs Coding Time) The Wilcoxon Mann-Whitney U test results are shown in Figure 4.27. The P value is 0.0026 (2 sided t-value). Since P<0.05, there is insufficient support for the hypothesis H0 3 that the cost of the CAP coding phase is equal or higher that PP coding phase in average. Wilcoxon Two-Sample Test Statistic (S) 45.0000 Normal Approximation Z -3.5321 One-Sided Pr < Z 0.0002 Two-Sided Pr > |Z| 0.0004 t Approximation One-Sided Pr < Z 0.0013 Two-Sided Pr > |Z| 0.0026 Exact Test One-Sided Pr <= S 2.057E-05 Two-Sided Pr >= |S - Mean| 4.114E-05 Z includes a continuity correction of 0.5. Figure 4.27: Wilcoxon Mann-Whitney U test Results (Static Pairs Coding Time) CAP PP 0 200 400 600 800 1000 c t i m e i ndi cat or 77 Decision: Reject H0 3 in favor of Ha 3 since p-value < ? (?=0.05). Thus we have sufficient statistical evidence to conclude that the cost of CAP coding phase is less than the cost of PP coding phase in average. 4.7.4. Combined Test Results (CAP Vs PP) 4.7.4.1. Test for Normality Figures 4.28 and 4.29 show the Q-Q plot of residuals for the total software development time and coding time respectively. The points on Figure 4.28 lie nearly on the straight line; whereas, the points on Figure 4.29 do not follow the straight line, which indicates that the total software development time data follows normal distribution whereas the coding time data is not. Figure 4.28: Q-Q Plot of Residuals (Combined CAP Vs PP Total Software Development Time) -3 -2 -1 0 1 2 3 - 400 - 200 0 200 400 600 r e s Nor mal Quant i l es 78 Figure 4.29: Q-Q Plot of Residuals (Combined CAP Vs PP Coding Time) Figures 4.30 and 4.31 show the results of the SAS?s ?GLM procedure with BF option? for total software development time and coding time respectively. In Figure 4.30 the p value of all tests (expect Shapiro-Wilk test) are insignificant at 5% significant level (p>0.05), which indicates that statistically there is no significant evidence to reject the normality; i.e., the overall software development time data follows normal distribution. In Figure 4.31 the p value of all tests are not insignificant at 5% significant level (p<0.05), which indicates that statistically there is significant evidence to reject the normality; i.e., the coding time data does not follow normal distribution. Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.910577 Pr < W 0.0067 Kolmogorov-Smirnov D 0.100131 Pr > D >0.1500 Cramer-von Mises W-Sq 0.085478 Pr > W-Sq 0.1755 Anderson-Darling A-Sq 0.657534 Pr > A-Sq 0.0829 Figure 4.30: Test for Normality (Combined CAP Vs PP Total Software Development Time) -3 -2 -1 0 1 2 3 - 400 - 200 0 200 400 600 r e s Nor mal Quant i l es 79 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.821607 Pr < W <0.0001 Kolmogorov-Smirnov D 0.179058 Pr > D <0.0100 Cramer-von Mises W-Sq 0.230129 Pr > W-Sq <0.0050 Anderson-Darling A-Sq 1.443171 Pr > A-Sq <0.0050 Figure 4.31: Test for Normality (Combined CAP Vs PP Coding Time) 4.7.4.2. Outliers The box plots for the total software development time and coding time are given in Figures 4.32 and 4.33 respectively. There are no circles in Figures 4.32 and 4.33, which indicates that there are no outliers either in PP?s overall software development time and coding time or in CAP?s overall software development time and coding time. Figure 4.32: Box plot (Combined CAP Vs PP Total Software Development Time) CAP PP 0 200 400 600 800 1000 t t i m e i ndi cat or 80 Figure 4.33: Box plot (Combined CAP Vs PP Coding Time) 4.7.4.3. Statistical Test Determination for the Combined CAP Vs PP Data The sample size was 36 (18 experiments completed by PP groups plus 18 experiments completed by CAP groups). Since the sample size was small, we used t-Tests to compare the CAP groups? means with the PP groups? means. Our sample size was 36, the total development time followed normal distribution, and there were no outliers. Consequently Student?s t-Test was used to compare the CAP total software development time means with the PP total software development time means. Since the coding time data was not normally distributed. The Wilcoxon Mann-Whitney U test was used to compare the CAP coding time means with the PP coding time means. 4.7.4.4. Total Software Development Time (Hypothesis 1) The total software development time for PP group and CAP group are shown in Table 4.6. The PP groups took 444 minutes in average for Problem1, 465 minutes in average for Problem2, and 300 minutes in average for Problem3; whereas, the CAP groups took only 182 minutes (59% less than PP groups) in average for Problem1, 200 minutes (57% less than PP CAP PP 0 200 400 600 800 1000 c t i m e i ndi cat or 81 groups) in average for Problem2, and 218 minutes (27% less than PP groups) in average for Problem3. The average time taken to solve all the three problems is 1209 minutes for PP groups and 600 minutes (50% less than PP groups) for CAP groups. Method Problem1 Problem2 Problem3 CAP-G1 180 275 120 CAP-G2 148 189 273 CAP-G3 171 160 204 CAP-G4 159 200 311 CAP-G5 210 122 156 CAP-G6 222 254 240 Average 182 200 218 PP-G1 250 488 272 PP-G2 342 346 256 PP-G3 264 504 140 PP-G4 592 544 312 PP-G5 350 480 510 PP-G6 866 428 310 Average 444 465 300 Table 4.6: Total Software Development Time (Combined CAP Vs PP) Figure 4.34 shows the average time taken by PP groups and CAP groups for the total software development for the given three problems. Figure 4.34: Average Total Software Development Time (Combined CAP Vs PP) 182 200 218 444 465 300 0 100 200 300 400 500 P1 P2 P3 CAP PP 82 The box plot in Figure 4.35 shows the total time taken by all 36 pairs (6x3 programs completed by PP groups and 6x3 programs completed by CAP groups). The plot indicates that all the nine CAP programs took less time than the least value of the PP program groups. Figure 4.35: Box Plot (Combined CAP Vs PP Total Software Development Time) The Student?s t-Test results are shown in Figure 4.36. The p-value in the equality of variances test is significant in the 5% significant level (p<0.05), which indicates that the data has unequal variance, so we have to take the unequal variance t-Test result, which is P<0.0001(2 sided t-value). Since P<0.05, there is insufficient support for the hypothesis H0 1 that the overall software development cost or time of CAP is equal or higher that PP in average. CAP PP 0 200 400 600 800 1000 t t i m e i ndi cat or 83 The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL Variable indicator N Mean Mean Mean Std Dev Std Dev Std Dev Std Err ttime CAP 18 172.66 199.67 226.68 40.759 54.317 81.429 12.803 ttime PP 18 319.2 403 486.8 126.45 168.52 252.63 39.72 ttime Diff (1-2) -288.1 -203.3 -118.5 101.27 125.2 164.03 41.733 T-Tests Variable Method Variances DF t Value Pr > |t| ttime Pooled Equal 34 -4.87 <.0001 ttime Satterthwaite Unequal 20.5 -4.87 <.0001 Equality of Variances Variable Method Num DF Den DF F Value Pr > F ttime Folded F 17 17 9.63 <.0001 Figure 4.36: t-Test Results (Combined CAP Vs PP Total Software Development Time) Decision: Reject H0 1 in favor of Ha 1 since p-value < ? (?=0.05). Thus we have sufficient statistical evidence to conclude that the overall software development cost or time of CAP is less than PP in average. 4.7.4.5. Coding Time (Hypothesis 3) The coding time for PP groups and CAP groups are shown in Table 4.7. The PP groups took 315 minutes in average for Problem1, 340 minutes in average for Problem2, and 238 minutes in average for Problem3; whereas, the CAP groups took only 73 minutes (77% less than PP groups) in average for Problem1, 85 minutes (75% less than PP groups) in average for Problem2, and 111 minutes (53% less than PP groups) in average for Problem3. The average time taken to solve all the three problems was 893 minutes for PP groups and 269 minutes (70% less than PP groups) for CAP groups. 84 Method Problem1 Problem2 Problem3 CAP-G1 38 55 51 CAP-G2 91 61 98 CAP-G3 65 40 89 CAP-G4 18 113 124 CAP-G5 132 77 121 CAP-G6 94 161 180 Average 73 85 111 PP-G1 92 272 194 PP-G2 320 346 196 PP-G3 164 494 120 PP-G4 308 242 218 PP-G5 200 380 420 PP-G6 804 336 280 Average 315 340 238 Table 4.7: Coding Time (Combined CAP Vs PP) Figure 4.37 shows the average time taken by PP groups and CAP groups for the coding phase of the software development for the given three problems. Figure 4.37: Average Coding Time (Combined CAP Vs PP) 73 85 111 315 340 238 0 50 100 150 200 250 300 350 400 P1 P2 P3 CAP PP 85 The box plot in Figure 4.38 shows the coding time taken by all 36 pairs (6x3 programs completed by PP group and 6x3 programs completed by CAP group). The plot indicates that all the nine CAP programs took less time than 75% PP programs. Figure 4.38: Box plot (Combined CAP Vs PP Coding Time) The Wilcoxon Mann-Whitney U test results are shown in Figure 4.39. The p value is <0.0001 (2 sided t-value). Since p<0.05, there is insufficient support for the hypothesis H0 3 that the cost of the CAP coding phase is equal or higher that PP coding phase in average. Wilcoxon Two-Sample Test Statistic (S) 185.0000 Normal Approximation Z -4.6667 One-Sided Pr < Z <.0001 Two-Sided Pr > |Z| <.0001 t Approximation One-Sided Pr < Z <.0001 Two-Sided Pr > |Z| <.0001 Exact Test One-Sided Pr <= S 5.598E-08 Two-Sided Pr >= |S - Mean| 1.120E-07 Z includes a continuity correction of 0.5. Figure 4.39: Wilcoxon Mann-Whitney U test Result (Combined CAP Vs PP Coding Time) CAP PP 0 200 400 600 800 1000 c t i m e i ndi cat or 86 Decision: Reject H0 3 in favor of Ha 3 since p-value < ? (?=0.05). Thus we have sufficient statistical evidence to conclude that the cost of CAP coding phase is less than the cost of PP coding phase in average. 4.7.5. CAP Vs IP Test Results 4.7.5.1. Test for Normality Figures 4.40 and 4.41 show the Q-Q plot of residuals for the total software development time and coding time, respectively. The points on the Q-Q plots of residuals lie nearly on the straight line, which indicates that both the total software development time and the coding time data follows normal distribution. Figure 4.40: Q-Q Plot of Residuals (CAP Vs IP Total Software Development Time) -3 -2 -1 0 1 2 3 - 200 - 100 0 100 200 r e s Nor mal Quant i l es 87 Figure 4.41: Q-Q Plot of Residuals (CAP Vs IP Coding Time) Figures 4.42 and 4.43 show the results of the SAS?s ?GLM procedure with BF option? for total software development time and coding time, respectively. In both Figure 4.42 and 4.43 the p value of all experiments are insignificant at 5% significant level (p>0.05), which indicates that statistically there is no significant evidence to reject the normality; i.e., both the overall software development time and the coding time data follows normal distribution. Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.98787 Pr < W 0.9667 Kolmogorov-Smirnov D 0.068654 Pr > D >0.1500 Cramer-von Mises W-Sq 0.021278 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.176827 Pr > A-Sq >0.2500 Figure 4.42: Test for Normality (CAP Vs IP Total Software Development Time) -3 -2 -1 0 1 2 3 - 150 - 100 - 50 0 50 100 150 r e s Nor mal Quant i l es 88 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.980243 Pr < W 0.7936 Kolmogorov-Smirnov D 0.075714 Pr > D >0.1500 Cramer-von Mises W-Sq 0.036181 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.24686 Pr > A-Sq >0.2500 Figure 4.43: Test for Normality (CAP Vs IP Coding Time) 4.7.5.2. Outliers The box plots for the total software development time and coding time are given in Figure 4.44 and 4.45 respectively. There are no circles in Figures 4.44 and 4.45, which indicates that there are no outliers either in PP?s overall software development time and coding time or in CAP?s overall software development time and coding time. Figure 4.44: Box plot (CAP Vs IP Total Software Development Time) CAP IP 0 100 200 300 400 500 t t i m e i ndi cat or 89 Figure 4.45: Box plot (CAP Vs IP Coding Time) 4.7.5.3. Statistical Test Determination for the CAP VS IP Data The sample size was 33 (15 experiments completed by IP groups plus 18 experiments completed by CAP groups). Since the sample size was small, we used Student?s t-Tests to compare the CAP groups? means with the IP groups? means. Our sample size was 33, and both total development time and coding time followed normal distribution, and there were no outliers. Consequently, Student?s t-Test was identified as suitable for comparing both the CAP total software development time means with the IP total software development time means, and the CAP coding time means with the IP coding time means. 4.7.5.4. Total Software Development Time (Hypothesis 2) The total software development time for the IP groups and the CAP groups are shown in Table 4.8. The IP groups took 233 minutes in average for Problem1, 280 minutes in average for Problem2, and 207 minutes in average for Problem3; whereas, the CAP groups took only 182 minutes (22% less than IP groups) in average for Problem1, 200 minutes (29% less than IP CAP IP 0 50 100 150 200 250 300 c t i m e i ndi cat or 90 groups) in average for Problem2, and 218 minutes (5% more than IP groups) in average for Problem3. The average time taken to solve all the three problems is 720 minutes for the IP groups and 600 minutes (17% less than IP groups) for CAP groups. Method Problem1 Problem2 Problem3 CAP-G1 180 275 120 CAP-G2 148 189 273 CAP-G3 171 160 204 CAP-G4 159 200 311 CAP-G5 210 122 156 CAP-G6 222 254 240 Average 182 200 218 IP-G1 318 227 150 IP-G2 184 417 345 IP-G3 152 290 59 IP-G4 270 145 195 IP-G5 242 320 285 Average 233 280 207 Table 4.8: Total Software Development Time (CAP Vs IP) Figure 4.46 shows the average time taken by PP groups and CAP groups for the total software development for the given three problems. Figure 4.46: Average Total Software Development Time (CAP Vs IP) 182 200 218 233 280 207 0 50 100 150 200 250 300 P1 P2 P3 CAP PP 91 The box plot in Figure 4.47 shows the total time taken by all 33 programs (5x3 programs completed by IP groups and 6x3 programs completed by CAP groups). Figure 4.47: Total Software Development Time (CAP Vs IP) The Student?s t-Test results are shown in Figure 4.48. The p-value in the equality of variances test is not significant in the 5% significant level (p>0.05), which indicates that the data has equal variance, so we have to take the equal variance t-Test result, which is p=0.1532 (2 sided t-value). Since p>0.05, there is sufficient support for the hypothesis H0 2 that the overall software development cost or time of CAP is equal or higher that IP in average. Decision: Do Reject H0 2 in favor of Ha 2 since p-value > ? (?=0.05). Thus we do not have sufficient statistical evidence to conclude that the overall software development cost or time of CAP is less than IP in average. CAP IP 0 100 200 300 400 500 t t i m e i ndi cat or 92 The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL Variable indicator N Mean Mean Mean Std Dev Std Dev Std Dev Std Err ttime CAP 17 170.63 199.41 228.19 41.691 55.978 85.194 13.577 ttime IP 16 189.15 237.69 286.22 67.282 91.081 140.97 22.77 ttime Diff (1-2) -91.59 -38.28 15.034 60.162 75.043 99.768 26.139 T-Tests Variable Method Variances DF t Value Pr > |t| ttime Pooled Equal 31 -1.46 0.1532 ttime Satterthwaite Unequal 24.6 -1.44 0.1614 Equality of Variances Variable Method Num DF Den DF F Value Pr > F ttime Folded F 15 16 2.65 0.0622 Figure 4.48: t-Test Results (CAP Vs IP Total Software Development Time) 4.7.5.5. Coding Time (Hypothesis 4) The coding time for IP group and CAP group are shown in Table 4.9. The IP groups took 124 minutes in average for Problem1, 183 minutes in average for Problem2, and 137 minutes in average for Problem3; whereas, the CAP groups took only 73 minutes (41% less than IP groups) in average for Problem1, 85 minutes (54% less than IP groups) in average for Problem2, and 111 minutes (19% less than IP groups) in average for Problem3. The average time taken to solve all the three problems was 444 minutes for IP groups and 269 minutes (39% less than IP groups) for CAP groups. 93 Method Problem1 Problem2 Problem3 CAP-G1 38 55 51 CAP-G2 91 61 98 CAP-G3 65 40 89 CAP-G4 18 113 124 CAP-G5 132 77 121 CAP-G6 94 161 180 Average 73 85 111 IP-G1 112 116 85 IP-G2 26 165 235 IP-G3 147 262 45 IP-G4 135 110 140 IP-G5 202 260 180 Average 124 183 137 Table 4.9: Coding Time (CAP Vs IP) Figure 4.49 shows the average time taken by IP groups and CAP groups for the coding phase of the software development for the given three problems. Figure 4.49: Average Coding Time (CAP Vs IP) The box plot in Figure 4.50 shows the coding time taken by all 33 programs (5x3 programs completed by IP groups and 6x3 programs completed by CAP groups). The plot indicates that all the nine CAP programs took less time than the 25% IP programs. 73 85 111 124 183 137 0 50 100 150 200 P1 P2 P3 CAP PP 94 Figure 4.50: Box plot (CAP Vs IP Coding Time) The Student?s t-Test results are shown in Figure 4.51. The p-value in the equality of variances test is not significant in the 5% significant level (p>0.05), which indicates that the data has equal variance, so we have to take the equal variance t-Test result, which is p=0.0113 (2 sided t-value). Since p<0.05, there is insufficient support for the hypothesis H0 4 that the cost of the CAP coding phase is equal or higher that IP coding phase in average. The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL Variable indicator N Mean Mean Mean Std Dev Std Dev Std Dev Std Err ctime CAP 17 66.218 89.353 112.49 33.511 44.996 68.48 10.913 ctime IP 16 106.87 144.31 181.76 51.906 70.267 108.75 17.567 ctime Diff (1-2) -96.59 -54.96 -13.33 46.98 58.601 77.908 20.412 T-Tests Variable Method Variances DF t Value Pr > |t| ctime Pooled Equal 31 -2.69 0.0113 ctime Satterthwaite Unequal 25.3 -2.66 0.0135 Equality of Variances Variable Method Num DF Den DF F Value Pr > F ctime Folded F 15 16 2.44 0.0868 Figure 4.51: t-Test Results (CAP Vs IP Coding Time) CAP IP 0 50 100 150 200 250 300 c t i m e i ndi cat or 95 Decision: Reject H0 4 in favor of Ha 4 since p-value < ? (?=0.05). Thus we have sufficient statistical evidence to conclude that the cost of CAP coding phase is less than the cost of IP coding phase in average. 4.7.6. Results Summary To test the first four hypotheses, i.e., for comparing both the average CAP total software development time with the PP total software development time, and the average CAP coding time with the PP coding time, Student?s t-test or Mann-Whitney U test was used. If the data follows a normal distribution and there were no outliers, then we used Student?s t-test; otherwise we used Mann-Whitney U test. To test the fifth hypothesis, i.e., comparing the CAP groups program correctness with the PP groups program correctness, we simply compared the number of post-developed test cases passed by programs developed by each group. 4.7.6.1. Total Software Development Time H0 1 (The overall software development cost of CAP is equal or higher than PP in average): For the dynamic pairs (i.e., the control experiment conducted in Fall 2008), the static pairs (i.e., the control experiment conducted in Spring 2009), and combined data the hypothesis 1 was not supported with p=0.0129, p=0.0011, and p<0.0001 respectively. Thus we have sufficient statistical evidence to accept the alternative hypothesis that the overall software development cost or time of CAP is less than PP in average. The average time taken to solve all the three problems is 954 minutes for the Dynamic Pairs PP groups and 573 minutes (40% less than PP) for the Dynamic Pairs CAP groups. The average number of acceptance test passed by Dynamic Pairs PP groups? programs is 59/72 96 (82%); whereas, the average number of acceptance test passed by Dynamic Pairs CAP groups? programs is 66/72 (92%). Moreover, all the nine Dynamic Pairs CAP programs took less time than the mean value of the Dynamic Pairs PP programs. The average time taken to solve all the three problems is 1464 minutes for the Static Pairs PP groups and 625 minutes (57% less than PP) for the Static Pairs CAP groups. Moreover, all the nine Static Pairs CAP programs took less time than the least value of the Static Pairs PP program groups. H0 2 (The overall software development cost of CAP is equal or higher than individual programming in average): The hypothesis is supported with p=0.1532. Thus we have sufficient support for the null hypothesis to conclude that the overall software development cost or time of CAP is equal or greater than IP in average. The average coding time taken to solve all the three problems is 720 minutes for IP groups and 600 minutes (17% less than IP) for CAP groups. 4.7.6.2. Coding Time H0 3 (The cost of CAP coding phase is equal or higher than the cost of PP coding phase in average): For the dynamic pairs (i.e., the control experiment conducted in Fall 2008), the static pairs (i.e., the control experiment conducted in Spring 2009), and combined data the hypothesis 1 was not supported with p=0.0028, p=0.0026, and p<0.0001 respectively. Thus we have sufficient statistical evidence to accept the alternative hypothesis that the coding phase cost or time of CAP is less than PP in average. 97 The average coding time taken to solve all the three problems is 733 minutes for Dynamic Pairs PP groups and 196 minutes (73% less than PP) for Dynamic Pairs CAP groups. Moreover, all the nine Dynamic Pairs CAP programs coding time took less than 75% Dynamic Pairs PP programs coding time. The average coding time taken to solve all the three problems is 1062 minutes for Static Pairs PP groups and 340 minutes (68% less than PP) for Static Pairs CAP groups. Moreover, all the nine Static Pairs CAP programs coding time took less than the least value of the Static Pairs PP programs coding time. H0 4 (The cost of CAP coding phase is equal or higher than the cost of individual programming coding phase in average): The hypothesis is not supported with p=0.0113. Thus we have sufficient statistical evidence to accept the alternative hypothesis that the coding phase cost or time of CAP is less than IP in average. The average time taken to solve all the three problems was 444 minutes for IP groups and 269 minutes (39% less than IP) for CAP groups. 4.7.6.3. Program Correctness H0 5 (The number acceptance tests failed in CAP is equal or higher than the number of acceptance tests failed in PP in average): The number of acceptance tests failed in CAP is less than the number of acceptance tests failed in PP. Therefore, there is insufficient support for the hypothesis. Hence we accept the alternative hypothesis that the number acceptance tests failed in CAP is less than the number of acceptance tests failed in PP in average. 98 A summary of the four control experiments and their results are given in Table 4.10. Control Experiments Null Hypothesis Sample Size Data Properties Statistical Test Result Reject? Control Experiment-1 (CAP Vs PP, Dynamic Pairs, Fall 2008) H0 1 (Time/Cost Overall ) 18 Normal Unequal Variance No Outliers Student t-Test p=0.0129 Yes H0 3 (Time/Cost Coding ) Normal Unequal Variance No Outliers Student t-Test p=0.0028 Yes H0 5 (Correctness) Not Applicable None Number of Acceptance Test cases failed in CAP is less than PP Yes Control Experiment-2 (CAP Vs PP, Static Pairs, Spring 2009) H0 1 (Time/Cost Overall ) 18 Normal Unequal Variance No Outliers Student t-Test p=0.0011 Yes H0 3 (Time/Cost Coding ) Not Normal Unequal Variance No Outliers Mann- Whitney U Test p=0.0026 Yes Combined CAP Vs PP H0 1 (Time/Cost Overall ) 36 Normal Unequal Variance No Outliers Student t-Test p<0.0001 Yes H0 3 (Time/Cost Coding ) Not Normal Unequal Variance No Outliers Mann- Whitney U Test p<0.0001 Yes CAP Vs IP H0 2 (Time/Cost Overall ) 33 Normal Equal Variance No Outliers Student t-Test p=0.1532 No H0 4 (Time/Cost Coding ) Normal Equal Variance No Outliers Student t-Test p=0.0113 Yes Table 4.10: Summary of Control Experiments and their Results 99 4.8. Observations We have implemented two different strategies of pairing during the control experiment. In Fall 2008, we adopted the dynamic pairing technique and in Spring 2009, we adopted the static pairing technique (see section 4.6 for more detail about dynamic and static pairing). During this one year period, the subjects completed 105 problems. Here are some interesting observations we have made during this period: 1) Existing empirical evidence [Williams et al. 2000], shows that the overall software development time or cost of pair programmers is at the highest in the beginning of the project due to pair-jelling, and decreases considerably as the project progresses. The dynamic pairs? pair programming experiment?s empirical evidence shows that no regularity in the development of the productivity rates or decrease in development time could be detected between projects; whereas, we observed an improvement in productivity or decrease in development time (see Figure 4.52), for Static Pairs due to the pair-jelling effect as the project progressed. Figure 4.52: Average Software Development Time for Static PP and Dynamic PP 603 429 367 285 446 223 0 100 200 300 400 500 600 700 P1 P2 P3 SPP DPP 100 2) The static PP helps the programmers to solve routine or similar kinds of problems (Problem1 and Problem2 in our case) faster than dynamic PP programmers as shown in Figure 4.52. But, the dynamic pairing (both the dynamic PP and the dynamic CAP) helps the programmers to solve a new kind of problem (problem 3 in our case) faster than its static counterpart. This we can observe from Figure 4.52 and Figure 4.53. Figure 4.53: Average Software Development Time for Static CAP and Dynamic CAP 3) The productivity of the dynamic PP groups is better than static PP groups. The average time taken to solve all three problems for dynamic PP groups is 954 minutes; whereas, it took 1399 minutes (32% more than dynamic PP groups) for static PP groups. At the same time, we did not observe any difference in productivity between static CAP groups and dynamic CAP groups; the average time taken to solve all three problems for dynamic CAP groups and static CAP groups is 573 minutes and 578 minutes respectively. 140 178 260 166 208 199 0 50 100 150 200 250 300 P1 P2 P3 SCAP DCAP 101 4) One of the major benefits of collaborative programming is pair-pressure [Williams et al. 2000]. During the entire control experiment period we observed the existence of pair- pressure among both the CAP programmers and the pair programmers. When they met both partners worked intensively and were motivated to complete their assigned task within the specified time period. This motivation was lagging with individual programmers; some individual programmers even withdrew in the middle of the experiment. At the same time, we did not observed any gain in productivity and/or quality improvements by the pair programmers due to pair-pressure as indicated by [Williams et al. 2000]. 5) We have observed that the pairs in CAP discuss more in design time and create concrete designs in contrast to their PP counterparts. The pairs in CAP also know that after the design phase they will play on adversarial role in the implementation stage (the goal of the implementer is to build working software, whereas the goal of the tester is to break the software in CAP). We believe this forces them to discuss more in the design stage before moving to the implementation stage. Since the PP developers know that they are going to have a partner throughout the entire development phases, we feel that the confidence of having a partner in the development stage turns into overconfidence and they do not discuss much in the design stage. Furthermore, this overconfidence leads to a design that is not concrete which in turn, causes them to change their design more often in the coding phase and spend 50% more time than their CAP counter parts. 102 5. CONCLUSIONS AND FUTURE WORK 5.1. Conclusions In this research we have proposed a new stable and reliable agile software development methodology called Collaborative-Adversarial Pair (CAP) programming. We see CAP as an alternative to traditional pair programming in situations where pair programming is not beneficial or is not possible to practice. The CAP was evaluated against traditional pair programming and individual programming in terms of productivity and program correctness. The empirical evidence shows that traditional pair programming is an expensive technology and does not necessarily produce programs with better quality as claimed by the pair programming advocates. The empirical evidence shows that better quality programs can be produced in 40% less time using the dynamic pairs CAP programming technique than the dynamic pair programming technique, better or equal quality programs can be produced in 57% less time using the static pairs CAP programming technique than the static pair programming technique, and overall, better or equal quality programs can be produced with a much cheaper cost (50% less overall software development time than traditional PP) using the CAP programming technique. The empirical evidence also shows that CAP is a cheaper technology than individual programming; using CAP we can produce programs of equal or better quality with 17% reduction in overall software development cost on average. 103 The empirical evidence shows that better or equal quality code can be produced in 73% less time using the dynamic pairs CAP programming technique than the dynamic pair programming technique, better or equal quality code can be produced in 68% less time using the static pairs CAP programming technique than the static pair programming technique, and overall, better or equal quality code can be produced with a much cheaper cost (70% less than traditional PP) using CAP programming technique. The empirical evidence also shows that CAP is a cheaper technology than individual programming; using CAP we can produce code of equal or better quality with 39% reduction in coding cost on average. It is expected that CAP will retain the advantages of pair programming while at the same time downplaying the disadvantages. In CAP, units are implemented by single developers (whereas two developers are developing a unit in pair programming) and functional test cases can be developed in parallel with unit implementation. This, in turn, reduces the overall project development interval. The CAP testing procedure judiciously combines the functional (black box) and structural (white box) testing, which provides the software with the confidence of functional testing and the measurement of structural testing. The CAP allows us to confidently test and add the purchased or contracted software modules to the existing software. Finally, the functional test cases in the CAP allow us to change the implementation without changing the test cases and vice-versa. 104 5.2. Future Work ? The external validity, the ability of the experimental results to apply to the world outside the research environment ? over variations in persons, settings, treatments, and outcomes, of the empirical research design is very important for any research study. We have carefully planned our CAP validation to meet these external validity requirements. Though the software development environment provided by us closely matches the industrial software development environment, clearly the experimental system and tasks in this experiment were small compared with industrial software systems and tasks. Therefore, we cannot rule out the possibility that the observed results would have been different if the system and tasks had been larger. Hence, validation of the results with professional programmers in an industrial setting would be beneficial. ? We aim to design, build, and test a stable and reliable new agile software development methodology called Team Collaborative-Adversarial Pair (TCAP) Programming, which is suitable for the software development teams. To achieve our goal, we employ the CAP process as a basic building block to design and build the TCAP. ? Currently we have integrated and validated the CAP methodology into the Extreme Programming process. In the future, we are planning to integrate the CAP into the other agile development methodologies as well. ? We are also planning to develop tool set to support CAP methodology. 105 REFERENCES [Abrahamsson et al. 2004] Pekka Abrahamsson, Antti Hanhineva, Hanna Hulkko, Tuomas Ihme, Juho J??linoja, Mikko Korkala, Juha Koskela, Pekka Kyll?nen, and Outi Salo. Mobile-D: An Agile Approach for Mobile Application Development, OOPSLA?04, Oct. 24?28, 2004, Vancouver, British Columbia, Canada. ACM 1-58113-833-4/04/0010. [Al-Kilidar et al. 2005] Al-Kilidar, H., Parkin, P., Aurum, A., Jeffery, R. Evaluation of effects of pair work on quality of designs. In: Proceedings of the 2005 Australian Software Engineering Conference (ASWEC 2005) Brisbane Australia. IEEE CS Press, pp. 78?87. [Anderson et al. 1998] A. Anderson, R. Beattie, et al., Chrysler Goes to ?Extreme?, http://www.xprogramming.com/publications/dc9810cs.pdf [Arisholm et al. 2007] Erik Arisholm, Hans Gallis, Tore Dyba, and Dag I.K. Sj?berg, Evaluating Pair Programming with Respect to System Complexity and Programmer Expertise, IEEE Transactions on Software Engineering, Vol. 33, No. 2, Feb 2007 [Astels 2003] David Astels, Test-Driven Development: A Practical Guide, Prentice Hall, 2003. [Bagley et al. Carole A. Bagley and C. Candace Chou, Collaboration and the 106 2007] Importance for Novices in Learning Java Computer Programming, ITiCSE?07, June 23-27, 2007, Dundee, Scotland, United Kingdom. [Beck et al. 1989] Kent Beck and Ward Cunningham, A Laboratory For Teaching Object-Oriented Thinking, OOPSLA'89 Conference Proceedings October 1-6, 1989, New Orleans, Louisiana. [Beck 2000] Kent Beck, Extreme Programming Explained: An Embrace Change, Addison-Wesley, 2000, ISBN 0201616416. [Beck 2003] Kent Beck, Test-Driven Development: By Example, Addison-Wesley, 2003.S [Boutin 2000] Karl Boutin. Introducing Extreme Programming in a Research and Development Laboratory. Extreme Programming Examined, Addison- Wesley, Chapter 25, pages 433-448. [Canfora et al. 2006] Canfora, G., Cimitile, A., Visaggio, C.A., Garcia, F., Piattini, M., Performances of pair designing on software evolution: a controlled experiment. In: Proceedings of the 10th European Conference on Software Maintenance and Reengineering, CSMR 2006, 22?24 March, Bari, Italy, pp. 197-205. [Canfora et al. 2007] Gerardo Canfora, Aniello Cimitile, Felix Garcia, Mario Piattini, and Corrado Aaron Visaggio, Evaluating performances of pair designing in industry, The Journal of Systems and Software 80 (2007) 1317? 1327 [Cockburn et al. 2000] Cockburn, Alistair & Williams, Laurie (2000), ?The Costs and Benefits of Pair Programming?, Proceedings of the First International 107 Conference on Extreme Programming and Flexible Processes in Software Engineering (XP2000). [Confer 2009] Personal e-mail communication [Flor 1991] Flor, N., & Hutchins, E. Analyzing distributed cognition in software teams: A case study of team programming during perfective software maintenance. Proceedings of the fourth annual workshop on empirical studies of programmers (pp. 36-59), 1991, Norwood, NJ: Ablex Publishing. [Hulkko et al. 2005] Hanna Hulkko and Pekka Abrahamsson, A Multiple Case Study on the Impact of Pair Programming on Product Quality, ICSE?05, May 15-21, 2005, St. Louis, Missouri, USA. [Jensen 2003] http://www.stsc.hill.af.mil/crosstalk/2003/03/jensen.html [Lippert et al. 2001] Martin Lippert, Stefan Rooks, Henning Wolf, and Heinz Zullighoven. JWAM and XP: Using XP for Framework Development. Extreme Programming Examined, Addison-Wesley, Chapter 7, pages 103-117. [Lui et al. 2003] Lui, K., Chan, K., 2003. When does a pair outperform two individuals? In: Proceedings of XP 2003LNCS. Springer-Verlag, pp. 225-233. [Lui et al. 2006] Kim Man Lui and Keith C.C. Chan, Pair programming productivity: Novice-novice vs. expert-expert, Int. J. Human-Computer Studies 64 (2006) 915-925. [Martin 2003] Robert C. Martin, Agile Software Development: Principles, Patterns, and Practices, Prentice Hall, 2003. 108 [McDowell et al. 2002] McDowell, C., Werner, L., Bullock, H., Fernald, J., 2002. The effects of pair-programming on performance in an introductory programming course. In: Proceedings of the 33rd SIGCSE Technical Symposium on Computer Science Education. ACM, Cincinnati, KY, USA, pp. 38?42. [Mendes et al. 2005] Emilia Mendes, Lubna Basil Al-Fakhri, and Andrew Luxton-Reilly, Investigating Pair-Programming in a 2nd-year Software Development and Design Computer Science Course, ITiCSE?05, June 27?29, 2005, Monte de Caparica, Portugal. [Muller 2004] MATTHIAS M. MULLER, Are Reviews an Alternative to Pair Programming? Empirical Software Engineering, 9, 335?351, 2004. [Muller 2005] Matthias M. Muller, Two controlled experiments concerning the comparison of pair programming to peer review, The Journal of Systems and Software 78 (2005) 166?179 [Nawrocki et al. 2001] Nawrocki, J. and Wojciechowski, A., 2001. Experimental Evaluation of pair programming. In: Proceedings of the European Software Control and Metrics Conference (ESCOM 2001). ESCOM Press, 2001, pp. 269? 276. [Nosek 1998] John T. Nosek, The Case for Collaborative Programming, Communications of the ACM March 1998/Vol. 41, No. 3 [perl 2004] http://use.perl.org/~inkdroid/journal/17066 [Pressman 2005] Roger S. Pressman. Software Engineering: A Practitioner?s Approach, sixth edition, McGraw Hill, 2005. [Umphress 2008] Umphress, David. Personal Communication. 109 [Vanhanen et al. 2005] Jari Vanhanen and Casper Lassenius, Effects of Pair Programming at the Development Team Level: An Experiment, 2005 IEEE [Wake 2002] William C. Wake, Extreme Programming Explored, Addison ? Wesley, 2002. [Wells et al. 2001] Don Wells and Trish Buckley. The VCAPS Project: An Example of Transitioning to XP. Extreme Programming Examined, Addison- Wesley, Chapter 23, pages 399-421. [Wiki] http://c2.com/cgi/wiki?PairProgrammingFacilities [Williams 2001] Laurie Williams, Integrating pair programming into a software development process, Software Engineering Education and Training, 2001. Proceedings. 14th Conference on Volume, Issue, 2001 Page(s):27 ? 36 [Williams et al. 2000] Laurie Williams, Robert R. Kessler, Ward Cunningham, Ron Jeffries, Strengthening the Case for Pair Programming, July/August 2000 IEEE SOFTWARE [Williams et al. 2000b] Laurie A. Williams and Robert R. Kessler, All I really need to know about pair programming I learned in kindergarten, Communications of the ACM, Volume 43 , Issue 5 (May 2000), Pages: 108 - 114 [Williams et al. 2003] Laurie Williams and Robert Kessler, Pair Programming Illuminated. Addison-Wesley, 2003, ISBN 0-201-74576-3. [Wilson et al. 1993] Wilson, J., Hoskin, N., Nosek, J., 1993. The benefits of collaboration for student programmers. In: Proceedings 24th SIGCSE Technical Symposium on Computer Science Education, pp. 160?164. 110 [XP 1999] http://www.extremeprogramming.org/rules/pair.html [Xu et al. 2006] Shaochun Xu, Vaclav Rajlich, Empirical Validation of Test-Driven Pair Programming in Game Development, Proceedings of the 5th IEEE/ACIS International Conference on Computer and Information Science and 1st IEEE/ACIS International Workshop on Component- Based Software Engineering, Software Architecture and Reuse (ICIS- COMSAR?06) 111 Appendix-A Pair Programming Experiments Analyzed 112 S. No Study Year Selected? Comments 1 Wilson et al. [Wilson et al., 1993] 1993 Y 2 Nosek [Nosek, 1998] 1998 Y 3 Williams et al. [Williams et al., 2000] 1999 Y 4 Nawrocki and Wojciechowski [Nawrocki et al., 2001] 1999/ 2000 Y 5 McDowell et al [McDowell et al., 2002] 2000/ 2001 Y 6 Baheti et al. 2002 N Distributed PP experiment 7 Rostaher et al. 2002 Y 8 Heiberg et al. 2003 N Not PP Vs Solo experiment, it is a PP VS 2 person team experiment 9 Canfora et al. 2007 N Design phase only 10 M?ller [Muller, 2005] 2002/ 2003 Y 11 Vanhanen and Lassenius [Vanhanen et al., 2005] 2004 Y 12 Madeyski 2006 N Design phase only 13 M?ller [Muller 2006] 2005 Y 14 Monvorath et al. 2004, 2005 N Compares the PP Vs Inspection techniques practiced only in Thailand. 15 Xu and Rajlich [Xu et al., 2006] 2005, 2006 Y 16 Canfora et al. 2005 N Each subjects performed both PP and solo programming alternatively 17 Arisholm et al. [Arisholm et al., 2007] 2001, 2004/ 2005 Y 18 Hulkko and Abrahamson [Hulkko et al, 2005] 2004 Y 19 Lui and Chan [Lui et al. 2006] 2005 N Repeat experiment compares Novice- Novice pairs against Expert-Expert pairs. 20 Jensen 1996 N Not PP Vs Solo experiment, only pairs experiment 21 Mendes et al. 2005 N PP used as a teaching tool 22 Carver et al. 2007 N PP used as a teaching tool 23 Carole and Chou 2007 N PP used as a teaching tool 24 Cliburn 2003 N PP used as a teaching tool 25 Phongpaibul and Boehm 2006 N Comparison of pair development and software inspection in Thailand 26 McDowell et al. 2003 N PP used as a teaching tool 27 McDowell et al. 2003 N PP used as a teaching tool 28 Cubranic and Storey 2005 N Pairs of first year CS students used to evaluate a prototype 29 Hanks et al. 2004 N PP used as a teaching tool 30 Gehringer 2003 N PP used as a teaching tool 31 Nagappan et al. 2003 N PP used as a teaching tool 32 Succi et al. 2001 N Only job satisfaction analysis 33 Bellini et al., 2005 N Design phase only 34 Al-Kilidar et al. 2005 N Design phase only 35 Canfora et al. 2006 N Design phase only 113 APPENDIX-B IRB Documents 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144