Exploring the Impact of Socio-technical Communication Styles on the Robustness and Innovation Potential of Global Participatory Science by Ozg ur Ozmen A dissertation submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama May 5, 2013 Keywords: Complex Adaptive Systems, Agent Based Simulation, Self-Organization, Communication, Collective Action, Innovation, Diversity, Robustness, Resilience, Social Network, Knowledge Creation Copyright 2013 by Ozg ur Ozmen Approved by: Je rey Smith, Professor of Industrial and Systems Engineering Levent Yilmaz, Associate Professor of Computer Science Alice Smith, Professor of Industrial and Systems Engineering Abstract Emerging cyber-infrastructure tools are enabling scientists to transparently co- develop, share, and communicate in real-time diverse forms of knowledge artifacts. In this research, these collaborative environments are modeled as complex adaptive systems using collective action theory as a basis. Communication preferences of sci- entists are posited as an important factor a ecting innovation capacity and resilience of social and knowledge network structures. Using agent-based modeling, a complex adaptive social communication network model is developed. By examining the Open Biomedical Ontologies (OBO) Foundry data and drawing conclusions from observing the Open Source Software communities, a conceptually grounded model mimicking the dynamics in what is called Global Participatory Science (GPS), is presented. Social network metrics and knowledge production patterns are used as proxy met- rics to infer innovation potential of emergent knowledge and collaboration networks. Robust communication strategies with regard to innovation potential are questioned by exploring di erent parameter and mechanism con gurations. The objective is to present the underlying dynamics of GPS in a form of computational model that enables analyzing the impacts of various communication preferences of scientists on innovation potential of the collaboration network. The ultimate goal is to further our understanding of the dynamics in GPS and facilitate developing informed policies fostering innovation capacity. ii Acknowledgments I would like to express my gratitudes to the advisory committee of this disserta- tion. Without their support, suggestions, criticisms, and credences, this work would not be possible. Special thanks to Dr. Levent Yilmaz for introducing me to the Complex Adaptive Systems, countless creative discussions, encouragement, and his never-ending con dence on me. Other special thanks to Dr. Je rey Smith for mind- stretching arguments, his expertise, and his guidance. Thanks to Dr. Alice Smith for recruiting me to the Auburn family and her valuable comments on my dissertation. Thanks for their patience at the times i procrastinated. I wish i could be a better student. Additionally, thanks to the National Science Foundation (NSF) for the partial support as authorized by the contract number NSF-SBE-0830261. The research is also partially funded by the Industrial and Systems Engineering Department assis- tantships/fellowships at Auburn University. I would like to dedicate this hard-work to my family. To my favorite people who love regardless... iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 General Concepts on Science . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Science and Research . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Open Science Paradigm . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Governance of Open Scienti c Communities . . . . . . . . . . 11 2.2 GPS from Complex Adaptive Systems Perspective . . . . . . . . . . . 13 2.3 Understanding CAS by Agent Based Modeling . . . . . . . . . . . . . 16 2.3.1 Examples of Agent Based Modeling . . . . . . . . . . . . . . . 17 2.3.2 Simulation Models of Science . . . . . . . . . . . . . . . . . . 19 2.4 Communication and Collective Action in GPS . . . . . . . . . . . . . 20 2.5 Understanding GPS as an Innovation and Collaboration Network . . 22 2.5.1 Social Network Metrics and Innovation Potential . . . . . . . . 25 2.6 Diversity and Innovation Potential . . . . . . . . . . . . . . . . . . . 28 2.7 Robustness and Resilience in Socio-technical Systems . . . . . . . . . 29 3 RESEARCH PROBLEMS AND METHODOLOGY . . . . . . . 30 3.1 Stakeholders of the Research . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Research Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Methodology Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 iv 3.3.1 Phase 1 - Conceptual Model Theory Base . . . . . . . . . . . 37 3.3.2 Phase 2 - Social Communication Model and Innovation . . . . 38 3.3.3 Phase 3 - Robustness Analysis . . . . . . . . . . . . . . . . . . 40 4 BASE-MODEL COMPONENTS AND VALIDATION . . . . . . 42 4.1 Base-Model Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1.1 Artifact Selection . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.2 Collective Action Mechanism . . . . . . . . . . . . . . . . . . 46 4.1.3 Learning and In uencing Processes . . . . . . . . . . . . . . . 51 4.1.4 Information Foraging Mechanisms . . . . . . . . . . . . . . . . 52 4.1.5 Population Dynamics . . . . . . . . . . . . . . . . . . . . . . . 54 4.1.6 SEIR Metaphor . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1.7 Conceptual Model Validation . . . . . . . . . . . . . . . . . . 57 4.2 Computational Model and Repast Implementation . . . . . . . . . . . 60 4.3 Model Veri cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.1 OBO data and Over- tting Problem . . . . . . . . . . . . . . 66 4.5 Initial Conditions and Terminating State Decision . . . . . . . . . . . 68 4.5.1 Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5.2 Bonferroni Analysis . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6 Activity Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.7 Power-Law Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.8 Collaboration Network Phases . . . . . . . . . . . . . . . . . . . . . . 80 5 SOCIO-COMMUNICATION MODEL AND EXPLORATORY ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1 Communication Preferences . . . . . . . . . . . . . . . . . . . . . . . 85 5.1.1 Random Connection . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.2 Human Capital . . . . . . . . . . . . . . . . . . . . . . . . . . 86 v 5.1.3 Social Capital . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1.4 Homophily Theory . . . . . . . . . . . . . . . . . . . . . . . . 90 5.1.5 Social Exchange Theory . . . . . . . . . . . . . . . . . . . . . 91 5.2 Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3 Social Network Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4 Diversity and Interdisciplinarity . . . . . . . . . . . . . . . . . . . . . 95 5.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5.1 Response Surface Analysis . . . . . . . . . . . . . . . . . . . . 97 5.5.2 Communication Preferences . . . . . . . . . . . . . . . . . . . 99 6 ROBUSTNESS IN GLOBAL PARTICIPATORY SCIENCE . . 113 6.1 Exploratory Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.1.1 The Use of Genetic Algorithms . . . . . . . . . . . . . . . . . 114 6.1.2 Metamorphic Testing . . . . . . . . . . . . . . . . . . . . . . . 117 6.2 Design Decisions Relative to the GA . . . . . . . . . . . . . . . . . . 118 6.2.1 Encoding and Decoding of the Parameter Space . . . . . . . . 122 6.2.2 Activity Flow of the GA Module . . . . . . . . . . . . . . . . 124 6.2.3 Metamorphic Relations . . . . . . . . . . . . . . . . . . . . . . 124 6.2.4 Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.2.5 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.2.6 Crossover and Mutation . . . . . . . . . . . . . . . . . . . . . 129 6.2.7 Culling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.4 Limitations and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 140 7 CONCLUSIONS AND FUTURE RESEARCH . . . . . . . . . . 142 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 vi A.1 Termination State Analysis . . . . . . . . . . . . . . . . . . . . . . . 168 A.2 Response Surface Analysis . . . . . . . . . . . . . . . . . . . . . . . . 181 A.3 Core/Periphery Calculation Method . . . . . . . . . . . . . . . . . . . 184 vii List of Illustrations 3.1 Network Visualizations of Micro-level, Meso-level, and Macro-level Sci- ence Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 The Chart of Objectives and Methodology . . . . . . . . . . . . . . . 36 3.3 Exploratory Software Coupling GA and Metamorphic Relations . . . 41 4.1 Moving and Artifact Selection Processes in the Model . . . . . . . . . 44 4.2 Shape of the Function that Updates Tension . . . . . . . . . . . . . . 50 4.3 Information Foraging Behavior . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Population Dynamics in the Environment . . . . . . . . . . . . . . . . 55 4.5 SEIR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Activity Flow Diagram of a Scientist . . . . . . . . . . . . . . . . . . 58 4.7 RePast API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.8 Life Cycle of a Simulation Study . . . . . . . . . . . . . . . . . . . . . 62 4.9 Sample Core-Periphery Structures at time 400 . . . . . . . . . . . . . 70 4.10 Number of Active Artifacts and Active Scientists Over Time - OBO . 77 4.11 Number of Active Artifacts and Active Scientists Over Time - Simulated 77 viii 4.12 Log-Log Plot of Degree Distribution and Contribution Distributions of OBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.13 Log-Log plot of Degree Distribution of Scientists - Simulated . . . . . 79 4.14 Log-Log Plot of Contribution Distribution of Artifacts - Simulated . . 80 4.15 Log-Log Plot of Contribution Distribution of Scientists - Simulated . 80 4.16 Emergent Network Patterns Over Time - Theoretical Model . . . . . 82 4.17 Emergent Network Patterns Over Time - OBO Data . . . . . . . . . 83 4.18 Emergent Network Patterns Over Time - Simulated . . . . . . . . . . 84 5.1 Activity Flow Diagram of Random Connection Mechanism . . . . . . 86 5.2 Activity Flow Diagram of Communication Mechanisms . . . . . . . . 87 5.3 Network Visualizations at Terminating State . . . . . . . . . . . . . . 100 5.4 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5 Degree Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.6 Clustering Coe cient . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.7 Average Path Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.8 Core/Periphery Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.9 Diversity Among Scientists . . . . . . . . . . . . . . . . . . . . . . . . 104 5.10 Diversity Among Artifacts . . . . . . . . . . . . . . . . . . . . . . . . 104 5.11 Diversity Among Links . . . . . . . . . . . . . . . . . . . . . . . . . . 105 ix 5.12 Diversity Among Nodes . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.13 Expertise Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.14 Maturity Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.15 Disparity Distribution Among Nodes . . . . . . . . . . . . . . . . . . 108 5.16 Disparity Distribution Among Links . . . . . . . . . . . . . . . . . . . 109 5.17 Active Scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.18 Active Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.19 Small-world Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.20 Node Diversity vs Network Coherence . . . . . . . . . . . . . . . . . . 111 6.1 Encoding and Decoding of the Parameters. . . . . . . . . . . . . . . . 123 6.2 Activity Flow Diagram of the Genetic Algorithm . . . . . . . . . . . 125 6.3 Average Fitness Values over Generations . . . . . . . . . . . . . . . . 132 6.4 Minimum and Average Fitness Values for each Theory . . . . . . . . 133 6.5 Integer Parameters of the Fittest Scenario over Generations . . . . . 134 6.6 Floating Number Parameters of the Fittest Scenarios over Generations 134 6.7 Communication Mechanism of the Fittest Scenarios over Generations 136 6.8 95% Con dence Intervals for Di erent Communication Mechanisms . 139 A.1 Density Over Time - OBO Scenarios . . . . . . . . . . . . . . . . . . 169 x A.2 Degree Centrality Over Time - OBO Scenarios . . . . . . . . . . . . . 170 A.3 Clustering Coe cient Over Time - OBO Scenarios . . . . . . . . . . . 171 A.4 Average Path Length Over Time - OBO Scenarios . . . . . . . . . . . 172 A.5 Diversity (Scientist Population) Over Time - OBO Scenarios . . . . . 173 A.6 Diversity (in the network) Over Time - OBO Scenarios . . . . . . . . 174 A.7 Density Over Time - Random Connection . . . . . . . . . . . . . . . 175 A.8 Degree Centrality Over Time - Random Connection . . . . . . . . . . 176 A.9 Clustering Coe cient Over Time - Random Connection . . . . . . . . 177 A.10 Average Path Length Over Time - Random Connection . . . . . . . . 178 A.11 Diversity (Scientist Population) Over Time - Random Connection . . 179 A.12 Diversity (in the network) Over Time - Random Connection . . . . . 180 A.13 Core/Periphery Activity Diagram . . . . . . . . . . . . . . . . . . . . 184 xi List of Tables 2.1 Social Communication Theories . . . . . . . . . . . . . . . . . . . . . 21 3.1 Traditional Science vs. Global Participatory Science . . . . . . . . . . 32 4.1 Conceptual Mechanisms and Assumptions . . . . . . . . . . . . . . . 59 4.2 Initial Settings of the Model . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Maximum n Values Found . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1 Summary of the Sensitivity Analysis . . . . . . . . . . . . . . . . . . 111 6.1 The Bit Values in Initial Genome . . . . . . . . . . . . . . . . . . . . 123 6.2 Decoded Values of the Identi ed Portfolio of Scenarios . . . . . . . . 136 6.3 Fitness Values - The Most Robust Parameter Scenario . . . . . . . . 137 6.4 Analysis of Variance for 20 batches of 10 replications . . . . . . . . . 138 6.5 Mean of Various Metrics - The Most Robust Scenario . . . . . . . . . 138 6.6 Standard Deviation of Various Metrics - The Most Robust Scenario . 139 A.1 Parameter Values for RSM . . . . . . . . . . . . . . . . . . . . . . . . 181 A.2 Summary of Response Surface Analysis - 1 . . . . . . . . . . . . . . . 182 A.3 Summary of Response Surface Analysis - 2 . . . . . . . . . . . . . . . 183 xii Chapter 1 INTRODUCTION Science is becoming increasingly global and participatory due to online collabo- ration opportunities such as e-mailing, web-based social networking, and open-access collaboration platforms. Hence, scientists interact not only locally but also glob- ally by constructing self-organizing collaboration networks. Wagner (2008) states the following regarding the emergence of these uid networks: They constitute an invisible college of researchers who collaborate not because they are told to but because they want to, who work together not because they share a laboratory or even a discipline but because they can o er each other complementary insight, knowledge, or skills. One of the most signi cant problems in organizational scholarship is to discern how social collectives govern, organize, and coordinate the actions of individuals to achieve collective outcomes (O?Mahony and Ferraro, 2007). The rst phase of this re- search explores micro-level (inter-scientist) socio-technical processes and mechanisms that explain emergent behaviors observed in scienti c communities that collaborate over the cyber-infrastructure. Scienti c knowledge creation in such communities is called Global Participatory Science (GPS) (Zou and Yilmaz, 2011). First, based on the views advocated by Wagner (2008) and Monge and Contractor (2003), the struc- ture and behavior of GPS are interpreted as complex adaptive systems (CAS). Second, recent ethnographic studies (Nielsen, 2010; Ostrom and Hess, 2007), which suggest that GPS is a collective action undertaken by autonomous self-organizing scientists, are leveraged. The rst question of interest is \Which interaction mechanisms in the literature explain operational behavior of GPS and its underlying socio-technical 1 processes?" Then the focus of this research is on \How we can specify and implement these mechanisms in the form of a computational model to gain empirical insight and perform exploratory analysis?" It is demonstrated by Wagner (2008) that science is complex because researchers interact in both competitive and cooperative ways, with no imposed blueprint. Fur- thermore, she states that it is adaptive because scientists respond to environmental changes such as funding preferences and new discoveries . In this work, Information foraging, preferential attachment, and population dynamics are conceptualized as the underlying self-organization mechanisms of knowledge creation in GPS. There are simulation studies that explore knowledge creation processes in sci- ence (Shrager and Langley, 1990; Cowan and Jonard, 2004; Gilbert, 1997). However, in these models social interactions are not taken into account. Using the collective action theory, which includes models of self-interest (based on knowledge gain), ex- posure (based on social in uence), cognitive burden (based on expertise of scientists), and tension (based on complexity of the projects) in scienti c knowledge generation, theoretically-grounded conceptual model of scientist behavior is developed. The understanding of CAS is more likely to arise with the help of computer- based models (Holland, 1996). Agent Based Modeling (ABM) provides us with the opportunity to directly identify individual entities along with their relationships and capabilities. Hence, simulation of these mechanisms using the ABM worldview is a powerful method that is adopted in this study. ABM involves rationally bounded human agents. Therefore, validation of these models is a challenging task, and the assumptions of the models should be explic- itly explained. But also computational laboratories are not supposed to be, indeed should not be, exact replication of reality (Burton, 2003). For validation purposes, the 2 presence of scale free networks, adaptive/renewal activity cycles, and network forma- tion phases (e.g. core/periphery) are investigated. These are known characteristics peculiar to collaboration networks and GPS. Communication among agents in a CAS has an intense e ect on the system level behavior (Shlesinger, 2007). Wagner (2008) indicates that if we can discern identi able patterns and mechanisms of communication among the scientists, then understanding can lead us to determine how the scienti c endeavor operates and how policymakers can e ectively in uence its evolution and growth. In an NSF workshop in 20061, participants are called for new theoretical models that foster understanding innovation through computational and cognitive models of creativity (Yilmaz, 2008b). The second phase of this research focuses on implementing computational mechanisms of selected social communication theories: (1) Homophily theory, (2) Social capital theory, (3) Human capital theory, (4) Social exchange theory. Then the the goal is to evaluate the evolution of the network. The question to address is: \Which social communication mechanisms among scientists are more e ective in fostering innovation potential?" Generative mechanisms of social capital, human capital, homophily, and social exchange theories, which are relevant social communication theories applicable to the problem domain, are implemented. There are studies that discuss diversity (Dha- naraj and Parkhe, 2006; Powell et al., 1996) and network connectivity (Pyka, 2009; Burt, 1995) as potential indicators of innovativeness. Diversity of the emergent knowl- edge and collaboration network structures is measured and used as an indicator of interdisciplinarity. Analysis of core/periphery ratio, small-world phenomena (based on clustering coe cient and average path length), degree centrality, and density of emergent collaboration networks are also conducted to assess innovation potential. 1NSF/SRS Workshop on Advancing Measures of Innovations: Knowledge Flows, Business Met- rics, and Measurement Strategies, 2006 3 Additionally, the utility of additional activity metrics (the number of active mem- bers, distribution of expertise) are evaluated as proxy metrics of innovation potential. The OECD and European Commission encourage the research on innovation metrics, unanticipated consequences, and unrealized opportunities emerging from the complex adaptive environments. Scienti c networks cannot be controlled but can be guided by the policymakers to in uence collaboration environments. Recognizing uncertainty and the lack of knowledge about the environment in science-based in- novation systems such as GPS, designing robust systems is important as opposed to searching for optimal system design. The third phase of the research is focused on the question: \How to explore di erent parameter and mechanism con gurations to seek and identify more robust communication strategies in terms of variance of innovation potential metrics?" In order to address this question, an exploratory modeling tool is developed which consists of coupled genetic algorithm and metamorphic relations module to create models with distinct scenarios. Hence, decision-makers (policy-makers in this research) can explore di erent parameter and mechanism set-ups. Mean absolute errors (MAE) of di erent metrics are used to measure robustness performance of each point in the scenario space. A feedback mechanism is constructed based on ro- bustness performance and metamorphic relations to facilitate further generation and exploration of the scenario space. More robust scenario space is investigated, while the average robustness behavior for each communication mechanism under various conditions (over generations) is monitored. Additionally, the levels that parameter values converge at more robust landscapes are evaluated. The tradeo s between robustness and innovation potential are delineated by social network metrics. The National Academy of Engineering report indicates that leadership in inno- vation is essential to U.S. prosperity and security.2 By developing agent-based models 2http://www.nae.edu - As of 4.07.2013 4 and conducting the analysis described above, the goal is to further our understand- ing of innovation in GPS and to support policy-makers in nurturing open scienti c environments. In the following Chapter, a summary of the literature relative to the topic is provided. Chapter 3 outlines the methodology adopted and then research prob- lems are introduced in detail. In Chapter 4, base-model design is discussed along with veri cation and validation e orts conducted, and Chapter 5 includes the socio- communication model development and sensitivity analysis. Chapter 6 introduces the exploratory modeling algorithm that is designed for policy-makers to explore di erent mechanism and parameter spaces for discovery of robust strategies. In Chapter 7, the future work and concluding remarks of the research are summarized. 5 Chapter 2 LITERATURE REVIEW This chapter provides a background for the research conducted. It starts with general concepts on science and research followed by the de nitions that help us to understand the environment of inquiry. Consequentially, the governance mechanisms exist in GPS are summarized supported by the ethnographic studies on Open Source Software (OSS) communities. Thereafter, GPS is examined from CAS perspective and discussion of how to study CAS is done. Domain knowledge is concluded with an introduction to ABM and simulation studies relevant to this research. Then theo- retical basis of the computational models implemented in this research is delineated. Subsequently, literature review about collaboration networks such as GPS from inno- vation and social network perspective is revealed. In the last section, several concepts about robustness and resilience of socio-technical systems are introduced. 2.1 General Concepts on Science There is no precise de nition of science that is widely-accepted. For example, Feynman (1998) has a range of de nitions for science from a special method of nd- ing things out, to knowledge arising from the things found out as well as new things brought by the things found out or doing of new things. In another study, science is addressed as a mode of inquiry(Epstein, 2008). Besides di erent interpretations of science, this research is interested in the creativity process and the environmental changes occurred during the creation of knowledge in scienti c environments. There- fore, an introduction to traditional science and research activities is summarized in the following section. 6 2.1.1 Science and Research In his in uential work, Kuhn (1996) discusses how scienti c elds develop as a result of cyclic revolutionary e orts. In normal science, researchers study based on past scienti c accomplishments that are accepted by the community they study in, and they nd themselves practicing parallel to the foundations of that particular com- munity, at the same paradigm. Paradigm is the word for avenues of questions and the style of conducting research. It emerges through time based on the previous successes of the paradigm fellows? scienti c e orts. None of the paradigms can explain a phe- nomenon exactly, but a widely-accepted paradigm survives, forming disciplines and professions. Scientists learn to practice through disciplinary laws, concepts, and their implementations. Since a paradigm cannot explain all the facets of a phenomenon, new phenomena and anomalies occur. These anomalies can result in two outcomes: novelty of fact or novelty of theory. Consequentially, these novelties can cause re- nement of the paradigm or failure of it. Failure means crises that would end up with a revolution (new paradigm) or an exception left to be handled for prospective researchers. Adaptive cycles are a good way to explain this progress in science. Walker et al. (2004) describes four states of an adaptive cycle:\growth and exploitation phase, con- servative phase, chaotic collapse, and innovation phase." The growth is cumulative but up to a certain extent. After a paradigm reaches the conservative phase, there needs to be reorganization or a new paradigm shift because it cannot explain the phe- nomenon any longer. Chaotic collapse brings new opportunities to compete against each other and then innovation phase starts a new cycle with the ttest novel strategy in the environment. On the other hand, research is de ned as puzzle-solving (Kuhn, 1996; Arndt, 1985). The aim of the traditional research is to explain what is known to be there in advance. Although they contribute to the existing paradigm, the provided answers are 7 questionable. Research is cumulative while scienti c revolutions are non-cumulative, and science improves by paradigm shifts as a result of the evolutionary process, in which the ttest survive. While contradictory de nitions exist about the di erence between research and science, Latour (1998) expresses a noteworthy observation to explain the scienti c endeavor emerging today: Science is certainty; research is uncertainty. Science is supposed to be cold, straight, and detached; research is warm, involving, and risky. Science puts an end to the vagaries of human disputes; research creates controversies. Science produces objectivity by escaping as much as possi- ble from the shackles of ideology, passions, and emotions; research feeds on all of those to render objects of inquiry familiar. Scientists form hierarchically-structured teams or groups and synchronously con- duct traditional science activities in a similar environment. Traditional science has a high-entry threshold, static structure in terms of turnover of the participants, and has nished products, which are usually in the form of publications. Complex traits of the new scienti c activities di er from the traditional scienti c activities regard- ing the communication styles, proximity and mobility of the actors, organizational hierarchies, and products of the collaboration. Gibbons et al. (1997) call the traditional way of doing science Mode-1 knowledge production. It is in a disciplinary framework, hierarchical, and institutionalized while Mode-2 science as a new production process of knowledge is transdisciplinary. Mode-2 science includes many diverse scientists or participants in the process. It has a deeper quality control because the product is socially accountable. It is more formed around an application area or an artifact (any form of information such as document, code, vocabulary) including contributors from diverse ranges of disciplines, and the product is transdisciplinary, which means it is not reversible to the contributing disciplines. Mode 2 threatens the existence of Mode-1 knowledge production, but Mode-1 and Mode-2 live simultaneously. 8 2.1.2 Open Science Paradigm Open Science is de ned as a mode of knowledge production which has the dis- closed knowledge of the earlier participants as an input to future researchers (Mukher- jee and Stern, 2009). David (1998) de nes the force of universalist pattern of open science as providing entry into scienti c artifacts and open discussion by all partic- ipants, while promoting \openness" in regard to new ndings. Carayol and Dalle (2007) explain open-science phenomenon as signi cant freedom of scientists to choose whatever they want to do and ho ver they want to do it. It is a Mode-2 way of knowledge production and un-institutionalized. With the increasing use of the Internet and collaboration tools, terms e-science (The FANTOM Consortium, 2005), and service-oriented science (Booth et al., 2004) are coined referring to scienti c research enabled by networks of loosely connected communicating services. Cofundos (Auer and Braun-Th urmann, 2011) as a stake- holder driven research platform, the openscience project1, creativecommons2, and innocentive3 are some of the web platforms fostering innovation and creating gov- ernance mechanisms in open science. Open science summit4 is another initiative that gathers open science endeavor to discuss the future of this emerging way of doing science. Open Science GRID (OSG), Enabling Grids for eScience in Europe (EGEE), Biomedical Informatics Research Network (BIRN), Network for Earthquake Engineering Simulation (NEES) of National Science Foundation and Open Biomed- ical Ontologies (OBO) in Sourceforge are some examples of open science initiatives (Foster, 2005). Merton (1979) described four basic elements of a community: \universalism (a shared interpretation), communism (information sharing), disinterestedness (having 1www.openscience.org - As of 4.07.2013 2http://creativecommons.org/science - As of 4.07.2013 3http://www.innocentive.com/ - As of 4.07.2013 4http://opensciencesummit.com/ - As of 4.07.2013 9 objective scienti c inquiry), and organized skepticism (proof and review process)." Open Source Software Development (OSSD) communities having the aforementioned features are also producing science in a form of software. OSSD governance frame- works are de nitely bene cial to explain open science. GNU project5, Apache software foundation6, and Linux operating system7 are successful OSSD communities that are still active. Nowotny et al. (2001) describe the main features of Mode-2 science environment (agora), in which science and the public meets. Agora has diverse participants from various disciplines with di erent interests and expertise. Participants produce not only reliable but also more socially robust knowledge because of continuos review mechanisms in agora. They are self-authorizing, and the expertise is socially dis- tributed. Agora is complex and full of uncertainty, which fosters further innovations co-evolving with the society. In Sourceforge8, OBO Foundry9 can be de ned as open source science develop- ment platform that has diverse scientists who are setting principles for interoperable ontology creation and are forming the shared terminology in biomedical domain. There is no top-down leadership in communities and it consists of di erent communi- ties focused on di erent subjects. These communities are divided into domains which are basically smaller communities. Artifacts and emailing are the main collaboration tools. Artifacts can be perceived as any form of knowledge (e.g. document, code, bug report) and are created and elaborated by the community members evolving through a consensus. The actors of open science are academics, not only software developers as in the OSS communities. Intellectual product is not only software but can be papers or 5http://www.gnu.org/ - As of 4.07.2013 6http://www.apache.org/ - As of 4.07.2013 7http://www.linux.com/ - As of 4.07.2013 8http://sourceforge.net/ - As of 4.07.2013 9http://www.obofoundry.org/ - As of 4.07.2013 10 datasets in digital form (as artifacts). Ostrom and Hess (2007) list the di erences between OSSD and traditional scienti c practices as sharing not only the research product (e.g. software, paper) but also the research process, which is di erent than the traditional journal publishing. Others outside the organizational borders can also participate in di erent scienti c research projects. Open commons do not hold full copyright and they foster the speed of publishing the ideas and innovations compared to standard peer-reviewed journal process. 2.1.3 Governance of Open Scienti c Communities One of the most signi cant problems in organizational scholarship concerns how social collectives govern, organize, and coordinate the actions of individuals to achieve collective outcomes (O?Mahony and Ferraro, 2007). Jensen and Scacchi (2010) aim to develop understanding for how to characterize the ways and means for a ecting governance within and across OSS projects, as well as the participants and technolo- gies that enable these projects and the larger communities of practice, in which they operate and interact. An important concern that arises from the Open Science concept is what kind of incentives should be created to encourage scientists and academics to participate. Even though journals have not paid authors for articles since the beginning of scienti c era, there are intangible rewards that authors want their work to be known, read, built upon, used, and cited (Ostrom and Hess, 2007). Ostrom and Hess (2007) also state that the authors can write journal articles without considering what idea is the best seller or what would be noticed by the widest audience. So academic freedom and gaining reputation can be seen as some of the incentives in open science but there needs to be more incentives for better participation. Ostrom and Hess (2007) exert attention on: 11 (1) How to license digital content that is not computer software, (2) how to work within the existing norms and incentive structures faced by most scientists and academics in their workplace today, (3) how to govern such a collaboration, and (4) how to nance such an endeavor. In addition to the previously described governance studies for OSS, Preece and Shneiderman (2009) o er the reader-to-leader framework for technology-mediated social participations. They assert that all users rst become aware of the participatory and become a reader, some then become contributors, then collaborators, and later on possibly leaders. They claim that people read because (1) they bene t from it, (2) recognition is an important driving force of contributing, (3) interest triggers a contribution which may turn into collaboration, (4) trust plays a role, and (5) altruism is identi ed as a major motivator for encouraging contribution and collaboration. Lattemann and Stieglitz (2005) aim to examine central structures and coordi- nation patterns in open source communities. They see intrinsic motivation, group identi cation processes, learning, and career concerns as the key drivers for a success- ful cooperation among the participants. In their study, it is assumed that all member groups are partially intrinsically and partially extrinsically motivated in every stage of the life cycle (introduction, growth, maturity and decline or revival) of the commu- nity. It can be perceived as positive and negative feedback mechanisms. They point out that conventional control mechanisms are not usable in systems based on volun- teer work because there is no possibility to penalize or reward members nancially. They advocate that the adequacy of governance tools is related with the motivation, and motivation is related with the di erent member groups and life cycle stages of a community. Jensen and Scacchi (2010) illustrate an alternative perspective o ering a multi- level analysis and explanation for the governance of OSSD as below, in which this research is interested in the micro-level analysis: 12 In particular, Open Source Software Development (OSSD) projects can be examined through a micro-level analysis of (a) the actions, be- liefs, and motivations of individual OSSD project participants, and (b) the social or technical resources that are mobilized and con gured to sup- port, subsidize, and sustain OSSD work and outcomes. Similarly, OSSD projects can be examined through meso-level analysis of (c) patterns of cooperation, coordination, control, leadership, role migration, and con ict mitigation, and (d) project alliances and inter-project socio-technical net- working. Last, OSSD projects can also be examined through macro-level analysis of (e) multi-project OSS ecosystems, and (f) OSSD as a social movement and emerging global culture. Ostrom and Hess (2007) remind us that the analysis of di erent types of mo- tivations fall into three groups: \technological, sociopolitical, and economic." They present the main technological reason for someone to participate as the need for soft- ware that is unavailable or too expensive and the main socio-political motivations as the belief in the social or political movement and the desire to participate in a broader community with shared interest as well as passion to contribute. In their study, economic motivations in OSS communities are building human capital through learning by reading the existing software code and peer-review process, and signaling the ability as an expert (earning reputation). Understanding governance and the relationships between the actors in these com- munities are essential to foster the conducive behaviors that steers these communities towards a desired goal. Scienti c knowledge creation in such communities is stated as Global Participatory Science (GPS) (Zou and Yilmaz, 2011). Which tools are used to explain this phenomenon are explained in the following sections. 2.2 GPS from Complex Adaptive Systems Perspective It is demonstrated that science is complex because researchers around the world interact in both competitive and cooperative ways, with no imposed blueprint, and 13 it is adaptive because both the science and participating scientists respond to envi- ronmental changes such as funding preferences or new discoveries (Wagner, 2008). Participants of GPS learn from the knowledge repository or through communication with each other. The knowledge network and the social network evolve over time in uencing each other and forming macro-level structures. Complex Adaptive Systems (CAS) can be described as a framework to under- stand the world around us. Complexity itself can be perceived in two di erent ways; qualitatively and quantitatively. Standish (2008) asserts that qualitatively, complex- ity is related with the ability to understand a system or object while quantitatively, complexity is used to de ne something being more complicated than another. Yam (2005) describes CAS as: A new approach to science, which studies how relationships between parts give rise to the collective behaviors of a system and how the system interacts and forms relationships with its environment." CAS are formed of elements that have wide range in both form and capability (Holland, 1996). Shlesinger (2007) describe CAS as \composed of interacting thought- ful (but perhaps not brilliant) agents." The phrase not brilliant raises concerns about bounded rationality. Axtell and Epstein (2006) discuss the empirical data, which demonstrate that all individuals should not necessarily be rational to produce e - ciency in macro-level outcomes of a system in CAS. Given that individual rationality is bounded, they explore how much rationality should exist in a system to generate macro-level patterns. In CAS, big changes can generate small outcomes, while small perturbations can cause big emergent behaviors. Yam (2005) de nes emergence as the interdependence between details and the larger view of a system. In addition to bounded rationality, Monge and Contractor (2003) describe the main elements of complex systems in terms of the network of agents, their attributes or traits, the rules of interaction, and the structures that emerge from these micro-level interactions. Authors list typical classes of agent traits as location, capabilities, and 14 memory. Communication among the agents, as a main interaction mechanism, has an intense e ect on the system level behavior of a complex adaptive system (Shlesinger, 2007). Yang and Shan (2008) depict that agents use belief-desire-intention framework to guide their behavior. In another seminal work Hidden Order, Holland (1996) describes 7 basics (four properties and three mechanisms) that are common in CAS. Basic principles that are also adopted by this research are listed below: Aggregation (a property): In one sense, it means de ning similar things in the same class. In another sense, it is emergent macro-level behaviors caused by aggregate micro-level relationships. Tagging (a mechanism): It results in selective interaction between the agents. Nonlinearity (a property): There are nonlinear interactions among the compo- nents of the system that means a change on a component non-linearly e ects the state of another component. Flows (a property): There is a ow of components and information through the system. Diversity (a property): There are heterogenous agents and components in the system. Internal Models (a mechanism): It refers to anticipation among the agents. Building Blocks (a mechanism): There is a repetition of novel situations or structures that are emerged through building blocks (re-usable categorical parts). Van Aardt (2004) views the OSS development communities as complex adap- tive systems. Mu atto and Faldani (2003) represent the OSS community actors with their interaction ows and depict them in terms of three fundamental processes that 15 Axelrod and Cohen (2001) identify in complex adaptive systems: variation, interac- tion, and selection. Scacchi and Jensen (2008) state that recent empirical studies of OSS projects reveal that OSS developers often self-organize into organizational forms characterized as evolving socio-technical interaction networks (STINs). They are self-organizing because usually without extrinsic leadership, they are formed and led. 2.3 Understanding CAS by Agent Based Modeling Interdisciplinarity and computer-based thought experiments are common fea- tures of CAS studies (Holland, 1996). As the purposes of our models diversify and types of the inputs range, models formed become more interdisciplinary. Standard models of others and recombination of mathematical algorithms enrich the process of modeling CAS. Shlesinger (2007) de nes the models as maps to develop scienti c understanding of the system of interest involving combination of theory, practice, and art. The main purpose of CAS studies is to understand the underlying relationships between parts while mostly people think that the problem is in the parts (Yam, 2005). The understanding of CAS is more likely to arise with the help of computer- based models (Holland, 1996). Axelrod (1997a) states that applications of simulation in social science are really diverse so that it has no natural home as a eld. There are two methods to explore CAS. Agent based modeling (ABM) is bottom-up and Method of Systems Potential (MSP) is top-down approaches which derive CAS properties analytically (Yang and Shan, 2008). ABM gives us the opportunity to directly identify the entities along with the relationships and capabilities of them. ABM captures emergent phenomena because it has a holistic approach that perceives a system as more than the sum of its constituent parts. The system level behavior cannot be explained by the properties of the units 16 in the system. Since ABM is used more with the behavioral entities, it provides an opportunity to model more realistically. Holland (1996) states that the abstractions of the reality, which are agent based models in this research, are metaphorical representations and are actually \the world as it might be, not the world as it is." In computer based simulations, accuracy is expected but the model is not the exact reality. In the following section, some examples are summarized regarding the use of ABM that is related to this research. 2.3.1 Examples of Agent Based Modeling The use of agent based modeling as an explanatory tool is widespread among disciplines and is gaining more popularity in the last decades. Epstein (2006) asks the question \Can you grow it?" instead of \Can you explain it?" for observed social phenomenon. He gives examples of agent-based models generating social in- teractions and emergent behaviors in socio-cultural contexts. He calls agent-based computational models as scienti c instruments. There are many inspiring implementations of agent based simulation models that explain di erent systems and create understanding for di erent contexts. Carlson and Doyle (2002) discuss highly optimized tolerance in a statistical physics environment. Their forest- re model aims to show connection between micro-level mechanisms and the macro-level failures by building self-organized criticality and robustness barriers. In another study, abstract computational forest- re models are developed to gain an understanding of mechanisms underlie di erent ecosystems and ultimately, di erent re management strategies are evaluated (Moritz et al., 2005). This research focuses on socio-technical processes and use of ABM in socio- technical environments. As a good example, Axelrod? s disseminating culture model (1997b) examines the relationship between local convergences and global polarization, and builds social in uence mechanisms of individual and group di erences in terms 17 of traits and features. In another study, Axelrod (2006) questions how the states form and dissolve from small political actors providing new thinking on the policy making of the real world in order to maintain more sustainable political structures. In their remarkable work on the e ect of individual rationality on macro-level be- havior, Axtell and Epstein (2006) point out that imitative behavior and the social interactions are not well considered in economic models. Therefore, they develop an agent-based model for timing of retirement incorporated with social interactions and social imitation process to investigate how desired (optimal) behavior converges even with relatively small amount of rational agents. Epstein et al. (2006) also examine the epidemic case of smallpox in a county- level context with di erent scenarios (vaccination, population, etc.) and feed the simulation with real data to question when and how much vaccination is needed to stop the epidemic. In another study, Epstein (2006) grows an arti cial hierarchical management mechanism for a company to measure the adaptability of the employee hierarchies. Then he creates an objective function representing the trade-o between maintenance costs of the managerial layers and costs of missed marketing opportuni- ties. At the end, he calculates the optimum hierarchy of the management for di erent initial set-ups. Cui et al. (2009) create a hypothetical open source software development envi- ronment presenting a stigmergy approach and investigate whether they can validate the network structures and collaboration frequency among scientists that are emerged from micro-level stigmergy preferences. Yilmaz (2009) develops an understanding of the coordination of OSS communities focusing on governance and the con ict man- agement strategies and measures the performance in terms of collective creativity. Dron and Anderson (2009) perceive the technology not only as a tool but also as a component (processes and rules). They claim that people should design systems con- sidering the signi cant components, and predictability of the behavior is not likely so 18 people need to make the systems adaptable. McCormack (2007) creates an arti cial ecosystem, in which he builds the metaphor between adaptive individual agents and colors exploring novel discovery processes. 2.3.2 Simulation Models of Science Di erent scholars use simulation to study scienti c domains. For instance, Gilbert (1997) introduces a model to determine whether it is possible to reproduce observed regularities in science using a small number of simple assumptions. His model gener- ates knowledge structures consistent with observed Zipf distributions involving scien- ti c articles and their authorship, but it does not consider social processes as a mech- anism. Naveh and Sun (2006) continue their analysis on top of Gilbert? s model and explore how di erent cognitive settings may a ect the aggregate number of scienti c articles produced. They argue that using cognitively realistic models in simulation may lead to novel insights in academia, but they just consider implicit and explicit learning. In the context of collective knowledge creation and di usion, Cowan and Jonard (2004) simulate the knowledge exchange process to examine the relationship between network performance and the network architecture. Shrager and Langley (1990) perceive science as problem solving including machine learning techniques. However, in these studies, the social interactions are not taken into account. Socio-technical modeling of science and representing knowledge generation as a social phenomenon draw signi cant attention among researchers. Similarly, in this research, the focus is on micro-level (inter-scientist) behaviors and developing a plausible socio-technical model of GPS. 19 2.4 Communication and Collective Action in GPS Communication preferences and opportunities are important interaction mecha- nisms embedded in the process of knowledge generation in science. Wagner (2008) states that if we can discern identi able patterns and mechanisms of communication among the scientists then that can lead us to understand how this scienti c endeavor works and how policymakers can in uence its evolution and growth. So, science can also be perceived as a complex communication network consisting of individual scien- tists who communicate, form partnerships, create opportunities, share their ndings and adapt to new constraints in their environment. Little is known about the role of the grounding theoretical mechanisms of com- munication in GPS. Table 2.1 categorizes the social communication theories that are widely known in Psychology domain. Olson (1974) argues in his seminal work The Logic of Collective Action that \unless the number of individuals in a group is quite small, or unless there is coercion or some other special device to make individuals act in their common interest, rational, self-interested individuals will not act to achieve their common or group interest." It is essential to understand Olson? s collective action dynamics in order to explain the merits and nature of GPS. A group of people may all bene t greatly from a collective action, yet be unable to act together to achieve it.10 Ostrom and Hess (2007) also state that the challenge in FOSS commons is how to achieve collective action to create and maintain commons or public good. Collective action is focused mainly on mutual interests and the possibility of bene ts from coordinated action (Monge and Contractor, 2003). There is also a social dilemma introduced by Hardin (1982), in which he asserts that the mutual-interest and individual-interest con icts resulting in dissolving of the collective action. The dilemma between mutual and self interest is essential. 10http://michaelnielsen.org/blog/the-logic-of-collective-action/ - As of 4.07.2013 20 Table 2.1: Social Communication Theories Theories Sub-Theories Theories of Self-interest Social CapitalStructural Holes Transaction Costs Mutual Self-Interest & Collective Action Public Good TheoryCritical Mass Theory Cognitive Theories Semantic or Knowledge Networks Cognitive Social Structures Cognitive Consistency Balance Theory Contagion Theories Social InformationProcessing Social Learning Theory InstitutionalTheory Structural Theory of Action Exchange and Dependency Social Exchange Theory Resource Dependency Network Exchange Homophily & Proximity Social ComparisonTheory Social Identity Physical Proximity Electronic Proximity Theories of Network Evolution OrganizationalEcologyNK(C) Monge and Contractor (2003) One of the terms employed by Bonacich (1990) is communication dilemma, which stresses the con ict between individual communication preferences and the organiza- tional needs of communication. Self-interest theories explain some of these commu- nication preferences of the scientists. They postulate that people make what they believe to be rational choices in order to acquire personal bene ts. These personal bene ts can be human capital, social capital or reputation. In this research, mecha- nisms for self-interest, exposure, preferential attachment, and communication theories are developed along with the collective action as an underlying mechanism. 21 2.5 Understanding GPS as an Innovation and Collaboration Network Kuhn (1996) approaches science from a paradigmatic point of view and per- ceives it as a collective innovation. It is collective because it builds upon the past achievements of the others and innovation is essential because science causes cyclical revolutions that occur periodically, resulting in formation of new paradigms and it- eratively adopting inventions. Even though the de nition of innovation is dependent on the system of inquiry, with a more abstract approach, it can be expressed as a critical event that destabilizes the state of the system and leads to a self-organizing new state (Pyka, 2009). As a systemic sense in science, the emergence of new knowledge structures, new channels of communication and new network topology can be described as innovation. It is known that most of the outputs of an innovation system are the number of publications or patents and the inputs are resources allocated; However, the process that transforms inputs into outputs is a black-box (Milbergs and Vonortas, 2006). The next generation innovation metrics are more focused on emergence. In GPS, plausible underlying socio-technical mechanisms that lead to emergence of desirable macro-level behaviors can be described as the processes that Milbergs and Vonortas (2006) address. Emerging social-network structures and emerging diversity in the topology can be perceived as innovation indicators. It is demonstrated that user innovation communities are self-organizing com- plex adaptive systems (Yilmaz, 2008a). However, not all complex systems are self- organizing (Monge and Contractor, 2003). A system is self-organizing when the net- work is self-generative (e.g. spawning agents), there is mutual causality between pa- rameters, imports energy into system (e.g., creating new artifacts and opportunities), and is not in an equilibrium state. Saviotti (2009) perceives the scienti c product of an economic system as a knowl- edge network and introduces network interactions between the knowledge base of the 22 rms. He synthesizes network science, complex systems, innovation, and knowledge networks approaches in his model and analyzes the network connectivity to discuss innovation. Diaz-Guilera et al. (2009) model propagation of innovations analyzing a spread of stimulus among a network in terms of connectivity, and they use com- plex interaction mechanisms such as punctuated equilibrium, self-organized criticality under the assumption that the cost of connectivity is stable. Similar to Saviotti (2009) and in addition to network model preferences, the underlying assumptions of social network models created by Gilbert (2006) accepts the maintenance of the network as costless. Thus, Lynne and Gilbert (2009) postulate to limit the size of personal networks because of the costs of keeping the network alive. Monge and Contractor (2003) list the four characteristics which are critical to the creation of a public good: interests, resources, bene ts, and costs. Udehn (1993) states that only self-interest is inadequate and must be replaced by an assumption of mixed motivations. \What is the mix of these motivations?" is the question awaiting for further exploration. Social network analysis, as one of the lately developed elds, has recently at- tracted increasing attention among the scientists. Social network analysis constructs networks from social relations and their functions in society (Wasserman, 1994b). Pyka (2009) represent some empirical results on the trends in innovation networks:\(i) The emergence of novelty tends to create new but poorly connected nodes, thus tem- porarily reducing the connectivity of the system. (ii) The subsequent di usion of the innovations establishes new links and raises again the connectivity of the system. (iii) As a result of (i) and (ii), the connectivity of the system is likely to uctuate around a given value." Dhanaraj and Parkhe (2006) present Hub rms in an innovation net- work to manage knowledge mobility, innovation appropriability, and network stability. They regard the network and the members of the network as coupled and dependent on one another. 23 All intelligible ideas, information, and data that can be delivered or gathered in a format can be referred to as knowledge (Ostrom and Hess, 2007). Innovation results from the recombination of knowledge held by the collaborators, and the extent to which agents? knowledge complements each others? is an issue of cognitive integrity (Cowan and Jonard, 2004). The introduction of new ideas through weak ties can foster innovation and development of the system (Wagner, 2008). In addition to the artifacts, GPS has interactive communication outputs (Monge and Contractor, 2003). In other words, connectivity of the members (the network itself) and communality can be identi ed as the goods of the collective action. Lynne and Gilbert (2009) suggest four di erent types of network models: regular lattice, small-world, scale-free, and random. Watts (1999) describes four character- istics of a small-world phenomenon. He argues that a small-world network consists of large number of actors which are connected to relatively small numbers of actors. There are no central actors, and the network is sparse. Relationships among actors overlap; that is, friends of friends are more likely to be friends too. Scale-free networks have a degree distribution that follows a power-law. Albert and Barab asi (2002) state that \Most real networks, however, exhibit preferential attachment, such that the likelihood of connecting to a node depends on the degree of the node." The preferential attachment mechanism creates power-law distribution, in which the ones with high level of resources, attract more resources. Lynne and Gilbert (2009) argue that social networks are not random since people connect with others who are similar to themselves. Scale-free networks are not realistic because people do not only use preferential attachment, in which people connect to the ones, who already have many links. Because, people do not necessarily know who has the most number of connections. Newman and Watts (2006) postulate: \the small-world model is not in general expected to be a very good model of real networks," because small-world models do not produce nodes with high degrees of 24 connectivity. Hence, Lynne and Gilbert (2009) conclude that social network models need to fall somewhere in-between scale-free and small-world, which is a new challenge in the modeling. 2.5.1 Social Network Metrics and Innovation Potential De Nooy et al. (2005) explain that \the main goal of social network analysis is detecting and interpreting patterns of social ties among actors." Social networks rep- resent the complexity of human interactions (Wasserman, 1994b) and their topologies are represented by sets of people or social actors and the set of peer-to-peer relation- ships among them. Social distance mathematically presents a degree of closeness and acceptance that these actors or group of actors feel towards each other (Boguna et al., 2004). Boguna et al. (2004) also discuss three speci c issues in social networks: \transitivity of the relationships between peers (clustering), correlations between the number of acquaintances (vertex degree) of peers, and the presence of a community structure with patterns." The degree centrality for each actor is measured in order to capture degree distri- butions. The degree centrality of an actor is calculated as the proportion of possible ties that exist for that particular actor. Another metric that can be considered is ego density a term coined by Burt (1982). The term refers to the proportion of existing ties that includes the actor as a peer. It is a useful metric to assess which nodes or actors are more likely to spread knowledge and innovation (Wasserman, 1994b). Additionally, density is another metric, which is averaged standardized degree in the whole network. Higher density suggests a higher connectivity and group cohesion (Blau, 1977). The variability of individual indices can be quanti ed so that the de- gree centrality of a network is calculated as a measure of variability among degrees of actors. 25 Regarding innovation potential, Yilmaz (2008a) argues that higher density net- works have a better mobility of knowledge, which is desirable for innovation; however, higher density also diminishes the positive e ects of diversity on innovation by creat- ing shared norms and skills. Therefore, it is essential to measure density along with the diversity of a network. Yilmaz (2008a) identi es high centrality and low density networks as another indicator of innovation potential that leads to more structural holes and transformation of knowledge. Both Yilmaz (2008a); Burt (1995) discuss the importance of high centrality and fewer structural holes as a competing preference. In order to support the given hypothesis, more metrics need to be explored. Dis- tance metrics between and among the groups such as Euclidian distance, Manhattan distance, Mahalanobis distance, and Hamming distance are some of the most impor- tant common metrics in use.11 Primarily, geodesic distance as a form of Euclidian distance is used in social network metric calculations, whereas the other distance metrics are outside of the scope of the social networks context. However, Hamming distance is useful when analyzing diversity among scientists. Wasserman (1994b) suggests that closeness centrality and betweenness centrality are indicative of cohesion within the network. Closeness centrality of an actor states how close an actor is to all other actors. There are di erent approaches for measuring group closeness (Freeman, 1979; Bolland, 1988). The measure is between 0 and 1, and lower values indicate better dissemination of information. Betweenness centrality of an actor is the number of shortest paths that pass through a node divided by all of the shortest paths within the network. This metric is useful for determining the nodes where the network can fall apart. When the normalized metric for the group is calculated, the higher values indicate it is easier to destruct the connectivity in the network because connectivity is highly dependent on a few actors. Additionally, eigenvector centrality and information centrality metrics capture who is connected 11http://www.statsoft.com/textbook/cluster-analysis/ - As of 4.07.2013 26 to the most popular (central) nodes and who is connected to best information paths (in the case of valued links), respectively (Wasserman, 1994a). These metrics can be used in resilience analysis, but they are computationally costly to calculate. The clustering coe cient is the most important metric of interest for capturing clustering tendencies within the network. It is the density of a node in its neighbor- hood, calculated from the average of coe cients for all actors. It is indicative of the presence of di erent communities or groups within the network (Schank and Wagner, 2005). Higher values might indicate sparsely clustered groups or a high connectivity in the whole network as a structure. Therefore, mathematically, the average of all shortest paths between nodes in the network can help to distinguish which structure is present. Also, it is discussed that the longest of the shortest paths in the network can be useful for indicating the diameter in the network (as higher values re ect more sparseness).12 As a consequence, calculating average path length along with the clus- tering coe cient would allow us for distinguishing the high centrality-fewer structural holes hypothesis. After how to measure clustering and degree distributions are discussed, the afore- mentioned third issue in social networks is the fractal-like network structures. One structure to measure is the small-world phenomenon which indicates a higher cluster- ing coe cient and relatively short average path length. These networks are clustered, but there are also many bridges and structural holes between clusters. The small- world phenomenon can be measured by the ratio between the clustering coe cient and the average path length. Greater values indicate a better small-world structure (Uzzi and Spiro, 2005). Additionally, degree centrality and density metrics can be indicative of scale-free structures that have few highly central actors as opposed to the majority of the actors that have small degree centrality. 12http://www.slideshare.net/gcheliotis/social-network-analysis-3273045 - As of 4.07.2013 27 Core/periphery ratio is a relatively new metric, and there is no consensus on how to calculate it. Core/periphery ratio is calculated by simply dividing the number of core members to the number of periphery members. It is a measure of innovation and the larger periphery is better as an indicator of di usion of innovations (Krebs and Holley, 2002). A well-known technique to identify core and peripheral nodes is the recursive method that removes the nodes with a smaller number of degrees than a predetermined number until there is no node remaining to remove (Boyd et al., 2006). Then, remaining nodes are counted as core nodes while the removed ones are counted as the peripheral nodes. 2.6 Diversity and Innovation Potential Uzzi and Spiro (2005) analyze small-world phenomenon through innovation in Broadway musicals. They indicate that quality of the show?s performances increases with small-world network up to certain extent, after which there is a diminishing e ect on performance. Diversity needs to be spurred within the network. Badis et al. (2009) also state that if we observe the companies in a market and ecosystems in the nature, we can see a diminishing return of similarity. At some point, having more similarity things diminishes the rate of bene ts. There are also studies that discuss diversity in the population (Dhanaraj and Parkhe, 2006; Powell et al., 1996) and network connectivity as an indicator of innovativeness of the network (Pyka, 2009; Burt, 1995). Interdisciplinarity as a form of diversity is desirable in GPS, and emergent knowledge and collaboration network structures can be used as proxy metrics of innovation potential. In this research, both interdisciplinarity and the connectivity in the network reveal patterns that allow us to discuss on innovativeness based on those foundations described in the previous section. A detailed summary of diversity metrics that are measured in this research is made in Chapter 5. 28 2.7 Robustness and Resilience in Socio-technical Systems It is worth mentioning that robustness has di erent de nitions in di erent prob- lem domains. In ecology, robustness refers to preservation of diversity in a population, while in medicine, it refers to healing and compensation. In cell biology, robustness refers to how the cell fate decisions are consistent (Krakauer, 2006). Flack et al. (2005) focus on a pigtailed macaque society by removing leaders and observing how the so- ciety reacts to this perturbation in terms of con ict management. It is highly related to the self-organization and to the levels of interactions between the individuals in the system. Pavard et al. (2006) de ne a robust system as one that adapts its behavior to the unexpected outcomes and perturbations in the environment. Robustness refers speci cally to the ability of a system to operate in a desired way when that particular system faces a wide range of operational condition (Sheard and Mostashari, 2008). Resilience and robestness can have similar de nitions in di erent domains. Re- silience is more related with how long does it take for a system to regain a desired output after a perturbation. Smith and Stirling (2008) describe resilience as \the dynamic persistence of a regime under episodic shocks" and robustness as \system maintenance under cumulative stress." There is a need to acknowledge the uncer- tainty and the lack of knowledge about GPS. So, robustness analysis is considered an essential method for testing since the simulation models are likely to give vari- able outputs, and because there is a need to capture the robustness of a mechanism through distinct scenarios and parameter set-ups. Robustness is valuable to explore as opposed to only searching for an optimal behavior in terms of a tness function in an unchanging environment. In the following Chapter, the stakeholders of this research are introduced, method- ology and research questions are brie y summarized. 29 Chapter 3 RESEARCH PROBLEMS AND METHODOLOGY In this chapter, the stakeholders in the research and the environment of interest are introduced. Then, signi cance of the research problems and the contributions that the research are described. 3.1 Stakeholders of the Research The Science of Science Policy (SoSP) is \an emerging interdisciplinary eld aim- ing to provide scienti cally rigorous basis from which policy makers can assess the impacts of scienti c enterprise, improve the understanding of its dynamics and assess the likely outcomes."1 There are three themes of SoSP: Understanding Science and Innovation Investing in Science and Innovation Using the Science of Science Policy to address National Priorities Innovation is at the core of SoSP themes because the National Academy of Engi- neering report states that it is critical for US prosperity2 and is a desirable outcome of all collaboratories. Collaboratories are described as \a computer-supported system that allows scientists to work with each other, facilities, and databases without re- gard to geographical location" (Finholt and Olson, 1997), which are observed in GPS. From the perspective of SoSP, scienti c exercises can be conducted in traditional sci- ence environments or GPS environments. Considering its knowledge creation process, 1http : ==scienceofsciencepolicy:net - As of 4.07.2013 2http://www.nae.edu - As of 4.07.2013 30 science is a collective action taken by diverse, autonomous individuals. Additionally in GPS, scientists are self-organizing all over the globe, collaborating on the same projects regardless of their physical proximity, learning from each other, and doing research without an imposed blueprint. The governance mechanisms of individu- als not only a ect the individual gains from scienti c activities but they also a ect emerging macro-level patterns. With the increasing importance of research on scien- ti c enterprise, SoSP created a roadmap for guidance of research on these scienti c communities. SoSP asks three fundamental questions relative to this research3: What are the behavioral foundations of science and innovation? How and why do communities of science and innovation form and evolve? Is it possible to predict discovery? Traditional science and GPS activities di er at di erent aspects of scienti c en- vironments. Table 3.1 summarizes the comparison of the features by which both scienti c enterprises can be described. 3The Science of Science Policy: A Federal Research Roadmap, 2008 31 Table 3.1: Traditional Science vs. Global Participatory Science Criteria Additional Criteria Traditional Science Teams Op en Science Comm unit y Distribution Space Co-lo cated Distri buted Time Sync hronous Async hronous + sync hronous Comm unication Informal Formal (structured electronic comm unicatio n) Organizational Structure Hierarc hical Net work ed Style Team/F ormal Group Comm unit y/Mark et Op enness Pro duct Access is Push-driv en Pull-driv en Gran ularit yof transparency Complete pro duct Incomplete pro duct Integration of con tributions Pre-pro duction decision s Pre and post-pro duction review and use Pro cess Decision-making Closed Op en/transparen t Authorit y Cen tralized/hierarc hical Decen tralized Mobilit y En try Threshold High Lo w Turno ver rate Lo w High 32 3.2 Research Problems There is uncertainty regarding the theoretical foundations of science communi- ties. Also it is stated in SoSP roadmap that \theoretical and computational models of science and innovation must be developed!" In the light of the needs stated by SoSP, the initial research questions to be explored in this research are: Which interaction mechanisms in literature explain operational behavior of GPS and its underlying socio-technical processes? How we can specify and implement these mechanisms in the form of a compu- tational model to gain empirical insight and perform exploratory analysis? There are three levels that science can be studied from: micro-level (inter- scientist interactions), meso-level (interactions among communities), and macro-level (communities of communities structures - ecosystem level. Figure 3.1 is introduced for better interpretation of the network topologies. In meso-level analysis, nodes can be perceived as communities, while in micro-level analysis, the nodes can be scien- tists. In the macro-level analysis nodes can be domains that include many di erent communities.4 The links between the nodes can be interpreted as any kind of rela- tionship (i.e. collaboration, social, funding) and this interpretation is based on the intention and the purpose of the model developer. 4http://www.cliquecluster.org/content/research-program - As of 4.07.2013 33 Figure 3.1: Net work Visualizations of Micro-lev el, Meso-lev el, and Macro-lev el Science Studies (a) Meso-Lev el (b) Micro-Lev el (c) Macro-Lev el 34 The CAS characteristics observed in GPS provide for the opportunity to study underlying micro-level (inter-scientist) behaviors in order to develop plausible ex- planations for this phenomenon. Collective action theory is selected an underlying theoretical base for addressing the operational behavior of GPS. More details about model development and implementation of collective action theory are described in the following chapters. Following the development of the base-model, the next step is to explore the emerging network structures in order to understand them. Therefore, policymak- ers can bene t from their unanticipated opportunities and eventually manage the evolution of GPS networks. Communication among agents, which is present within collaboration networks, has an intense e ect on the system level behavior. Theoreti- cally grounded explanations of communication behaviors among scientists provide for an opportunity to explore di erent communication preferences and their e ects on social network structures. The research question of interest in this research is: Which social communication mechanisms among scientists are more e ective in fostering innovation potential? Ultimately, considering the uncertainty of the mechanisms and bounded ratio- nality that exists in the environment (as global information does not exist among agents), variable outcomes are likely to occur. Robustness can be identi ed as the level of variability the system exhibits under various environmental and intrinsic con- ditions. Less variable outcomes are indicative of more robust landscapes. Since robust system design is more important than nding an optimal behavior of a single scenario, the research question in this study is: How to explore di erent parameter and mechanism con gurations to seek and identify more robust communication strategies in terms of variance observed in innovation potential metrics? 35 3.3 Methodology Chart In this research, a bottom-up approach is adopted that has top-down guidance as articulated by the objectives of the study. The methodology is described in Figure 3.2. Figure 3.2: The Chart of Objectives and Methodology The base-model is grounded on theories and operating principles derived from observations on the system of interest. First, the base-model is built with CAS prin- ciples in mind. In the model, interpretation of collective action theory and social interactions complement CAS principles (i.e., tagging, information- ows, diversity, non-linear interactions). Along with the theory base, the information foraging mech- anism, which is inspired from food foraging in nature (Pirolli and Card, 1999), is 36 built and preferential attachment mechanism is designed as essential interaction pro- cess. SEIR metaphor and population dynamics are introduced which conclude the conceptual model development. Conceptual model development is followed by the implementation of it as a computer simulation. The second phase of the study builds on the base-model developed in phase one. Di erent communication preferences of the scientists are implemented. Thus, di erent communication mechanisms and their e ects on innovation potential are sought. The third phase consists of the discovery of more robust con gurations by a tool developed to help decision-makers to explore di erent mechanism and parameter set-ups. 3.3.1 Phase 1 - Conceptual Model Theory Base Scientists join or leave a problem domain on the basis of problems to be explored and projects to be accomplished, and their position in the scienti c enterprise depends upon their knowledge, levels of interest, popularity, personal learning objectives, re- sources, and commitments (Hollingshead et al., 2002). In this research, Olson? s collective action theory is identi ed as the socio- cognitive interaction mechanism in GPS. It basically asserts that when the bene ts an individual gains are greater than the costs he or she is burdened with, then that individual will join the collective action. GPS is perceived as a collective action be- cause artifacts as a product of the collaboration are public goods which are owned by the community and have features such as jointness of supply and impossibility of exclusion. Because, knowledge produced is open; all may bene t from the knowledge, and bene ts of another do not diminish the bene ts that can be gained by others. In GPS, scientists are always in close proximity to one another so long as they collaborate on the Internet. Scientists use web services and platforms to collaborate 37 and socialize. Although traditional science is institutionalized and has certain incen- tive mechanisms for scientists such as tenure, journal publishing, and funding, these mechanisms do not exist in open science environments. Scientists participate because of an altruistic belief in the action; they believe that collective innovation is necessary in the interest area of that action, they desire to gain greater knowledge, and they want to broadcast their skills and expertise to fellow scientists by weaving a social network. Basically, scientists follow their self-interests on the theme. But self-interested people are more likely to acquire what they want without paying a great price. Ex- ploitation causes an inevitable free-riding problem, but it does not destruct the value of the work done in GPS. So, interested and self-motivated scientists keep contribut- ing to the collective. The participation in GPS is not compulsory, but the social pressures existing within the open science community leave scientists exposed to the groupthink present in collective behaviors. This phenomenon creates exposure to the mutual-interest. The dialectic interaction between mutual-interest and self-interest is essential. While mutual-interest in an action drives an individual to participate, self-interest might cause avoidance from participation, or vice-versa. Through the development process of base-model, veri cation and validation stud- ies are also conducted. At each level of conceptual model development, the mecha- nisms are conceptually grounded on sound theories, based on the others? work, and empirical ndings. The detailed summary of veri cation and validation e orts are listed in Chapter 4. 3.3.2 Phase 2 - Social Communication Model and Innovation In this phase, social communication theories that are relevant to GPS environ- ment are selected. Generative mechanisms for selected theories are interpreted and recommendations for their implementation are given. Then, sensitivity analysis is 38 conducted to measure innovation potential for di erent simulation set-ups. Selected communication theories are listed below: Human Capital mechanism states that scientists have broadcasted information about the expertise of others and try to connect with the other scientists who have higher expertise than themselves. Social Capital mechanism states that scientists will attach themselves to scien- tists with high or terminal degrees in the social network. Homophily mechanism states that scientists can perceive the interest informa- tion of others and will try to connect to scientists who are familiar to them. Social Exchange mechanism states that scientists are going to have the infor- mation about what other scientists know and their degree of expertise. Then, a scientist will connect with scientists who are experts in an area in which he or she is not familiar in order to strengthen his or her own expertise. Random mechanism only allows scientists to connect to other randomly selected scientists. Mixed communication is a scenario that randomly assigns one of the afore- mentioned ve theories to scientists. Each scientist then behaves according to that particular theory. Also, probabilities that are assigned to each theory are parameterized, and further analysis on population dynamics can be conducted. Following the implementation, sensitivity runs are conducted. Diversity, in- terdisciplinary, and social network metrics are measured to be able to distin- guish more innovative communication behaviors. The results are presented to policy-makers, so that they can promote desirable behaviors in open science environments. 39 3.3.3 Phase 3 - Robustness Analysis In this phase, an algorithm is created that consists of a search algorithm that explores di erent parameter values and a module that creates plausible simulation scenarios. These two parts are integrated via a feedback mechanism. The results of the search algorithm are measured by an objective function that indicates variability of selected innovation potential metrics. Less variable results are assumed to be more robust. The policy-maker (decision-maker) can generate new plausible scenarios by observing the ttest parameter con gurations. The ultimate goal is to nd a strategy that behaves more robustly than the others regarding the variability of the perfor- mance metrics and to measure the robustness of di erent communication mechanisms under various conditions. In an environment that has a high level of uncertainty, the more robust strategies should be considered for implementation as opposed to opti- mal but not robust strategies, because the most innovative strategy might create an unsustainable, highly fragile environment. In Chapter 6, communication between the components of the exploratory software is described in detail. In order to illustrate the high-level structure of the exploratory software, Figure 3.3 is illustrated below. In the next chapter, base-model and its components are introduced in detail along with validation and veri cation studies conducted. 40 Figure 3.3: Exploratory Software Coupling GA and Metamorphic Relations 41 Chapter 4 BASE-MODEL COMPONENTS AND VALIDATION There is no consensus over what the term conceptual modeling means. Robinson et al. (2010) identify some key factors of conceptual modeling in their book. They claim that it starts from problem situation and moves through the questions about modeling such as,\What do we require to model?",\What do we model," and \How do we model?" The questions are iterative, and there is continuous feedback with revisions. The conceptual model is simpli ed. It is not the code or software model, and it considers client-side perspective as much as modeler?s. Robinson, Brooks, Kotiadis, and Van Der Zee? s precise de nition is: The conceptual model is a non-software-speci c description of the com- puter simulation model (that will be, is or has been developed), describing the objectives, inputs, outputs, content, assumptions, and simpli cations of the model. The following sections give information on the conceptual model of the base- model including the assumptions, grounding theories, and interaction mechanisms. A detailed conceptual model description is followed by veri cation and validation e orts, and analysis to determine the initial conditions of simulation experiments. 4.1 Base-Model Mechanisms In GPS, scientists participate in artifacts or create new ones without a central authority and meso-level (community-level) governance. In the formulation of base- model, there is no enculturation or entrance threshold for a scientist who is willing to become active and contribute to an artifact. Let us imagine a web tool, in which 42 motivated scientists, who believe in the collective action, can browse the list of open artifacts, select one of them, contribute to it, and thus learn from the artifact. In the following sections, the micro-level interaction mechanisms of the base-model are introduced. 4.1.1 Artifact Selection It is mentioned in the previous sections that scientists browse the web-tool, which is metaphorically a grid in the model. But, not all scientists are equal in terms of time spent in browsing the online tool. Some scientists browse more titles while some browse fewer. That means the environment is heterogenous regarding the width of scopes among scientists. Each scientist has a scope that is bounded (they do not have the perfect information about the whole environment) and scientists can only operate within that scope while searching for an artifact. The selection process is based on the calculation of three dimensions: Popularity: Scientists might select an artifact according to the artifact popu- larity; the more elaborated the artifact is, the more likely it is to be selected (0 >< >>: (1 1 Min j ) Sj;t if Sj;t initial-maturity+ (Min j + 1 Min j ) (Sj;t ) otherwise (4.6) 49 For better illustration, Figure 4.2 describes the shape of the introduced function that updates Tension. As a note, in Figure 4.2, the completion maturity of the artifact is 0.8, initial maturity is 0.3, and minimum tension is set to 0.2. Figure 4.2: Shape of the Function that Updates Tension Some scientists believe in the necessity of scienti c collaboration within GPS more than others. The altruism, an independent variable, explains belief in collective action. A scientist?s decision to become active is based on Olson? s statement in the case of shared costs, which states,\if the bene t is more than the costs of an action, people will participate" (Olson, 1974). There is an analogy between bene t and multiplication of self-interest and exposure, and between cost, and multiplication of tension in an artifact and the cognitive burden of a scientist. The condition to become active is below: Bk;j j;t Fk;j Xk;t Altruism (4.7) where altruism is a value which is xed throughout the simulation and is di erent for each scientist. 50 4.1.3 Learning and In uencing Processes Scientists interact via the ow of information through the artifacts. When a scientist considers cost/bene t analysis (as described above), he or she contributes to repository of the artifact in productive ways; for instance, by commenting, posting a solution, or writing code. The monotonic transfer mechanism as described in Page (2010) regarding information ows is interpreted in GPS and is implemented. If the expertise of the scientist is greater then complexity of the artifact, then a contribution results in: Cj[i] = Cj[i] + (1 Cj[i]) Ek[i] !j (4.8) where i is a randomly selected area, j is the contributed artifact, k is contributing scientist, and !j is the elasticity of the artifact. The greater elasticity of an artifact indicates that it is easier to elaborate on that particular artifact. This transfer mech- anism also refers to faster growth in complexity if the expertise of scientist k is higher than the complexity of artifact j. The transfer mechanism indicates slower growth when complexity approaches to its maximum value \1." The transfer mechanism for in uencing the expertise level of the contributor, or learning process, is articulated below. This mechanism indicates slower growth than the in uencing process in order to make it harder to gain expertise throughout time: Ek[i] = Ek[i] + (1 Ek[i]) (Cj[i] Ek[i]) Bk;j (4.9) where i is a randomly selected area, j is the contributed artifact, and k is the con- tributing scientist. The higher the challenge or cognitive burden is, the higher is the learning of the contributor. Learning is only justi ed when the complexity of the artifact is greater than the expertise of the scientist. Both transfer mechanisms assume a monotonic increase in the expertise levels of scientists and the complexity 51 levels of artifacts over time. These assumptions are also based on our observations and logic derived from OSSD communities. Expertise is, logically, something that is non-decreasing within the same context and without relative evaluation. Addition- ally, complexity is observed to increase along with the number of contributions in OSS environments, which makes it harder for fellow scientists to follow and understand the artifacts. Other than through transfer mechanisms, a contribution may cause a change in the theme of an artifact or it may cause a change in a scientist?s interests. The higher the expertise of a scientist, the more likely there is to be a change in an artifact?s theme. The higher the maturity of an artifact, the less likely it will be to change the interest level of a scientist. In both cases, a random area on both the interest and theme lists are equalized to demonstrate the in uence of the contribution. Additionally, there is a mutation mechanism on an interest area of a scientist with a certain probability (e.g., 0.01) at each time tick since a person?s interests are subject to change through time. 4.1.4 Information Foraging Mechanisms Metaphorically, scientists can be viewed as predators. Predators are expected to abandon their current territory (e.g., domain) when the local capture rate (e.g., success of problem solving) is lower than the estimated capture rate in the overall en- vironment (Bernstein et al., 1988). Information foraging theory, developed by Pirolli and Card (1999), assumes that people, if they have an opportunity, will adjust their strategies or the topology of their environment to maximize their rate of information gain. In this study, scientists join or abandon artifacts based on perceived cues about their performance in attaining the desired outcome. Every scientist has a di erent instrumentality, meaning that they have di erent levels of expectations for the amount of time they should spend on their research until 52 they have a successful contribution. Each scientist has a di erent initial expectation, which is called timeToContribution and shown as TCk;t for scientist k at time t. Each scientist k has a memory-factor k, which postulates that lower level memory encourages conservative behavior which guides the scientist to maintain their previous expectations. Scientists modify their expectations as following after every successful contribution: TCk;t = ( k TPk;t) + [(1 k) TCk;t 1] (4.10) where TPk;t is the number of time ticks passed without having a successful contribu- tion for scientist k at time t. If the amount of time passed is more than the modi ed expectations, then the scientist forages. In foraging, the scope is expanded (e.g., two times) and scientist moves to a di erent area in the expanded scope on the grid. Figure 4.3: Information Foraging Behavior In food foraging, Charnov (1976) states that a forager should leave a territory if the rate of gain (in terms of energy) within the territory that the forager resides in drops below the rate of gain that can be achieved by traveling to a di erent territory. In Charnov? s Marginal Value Theorem, the gain starts after a certain time t where t is the amount of time forager spends traveling to a new territory. Analogically, in GPS, the amount of time spent for traveling to another territory is almost instantaneous. Therefore, the tradeo between time spent in traveling and the expected rate of gain is not valid. 53 In the base-model, the described basic foraging mechanism is used. Also, a second foraging mechanism is developed: optimal foraging a term inspired by Pirolli (2007) and Charnov (1976). Optimal foraging checks the rate of return in terms of expertise a scientist gains from the environment. During a time window, if the rate of return drops consecutively below the maximum rate of return achieved, then the scientist forages. In both strategies, every scientist has a di erent expectation regarding the amount of time that will pass until the success criteria is achieved. 4.1.5 Population Dynamics Yilmaz (2008a) states that innovation communities are self-organizing complex adaptive systems; however, not all complex systems are self-organizing (Monge and Contractor, 2003). A system is self-organizing when the network is self-generative (e.g. new arrivals), there is mutual causality between parameters, energy is imported into the system (e.g., creating new artifacts and opportunities), and the system is not in an equilibrium state. The simulation environment in this research is not a closed system. Like web platforms in real life, the model has new user arrivals. There is no recruiting process for scientists in order to maintain simplicity. New scientists, who start to browse the system, are created in context with a certain arrival rate. At each time tick, with a certain probability (e.g., 0.2), a new arrival enters the system, either creating a new artifact (with probability of 0.05) or just browsing the environment. Figure 4.4 illustrates the new scientists arriving at the system. 4.1.6 SEIR Metaphor The SEIR model is a widely known epidemiology model (Newman, 2010). It stands for four states of an individual?s transition: 54 Figure 4.4: Population Dynamics in the Environment Susceptible(S) state describes the initial state. All individuals are susceptible to the collective action or collaboration in the community. Exposed(E) state represents the interaction with an activity, virus, or idea. Infected(I) state that describes the in uence on an individual by an activity, virus, idea, or some sort of knowledge. Recovered(R) state is an inactive state. Scientists become inactive and leave the environment in the Recovered state. The metaphor built from the SEIR model is described through state machine formalism in Figure 4.5. A state machine performs actions when a certain event occurs.1 The state machine illustrated in Figure 4.5 is idle except that the times events are realized. The actions that cause the transitions between the states can be described below: E0/start(): Initial population and majority of the new arrivals (the ones arriving without creating new artifacts) are initialized. 1http://www.agilemodeling.com/artifacts/stateMachineDiagram.htm - As of 4.07.2013 55 Figure 4.5: SEIR Model E1/select(): Initial population and new arrivals start the simulation at Suscep- tible state. If they nd an artifact in their scope, they switch to Exposed and move on to the location of the artifact. E2/browse(): After evaluation of the selected artifact, scientists might decide not to become active. In that case, they browse their scopes searching for further opportunities, changing the selected artifact or residing on the same. E3/contribute(): If a scientist decides to become active after the collective action mechanism is evaluated, then he or she transitions to an Infected state. E4/inactive(): At the next time tick after a successful contribution, there is a 20% chance that a scientist might change his or her artifact preference switching to one of the past contributed artifacts. Scientists keep a list in mind consisting of the artifacts they contributed in the past and probabilistically select one of them. The more recent an artifact is contributed, the more likely it is to be selected again. E5/return(): An Infected scientist evaluates the collective action mechanism at each time tick, decides whether or not to become active in the following time-tick, and transitions to Exposed. 56 E6/create(): If a scientist cannot become active for certain amount of time and has foraged for a while, then he or she can create a new artifact (5% chance) which is related with his or her interest areas and become active on it. E7/leave(): Active scientists might not be able to nd an artifact in their memory list to study on (e.g. all past artifacts might be closed), so they leave the contributed artifact and transition to a Susceptible state. E8/forage(): If a scientist cannot become active during the time that it takes them to have their expectations ful lled, then the scientist forages, expanding the scope by a factor (initially 2). E9/recover(): If the expertise of a scientist is over a certain threshold (e.g. 0.5, 0.7), he or she can not have a successful contribution, and keep foraging in the environment for a certain amount of time (e.g. 3), then with a certain proba- bility (e.g. 0.2), the scientist becomes Recovered and leaves the environment. E10/arrive(): A proportion of the new arrivals (5% chance) start the system, create a new artifact, and they start to work on that particular artifact in Infected state. E11/depart(): Recovered scientists leave the environment. While in the base- model, the nodes and ties of recovered scientists are retained, whereas in Phase- 2, scientists dissolve their social ties and disappear. The activity ow chart of a scientist in Figure 4.6 represents the ow of the mechanisms as a whole. In ow charts, the processes are associated with the vertices and, when it is on a node, it executes activities. 4.1.7 Conceptual Model Validation Sargent (2005) de nes conceptual model validity as the following: 57 Figure 4.6: Activity Flow Diagram of a Scientist (1) The theories and assumptions underlying the conceptual model are correct, and (2) the model representation of the problem entity and the models structure, logic, and mathematical and causal relationships are reasonable for the intended purpose of the model. If the model outputs are tested by real world data, it is known as black-box validity. Typically in socio-technical environments, providing real world data is a luxury. In order to increase the white box validity, or credibility of a model, certain 58 questions should be answered such as, \How well is the grounding theoretical base of the model?" and \How realistic are the inputs?" Since the conceptual model is the abstraction in the modeler? s mind, model representation is critical for ensuring better communication between clients, stake- holders, and the modeler. Balci (1990) describes the communicative model, or rep- resentation of the conceptual model and identi es six forms: \(1) structured, com- puter assisted graphs, (2) owcharts, (3) structured English and pseudocode, (4) entity-cycle (or activity cycle) diagrams, (5) condition speci cation, and (6) other diagraming techniques." Good representation of the conceptual model can provide an easier assessment process and better understanding of the model as well as in- creased credibility. In this research, to delineate the base-model components, the research provides owcharts, state-charts, visual snapshots, mathematical formulas, and structured English. Additionally, pseudo-codes of some important mechanisms in the implementation are added to the Appendix A. Table 4.1 summarizes the grounding theory base and the assumptions in the base-model. Table 4.1: Conceptual Mechanisms and Assumptions Mechanism/Assumption Grounding Theory/Study Artifact Selection Mechanism Preferential Attachment Process (Barabasi, 2002) Bounded Rationality Assumption Axtell and Epstein? s model (2006) Roulette Wheel Algorithm Inspired by Predictably Irrational (Ariely, 2008) Information Foraging Mechanisms Metaphors in (Pirolli, 2007) and Marginal Value Theorem from Charnov (1976) Collective Action assumption Olson? s Collective Action Theory (1974) Exposure to Mutual-interest Axtell and Epstein? s model (2006) Tension within the Projects Evolution of Project Life-cycles in OSSD Communities (Wynn, 2003) Population Dynamics Self-Organization Principles (Camazine, 2003; Monge and Contractor, 2003) Learning and In uence Mechanisms Information Flows Process in Page (2010) CAS Principles (e.g. Tagging, Informa- tion Flows) Holland? s Hidden Order Book (1996), As- sumptions in Yilmaz (2008a) and Wagner (2008) 59 4.2 Computational Model and Repast Implementation Repast (Recursive Porous Agent Simulation Toolkit) is the toolkit used to create the simulation environment. It is an open source simulation tool that allows for development of multi-agent simulation models in Java. Though it has a powerful framework that supports developers while building the context, it is open source and it has a shallow learning curve. The number of demo models and documentation need more detailed explanations. Even though it is possible to encounter maintenance problems related with Repast, it has a highly active community that answers the submitted inquiries and helps users to x the software bugs. In this research, the main interaction context is a grid. Both scientists and artifacts are assigned to a cell. Cells are multi-occupancy, which means a cell can have more than one agent. Figure 4.7 represents a snapshot of RePast API. The right-top quadrant is the grid environment, and the right-bottom quadrant is the 3D social network representation. On the far left column, users can schedule the length of the simulations and change their run speed. In the parameters column, users can explore the mechanism and parameter space by entering the values representing a desired scenario. On the grid, blue nodes are artifacts, red ones are the scientists. 60 Figure 4.7: ReP ast API 61 4.3 Model Veri cation Veri cation indicates how correctly the implemented model represents the con- ceptual model. When the conceptual model becomes more complex, the magnitude of complexity of the computational model increases signi cantly. Yilmaz (2006) in- corporates model veri cation and validation in a life cycle of a simulation study. Figure 4.8: Life Cycle of a Simulation Study Yilmaz (2006) The end result of veri cation is technically not a veri ed model, but rather a model that has passed all the veri cation tests.2 The veri cation tests conducted in this research are listed below: Eclipse3 is used as an editor and a development environment. Debugging is used to detect anomalies at each implemented module.4 Step by step, each piece of 2http://jtac.uchicago.edu/conferences/05/resources/V&V macal pres.pdf - As of 4.07.2013 3http://www.eclipse.org/ - As of 4.07.2013 4http://www.ibm.com/developerworks/web/library/wa-debug/index.html - As of 4.07.2013 62 code that updates variables is checked by the oversight in debugger through reasonabless analysis. Unit-test principles5 ensure that every algorithm and method are checked indi- vidually at extreme conditions (e.g. the behavior of certain outputs when there is only one scientist in the network) before implementation (Extreme Program- ming). Flow diagrams are used to verify the code. Performance metrics are manually calculated for small populations (typically a network of 5 scientists) and outputs of the simulations are compared with manual calculations. Input parameters are printed throughout the simulation to check any inconsis- tencies that might have been caused. The code is self-documented. Every piece of code that is assumed to be impor- tant has comments attached to it. Data interchange les (gdf) are generated for further veri cation in the format that records the node and edge data created by the simulations. Then, the network visualization tool Scibrowser, written in Python and developed in the Auburn University Simulation and Systems Engineering Lab parses the gdf les and calculates the network metrics. Calculations of the implemented Java code are compared with Scibrowser outputs. OBO Foundry collaboration data is parsed in the Auburn University Simulation and Systems Engineering Lab. This data is explored to determine parameter and output ranges that can be encountered in GPS. 5http://geosoft.no/development/unittesting.html - As of 4.07.2013 63 4.4 Model Validation Simulation, as a simple de nition, generates a model of a system with suitable inputs and observes the outputs (Bratley et al., 1987). Simulation models can be described as abstractions of real-world systems or proposed real-world systems, so they cannot be expected to have every feature of a real system represented in the model (Robinson et al., 2010). From this de nition, the question to be asked is: \Do we have a consensus between the model we build and what we intend to do?" The problem formulation and de nition are re ected highly on credibility and qualitative performance of a model. Sargent (2005) de nes validation as: Model validation is usually de ned to mean substantiation that a com- puterized model within its domain of applicability possesses a satisfactory range of accuracy consistent with the intended application of the model. The approach of Silverman and Bharathy (2011) to validity assessment consid- ers the life cycle of the entire simulation study and assesses the validity under the following four dimensions: \(1) methodological validity, (2) internal validity, (3) ex- ternal validity, and (4) qualitative, causal and narrative validity." In methodological validity, authors consider modeling process and software process adequacy while ob- taining inputs. Internal validity refers to the theoretical base of the behaviors in the model, and external validity examines how reasonable the output data is. Qualitative analysis consists of cross-validation techniques such as face validation, comparison of graphics, and visual analogies. Regarding methodological validity, the ethnographic analysis is conducted in this research that occurs through the observation of the environment of interest. OBO data is analyzed to understand the possible outcomes and parameter ranges while de- termining initial conditions of the simulation runs. The simulation modeling process is followed and supported in order to avoid initialization bias, to support terminating 64 state decisions, and to establish the number of replications. In the previous sec- tions, the theoretical basis of the conceptual model was described for internal validity concerns. External validity is checked by observing the variability of outputs for di erent scenarios. Further, a single scientist is followed in two methods: debugging and visual tracking on Repast API to compare behavior against expected regularities. Qualitative analysis is performed in the next sections, presenting di erent macro-level emergent patterns for validation purposes. According to Yilmaz (2006), there are two classi cations of validation stud- ies: traditional and holistic/pragmatic approaches. The traditional approach sees the model as either valid or invalid with regard to its application area. The prag- matic/holistic approach does not value the de nite correctness or incorrectness of the model. The traditional view supports a division of our model into its parts (reduc- tionist) in order to examine whether parts are representative of the real system or not. The traditional view ensures that the predictive capability of the model is relevant to the real system. But in complex systems, the holistic approach is more dominant, meaning that the system is more than the summation of its parts. The ability of a simulation model to generate an anticipated emergent behavior or to mimic the data does not necessarily mean that it is good representation of reality (Yilmaz, 2006). Silverman and Bharathy (2011) claim that models are frequently evaluated by their capability to estimate an observed phenomenon over a speci ed range that means each model has a tness of use. To what extent the model should be perceived as credible is also another question of concern. Sargent (2005) postulates that validation is usually too costly in terms of time and resources to determine the absolute validity of the model regarding the domain and purpose of the study. While more e ort might be better, reasonable enough e ort on validation can be satisfactory. 65 Additionally, validation studies of socio-technical system simulations can be prob- lematic. The lack of data and high-level abstraction in terms of assumptions of in- dividual behaviors make it di cult to assess validity. Kl ugl (2008) lists the primary obstacles for the validation of the proposed agent-based models including: transient dynamics in the model, non-linearity, amount of e ort, and availability of data. Data can be used to train the model and calibrate it before conducting sensitivity analy- sis. Tuning the model parameters using a meta-heuristic can increase the credibility of the model by representing its ability to mimic real world cases. However, if the model aims for a generalization ability and is desired to be used for exploratory anal- ysis, then creating what-if scenarios and tuning the model would cause an over- tting problem. 4.4.1 OBO data and Over- tting Problem In statistics, over- tting is the violation of parsimony that means including more terms, variables, and/or procedures than necessary in the model (Hawkins, 2004). Ex- perimenters explore the relationships between the measures. Complicated models are not easy to interpret, which results in an over- tting problem. The realism achieved by mimicking the real world data and components in great detail may make the model inappropriate and may impair the ability of the model to answer the questions of interest (Laine, 2006). Grunwald (2005) points out the dangers of over- tting: If you over- t, you think you know more than you really know. If you under- t, you do not know much but you know you do not know much. In this sense, under- tting is relatively harmless, but over- tting is dangerous. In general, over- tting happens when a model learns to describe noise in addi- tion to the real dependencies between input and output.6 Cawley and Talbot (2010) 6https://alliance.seas.upenn.edu/ cis520/wiki/index.php?n=Lectures.Over tting#toc1- As of 4.07.2013 66 suggest separating model testing and model tting processes from one another while training the model with a wider range of available data. Reunanen (2003) also sug- gests dividing available data into training and test sets. This suggestion, however, assumes that the data is identically distributed. In the case of social network metrics in OBO, the time series data has trends and high variability, especially during the early stages of the network. Distinguishing di erent network phases in the data and dividing each phase into subsets could be a solution to the problem of generating iden- tically distributed data. However, the sample data sets of OBO are not large enough for this practice. Cawley and Talbot (2010) state that it is possible to overcome the over- tting problem by regularization, early stopping or ensemble algorithms. Early stopping rst suggests us separating data into training and test subsets, then training the model with the training set, pausing at times to test the model with the test data. The training is stopped if the test results start to become less signi cant at which point the validation is optimal. Regularization penalizes the complexity of the model in the tness function to avoid tting the noise. An ensemble is de ned as a collection of models whose predictions are combined by weighted averaging or voting (Caruana et al., 2004). In OBO, scientists form communities and domains related to di erent areas of health sciences while collaborating on the ontology data to standardize the shared terminology. It is a Sourceforge style science development activity. In OBO data, the assumption is that if two scientists collaborated on the same artifact in the same month, then they are connected. OBO log-data (between 2000 - 2009) is parsed from Sourceforge and the social network data is generated. Over- tting is likely to be undesirable when the sample data is small, which is the case for OBO data. There is not a signi cant amount of data, and additional data is not available. The validation method conducted in this research implements a ge- netic algorithm that evolves the model parameters, thereby minimizing the absolute 67 di erence between simulation outputs and the OBO data. Since there are few com- munities with a signi cant amount of data, statistically estimating the distribution of certain metrics became unrealistic. Additionally, tting the simulation parameters to a single community data would destruct the generalization ability. Hence, OBO data is only used for model calibration (oversight for the level of activity among arti- facts and scientists) and validation of emergent patterns. Three macro-level patterns (CAS speci c and domain speci c) are sought in simulation outputs and OBO data for validation purposes in the following sections. 4.5 Initial Conditions and Terminating State Decision Naturally, in complex adaptive models, the outputs are highly probable to have cyclic structures. In theory, these systems never stop running and usually do not reach equilibrium, because there are phase transitions in long run. The analysis of steady- state simulations is more di cult than terminating simulations. In this research, the sensitivity analysis is considered to be conducted by measuring the point-estimators at the terminating-state. However, high variability among point-estimators is observed at a terminating state. So, instead of observing point estimators, the average of performance metrics are measured for last 100 time ticks before the terminating state. In order to justify the terminating state decision, preliminary runs are conducted, which have a variety of scenarios. The list of actions taken and conclusions derived from the output behaviors are as the following: As a result of the computational complexity, it became impossible to run sim- ulations for thousands of time ticks and to output the time series data for each performance metric. Therefore, the simulation run-length is set to 1250 time ticks for each scenario (30 replications for each), which is observed to be su - cient to discern patterns of di erent output metrics in the long run. The main 68 goal is to measure the trends of time-series data to identify di erent stages of network evolution. A warm-up period for the simulation runs is not needed, that eliminates the concern of initialization bias. Preliminary analysis is conducted for random scenarios, and time series for di erent performance metrics are plotted. Plots are simply eyeballed for quali- tative analysis purposes. No terminating state works perfectly for each scenario and each metric, follow- ing the stochastic nature of this study. While some scenarios converge to the core/periphery stage after 100 time ticks, some scenarios may need 400 time ticks to converge. Therefore, the terminating state can be determined that is su cient enough to observe core/periphery stage under various scenarios. In OBO scenarios before 500 time ticks, variances decrease and the metrics seem to be uctuating around the same values. In random connection scenarios, it is qualitatively discernible that the metrics start to uctuate around the same values before 500 time ticks, but later time ticks it is possible to observe data trends due to the dissolution of members from the social network. In this study, the goal is to analyze the outputs at the core/periphery stage and to evaluate the network at that stage. Furthermore, the justi cation for the terminating state can be supported by the following network snapshots. Figure 4.9 illustrates network snapshots at 400 time ticks for four scenarios extracted from the preliminary runs. These snapshots represent distinct scenarios incorporating high-level of di erences between parameter values. The core periphery stage can be observed in the snapshots that has less variable network metrics values. These networks are knitted closely in the core and are more resilient. Hence, the terminating-state of the model is set to 500 time 69 Figure 4.9: Sample Core-Periphery Structures at time 400 (a) Core/Periphery-Scenario1 (b) Core/Periphery-Scenario2 (c) Core/Periphery-Scenario3 (d) Core/Periphery-Scenario4 ticks, which can be perceived metaphorically as a 10-year collaboration period (by comparing the output to OBO). The analysis are conducted over last 100 time ticks (from 400 to 500 time ticks). Appendix A has the snapshots of di erent performance metrics for di erent scenarios in order to illustrate the behavior of time series data. 4.5.1 Initial Conditions Selected initial parameter values that are determined after the preliminary results were observed are listed in Table 4.2. 70 Table 4.2: Initial Settings of the Model Parameter Name Initial Value Purpose Weigh of Familiarity 0.34 Weight assigned to Familiarity in preferential attachment mechanism Weight of Imitation 0.33 Weight assigned to Imitation in pref- erential attachment mechanism Weight of Popularity 0.33 Weight assigned to Popularity in pref- erential attachment mechanism Arrival rate 0.2 Probability that a new scientist ar- rives in the system at each time tick End of simulation 500 Indicates the time tick to stop the sim- ulation Initial number of arti- facts 10 Initial number of artifacts created on the context Initial number of sci- entists 25 Initial number of scientists created on the context Probability to leave 0.2 It is a turnover rate for a scientist to leave an artifact Maximum Altruism 0.5 Maximum level of Altruism a scientist can take Maximum Scope 5 The maximum number of cells that a scientist can browse Minimum Scope 1 The minimum number of cells that a scientist can browse Continued on next page 71 Table 4.2 { Continued from previous page Parameter Name Initial Value Purpose Maximum Time Ex- pectation 10 Maximum number of time ticks until a reward/contribution Minimum Time Ex- pectation 5 Minimum number of time ticks until a reward/contribution Minimum Tension (0, 0.5] Range of values that lower bound of Tension takes Minimum Cognitive Burden (0,0.5] Range of values that lower bound of Cognitive burden takes Elasticity (0, 1] A value closer to one indicates easiness of stretching the complexity of an ar- tifact Completion Threshold (Initial Maturity, 1] Higher values mean relatively longer life-cycle for an artifact Memory Factor (0, 1] It is the weight given to previous esti- mation as opposed to new experience in foraging behavior Core Threshold 5 Number of connections a scien- tist should have to be core in Core/Periphery calculations Theme Length 10 Number of bits in Interest, Theme, Complexity, and Expertise arrays Forage Extension 2 Multiplier that expands the scope in foraging mechanisms Continued on next page 72 Table 4.2 { Continued from previous page Parameter Name Initial Value Purpose Recover Rate 0.2 Probability of getting in Recovered state if the conditions are occurred World Width 50 Number of cells the grid has horizon- tally World Height 50 Number of cells the grid has vertically Foraging Mechanism 2 Basic (2) or Optimal Foraging(1) mechanisms Migration Threshold 3 Number of migrations to be expe- rienced before artifact creation and leaving Artifact Creation Rate 0.05 The probability that new arrival or a scientist who passed migration thresh- old creates a new artifact Mutation Rate 0.01 Probability that a scientist will change his/her interest at each time tick 4.5.2 Bonferroni Analysis After the length of runs has been established to avoid initialization bias and to set the conditions for sensitivity analysis, the next step is to decide on the number of replications (n) that will be used. Output metrics are not normally distributed in the models of this research and the normality assumption can not be drawn. The lack of distribution is inconsequential because if n replications for each scenario are conducted, and are repeated for r times with di erent random number seeds, then 73 the mean of each replication batch is expected to be normally distributed. This is a result of the Central Limit Theorem.7 Sampling variability is a primary concern in the assignment of number n. In this research, Bonferroni Analysis is conducted to determine the number of replica- tions because multiple performance measures are observed. The so-called problem of multiple comparisons should be mitigated. There may be a number su cient for esti- mating an output metric with a given con dence interval. But for di erent scenarios and di erent metrics, the number of replications could vary. Regardless, Bonferroni inequality states that \all intervals should contain their performance measure simul- taneously." Relative to this concern, the Bonferroni inequality addresses an overall probability of at least 1 that the con dence interval of all k metrics contain their own expected performance measures. If the con dence interval of metric s is 1- s, then Bonferroni inequality states: P(All intervals contain their respective performance measure) 1 kX s=1 s (4.11) Fourty-four scenarios (8 OBO and 36 Random) and ve performance metrics are used for the Bonferroni analysis. s is set to 0.02 for each metric. t-statistics are used to determine the minimum number of n that would assure that all metrics fall between their respective con dence interval with overall con dence of 1 0:10, simultaneously. The initial number n is set to 30 replications. The decision of the half-width to assure is set to 10% of the mean. So, the educated guess of n is found as: t2n 1;1 s=2 s 2 h2 (4.12) 7http://www.math.csusb.edu/faculty/stanton/probstat/clt.html - As of 4.07.2013 74 where tn 1;1 s=2 is t statistics, s2 is the variance, and h2 is the square of the half- width. The analysis is done for each scenario by recording the mean and standard deviation at 500 time ticks. Table 4.3 below represents the maximum n that is found among fourthy-four scenarios for each metric. Density is excluded from the analysis because of the high coe cient of variation. Additionally, scenarios with low arrival rates (e.g. 0.1) and small initial populations are excluded from the analysis due to the signi cant e ects of node removal on the variability of social network metrics in smaller (population) social networks. The reason for the exclusions was to avoid the impact of high variability on the n determined by Bonferroni Analysis. Also, it is observed that under the initial conditions, these exclusions do not have an impact on n that is determined. Table 4.3: Maximum n Values Found Avg. Path Length DC CC DiversityS DiversityN Mean 2.75 0.27 0.28 0.26 0.47 Standard Deviation 0.35 0.08 0.10 0.08 0.01 Half-width 0.275 0.027 0.028 0.026 0.047 Maximum n 9.91 48.60 78.14 62.18 0.62 (t29;1 0:01 = 2:462) A conservative approach is taken as the number of replications is set to 100, which is assumed to be su cient in order to avoid sampling variability. With regard to the computational complexity and time constraints, conducting 100 replications provides an acceptable performance. As a note, in Bonferroni analysis, the metrics such as diversity among artifacts and diversity among links are excluded because they have patterns similar to the other diversity metrics that are used in the analysis. It is also revealed that the scenarios with small populations and arrival rates should be analyzed with large numbers of replications (around 500) or else they should be excluded from the sensitivity analysis to be able to capture statistical signi cance. 75 4.6 Activity Time Series Fluctuating time series are familiar representation of dynamical systems (Kendall, 2001). Complex adaptive systems, which include non-linear dynamics, exhibit attrac- tors. Attractor can be subset of the states that system can phase through emerging from typical initial conditions.8 It is common to observe more than one attractor in complex adaptive systems. Chaotic attractors seem random but actually they consti- tute a complex order that does not repeat, which represents that a dynamical system behaves within certain ranges of possible behaviors (Goldstein, 2011). Kendall (2001) indicates that nonlinear population dynamics represent chaotic attractors. Hence, the activity diagrams of simulation runs and OBO data for a randomly chosen community are plotted. In Figure 4.10, activity diagrams of a single community in OBO are illustrated. Figure 4.11 also presents the representation of simulation results for a single run to illustrate similar uctuating structures. This pattern is observed at each run in the simulation outputs. In both simulation and OBO data, chaotic attractors are observed as a hallmark of complex adaptive systems. These ndings suggest that the CAS assumption is legitimate in the model and the real world data (OBO) also exhibits the same CAS patterns. 4.7 Power-Law Distributions Another phenomenon this research explores is scale-free network structures, which creates power law distribution (Blank and Solomon, 2000). Network topol- ogy is expected to have a small number of highly central users with a substantial number of links to others while most of the network members have small number of links. The contribution data is suspected to have the same behavior, which means that a small number of scientists have high number of contributions while most other 8http://www.scholarpedia.org/article/Basin of attraction - As of 4.07.2013 76 Figure 4.10: Number of Active Artifacts and Active Scientists Over Time - OBO (a) Active artifacts (b) Active scientists Figure 4.11: Number of Active Artifacts and Active Scientists Over Time - Simulated (a) Active artifacts (b) Active scientists 77 scientists have smaller numbers of contributions. The observation of power law dis- tributions is also peculiar to CAS. Figure 4.12 shows Log-Log plots of degree and contribution distributions of cumulative OBO data. Figure 4.12: Log-Log Plot of Degree Distribution and Contribution Distributions of OBO (a) Degree distribution Log-Log plot (b) Scientist contribution Log-Log plot (c) Scientist contribution Log-Log plot Power law distributions indicate that the magnitude of a phenomenon is inversely proportional to the frequency of that particular phenomenon. Fitness can be analyzed by linear regression of log-log space. Because of multiple observations of the same value, there is a noise in the tail. There are two ways to create bins of data. The rst way is to have equal width for each bin, and the second way is to normalize the widths of bins such as logarithmic bins. If there is not a good t, the data can be tted from a minimum value or until a maximum value since power law distributions are sometimes mixed with another distributions. 78 The Log-Log diagrams of the contribution distribution among scientists, distribu- tion of degree information, and contribution distribution of artifacts are represented below. In order to generate more data and better illustrations, the simulations with the initial parameters are run for 200 times, after which the data is accumulated. Each graph contains bins of equal width. Figure 4.13: Log-Log plot of Degree Distribution of Scientists - Simulated (a) Degree distribution Log-Log plot (b) Degree distribution Log-Log plot - After cutting o the tail It is not possible to con dently derive a power law distribution from Figure 4.13 part (a). The main mechanism that causes a power law distribution in the base-model is preferential attachment. A reason for having an exponential decrease in the tail is the lack of highly central members in the community. That is why outliers are excluded after the bin that falls between 55 and 60 connections in Figure 4.13 part (b) so that a good t is observed, which indicates that the power law distribution exists until certain degree values have been reached. The contribution distribution for artifacts is represented in Figure 4.14. In Figure 4.14 part (a), all data containing that has a high noise in the tail is included. Figure 4.14 part (b) shows the data that is cut o from a certain point, at which the continuity of histogram data (bins) starts to disconnect. The contribu- tion distribution of artifacts is highly indicative of power law distribution, which is expected because of the artifact selection mechanisms implemented. In Figure 4.15 there is also a good t for the contribution distribution of scientists. The power law 79 Figure 4.14: Log-Log Plot of Contribution Distribution of Artifacts - Simulated (a) Contribution distribution Log-Log plot of artifacts (b) Contribution distribution Log-Log plot of artifacts - After cutting o the tail Figure 4.15: Log-Log Plot of Contribution Distribution of Scientists - Simulated (a) Contribution distribution Log-Log plot of scientists results support the CAS assumption and validity of the CAS principles implemented in this research. 4.8 Collaboration Network Phases Creation of a fractal-like structure in the course of evolution is another CAS emer- gent property (Yang and Shan, 2008). Regarding collaboration networks, there are four stages through which a network evolves. The network topology is observed and animated through time to discern scattered, one hub, multi-hub, and core/periphery structures consecutively (Krebs and Holley, 2002). These four main stages that col- laboration networks phase through are: 80 Scattered Clusters: The stage where the community starts with emergent clus- ters isolated from each other. Single Hub-and-Spoke: The stage where a hub or single actor begins to connect di erent clusters. Multi-Hub, Small-World Network: In this stage, other hubs start to emerge that are connected by weak ties. Core/Periphery: This stage emerges after a long period of weaving by the hubs. It is stable and easy to maintain. As described in the theoretical model (Figure 4.16), the four phases of collabora- tion networks are discernible in OBO data (Figure 4.17). The snapshots in Figure 4.18 are taken from a single simulation run to illustrate the generated collaboration net- works. In contrast to the simulation data, OBO has star-like structures which indi- cates a connection between core members and a signi cant number of inactive users who have only one connection to that particular star. Due to the OBO, some core members close or conclude the artifacts that are inactive; as a result, they form a connection with the creators who never elaborate or create any artifacts. The phase transitions can also be detected in the simulation data, and the snapshots may be used for face-validity purposes. 81 Figure 4.16: Emergent Network Patterns Over Time - Theoretical Model (a) Scattered (b) One-hub (c) Multi-Hub (d) Core/Periphery Krebs and Holley (2002) The snapshots reveal that the simulated networks exhibit the theoretical stages through which a collaboration network evolve. This nding supports the model vali- dation. The same patterns are also supported by OBO data observations. In Chapter 5, socio-communication model is introduced along with the de nitions of the output metrics. Subsequently, sensitivity analysis are conducted for various scenarios and innovation potential is discussed. 82 Figure 4.17: Emergent Network Patterns Over Time - OBO Data (a) Scattered (b) One-hub (c) Multi-Hub (d) Core/Periphery 83 Figure 4.18: Emergent Network Patterns Over Time - Simulated (a) Scattered (b) One-hub (c) Multi-Hub (d) Core/Periphery 84 Chapter 5 SOCIO-COMMUNICATION MODEL AND EXPLORATORY ANALYSIS The assumption in the base-model states that if two scientists contribute to an artifact at the same time period, they are then connected. This assumption on how to connect scientists is changed before di erent communication theories are introduced to the base-model. In the socio-communication model, scientists attempt to attach themselves with other scientists based on their communication preferences. In the fol- lowing sections, the shared assumptions among di erent communication mechanisms and implementations of particular communication theories in GPS are discussed. 5.1 Communication Preferences Scientists are familiar with the people who study the same artifacts that they themselves study. Scientists may only consider others who elaborated on the artifact themselves are active on, at the same time tick. Also, at each time tick, an active scientist can try a connection only once. This is a matter of resources scientists have. As an underlying assumption, the scientists are homogeneous and have the same amount of resources. If the connection/interaction request is accepted and the target scientist is already a member of his or her network, then the weight of the tie between them is increased incrementally. Reciprocity is an important concern when forming a tie with another. Even though scientists are willing to connect to the ones with higher expertise or higher degree, these scientists might not want to connect in response. In order to address the reciprocity issue, the tendency and motivations of the other scientist, who is selected to be connected, should be taken into account. 85 If forming a tie is a decision of both parties, then successful connection should also be based on how the source scientist is perceived by the target scientist who was selected. The mechanisms for each selected communication theory are described in the following sections. 5.1.1 Random Connection In this mechanism, a scientist selects another scientist to create a random con- nection. The ow diagram describes the random process below: Figure 5.1: Activity Flow Diagram of Random Connection Mechanism 5.1.2 Human Capital Human capital is de ned in terms of attributes and characteristics that one has, such as reputation and knowledge. The theory of human capital explains that people 86 who have greater numbers of attributes and features gain more advantages in the network (Becker, 1978). In this research, the human capital mechanism states that scientists have the broadcasted expertise information of others and they attempt to connect to other scientists with higher expertise. The general activity ow diagram of communication mechanisms that are implemented are described in Figure 5.2. In human capital mechanism, the evaluation is based on the expertise levels of the fellow scientists who study on the same artifact. The greater the expertise of scientists, the more likely they are to be selected by others. Figure 5.2: Activity Flow Diagram of Communication Mechanisms 87 Decision criteria Pij for scientist i to select scientist j is the average expertise level of scientist j: Pj = PM n=1Ej[n] M (5.1) where M represents the total number of expertise areas and Ej[n] is the expertise level of scientists j on area n. In the following formula below, pij is the probability of i to select j from candidate scientists. pij = PjPN n=1Pn (5.2) where N is the total number of scientists who actively study on the same artifact. Roulette wheel algorithm is used to determine which scientist is selected as j by sci- entist i. Regarding reciprocity, the probability that the connection will be successful is represented by the Equation 5.3 below. pji = PiPN n=1Pn (5.3) 5.1.3 Social Capital Social capital is the sum of the resources that the virtual or actual ties of a person has in a network. An example of social capital is the theory of structural holes (Burt, 1995). The theory asserts that people invest in social opportunities from which they expect to pro t. Structural holes are the non-connected actors in the network that create opportunities to be lled by others. When the non-connected actors form ties, better information sharing can be generated. In this work, the sum of the number of actual ties is interpreted as the social capital among scientists. The social capital mechanism indicates that scientists possess the others? degree information, and that these scientists try to forge connections with other scientists 88 that have higher degrees than themselves. Bounded rationality is a concern, because scientists may not know online connections of others, necessarily. Scientists can not perceive the perfect information about the condition of whole network (such as who is connected to who and if they are in di erent cluster). It is observed that cyber- infrastructures typically broadcast the number of connections each user possesses and that information is readily available. The more connections a scientist has online, the more likely for them to be selected by other scientists to form a link. In addition to degree information, closeness centrality and betweenness centrality can be candidate attributes for use in social capital mechanisms. However, scientists have limited information, and data items such as the number of central actors be- tween clusters and the proximity of a scientists to all clusters of the network cannot be processed. Another reason for selecting degree information as an attribute is the computational ease to its calculation. Both betweenness and closeness need computa- tionally costly algorithms to be calculated; since these calculations are performed on each time tick for each scientist, the run-time of the simulations are severely altered. Regarding social capital mechanism, in Figure 5.2, the evaluation process is based on the degree information of the fellow scientists. The selection process is conducted based on pij: pij = DCjPN n=1DCn (5.4) where N is the total number of scientists in the same artifact and DCj is the degree centrality of scientist j. The reciprocal response from scientist j to scientist i is based on the following probability: pji = DCiPN n=1DCn (5.5) 89 5.1.4 Homophily Theory Homophily theory states that there is a stronger tendency to form social ties with others who are regarded as similar to one?s self than with someone perceived as being di erent.1 People are more likely to communicate with the others who have similar attributes. Shared attributes may include one? s interests, or any other variable related to human capital. E ectively, people are likely to communicate with people similar to themselves because no e ort is required to build mutual understanding.2 The implemented homophily mechanism states that scientists have information about others? interests, and that they forge connections with those who share the same interests as themselves. In Figure 5.2 the evaluation is based on the interest levels of ow scientists. Pj = PM n=1Min(Ii[n];Ij[n]) N (5.6) where M is the total number of interest areas. Below, pij is the probability of i to select j from candidate scientists. pij = PjPN n=1Pn (5.7) where N is the total number of scientists who are active on the same artifact. With regard to reciprocity, the probability that the connection will be successful is the same: pji = pij (5.8) 1http://faculty.ucr.edu/ hanneman/soc157/18 Homophily.html - As of 4.07.2013 2http://jcmc.indiana.edu/vol11/issue4/yuan.html - As of 4.07.2013 90 5.1.5 Social Exchange Theory Cognitive consistency theory focuses on the individual?s perceptions of their net- work. People aspire to balance the attitudes in their network. Social exchange theory is the reverse of cognitive consistency, the condition in which unbalanced network members exist, the theory encourages a person to exchange information and resources. The theory also has a con ict with self-interest theories because self-interest theories focus only on maximizing the individual value (Monge and Contractor, 2003). The social exchange mechanism states that a scientist have information about what others know and the level of their expertise. Scientists will attempt to connect with other experts in order to balance their own expertise level. In Figure 5.2, the evaluation process is based on the expertise gap between the fellow scientists. Pj = PM n=1Max(Ej[n] Ei[n];0) N (5.9) where M is the total number of interest areas. Below, pij is the probability of i to select j from candidate scientists. pij = PjPN n=1Pn (5.10) where N is the total number of scientists in the same artifact to who scientist i is not connected. Regarding reciprocity, the probability that the connection will be successful is the same as follows: pji = pij (5.11) 5.2 Learning Process To reiterate, information ows are represented between artifacts and scientists in the base-model. In order to measure the e ects of di erent communication preferences 91 on diversity within the network, it must be assumed within the communication model that scientists can learn from one another through the revision of their interest and expertise levels. The in uence process introduced in the base model also describes a learning process. The following formula represents the transfer mechanism for learning of a scientist. When a scientist is connected to a new scientist, with a certain probability (0.80), their expertise level will be revised according to the formula below. Ek[i] = Ek[i] + (1 Ek[i]) (Ej[i] Ek[i]) (5.12) where i is a randomly selected area, j is the scientist who is connected to, and k is the scientist requesting communication process. When a connection is not realized, a randomly selected scientist will learn from a randomly selected scientist in their network. Learning is only justi ed when the expertise of scientist j is greater than the expertise of the scientist k. The interests of a scientist are also in uenced by other scientists. The higher the expertise gap, the more likely they are to be in uenced. This mechanism only updates a randomly selected area i on interest of scientist k: Ik[i] = Ij[i] (5.13) 5.3 Social Network Metrics It is previously mentioned in the literature summary that the innovation potential is likely to be captured by measuring various social network metrics. The relative social network metrics that are identi ed as important in this study and the summary of their de nitions are listed below: Network Density: The calculation of the proportion of possible ties that exist in the network (Rowley, 1997). Density is 2jEjN(N 1), where jEj is the total number of edges in a network and N is the total number of nodes. High density is an 92 indicator of mobility, which means increased connectivity and transfer of ideas within the network. Diversity in the network: The number of di erent people connected in the net- work. Diversity can be measured through di erent skills, expertise, resources, and reputation. Diversity in the population: The perception of di erences in others within the population over time. Di erence can also be in terms of skills, expertise, re- sources, and reputation. How the diversity metrics are measured is de ned in the following sections. Degree: An actor?s total number of connections. Degree centrality of an actor: According to Wasserman (1994b), people with the greatest number of ties are the most central actors in a network. It is the proportion of possible ties that exists for an actor. In this research, the degree centrality of scientist i is DCi = DegreeiN 1 , where N is the total number of nodes in the network. Degree Centrality of a Network: The range of variability among degrees of the actors. Degree centrality of a network is DCNetwork = PN i=0DCmax DCi N 2 , where N is the total number of nodes and DCmax is the maximum degree centrality a scientist has in the network. Clustering Coe cient: The number of edges in a neighborhood divided by the maximum possible number of edges that could exist in that neighborhood. The coe cient provides information about how actors in a network tend to cluster together. For each scientist i, a neighborhood is de ned. The proportion of possible ties that exist between neighbor nodes is measured, assuming that the 93 neighborhood is a network itself. The clustering coe cient of the whole network is the average of the clustering coe cients of individual scientists. Average Path Length: The average number of steps along the shortest paths for all possible pairs of network actors (Albert and Barab asi, 2002). It can also be stated as the average of the shortest paths from every scientist i to scientist j. Lower values indicate higher cliquishness and fewer structural holes. Core/Periphery Ratio: The ratio of the number of core actors to the number of periphery actors. Boyd et al. (2006) state that \individuals in a group belong to either the core, which has a high density of ties, or to the periphery, which has a low density of ties." The core-periphery ratio calculation removes less central nodes recursively until only central nodes remain in the network. The remaining nodes are counted as core nodes and the removed nodes are counted as peripheral members (Borgatti and Everett, 2000). In this study, the high centrality and fewer structural holes argument is adopted to measure innovativeness in a network (Burt, 1995). This hypothesis promotes mod- erate level average path length, higher degree centrality, and moderate level clustering coe cients in the network while it may decrease the density, which fosters mobility and di usion of ideas. Since there is high turnover of the scientists within the en- vironment, the high centrality with fewer structural holes argument is critical even though high density networks are known to be innovative. However, high density is still presented to decision-makers for comparison of di erent scenarios. Considering the knowledge generation, the total maturity of the artifacts, the distribution of ex- pertise levels of scientists, activeness in the population are also discussed as collective creativity metrics. 94 5.4 Diversity and Interdisciplinarity In this research, a variation among a type strategy is adopted (Page, 2010) in order to measure diversity in the environment. Basically, diversity can be de ned as combination of three properties: Variety: It de nes \how many types of things are there?" The number of dif- ferent scientists, themes or interests can be perceived as variety depending on the metric. All else equal, the greater the variety, the greater the diversity. Balance: It de nes \how much of each type of thing are there?" The greater equality of the balance, the better the diversity. For Balance formulation, Shan- non entropy (Stirling, 2007) is used in which pi indicates the proportion of members of given type in the total population: NX i=1 pi lnpi (5.14) Disparity: It de nes \how di erent from each other are the types of things?" Greater disparity is bene cial for diversity. The theme of artifacts or interests of scientists are used as attributes to calculate disparity in the population. Although diversity is advocated as a unique quality in policy-making, the in- terpretation of diversity is context dependent and relative to the intention of the decision-makers. The challenge is how to accommodate threefold understanding and how to aggregate them in a metric. For that purpose, Stirling (2007) introduces an e ective heuristic used in di erent studies (Benhamou and Peltier, 2010; Rafols and Meyer, 2010): D = X ij (dij) (pi pj) (5.15) 95 where D is the diversity in the network or population, dij is the disparity between type i and j, and i6= j. While pi and pj are de ned as proportions of type i and type j, and are the relative weightings that are assigned to each component in the formula. There are four diversity metrics calculated in this study. For diversity in the population of scientists and artifacts, dij is de ned as dissimilarity between interest and theme arrays, respectively. While measuring link diversity in the network, every node is accepted as a kind, where pi and pj are the proportion of ties in the network that scientist i and scientist j have respectively, and dij is the dissimilarity between the individual networks of scientist i and scientist j. In order to calculate node diversity of the network, dij is calculated as the dissimilarity between scientist i and j based on their interest arrays as it is used in the diversity among population of scientists. Additionally, interdisciplinary is known to be a desired output of scienti c activ- ities (Rafols and Meyer, 2010). The authors de ne interdisciplinarity by two aspects: diversity and coherence, or the extent to which two things are related. Coherence can be de ned in terms of the network of relations; however, what types of relations are sought is the question of interest. Thagard? s explanatory coherence (1989) is another method in order to capture di erent types of coherence metrics within the environ- ment. In this research, the diversity metric created to measure diversity among the nodes, density of the network, and the average path length are observed to discuss about interdisciplinarity that exists in the network. 5.5 Sensitivity Analysis Before sensitivity analysis are conducted, the response surface of the model is explored. Screening experiments are run with the parameter values that are identi ed as important factors and expected to be e ective factors on the outputs. The goal 96 of applying response surface methodology (RSM) is to observe the e ects of various parameters values on various performance metrics and to identify parameters that can be used in sensitivity analysis. Identi ed parameters and their respective levels are listed in Appendix A. 5.5.1 Response Surface Analysis RSM includes use of statistical and mathematical techniques to develop, improve, and eventually optimize processes (Carley et al., 2004). Regarding simulation outputs, a performance metric can be called response. Independent variables can be model inputs, which are environmental or mechanistic parameters in the model. Practically, while applying response surface analysis, an approximation model of response space is created. Hence, in this research, rst-order multiple linear regression model is developed, which is also called main e ects model. Subsequently after the simulation runs, parameter values and corresponding per- formance metrics for each scenario are imported in IBM-SPSS tool. All possible re- gression method is not used, since the number of equations to be examined increases exponentially as the number of candidate variables increases. Therefore, backward elimination method is adopted, which is known to be a good variable selection pro- cedure, when the e ects of all candidate variables on performance metrics are desired to be observed. Backward elimination basically starts with all variables in the model and then F-statistic is calculated for each variable as if it were the last variable to enter the model. If p-value is more than desired level, then that variable is removed. This procedure continues until there is no variable to remove. In Appendix A, the tables of IBM-SPSS results are presented. By examining the RSM results, the fol- lowing conclusions are outlined, which can also be used to support internal validity of the model. Maximum altruism level is highly e ective on all metrics. 97 Minimum cognitive burden that is associated with the environment (perceived as the di culty of the problem domain) arises lower activity that triggers less density in the network. Minimum tension that exists in the environment is related with elasticity of the problem domain or transparency of the community. It is an altering factor on outputs. Higher level of lower bound for the tension has similar e ects as minimum cognitive burden. Maximum altruism, minimum cognitive burden, and minimum tension are ex- pected to be e ective on the performance metrics, because they directly a ect the collective action mechanism and activeness in the population. Communication preferences of the scientists are observed to have important e ect on the output. Since, the purpose of phase two is to measure the e ects of di erent communication preferences on innovation potential, communication type is coupled with various variables to observe their combined e ect on the simulation outcome. The results are presented in Appendix A for the design of further sensitivity analysis in the future. Standardized (coe cients of variables) values are observed to avoid e ect of scale. Also p-values are observed and the the p-values with less statistical signi cance are printed in bold characters in Appendix A. Forage extension and minimum time expectation are not as e ective as expected on the outputs. Apparently, they just stretch or compress the timeline of the activity time series. The metrics converge to similar results at the terminating state. All parameter values that are associated with population dynamics (arrival rate, probability to recover, probability to leave, and expertise level to recover) are 98 e ective on the performance metrics. The reason for that is considered as the mechanism that dissolves the inactive scientists from the network. Migration threshold as an environmental parameter, which is related with the patience among community members, is also e ective on the outputs. Under the lights of these observations, various scenarios can be created by mutat- ing e ective parameters. In this study, simulation experiments to discern more inno- vative communication preferences are conducted under base-model parameter set-up as relative to the intention of this research. Sensitivity analysis are summarized in the following sections. 5.5.2 Communication Preferences First, the simulation runs are conducted at initial conditions for each commu- nication mechanism. Figure 5.3 illustrates emerging network topologies for di erent communication mechanisms at time tick 500. As a note, darker colors indicate higher levels of expertise. Figure 5.4 represents error bars of density with 95% con dence interval for di er- ent communication mechanisms. Especially, human capital and social capital mecha- nisms promote connections between central and peripheral nodes, since more central or expert scientists are more likely to be selected to form ties. Therefore, this process results in more connections among di erent clusters in the network. Figure 5.5 represents error bars of degree centrality with 95% con dence interval for di erent communication mechanisms. Random, human capital, and social capital mechanisms seem to have signi cantly higher degree centrality in the network, which means there are highly central scientists and the variance of degree levels among scientists is higher. Figure 5.6 represents error bars of clustering coe cient with 95% con dence interval for di erent communication mechanisms. Random, human capital, and social 99 Figure 5.3: Network Visualizations at Terminating State (a) Mixed Theories (Mi) (b) Random Connection (Ra) (c) Human Capital (HC) (d) Social Capital (SC) (e) Homophily (Ho) (f) Social Exchange (SE) 100 Figure 5.4: Density Mi: Mixed Communication, Ra: Random Communication, HC: Human Capital, SC: Social Capital, Ho: Homophily, SE: Social Exchange Figure 5.5: Degree Centrality capital mechanisms result in higher clustering coe cients. Clustering coe cient is also used in small-world calculations in the following analysis. 101 Figure 5.6: Clustering Coe cient Figure 5.7 represents error bars of average path length with 95% con dence in- terval for di erent communication mechanisms. Social exchange and mixed commu- nication mechanisms promote higher average path length and more dissolved network structures. Figure 5.7: Average Path Length 102 Figure 5.8 represents error bars of core/periphery ratio with 95% con dence interval for di erent communication mechanisms. Mixed communication and social exchange mechanisms result in better core/periphery structures generating greater number of periphery members to di use innovation. Figure 5.8: Core/Periphery Ratio Figure 5.9 and Figure 5.10 represent error bars of diversity among scientists and diversity among artifacts with 95% con dence interval, respectively. In both graphs, it can not be argued that di erent mechanisms cause signi cantly di erent levels of diversity. The underlying reasons creating this phenomenon are the high variety and convergence of disparity among the population to 0.50, which is thought to be caused by binary interest and theme arrays. Figure 5.11 represents error bars of diversity among links with 95% con dence interval. Social capital mechanism results in more diverse social networks based on links. Figure 5.12 represents error bars of diversity among nodes with 95% con dence interval. Random, human capital, and social capital mechanisms cause more diverse 103 Figure 5.9: Diversity Among Scientists Figure 5.10: Diversity Among Artifacts social networks regarding how much dissimilar scientists are connected based on their interests. Figure 5.13 represents expertise distribution among scientist population. In the gure, there is a peak at bin that represents expertise levels between 0.80 and 0.90. It is caused by expertiseToRecover value. After expertise level of 0.80, if scientists 104 Figure 5.11: Diversity Among Links Figure 5.12: Diversity Among Nodes can not become active, they recover and leave the environment. So, there are more number of scientists who fall in the bin between 0.80 and 0.85. The last peak in the graph represents the scientists who are highly active and central in the network. Even though it is hard to distinguish, the levels between the bars of di erent colors 105 indicate that highly expert scientists are more likely to occur in human and social capital mechanisms. Figure 5.13: Expertise Distribution X-axis is the bin number. Each bin has width of 0.05. Y-axis is the proportion of the population whose expertise levels fall on respective bin Figure 5.14 represents maturity distribution among the artifact population. Hu- man capital and social capital mechanisms create artifacts that have higher levels of maturity, which is a result of information ows between scientists and artifacts. In order to discuss through how disparity emerges for di erent communication theories, Figure 5.15 and Figure 5.16 represent disparity distributions of links and nodes, respectively. As a result, homophily mechanism generates cliques, therefore it creates more disparate individual networks. Disparity among the nodes is based on how dissimilar are the nodes that are connected. Interestingly, it indicates a binomial 106 Figure 5.14: Maturity Distribution distribution. Since each bit on binary arrays can be perceived as bernoulli trials, the number of matches or dissimilar bits between two binary arrays produces binomial distribution in long run, which is quite similar to normal distribution. Figure 5.17 and Figure 5.18 represent the average number of active scientists and artifacts. It is shown that mixed communications and social capital mechanisms result in relatively higher level of activity than random connections and homophily mechanisms. It is not possible to distinguish human capital and social exchange mechanisms from the others. Figure 5.19 illustrates small-world phenomenon information. The calculation method, which basically divides clustering coe cient by the average path length is 107 Figure 5.15: Disparity Distribution Among Nodes X-axis is the bin number. Each bin has width of 0.1. Y-axis is the proportion of the pairs of nodes which has disparity levels fall on respective bin adopted (Uzzi and Spiro, 2005). Uzzi and Spiro (2005) state that small-world phe- nomenon is also indicative of creativity, which spurs innovation until certain extent. Random, human capital, and social capital have higher level clustering and smaller average path length that means the network is dense and closely knitted allowing di usion of ideas and collective productivity. Table 5.1 summarizes the results for di erent communication preferences. For better illustration, relative values of each metric are indicated in three levels: low, medium, and high. Bold characters are used to discern the values that promote innovation potential. 108 Figure 5.16: Disparity Distribution Among Links Figure 5.17: Active Scientists 109 Figure 5.18: Active Artifacts Figure 5.19: Small-world Phenomenon Interdisciplinarity is illustrated by Figure 5.20 below. Interdisciplinarity is also a desirable feature that fosters innovation. The heuristic is adopted from the study by Rafols and Meyer (2010). In conclusion, social capital theory supports all indicators of innovativeness more than the other candidate theories. Social capital theory promotes connections among 110 Table 5.1: Summary of the Sensitivity Analysis Criteria Mi Ra HC SC Ho SE Density Low High High High Med Low Degree Centrality Low High High High Med Low Clustering Coe cient Low High High High Med Low Avg Path High Low Low Low Med High Core/Periphery Low High Med Med Med Low DiversityL Low Med Med High Med Low DiversityN Low High High High Med Low Activity High Low Med High Low Med Expertise Med Med High High Med Med Maturity Med Med High High Med Med Small World Low High High High Med Low Figure 5.20: Node Diversity vs Network Coherence central scientists and periphery scientists. Apparently, highly central members are also more likely to be more active and accumulate more expertise. More expert sci- entists in the population increase the complexity of the artifacts and lead to more mature artifacts in the environment, which means more knowledge creation. Highly central scientists also broadcast the knowledge to periphery members and spur the 111 diversity among the network members. High centrality-fewer structural holes hypoth- esis is supported more by social capital theory. Additionally, core/periphery structures are desirable to promote di usion of in- novative ideas. Mixed connections and social exchange favor the balance of popularity among the scientists, which also results in lower degree centrality. Therefore, they have small number of core members while the number of periphery members are high and they usually have moderate level of expertise. The implemented simulation model can be used to conduct more sensitivity anal- ysis under further environmental conditions and the di erent communication prefer- ences can be tested along with di erent parameter set-ups (more altruistic, more di cult, more elasticity etc.). In the following chapter, a search algorithm (genetic algorithm) is introduced and more robust communication landscapes are explored in the scenario space. The goal is to capture more robust parameter set-ups with an evolutionary algorithm that enables intelligent search and avoids the exhaustive parameter sweep. 112 Chapter 6 ROBUSTNESS IN GLOBAL PARTICIPATORY SCIENCE Developing innovation coordination mechanisms that are robust and resilient under environmental uncertainty is critical for sustained innovation. For that reason, robustness is valuable to explore as opposed to only searching for an optimal behavior in terms of a tness function in an unchanging environment. The motivation to explore the scenario space is to gain deeper insight and generalize from the observed behavior. Exploration vs. exploitation is a well-known trade-o (March, 1991). An important issue is to decide when to stop exploring and when to start exploiting the parameter space to discover robust system con gurations. The aim in this chapter is to exploit the possible scenario space to discover robust landscapes in terms of an evolutionary search algorithm. To measure robustness, the tness function is de ned in terms of the degree of variability among innovation potential metrics. 6.1 Exploratory Modeling There is insu cient knowledge and a high level of uncertainty for the modeled target system (Global Participatory Science). The estimates on initial and boundary conditions or the nonlinearities in the models can cause even small levels of initial uncertainties to generate remarkable levels of uncertainties in the results. The critical question to be addressed is: \What is the appropriate method for using the model considering its limitations?" In this chapter, an exploratory modeling approach is adopted, which is de ned as a series of computational experiments to explore the im- plications of mechanisms (e.g., communication mechanisms) and parameter changes (Bankes, 1993). 113 An exploratory modeling can involve a search for key con gurations of the sys- tem (Bankes, 1992). Initially, boundaries of the plausible scenario space (possible parameter ranges) and an ensemble of plausible scenarios are generated. The process of selecting which ensemble of the plausible scenarios to run depends on the question of interest. Through the search, an output metric is observed. In this research, the output metric is de ned as an indicator of robustness (related to the purpose of the study). Moreover, a search strategy is needed in exploratory modeling. In this work, a genetic algorithm (GA) is implemented to evolve the regions of parameter values where more robust (less variable) social network structures are observed. The ability to discern patterns from output metrics depends on being able to de ne a topology (set of model con gurations) in the ensemble such that similar parameter ranges have similar outcomes (Bankes, 1993). In this research, the search is guided by a heuristic (speci cally a GA) and cannot guarantee an absolute optimal scenario. Thus, the evolution of the ensemble of scenarios at di erent generations are recorded. Furthermore, the ensemble is interactively tested against di erent metamorphic relations to bound the plausible scenario space (by interventions). This search can be called human-mediated interactive robustness analysis. The ensemble of scenarios is revised over time resulting in an evolving scenario space. In the following section, the use of genetic algorithm in exploring plausible scenario space is discussed. 6.1.1 The Use of Genetic Algorithms In genetic algorithms, through randomness associated with selection and crossover, alternatives with desired outputs mate to generate new ensemble members. A scenario with better output metric (with respect to a tness function) has a higher probability to be selected. Over iterated generations, an increasingly desired behavior is expected to accumulate (Atmar, 1994). 114 The selection decision of GA is motivated by simulation optimization studies. Simulation optimization can be de ned as the process of nding the most e ective and optimal input values among all possibilities without explicitly evaluating each possibility (Carson, 1997). However, there exist other techniques that are used in simulation optimization: \Gradient-based search, response surface method, stochastic approximation, and Ranking and Selection etc." (Carson, 1997; April et al., 2003). Most available tools use evolutionary algorithms to optimize the inputs of a given model (Fu and Glover, 2005). Arti cial neural networks (ANN) simulation is also a well-published method used for training a model with a given set of data (Sexton et al., 1999). Fu and Glover (2005) state that there is no clear answer for why the use of evolutionary algorithms are dominant, but they point out the bene ts, such as the ability to explore the entire state space and robust properties in practice. Genetic algorithms are used in various simulation studies as parameter opti- mization tools. For example, in GENOSIM, authors manipulate the values of control parameters in a tra c micro-simulation and globally search for an optimal set of values that minimize the gap between real eld data and the simulation output data (Ma and Abdulhai, 2002). Similarly, Zou (2012) explores the parameter space with a genetic algorithm that minimizes the discrepancy between the real world community data (also gathered from OBO) and various social network metrics. In a dynami- cal system like a social network, Zou (2012) builds the tness function based on the point estimators at the termination state of the simulation. It adds credibility to the models by stating that the model is capable of creating snapshots of some real world networks at certain states, but it can not assert that the simulation arrive to observed states following the same transient events. B ack and Schwefel (1993) compare di erent evolutionary algorithms and promote the use of GA by stressing its ability to assign a nonzero selection probability to each individual (called as preservative selection or proportional selection). Additionally, 115 Paul and Chanev (1998) address the appropriateness of GA for simulation optimiza- tion. Unlike this research, Paul and Chanev (1998) simulate an existing steady-state Steelworks model. For each scenario, they run only a single replication for a long period of time, which they believe is long enough to gather su cient statistics at the steady-state of the system. Nazzal et al. (2011) state that when the goal is to optimize a stochastic system with high variability, the performance of GA can be inadequate. The authors address that it is important to consider variance when evaluating the alternative scenarios produced by GA over generations. Hence, Nazzal et al. (2011) propose a methodology that incorporates an indi erence-zone (IZ) ranking and selection procedure under common random numbers (CRN). The methodology also aims to reduce the required number of replications for each scenario. In contrast, Pierreval and Tautou (1997) note that when a simulation model is considered, as a common practice in GA selection operation, scenarios are compared based on the di erences between mean values of an output metric (or tness value). If the proportional selection operation is used, the comparison tests (e.g., sensitivity analysis) between scenarios are less important and the scenarios can evolve based on the mean values of the output metric (Pierreval and Tautou, 1997). In general, genetic algorithms are used to determine the discrete input values (Liepins and Hilliard, 1989). Genetic algorithms mimic the evolutionary process of biological systems to create new generations guiding the search towards optimal solu- tion (Swisher et al., 2000). Genetic algorithms cannot guarantee optimality, but the intention is to improve the ensemble, and if possible derive robust scenarios under environmental uncertainty. As discussed in Chapter 4, for the sake of generalization, the aim of this research is not to nd the optimal parameter set that mimics the real world data, but rather to explore the scenario space to measure the robustness 116 of di erent communication theories under di erent conditions and if possible, to dis- cover diverse scenarios that are more robust than the others. In order to limit the search space or understand the response surface of the model, metamorphic testing is introduced as a candidate process. In the following section, metamorphic testing is delineated and characterized. 6.1.2 Metamorphic Testing Metamorphic testing approach is introduced by Chen et al. (1998) to address the problem of testing programs with no oracle. Metamorphic testing is a testing method that is based on the expected properties of the application or model. The properties, called metamorphic relations (MRs), are basically functions that de ne the relationships between program, model, or function input and the expected changes in output. Speci cally, those relationships can provide means to de ne V&V test cases. Suppose a case in which x is a test input and produces output f(x). The meta- morphic properties of the function f can be used to develop a transformation function, which when applied to the test input produces x0. This then enables prediction of the expected output f(x0) based on the known f(x). If the outcome f(x0) is consistent with the expectation, it is not necessarily correct. However, violation of the meta- morphic property indicates that one (or both) of the outputs, f(x) or f(x0), is wrong. So, though it may not be possible to know with a single test whether an output is correct, it is determined if the output is incorrect. Metamorphic testing has been used at the function (Guderlei and Mayer, 2007), application (Xie et al., 2011), and simulation (Ding et al., 2011; Pullum and Ozmen, 2012) levels. As an example to explain metamorphic relations, consider a function that cal- culates the standard deviation of a set of numbers. For some transformations of the input set, we expect no change in the result, e.g., if the order of the members of the input set is permuted or if each member of the input set is multiplied by -1. Other 117 transformations of the input set will predictably alter the output, e.g., multiplying each member of the input set by 2 will result in a standard deviation twice that of the original input set. To date, there has been only a single published e ort to apply metamorphic testing to agent based models (Murphy et al., 2011). Murphy et al. (2011) investigate the use of metamorphic testing on a simulation tool of a hospital in healthcare domain. ABMs are often used as exploratory tool for discovery of unknown regularities. Prior studies on metamorphic relations test if the implementation is correct, while in this work, metamorphic relations are intended to be used to identify the boundaries of the model response surface. Instead of classifying the model as wrong when the model behaves di erently than the expectation, the corresponding response area can be agged to be eliminated from the analysis. Metamorphic relations can be updated iteratively to make sure that the analysis is conducted on the scenario space of interest. In this research, it is only used to test expected behavior and derive new expected behaviors for future use. 6.2 Design Decisions Relative to the GA In this section, the components of the genetic algorithm module and the imple- mentation of the algorithm are explained. It is worth mentioning that the main in- tention is not to create a novel meta-heuristic, but to synthesize the existing methods for better search within a computationally feasible time frame. In genetic algorithms, variation is achieved via various operators (e.g., combination, mutation, crossover) and the selective pressure is based on the tness function. The relationship between context of this study and the biological systems are listed below: A scienti c community of scientists and its traits (phenotype) can be interpreted as a member of the population. 118 The population consists of di erent scenarios/communities. In the algorithm, di erent communication theories are metaphorically perceived as di erent species of the population. Each community has a scalar measure that identi es its tness, which depends on the purpose of the study. In this research, tness is de ned in terms of the aggregation of variabilities that are observed in di erent metrics. Gene is the genotype. It is a vector that retains the parameter values for each community. Regarding the simulation optimization studies, a widely observed approach in the literature is to run each scenario for a su cient number of replications (using CRN) and to compare mean values of the outputs while conducting proportional selection. When the desired termination condition is met, the GA provides a single best scenario (Tompkins and Azadivar, 1995; B ack and Schwefel, 1993; Faccenda and Tenga, 1992). However, GAs are random algorithms and if they are asked to solve exactly the same problem twice, they are likely to come up with two di erent scenarios (if exact optimum is not found).1 If the goal is to generate a set of scenarios (diverse portfolio of solutions), one method is to re-initiate the GA multiple times. Likewise, Zeigler et al. (1997) provide a high performance environment for modeling large- scale systems at high resolution to enable parallel GA runs. Parallel GA modules are initiated to identify possible parameter con gurations using CRN for each GA module. Corresponding to the goal of exploratory modeling, these parameter con gurations can serve as a basis for drawing general conclusions about the system of interest. Zeigler et al. (1997) conclude that parameter estimations of simulation-based studies for large-scale models must await new generation computers. Even though it has 1http://www.burns-stat.com/documents/tutorials/an-introduction-to-genetic-algorithms/ - As of 4.07.2013 119 been more than a decade since this paper was published, today? s desktop computers have only reached to G ops of computational power, on average. Along with the advancements in hardware, simulation studies have become more popular and high resolution-more complex simulation models are being developed that still need High Performance Computing (HPC). Similarly in the analysis of exploratory modeling, the challenge is the problem of deciding on the limited number of experiments that can be run practically (consuming reasonable amount of computational resources) to best inform the question of interest. The sampling strategy (the number of replications) involves human judgment. Bankes (1993) states that: Consequently, the result of an exploratory analysis will typically not be a mathematically rigorous answer, but rather an imperfect image of the complete ensemble that improves gradually as more cases are run. Given a xed analytic budget (in dollars, people, or time), the analysis must provide the most useful results possible based on what is known about problem on hand. Given the complexity of the developed socio-communication model in this re- search, an exhaustive parameter sweep across all plausible scenarios is not possible. The sampling strategy (30 replications) is dedicated to produce a reasonable amount of help in converging to robust con gurations from a limited number of computational experiments. In the search, diversity is aspired to be maintained in the population. Diverse con gurations of parameters are explored and it is desired to produce likelihood (in proportional selection) to potentially robust scenarios. In order to add diversity to the ensemble and avoid premature convergence, randomness is perceived as a non- parametric environmental condition that can not be controlled. At each generation, the set of random numbers is changed. This approach is motivated by two studies: (1) the search method suggested in (Dibble, 2006) and (2) the explicit separation of 120 environmental conditions and model parameters (Mitchell and Yilmaz, 2009). Dib- ble (2006) explores the parameter space running a small number of replications (2-3) per scenario, then suggests a search across the sets of random number seeds to test worst case combinations of stochastic events. Mitchell and Yilmaz (2009) also run each scenario for a small number of replications and observe the adaptation of the converged scenario space against non-parameterized environmental conditions mod- eled by a separate simulator that emulates the environment. Random or purposeful changes in the environment is fed back into the GA algorithm through explicitly separated environmental parameters. This approach assures that the competing GA solutions are in synch with the evolving environmental conditions under which they are competing. However, considering the complexity of the model, the design of the robustness metric, and the implementation of the GA, this approach is likely to gen- erate variable results rather than convergence to a single best scenario. Therefore, narrowing the focus on determining the best scenarios is postponed to future analysis. A similar technique is delineated by Dibble (2006): Using supervisory genetic algorithms to discover highly e ective treat- ments or to search for exceptional or surprising simulation outcomes has the potential to profoundly enhance our ability to make the most e ective use of limited computational and analytical resources. It permits us to discover and test incisive empirical insights, e ective normative designs or interventions, and surprising heuristic insights. Once such treatments or outcomes have been identi ed by the genetic algorithm, subsequent ordi- nary batches of simulations can be carefully targeted in order to evaluate the accuracy, uncertainty, risk, and inference power of results obtained from any well-speci ed agent-based simulation model. When the decision maker (analyst) decides that the GA exploration is complete, the task is to select a portfolio of scenarios that can be a basis on policy making decisions. The key parameter con gurations can be identi ed by observing the ttest scenarios over generations. In this research, the portfolio is selected from the ttest 121 scenarios to which the GA converges. The decision is qualitative and aims to identify di erent level sets for parameters. In like manner, Bankes (2002) states that a port- folio of models or level sets for parameter con gurations that behave in a reasonable range provide more information than does a single optimal con guration. Dibble (2006) indicates that the greater economy in searching for key parameter values can release computational resources that can be dedicated to simulate each candidate key con guration for a su cient number of replications to test if there are statistically signi cant di erences among them. Also, existing techniques of decision support may need static recommendations such as providing a single scenario as an answer (Bankes et al., 2002). Therefore, in this research for illustration purposes, further batches of runs are conducted to evaluate the identi ed portfolio (of scenarios) in terms of uncertainty in the robustness results. In the following sections, components of the GA are described in detail. 6.2.1 Encoding and Decoding of the Parameter Space Encoding refers to the mechanism of mapping the parameters to genes, so that the evolution of the gene in the parameter space can be realized. Genes can be represented as binary strings or real numbers. Allowing continuous values in bits result in an exhaustive search, so it is essential to decide on what values the bits can take and how many bits are needed to represent the plausible scenario space. In this research, if the parameter value is a oating number, the precision is set up as increments to have xed numbers of values that are feasible. Then the combination of binary bits are used to represent those parameter values. Decoding is the reverse process of encoding. Decoding partitions the gene into its parts so that the corresponding parameter values can be used in simulation runs. Therefore, the tness function value can be calculated for every set of replications 122 for di erent genes/scenarios. Figure 6.1 illustrates two kinds of parameter values and how they are represented in a gene. Figure 6.1: Encoding and Decoding of the Parameters. The initial search is conducted on the scenario space that is bounded by the experience of the modeler of this research. The gene consists of 22 bits that describe the parameter set-up. The parameter values are identi ed in two classes: (1) integers and (2) oats. The Table 6.1 lists the bits and value ranges to interpret in the decoding process. Table 6.1: The Bit Values in Initial Genome Bits Parameter Code Values/Range 1 Communication Preference Integer [0,5] 2 Maximum Time Expectation Integer [2,10] 3 Foraging Mechanism Integer [1,2] 4 Migration Threshold Integer [2,5] 5 Maximum Scope Integer [1,10] 6-7 Maximum Altruism f00,01,10,11g f0.1, 0.3, 0.5, 0.9g 8-10 Minimum Tension f000,001,...,111g f0.1,0.2,...,0.8g 11-13 Minimum Burden f000,001,...,111g f0.1,0.2,...,0.8g 14 Mutation Rate f0,1g f0.01,0.05g 15-16 Probability to Recover f00,01,10,11g f0.1,0.2,0.3,0.4g 17-18 Probability to Leave f00,01,10,11g f0.1,0.2,0.3,0.4g 19-20 Artifact Creation Probability f00,01,10,11g f0.05,0.1,0.2,0.3g 21-22 Arrival Rate f00,01,10,11g f0.05,0.1,0.2,0.3g 123 6.2.2 Activity Flow of the GA Module Figure 6.2 represents the activity- ow speci cation of the genetic algorithm mod- ule. The population is randomly initialized. For the bits that have an integer value, the value is drawn uniformly from the list of possible values. If the bits are repre- sented as binary numbers, the values are uniformly assigned to each bit. As a note, all possible combinations of the bits represent a number, so all permutations are valid scenarios. 6.2.3 Metamorphic Relations Based on the analysis that are conducted in previous chapters and the observa- tions on the behavior of the model, initial metamorphic relations are identi ed. By identifying the metamorphic relations, it is aimed to understand the response surface of the model. Since the model is a dynamical system, bounding the search space in a way consistent with the goal of the research is critical. Even though individual values for each parameter are valid, the combined e ect of di erent parameters can steer the search toward undesired regions. Especially, the parameters that are incorporated in the collectiveaction formula can have this impact. Below, some initial metamorphic relations are listed to test the e ectiveness of the method. Initially cognitive burden, tension, and altruism are uniformly distributed be- tween 0 and minimum or maximum values. Tension starts at 1 and gradually decreases with new contributions. Considering the collective action formula, in order to have an active initial population, an MR about the expected values of selected parameters is identi ed as follows: minCognitiveBurden 2 < maxAltruism 2 + 0:25 (6.1) 124 Figure 6.2: Activity Flow Diagram of the Genetic Algorithm where the values are divided by 2 to nd the expected value. 0.25 is the expected value of the initial bene ts. In these cases, the outputs are expected to be highly variable, because cost is higher than the initial benefits. Therefore, it is harder for a scientist to become active, resulting in zzling activity. 125 If arrival rate and artifact creation rate are at low, while probability to recover is high, the scenario might lead to low level of activity and a small population size, resulting in zzling activity. In those scenarios, the activity level is highly dependent on the initial conditions. These metamorphic relations are initially identi ed for test purposes, and the evolution of the ensemble is monitored against those conditions. In the initial gen- erations, rather than bounding the plausible scenario space by excluding the regions that violate the MRs, the results are recorded. The results are used to verify if the vi- olations of identi ed MRs are observed and under what conditions they are observed, so that, in future replications, re ned MRs can be derived from the outputs and if desired, the scenario space can be bounded. 6.2.4 Fitness Function In genetic algorithms, likewise in biology, there is a selection process by which the ttest parameter value con gurations are retained in the population. The tness is quantitatively represented as a function that includes the fusion of ve output values. As aforementioned, identifying possible robust landscapes and to determine how the communication theories behave under diverse range of scenarios is important. The output values that are under consideration are related to innovation potential metrics discussed in Chapter 5. The Core/Periphery ratio is excluded from the analysis due to the high level of variation under the majority of the scenario space. Including Core/Periphery in the calculation of the tness function would cause bias due to the number of connections that is assigned to identify core members in the method of calculation. Further information can be found in Appendix A about the activity ow of the calculation method for Core/Periphery ratio. The metrics that are used in the tness function are: Density 126 Degree Centrality Clustering Coe cient Average Path Length Diversity Among Nodes Functional robustness de nition by Krakauer (2006) is interpreted to de ne the robustness metric. This kind of robustness can be achieved by invariance of the output metrics. Thus, the tness function is described as minimization of aggregated variance measures (for each relative social network metric). There are two goals of the implemented tness function. First goal is to minimize the variability among various metrics to discover more robust scenario space. The second goal is to measure average variability comparing di erent communication preferences. The tness function to minimize is de ned by Equation 6.2 below: fj = PN i=1 MAEi Meani N (6.2) where j is a gene (i.e., scenario), N is the number of output metrics under consider- ation, and MAEi is the mean absolute error for the ith element of the output vector. In the analysis, MAE is used since the deviation among the data-points for each time-tick in the time-series data is measured and it is aimed to diminish the e ects of outliers on the tness function. Di erent output metrics may have MAE values at di erent scales, so MAE is divided by mean values of each metric to normalize. As a note, MAE and mean values are calculated based on the last N time ticks of time series data for each metric. The formula used in calculating the mean of density is represented in Equation 6.3. Meanof Density = PN j=1 PR i=1Densityi;j R N (6.3) 127 where i is the total number of replications per scenario and N is the number of time ticks that are considered at the end of each time series data. The same logic is used to calculate means and MAE? s for each metric. 6.2.5 Selection The selection operator is built to select the parent scenarios, which are used to reproduce o spring population to replace the actual population in the following generation. The selection operator is applied after each scenario (the population) is run for 30 replications. After the runs are completed, the mean tness function value for each scenario is calculated. Subsequently, the probability for each scenario to be selected as a mate is determined by the following equations. pj = 1 fjP N i=1 1 fj (6.4) Pj = jX i=1 pj (6.5) where pj is the probability for gene j to be selected, Pj is the cumulative probability that is used in roulette wheel algorithm, and fj is the tness of gene j. N is the total number of genes representing the addition of ttest members of each communication theory and the whole population. The ttest population member of each communication theory is shu ed in the selection process twice since parameter set-ups might behave di erently under dif- ferent communication mechanisms. In order to keep track of the ttest population members for each communication theory and to avoid dominance of the parameter set-ups that behave better under certain theories, selection is done among the ttest genes for each theory and the existing population. This mechanism adds scaling to the proportional selection algorithm; however as a disadvantage, it may cause genetic 128 drift to be altered, evolving the population members to be identical. That is why two crossover operators are introduced in the next section. The selection of gene j is based on the following condition: Pj R>Pj 1 (6.6) where R is the random number that is drawn between 0 and 1. If the condition is satis ed, then gene j is selected to mate. This process is replicated until the operator nds the second mate, which must be di erent than the rst mate. As described, proportional selection algorithm is adopted among the other techniques such as \Rank Based Selection," \Tournament Selection," and \Truncation Selection" (Dr eo et al., 2005). 6.2.6 Crossover and Mutation The reproduction of genes is realized by the crossover and mutation operators. A total of N 1 new o springs are reproduced after N 1 couples are determined by tournament selection. Among each couple, the gene with better tness value is determined and the o spring is created with identical genome to the selected gene. Then one point crossover is applied twice. One for selecting a random bit among binary bits and equalizing that bit to the mate? s value. The other one is for equalizing a randomly selected bit among integer bits (bits 2,3,4,5) to the average of both parties? bit values. The process is stochastic, so crossover of the same distinct parents can generate di erent o springs. This process is repeated until N 1 o springs are reproduced. If a randomly generated number is less than or equal to a certain probability, i.e., 1%, then a randomly selected binary bit is ipped. Mutation is a unary operator like crossover operators and realized on a single bit. First bit is not changed through 129 crossover and mutation operators, because rst bit represents communication prefer- ences and it is intended to stay the same among all generations. As aforementioned, metaphorically, communication ttheories can be perceived as di erent species, while other evolving parameters are traits that are passed between species. 6.2.7 Culling As the last step, the population members except the ttest member in terms of the tness function value are replaced by the o springs. It is generational replacement with elitist approach. Elitism consists of preserving at least one of the individuals with the best tness from one generation to another (Dr eo et al., 2005). The intention in this research is to avoid getting away from the optimum/sub-optimum areas easily by giving more chance to the best gene for reproduction. Followed the culling, the next generation runs are started. Steady-state replacement is not adopted. Because, due to the computational complexity and to promote diversity, the GA needs to excessively disturb the scenario space rather than gradual evolution. 6.3 Analysis During the initial analysis, the ensemble consists of 30 scenarios/population members (5 scenarios for each communication mechanism). Before the rst inter- vention, the ensemble is evolved for ten generations. Then metamorphic relations and the ttest scenarios are evaluated. It is observed that maximum altruism dom- inated the other variables, and genes converged to similar scenarios, which always have the highest value of maximum altruism (0.9). In the early generations, some scenarios are observed to violate initial MRs, but they result in highly variable landscapes, so they are disappeared from the ensemble at later generations. Also, in some scenarios, it is identi ed that, collective action formula might be redundant for the majority of scientists. In those scenarios, high 130 level of activity is expected to be observed. So, another metamorphic relation is added to the list, which is represented below: 1 + MinTension2 2 > maxAltruism 2 (6.7) This MR is included to mitigate the dominant e ect of maximum altruism on the activeness. Additionally, maximum altruism value is set to 0.5 to avoid the region the second metamorphic relation has violated. Moreover, it is observed that both the foraging mechanism and the mutation rate stabilized. (Foraging = 2 and Mutation = 0.01). Those parameters that do not have an impact on the tness function are eliminated. As a last step, two more parameters are introduced (foraging extension and expertise to recover) on the bits that are idle after the elimination. Then the ttest scenario is kept, kick the ball principle is applied by varying the parameter values of the other population members, and new generations are run. In the following sections, the results are delineated. 6.3.1 Results In Figure 6.3, the evolution of the average tness value for each communication mechanism along with the mean tness in whole population are represented. It is observed that the GA module improves the results over generations, which veri es the implemented algorithm. As a note, peaks are observed at generation 11 and generation 16 as a result of implemented kick the ball principle. As a rst observable, mean robustness for each communication mechanism is presented. This representation is inspired by the use of ensembles in the machine learning domain (Dietterich, 2000). Likewise comparing the average behavior of model ensembles, the average behavior of the sub-populations (for each communi- cation mechanism) are compared. The goal is to measure how each communication mechanism behaves under various scenarios in Figure 6.4 is illustrated. 131 Figure 6.3: Average Fitness Values over Generations The minimum tness value that is observed over generations is a scenario with social capital theory. However, considering the average performance, human capital theory behaves more robust than the others. When the 95% con dence intervals are presented in Figure 6.4, due to the small sample size, it is not possible to prefer one theory over another in terms of robustness. Further runs are recommended. Additionally, accepting that the ensemble improves in terms of robustness over generations, examining the distinct parameter values can lead to draw conclusions on the converged scenario space and the impacts of variability. It is a scenario space that GA module discovers. The ensemble could converge to a scenario or diverse scenarios that provide basis for further exploration. In the analysis, the ttest scenarios con- verged to an identical parameter set-up with di erent communication mechanisms. Figure 6.5 represents the evolution of the ttest member in the ensemble for each integer parameter. When the maximum scope is increased, the scientists have more global informa- tion, which gives them the ability to identify the artifacts to study on more e ectively. Interestingly, the scope does not evolve to the maximum value and stabilizes around 132 Figure 6.4: Minimum and Average Fitness Values for each Theory (a) Average Fitness Value for each Theory (b) Minimum Fitness Value for each Theory Mi: Mixed Communication, Ra: Random Communication, HC: Human Capital, SC: Social Capital, Ho: Homophily, SE: Social Exchange the 7 cells. This observation can be indicative of the presence of a level of diminishing return of the scope in relation to the robustness of the system. Migration threshold is the number of times that a scientist forages before consid- ering to leave the environment to recover. The greater the threshold, the less likely the scientists are to depart from the network. This is the reason why a higher level 133 Figure 6.5: Integer Parameters of the Fittest Scenario over Generations of migration threshold creates more robust networks. However, further exploration is needed to determine if there is a diminishing return as observed for the scope. The same comments can be made for the forage extension, which de nes how broader scope scientists perceive in case of foraging. Figure 6.6 represents the evolution trends among the parameter values represented in the genome ( oating numbers). Figure 6.6: Floating Number Parameters of the Fittest Scenarios over Generations 134 Expertise to recover sets the expertise level above which, scientists consider to depart from the environment. The greater the value is, the less likely the scientists are to leave the environment. Earlier in the analysis, it was expected that greater values would create more robust environments. However, experiments indicate otherwise. Letting scientists dissolve sooner does not improve network centrality. That is, when scientists dissolve from the network, the perturbation to the social network is smaller, resulting in more robust environments. Hence it is converged to value of 0.7. Maximum altruism value is stabilized at 0.9 for the rst 10 generations. After the value is limited to 0.5, it stabilized around 0.5. Higher values of altruism create more active and crowded social networks, which are more robust than smaller social networks. Interestingly, when altruism values are higher ( rst 10 generations delin- eate it better), minimum tension and minimum cognitive burden move in the same direction. This observation suggests that moderate level of di erence between the parameter values that are dialectic forces in the collective action formula can lead to more robust landscapes. Arrival rate is converged to the value of 0.2. Increased arrival rate result in more crowded social networks, that can be more robust against perturbations. However, it is stabilized at lower values than the upper limit. Probability to recover (related to the turnover rate) decreases as expected, because it decreases the probability of dissolution from the network. Probability to leave is related to mobility of the mem- bers. Lower level of mobility can cause scientists to stick with the project they reside on, missing other opportunities, while higher levels can cause distraction, abandoning promising opportunities before attracting more attention. Thus, scientists may end up wandering in the environment most of the time and connect to less number of people. This causes the network to be smaller, therefore less robust to perturbations. 135 Figure 6.7 represents the communication mechanism of the ttest scenarios over time. It is observed that under the same parameter con gurations, di erent com- munication mechanisms outperform the others at di erent generations, which can be caused by environmental uncertainty (di erent random number seeds) or communi- cation mechanisms. Figure 6.7: Communication Mechanism of the Fittest Scenarios over Generations Mi: Mixed Communication, Ra: Random Communication, HC: Human Capital, SC: Social Capital, Ho: Homophily, SE: Social Exchange Table 6.2: Decoded Values of the Identi ed Portfolio of Scenarios Parameter Value Communication Preference f0,1,2,3,4,5g Maximum Time Expectation 4 Mutation Rate 0.01 Foraging Mechanism Basic Expertise to Recover 0.7 Migration Threshold 4 Maximum Scope 7 Maximum Altruism 0.5 Minimum Tension 0.7 Minimum Burden 0.2 Forage Extension 3 Probability to Recover 0.1 Probability to Leave 0.3 Artifact Creation Probability 0.3 Arrival Rate 0.2 136 The portfolio of scenarios is represented in Table 6.2. The parameters that are not presented in the table are the same as the base-line model. The portfolio is determined by observing the scenarios that output the ttest results at the last generations of the search. The decision of selecting the scenarios that are added in the portfolio is qualitative. Depending on the purpose of the study, the decision can be made by just observing the ttest scenarios over time or selecting the scenarios that give results between a range of values. In this study, further batches of runs are conducted to provide an example for the use of exploratory modeling. In order to narrow the focus while enabling the best use of the computational resources, only ttest scenarios of last generations are included in the portfolio. In conclusion, communication theory parameter is ipped a couple of times in the portfolio. When the overall robustness of the population is observed, non of the theories are signi cantly better than the others. The analysis needs more number of replications and further exploration. Therefore, in order to test the communication mechanisms at the portfolio scenarios and to measure innovation potential, 200 runs for each theory are conducted. The aim is to assess tness values of di erent commu- nication mechanisms and examine whether there is a tradeo between robustness and innovation potential. Table 6.3 presents the tness values for each communication mechanism. Table 6.3: Fitness Values - The Most Robust Parameter Scenario Theory Fitness Mixed Communications 0.072 Random Communication 0.062 Human Capital 0.060 Social Capital 0.057 Homophily 0.067 Social Exchange 0.076 137 For further analysis, con dence intervals are needed. However, tness function is based on time-series data of all replications and there is no tness value for a sin- gle replication. So, these 200 runs are divided into batches of 10 replications (20 batches) and the analysis of variance is conducted among these batches. Table 6.4 summarizes the number of batches that is guessed to conduct analysis of variance with desired half-width. Figure 6.8 represents 95% con dence intervals for each mecha- nism. As a note, the interpretation of these con dence intervals are biased on the number of replications for each batch and the assigned half-width. As a result, it is not possible to prefer one theory over another. Further tests can be conducted including more replications for each batch or lower level of half-width (which will require more number of batches) to be able to distinguish the performance of com- munication mechanisms. It is interpreted from the analysis that the tness function design (accounting MAE of time-series data) is also e ective on variable and similar outputs of some communication mechanisms. Table 6.4: Analysis of Variance for 20 batches of 10 replications Theory Mean Standard Deviation n 95% CI Mi 0.072 0.012 11.517 0.011 Ra 0.062 0.011 12.771 0.010 HC 0.060 0.010 12.104 0.010 SC 0.056 0.006 4.794 0.006 Ho 0.066 0.009 9.141 0.009 SE 0.069 0.009 7.687 0.009 (t19;1 0:025 = 2:093, half-width = 10% of the mean) Table 6.5: Mean of Various Metrics - The Most Robust Scenario Theory Density DC CC AvgPath CP DiversityN DiversityL SW Mi 0.192 0.255 0.397 2.434 1.493 0.492 0.286 0.163 Ra 0.322 0.33 0.322 1.805 4.600 0.607 0.371 0.178 HC 0.287 0.319 0.287 1.884 3.664 0.559 0.399 0.152 SC 0.281 0.325 0.281 1.896 3.370 0.571 0.390 0.148 Ho 0.254 0.318 0.254 1.933 3.710 0.519 0.422 0.131 SE 0.199 0.257 0.199 2.400 1.585 0.495 0.285 0.083 138 Figure 6.8: 95% Con dence Intervals for Di erent Communication Mechanisms Mi: Mixed Communication, Ra: Random Communication, HC: Human Capital, SC: Social Capital, Ho: Homophily, SE: Social Exchange Table 6.6: Standard Deviation of Various Metrics - The Most Robust Scenario Theory Density DC CC AvgPath CP DiversityN DiversityL SW Mi 0.030 0.021 0.040 0.146 0.400 0.064 0.046 0.274 Ra 0.051 0.021 0.056 0.087 1.594 0.044 0.032 0.644 HC 0.038 0.017 0.048 0.075 0.792 0.042 0.023 0.640 SC 0.035 0.021 0.038 0.070 0.796 0.042 0.022 0.543 Ho 0.040 0.018 0.049 0.082 1.194 0.045 0.019 0.598 SE 0.026 0.020 0.037 0.161 0.408 0.061 0.051 0.230 Comparing the results (Table 6.5 and Table 6.6) of the most robust landscape with the analysis of Chapter 5, more robust networks generate more connections. While degree centrality is observed at similar levels, clustering coe cient is observed at higher levels than the base-line scenario, suggesting that the networks have densely connected cliques, or the network behaves like a single densely connected clique. Since average path length is observed at lower numbers and core periphery ratio is high, it can be concluded that the networks consist of a single highly dense cluster, which is a small world itself. Small-worldliness, lower average path length, and higher density are indicative of a more innovation potential. However, Core/Periphery ratios are increased that indicates the number of periphery members are not many. Diversity 139 among links su ers from highly dense clusters. Additionally, diversity among nodes is spurred creating connections between scientists from di erent interest levels. 6.4 Limitations and Conclusions The exploration software in this research provides a process to navigate e ec- tively through the plausible scenario space and identify key con gurations that can lead to construct lines of reasoning in terms of robustness and innovation. If more in- sight is desired, further sensitivity analysis can provide information on whether these scenarios are signi cantly di erent or similar. Additionally, the identi ed portfolio of scenarios are constructed as a basis for further exploration of the genetic algo- rithm by bounding the plausible scenario space, or for high-resolution exploration by re-determining the possible parameter ranges to search on. The developed human- mediated exploration software is realized to provide a search process and a portfolio of robust scenarios rather than a product (optimal answer). An important limitation of this research is the amount of computational re- sources that could be dedicated to exploration. Thirty replications for each scenario are dedicated to provide su cient level of convergence. Since the research does not aim optimization, strong convergence can also be problematic. Aside from the dis- cussed approach of re-initiating the GA for multiple times (Zeigler et al., 1997), an alternative implementation would be to run each scenario for 2 or 3 replications and dedicate the remaining computational resources to further exploration (Dibble, 2006). However, standard GA (as implemented in this research) have strong convergence ten- dencies (Burke et al., 2002). So, in order to interpret whether this alternative method would provide more diversity as expected, the GA operators should be varied and the performance should be evaluated. In this chapter, four particular operations are implemented to promote diversity in the population: (1) the set of random number seeds are changed at each generation, 140 (2) by the interventions, kick the ball principle is introduced (randomly varying the population), (3) during interventions, metamorphic relations are tested to bound the plausible scenario space, and (4) the population is divided in sub-populations of communication mechanisms (which are preserved over time). Alternative approaches to maintaining the diversity in the population are listed such as crowding models, assortative mating, dividing the population in sub-populations, and tness sharing (Smith et al., 1993; Burke et al., 2002). Another limitation of the study is related to the implemented GA design. In particular, experimentation with alternative GA designs can be performed to understand the features that can make the exploration software faster and more e ective. The results can be compared to determine the e ectiveness of the operations that are applied to promote diversity in this research. To conclude, in the most robust landscape, high level of activity and knowledge creation generate highly dense, clustered small-world networks that increase the ro- bustness while the diversity and core-periphery structures are mitigated. Further runs with more number of replications are suggested to compare communication mecha- nisms and their robustness performance. If desired, di erent tness functions can be tested based on the standard deviation of average behavior among replications or coe cient variation (not MAE) etc. Also, the portfolio can be extended by including the scenarios (i.e. ttest scenario of generation 14) that perform close enough to the scenarios in the portfolio for further sensitivity analysis. In the following chapter, the summary of the research is described along with suggestions for future research. 141 Chapter 7 CONCLUSIONS AND FUTURE RESEARCH The Science of Science and Innovation Policy program of the National Science Foundation (NSF) is particularly interested in development of computational models that explore di erent aspects of knowledge creation. The aim is to identify the the- ories and the mechanisms that mimic the dynamics in knowledge creation and then to conduct exploratory analysis under various conditions to develop better under- standing of innovation and creativity. Policymakers might eventually aim to develop open-scienti c environments and maintain cyber-infrastructures that provide a land- scape for scienti c collaboration. In his well-known work, Nielsen (2011) coins the term designed serendipity. He anticipates that the world is approaching to the second open science revolution ( rst one was the publication system around 300 years ago), that will alter the way people publish and the publication system itself by creating new norms (new agora) (Gibbons et al., 1997) and tools that people conduct research on. IBM1 and MIT Collective Intelligence Lab2 have already been working on online collaboration tools that will foster innovation and creativity. Nielsen (2011) states that human society is in the era to gure out how to design these tools in a way to promote serendipity coming out of the system. That is why he calls the phenomenon designed serendipity. The results presented in this dissertation provide insight about which components have impacts on collaboration in Global Participatory Science (GPS) and which communication behaviors can be promoted to improve innovation potential 1http://www.research.ibm.com/labs/watson/index.shtml - As of 4.07.2013 2http://cci.mit.edu - As of 4.07.2013 142 and capacity. In the long run, these ndings can be used for orchestrating the collaboration in GPS and designing online cyber-infrastructures. In this study, self-organizing complex adaptive system viewpoint is adopted. The CAS domain provides a trans-disciplinary research framework, allowing a wide-variety of disciplines to bene t from. Initially, self-organization principles and theoretical foundations that can explain the behaviors of the scientists in GPS are examined. It is observed that collective action theory is a plausible theory on which to ground the models of collaborative environments. Collective action dynamics is modeled in terms of self and mutual interest variables based on the ndings of Olson (1974). The theory is used to explain di erent phenomena in late 080s and early 090s including computational models, however the theory has vanished in the last decades (Oliver and Marwell, 2001). Recently, the theory is brought up to explain the open science phenomenon.3 To the best of our knowledge, to date, there is no computational model that incorporates collective action theory to explain scienti c knowledge generation. Self-organization mechanisms that explain the dynamics in GPS are implemented over the collective action model. Unlike prior studies on OSSD communities that use the stigmergy mechanism (Cui et al., 2009), preferential at- tachment mechanism is used to represent the artifact/project selection process. In- formation foraging theory is interpreted along with the domain knowledge, and it is implemented as a mechanism that explains the migration of scientists among the projects. Learning and in uence mechanisms are implemented by monotonic increase functions, representing the cumulative knowledge creation. Since GPS is an open sys- tem with new arrivals and departures of the scientists, population dynamics are also introduced. All these mechanisms are implemented with positive and negative feed- back dynamics based on the local information that is available to scientists. Bounded 3http://michaelnielsen.org/blog/the-logic-of-collective-action/ - As of 4.07.2013 143 rationality assumption is an essential principle perceived in all human decision pro- cesses. The mechanisms that are related with those decisions incorporate roulette wheel algorithms to represent proportional selection rather than optimal selection. The main principle is: \People satis ce, rather than optimize." All mechanisms that are implemented in this research are grounded on theory and empir- ical ndings. One of the limitations at the rst phase of this research is the use of the V-shaped function that approximates tension over time. A U-shaped function can be introduced to the model in future studies. Di erent learning techniques can be employed to build up expertise of scientists and the complexity of the artifacts. Additionally, agents can perceive di erent levels of appropriate information in e ectively nding the artifact opportunities for contribution. Di erent information types can be introduced to the preferential attachment mechanism based on the objective of the study. Following the representation of the intrinsic motivation dynamics of scientists, the conceptualization of the base-model is concluded. The base-model development was an important stage for this research, since it formally de nes the parameters and e ectively formulates the known dynamics and spatio-temporal characteristics of GPS. Veri cation and validation studies are conducted during and after the implementation with performance tests as described in Chapter 4. Additionally, OBO data is used to conduct tests for validation of emergent patterns. In the future, if additional social network data become available, the model can be calibrated to represent speci c communities for further validation. The second milestone, was the identi cation of more innovative com- munication traits among scientists. Communication preferences are stated as important factors that have an impact on the evolution of GPS and 144 social networks. To conduct sensitivity analysis, the candidate communi- cation theories that are related with the problem domain are identi ed. Then those theories are implemented over the base-model. In Chapter 5, the interpretation of each theory and their relative implementations are described. Under the base-model conditions that the validation studies are conducted on, in- novation potential of the identi ed communication traits are explored. Innovation potential is represented in terms of di erent hypothesis: (1) High centrality-fewer structural holes, (2) Density, (3) Diversity, and (4) Interdisciplinarity. Most studies calculate the diversity in terms of individual disparities in the population, while in this research, another diversity metric that identi es discrepancy among the atomic social networks of individuals is introduced. It was ef- fective to identify disparity among the nodes in the social network, so it is promising to be used in di erent studies. All metrics are represented to policymakers by a summary table (matrix of levels that di erent metrics converge), so that the behavioral patterns of di erent theories can be distinguished. Moreover, if it is desirable to conduct further runs, the outputs can be interpreted by the policymakers to aid their multi-criteria decision making process. While a theory might seem more innovative under a set of conditions, another theory can outperform that theory under di erent set of conditions. Interestingly, social capital theory dominated the other theories during the analysis in Chapter 5. Social capital theory supports innovativeness relatively more than the other candidate theories. In this research, it is revealed that if the infor- mation about the social degrees of scientists are broadcasted on online tools and if link formation between highly central members and periphery members are fostered, then networks generate more innovation potential. Speci cally, the social capital theory promotes connections among central scientists and periphery scientists, enabling innovation di usion among the scientists. Highly 145 central members are also more likely to have more contributions and gain more exper- tise. As the expertise levels of the scientists increase, complexity of the artifacts also bene t. This leads to increased maturity in the environment, resulting in sustained knowledge creation. High centrality-fewer structural holes hypothesis is supported more by the social capital theory, because the theory promotes highly central mem- bers, who are connected to the periphery members from di erent clusters. In terms of link diversity, social capital gives opportunity to form connections between di erent cliques, causing diverse atomic individual networks. This mechanism also decreases the observed average path length in the network. Core/periphery structures promote di usion of innovative ideas. It is observed that mixed connections and social exchange favor the balance among the scientists. They generate a small number of core members, while the number of periphery populations is large. Regarding the balance of expertise, they usually result in moderate level of expertise in the population. This research provides the base-model as a computational laboratory that is developed with object oriented programming language (Java) prin- ciples using the Repast framework, so that it is extensible and portable across platforms. Repast is selected as the development framework, because it is porous; that is, it provides a framework to model agents that are not necessarily atomic but can be distinct by design. Repast is also called recursive and hence allows designing nested combinations of agents and spaces. Therefore, the model can be ex- tended by adding higher level contexts or agents, including communities as di erent layers and these di erent layers can still communicate. Robustness, as a systemic characteristic, is an important objective for policymak- ers. The scenario space is needed to be explored by testing di erent communication preferences under di erent parameter set-ups. For that reason, exploratory modeling method is adopted and a GA module is implemented to conduct intelligent 146 search on the scenario space while improving the scenarios in terms of the robustness metric. Robustness is de ned by aggregating the mean absolute errors of di erent output metrics. The less variable the outputs are, the more robust are the scenarios. Furthermore, there was a need to build a feedback mechanism between gen- erations to identify the domain speci c or model speci c regions of the scenario space that behave highly variable or give trivial results. Metamorphic relations (MRs) are introduced as a feedback mechanism. It is addressed as a promising method in identifying the expected behaviors of the model under di erent conditions as well as bounding the search space. In this research, the feed- back mechanism between MRs and GA module is realized manually, however there are studies that implement automatic creation of metamorphic testing (Gotlieb and Botella, 2003). Future research includes the evaluation of metamorphic relations to automatically generate further relations that can be used. Experimental results indicate that the implemented GA module that is coupled with MRs can be applied in di erent domains for di erent problem sets. The search for robust landscapes in this research is limited due to the compu- tational complexity and the lack of computational resources. In order to draw more signi cant results, the number of replications can be re-determined. The simulations are run on an Apple laptop, that has 8 GB of memory and 2.3 Ghz i5 CPU with 5400 Rpm hard-drive. Even though the kick the ball principle is applied, the runs con- verged to the same region after each randomization. Di erent techniques to promote diversity can be used in the GA design as a future work. Also, more runs can be conducted after the boundaries of the parameter values are relaxed and continuous values are allowed in the parameter space. Therefore, diminishing returns of various parameters can be identi ed. For the objective of this research, providing the exploration software and exploring the general behaviors of the system 147 that is bounded by the MRs and expert opinion (opinion of the author) is considered su cient. The further experiments could not reveal any communication theory that globally outperforms the others in terms of robustness. In general, the parameter values that generate more arrivals and cause more populated social networks (by mitigating the e ect of dissolution) are observed to create more robust landscapes, since they are more resilient against perturbations by the removal of cen- tral members. Interestingly, the tendency to balance certain values is observed. For example, the expertise threshold after which scientists might leave the environment, is set at a medium level. Experimental observations suggest that more expert scientists are likely to become central in the network over time and in the case of a removal, they are more likely to perturb the network more. So, medium level values result in more robust landscapes. Additionally, the gap between maximum altruism value and the combined e ect of minimum cognitive burden and minimum tension values is oriented to be at medium levels. A tradeo is addressed between robustness and in- novation potential. It is stated that more robust scenarios have dense and more clustered networks with highly central members promoting trust and sharing of ideas. However, the diversity among links and core/periphery structures that are related with structural holes are hindered. The ndings of this research can assist policymakers during the design phase of online collaboration tools. The behaviors observed under each theoretical mechanism can be interpreted and related information can be broadcasted transparently in terms of the network structure. For example, the social capital theory, that is found to promote higher levels of innovation potential, states that scientists are more likely to connect to the ones who are connected to more people. Consequently, scientists aim to increase the social resources they have. In order to support social capital theory, in the tool design, the degree information of scientists and their social reachability 148 can be presented publicly to guide the scientists. Besides, policymakers can develop mechanisms (such as reputation index) to incentivize the connections between central members and periphery members. Hence central members would incline to connect to less central scientists. Moreover, the evolution of the social network can be observed over time. If the network evolves to undesired stages, other communication styles can be promoted. For example, if the network transtions into a state, which has great number of cen- tral members and highly specialized cliques, then mixed communications or social exchange theory can be supported. Scientists from di erent backgrounds and exper- tise levels can be injected into di erent projects to balance the network. This would promote core/periphery structures, which are known to promote innovation potential and capacity. Interdisciplinarity is an aspired characteristic highly promoted in the modern scienti c culture. Network coherence matrix can be used by policymakers to identify what properties they want and by which theories they can guide the system toward that particular property. While specialized interdisciplinary networks are more innovative, more balanced networks with potential interdisciplinary integration can also be desired by policymakers to enable e ective management. Such networks can be generated and maintained by promoting homophily theory (connec- tion between scientists from similar domains and interests). Mixed communications mechanism assigns a communication preference to a sci- entist with equal probability. As a future venue of research, until what extent each communication style should be promoted in the environment can be investigated. Subsequently, the granularity of individual information that enables each communi- cation preference and the way that information is shared can be diversi ed in the model. Further, incentive mechanisms to encourage di erent communication pref- erences can be designed to promote innovation potential and capacity. Besides the 149 implemented communication mechanisms in this research, other mechanisms can also be modeled based on di erent communication theories (i.e., Cognitive Consistency, Balance, and Network Exchange Theory). 150 Bibliography Albert, R. and A. Barab asi (2002). Statistical mechanics of complex networks. Re- views of modern physics 74 (1), 47. April, J., F. Glover, J. P. Kelly, and M. Laguna (2003, October). Practical intro- duction to simulation optimization. In Proceedings of the 2003 Winter Simulation Conference, pp. 71{78. Ariely, D. (2008). Predictably Irrational: The Hidden Forces That Shape Our Deci- sions (1 ed.). HarperCollins. Arndt, J. (1985). On making marketing science more scienti c: role of orientations, paradigms, metaphors, and puzzle solving. The Journal of Marketing 49 (3), 11{23. Atmar, W. (1994). Notes on the simulation of evolution. IEEE Transactions on Neural Networks 5 (1), 130{147. Auer, S. and H. Braun-Th urmann (2011). Towards Bottom-Up, Stakeholder-Driven Research Funding{Open Science and Open Peer Review. Technical report. Axelrod, R. (1997a). Advancing the art of simulation in the social sciences. Com- plexity 3 (2), 16{22. Axelrod, R. (1997b). The dissemination of culture: A model with local convergence and global polarization. Journal of con ict resolution 41 (2), 203{226. Axelrod, R. (2006). Building new political actors. In Generative social science: Studies in agent-based computational modeling, pp. 121{144. Princeton Univ Press. 151 Axelrod, R. and M. Cohen (2001). Harnessing complexity: Organizational implica- tions of a scienti c frontier. Basic Books. Axtell, R. and J. Epstein (2006). Coordination in transient social networks: an agent-based computational model of the timing of retirement. In Generative social science: Studies in agent-based computational modeling. Princeton Univ Press. B ack, T. and H.-P. Schwefel (1993, March). An Overview of Evolutionary Algorithms for Parameter Optimization. Evolutionary Computation 1 (1), 1{23. Badis, G., M. F. Berger, A. A. Philippakis, S. Talukder, A. R. Gehrke, S. A. Jaeger, E. T. Chan, G. Metzler, A. Vedenko, X. Chen, H. Kuznetsov, C. F. Wang, D. Coburn, D. E. Newburger, Q. Morris, T. R. Hughes, and M. L. Bulyk (2009, June). Diversity and Complexity in DNA Recognition by Transcription Factors. Science 324 (5935), 1720{1723. Balci, O. (1990). Guidelines for successful simulation studies. In Winter Simulation Conference Proceedings, pp. 25{32. Bankes, S. C. (1992). Exploratory Modeling and the Use of Simulation for Policy Analysis. RAND Corporation. Bankes, S. C. (1993, May). Exploratory Modeling for Policy Analysis. Operations Research 41 (3), 435{449. Bankes, S. C. (2002). Tools and techniques for developing policies for complex and uncertain systems. Proceedings of the National Academy of Sciences of the United States of America 99 (Suppl 3), 7263{7266. Bankes, S. C., R. Lempert, and S. Popper (2002, November). Making Computational Social Science E ective: Epistemology, Methodology, and Technology. Social Sci- ence Computer Review 20 (4), 377{388. 152 Barabasi, A. (2002). Linked: The New science of networks. Perseus Books. Becker, G. (1978). The economic approach to human behavior. The University of Chicago Press. Belton, V. and T. Stewart (2002). Multiple criteria decision analysis: an integrated approach. Springer. Benhamou, F. and S. Peltier (2010, November). Application of the Stirling Model to Assess Diversity using UIS Cinema Data. UNESCO Institute for Statistics, 1{73. Bernstein, C., A. Kacelnik, and J. Krebs (1988). Individual decisions and the distri- bution of predators in a patchy environment. The Journal of Animal Ecology 57 (3), 1007{1026. Blank, A. and S. Solomon (2000). Power laws in cities population, nancial mar- kets and internet sites (scaling in systems with a variable number of components). Physica A: Statistical Mechanics and its Applications 287 (1-2), 279{288. Blau, P. (1977). Inequality and heterogeneity: A primitive theory of social structure. Free Press New York. Boguna, M., R. Pastor-Satorras, A. Diaz-Guilera, and A. Arenas (2004, November). Models of social networks based on social distance attachment. Phys. Rev. E 70 (5), 056122. Bolland, J. (1988). Sorting out centrality: An analysis of the performance of four centrality models in real and simulated networks. Social networks 10 (3), 233{253. Bonacich, P. (1990). Communication dilemmas in social networks: An experimental study. American Sociological Review 55 (3), 448{459. Booth, D., H. Haas, F. Mccabe, E. Newcomer, M. Champion, C. Ferris, and D. Or- chard (2004). Web Services Architecture. Technical report. 153 Borgatti, S. and M. Everett (2000). Models of core/periphery structures. Social networks 21 (4), 375{395. Boyd, J., W. Fitzgerald, and R. Beck (2006). Computing core/periphery structures and permutation tests for social relations data. Social networks 28 (2), 165{178. Bratley, P., B. L Fox, and L. E Schrage (1987). A guide to simulation. Springer. Burke, E., S. Gustafson, and G. Kendall (2002). A survey and analysis of diversity measures in genetic programming. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 716{723. sn. Burt, R. (1982). Toward a structural theory of action: Network models of strati ca- tion, perception, and action. New York: Academic Press. Burt, R. (1995). Structural holes: The social structure of competition. Harvard University Press. Burton, R. (2003). Computational laboratories for organization science: Questions, validity and docking. Computational and Mathematical Organization Theory 9 (2), 91{108. Camazine, S. (2003). Self-organization in biological systems. Princeton Univ Pr. Carayol, N. and J. Dalle (2007). Sequential problem choice and the reward system in Open Science. Structural Change and Economic Dynamics 18 (2), 167{191. Carley, K. M., N. Y. Kamneva, and J. Reminga (2004). Response surface methodol- ogy. Technical report. Carlson, J. and J. Doyle (2002). Complexity and robustness. Proceedings of the National Academy of Sciences 99 (Suppl 1), 2538. 154 Carson, Y. (1997). Simulation optimization: methods and applications. In Proceedings of the 29th conference on Winter simulation. Caruana, R., A. Niculescu-Mizil, G. Crew, and A. Ksikes (2004). Ensemble selection from libraries of models. In Proceedings of the twenty- rst international conference on Machine learning, pp. 18. ACM. Cawley, G. and N. Talbot (2010, August). On Over- tting in Model Selection and Subsequent Selection Bias in Performance Evaluation. J. Mach. Learn. Res. 99, 2079{2107. Charnov, E. (1976). Optimal foraging, the marginal value theorem. Theoretical population biology 9, 129{136. Chen, T. Y., S. C. Cheung, and S. M. Yiu (1998). Metamorphic testing: A new approach for generating next test cases. Technical report, Department of Computer Science, Hong Kong University of Science and Technology. Cowan, R. and N. Jonard (2004). Network structure and the di usion of knowledge. Journal of economic Dynamics and Control 28 (8), 1557{1575. Cui, X., J. Beaver, J. Treadwell, T. Potok, and L. Pullum (2009). A Stigmergy Approach for Open Source Software Developer Community Simulation. Compu- tational Science and Engineering, 2009. CSE?09. International Conference on 4, 602{606. David, P. (1998). Common agency contracting and the emergence of "open science" institutions. The American Economic Review 88 (2), 15{21. De Nooy, W., A. Mrvar, and V. Batagelj (2005). Exploratory social network analysis with Pajek. Network 40 (3), 362. 155 Dhanaraj, C. and A. Parkhe (2006). Orchestrating innovation networks. Academy of Management Review 31 (3), 659. Diaz-Guilera, A., , S. Lozano, and A. Arenas (2009). Propagation of Innovations in Complex Patterns of Interaction. In Innovation networks: new approaches in modelling and analyzing, pp. 271{286. Springer Verlag. Dibble, C. (2006). Computational Laboratories for Spatial Agent-Based Models. In Handbook of Computational Economics, pp. 1511{1548. Handbook of Computa- tional Economics. Dietterich, T. (2000). Ensemble methods in machine learning. Multiple classi er systems - Lecture Notes in Computer Science 1857 , 1{15. Ding, J., T. Wu, D. Wu, J. Q. Lu, and X. H. Hu (2011). Metamorphic testing of a Monte Carlo modeling program. In Proceedings of the 6th International Workshop on Automation of Software Test, pp. 1{7. ACM. Dr eo, J., A. Petrowski, P. Siarry, and E. Taillard (2005). Metaheuristics for hard optimization: methods and case studies. Springer. Dron, J. and T. Anderson (2009). On the Design of Collective Applications. In Computational Science and Engineering, 2009. CSE?09. International Conference on, pp. 368{374. IEEE. Epstein, J., D. Cummings, and S. Chakravarty (2006). Toward a containment strategy for smallpox bioterror: an individual-based computational approach. In Genera- tive social science: Studies in agent-based computational modeling. Princeton Univ Press. Epstein, J. M. (2006). Generative social science: Studies in agent-based computational modeling. Princeton Univ Press. 156 Epstein, J. M. (2008). Why model? Journal of Arti cial Societies and Social Simu- lation 11 (4), 12. Faccenda, J. F. and R. F. Tenga (1992). A combined simulation/optimization ap- proach to process plant design. In the 24th conference, New York, New York, USA, pp. 1256{1261. ACM Press. Feynman, R. (1998). The Meaning of It All. Penguin. Finholt, T. and G. Olson (1997). From laboratories to collaboratories: A new orga- nizational form for scienti c collaboration. Psychological Science 8 (1), 28. Flack, J., D. Krakauer, and F. De Waal (2005). Robustness mechanisms in primate societies: a perturbation study. Proceedings of the Royal Society B: Biological Sciences 272 (1568), 1091. Foster, I. (2005). Service-oriented science. Science 308 (5723), 814. Freeman, L. (1979). Centrality in social networks conceptual clari cation. Social networks 1 (3), 215{239. Fu, M. and F. Glover (2005). Simulation optimization: a review, new developments, and applications. In Proceedings of 37th conference on Winter simulation. Gibbons, M., L. C, and N. H (1997). The new production of knowledge: the dynamics of science and research in contemporary societies. Sage. Gilbert, N. (1997). A simulation of the structure of academic science. Sociological Research Online 2. Gilbert, N. (2006). Putting the Social into Social Simulation. In Keynote address to the First World Social Simulation Conference, Kyoto. 157 Goldstein, J. (2011). Attractors and nonlinear dynamical systems. Deeper Learning, 1{17. Gotlieb, A. and B. Botella (2003). Automated metamorphic testing. In Computer Software and Applications Conference, 2003. COMPSAC 2003. Proceedings. 27th Annual International, pp. 34{40. Grunwald, P. (2005). A tutorial introduction to the minimum description length principle. Technical report, Centrum voor Wiskunde en Informatica. Guderlei, R. and J. Mayer (2007, October). Statistical Metamorphic Testing Test- ing Programs with Random Output by Means of Statistical Hypothesis Tests and Metamorphic Testing. In Quality Software, 2007. QSIC ?07. Seventh International Conference on, pp. 404{409. Hardin, R. (1982). Collective action. Resources for the Future. Hawkins, D. M. (2004, January). The Problem of Over tting. Journal of Chemical Information and Modeling 44 (1), 1{12. Holland, J. (1996). Hidden order: How adaptation builds complexity. Basic Books. Hollingshead, A. B., J. Fulk, and P. Monge (2002). Fostering intranet knowledge sharing: An integration of transactive memory and public goods approaches. Dis- tributed work, 335{355. Jensen, C. and W. Scacchi (2010). Governance in Open Source Software Develop- ment Projects: A Comparative Multi-Level Case Study Analysis. In The 6th In- ternational Conference on Open Source Systems: IFIP Working Group 2.13, Notre Dame, IN, USA. Springer Boston: Springer Boston. Kendall, B. E. (2001). Nonlinear dynamics and chaos. eLS. 158 Kl ugl, F. (2008). A validation methodology for agent-based simulations. In Proceed- ings of the 2008 ACM symposium on Applied computing, pp. 39{43. Krakauer, D. (2006). Robustness in Biological Systems: a provisional taxonomy. Complex systems science in biomedicine, 183{205. Krebs, V. and J. Holley (2002). Building sustainable communities through network building. Technical report. Kuhn, T. (1996). The structure of scienti c revolutions. University of Chicago press. Laine, T. (2006). Agent-based model selection framework for complex adaptive sys- tems. ProQuest. Latour, B. (1998). From the world of science to the world of research? Sci- ence 280 (5361), 208. Lattemann, C. and S. Stieglitz (2005). Framework for Governance in Open Source Communities. In System Sciences, 2005. HICSS ?05. Proceedings of the 38th Annual Hawaii International Conference on, pp. 192a{192a. Liepins, G. E. and M. R. Hilliard (1989). Genetic algorithms: Foundations and applications. Annals of operations research 21 (1), 31{57. Lynne, H. and N. Gilbert (2009). Social circles: A simple structure for agent-based social network models. Journal of Arti cial Societies and Social Simulation 12 (2), 3. Ma, T. and B. Abdulhai (2002). Genetic algorithm-based optimization approach and generic tool for calibrating tra c microscopic simulation parameters. Trans- portation Research Record: Journal of the Transportation Research Board 1800 (-1), 6{15. 159 March, J. (1991). Exploration and exploitation in organizational learning. Organiza- tion Science 2 (1), 71{87. McCormack, J. (2007). Arti cial ecosystems for creative discovery. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, pp. 307. Merton, R. (1979). The sociology of science: Theoretical and empirical investigations. University of Chicago Press. Milbergs, E. and N. Vonortas (2006, January). Innovation Metrics: Measurement to Insight. Technical report, IBM Corporation. Mitchell, B. and L. Yilmaz (2009). Symbiotic adaptive multisimulation: An au- tonomic simulation framework for real-time decision support under uncertainty. ACM Trans. Model. Comput. Simul. 19 (1), 2:1{2:31. Monge, P. and N. Contractor (2003). Theories of communication networks. Oxford University Press, USA. Moritz, M., M. Morais, L. A. Summerel, J. Carlson, and J. Doyle (2005). Wild res, complexity, and highly optimized tolerance. Proceedings of the National Academy of Sciences of the United States of America 102 (50), 17912. Mu atto, M. and M. Faldani (2003). Open Source as a complex adaptive system. Emergence 5 (3), 83{100. Mukherjee, A. and S. Stern (2009). Disclosure or secrecy? The dynamics of open science. International Journal of Industrial Organization 27 (3), 449{462. Murphy, C., M. S. Raunak, A. King, S. Chen, C. Imbriano, G. Kaiser, I. Lee, O. Sokol- sky, L. Clarke, and L. Osterweil (2011). On e ective testing of health care sim- ulation software. In Proceedings of the 3rd Workshop on Software Engineering in Health Care, pp. 40{47. ACM. 160 Naveh, I. and R. Sun (2006). A cognitively based simulation of academic science. Computational and Mathematical Organization Theory 12 (4), 313{337. Nazzal, D., M. Mollaghasemi, H. Hedlund, and A. Bozorgi (2011, July). Using genetic algorithms and an indi erence-zone ranking and selection procedure under common random numbers for simulation optimisation. Journal of Simulation 6 (1), 56{66. Newman, M. (2010). Networks: An Introduction. Oxford University Press. Newman, M. and D. Watts (2006). The structure and dynamics of networks. Princeton Univ Pr. Nielsen, M. (2010). The Logic of Collective Action. Technical report. Nielsen, M. (2011). Reinventing discovery: the new era of networked science. Prince- ton University Press. Nowotny, H., P. Scott, and M. Gibbons (2001). Re-thinking science: knowledge and the public in an age of uncertainty. Polity Press. Oliver, P. and G. Marwell (2001). Whatever happened to critical mass theory? A retrospective and assessment. Sociological Theory 19 (3), 292{311. Olson, M. (1974). The logic of collective action: Public goods and the theory of groups. Harvard University Press. O?Mahony, S. and F. Ferraro (2007). The emergence of governance in an open source community. Academy of Management Journal 50 (5), 1079{1106. Ostrom, E. and C. Hess (2007). Understanding knowledge as a commons: from theory to practice. MIT Press. Page, S. (2010). Diversity and complexity. Princeton Univ Pr. 161 Paul, R. J. and T. S. Chanev (1998). Simulation optimisation using a genetic algo- rithm. Simulation Practice and Theory 6 (6), 601{611. Pavard, B., J. Dugdale, N. Saoud, S. Darcy, and P. Salembier (2006). Design of robust socio-technical systems. In Second Symposium on Resilience Engineering Proceedings, Juan-les-Pins, France, November, pp. 8{10. Citeseer. Pierreval, H. and L. Tautou (1997). Using evolutionary algorithms and simulation for the optimization of manufacturing systems - Springer. IIE Transactions 29 (3), 181{189. Pirolli, P. (2007, April). Information foraging theory: Adaptive interaction with in- formation. Oxford University Press, USA. Pirolli, P. and S. Card (1999). Information foraging. Psychological review 106 (4), 643. Powell, W. W., K. W. Koput, and L. Smith-Doerr (1996). Interorganizational col- laboration and the locus of innovation: Networks of learning in biotechnology. Administrative science quarterly 41 (1), 116{145. Preece, J. and B. Shneiderman (2009). The Reader-to-Leader Framework: Motivating technology-mediated social participation. AIS Transactions on Human-Computer Interaction 1 (1), 13{32. Pullum, L. L. and O. Ozmen (2012, December). Early Results from Metamorphic Testing of Epidemiological Models. In Workshop on Veri cation and Validation of Epidemiological Models in ASE International Conference on Biomedical Comput- ing, Washington DC. Pyka, A. (2009). Innovation networks: new approaches in modelling and analyzing. Springer Verlag. 162 Rafols, I. and M. Meyer (2010). Diversity and network coherence as indicators of interdisciplinarity: Case studies in bionanoscience. Scientometrics 82 (2), 263{287. Reunanen, J. (2003). Over tting in making comparisons between variable selection methods. The Journal of Machine Learning Research 3, 1371{1382. Robinson, S., R. Brooks, K. Kotiadis, and D. Van Der Zee (2010). Conceptual mod- eling for discrete-event simulation. CRC Press, Inc. Boca Raton, FL, USA. Rowley, T. (1997). Moving beyond dyadic ties: A network theory of stakeholder in uences. The academy of management review 22 (4), 887{910. Sargent, R. (2005). Veri cation and validation of simulation models. In Proceedings of the 37th conference on Winter simulation conference, pp. 130{143. Saviotti, P. (2009). Knowledge Networks: Structure and Dynamics. In Innovation networks: new approaches in modelling and analyzing. Springer Verlag. Scacchi, W. and C. Jensen (2008). Governance in Open Source Software Development Projects: Towards a Model for Network-Centric Edge Organizations. In 13th In- ternational Command and Control Research and Technology Symposium, Bellevue, WA. Schank, T. and D. Wagner (2005). Approximating clustering coe cient and transi- tivity. Journal of Graph Algorithms and Applications 9, 265{275. Sexton, R. S., R. E. Dorsey, and J. D. Johnson (1999). Optimization of neural networks: A comparative analysis of the genetic algorithm and simulated annealing. European Journal of Operational Research 114 (3), 589{601. Sheard, S. and A. Mostashari (2008). A Framework for System Resilience Discussions. In Proc Eighteenth Annu Int Symp INCOSE. 163 Shlesinger, M. F. (2007, September). Complex Adaptive Systems: An Introduction to Computational Models of Social Life. Journal of Statistical Physics 129 (2), 409{410. Shrager, J. and P. Langley (1990). Computational approaches to scienti c discov- ery. Computational Models of Scienti c Discovery and Theory Formation. Morgan Kaufmann, 1{26. Silverman, B. and G. K. Bharathy (2011). Modeling and Simulation Fundamentals Theoretical Underpinnings and Practical Domains. Smith, A. and A. Stirling (2008). Social-ecological resilience and socio-technical tran- sitions: critical issues for sustainability governance. Technical report. Smith, R. E., S. Forrest, and A. S. Perelson (1993). Searching for diverse, cooperative populations with genetic algorithms. Evolutionary Computation 1 (2), 127{149. Standish, R. K. (2008, May). Concept and De nition of Complexity. ARXIV eprint. Stirling, A. (2007, February). A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface 4 (15), 707{719. Swisher, J. R., P. D. Hyden, S. H. Jacobson, and L. W. Schruben (2000). A survey of simulation optimization techniques and procedures. In Simulation Conference, 2000. Proceedings. Winter, pp. 119{128. IEEE. Thagard, P. (1989, April). Explanatory coherence. Behavioral and Brain Sciences 12, 435{502. The FANTOM Consortium (2005, September). The Transcriptional Landscape of the Mammalian Genome. Science 309 (5740), 1559{1563. 164 Tompkins, G. and F. Azadivar (1995). Genetic algorithms in optimizing simulated systems. In the 27th conference, New York, New York, USA, pp. 757{762. ACM Press. Udehn, L. (1993). Twenty- ve years with the logic of collective action. Acta Socio- logica 36 (3), 239. Uzzi, B. and J. Spiro (2005). Collaboration and Creativity: The Small World Problem. ajs 111 (2), 447{504. Van Aardt, A. (2004). Open Source Software development as a Complex Adaptive System: Survival of the ttest? In Paper delivered at The 17th Annual Conference of the NACCQ. Christchurch, New Zealand, 6th-9th July. Citeseer. Wagner, C. (2008). The new invisible college: Science for development. Brookings Inst Pr. Walker, B., C. Holling, S. Carpenter, and A. Kinzig (2004). Resilience, Adaptability and Transformability in Social{ecological Systems. Ecology and society 9 (2), 5. Wasserman, S. (1994a). Social Network Analysis. Sociology The Journal Of The British Sociological Association 22 (1), 109{127. Wasserman, S. (1994b). Social network analysis: Methods and applications. Cam- bridge university press. Watts, D. (1999). Networks, dynamics, and the small-world phenomenon. American Journal of Sociology 105, 493{527. Wood, M. (2009). The pros and cons of using pros and cons for multi-criteria evalu- ation and decision making. 165 Wu, S., Y. Bi, X. Zeng, and L. Han (2009, July). Assigning appropriate weights for the linear combination data fusion method in information retrieval. Information Processing and Management 45 (4), 413{426. Wynn, D. (2003). Organizational structure of open source projects: A life cycle approach. In 7th Annual Conference of the Southern Association for Information Systems, Georgia. Xie, X., J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen (2011). Testing and validating machine learning classi ers by metamorphic testing. Journal of Systems and Software 84 (4), 544{558. Yam, B. (2005). Making Things Work: Solving Complex Problems in a Complex World. NECSI Knowledge Press. Yang, A. and Y. Shan (2008). Intelligent complex adaptive systems. IGI Publishing Hershey, PA, USA. Yilmaz, L. (2006). Validation and veri cation of social processes within agent-based computational organization models. Computational and Mathematical Organization Theory 12 (4), 283{312. Yilmaz, L. (2008a). Innovation systems are self-organizing complex adaptive systems. In Association for the Advancement of Arti cial Intelligence. Yilmaz, L. (2008b). Project Proposal - NSF-SBE-0830261. Yilmaz, L. (2009). On the synergy of con ict and collective creativity in open in- novation socio-technical ecologies. In Proceedings of the 2008 Winter Simulation Conference. Yilmaz, L. and A. Hunt (2010, January). Computational Discovery - Project De- scription. 166 Zeigler, B. P., Y. Moon, D. Kim, and G. Ball (1997, February). The DEVS envi- ronment for high-performance modeling and simulation. Computational Science Engineering, IEEE 4 (3), 61{71. Zou, G. (2012). ColorScape: A Creative Arti cial Ecosystem Model of Communication and Collective Creativity in Global Participatory Science. Ph. D. thesis, Auburn University. Zou, G. and L. Yilmaz (2011). Dynamics of knowledge creation in global participatory science communities: open innovation communities from a network perspective. Computational and Mathematical Organization Theory 17 (1), 35{58. 167 Appendix A A.1 Termination State Analysis The snapshots of time-series plots for di erent performance metrics are illustrated below. The means and standard deviations for each metric are observed to support terminating state decision. The detailed information about the calculation process of diversity metrics is provided in Chapter 5. The rst set of snapshots are taken for four di erent scenarios, which are assumed to represent the di erent patterns encountered in the analysis (only one representative of qualitatively similar patterns are plotted for readability). During the runs, scientists are socially connected when they contribute on the same artifact at the same time-tick (OBO assumption). Also, recovered scientists are not removed from the social network. 168 Figure A.1: Density Over Time - OBO Scenarios (a) Mean of Density (b) Standard Deviation of Density Series-1 is Opt. Foraging/Low initial population, Series-2 is Opt. Foraging/Moderate initial pop- ulation, Series-3 is Basic Foraging/Low initial population, Series-4 is Basic Foraging/High initial population 169 Figure A.2: Degree Centrality Over Time - OBO Scenarios (a) Mean of DC (b) Standard Deviation of DC 170 Figure A.3: Clustering Coe cient Over Time - OBO Scenarios (a) Mean of CC (b) Standard Deviation of CC 171 Figure A.4: Average Path Length Over Time - OBO Scenarios (a) Mean of Average Path Length (b) Standard Deviation of Average Path Length The scenarios at which scientists are connected randomly are illustrated in the following plots, and their mechanisms are described in Chapter 5. In this case, Recov- ered scientists are removed from the social network. That is why in some scenarios, after 500 time-ticks, the network starts to dissolve (as central members leave the en- vironment). Compared to OBO scenarios, relatively high variability is observed in social network metrics. Four scenarios out of 40 scenarios are selected, which present di erent patterns. 172 Figure A.5: Diversity (Scientist Population) Over Time - OBO Scenarios (a) Mean of Diversity (Scientist Population) (b) Standard Deviation of Diversity (Scientist Population) 173 Figure A.6: Diversity (in the network) Over Time - OBO Scenarios (a) Mean of Diversity (in the network) (b) Standard Deviation of Diversity (in the network) 174 Figure A.7: Density Over Time - Random Connection (a) Mean of Density (b) Standard Deviation of Density Series-1 is Low Arrival Rate/High Turnover rate, Series-2 is Low Arrival Rate/Low Turnover rate, Series-3 is High Arrival Rate/High Turnover rate, and Series-4 is High Arrival Rate/Low Turnover rate 175 Figure A.8: Degree Centrality Over Time - Random Connection (a) Mean of DC (b) Standard Deviation of DC 176 Figure A.9: Clustering Coe cient Over Time - Random Connection (a) Mean of CC (b) Standard Deviation of CC 177 Figure A.10: Average Path Length Over Time - Random Connection (a) Mean of Average Path Length (b) Standard Deviation of Average Path Length 178 Figure A.11: Diversity (Scientist Population) Over Time - Random Connection (a) Mean of Diversity (Scientist Population) (b) Standard Deviation of Diversity (Scientist Population) 179 Figure A.12: Diversity (in the network) Over Time - Random Connection (a) Mean of Diversity (Network) (b) Standard Deviation of Diversity (in the network) 180 A.2 Response Surface Analysis Table A.1 represents the variables identi ed for response surface analysis and their respective values. Table A.1: Parameter Values for RSM Parameter Scenarios Communication Type(Theory) [1,...,5] Expertise Level to Recover [0.7,0.9] Probability to Leave [0.1,0.2] Minimum time Expectation [1,5] Arrival Rate [0.1,0.2] Migration Threshold [3,5] Recover Rate [0.1,0.2,0.5] Forage Extension [2,3,5] Minimum Tension [0.1,0.5,1] Minimum Burden [0.1,0.5,1] Maximum Altruism [0.1,0.5,1] The following Table A.2 and Table A.3 represent response surface analysis results. 181 Table A.2: Summary of Resp onse Surface Analysis -1 Inputs CC DC Avg Path CP Densit y t p t p t p t p t p (Constan t) -0.076 -4.572 0.001 0.446 23.292 0.001 0.5 26.711 0.001 1.675 14.087 0.001 0.471 660.381 0 maxAltruism 1.004 43.01 0.001 0.934 24.698 0.001 0.921 22.558 0.001 -0.666 -10.153 0.001 -0.811 -54.058 0.001 MinBurden ComT yp e 0.051 1.852 0.065 0.107 2.451 0.015 0.155 3.331 0.001 -0.234 -3.116 0.002 0 0 0 migrationThreshold 0.169 15.616 0.001 0 0 0 0.124 4.721 0.001 -0.11 -3.742 0.001 -0.091 -9.421 0.001 Reco ve rComT yp e 0.045 1.543 0.124 0 0 0 0 0 0 0 0 0 0 0 0 exp ertiseT oRe co ver 0.111 10.446 0.001 -0.049 -4.071 0.005 -0.29 -15.706 0.001 0.164 5.663 0.001 0.041 4.228 0.001 minT ension -0.167 -10.243 0.001 -0.187 -4.385 0.001 -0.153 -5.381 0.001 0.317 7.112 0.001 0.142 15.407 0.001 minBurden -0.387 -14.351 0.001 -0.517 -12.119 0.001 -0.433 -9.332 0.001 0.421 5.679 0.001 0.377 40.984 0.001 Altruism ComT yp e -0.233 -8.532 0.001 -0.254 -5.863 0.001 -0.311 -6.693 0.001 0.359 4.782 0.001 -0.03 -1.967 0.05 arriv alRate 0.162 11.456 0.001 -0.088 -4.446 0.001 -0.253 -13.256 0.001 0.09 2.6 0.01 0.259 24.37 0.001 reco verRate -0.203 -8.345 0.001 -0.054 -3.244 0.002 -0.113 -6.216 0.001 0.113 3.969 0.001 0.086 9.223 0.001 Arriv alRate ComT yp e -0.175 -6.405 0.001 0 0 0 0 0 0 0.128 2.105 0.036 0 0 0 MinBurden MinT ension 0.088 4.281 0.001 0.1 3.075 0.003 0.081 2.236 0.026 -0.653 -11.569 0.001 0 0 0 probT oLea ve -0.034 -4.014 0.006 -0.083 -4.162 0.001 0 0 0 0 0 0 0.024 2.233 0.026 MinT ension ComT yp e 0 0 0 0.048 1.094 0.275 0 0 0 0 0 0 0 0 0 ForageExt ComT yp e 0 0 0 0.069 1.078 0.282 0 0 0 0 0 0 0 0 0 Foraging ComT yp e 0 0 0 -0.311 -4.603 0.001 0 0 0 0 0 0 0 0 0 forageExtension 0 0 0 -0.097 -2.462 0.015 0 0 0 0 0 0 0 0 0 minTimeExp ectation 0 0 0 -0.05 -2.498 0.013 0 0 0 0 0 0 0 0 0 Migration ComT yp e 0 0 0 0.073 2.204 0.028 -0.081 -3.059 0.04 0 0 0 0 0 0 182 Table A.3: Summary of Resp onse Surface Analysis -2 Inputs Div ersit yS Div ersit yA Div ersit yN Div ersit yL t p t p t p t p Constan t 0.492 572.778 0 0.471 660.381 0 0.673 27.172 0.001 0.282 14.211 0.001 maxAltruism -0.745 -43.713 0.001 -0.811 -54.058 0.001 0.755 18.794 0.001 1.042 31.421 0.001 MinBurden ComT yp e 0 0 0 0 0 0 0 0 0 0.102 2.662 0.008 migrationThreshold -0.058 -5.297 0.001 -0.091 -9.421 0.001 0 0 0 0.134 9.154 0.001 Reco ver ComT yp e 0 0 0 0 0 0 0 0 0 0 0 0 exp ertiseT oR eco ver -0.274 -25.525 0.001 0.041 4.228 0.001 -0.063 -3.009 0.003 0 0 0 minT ension 0.145 13.818 0.001 0.142 15.407 0.001 -0.155 -4.866 0.001 -0.164 -7.314 0.001 minBurden 0.429 41.026 0.001 0.377 40.984 0.001 -0.485 -15.279 0.001 -0.487 -13.005 0.001 Altruism ComT yp e -0.051 -2.964 0.004 -0.03 -1.967 0.05 -0.174 -3.934 0.001 -0.293 -7.702 0.001 arriv alRate 0.191 17.279 0.001 0.259 24.37 0.001 -0.193 -7.972 0.001 0 0 0 reco ve rRate 0.095 8.975 0.001 0.086 9.223 0.001 0 0 0 -0.157 -11.002 0.001 Arriv alRate ComT yp e 0 0 0 0 0 0 0 0 0 0 0 0 MinBurden MinT ension 0 0 0 0 0 0 0.13 3.229 0.002 0.157 5.534 0.001 probT oLea ve 0 0 0 0.024 2.233 0.026 -0.088 -3.61 0.001 -0.061 -4.125 0.001 MinT ension ComT yp e 0 0 0 0 0 0 0 0 0 0 0 0 ForageExt ComT yp e 0 0 0 0 0 0 -0.085 -3.006 0.003 0 0 0 Foraging ComT yp e 0 0 0 0 0 0 0 0 0 -0.09 -3.004 0.003 forageExtension 0 0 0 0 0 0 0 0 0 0 0 0 minTimeExp ectation 0 0 0 0 0 0 -0.088 -3.617 0.001 0 0 0 Migration ComT yp e 0 0 0 0 0 0 0 0 0 0 0 0 183 A.3 Core/Periphery Calculation Method In Figure A.13, the activity diagram of calculation method of the core/periphery metric is represented. Figure A.13: Core/Periphery Activity Diagram 184