SciBrowser: Exploration and Analysis of the Complexity, Structure, and
Activity Dynamics of Open Source Science Communities
by
Damodar P. Shenvi Wagle
A thesis submitted to the Graduate Faculty of
Auburn University
in partial ful llment of the
requirements for the Degree of
Master of Science
Auburn, Alabama
May 9, 2011
Keywords: complexity, social network analysis, visualization, software engineering
Copyright 2011 by Damodar P. Shenvi Wagle
Approved by
Levent Yilmaz, Associate Professor of Computer Science and Software Engineering
David Umphress, Associate Professor of Computer Science and Software Engineering
Dean Hendrix, Associate Professor of Computer Science and Software Engineering
Abstract
Open Biomedical Ontology (OBO) is a socio-technical community that is comprised
of individuals dispersed geographically, but function as a coherent unit through the use of
cyber-infrastructure. This study explores dynamics of open source science in such virtual
socio-technical networks. Innovation within a socio-technical network can be de ned as
the approach to work that leads to the generation of novel and useful ideas and processes.
Among the factors that in uence innovation are structural properties such as centrality, den-
sity, clustering coe cient, and average path length of socio-technical networks, as well as
e ectiveness in collaboration. Hence, we explore virtual scienti c communities from three
main perspectives: network, collaboration, and activity. Structural network metrics measure
the resilience of socio-technical networks. Collaboration analysis aims to discover interaction
patterns among participants and between knowledge domains. Activity analysis facilitate
discerning artifact submission and community growth patterns over time. To expedite anal-
ysis, a computational ethnography tool, called SciBrowser, is introduced. Using SciBrowser,
we observe power law degree distributions, which indicate presence of scale-free network
con gurations. Such con gurations provide an explanation for the resilience of research
communities in cyberspace. A new metric, called activity strength, suggests that major
contributors of the project are weak collaborators. As a result, their strong contribution
factor is nulli ed by their weak collaboration intensity. Activity patterns of the observed
projects suggest the presence of an adaptive renewal cycle, which is the epitome of behavior
in innovation ecosystems.
ii
Acknowledgments
I would like to thank Dr. Levent Yilmaz, Dr. David Umphress and Dr. Dean Hendrix
for their direction, assistance, and guidance. In particular, Dr. Yilmaz?s recommendations
and suggestions have been invaluable for the project and for software improvement.
Special thanks should be given to my student colleague and lab-mates Michael Arnold,
Guangyu Zao and Ozgur Ozmen who helped me in more ways than one. My thesis would
be incomplete without the visualizations and data provided by Michael.
iii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Open Source Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Complex Adaptive System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Open Biomedical Ontology (OBO) as a Complex System . . . . . . . . . . . 9
2.4 Visualizing Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Social Network Metrics and Interpretation . . . . . . . . . . . . . . . . . . . 13
2.5.1 Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.3 Clustering Coe cient . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.4 Average Path Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.5 Small World Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . 18
3 The Organizational Framework of the SciBrowser System . . . . . . . . . . . . . 20
3.1 Schema Conversion Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 SciBrowser Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Schema Conversion Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 Open Biomedical Ontologies Schema, SourceForge Research Data Archives . 24
4.1.1 Original Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.2 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
iv
5 Implementation of SciBrowser Tool . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1 Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Helper Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.3 Metric Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.4 GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Veri cation/Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1 Test Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.2 Metric Veri cation Module . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.2 Collaboration Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.3 Activity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Social Network Analysis Using SciBrowser . . . . . . . . . . . . . . . . . . . . . 56
6.1 Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.1 Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.2 Small World Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1.3 Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1.4 Preferential Attachment . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2 Collaboration Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2.1 Activity Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.2 Collaboration Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 Activity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.1 Contribution Distribution . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.2 Active User Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 75
v
6.3.3 Activity Outburst Frequency Distribution . . . . . . . . . . . . . . . 77
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vi
List of Figures
3.1 Schema Conversion Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 SciBrowser Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Open Biomedical Ontology Schema for Artifact Data . . . . . . . . . . . . . . . 25
4.2 Extended Schema for Artifact Data . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Comprehensive Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Structural Analysis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Collaboration Analysis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Activity Analysis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Metric Factory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Model-View-Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 GUI Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.8 Structural Analysis Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.9 Collaboration Analysis Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.10 Activity Analysis Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.11 GUI Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
vii
5.12 Testing Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.13 Test Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1 User Artifact User Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 User User Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3 Artifact Artifact Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 User-Domain Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.5 Domain Domain Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6 Centrality and Density Distributions . . . . . . . . . . . . . . . . . . . . . . . . 63
6.7 Clustering Coe cient and Average Path Length Monthly Distribution . . . . . 64
6.8 Degree Distribution of User-User network for Gene Ontology (Group Id-36855) . 65
6.9 Degree Distribution of User-User network for OBO (All Groups Included) . . . 66
6.10 Artifact Degree Distribution of User-Artifact-User network for OBO (Compre-
hensive) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.11 Preferential Attachment Graph for Artifact . . . . . . . . . . . . . . . . . . . . 67
6.12 Preferential Attachment Graph for Users . . . . . . . . . . . . . . . . . . . . . . 68
6.13 Activity Strength Log Plots for Gene Ontology (36855) . . . . . . . . . . . . . . 70
6.14 User Collaboration Maps for various projects under OBO . . . . . . . . . . . . . 72
6.15 TypicalContributionActivityPatternsacrossprojectsOBI(177891), OpenBiomed-
ical Ontologies (76834) and ChEBI (125463) . . . . . . . . . . . . . . . . . . . . 74
viii
6.16 Alternate Contribution Activity Patterns across projects Gene Ontology (36855)
and Sequence Ontology (72703) . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.17 Comparisons between Contribution and Active User Distribution Patterns for
Sequence Ontology (72703) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.18 Comparisons between Contribution and Active User Distribution Patterns for
Gene Ontology (36855) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.19 Activity Plots for Systems Biology Ontology (174625) . . . . . . . . . . . . . . . 77
6.20 Activity Outburst Frequency Distribution For Gene Ontology and ChEBI . . . 78
6.21 Activity Outburst Frequency Distribution For Open Biomedical Ontologies and
Sequence Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
ix
List of Tables
5.1 Structural Analysis Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Collaboration Analysis Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Activity Analysis Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 artifact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.6 groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.7 user group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.8 artifact group list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.9 artifact message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.10 Test Cases for Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.11 Test Cases for Activity Strength Distribution . . . . . . . . . . . . . . . . . . . 53
5.12 Test Cases For Contribution Activity Distribution . . . . . . . . . . . . . . . . . 53
x
Chapter 1
Introduction
The open source research has turned into a business model because of its popularity
and the open innovation revolution it brought in the software industry 15 years ago. Despite
di erences in research dynamics between software and biomedical industries, this model has
entered the biomedical research [29]. Virtual collaboratories are enabling a new mode of
collaboration among scientists distributed over the globe to share and co-develop knowledge
over the cyber-infrastructure. The practice of science is becoming open and global as the
access to knowledge, as well as its production, is becoming increasingly transparent. Service-
oriented science [15] and e-Science [8] initiatives create scienti c communities where shared
domain knowledge is no longer exclusively documented in scienti c literature or patents.
Rather it is documented in software, simulations, and databases that represent an evolving
collective knowledge-base that is governed and maintained by community members. Just
like open source software communities, \SourceForge for science" style in scienti c produc-
tion and collaboration provides the requisite infrastructure that encompasses community
membership services, catalogs, storage services, and work ow orchestration services.
As reported in [20], [13] and [38], open source communities promise a great deal of
discovery and learning. Many researchers have examined open source science communities
in the past. For example, Madey et al. [18] performed topological analysis of an entire
development community on SourceForge.net [14], wherein they classi ed the members of
the community based on their activity into 4-5 groups and then performed social network
analysis on the networks. They study primarily the project-developer network. Based on
this network, they derive project-project and developer-developer networks. In our study,
we investigate the individual projects and study the collaboration network of the members
1
at the artifact level which makes our study more detailed. Moreover, we do not classify
the members based on their roles in the database. The roles as mentioned in the database
can be deceiving for the purpose of our study because they does not correlate with their
respective member activities. Recent studies such as the human  esh search engine [13] and
the network analysis of scienti c work ows [38] also use social network analysis to explore
their communities of practice and answer some of the vital questions about their research.
For instance, in [38] authors aim to answer the questions \What is the current usage pattern
of services in scienti c work ows, and how can this knowledge be extracted to facilitate
reuse?"
What di erentiates our study is the 3 pronged approach towards analysis of the OBO
community. Our aim is to study the network not only from its structural point of view (which
is social network analysis) but also from the perspectives of collaboration and activity. From
the structural perspective, we visualize di erent types of graphs: user-user, user-artifact-user,
user-domain, and domain-domain. We study social network metrics: di erent centralities,
density, clustering coe cient and average path length for these graphs. Centrality and
density metrics assist in detecting whether the core periphery structure exists in the network.
Clustering coe cient and average path length comment on the small world nature of the
community. Both of these properties, if present in the network, can foster the innovation in
the community. In structural analysis we also look for presence of power law in the degree
distribution of the users, artifacts, and domains for di erent types of graphs mentioned
above. Apart from degree distribution, we concentrate on the phenomenon of preferential
attachment. To examine this phenomenon, we plot rate of change of degrees of actors over
time.
Collaboration and activity analysis gives a new dimension to the study. Collaborative
approach takes place at the user level, for which we developed a novel set of metrics (to
be discussed in chapter 6) to identify in uentiality and innovation potential of users based
on their activity in the project. We also visualize the collaboration between users using
2
color coded maps and highlight the areas where collaboration is signi cantly high. Temporal
analysis of activity is conducted at both the group and the domain level within a group,
thereby relating the open source science project life cycle to the organizational life cycle.
Two types of activities are identi ed for purpose of our study: artifact contribution and
number of active users. We examine the implication of one activity over the other. Activity
outbursts (high activity points) are also plotted to see how frequently activity crosses a
threshold. Threshold is determined by the user by setting relevant parameters.
We introduce a tool, called SciBrowser, for comprehensive analysis of open source sci-
ence communities. Among signi cant  ndings is a power law degree distribution for the
User-User graph indicating the resilient nature of the community. We also observe power
law distribution for the User-Artifact-User graph which indicates that only a few artifacts
greatly in uence responses from users. Most of the projects have their clustering coe cient
value stabilizing at around 0.5 and their average path length value stabilizing at around 2
which is the indication of the small world network. Distribution of innovation metric \Ac-
tivity Strength" on a log scale indicates that major contributors of the project are weak
collaborators. Thus, their strong contribution factor is nulli ed by their weak collaboration
intensity. As a result, we fail to get a power law distribution when we consider both contribu-
tion as well as collaboration as a part of \Activity Strength" metric. Activity patterns of the
projects closely resemble the virtual organization life cycle. Thus, there is a possibility that
the open source science projects possess speci c organizational characteristics like division
of labor, leadership, level of commitment, and coordination/control.
This thesis is structured as follows. Chapter 2 highlights the properties of complex
adaptive systems and open source communities that prevail in OBO. It also explains rele-
vant social network metrics that are related to innovation and creativity. Chapter 3 presents
the conceptual framework of the SciBrowser tool in terms of major building blocks that con-
stitute the tool. Chapter 4 introduces the SourceForge database schema and its extension
3
to facilitate calculation of metrics required for the project. Chapter 5 outlines the imple-
mentation of the SciBrowser tool from a software engineering perspective and discusses the
software process in terms of di erent stages of the software development life cycle. Chapter
6 elaborates on data analysis using SciBrowser and describes di erent ways in which we
analyze the community. We relate our observations to the innovation capacity of the group
as well as the individual. In Chapter 7, we conclude by summarizing our  ndings and point
out potential avenues of future research.
4
Chapter 2
Background
This chapter presents the theory behind open source communities and complex adaptive
systems. It explains the characteristics of Open Biomedical Ontology (OBO) which makes
it a complex system. Some light is thrown on the important aspects of visualizing science.
At the end, various metrics used in the  eld of social network analysis are formulated and
their relevance in our study is explained.
2.1 Open Source Communities
The key to understanding an organization is to understand its governance, because it
not only o ers an insight into an organization?s conception of control, but also indicates how
such communities can be sustained over time [30]. Corporate organizations use bureaucratic
bases of authority, while other organizations such as socio-technical communities of practice
use shared bases of authority. It is important to know how collaboration between individ-
uals in social communities accomplishes important outcomes such as knowledge sharing - a
determining factor in innovation.
One of the key aspects of open source communities is learning through knowledge shar-
ing. Members voluntarily collaborate and contribute towards community formation and
growth in the form of artifacts for either public or private bene t. Typically, members are
geographically dispersed and rely on modern communication technologies such as the In-
ternet as the means of communication and coordination. According to [30], a meritocratic
governance system must be introduced in the open source communities in order to attract
high quality contributions from the members. In return, the contributors can be rewarded
with greater status, responsibility, or opportunity. Thus, communities tend to satisfy the
5
contributor?s need for recognition. There are 4 stages of governance that an open source com-
munity goes through [30]. These stages are explained in relevance to the study of Debian
Linux Community:
 De facto Governance
 Designing Governance
 Implementing Governance
 Stabilizing Governance
Leadership of such communities can either be decided based on Technical Contribution [30],
or Organizational Building and Leadership [30]. According to the technical contribution
approach, the greater the amount of technical contribution of a member, the higher is the
probability that the member will become a leader. Based on the organizational building and
leadership approach, the more a community member participates in online discussions, the
higher is the probability that the member will become leader.
2.2 Complex Adaptive System
Complex systems research is an interdisciplinary  eld that seeks to explain how large
number of relatively simple entities organize themselves, without the bene t of any central
controller into collective whole that creates patterns, uses information and in some cases
evolves and learns (called as Complex Adaptive System) [27]. For instance, ant colonies [27]
are an example of complex adaptive systems. An ant colony consists of hundreds to millions
of ants, with each ant being a simple creature performing simple tasks like foraging for food,
responding to the chemical signals of other ants and  ghting intruders. But as a group, they
create complex structures like bridges (out of their own bodies) from one nest site to another
via tree branches separated by great distances.
It can be seen from the examples in [27] that complex systems consist of many elements
connected together. A possibility exists that the parts of complex systems are complex
6
systems themselves. But the individual element need not necessarily have a complex nature.
They can be simple parts which adhere to simple rule sets. If a system consists of simple parts
whose collective behavior is complex, then the resulting phenomenon is called as emergence
[4]. Emergent patterns are not caused by single elements/agents working in isolation, but
they emerge from the interactions that take place between the agents based on the simple
rules which an agent operates on. In order to understand complex systems it is important
to know their properties. Each complex system is su ciently di erent from others, but
at an abstract level they have some commonalities. Following are some of the common
characteristics and mechanisms of complex systems:
 Aggregation [16] - Aggregation has two interpretations;  rst one is the way we model
a system and second one is the behavior of the complex system. According to the  rst
interpretation, we categorize the system into similar objects and each category becomes
a class that is treated equivalently. The second interpretation is concerned about the
emergent behavior as a result of aggregate interactions of agents. For example, an ant
in an ant colony is a relatively simple agent, but the ant aggregate is highly adaptive
and complex. Aggregates so formed can act as agents at higher levels called meta-
agents. Meta-agents can also aggregate to yield meta-meta-agents which leads to an
hierarchical structure commonly found in complex adaptive agents. Thus the second
interpretation of Aggregation is a typical characteristic of complex adaptive systems.
 Tagging [16] - Tagging is the mechanism by which aggregates are formed in a sys-
tem. Similar agents are identi ed by tags which enable the members to  lter their
interactions so that they can choose from the pool of agents, the agents that they need
to interact with. Thus, tagging leads to aggregation of agents into meta-agents and
organizations which is so common in complex adaptive systems.
7
 Self Organization and Dynamicity - It has been argued that complex activities
are inevitably self-organizing [28]; that is, they cannot be fully externally or hierarchi-
cally controlled. A system can be considered a self-organizing complex system if its
components dynamically interact to achieve a global goal or function [40] page 40. In
the decentralized self-organizing systems there is no central authority who imposes the
function. Rather the function is imposed through the autonomous interactions to pro-
duce the feedback that regulates the system. Internal structures of complex systems
do not remain the same; it changes dynamically depending on the interactions that
take place between the actors of the systems.
 Non-Linearity [16] - Generally, linearity means that we can get the whole by adding
sum of the parts. Linear function consists of weighted sum of its parts as given in this
function: 3x + 5y + z. But in complex adaptive systems, the whole is more that just
the sum of its parts. Such systems demonstrate the non-linear properties like power
law.
 Flows [16] - Flows can be thought of as the information transfer over a network of
nodes and connectors. Agents form the nodes and their interaction forms the connec-
tors of the network. There are two e ects caused by the  ows in the network: multiplier
e ect and recycling e ect. Multiplier e ect is caused when the resource is added at
any node in the network. The resource is passed from node to node throughout the
network. Recycling e ect is caused when the resource is reused across the network.
Recycling leads to increase in the resource amount as well as its quality.
 Unpredictability and uncertainty [40] page 40 - Presence of this characteristic
in the environment is another common feature of complex systems. As the systems are
self organizing they use two mechanisms to cope with the uncertainty and volatility
in the environment: adaptation and anticipation. In adaptation system uses learning
techniques such as genetic algorithms or evolutionary computing to adapt or update
8
its behavior to changes in the environment. In anticipation system uses current state
as well as current image of its future states to determine what the next state of the
system is and accordingly updates the behavior.
 Diversity/Disparity [16] - Diversity of the complex system is re ected by the het-
erogeneity of the agents that are part of the system. Each agent may be following
same set of simple rules but there are some properties that are di erent for each agent.
For e.g. in standing ovation problem [26] each agent can have di erent personal traits
and can behave according to those. Presence of such diverse agents causes the system
to undergo cascade of adaptations when a certain type of agent is removed from the
system.
2.3 Open Biomedical Ontology (OBO) as a Complex System
OBO Foundry [6] is a collaborative experiment in order to establish a common set of
principles for ontology development and a standardized data acquisition system. Aim of
OBO foundry is to support the community members who are developing and publishing
ontologies in the biomedical  eld. The goal is to apply the scienti c methods to the ontol-
ogy development, so that, the data gathered through the biomedical research can be single,
consistent, cumulatively expanding and algorithmically tractable whole. OBO Foundry is
open and its contributors are the researchers who work together on a continuously evolving
set of design principles that can foster interoperability of ontologies. There are more than 60
ontologies that are interested in the goal of OBO Foundry. OBO Foundry is a consortium
that is comprised of multiple groups that focus on di erent domains. The groups used in
this study are: Open Biomedical Investigations (OBI), Gene Ontology (GO), Open Biomed-
ical Ontology (OBO), Chemical Entities of Biological Interest, Disease Ontology, Sequence
Ontology and System Biology Ontology. Following are the 2 types of data that are taken
into account for the purpose of this study.
9
 Trackers [25] facilitate submission of the artifacts that characterize open problems and
feature requests. Each artifact tends to generate solution to the open problem in the
form of comments or suggestions posted by other members of the community. Artifacts
not only facilitate social interactions, but also act as the contributors to the knowledge
base. Mostly, artifacts are submitted by the active members of the community that
are engaged in knowledge creation.
 Patches [25] are the revisions submitted to the knowledge base by the core members
of the community. Knowledge base evolves over time and sometimes branches out in
new directions as it is explored. Exploratory branch may mature and merge with the
main stable branch, or it may even terminate to become the discontinued development
branch.
Just like world wide web [27], OBO can be thought of as a self organizing socio-technical
system with little or no central control. OBO comprises of the individual users spread
across geographically and performing simple tasks like submitting the artifacts, elaborat-
ing/commenting on the artifacts submitted by the other members. The only means of direct
interaction between the members is through emails. The elaborations provide an indirect
means of interaction. Through these simple actions OBO emerges as the complex system in
terms of its dynamic structure, growth over time, patterns of artifact submissions, user col-
laborations and information  ows across domains. Following are the properties of complex
adaptive system that appear to be applicable to open source science communities:
 Aggregation - Aggregation in an open source science community symbolizes the col-
laboration between the agents. Members of the community are the agents who collab-
orate over the artifact contributions that they make. A single agent behavior is simple
in terms of the contribution he/she makes, but the emergent behavior becomes evident
when the users collaborate. Emergent behavior like power law, scale free network can
be seen in such communities.
10
 Tagging - Members in open science communities are tagged or designated based on
the technical contributions that they make [30]. Due to tagging the community gets
segmented into di erent groups of users, and apparently members of the lower rank try
to associate themselves with the members of the higher rank leading to a phenomenon
called Preferential Attachment [1] which can be the cause of the power law and the
scale free network.
 Self Organization and Dynamicity - In open science communities, central control
does not exist; participation of members of the community is entirely voluntary. An
impact of self organization is that the structure of the network changes dynamically
with the addition of new members and resources over time. Users who greatly con-
tribute to the community become central to the community, while those having scarce
contribution remain on the periphery.
 Non-Linearity - Power law is an example of non-linear behavior.
 Flows - Knowledge mobility signi es the information transfer in the network. When
an artifact is submitted by the member of the community, the knowledge gets passed
on from node to node throughout the network (multiplier e ect) due to collaborating
users. At the same time, as the members start contributing to the artifacts in the form
of comments the artifact is re ned (recycling e ect).
 Diversity/Disparity - Diversity is not necessarily a trait of the members of the
community; it can be a trait of the knowledge that is injected into the network by
the members. When injected knowledge is new, it gives rise to the response from
the other members of the community which further enhances the knowledge. But
after a while, the knowledge saturates and there are no more responses coming from
the members of the community. It is at this point that we need diversity in terms
of the knowledge contribution so that the responses keep coming from the members.
11
Having homogeneous knowledge in the network quenches the growth of the community,
whereas presence of the novel ideas gives a new direction to the growth.
2.4 Visualizing Science
Visualizing science [34], over a period of time has come up as the useful tool in decision
making and analysis. Researchers in this  eld are producing representations that are catching
the attention of the program o cers and the policymakers. But at the same time it is
important to have a statistical basis for the visualizations so that we can di erentiate between
noise and a real change. Any visualization is based on the following keys:
 Data - Having data readily available to the researchers is the most essential require-
ment in visualization. University of Notre Dame maintains the Source Forge Data
Archive, which is available for research. For any open source project on the Source
Forge, data is updated monthly. Data can be easily downloaded as a csv (comma
separated values)  le by querying the database using the query portal. Queries  red
on the database are SQL queries. Having such a framework and roadmap makes data
collection and management easier.
 Model - Visualization should be based on the statistical models built on the data.
This is a di erence between data analysis and visualization. Data analysis is more of
a building statistical models on the data that would induce sense into the data, where
as visualization is the way data analysis is represented so that it becomes easy for the
end user to understand the data analysis. For e.g., power law [1] becomes a statistical
model/metric which is visualized using a line plot or a bar plot, because mare numbers
will make is di cult for the user to interpret the power law.
 Validation - Underlying statistical models need to be validated, so that they convey
the right information what an user intends to see. But it becomes a challenge to validate
these models when we are exploring. Visualization when used as an exploration tool,
12
makes validation di cult, because one does not know what to expect ahead of time
[34]. The following section on social network metrics talks more about the metrics that
are already de ned for any social network.
 User Interaction - For understanding what will be the right kind of visualization, we
need to  rst understand user needs. Taking user needs into account while designing
visualization tools, makes it easier to design useful visualization. The users in our case
are technical people working on simulation models, so the visualizations were designed
from their perspective. They needed to see the data in the form of graphs and plots so
that they can tweak their simulation models based on the actual data that they see.
2.5 Social Network Metrics and Interpretation
The best way to visually represent a social network is in the form of network of nodes
connected together; where the nodes are the actors of the network. In this section we
will discuss the metrics explained by [36], that are useful in interpreting the social network
with respect to creativity. These metrics have di erent interpretations for di erent types of
graphs.
2.5.1 Centrality
Centrality indicates the prestige and in uentiality associated with the actor. Creativity
and centrality are related to each other according to following propositions.
\In phase 1 a positive self reinforcing spiral exists between centrality and creativ-
ity such that an increase in one leads to an increase in the other, until centrality
becomes constraining. In phase 2 the spiral becomes self-correcting such that an
increase in centrality no longer leads to an increase in creativity." [31]
13
\As an individual becomes more central, his or her creativity should continue
to increase at a decreasing rate, up to a point. Beyond this point, increases in
centrality may constrain creativity." [31]
There are 3 types of centralities:
 Degree Centrality: At the individual level this metric determines the number of
nodes connected to a given node. One can view this as a measure of activity, in the
sense that a highly active actor will have links to most of the other actors [36]. Degree
centrality for an actor is represented by the formula:
CD(ni) = d(ni)g  1 (2.1)
where d(ni) is the degree of the actor ni and (g  1) is the group size [36].
For a group, the degree centrality is found by subtracting the individual actor degree
centrality from the maximum degree centrality value and summing them up; sum is
then divided by maximum attainable value for the numerator. Generalized formula for
the group degree centrality is:
CD =
g?
i=1
[CD(n ) CD(ni)]
(g  2) (2.2)
where CD(n ) is the maximum individual degree centrality in the group and (g 2) is
the maximum attainable value for the numerator (star network) [39].
 Closeness Centrality: For an individual actor, closeness centrality determines how
close that actor is from all other actors. It re ects the distance between the actor and
all other actors in the network. The metric is computed as the average distance between
an actor and other members of the network [31]. It not only takes direct links, but also
indirect links required for an actor to communicate with all other actors. Closeness
14
centrality helps in knowledge mobility [9]. The actor closeness centrality is de ned in
[36] by the formula:
CC(ni) = (g  1)g?
i=1
d(ni;nj)
(2.3)
where d(ni;nj) is the shortest distance between node i and j. Closeness centrality for
node i is the inverse of the sum of the distances of node i from all other nodes in the
graph. So higher is the distance lower is the closeness and vice versa. Minimum value
for the sum of distances from any node to all other nodes in a graph with g nodes
is (g  1). So the distance is normalized by (g  1) in the equation above. Value of
closeness centrality thus varies between 0 and 1.
Closeness centrality for the group is found in the similar manner as in case of degree
centrality. First subtract the individual actor closeness centrality from the maximum
closeness centrality value, and add the di erence; the sum is then divided by the
maximum attainable value for the numerator. Following equation gives the formula:
CC =
g?
i=1
CC(n ) CC(ni)
[(g  2)(g  1)]=(2g  3) (2.4)
where CC(n ) is the maximum individual closeness centrality in the group and [(g  
2)(g  1)]=(2g  3) is the maximum attainable value for the numerator (star network)
[36].
 Betweenness Centrality: This metric highlights the actors that act as the mediators
between two actors or groups of actors. They act as the communication medium
between the two groups or actors, and hence have a high betweenness centrality value.
Betweenness is a measure of how good an actor is at routing information. Following
quote explains betweenness centrality in a better way.
15
Suppose that in order for [actor] i to contact [actor] j, [actor] k must be
used as an intermediate station. [Actor] k in such a network has a certain
\responsibility" to [actors] i and j. If we count all the minimum paths which
pass through [actor] k, then we have a measure of the \stress" which [actor]
k must undergo during the activity of the network [36] (page 189)
For calculating betweenness centrality of a node, we count number of the shortest paths
which pass through the node, out of the total number of paths possible in a graph.
Actor betweenness centrality is de ned in [36] according to the formula given below:
CB(ni) =
g?
i=0
gjk(ni)
gjk
[(g  1)(g  2)]=2 (2.5)
where gjk(ni) is the number of paths between nodes j and k that pass through node
ni and gjk is the total number of paths between j and k. The numerator is normalized
by the denominator [(g  1)(g  2)]=2 which is maximum number of paths possible in
a undirected graph with g nodes. Since we deal with undirected graphs above formula
is appropriate for our case.
Group betweenness centrality is calculated in similar fashion as that of the degree and
closeness centrality. According to [36], following equation de nes group betweenness
centrality:
CB =
g?
i=0
CB(n ) CB(ni)
[(g  1)2(g  2)] (2.6)
where CB(n ) is the maximum individual betweenness centrality in the group and
[(g  1)2(g  2)] is the maximum attainable value for the numerator (star network)
16
2.5.2 Density
Density is the measure of degree of completeness and cohesiveness of the graph [36].
Any graph can have certain maximum number of edges. Density will measure the number of
edges the graph actually consists of out of the total edges possible. For an undirected graph
the maximum number of edges possible are n(n 1)2 , and if we denote the number of edges of
the graph as jEj then density is de ned as:
D = jEjn(n 1)=2 (2.7)
2.5.3 Clustering Coe cient
Clustering Coe cient (CC) measures the average fraction of an actor?s collaborators who
are also the collaborators with one another [35]. A clique is de ned as a maximal complete
subgraph of three or more nodes [36]. In the context of a social network, the clustering
coe cient represents how cohesive the group is; a high clustering coe cient represents a
tight circle, or subgroup. Similar to how circles of friends form in social settings, circles of
preferred collaboration can form in socio-technical networks. When the change of this metric
is shown over time it can show how well this node is integrating with the group.
2.5.4 Average Path Length
Average path length is the number of edges in the shortest path between two vertices,
averaged over all the pairs of vertices [37]. It is the measure of e ciency with which infor-
mation is transfered over the network from one node to the other; smaller the average path
length higher being the e ciency. Small average path length gives rise to the phenomenon
of the small world networks [37]. The issue of small world networks is of great importance
for the network studies, as this property directly a ects such crucial  elds like information
processing in di erent communication systems, disease or rumor transmission, network de-
signing and optimization [2]. In relation to OBO the smaller average path length fosters
17
knowledge transfer, which can be a vital factor for innovation. The metric can be de ned
according to following equation. Consider an unweighed graph G with the set of vertices V.
Let d(v1;v2), where v1;v2  V denote the shortest distance between v1 and v2. Assume that
d(v1;v2) = 0 if v1 = v2 or v2 cannot be reached from v1. Then, the average path length is:
APL =
n?
i;j
d(vi;vj)
[n (n 1)] (2.8)
where n is the number of vertices in G.
2.5.5 Small World Phenomenon
According to [35] Clustering Coe cient (CC) and Average Path Length (APL) both
de ne the small world network. In order to determine if a given network is the small world
network or not, Watt?s model compares CC and APL of actual network to that of the
randomly generated graph of same size. Random graph has a low CC and APL. Small
World Quotient (Q) is the measure of small world property of the network and is de ned as:
Q = CCratioAPL
ratio
(2.9)
where CCratio is de ned as:
CCratio = CCactual networkCC
random network
(2.10)
and APLratio is de ned as:
APLratio = APLactual networkAPL
random network
(2.11)
More closer the APLratio to 1.0 and more the CCratio exceeds 1.0, higher is the small
world coe cient. In a bipartite (a liation) network, members on the same team form a
18
fully linked clique. Clustering includes both the within-team clustering and the between-
team clustering. If CCratio is approximately 1.0 then the clustering in the actual network
is mainly the result of the within-team clustering. But as CCratio goes beyond 1.0 there
is an increase in the between-team links. Also the cross-team links are mostly repeated
which means that the member who has collaborated previously, likes to collaborate across
the teams, with the same person they did previously. Thus, in bipartite networks the small
world in uences behavior through two mechanisms:
\Structurally, the more a network becomes small worldly (formally, the more
the small world quotient exceeds 1.0), the more links between clusters increase
in frequency, which potentially enables the creative material within teams to be
distributed throughout the global network." [35]
\Relationally, the more a network becomes small worldly, the more links between
clusters are made up of repeated ties and third-party ties, which potentially
increases the level of cohesion in the global network." [35]
Thus, as the small world quotient increases, the level of connectivity between the dif-
ferent teams within the network increases through cohesive relations among the members
of these teams. This can be considered as the reason for their successful collaboration and
creativity.
19
Chapter 3
The Organizational Framework of the SciBrowser System
This chapter gives an overview of the application in terms of its important components,
and how they communicate with each other. In addition, language speci c libraries &
frameworks used in the tool are also discussed. Broadly speaking, there are two parts
involved in the construction of this tool: Schema Conversion Subsystem (Chapter 4) and
SciBrowser Subsystem (Chapter 5).
3.1 Schema Conversion Subsystem
Figure 3.1 below shows the block diagram for Schema Conversion Subsystem. Source-
Forge.net [14] uses relational databases to store project management activity and statis-
tics. There are over 100 relations (tables) in the data dumps provided to Notre Dame
[33]. SourceForge.net cleanses the data about personal information and strips out all OSTG
(Open Source Technology Group) speci c and site functionality speci c information. On a
monthly basis, a complete dump of the databases (minus the data dropped for privacy and
security reasons) is shared with Notre Dame. The Notre Dame researchers have built a data
warehouse comprised of these monthly dumps, with each dump stored in a separate schema.
Thus, each monthly dump is a snapshot of the status of all the SourceForge.net projects at
that point in time. To help researchers determine what data is available, an ER-diagram
and the de nitions of tables and views in the data warehouse are provided. Data access
is given to the academic and scholarly researchers through a query portal to extract the
data. We query and extract the project speci c data using the query portal and load it
in the local MySQL database on the server in our lab. Our study requires us to aggregate
the data, so that we have precalculated results which can further be used to calculate the
20
metrics we need for the purpose of our study. Thus, we have a set of python programs in
the Schema Converter which convert the original schema to a new one. The new schema is
called SciBrowser schema, and it contains the aggregate tables.
Figure 3.1: Schema Conversion Subsystem
3.2 SciBrowser Subsystem
SciBrowser subsystem is the GUI tool written entirely in python language, and is used
analyze the projects on SourceForge.net [14]. The tool has three types of analysis as shown in
the  gure 3.2: Structural Analysis, Collaboration Analysis and Activity Analysis (explained
in detail in Chapter 5). The database in the backend of the tool is the SciBrowser schema
that we get after running the Schema Converter code against the SourceForge schema, as
mentioned in the previous section.
As shown in  gure 3.2, the tool uses couple of python libraries; given below is the brief
description of each library:
 matplotlib [19]
matplotlib is a library for making 2D plots of arrays in python. Although it has
its origins in emulating the MATLAB [17] graphics commands, it is independent of
21
Figure 3.2: SciBrowser Subsystem
MATLAB, and can be used in a pythonic, object oriented way. Although matplotlib is
written primarily in pure python, it makes heavy use of NumPy [7] and other extension
code to provide good performance even for large arrays. Using matplotlib simple plots
can be created with just a few commands.
 NetworkX [3]
NetworkX is a python-based package for the creation, manipulation, and study of
the structure, dynamics, and function of complex networks like social, biological, and
infrastructure networks. It provides a standard API and/or graph implementation that
is suitable for many applications such as social network analysis. The structure of the
graph or network is encoded in the edges (connections, links, ties, arcs, bonds) between
the nodes (vertices, sites, actors). Various types of graphs (directed, undirected) can
be drawn using NetworkX; moreover weights can also be given to nodes and edges if
needed.
 wxPython [32]
wxPython is a GUI toolkit for the python programming language. It allows python
programmers to create programs with a robust, highly functional graphical user inter-
face, simply and easily. It is implemented as a python extension module (native code)
22
that wraps the popular wxWidgets - a cross platform GUI library written in C++.
Like python and wxWidgets, wxPython is open source which means that it is free for
anyone to use and the source code is available for anyone to look at and modify. Any-
one can contribute  xes or enhancements to the project. wxPython is a cross-platform
toolkit which makes it possible for the same program to run on multiple platforms
without modi cation. Since the language is python, wxPython programs are simple,
easy to write and easy to understand.
 wxmpl [24]:
Embedding matplotlib in wxPython applications is straightforward, but the default
plotting widget lacks the capabilities necessary for interactive use. WxMpl (wx-
Python+matplotlib) is a library of components that provide these missing features
in the form of a better matplolib FigureCanvas.
23
Chapter 4
Schema Conversion Subsystem
This chapter gives a detailed description of the SourceForge schema that we harness to
derive a new schema, called SciBrowser schema. Detailed description of the tables used in
both of these schema?s, is provided in the sections below.
4.1 Open Biomedical Ontologies Schema, SourceForge Research Data Archives
SourceForge.net [14] stores all its project related data in the relational databases. This
data package is made available to researchers through a query portal provided by the Uni-
versity of Notre Dame [33]. The schema of our interest is the Artifact Data schema in the
Source Forge database.
4.1.1 Original Schema
Figure 4.1 shows the original database schema provided by SourceForge for Artifact
Data. Following is the description of the tables that interest our project.
 groups
There are various communities that form the part of source forge open biomedical
ontology (OBO), such as Gene Ontology (GO). These communities are designated as
groups in the \groups" table of the SourceForge database schema. Each group has an
identi cation number that uniquely identi es the group.
 artifact group list
A group can be generic and might have various speci c areas within it. These areas
of specialization are called as domains. This table keeps the association between a
24
Figure 4.1: Open Biomedical Ontology Schema for Artifact Data
group and its domains. Each record in the table represents a domain, and is uniquely
identi ed by an identi cation number. A group can have multiple domains, whereas a
domain is always associated with a single group.
 users
This table stores the list of all the members of the communities. The members submit
the artifacts, comment on the artifacts and collaborate. Table lists all user information
including their names and contact information.
 artifact
Table \artifact" holds the artifacts submitted by the members of the community. Ar-
tifact is a generic term to represent numerous types of reports that are attached to a
project. SourceForge automatically de nes a number of default artifacts like bug re-
ports, feature requests, support requests, and patches; however, projects can also de ne
25
their own type of artifacts. Each artifact record actually has two users associated with
it; one user is the person who created or submitted the artifact record, and the other
user is the person assigned to the artifact record. When a user contributes an artifact,
the contribution is made towards a domain which is designated by \group artifact id".
The same artifact cannot be associated with a single domain.
 artifact message
When a member submits an artifact other users are free to comment or elaborate on
it. Each entry in the table \artifact message" represents a single elaboration. Details
like who submitted the elaboration, when it was submitted, which artifact it was
submitted for and the body of that message is stored in the table. There can be
multiple elaborations towards each artifact.
 user group
When a user gets registered with the community he/she gets a liated with that group.
Table \user group" stores this a liation information. Each user in a group has a
member role which decides the rank of the user for that group. User can be core-
developer, co-developer, admin, etc. In addition, an user can be the member of multiple
groups, and di erent member roles can be assigned to the user; one for each group.
4.1.2 Extension
With original schema alone, the calculation of some of the complex metrics becomes
computationally extensive; so we decided to include some additional tables that extend the
original schema, and facilitate the calculations of these metrics. The new schema contains
precalculated results stored in the tables. This helps in reducing the input-output calls made
towards the database during calculation of the metrics. The extended schema is shown in
 gure 4.2.
26
Figure 4.2: Extended Schema for Artifact Data
 collaboration
The \collaboration" table stores the data about user-user collaboration that helps in
calculating \Collaboration Intensity" factor of \Activity Strength" metric (explained
in Chapter 6). This table stores edges between users for given group, domain, year,
month and artifact which makes it very easy to slice and dice the data according to
the various dimensions thus providing  exibility.
27
 contribution
The \contribution" table stores the details about contribution of the individual users
in terms of artifacts and elaborations and thus helps in calculating \Contribution"
factor of the \Activity Strength" metric (explained in Chapter 6). The table stores
number of artifacts and elaborations submitted by the user for a given group, domain,
year and month. So that, it makes it very easy to slice and dice the data according to
the various dimensions thus providing  exibility.
 user group a 
According to SourceForge, the user is the member of the group with which he/she is
registered. This relationship is misleading, because a user being registered with one or
more groups does not mean that he/she would contribute only for those groups. It is
upto user?s discretion to submit an artifact to the domain of their choice. If we have
this relationship, then we tend to neglect the contributions which were made by the
user for the groups with which they are not o cially registered. This prompted us to
de ne a new table \user group a ", not available in the original artifact schema. The
table gives the a liation between users and groups based on newly de ned relationship
which di ers from the relationship between user and group at the database level. Now,
the user becomes the member of the group only if he/she contributes an artifact for
the group.
 domain domain collaboration
In order to construct the domain-domain graph, \domain domain collaboration" ta-
ble is built so that edges between di erent domains can be retrieved e ciently from
the database. Moreover, this also helps in calculation of certain structural metrics
like \Preferential Attachment" and \Degree Distribution" (to be discussed in later
sections).
28
 artifact data
This table is a combination (equi-join) of \artifact" and \artifact message" tables from
original schema, with the only exception of the date  eld. The dates in the original
schema tables are stored in UNIX epoch time. But in our case, we need the dates in
the form of month and year. As a result, the dates in the artifact table are converted
to month, year format and stored in this table. In order to di erentiate the users who
submit the artifact from the users who elaborate on it, we have a column called  ag.
Setting the  ag as ?S? indicates that the user submitted the artifact; where as, setting
it as ?E? indicates that user elaborated on the artifact.
As we can see from  gure 4.2, the extended schema forms a star in which tables \col-
laboration", \contribution", \domain domain collaboration" and \artifact data" form the
FACTS that are surrounded by the DIMENSIONS: \group", \artifact", \artifact group list"
and \users"- the tables from original schema.
29
Chapter 5
Implementation of SciBrowser Tool
This chapter gives the details about implementation of the SciBrowser tool. It explains
entire software engineering approach followed in terms of various stages of software process.
The Chapter is divided into 4 sections namely: Requirement Analysis, Design, Veri cation
and Validation. Each of these sections resemble a speci c stage of software process that we
follow.
5.1 Requirement Analysis
5.1.1 Purpose
Primary purpose of the project is to analyze the Open Biomedical Ontology (OBO) net-
work from the perspective of Complex Adaptive Systems (CAS). Analysis includes demon-
strating the properties of a CAS, such as power law, scale free network, preferential attach-
ment etc. Our objective is to search for these properties and  nd out if they do or do not
exist. These properties can be observed in di erent types of graphs mentioned in [25], where
the interpretation of the properties vary accordingly. The properties targeted are mainly
the structural attributes of a social network. Apart from the structural aspects, the tool
also focuses on the behavioral properties of the the social network that are inclusive of the
collaboration as well as the activity taking place in the network. It visualizes as well as
quanti es the collaborations between members of the network, and also displays the activity
distributions over time.
Following is the list of requirements:
30
 Visualize the Degree Distribution, at the Group/ Domain level, and for a given time
frame.
 Visualize the Preferential Attachment phenomenon, at the Group/ Domain/ User/
Artifact level, and for a given time frame.
 Visualize the Activity Strength Distribution of users, at the Group/ Domain level, and
for a given time frame.
 Visualize the Artifact Contribution Distribution of users, at the Group/ Domain level,
and for a given time frame.
 Visualize the Collaboration Intensity Distribution of users, at the Group/ Domain
level, and for a given time frame.
 Visualize the Activity Distribution, which includes visualizing the Active User distri-
bution and the Artifact Contribution distribution at the Group/ Domain level, and for
a given time frame.
 Visualize both, the cumulative and the non-cumulative forms of activity.
 Month-wise aggregation of the activity and the preferential attachment plots, within
a given time frame.
 Comparisons of the domains in terms of their activity.
 The plots displayed in the image panel should be interactive in the sense that the user
should be able to zoom in and zoom out in order to see the  ne details of the plot.
Options for each metric are mentioned in tables 5.1, 5.2 and 5.3.
All the metrics in the discussion are calculated for the selected group and domain during
a certain time frame, and in some cases for certain artifacts or users as well. All these
parameters are required as the input for the calculation of the metrics. The user should be
31
Type of Metric Metric User Options
User User
Preferential Attachment Type of Plot, Group, Domain, From Year,
From Month, To Year, To Month, User,
Time Interval
Degree Distribution Type of Plot, Group, Domain, Year,
Month
User Artifact User
Preferential Attachment Type of Plot, Group, Domain, From Year,
From Month, To Year, To Month, Artifact,
Time Interval
Artifact Degree Distribution Type of Plot, Group, Domain, Year,
Month
User Degree Distribution Type of Plot, Group, Domain, Year,
Month
User Domain
Preferential Attachment Type of Plot, Group, From Year, From
Month, To Year, To Month, Domain, Time
Interval
Domain Degree Distribution Type of Plot, Group, Year, Month
User Degree Distribution Type of Plot, Group, Year, Month
Domain Domain
Preferential Attachment Type of Plot, Group, From Year, From
Month, To Year, To Month, Domain, Time
Interval
Degree Distribution Type of Plot, Group, Year, Month
Table 5.1: Structural Analysis Metrics
Metric User Options
Artifact Contribution Distribution Type of Plot, Group, Domain, Year, Month
Activity Strength Distribution Type of Plot, Group, Domain, Year, Month
User Collaboration Map Type of Plot, Group, Domain, Year, Month
Collaboration Intensity Distribution Type of Plot, Group, Domain, Year, Month
Table 5.2: Collaboration Analysis Metrics
32
Type of Metric Metric User Options
Cumulative/Non-
Cumulative
Active User Distribution Type of Plot, Group, Domain, From Year,
From Month, To Year, To Month, Time
Interval
Contribution Distribu-
tion
Type of Plot, Group, Domain, From Year,
From Month, To Year, To Month, Time
Interval
Domain Active User
Comparison
Type of Plot, Group, Domain List, From
Year, From Month, To Year, To Month,
Artifact, Time Interval
Domain Contribution
Comparison
Type of Plot, Group, Domain List, From
Year, From Month, To Year, To Month,
Artifact, Time Interval
Table 5.3: Activity Analysis Metrics
able to select the parameters required to calculate the metrics, from the drop down choices
provided on the panel and see the result.
5.2 Design
Figure 5.1, shows the comprehensive class diagram of the SciBrowser application. De-
tails like method signature and variable names have been eliminated from the class diagram
due to space constraint. To start with, we focus on the various packages this class diagram
is made up of; they are listed as follows.
 Data:
Data package is composed of the classes that hold the data to be displayed on the user
interface in the form of the options on the control panel (ControlPanelOptions), and
the metadata which is used in order to calculate the metrics. The class ControlPan-
elOptions, holds the data to be displayed on the control panel for all 3 pages. The
class Data, holds the di erent types of method binding data. For example, each metric
has a speci c method used for its calculation; thus, there is a binding between metric
name and its method name in the form of a dictionary{ a data structure in python.
33
The class, LoadParameterData is used to load the parameter values in the control
panel, directly from the database. The options for di erent groups or domains, used
in our analysis are loaded using this class. Finally, Parameter class is used to hold the
parameters selected by the user, which are then used in the metric calculation.
 Metric Interfaces:
The package MetricInterfaces, holds the common interface used to calculate all three
types of metrics (to be discussed in detail in following sections). IMetrics is the in-
terface for metrics, IToolkit is the interface for all the toolkits used in calculation of
metrics and IQueryConstructor is the interface all the query constructors used to gen-
erate queries. Although, python does not have concept of Interfaces we still enforce
this concept by creating the classes and having unimplemented methods inside it. If
in a class, just the method signatures are de ned with \pass" keyword in the body of
the method, we call it as \Interface".
 Connection:
The Connection package has 2 classes: DatabaseCon g and DBConnect. The class
DatabaseCon g, holds all the parameters such as host, port number, database name,
username and password that are required to connect to any database. All these pa-
rameters are static or class variables. The class is not singleton i.e. we can create
multiple instances of the class. Thus, if we want to get/set any of these parameters
(using getters and setters), then we just need to create an instance of the class, and
use getter/setter methods to alter these parameters. The class DBConnect, returns the
connection object to be used to query the database during metric calculation. Again,
connection object is a static variable which makes it possible to have single point of
reference for connection.
 Threading:
The class Thread in threading, module is used to generate a background thread for
34
Figure 5.1: Comprehensive Class Diagram
metric calculation or connection to the database. This makes it possible to have a
progress bar displayed for the user, while calculation or connection process is going on
in the background.
 GUI:
Detailed view of the GUI class diagram can be seen in the  gure 5.11. Figure 5.1, shows
only the important parts of the GUI module. GUI package consists of number of sub
packages. The package GUI.MVC, is the dedicated package for the implementation
35
of Model View Controller [21] design pattern. Both, ImagePanel and AnalysisTree-
bookPanel (which is a part of control panel) act as the views which call the Controller
instance. The Controller instantiates Parameter class, and stores all the parameter
values in that class. The instance of the class Parameter, is passed to the Model for
metric calculation. The Model calls the instance of class CalculationThread to cal-
culate the metric, and then updates the ImagePanel. ImagePanel registers with the
Model initially when the application starts. Thus, the Model is aware about the view it
needs to update. GUI.BackgroundProcess package works in association with threading
module, in order to generate background threads for metric calculation and connection
to the database.
 Factory:
Factory package comes into play when the class Model (in the GUI.MVC package)
needs to calculate a metric. Factory returns right kind of object for the metric to be
calculated. The package Factory, is explained in details in Metric Selection section
below.
5.2.1 Helper Tables
The database schema provided by SourceForge, is alone insu cient to e ciently cal-
culate the metrics. As a solution to this problem, we came up with a new schema which
has the data in the relevant format. The new schema expedites the calculation of metrics
in terms of the time complexity, by reducing the number of input-output calls made to the
database. The new schema forms the basis for the calculation of all the metrics we need.
Tables in the new schema are designated as the Helper Tables, as they assist a great deal
in e cient calculation of metrics. The conversion from old schema to new schema is done
through a set of python programs.
36
5.2.2 Metrics
Metrics are classi ed into three types: Structural Analysis, Collaboration Analysis and
Activity Analysis. Although they are signi cantly di erent modules, they have same depen-
dency structure which we call as 3 level dependency structure. The module that sits on the
top of the structure is the high level module, and is named as Metric module. Metric module
is represented by an interface which de nes the methods that calculate metrics. The actual
implementation of the metric method is done by the classes that implement the interface.
In addition to that, the concrete class can also have its own methods to calculate some of
the speci c metrics.
Module at the second level is called as Toolkit module, and is represented by the interface
that de nes the common methods that are needed by the Metric module. So, it is needless
to say that the Metric module depends on the Toolkit module. Any concrete class that
implements the Toolkit interface can de ne its own methods.
Module at the third and the bottommost level is called Query Constructor module,
and is also represented by the interface which de nes the common methods required by the
Toolkit module. Toolkit module communicates with the database to acquire data and Query
Constructor module assists Toolkit module in data acquisition, by providing the relevant
SQL queries that need to be  red on the database for fetching the required data.
At each level there are interfaces, and higher level concrete classes are dependent on the
interfaces de ned at the level below them. For instance, concrete classes in Metric module
implement the higher level Metric interface and also depend on the Toolkit interface de ned
in the level below them. Similar is the relationship between Toolkit and Query Constructor
modules. This implementation is inspired from the Dependency Inversion Principle [22]
(one of the 5 principles of SOLID code). The principle states that details should depend
on abstraction but abstractions should not depend on the details. In our case the concrete
classes in the higher level module depend on the abstractions in the lower level module.
37
As a result, every level has to be represented by an interface. This approach has following
advantages
 There is  exibility of modifying the lower level modules without modifying the higher
level modules, because higher level modules depend on the abstraction rather than the
concrete implementations. As long as the interface requirements are met there is not
need to change the higher level modules which depend on the interface.
 The software can be easily extended with new code with out modifying the existing
code. According to the Open Closed Principle (OCP) [23] (yet another principle of
SOLID coding), the software should be built in such a way that the entities (Classes,
Modules, Functions, etc.) should be open to extension but closed for modi cation.
This can be achieved via Dependency Inversion Principle. Suppose, in future we need
to extend the software for a new type of metric; then a new method can be added in the
existing class which implements the Metric interface, and corresponding Toolkit and
Query Constructor methods can be added to the classes in their respective modules if
required. Thus existing code doesn?t have to be tempered with. Whereas in absence
of this architecture, every time the higher level class depends upon a new lower level
class, changes have to be made in the higher level class.
Each category of metrics is discussed below in detail along with their class diagrams.
Structural Analysis
Structural analysis module is shown in  gure 5.2. As there are 4 di erent types of
graphs, there can be 4 di erent interpretations of a metric. For instance, the method pop-
ularityGrowthRate() showing preferential attachment has four di erent interpretations for
4 di erent types of graphs. Thus, the Metric module has an interface IGraphMetrics and
the classes that implement the interface will implement the metric in their own manner. In
addition, the concrete classes also de ne their own metrics.
38
In Toolkit module(secondlevel), theclassGraphToolkit implementstheinterfaceIGraph-
Toolkit which has signatures of common methods used by all the classes in the Metric module.
There are additional methods needed by UserArtifactUserGraphToolkit and UserDomain-
GraphToolkit classes which are de ned in the corresponding classes These classes inherit
from GraphToolkit class.
The third level is called as Query Constructor level. It takes care of all the SQL query
generation required by the toolkit level. Toolkit level communicates with the database to
query for the data and the queries that are needed are provided by the Query Constructor
level. Interface IGraphQueryConstructor gives the methods that have di erent implementa-
tions for di erent types of graphs. We have 4 di erent classes (one for each type of graph)
that implement this interface. Method speci c to a certain type of graph are de ned in the
corresponding class.
Collaboration Analysis
Collaboration analysis module is shown in  gure 5.3. These metrics do not depend
on the type of graph. In fact, currently there is single implementation of collaboration
metrics as can be seen from the Metric module in  gure 5.3. But we are open to a di erent
interpretation as well as implementation of these metrics and as a result, we have created
an interface ICollaborationMetrics having signatures of these metrics.
The Toolkit class in the Toolkit module implements ICollaborationMetricsToolkit inter-
face and inherits from CollaborationMetricsToolkit class. The reason behind this implemen-
tation is that, there are certain methods such as getting user list or domain list from the
database which are standard methods having just single implementation. These are used
through out the project. Thus, they are de ned in the concrete class on which the higher
level classes depend. Although this goes against the Dependency Inversion Principle it makes
sense, because ultimately we send instance of Toolkit class as a argument to the methods in
39
Figure 5.2: Structural Analysis Module
40
the higher level class. This instance is of type both ICollaborationMetricsToolkit and Col-
laborationMetricsToolkit, and hence, it has access to both the implemented methods and the
base class methods. If we de ne a new class that implements ICollaborationMetricsToolkit,
then we do not have to rede ne the standard methods in CollaborationMetricsToolkit again
in the new class. The new class can get them by inheriting CollaborationMetricsToolkit class.
Same is the situation with the Query Constructor module.
Figure 5.3: Collaboration Analysis Module
41
Activity Analysis
In this project we are targeting two types of activities Cumulative and Non-Cumulative.
As a result, we have two implementations of each metric method de ned in the interface
IActivityMetrics. As far as Toolkit module is concerned, we just have a single implementation
of the IActivityMetricsToolkit interface and for Query Constructor module there are two
query builder classes as shown in  gure 5.4.
Figure 5.4: Activity Analysis Module
42
5.2.3 Metric Selection
Above sections are dedicated to metric calculation that happens in the backend. This
section concentrates on how the intended metric is chosen for the calculation, based on the
user input. User selects a bunch of parameters on the control panel that decide which metrics
should be drawn on the screen for the user. As we have seen in the above sections that we
have three categories of metrics. For calculating a metric be it any category, we need to se-
lect and instantiate three classes because of three level dependency structure. For instance,
if we have to calculate Degree Distribution for User-Artifact-User graph then the classes
that need to be instantiated are UserArtifactUserMetrics (from Metric Module), UserArti-
factUserGraphToolkit (from Toolkit module) and UserArtifactUserGraphQueryConstructor
(from Query Constructor module).
Thus, metric selection can be achieved by applying Factory Method pattern [12], as
shown in  gure 5.5. Three factories are built, one for each type of metric: structural,
collaboration and activity. Each factory class has three methods with the same signature
for producing the instance of a concrete class in Metrics module, Toolkit module and Query
Constructor module for the given metric. This structure closely resembles Abstract Factory
pattern [12], if we just have an abstract class or interface for the three factories that are
created. Thus, we have Abstract Metric Factory as the interface for the three factories.
This also requires us to have a common interface for each type of metric, toolkit and query
constructor classes. In order to select the appropriate factory class based on the category of
metric, we have a wrapper class called Factory Selector around the three factory classes.
5.2.4 GUI
When designing graphical user interfaces (GUI), we should use a solid design pattern or
model so that the GUI would be stern and smooth in all its transitions. If a GUI contains
several views or screens, or if it contains complex controls, it would not be wise to create
the GUI on the  y without doing any prior design or without having any design base model.
43
Figure 5.5: Metric Factory
MVC (Model View Controller) [21] is such design pattern that is used to model complex
user interface. The MVC metaphor imposes a separation of behavior between the actual
model of the application domain, the views used for displaying the state of the model, and
44
the editing or control of the model and views. Figure 5.6 shows the interaction between
di erent modules of MVC.
Figure 5.6: Model-View-Controller
Model manages the information and domain logic which in our case is the calculation
of metrics. Our application uses a backend database that stores all the raw data, and
calculation of metrics is built upon the data that is queried from the database. So the data
access layer is assumed to be encapsulated by the model. When model changes its state it
noti es all the views about the state change. View is the visual representation of the model
and is comprised of screens and widgets used within the application. The view is shown in
 gure 5.7. We have just one view (image panel) to be updated. Controller responds to the
user inputs such as button clicks, data entry or menu selection. Its acts as the link between
user and application. Once request is received by controller it instructs the model to perform
certain actions by making calls on model objects.
GUI for the tool is structured as shown in the  gure 5.7. GUI window is divided into
two segments called image panel and control panel. Image panel displays the graphs and the
plots where as control panel displays the options to be selected by the user in order to get
the desired plot. Control panel is a notebook having 3 pages namely Structural Analysis,
Collaboration Analysis and Activity Analysis. Figure 5.8, 5.9 and 5.10 shows the sample
graphs for three diferrent types of analysis. Each page corresponds to a speci c type of
analysis mentioned under \Metrics" subsection under \Design". A tree structure of available
metrics under the given analysis, is displayed on the left side where as the options associated
with each of the metric are displayed on the right side. When user selects the appropriate
45
Figure 5.7: GUI Snapshot
Figure 5.8: Structural Analysis Tab
46
Figure 5.9: Collaboration Analysis Tab
Figure 5.10: Activity Analysis Tab
set of options and clicks \Draw" button, user request goes to the controller which in turn
sets the parameters in the Parameter class and instructs the model to calculate the desired
47
metric. Model runs the calculation as a background thread, and shows a progress bar to the
user indicating that the calculation is taking place. When calculation is done model noti es
the image panel and image panel updates itself with the new data from the model. Class
diagram for GUI package ( gure 5.1) is shown below in  gure 5.11
Figure 5.11: GUI Class Diagram
48
artifact id group artifact id submitted by open date
3 2 4 1107306000
4 1 2 1109725200
2 3 1 1136163600
1 1 2 1159750800
5 3 3 1178067600
6 4 1 1183338000
Table 5.4: artifact
5.3 Veri cation/Testing
Metrics calculated had to be tested for correctness before they were actually transformed
into plots. Figure 5.12 shows components involved in testing. Each of these components have
been explained below.
5.3.1 Test Database
Figure 5.12: Testing Framework
49
user id user name
1 user1
2 user2
3 user3
4 user4
5 user5
6 user6
7 user7
Table 5.5: users
group id unix group name
1 group1
2 group2
Table 5.6: groups
group id user id member role admin  ags
1 1 0 Y
1 2 100 N
1 4 101 N
1 5 100 N
2 5 0 Y
2 6 101 N
2 7 100 N
2 1 100 N
Table 5.7: user group
group artifact id group id
1 1
2 2
3 2
4 1
Table 5.8: artifact group list
50
(a) User User Network (b) User Artifact User Network
Figure 5.13: Test Network
Test database \obo test" is created in order to test the metrics. Schema of test database
is a replica of original OBO database and consists of test data, loaded using SQL scripts. The
test data is generated using a hypothetical network shown in  gure 5.13. The hypothetical
network is constructed by considering the boundary cases as well as intermediate test cases.
First the network is sliced into time frames in such a way that time frames take care of all
the test cases. The test data is shown in the tables below.
5.3.2 Metric Veri cation Module
Metric Veri cation Module consists of a Test Module for each category of metrics.
There are three types of Test Modules corresponding to three types of metrics as mentioned
in design section. Each Test Module consists of unit test cases in python that are tested
using python?s unit test framework PyUnit. Each unit test splits the network into a certain
time frame using given parameters. Test cases and their results for some of the metrics are
shown in tables 5.10, 5.11 and 5.12.
51
id artifact id submitted by adddate
1 1 1 1160010000
2 1 3 1165280400
3 5 1 1178326800
4 5 5 1178499600
5 5 4 1181005200
6 5 6 1188954000
7 5 3 1188954001
8 6 5 1186275600
9 6 6 1189126800
10 6 7 1199494800
11 6 5 1186275601
Table 5.9: artifact message
Test Case Description Result
group = 1, domain = 1,
year = 2005, month = 1
Boundary Condition,
No User Exist
f g, Raise Exception \No Data
Available for selected set of
Parameters."
group = 1, domain = 1,
year = 2005, month = 3
Boundary Condition,
User exists but does
not have a degree.
f\user2":0g
group = 1, domain = 1,
year = 2006, month = 10
General test case
where user exists
and has a degree
f\user2":1,\user1":1g
group = 2, domain = 0,
year = 2008, month = 1
Full Graph
Condition for
group id = 2
f\user1":4, \user5":4, \user3":4,
\user4":4, \user6":4g
group = 0, domain = 0,
year = 2008, month = 1
Full graph condition
for all groups i.e.
cumulative
f\user1":6, \user5":5, \user3":5,
\user4":4, \user2":2, \user6":5,
\user7":3g
Table 5.10: Test Cases for Degree Distribution
52
Test Case Description Result
group=1, domain=1,
year=2005, month=2
Boundary
Contribution, No
User Exist
f g Raise Exception \No Data
Available for selected set of
Parameters."
group=1, domain=1,
year=2005, month=3
Single User
Condition f\user2":0.5g
group=1, domain=4,
year=2008, month=1
Full graph for
group id = 1
f\user1":0.833, \user2":0,
\user3":0, \user5":0.5,
\user6":0.333, \user7":0.333g
group=0, domain=0,
year=2008, month=1
Full graph condition
for all groups i.e.
cumulative
f\user1":0.9117, \user2":0.6176,
\user3":0.75, \user4":0.5441,
\user5":0.4705, \user6":0.3823,
\user7":0.23529g
Table 5.11: Test Cases for Activity Strength Distribution
Test Case Description Result
group=1, domain=1,
fromYear=2005, fromMonth=3,
toYear=2005, toMonth=7,
timeInterval=1
- f1:1, 2:0, 3:0, 4:0,5:0g
group=1, domain=1,
fromYear=2006, fromMonth=9,
toYear=2006, toMonth=12,
timeInterval=1
- f1:0, 2:1, 3:0, 4:0g
group=1, domain=1,
fromYear=2005, fromMonth=9,
toYear=2005, toMonth=8,
timeInterval=1
Start Time comes before
End Time
Raise Exception
\Start Time Cannot
be less than End
Time"
group=1, domain=1,
fromYear=2006, fromMonth=9,
toYear=2006, toMonth=10,
timeInterval=2
Time Interval is greater
than the di erence between
Start and End time
Raise Exception \No
Data Available for
selected set of
Parameters."
group=0, domain=0,
fromYear=2006, fromMonth=8,
toYear=2007, toMonth=7,
timeInterval=3
Time Interval greater than
1. Aggregating the data for
every 3 months.
f1:1, 2:0, 3:0, 4:2g
Table 5.12: Test Cases For Contribution Activity Distribution
53
5.4 Validation
In this section we demonstrate the validity of the SciBrowser by showing that it suc-
cessfully completes all the requirements speci ed in the Requirement Analysis section.
5.4.1 Structural Analysis
There are four di erent types of graphs [25] considered for this study. Structural analysis
page broadly consists of two metrics: degree distribution and preferential attachment. As
these metrics have di erent interpretations for di erent types of graphs they have been
grouped according to the graph types. Degree distribution is a plot of the degree of a node
v/s number of members having that degree. Besides line and bar, this plot can be viewed
on logarithmic scale as it helps in accurate representation of the power law if it exists.
Preferential attachment brings out the phenomenon of \rich gets richer". It represented by
the plot of rate of change of degree with respect to time; thus, a constant increase in the
rate of change of the degree indicates the presence of preferential attachment.
5.4.2 Collaboration Analysis
The metric, activity strength is introduced in this analysis. This metric considers both
collaboration and contribution in its calculation; thus, it re ects the innovation potential
of the users/members of the community. The metric is plotted on time axis in order to
determine if there exists a similar distribution like power law. In addition, there is User-
User collaboration map in order to view the interactions between the users. The areas
that have high interactions are highlighted with lighter shades; whereas, the areas with low
interactions are assigned the darker shades.
54
5.4.3 Activity Analysis
Under Activity Analysis page, we have metrics that display monthly distributions of
number of artifacts submitted and number of active users in the network. We also com-
pare the domains based on the activity using \Domain Active User Comparison" and \Do-
main Contribution Comparison" plots. In order to depict the frequency at which activity
reaches/crosses a threshold level, the distribution called \Waiting Time Distribution" is used
where the threshold can be set by the user.
55
Chapter 6
Social Network Analysis Using SciBrowser
As mentioned in the last chapter, SciBrowser analyzes the OBO community from three
perspectives: structural, collaboration and activity. Structural analysis focuses on the social
network analysis metrics like degree, centrality, density, clustering coe cient and average
path length. Collaboration analysis focuses on collaboration between the users and innova-
tion potential of individual user, using some novel metrics like \Activity Strength". Activity
analysis concentrates on visualizing temporal distributions of activity to see how frequently
the activity goes beyond a threshold which is a point of high activity. This chapter pinpoints
the results of analysis performed by the tool. Each type of analysis, along with the results
and their interpretations, is the highlight of this chapter.
6.1 Structural Analysis
Structural analysis is based on di erent types of graphs which we introduced in [25].
Following is the brief description of each type of graph that we used in our study and
its implications on social network metrics. In each type of graph, node size is directly
proportional to the degree of the node, and the strength of the connection between the nodes
is indicated by the thickness of the line connecting the nodes. De nition of the connection
strength varies according to the type of graph.
 User-Artifact-User Graph:
In this network graph, the nodes are users and artifacts. The network depicts the
contributions made by the users towards their group in the form of artifact submissions
and elaborations. An artifact is created by exactly one user, while multiple users can
56
elaborate on it; thus, a given artifact has at least one connecting edge with the user
who submits it, and there can be multiple edges based on these user elaborations.
This graph shows the responses from other members of the network, and from these
responses we can deduce how in uential a given artifact will become. Figure 6.1 depicts
the User-Artifact-User graph for the OBO hub group (group id: 125463 (ChEBI)). The
nodes which are named as numbers indicate artifacts, whereas rest of the nodes named
as alpha-numeric specify engineers and scientists (e.g., community members). The
greater the number of elaborations an user makes towards an artifact, the stronger the
connection strength between the user and the artifact.
Figure 6.1: User Artifact User Network
 User-User Graph: The User-User graph is derived from the User-Artifact-User
graph. We assume a transitive relationship between artifacts and users; if both member
A and member B are connected to artifact 1, we posit that member A is connected to
member B. This allows us to simplify the graph and show only how users change over
57
time, and prevent information overload due to the artifacts. This graph can be termed
as a collaboration graph as it displays collaboration between users. The greater the
number of times users collaborate (indicated by number of comments shared between
users), the greater the strength of connection between them. Figure 6.2 depicts the
user-user graph for the OBO hub group (group id: 125463 (ChEBI)).
Figure 6.2: User User Network
 Artifact-Artifact Graph: The Artifact-Artifact graph is also derived from the User-
Artifact-User graph, by connecting together the artifacts that are linked to the same
member. The greater the number of members linked to the pair of artifacts, the
stronger the link between the two artifacts. This graph depicts the information  ow at
the artifact level. Figure 6.3 shows the artifact-artifact graph for the OBO hub group
(group id: 125463 (ChEBI)).
 User-Domain Graph: A group has one topic of focus, but these topics can be broken
further into sub elds in a similar manner to a phylogenetic tree. These communities
58
Figure 6.3: Artifact Artifact Network
are laid out in a similar fashion. If we take any particular group, this group will be
composed of one or more subject area?s or domains. The subject areas are the focus of
the User-Domain graph. Figure 6.4 depicts the User-Domain graph for the OBO hub
group (group id: 36855 (Gene Ontology)). Each group in OBO has subgroups (i.e.,
domains) that focus on speci c subject areas. Each artifact is submitted for a speci c
domain. By abstracting the artifacts of the User-Artifact-User graph onto domains they
belong, we derive a low-resolution and abstract network representation that denotes
distribution of members onto subject areas. The nodes named as numbers indicate
domains, whereas the rest of the nodes named as alpha-numeric specify engineers and
scientists (e.g., community members).
 Domain-Domain: The Domain-Domain graph is derived from the User-Domain
graph. We assume a transitive relationship between users and domains similar to
what we have in the User-Artifact-User graph above; if member A is connected both
to domain 1 and domain 2, we posit domain 1 is connected to domain 2. Further-
more, the greater the number of members common to a pair of domains, the stronger
59
Figure 6.4: User-Domain Network
the relationship between the domains. Members of the groups act as the medium
of knowledge transfer between domains (in Domain-Domain graph) or artifacts (in
Artifact-Artifact graph). This rationale gives us an idea of how well the knowledge
transfer takes place between domains{ the concept which fosters innovation. Figure
6.5 depicts the domain-domain graph for the OBO hub group (group id: 36855 (Gene
Ontology)).
It was observed that despite network diversity, most of the real web-like systems share
three prominent structural features: small average path length (APL), high clustering and
scale-free (SF) degree distribution [2][37]. Although the SciBrowser tool is used to visualize
degree distribution and preferential attachment, we did not limit our study to these metrics;
rather, we have done comprehensive social network analysis of the OBO community.
60
Figure 6.5: Domain Domain Network
6.1.1 Centrality
For the User-Artifact-User graph, the artifact with high degree-centrality will have ties
to most of the users which means large number of users contribute to the artifact. For
the User-User network, a highly central user will collaborate with most of the users, and
hence re ect high collaboration intensity. For the User-Domain and the Domain-Domain
networks, the central domain indicates that many users contribute to the domain which
makes it an active and important domain. High closeness centrality in an User-Artifact-User
graph indicates that the artifact is easily accessible to most of the users. In the User-
User graph the user having high closeness centrality can reach all other users easily which
facilitates communication. In Domain-Domain network high closeness centrality indicates
that knowledge di usion from one domain to the other will be smooth.
Figure 6.6 shows the monthly distributions of di erent types of centralities and density.
Some of the projects like Open Biomedical Ontologies (76834), Disease Ontology (79168),
and Systems Biology Ontology (174625) have the closeness centrality value as 0 because
the graphs of these projects are disconnected. Sequence Ontology (72703) has values of all
centralities in the range 0.85 - 0.95 which is very high. This network has the structure close to
61
a star network which makes it evident that the network has a small core and a large periphery.
Presence of the core members keeps the community active as they heavily contribute to the
community. Presence of peripheral members helps keep constant  ow of novel ideas into the
community, as the peripheral members are the links between the community and the outside
world.
6.1.2 Small World Phenomenon
Clustering Coe cients (CC) and Average Path Lengths (APL) de ne the small world
phenomenon for networks [35]. Month-wise distributions of CC and APL for the User-User
graphs of various projects are shown in  gure 6.7. CC indicates how complete the subgraph
is for the user in discussion. If the neighbors of a user are fully connected it means that the
CC for the user is 1. Whereas, if the user?s neighboring network is fully disconnected then
the CC for that user is 0. Thus, if the CC is high or very close to 1, then most or all of the
neighbors of the user will be connected to each other, creating a uniformity in the knowledge
level of the individual users in the clique. Simply, if everyone in the group has the same
knowledge as every other individual, then there?s no diversity{ one of the important factors
in fostering innovation and creativity. Also having a 0 value for CC means that there is no
communication between the users that are connected to the user in discussion. This is not
advisable, as the knowledge mobility is suppressed. Thus, it is preferable to have a value
of CC between 0 and 1, as it indicates the presence of highly connected subgroups (within
the project) which are loosely connected to each other. As seen in Figure 6.7, most of the
groups have their CC value between 0 and 1 which is an indication of the existence or the
possibility of existence of creativity in the groups. APL is de ned as the average number of
nodes it takes for any node to get to any other node in the graph. The smaller the APL, the
faster the information di usion in the graph which favors the condition for the small world
phenomenon.
62
(a) Sequence Ontology (72703) (b) Open Biomedical Ontologies (76834)
(c) Disease Ontology (79168) (d) ChEBI (125463)
(e) Systems Biology Ontology (174625) (f) OBI (177891)
Figure 6.6: Centrality and Density Distributions
It is worth noting that CC  uctuates initially for most of the groups which indicates that
the group is constantly restructuring itself with new users joining the group and innovating.
But some of these groups (Open Biomedical Ontologies, OBI, Disease Ontology), stabilize
their CC values without much  uctuation which means that there is no more restructuring
63
(a) Sequence Ontology (72703) (b) Open Biomedical Ontologies (76834)
(c) Disease Ontology (79168) (d) ChEBI (125463)
(e) Systems Biology Ontology (174625) (f) OBI (177891)
Figure 6.7: Clustering Coe cient and Average Path Length Monthly Distribution
in the project and no more innovation. Other groups (Sequence Ontology, ChEBI, Systems
Biology Ontology) show slight variation in CC, but not enough for innovation and creativity.
64
6.1.3 Degree Distribution
Figure 6.8 and 6.9 show the degree distributions for the User-User network of Gene
Ontology and entire OBO community respectively. The line plot in both cases indicates
that we have a power law which is shown by the log plot next to it. Power law has di erent
interpretations in di erent types of graphs. In the User-User network power law indicates
that only few users have a high degree where as most of the users have low degree. As can be
see from  gures 6.8 and 6.9, the distributions are progressively broadening in time developing
heavy tails. This implies that the distribution has high variance i.e. if we randomly pick
a user then he is likely to have a degree value which is far from average. In User-Artifact-
User network we focus on the degree distributions of the artifacts and we get power law
as shown in  gure 6.10. This indicates that there are only few artifacts that attract large
number of users but majority of the artifacts does not impact users of the network. Power
law demonstrates the scale-free property of the network which makes the network robust
and resilient; if we randomly remove nodes from the network, it does not fail. This is one
of the reasons why the self organized communities like WWW (world wide web)  ourish
even though the members of the community join and leave voluntarily. We ignore the actors
having zero degree by classifying them as outliers, as log (0) is not de ned.
(a) Line Plot (b) Log Plot
Figure 6.8: Degree Distribution of User-User network for Gene Ontology (Group Id-36855)
65
(a) Line Plot (b) Log Plot
Figure 6.9: Degree Distribution of User-User network for OBO (All Groups Included)
Figure 6.10: Artifact Degree Distribution of User-Artifact-User network for OBO (Compre-
hensive)
6.1.4 Preferential Attachment
According to [5], the reasons of having scale-free power law distribution for many large
networks are: (i) networks expand continuously due to addition of new nodes to the network,
and (ii) new nodes attach preferentially to the nodes that are well connected. This indicates
that the nodes that are well connected will attract new nodes, and continue to grow until a
certain limit [31]. In order to visualize preferential attachment, we plot the change in degree
of the actor over period of time. Actor can be a user (in User-User network), artifact (in the
User-Artifact-User network) or domain (in Domain-Domain network). Ideally, what we can
expect from the visualization for the actor who displays preferential attachment is a linear
66
rise in the rate of increase in degree of the actor indicating that the actor is becoming more
and more connected; but later the change will fall and approach zero which means that the
actor?s degree becomes saturated there after. In User-Artifact-User network, this exactly
resembles the life cycle of any artifact as shown in  gure 6.11(a). Initially when the artifact
is new it in uences response from the users; but later the responses diminish as there is no
novelty left in the artifact. But there is a exception to this revelation which is shown in
 gure 6.11(b). Here the artifact just after its submission receives response which leads to
its degree change going to 100%, and over next few months the artifact becomes dormant
as there are no users contributing towards the artifact; but there is a sudden increase in the
degree of the artifact after that. We tracked this artifact at the database level and found
that the reason for the increase in the degree of the artifact is a certain contribution made
by a user, which in uenced a series of responses from the existing as well as new users.
There is a possibility that this contribution was a novel one or an important addition to the
existing artifact. Thus, the contribution could have been either a radical innovation or an
incremental one.
(a) Artifact - 1167822 (b) Artifact - 994121
Figure 6.11: Preferential Attachment Graph for Artifact
In case of users, the preferential attachment depicts the journey of the user from the
periphery to the core of the network. Figure 6.12 shows the degree change plots for some of
the users of the Gene Ontology project. It is evident from these plots that they all follow
the same pattern in terms of degree change. Initially when users join the community they
67
are highly active as indicated by their sharp rise in the rate of increase in degree. But over
period of time their degree starts increasing at a lesser rate and eventually stays constant.
Based on the plots, we can say that as the user becomes more and more central, the rate of
increase in the degree declines. Thus for a user, preferential attachment only exists initially
for a certain time.
(a) User - gomidori (b) User - jl242
(c) User - val wood (d) User - ramab
Figure 6.12: Preferential Attachment Graph for Users
6.2 Collaboration Analysis
Collaboration Analysis focuses on the collaboration aspect of the users. When an artifact
is submitted, the users discuss the artifact by commenting on it. The in uentiality of the
artifact becomes evident from the number of comments it gets in the form of responses. But
this analysis is targeted towards determining the in uentiality of the user and not the artifact.
Another objective to successfully visualize the collaboration between users. Activity strength
68
helps identify the productivity of the user by considering both contribution and collaboration
aspects associated with the user, while collaboration map helps visualize the collaboration
between users at any given point in time.
6.2.1 Activity Strength
This metric has been drawn from the study conducted on the impact of Co-Authorship
teams [20], where it was used to identify the productivity and in uentiality of the authors.
In OBO the users submit the artifacts, and these artifacts are elaborated by other users
in the form of comments. Thus, the artifacts become the means by which the users can
e ectively collaborate. In order to calculate this metric for a user, we need both the number
of artifacts submitted (A) and the collaboration intensity (CI) for the user.
CI(i) =?
j
wij (6.1)
Equation 6.1 represents the Collaboration Intensity and it takes into account all the users j
connected to user i. Connection between user i and j has a weight associated with it that is
represented by wij where
wij =
?
a Nc
Na
takes the weight between user i and j over all artifacts. Nc represents the number of col-
laborations that take place between user i and j over artifact a and Na is total number of
artifacts over which user i and j collaborate.
Sa(i) = Wa (Ai) +Wci (CIi) (6.2)
Equation 6.2 represents the activity strength with Wa representing the weight for artifact
submission (A), while Wci represents the weight for collaboration intensity (CI) such that
Wa +Wci = 1. If Wa > Wci then it indicates that the Activity Strength gives higher weight
to artifact submissions than collaboration intensity and vice versa. Furthermore, Ai and CIi
69
are normalized by dividing them with maximum value for A and CI respectively, such that
0  Ai;CIi  1. This also makes sure that 0  Sa(i)  1.
Activity Strength is used to assess the productivity, in uentiality and innovation po-
tential of the user. We expect to get a scale free distribution for this metric just like degree
distribution. It means that there exist a few in uential users around which the community
is built, and all the other users connect to these in uential users. Figure 6.13 shows activity
strength plots that we get from the SourceForge data. If we put Wci i.e. the weight for
collaboration intensity (CI) as 0 in the equation 6.2 there by completely ignoring collabo-
ration factor, then we get the plot as shown in  gure 6.13(a). If the artifact submission
factor is completely ignored in equation 6.2 by putting Wa to 0, then we get the plot shown
in  gure 6.13(b). With equal weights given to both these factors we get the plot shown
in  gure 6.13(c). As we can see, plot 6.13(a) displays a power law; whereas, plots 6.13(b)
(a) Wci = 0 (b) Wa = 0
(c) Wa = 0.5, Wci = 0.5
Figure 6.13: Activity Strength Log Plots for Gene Ontology (36855)
70
and 6.13(c) are scattered and do not indicate any speci c pattern. The conclusion we can
derive from these plots is that as far as artifact submission is concerned there are a few core
users who make contributions towards the project 36855 (Gene Ontology). But these users
are not major collaborators which is the reason why we fail to get a power law when we
consider both contribution and collaboration. Most of the members of the community are
collaborators and their collaboration intensity is along same lines which is the reason why
we do not get a power law when we consider just the collaborations.
6.2.2 Collaboration Map
Collaboration map is used to visualize the collaboration patterns in a group or domain,
using python matplotlib color map. The map is laid out as a 2 dimensional matrix with the
users on both X and Y axis of the map. Each cell of the matrix represents the collaboration
between the user on X axis and the corresponding user on Y axis. Diagonal cells are
ignored as they represent the same user on both X and Y axis. The map is symmetrical
with diagonal acting as the axis of symmetry. Figure 6.14 shows the collaboration patterns
between the users of di erent groups in OBO. Color scheme used in the matrix is shown in
a color bar adjacent to the color map. The darker the cell is, the lesser the collaboration
intensity between the users associated with the cell. As the color approaches yellow or
white, that indicates increase in the collaboration between associated users on X and Y
axis. Lower half of the color map is colored black in order to indicate that the graph we
are using is undirectional and is symmetrical across the diagonal. Thus, the collaboration
between User-X and User-Y is same as the collaboration between User-Y and User-X. In
case of bi-directional graphs entire color map can be used. Collaboration between two users
is calculated using the formula given in equation 6.3 shown below.
CUser X;User Y = NcN
a
(6.3)
71
In equation 6.3, Nc is the number of collaborations that took place between User-X and
User-Y; while Na is the number of artifacts over which the collaborations took place between
User-X and User-Y. Collaboration is normalized using the maximum value. Thus, we get
collaboration value between 0 and 1.
(a) Gene Ontology (36855) (b) Sequence Ontology (72703)
(c) Open Biomedical Ontologies
(76834)
(d) ChEBI (125463)
(e) Systems Biology Ontology
(174625)
(f) OBI (177891)
Figure 6.14: User Collaboration Maps for various projects under OBO
72
6.3 Activity Analysis
The activity patterns are seen against the time so that it re ects di erent stages the
community has gone through. Activity not only shows the project life cycle stages but also
explains the innovation taking place in the project. We de ne two types of activities: Artifact
Submission and Active User Distribution. These activities can be viewed using SciBrowser
for a certain group (project) or domain from a certain start time (year/month) to a certain
end time (year/month). The smallest unit of time is one month i.e., by default the results
will be shown as monthly distributions. Also, the time can be aggregated to see the total
activity for that period; for e.g., a monthly activity can be aggregated to view it every n
months, where n can be anything in the set [2, 3, 6, 9, 12]. For the plots in  gures 6.15,
6.16, 6.17 and 6.18 n is set to 9.
The activity metrics indicate the stage of the community growth at any given point in
time. A typical project life cycle of an organization is based on the sales or pro ts (dependent
variable) over time. According to [10] for an open source software project it is the number
of downloads the users do, that decides life cycle of the project. OBO being an open science
project, we plot the number of contributions and the number of active members over time,
in order to see how the project  ts into the organizational life cycle model.
6.3.1 Contribution Distribution
\Contribution Distribution" is the plot of the number of artifacts submitted over period
of time. This metric can be visualized using line or bar plot, with the time on the x axis
and the magnitude of submission on y axis. Figure 6.15 shows the contribution pattern
for di erent communities under OBO foundry. Each point in the graph accounts for the
aggregate contribution of 9 months. This is done in order to achieve a smoother curve and
eliminate noise. The  gure shows all the stages of typical project cycle. Figure 6.15(a)
shows the \Introduction and Growth" phase for the community. Figure 6.15(b) shows the
community which is in its \Maturity" phase and  gure 6.15(c) shows the \Decline" phase
73
for the respective community. It is not necessary that all the projects follow the same cycle
that is mentioned above. It is the most typically observed life cycle of a project. Figure
(a) Introduction & Growth (b) Maturity
(c) Decline
Figure 6.15: Typical Contribution Activity Patterns across projects OBI (177891), Open
Biomedical Ontologies (76834) and ChEBI (125463)
6.16 below shows the alternate life cycles that are followed by the projects. It is seen from
 gure 6.16(a), that the community starts of with a steady growth and appears to become
mature after that, but then there is a sudden rise in the contributions coming from the
members of the community. Later the number of contributions starts dropping. According
to the project life cycle model [10], community can either start declining or reviving after
it reaches its maturity. In this case we witness a revival which can be due to an important
breakthrough in the existing domain in the project or due to the introduction of a new
domain. It can also be simply due to a new discovery by a group of motivated researchers.
Revival can make a project enter the growth phase again; this trend is evident from  gure
6.16(b), where the community starts reviving after it started to decline. Thus, revival brings
with it the innovation which tends to put the project back on track.
74
(a) Revival After Maturity (b) Revival After Decline
Figure 6.16: Alternate Contribution Activity Patterns across projects Gene Ontology (36855)
and Sequence Ontology (72703)
6.3.2 Active User Distribution
We de ne active users as the users that contribute towards the community. \Active User
Distribution" is the distribution of the active users over the period of time. The plot does
not necessarily indicate the state of the project, but it gives us the idea about overall active
population in a project at any given point in time. If we compare the plots of distribution
of active user with the plot of distribution of artifacts contributed, we get an idea about
how the changes in the active user concentration have a ected the changes in the artifact
submission. Figure 6.17 and  gure 6.18 compares these two plots for two projects. As can be
seen from  gure 6.17(a) an initial increase in the number of active users can be correlated to
increase in the artifact submission (from x=1 to x=5 in  gure 6.17(b)). This indicates that
the increase in the active population leads to in ux of new ideas in the project and hence
innovation. Further, as the growth in the number of active users stagnates (from x=5 to
x=7), the artifact submission falls down and is revived as the active users grow (from x=7
to x=8).
The above case might not happen at all the time;  gure 6.18 shows an exception to
the above revelation. Initially as the number of active users increase (from x=1 to x=4)
, an increase in the number of artifacts can be seen in  gure 6.18(b). But the growth in
the number of artifacts stagnates (from x=4 to x=7) while the number of active users still
increases indicating that although the number of active users increase, their contribution
75
(a) Active User Distribution (b) Contribution Distribution
Figure 6.17: Comparisons between Contribution and Active User Distribution Patterns for
Sequence Ontology (72703)
is not enough for innovation. Later, as the number of users reach a saturation (in  gure
6.18(a) from x=7 to x=11), there is a sharp increase in the number of contributions. This
indicates that there was some signi cant contribution during this time which led to a high
activity coming out of the project, although there was no increase in the number of active
users at all and this again is a sign of innovation. Thus, in some projects only a few users
are the active contributors while the others are dormant or inactive, yet there is a signi cant
activity going on in the project. It also shows that it is the quality of the artifact which
decides what activity will follow the current one; not necessarily the active population.
(a) Active User Distribution (b) Contribution Distribution
Figure 6.18: Comparisons between Contribution and Active User Distribution Patterns for
Gene Ontology (36855)
76
6.3.3 Activity Outburst Frequency Distribution
The activity for any project can be one of the above two types the number of contri-
butions over time and the number of active users over time. It can be seen from  gure 6.19
that the activity of a project does not remain constant or ever increasing over time but goes
though ups and downs. Outburst in an activity is de ned as the activity that crosses a
certain threshold. This threshold is de ned by equation 6.4 given below.
ActivityOutburstThreshold = AverageActivity(1 + ) (6.4)
where  is the user de ned variable. The value of  has to be chosen carefully. Choosing too
small value for  can greatly increase the number of outbursts; whereas, choosing too large
value for  can greatly reduce the number of outbursts. So it is advisable to choose the value
of  based on the average value of the activity. \Activity Outburst Frequency Distribution"
(a) Distribution of Number of Artifacts
Submitted
(b) Distribution of Number Active Users
Figure 6.19: Activity Plots for Systems Biology Ontology (174625)
for any activity can be de ned as the frequency of occurrence of the outbursts in the activity.
It is a plot with x-axis indicating the outburst number; whereas, y-axis indicating the delay
in the occurrence of the corresponding outburst on x-axis. This metric has been derived
from the agent based civil violence model [11] created by Joshua M. Epstein. Figures 6.20
and 6.21 show \Activity Outburst Frequency Distribution" on the left and its corresponding
77
histogram on the right, for di erent projects under obo. The histogram groups the outbursts
according to the delay caused for the outbursts. The plots also show the average value of
(a) Outburst Frequency Distribu-
tion for Gene Ontology
(b) Outburst Frequency His-
togram for Gene Ontology
(c) Outburst Frequency Distribu-
tion for ChEBI
(d) Outburst Frequency His-
togram for ChEBI
Figure 6.20: Activity Outburst Frequency Distribution For Gene Ontology and ChEBI
the activity and the value of  chosen. Value of  is selected, based on the average value of
the activity and by considering the number of outbursts we get. It can be seen that a speci c
pattern comes across from these plots. According to this pattern, every activity outburst
that occurs after a signi cant delay is followed by a series of quick outbursts. These outbursts
might then be followed by an outburst that occurs after a signi cant delay. This means that
a high activity seems to trigger a series of high activities that are probably related to the
high activity with signi cant delay. Thus, in most of the histograms it can be seen that the
majority of the outbursts occur frequently i.e. they have a small delay period.
78
(a) Outburst Frequency Distribu-
tion for Open Biomedical Ontolo-
gies
(b) Outburst Frequency His-
togram for Open Biomedical
Ontologies
(c) Outburst Frequency Distribu-
tion for Sequence Ontology
(d) Outburst Frequency His-
togram for Sequence Ontology
Figure 6.21: Activity Outburst Frequency Distribution For Open Biomedical Ontologies and
Sequence Ontology
79
Chapter 7
Conclusion
In this thesis, we introduced SciBrowser, which is a computational ethnography tool,
to explore open source science communities that reside in SourceForge. We demonstrate the
applicability of the SciBrowser to the analysis of Open Biomedical Ontology (OBO), which
is an open source science network in the  eld of biomedical science. To demonstrate the
utility of SciBrowser and apply it the analysis of open source science networks, we present a
three dimensional analysis approach: structural, collaboration and activity analysis
Under structural analysis, we examine traditional social network metrics such as cen-
trality and density. We observe high values of centrality for the Sequence Ontology (72703)
project, indicating that it has a structural topology resembling the star network. This sug-
gests the possibility of the presence of core-periphery pattern. Clustering Coe cients (CC)
and Average Path Lengths (APL) measures are also plotted over time in order to determine
the presence of small world property in di erent projects. Values of CC for most of the
projects are around 0.5. This observation suggests that there exist highly connected com-
ponents which are loosely coupled with each other. For most of the networks, the value of
average path length is around 2, which is small compared to a random network. This facili-
tates e cient knowledge transfer and innovation di usion from one part of the network to the
other. Furthermore, most of the projects that we examined stop substantially restructuring
themselves eventually, as indicated by stabilized values of CC. Due to the lack of restruc-
turing, it becomes evident that the communities may experience challenges in innovation.
The SciBrowser tool also plots degree distributions and visualizes preferential attachment.
A power law degree distribution is observed for the User-User graph of the Gene Ontology
(group id: 36855) domain, indicating the resilient nature of the community.
80
Novel metrics such as \Activity Strength" are introduced for collaboration analysis.
These metrics are used to measure the productivity and innovation potential of users. Con-
 ning innovation to artifact submission generates a power law, indicating the presence of
core members submitting most of the artifacts, while peripheral members commenting on
them. When the artifact submission and collaboration factors are combined as a proxy met-
ric for innovation, the power law is disturbed and seizes to exist. This may indicate that
major contributors of the project are weak collaborators and their strong contribution factor
is nulli ed by their weak collaboration intensity. We also visualize the collaboration among
users of the community using a color coded map. The visualization indicates that only few
users pair e ectively collaborate, while most of the other user pairs have mediocre amount
of collaboration between them.
Under activity analysis, the tool plots two types of activities: artifact submissions and
active user distributions, over a period of time. It is observed that open source science com-
munities examined in this study closely follow the organization project life cycle. Thus, there
is a possibility that open source science projects possess speci c organizational characteris-
tics such as division of labor, leadership, level of commitment, and coordination/control. In
addition, we also discovered certain unconventional activity patterns, in which the project
picks up pace after the decline phase. Knowing which stage the project is in can provide
potential insight to the administrators of a project, so that they can take certain decisions at
the proper time to revive the project. Active user distribution is also discussed in association
with artifact submission pattern. It is observed that when the number of users in the project
increase, it leads to an increase in the artifact submission. The rational behind this can be
the innovation which the new users bring into the project. We observe that an activity
outburst occurring after a signi cant delay is usually followed by one or more, frequently
occurring outbursts. This implies that the  rst outburst (occurring after a signi cant delay)
might trigger one or more outbursts that follow it. Or, it might also imply that initially
81
when the community is growing, there are less outbursts, but once the community has found
its direction and con icts are resolved the outbursts occur more frequently.
Primarily the SciBrowser tool is used by the simulation team in Simulation and Mod-
eling lab at Auburn University to validate their agent-based simulation models. But in a
broader sense, the tool is targeted towards researchers that explore open source communities
in SourceForge. Although our study pertains to speci c community on the Sourceforge, the
application of the SciBrowser is not limited to the open biomedical ontology (OBO) com-
munity. The tool is versatile in terms of its usefulness, as it can also be applied to open
source software projects. At an abstract level, the structure of Sourceforge communities is
similar, and the database schema used by all the projects is the same. These similarities
make SciBrowser ideal for those who are interested in analyzing the collaboration between
the users of a community and tracking the activity taking place within a project that resides
in SourceForge.
Our future plans pertaining to the SciBrowser involve re-engineering the tool toward a
comprehensive analysis tool, including options for network visualization, metric observation,
and plot generation. Currently, we can visualize metrics and generate plots, but network
visualization in the form of a graph with nodes and edges is lacking. Such a feature would
give researchers the ability to observe the structural growth of a community and help de-
velop hypotheses about its dynamics. Also, further work includes integrating data mining
features, allowing the development of social network speci c mining algorithms. Data mining
techniques such as association rule mining can be used to establish association or relation
between the changes in the structural metrics as well as temporal activity patterns. The cur-
rent version of the tool lacks the feature that would explicitly link the structural attributes
such as change in degree to the innovation metrics at the user level. Tool accounts for the
activity at the group level and domain level, but not yet at the user level.
82
Bibliography
[1] F. Colaiori L. S. Buriol D. Donato S. Leonardi A. Capocci, V. D. P. Servedio and
G. Caldarelli. Preferential attachment in the growth of social networks: The internet
encyclopedia wikipedia. September 2006.
[2] Piotr Fronczak Agata Fronczak and Janusz A. Hoyst. Average path length in random
networks. November 2004.
[3] Pieter Swart Aric Hagberg, Dan Schult. Networkx, 2010.
[4] Yaneer Bar-Yam. Dynamics of Complex Systems. Westview Press, 1997.
[5] Albert-Laszlo Barabasi* and Reka Albert. Emergence of scaling in random networks.
Science, 286:509{512, October 1999.
[6] Open Biological and Biomedical Ontology Foundry. Obo:foundry, July 2010.
[7] Scipy Community. Numpy, 2008.
[8] Paul A. David and Michael J. Spence. Towards institutional infrastructures for e-science:
The scope of the challenge. Oxford Internet Institute, Research Report No. 2, September
2003.
[9] Charles Dhanaraj and Arvind Parkhe. Orchestrating innovation networks. Academy of
Management Review, 31(3):659{669, July 2006.
[10] Jr. Donald E. Wynn. Organizational structure of open source projects: A life cy-
cle approach. Proceedings of 7th Annual Conference of the Southern Association for
Information Systems, pages 285 { 290, 2003.
[11] Joshua M. Epstein. Modeling civil violence: An agent-based computational approach.
Proceedings of the National Academy of Sciences of the United States of America,
99:7243 { 7250, May 2002.
[12] Ralph Johnson John Vlissides Erich Gamma, Richard Helm.
Design Patterns. Elements of Reusable Object-Oriented Software.
[13] James A. Hendler Qingpeng Zhang Zhuo Feng Yanqing Gao Hui Wang Fei-Yue Wang,
Daniel Zeng and Guanpi Lai. A study of the human  esh search engine: Crowd-powered
expansion of online knowledge.
[14] Source Forge. Sourceforge.net research data, 2010.
83
[15] Ian Foster. Service-oriented science. Science, 308(5723):814 { 817, May 2005.
[16] John H. Holland. Hidden Order: How Adaptation Builds Complexity. Addison-Wesley
Publishing Company Inc., 1995.
[17] The MathWorks Inc. Matlab, 2010.
[18] Scott Christley Gregory Madey Jin Xu, Yongqin Gao. A topological analysis of the
open source software development community. 2005.
[19] Michael Droettboom John Hunter, Darren Dale. Matplotlib, 2008.
[20] Weimao Ke Katy Borner, Luca Dall?Asta and Alessandro Vespignani. Studying the
emerging global brain:analyzing and visualizing the impact of co-authorship teams.
Wiley Periodicals - Complexity, 10(4):57{67, 2005.
[21] Glenn E. Krasner and Stephen T. Pope. A description of the model-view-controller user
interface paradigm in the smalltalk-80 system. 1988.
[22] Robert C. Martin. The dependency inversion principle. May 1996.
[23] Robert C. Martin. The open-closed principle. January 1996.
[24] Ken McIvor. wxmpl, 2009.
[25] Damodar Shenviwagle Michael Arnold and Levent Yilmaz. Scibrowser: A computational
ethnography tool to explore open source science communities. March 2010.
[26] John H. Miller and Scott E. Page. The standing ovation problem. April 2004.
[27] Melanie Mitchell. Complexity: A Guided Tour. Oxford Univ Press, 2009.
[28] Susan A. Mohrman and Caroline S. Wagner. The dynamics of knowledge creation: Phase
one assessment of the role and contribution of the department of energy?s nanoscale
science research centers. November 2008.
[29] Bernard Munos. Can open-source r&d reinvigorate drug research? Nature Reviews
Drug Discovery, August 2006.
[30] Siobhan Omahony and Fabrizio Ferraro. The emergence of governance in an open source
community. April 2007.
[31] Jill E. Perry-Smith and Christina E. Shalley. The social side of creativity : A static and
dynamic social network perspective. Academy of Management Review, 28(1):89{106,
January 2003.
[32] Noel Rappin and Robin Dunn. wxPython in Action. Manning Publication Co., 2006.
[33] SRDA. Sourceforge.net research data, September 2008.
84
[34] Georgia Tech Susan Cozzens and NSF Julia Lane. A deeper look at the visualization of
scienti c discovery in the federal context. September 2008.
[35] Brian Uzzi and Jarrett Spiro. Collaboration and creativity: The small world problem.
American Journal of Sociology, 111(2):447 { 504, September 2005.
[36] S. Wasserman and K. Faust. Social network analysis: Methods and applications. Cam-
bridge Univ Pr, 1994.
[37] D. J. Watts and S. H. Strogatz. Collective dynamics of ?small-world? networks. Nature,
393:440{442, June 1998.
[38] Jia Zhang Wei Tan and Ian Foster. Network analysis of scienti c work ows: A gateway
to reuse.
[39] Wikipedia. Centrality, April 2010.
[40] Levent Yilmaz and Tuncer Oren. Agent-Directed Simulation And Systems Engineering.
Wiley-VCH, 2009.
85