DEMOGRAPHICS OF ADWARE AND SPYWARE
Except where reference is made to the work of others, the work described in this thesis is
my own or was done in collaboration with my advisory committee. This thesis does not
include proprietary or classified information.
_______________________________________________
Kavita Sanyasi Arumugam
Certificate of Approval:
_____________________________ _____________________________
Dean Hendrix David A Umphress, Chair
Associate Professor Associate Professor
Computer Science and Computer Science and
Software Engineering Software Engineering
_____________________________ _____________________________
Cheryl Seals George T. Flowers
Assistant Professor Interim Dean
Computer Science and Graduate School
Software Engineering
DEMOGRAPHICS OF ADWARE AND SPYWARE
Kavita Arumugam
A Thesis
Submitted To
the Graduate Faculty of
Auburn University
in Partial Fulfillment of the
Requirements for the
Degree of
Master of Science
Auburn, Alabama
December 17, 2007
iii
DEMOGRAPHICS OF ADWARE AND SPYWARE
Kavita Arumugam
Permission is granted to Auburn University to make copies of this thesis at its discretion,
upon the request of individuals or institutions and at their expense. The author reserves
all publication rights.
____________________________
Signature of Author
____________________________
Date of Graduation
iv
THESIS ABSTRACT
DEMOGRAPHICS OF ADWARE AND SPYWARE
Kavita Arumugam
Master of Science, December 17, 2007
(B.E., Sir M Visveswaraya Institute of Technology, 2004)
60 Typed Pages
Directed by David Umphress
The World Wide Web is the most popular use of the Internet. Information can be
accessed from this network of web pages. Unknown to users, web pages can access their
personal information and, sometimes, also provide information that the user has not asked
for. Various kinds of software are used in the web pages. Web pages use Java, Perl
scripts, XML, etc. There are additional software like adware and spyware being used.
This software could be useful or threatening. Our interest lies in knowing which software
being used presents a threat on the web.
A user who is worried about his online privacy may be more interested in
knowing what types of technology is being used on the web. Adware and spyware are
types of software that are present in web pages unknown to the user. If they are present
when a web page is accessed, they make use of the user?s Internet connection to track and
send data and statistics via a server installed on the user's computer or the users? client.
Most of the legitimate adware and spyware companies disclose in their
iii
privacy statement the nature of data that is collected and transmitted; but there is no way
a user can control the kind of data that is being sent.
A study was conducted to determine the percentage of web pages that contained
certain types of adware and spyware. It was found that 16% of web pages contained web
bugs while 1% of web pages contained ActiveX objects.
iii
ACKNOWLEDGMENTS
I wish to express my deepest gratitude to Dr. David Umphress for his motivation
and guidance throughout the research work. I wish to thank Dr. Dean Hendrix and Dr.
Cheryl Seals for their time and unwavering support. My gratitude goes out to my parents,
my sister, and my friends who have supported me and helped me in all the phases of my
life at Auburn. I wish to thank Santosh for helping me during my stay in Auburn. A big
thanks to all my teachers who have taught me to stay focused on my goals and the
Almighty God for showing me the way.
iv
Style manual used: ACS Computing Surveys.
Computer software used: Microsoft Word, Microsoft Excel, Microsoft Access, Java.
v
TABLE OF CONTENTS
LIST OF FIGURES ............................................................................................................ v
1 INTRODUCTION ........................................................................................................... 1
1.1 Background............................................................................................................... 1
1.2 Problem Statement.................................................................................................... 4
2 RELATED WORK .......................................................................................................... 6
2.1 Background Information............................................................................................... 6
2.1.1 Web Bug ............................................................................................................ 7
2.1.2 Adware Networks .............................................................................................. 8
2.1.3 Backdoor Santas................................................................................................. 9
2.1.4 Trojan Horse .................................................................................................... 10
2.1.5 Browser Hijackers............................................................................................ 11
2.1.6 Dialers .............................................................................................................. 12
2.2 Related Work .......................................................................................................... 13
3 APPLICATION DEVELOPMENT............................................................................... 16
3.1 System Architecture................................................................................................ 16
3.1.1 Web crawler ..................................................................................................... 16
3.1.2 JDBM............................................................................................................... 19
3.2 Architecture of the program.................................................................................... 20
iii
3.3 Understanding the program..................................................................................... 22
3.3.1 Class IDemo..................................................................................................... 23
3.3.2 Class crawlthread ............................................................................................. 23
3.3.3 Class identifyToken ......................................................................................... 24
3.3.4 Class connectionThread ................................................................................... 26
3.3.5 Class urlTokenServer....................................................................................... 26
3.3.6 Class connectionTimer..................................................................................... 26
3.3.7 Class theCount ................................................................................................. 27
3.3.8 Class cookie ..................................................................................................... 27
3.3.9 Class GUI......................................................................................................... 27
3.3.10 Class theDatabase .......................................................................................... 29
3.3.11 Class theDomain ............................................................................................ 29
3.3.12 Class theDepthCheck ..................................................................................... 30
3.3.13 Class theQueue............................................................................................... 31
3.3.14 Class reporter ................................................................................................. 32
3.4 Designing the software ........................................................................................... 32
3.4.1 Web bug detection ........................................................................................... 32
3.4.2 ActiveX objects detection ................................................................................ 35
4 VALIDATION AND RESULTS................................................................................... 37
4.1 Validation of web bugs ........................................................................................... 37
4.2 Validation of ActiveX objects ................................................................................ 39
4.3 Results..................................................................................................................... 41
iv
5 CONCLUSIONS AND FUTURE WORK .................................................................... 43
BIBLIOGRAPHY............................................................................................................. 45
v
LIST OF FIGURES
Figure 3. 1: Web crawler architecture............................................................................... 17
Figure 3. 2: A B+ tree ....................................................................................................... 20
Figure 3. 3: System Architecture of Internet Demographics ............................................ 22
Figure 3. 4: Internet Demographics GUI .......................................................................... 28
Figure 3. 5: Code snippet to detect web bugs ................................................................... 34
Figure 3. 6: Code to detect ActiveX objects..................................................................... 36
Figure 4.1: Screenshot of the results from www.af.mil.....................................................38
Figure 4.2: Screenshot of the results from www.cnn.com.................................................38
Figure 4.3: Screenshot of the results from www.indiatimes.com......................................39
Figure 4. 4: Screenshot of the results from www.navy.mil .............................................. 40
Figure 4. 5: Screenshot of the results from http://grail.sourceforge.net ........................... 40
1
1 INTRODUCTION
1.1 Background
In today?s world, one can shop online for clothes, tickets, etc., schedule payments of
bills, and transfer money between different banks. Sensitive information such as credit
card details; bank routing and account numbers; and social security number are used to
complete these transactions. Most websites that carry out these transactions are secure.
But sometimes, websites are monitored by certain programs and the user?s online activity
communicated to a third party. These programs are called adware and spyware. They are
also known under the names of malware or trackware. They can be present in web pages
as a hidden addition to a legitimate program that is being downloaded from a website, or
can be directly installed on a computer.
Spyware and adware may be written so as not to reveal their existence on web pages.
They take advantage of users? consent to install some piece of legitimate software. For
example, Audio Galaxy is a company that makes Napster-style file sharing software.
When users download Audio Galaxy, they also download VX2?s spyware program.
Audio Galaxy is a legitimate piece of software. VX2 is not. This spyware program keeps
track of the websites a user visits and passes on user information to the company?s
servers [Benner 2002].
2
Adware is software that is usually free. When it is executed, it displays
advertisements from the Internet. The advertisements can be viewed through pop-up
windows or through search bars that appear at the bottom of the screen. The justification
for adware is that it helps recover programming development costs through ad revenue.
For example, Google?s Blogspot service contains JavaScript that tries to convince users
to install software. Pop up boxes appear to come from a website iWebTunes.com. This
website gives bloggers music to add to their blogs or other websites. When a user views
this blog, iWebTunes attempts to install extra programs on the users? machine. These
programs pay iWebTunes a commission for every installation made [Edelman 2005].
Software components used for tracking and reporting user information are included in
most adware. These components collects web browsing history, on-line purchasing
behavior, and an inventory of the computer hardware and software it runs on. After
collecting information from the user, companies target users with specific advertisements
based on browsing history.
Spyware is software that is installed on the computer without the user?s knowledge. It
exists as an independent executable program, unlike adware. It is designed to track the
surfing habits of a user, collect personal information such as credit card number, the
games the user plays, the software being used, the keystrokes etc, all without the user?s
permission [Zhang 2005], [Anonymous 2004], [Daniels 2004], [Doyle 2003] and [Taylor
2002]. The software sends this information back to the origin (creator's servers) where it
is collected [Urbach et al. 2004] and possibly used for identity theft operations including
password harvesting and credit card number theft [Radcliff 2004]. Sometimes, the
software hides the information on the user?s hard drive for later retrieval [Doyle 2003].
3
Surveillance spyware is used by law enforcers and industrial spies, or by corporations
to keep a check on their employees. This software monitors and records keystrokes or
web activity, or occasionally captures an image of the monitor screen. That information is
then either e-mailed to the spying party or hidden on the hard drive for later retrieval
[Ferrer and Mead 2003].
Spyware is often concealed within another application. Web pages contain code that
downloads and installs spyware, usually through exploits. Spyware is installed without
the user's consent, as a drive-by-download, or as the result of clicking some option in a
pop-up window [Thompson 2005]. Drive-by downloads are applications that install
themselves on computers without the user?s knowledge during visits to websites. Drive-
by download is capable of remote monitoring and reporting to actual Trojans with remote
administration capabilities [Schwartz et al. 2004].
When someone installs adware and spyware components, they are often assigned a
unique identifier by the software. Thus, the program can track the user and target
advertisements catered to the users? choice. For example, if a user is looking to buy a new
house, the software provides housing advertisements and house loans as well as
additional pop-up ads to match his needs. Some users like to view advertisements
targeted to their needs; however, these users are not aware that the information collected
by these marketers is then sold to third parties. The information can include the user?s
name, address, email address and any other information they may have gathered from the
user?s personal profile.
4
1.2 Problem Statement
Adware and spyware are not illegal; however, a user who is worried about his
privacy may want to know what private information is being divulged without his
consent. The issue that is of concern involves the use of the user?s Internet connection to
track and send data and statistics via a server installed on the user's computer or the
users? client. Most of the legitimate adware and spyware companies disclose in their
privacy statement the nature of data that is collected and transmitted. But there is no way
a user can control the kind of data that is being sent.
Spyware wastes bandwidth, interferes with the programs running on the machine
and uses disk space. Some spyware is also known to cause crashes and stability problems
on users' computers [Digital Insight Security Bulletin 2005]. Other spyware offers a
serious security risk by opening a backdoor on the system, offering the capability to
secretly install additional software.
The web is very accessible. As long as the users do not face a problem in
accessing the web, they are least worried about the kind of software being used on the
web. Most users are unaware of the adware or spyware embedded in them. Users hence
have no statistics of how many web pages have this kind of information. The goal of this
research is to identify what percentage of web pages contain adware or spyware.
The main work is identifying the various kinds of adware or spyware embedded
in web pages and how they can be detected in web pages. A web crawler with the ability
to crawl through a representative number of web pages is used to search for adware or
spyware embedded in them.
5
This work will benefit people who use the Internet. Most people are unaware of
the spyware and adware that gets downloaded to their machine when they are using the
web [Zhang 2005]. This work will give people an idea of what is present on the web.
6
2 RELATED WORK
2.1 Background Information
Millions of users have been impacted by the World Wide Web. The WWW is so
popular that various kinds of software are integrated into it. It usually gives information
to a user. Sometimes, it also takes information from the user. Users, most often, willingly
give their information, but there is some software that does not ask the user for any
information. It takes the information without the users? knowledge. This information
usually is bought by a third party who uses it to his advantage.
Malware is short for malicious software. Malware requires special conditions if it
is to execute and produce the intended results. The code most software vendors produce
nowadays is not carefully designed or tested. This brings about software that has so many
vulnerabilities. This gives perpetrators a chance to execute malware that exploits the
vulnerabilities [Skoudis and Zeltser, 2003]. In some cases, malware gets installed on a
computer without the user?s knowledge. It can change browser settings as well as system
settings, causing potentially harmful effects to occur on the computer.
7
2.1.1 Web Bug
A Web Bug is usually invisible to the user. It is also called a Web Beacon. It is a
graphic image of size 1-by-1 pixel found on a web page or an email message [Doyle
2003]. Web bugs are represented by HTML IMG tags. The web bug might allow a third
party to send pop up ads or just collect demographics. Usually, the web bug is loaded
from a different web server than the remainder of the page. This is how a web bug can be
differentiated from a normal 1-by- 1 pixel image [Smith 2003]. This spyware monitors
the IP address of the machine that opened the page with the web bug. It also sends
information about the type of the browser that opened it, the time it was viewed and the
URL of the web page that had the bug.
If the web bug is embedded in an email, the image is requested when the user
reads the email for the first time. It can also be requested every time the user opens the
email again.
When a web page with a bug is downloaded, the server where the page resides
stores the IP address of the computer requesting the page. This information can be
retrieved from the server log files. When files are transferred using the Hypertext
Transfer Protocol, web bugs send the server their URL, and the URL of the page
containing them. The URL of the page containing the bug allows the server to determine
which particular Web page the user has accessed. The URL of the bug can be appended
with an arbitrary string in various ways while still identifying the same object. This extra
information can be used to better identify the conditions under which the bug has been
loaded. This information can be added while sending the page or by Java scripts after the
download.
8
The following is an example of a web bug found on Quicken?s home page
www.quicken.com :
This bug provides to DoubleClick, an Internet advertising company, information about
the number of visitors [Smith 2003].
2.1.2 Adware Networks
When companies want online publicity, they approach software developers or
web sites and pay them to allow their advertisements to be displayed when people use
their software. This software is called adware networks [BLEEPING COMPUTER].
These ads are generally in the form of popups. The problem with these networks is that
they place cookies on the computer each time the user opens an ad served by the
particular network. This allows the advertising network to track the user?s movements
across the Internet by reading the information contained in the cookies every time a user
connects to a site. Networks that employ this method include DoubleClick
[www.doubleclick.com], Value Click [www.valueclick.com], Gain
[www.gainpublishing.com], and Radiate [www.cexx.org/aureate.htm].
A very good example of adware networks is the 180solutions.com website. This is
the world?s largest adware networks. Software from 180solutions redirects many affiliate
commissions to 180solutions. This transmits information about the web sites that the user
9
visits to its server. 180solutions.com shows pop up ads which cover all of the targeted
web sites. Programs from 180solutions monitor users' activities and show targeted
advertisements, but 180 programs also overwrite affiliate commissions to cause
180solutions to receive payments from merchants when users make online purchases.
Benjamin Edelman, an assistant professor at Harvard Business School, conducted an
extensive study and published results of his study on his web site. In his study, he showed
that when he browsed for www.delta.com, the instructions caused 180solutions? software
to show an ad for Hawaiian airlines that covered almost the entire delta.com web page
that opened on his machine. Further in his research, he found that the advertisers who
sponsored 180 solutions? web site paid as little as $0.015 per display of their ads
[Edelman 2004].
2.1.3 Backdoor Santas
Users download programs from the Internet. On the surface, programs appear
valid. However, they collect statistics on computer usage, browsing history, hardware a
computer uses and transmit the information back to servers. These programs are called
Backdoor Santas. A Backdoor Santa is a stand - alone program that gathers user
information. It bypasses normal security controls to give the attacker access to useful and
potentially valuable data; hence the name Backdoor Santa [Sipior et al. 2005].
A good example of a Backdoor Santa is a novelty cursor representing a seasonal
icon or the likeness of Dilbert or a Peanuts character. When a program to make the
customized cursor is downloaded, a Globally Unique Identifier (GUID) is issued. This
10
GUID helps the provider?s servers to record without the users? permissions logs of cursor
impressions/ cursor themes, the identity of referrers, Internet Protocol (IP) addresses, and
system information. This data is sold by the providers to clients to inform them how
many users have customized cursors obtained from certain websites [Tipton and Krause
2006], [Sipior et al. 2005] and [Smith 1999].
Programs of this type are distributed by Comet Cursor [www.cometcursor.com],
Alexa [www.alexa.com], Hotbar [www.Hotbar.com], and Cuteftp [www.cuteftp.com].
2.1.4 Trojan Horse
Trojan horse is named after the Trojan horse tactic in Greek history, where
something unknown and unexpected is delivered to the user in the form of a package,
which the user normally accepts [Pastore 2002]. These types of software are often
popular programs and are usually free downloads. Trojan horses involve installing
programs that can be contacted by remote machines which take over control over the
user?s machine. Email attachments are a popular mechanism for delivering Trojan horses
[Mikusch 2003].
A Trojan horse program masquerading as an advertising application was included
with versions of programs BearShare [www.bearshare.com], LimeWire
[www.limewire.com], Kazaa [www.kazaa.com] and Grokster [www.grokster.com]. The
Trojan, called "W32.Dlder.Trojan?, is found within an application called
"ClickTillUWin" which promises users a chance to win prizes [Borland 2002]. The
Trojan file Dlder is installed when users set up the file sharing applications. After
11
installation, this Trojan downloads an additional file called explorer.exe from a website
2001-007.com and installs it in the system folder. It then creates a startup key for the
explorer.exe file. When the system is restarted the next time, the Trojan is connected to
the 2001-007.com website. It keeps track of the users? web activity and reports this to a
web server.
In some situations, if a user chooses not to install the program containing the
Trojan horse, he/she will not be able to use main program. Examples of such programs
are KaZaA Media Desktop [www.kazaa.com], Grokster [www.grokster.com], and
Morpheus [morpheus.com].
2.1.5 Browser Hijackers
Browser hijackers change the default web page setting on the user?s browsers
without permission. The software changes the default homepage to another homepage no
matter how many times a user changes it. It does this by making changes to the system
registry [Mikusch 2003]. Sometimes Internet shortcuts will be added to the Favorites
folder of the web browser without the user?s permission. The purpose of this is to force
the user to visit a web site of the hijacker's choice so that they can inflate their web site's
traffic for higher advertising revenues [Healan 2005].
In 2005, AOL was labeled a browser hijacker. AOL placed its web site
free.aol.com in Internet Explorer's trusted sites security zone, thus bypassing the most
frequently used security settings. This occurred only after a user installed the AOL
software. After that, AOL downloaded ActiveX components to the computer without the
user?s consent. These components led to pornographic web sites [Healan 2005].
12
One of the infamous hijackers known to date is the CoolWebSearch. It registers
Winres.dll under the Windows directory and then changes the Start page to about-blank.
It downloads and installs other searches such as 2020search, isearch, etc. which offer
CoolWebSearch a fee in exchange for a visitors? use of their search program
[Spywareguide 2007].
2.1.6 Dialers
Dialer is a colloquial term for Dialing Software. This software gets installed on a
user?s computer and has the ability to make phone calls from the computer if a modem is
connected to it. These programs will connect to other computers, through the phone line,
without the user?s permission. Most of these numbers are not toll-free calls;
consequently, the user gets charged for the amount of time the computer is connected to it
[Shukla and Nah 2005], [Pastore 2002].
A good example is the Fairtale Dialer. A computer user downloads a software
package. Among the programs is a dialer application that was not mentioned in any of the
licenses or advertisements associated with the package. The dialer application is not an
integral part of the software package. When the user opens the Web browser after
installation of the software, the dialer opens in a hidden window, turns off the sound of
the user?s computer, and calls a phone number without the user?s permission [Internet
Security Services]. Once this is installed, it makes long distance calls or calls to 900 and
976 phone numbers without asking for the user?s approval. These calls are usually adult
13
pay-per minute phone services. Thus, the user pays for the call even though he did not
make it.
2.2 Related Work
Alexander Moshchuk, Tanya Bragin, Steven D. Gribble, and Henry M. Levy from
the University of Washington conducted a crawler-based study of spyware on the web
[Moshchuk et al. 2006]. In this experiment, the group studied and confirmed the
existence of spyware. The objective of this study was to quantify the amount of spyware
in executable web content and the number of web pages that contained embedded drive-
by download attacks. These programs get downloaded by exploiting the web browser, or
operating system bug.
To conduct the study, they used the Heritrix public domain web crawler. The data
captured was analyzed using a virtual machine (VM) and Ad-Aware, an anti-spyware
product from Lavasoft. Once installed, this product reduces the risks of pop-up ads,
browser hijacks and theft of any information from that machine. It uses a technology that
can detect known and unknown variants of malware. The web crawler crawled sites from
eight different categories: adult entertainment, gaming, music, celebrity, screensaver,
children zones, online news, and pirate sites. For each of these categories, web sites were
selected using Google directory and key-word searches specific to that particular
category.
Their study was divided into two parts. In the first part, almost 20 million URLs
were crawled in search of executable content. They reached the conclusions that adware
14
constituted the largest portion of spyware. It also showed that gaming sites and sites that
allowed wallpaper downloads had more spyware in them than the other categories. The
second part of their study involved three crawls of 45,000 URLs in the eight web
categories. In this part of the study, they examined drive-by download attacks over a time
slot of 5 months.
This study has a few limitations. The results were based on samplings of web
pages on Google selected domains and URLs in the eight categories. Using an anti-
spyware tool like Ad-Aware allows the detection of only what it considers a threat.
Overall, the study was able to quantify the nature of the spyware threat and thus
confirmed the existence of spyware over the web.
In another work carried out at the University of Washington, the authors used
passive network monitoring to measure the extent to which four specific adware
programs had affected computers in the university [Saroiu et al. 2004]. This work used
the honeypot technique. A honeypot is a closely monitored network that can provide
early warning about new attacks and exploitation trends, and allow detailed examination
of adversaries during and after the attack. Since physical honeypots are time consuming
and expensive to set up, Niels Provos came up with a framework for virtual honeypot
called Honeyd. This simulates computer systems at the network level [Provos 2004].
The Strider HoneyMoney project developed by the Microsoft Research team is
also inspired by honeypot techniques. The system the researchers built was called
HoneyMonkey Exploit Detection System. This consists of a pipeline of monkey
programs running possibly vulnerable browsers on virtual machines [Wang et al. 2006].
A monkey program is a program that drives a browser in a way that mimics a human user
15
operation. These virtual machines have different patch levels and they patrol the web to
seek and classify web sites that exploit browser vulnerabilities. The HoneyMonkey
project report focuses more on the construction and design of the tool.
Quite a few commercial anti-spyware companies have used web crawlers to find
new spyware on the web. Webroot?s Phileas system uses a cluster of computers to scan
web content for known threats and patterns that suggest of new browser threats [Webroot
Software Inc, 2007]. Sunbelt Software has also built a web crawler SPECTRE that
identifies new spyware outbreaks [Sunbelt Software 2007].
16
3 APPLICATION DEVELOPMENT
3.1 System Architecture
When users surf the internet, they have little knowledge of how much malware is
there. A study of the amount of malware over the internet will enlighten users with what
technology they are encountering in their daily life. To study the web, a stable, powerful,
and accurate web crawler is needed. Once the crawler gets the results, the results need to
be stored for later retrieval. JDBM is a package that is used for this purpose. It stores data
using a hash table and a B+ tree data structure. In order to study the data collected, the
results need to be imported into a database like MS Access. A statistical analysis of the
data will help people make a decision of what malware is being found excessively in web
pages.
3.1.1 Web crawler
In order to detect adware and spyware on the web, the program written needs a
web crawler that can search the web thoroughly. A web crawler is a program that
browses the World Wide Web in a systematic fashion. A crawler resides on a single
machine. Web crawlers start by parsing a specified web page, noting any hypertext links
on that page that point to other web pages. They then parse those pages for new links, and
keep doing it in a recursive manner. Figure 3.1 shows a generic web crawler architecture.
17
Web crawler starts from seed URL
Checks for particular tags
Read content
Find Links
Follow Links
Figure 3. 1: Web crawler architecture
Search engines such as Google, etc. use web crawlers [Hawking 2006]. These
engines operate multiple data centers that are distributed throughout the world. This
ensures fault tolerance. Fault tolerance is important because should an agent crash, the
other agents will decide who should fetch a certain host. Within a data center, PCs are
Web
page
HOMEPAGE
Web
page
HOMEPAGE
Web
page
18
clustered according to the services they provide. Clusters are dedicated to specialized
functions, such as crawling, query processing, and result caching. Currently, the amount
of web data that search engines crawl and index is in the order of 400 terabytes [Hawking
2006].
Certain terms used frequently while discussing web crawlers need to be defined.
A URL or Uniform Resource Locator is a web page address. The term crawling refers to
traversing the web by recursively following links from a seed. A seed is a URL provided
at the start of the crawl.
The simplest web crawling algorithm uses a queue of URLs yet to be visited and a
fast mechanism for determining if it has already seen a URL. The crawler initializes the
queue with one or more seed URLs. Crawling proceeds by making a HTTP request to
fetch the page at the first URL in the queue. When the crawler fetches the page, it scans
the contents for links to other URLs and adds each previously unseen URL to the queue.
The crawler saves the page content for indexing. Crawling continues until the queue is
empty.
Web crawlers start by parsing a specified web page, noting any hypertext links on
that page that point to other web pages. They then parse those pages for new links, and so
on, recursively. Web-crawler software doesn't actually move around to different
computers on the Internet, as viruses or intelligent agents do. A crawler resides on a
single machine. The crawler simply sends HTTP requests for documents to other
machines on the Internet, just as a web browser does when the user clicks on links. All
the crawler really does is to automate the process of following links.
19
3.1.2 JDBM
JDBM is a transactional persistence engine for the crawler. This is used to store
objects and Binary Large objects (BLOB), and all updates are done in a transaction safe
manner. It provides data structures such as B+ tree to support persistence of large objects
[Groot 2000].
A B+ tree is a specialized tree designed to branch out in a number of directions
such that the height of the tree is relatively small. A B+ tree of order M is an m- ary tree
that has the data items stored in its leaves and whose root is either a leaf or has between 2
and M children. There are two types of nodes. The internal nodes contain key values and
node pointers. The leaf nodes contain key and record pointer pairs. Each internal node is
designed to fit into one I/O block of data. Hence, an internal node can keep a lot of keys.
Each node except the root has between |m/2| to m children. Each node except the root
has between |m/2|-1 and m-1 keys. Figure 3.2 sketches a B+ tree with keys stored in the
node.
In a B+ tree, data records are only stored in the leaves. If a target key is less than a
key in an internal node, then the pointer just to its left is followed. If a target key is
greater or equal to the key in the internal node, then the pointer just to its right is
followed. The leaves are also linked together so that all of the keys in the B+ tree can be
traversed in ascending order, just by going through all of the nodes in this linked list
along the bottom level of the tree [Carlson 2007].
20
Figure 3. 2: A B+ tree
3.2 Architecture of the program
Figure 3.3 shows the system architecture for the IDemo application. The
application is composed of seven core components which interact with each other in
order to crawl the web and record data. Among all, the GUI component is primarily
responsible for handling all the user interface related operations of the application. Apart
from displaying the results and allowing the user to configure various run-time
parameters like the number of threads, the total number of links, etc., the GUI is also
responsible for receiving the seed URL for IDemo component from the user. On
receiving the seed URL, the IDemo component - which controls the entire application -
sets up other components like the queue component (theQueue) to store various URLs
found during the crawl, the depth check component (theDepthCheck) to store the depth of
the parsed URLs for each domain crawled and the database component (theDatabase) to
store the parsed URLs as well as their statistics. IDemo then spawns a specified number
of connection Threads (CThread) to crawl the web starting from the seed URL received.
21
The connectionThreads then crawl the web and record its demographics in the database.
For each valid URL retrieved from the queue, the connectionThread checks to see if the
depth check fails for that domain. If not, the connectionThread parses the URL for known
technologies and records the results in the database. URLs discovered during such parses
are inserted into the queue for a later retrieval. The connectionThread continues this
fetch-check-parse-record cycle till it either has recorded the specified number of URLs in
the database or till the queue component runs out of URLs. In case of the latter the user is
prompted to enter the seed URL again. In the case of the former, it has to run till
completion.
Upon completion, the user can generate a report of the IDemo run by invoking the
reporter component. The reporter reads the database populated by the connectionThreads
and outputs a text file containing technology statistics recorded for every URL in the
database. This file can then be used by the user to analyze the demographics of the
Internet.
22
Figure 3. 3: System Architecture of Internet Demographics
3.3 Understanding the program
The program contains several classes. An understanding of the classes is essential
to understand the various aspects of the program.
Generate Report
Input
Depth for
given
domain
Spawn Threads
URL Fetch
Seed URL
Thread
Thread
IDemo
Thread
GUI
CThread
theQueue
theDepthcheck
theDatabase
URL Insert
Technology Stats for given
URL
Report
reporter
23
3.3.1 Class IDemo
This is the root class. It is the main controller of the entire application. It is
responsible for creating the queue, database and depth. When the queue is empty, IDemo
creates the GUI object so that the user can enter the seed URL. When the ?Start? button is
clicked on the GUI, IDemo creates and starts the specified number of crawl threads. If the
queue becomes empty at any point of time, it prompts the user to enter a seed URL again.
After the program is ended, it closes all the data structures created.
3.3.2 Class crawlthread
This class is the worker bee of this application. It makes use of two variables: One to
name the thread and the other to keep track of the current thread number. When started,
the thread does the following:
? Retrieves a new URL from IDemo?s queue.
? Converts the URL to a proper URL by removing hex symbols from the URL and
concatenating a ?/? to the ones with no path.
? Makes sure that the user-specified depth limit is not exceeded.
? Creates a new connection to the URL obtained through the previous steps.
If the established connection is good, the following steps occur:
? Makes sure that the URL has MIME type text/html.
? Makes sure that the URL is not already in the database.
? Makes sure that the URL does not have a null domain.
? Makes sure that the user specified depth limit is not exceeded.
24
? Creates token server and passes token stream of the above URL to it.
For every token retrieved, it does the following:
? Tries to identify the token.
? Checks if the token is a link whose technology can be identified.
? Makes sure that the link is a valid URL.
? Makes sure that the depth or the max limits are not exceeded.
? Finally, makes sure that the link is already not in the database before inserting it
into the queue.
If the established connection was good, it adds the URL entry to the depth tree and
adds the URL to the database. Otherwise, it discards the entry. The statistics on GUI is
updated. It also checks if the thread needs to be killed.
3.3.3 Class identifyToken
This class is used to parse all the tokens returned from the token server. It is also used
to maintain statistics on the various technologies found in a given URL. The constructor
identifyToken does the following:
? Checks to see if the no-follow flag is set on the token.
? Checks to see if the token returned indicates an applet.
? Checks to see if the token returned indicates a form.
? If the token begins with ?base?
o Extracts the base URL enclosed along with ?href?.
o Store the base URL if it starts with ?http?.
25
? If the token begins with ?script?
o Checks to see if the script present is javascript or vbscript.
? If the token begins with ?object?
o Checks to see if the application type is ?activex?.
? If the token begins with ?img?.
o Checks to see if the dimensions are 1x1 in order to find ?web bugs?.
? Invokes method lookforlinks().
The method lookforlinks does the following:
? Extracts the link from tokens with tags like a, area, link, frame and iframe.
? Ignores links to extensions such as .jpg, .mpg, .gz, etc.
? Resolves relative links.
? Makes sure that the links are of type ?http? only.
? Invokes countTechnologies() for the candidate link in order to update stats and
examines what is returned.
? If countTechnologies() was able to determine the link technology the isLink flag
is set to true.
For a given candidate link, the method countTechnologies checks to see if the link
has
? ActiveX
? Web bugs
If the technology on the given link has been found, it returns NULL. Otherwise, it
returns a crawl link.
26
3.3.4 Class connectionThread
This class is responsible for establishing a HTTP connection to the given URL. When
started, this thread does the following:
? Opens a HTTP connection to the URL taking into account redirections.
? If there are already a set of cookies for the URL, it retrieves them and sets them in
the HTTP connection request property.
? Establishes connection to the URL.
? Handles cookies.
? Opens a buffered reader to the URL and sets up a tokenizer to tokenize stream
based on angular brackets.
3.3.5 Class urlTokenServer
This class returns HTML tokens from the input stream. The method returnToken does
the following:
? Returns NULL only if the end of stream is encountered.
? Ignores comments if encountered.
? Ignores new line while returning a token.
? Returns tokens enclosed between angular brackets one at a time.
3.3.6 Class connectionTimer
This class sets up a timer for specified number of seconds. It sets a flag called
?loop? to indicate that the timer has started at the start. When the timer has expired, the
flag is turned off.
27
3.3.7 Class theCount
This class maintains a variable called ?countOfPages? to keep track of the page
count. The functionality of this class can in fact be accomplished with the help of a
simpler static variable. This class contains only basic functionalities like set, get and
increment of ?countOfPages? variable.
3.3.8 Class cookie
This class is used to store session related information for a given connection. The
whole class consists of a bunch of set() and get() methods for values like name,
information, domain, path, secure and expiry.
3.3.9 Class GUI
This class is responsible for handling all the UI related operations of the
application. Figure 3.4 illustrates the GUI in operation:
28
Figure 3. 4: Internet Demographics GUI
The UI allows the user start/stop the crawl. It lets the user configure the number
of threads that can be run, the timeout value for each connection, the total links that are to
be passed and the maximum depth of the tree. If the queue is empty the user is prompted
to enter a seed URL. The user entered value is then inserted in the queue.
The progress of the whole operation is displayed in the progress bar at the bottom
of the screen. The statistics of the number of links in the database, in the queue, depth
checker and the total links examined are all displayed in the ?Statistics? frame.
The class also handles typical window operations like minimize, maximize and
close. The window close operation also triggers IDemo?s data structure cleanup.
29
3.3.10 Class theDatabase
This is the main data structure where all the parsed URLs and their associated
technologies are stored. It makes use of java RecordManager and java BTree to
implement a database. First, it creates record manager for the indicated file (?dbasefile?).
Then it checks to see if a persistent binary tree for the above record manager already
exists. If so, that binary tree is loaded off the disk. If a persistent binary tree does not
exist, a new one is created and a reference for the tree is stored in the record manager
using setNamedObject().
Insertion is done using dinsert() ? which does the following:
? Creates a database entry for the given URL and technology stats array.
? Stores the above created entry with the URL in the binary tree making sure that
duplicate entries are not allowed.
? The transaction for the record manager is then committed.
The method dsize() returns the current size of the tree. The dsearch() method is used
to search for an entry in the database. If found, the technology stats array associated with
the URL is returned.
3.3.11 Class theDomain
The methods in this class are used to determine the domain name of the given URL.
By default, all unknown domains are set to ?????. By default, all numeric domains are set
to ?999?. The function parseTLD() does the following:
? Ignores the initial string of ?http://?
30
? Gets the name of the domain from the substring found before the first ?/? and after
the first ?.?
? Checks to see if the domain is numeric or not.
? Returns string if valid domain found, NUMERIC_DOMAIN if domain was
numeric and UNKNOWN_DOMAIN if the domain is unknown.
3.3.12 Class theDepthCheck
This class is used to control the number of URLs that can be retrieved from a
domain. It defines the upper limit on the URLs that can be retrieved from a given domain.
It makes use of java RecordManager and java BTree to implement a database. First, it
creates record manager for the indicated file (?depthfile?). Then, it checks to see if a
persistent binary tree for the above record manager already exists. If so, that binary tree is
loaded off the disk. If a persistent binary tree does not exist, a new one is created and a
reference for the tree is stored in the record manager using setNamedObject().
The depthInsert() method is used to insert an entry into the database. It checks to
see if an entry for the URL already exists. If there is no new entry, then a new entry is
created and the count is set to 1. Otherwise, the count is retrieved, incremented by one
and stored again overwriting the previous entry. The transaction for the record manager is
then committed.
The method depthSize () returns the current size of the tree. The depthSearch ()
method is used to search for an entry in the database. If found it returns the number of
31
instances of a particular domain that is found in the tree. The findDomain() method is
used to extract the host out of the given URL.
3.3.13 Class theQueue
Technically, this is not a queue as the insertions and deletions are not done in any
particular order. This class makes use of java RecordManager and java BTree to
implement a database. First, it creates record manager for the indicated file (?queuefile?).
Then it checks to see if a persistent binary tree for the above record manager already
exists. If so, that binary tree is loaded off the disk. If a persistent binary tree does not
exist, a new one is created and a reference for the tree is stored in the record manager
using setNamedObject().
The qinsert() method is used to insert a URL into the queue. It inserts a URL into
the database without duplicates. It also performs a mass commit on the insertions when
100 entries are inserted. This is done to improve performance speed.
The method qsize() returns the current size of the tree. The qsearch() method is
used to search for an entry in the database. If found it returns true else it returns false.
The qretrieve() method is used to retrieve an entry from the queue. It retrieves an
entry from the tree and converts it into a URL to be returned, removes the entry from the
tree and returns the URL if successful in the above operations. Otherwise, it returns
NULL.
32
3.3.14 Class reporter
This class is used to convert the statistics collected in ?dbasefile? database into a
human readable format. It makes use of java RecordManager and java BTree to
implement a database. First, it creates record manager for the indicated file (?dbasefile?).
Then, it checks to see if a persistent binary tree for the above record manager already
exists. If so, that binary tree is loaded off the disk. If the persistent binary tree is not
found, it terminates reporting an error. It prompts the user to enter the path and filename
where the output is to be stored. For each node found in the tree, the URL and its domain
along with the technology statistics is output to the file.
3.4 Designing the software
For the purpose of this thesis, web bugs and embedded objects are the two
technologies being detected.
3.4.1 Web bug detection
Typically, web bugs are images of size 1 by 1 pixel that get loaded from a
different server than the remainder of the web page. However, during the study of web
bugs, certain images whose sizes were not a 1 by 1 pixel were found to be loaded from a
different server than the domain server. These images were found to behave like a typical
web bug. Also, some of the web bugs of both 1 by 1 pixel and an M by N pixel were
found to contain query strings. URL query strings can be used to pass information to the
server in order to perform functions like displaying different data, passing information,
33
entering different mode and changing display format among a lot of other things. When a
query string is present in the SRC field of an image tag whose display size is either set to
1 by 1 or to an M by N pixel, it can be reasonably counted as a web bug.
So, there are three cases of web bug detection that can be used to find most of the
web bugs.
Case 1: A 1 by 1 pixel image with no query string
Since a web bug is always an image, to detect an image in Java requires the use of
the IMG tag. The web bug is of specific dimensions, that is, height =1 and width =1.
Figure 3.5 shows the code snippet used to detect these web bugs. The code snippet
relevant to this case is indicated by the case number. The lowercasetoken looks for a
particular token, i.e., a string value in the HTML code of a web page. If the image fits the
specified height and width of one pixel, a successful match is obtained. The variable
techArray gets incremented and stores the image details, such as the web page address,
the type of web page, etc.
Case 2: A 1 x 1 image with a query string
A 1 x 1 image that has a query string in the SRC presents a stronger case for a web bug.
The query string indicates that some kind of information exchange is taking place. In
Figure 3.5, the code snippet under case 2 is relevant to this type of web bugs.
34
Case 3: An M x N image with a query string
An M x N image that is getting loaded from a different server than the web page itself
may not be considered a web bug. But if the URL has a query string, then it could
possibly be a web bug. These web bugs are also accounted for. The code snippet for this
case is shown in Figure 3.5 under case 3.
Pattern heightq = Pattern.compile("height\\s*=\\s*\"{1}1\"{1}\\s*");
Pattern widthq = Pattern.compile("width\\s*=\\s*\"{1}1\"{1}\\s*");
Pattern height = Pattern.compile("height\\s*=\\s*1{1}\\s*");
Pattern width = Pattern.compile("width\\s*=\\s*1{1}\\s*");
Matcher hmatchq = heightq.matcher(parameters);
Matcher wmatchq = widthq.matcher(parameters);
Matcher hmatch = height.matcher(parameters);
Matcher wmatch = width.matcher(parameters);
//CASE 1: This is for links that contain query string
if(hmatch.find(0) && wmatch.find(0)|| (hmatchq.find(0) && wmatchq.find(0)))
{ if(lowercasetoken.indexOf("?") == -1)
{ techArray[11]++;
System.out.println("Caught a web bug with 1x1!!!\n");
return;
else{
//CASE 2: This is for links that don?t contain the query
techArray[12]++;
System.out.println("Caught a web bug with 1x1 and query !!!\n");
return;
}}
else{
//CASE 3: This is for links that contain the query
if(lowercasetoken.indexOf("?") != -1)
{ techArray[13]++;
System.out.println("Caught a web bug with MxN and query !!!\n");
return;
}}}
Figure 3. 5: Code snippet to detect web bugs
35
The results from these 3 cases will help in analyzing what percentages of web
pages contain each category of web bugs.
3.4.2 ActiveX objects detection
ActiveX objects are inserted in web pages using the