INTERNET DATA ACQUISITION, SEARCH AND PROCESSING by Sandeep Neeli A thesis submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Master of Science Auburn, Alabama Dec 18, 2009 Keywords: Data Mining, Citations, Data Processing Copyright 2009 by Sandeep Neeli Approved by: Bogdan Wilamowski, Chair, Professor of Electrical and Computer Engineering Thaddeus Roppel, Associate Professor of Electrical and Computer Engineering John Hung, Professor of Electrical and Computer Engineering Abstract Internet data acquisition from the Web is the process of extracting essential data from any web server. Semi-structured data present in the form of HTML web pages need to be extracted, converted into structured data before presenting them to the users. In this thesis, four tools are presented which perform the functions - data acquisition, data search and data processing. They are: GradeWatch, Ethernet Robot, Online Search Tool and Citations Explorer Tool. GradeWatch is a tool mainly for students and faculty of an academic insti- tution to check and post grades online respectively. Ethernet Robot extracts paper details for IEEE Transactions on Industrial Electronics from IEEE web server using Perl scripting language and processes the data using regular expressions. Using the paper database created by Ethernet Robot, an Online Search Tool is developed which can perform a search up to a depth of three keywords and present the information on a separate web page from which the users can also download the papers by clicking the corre- sponding links provided with them. The Citations Explorer is a program which returns the most cited papers for the IEEE Transactions on Industrial Electronics for a particular year. The program uses Google Scholar to do the search and Perl regular expressions to process the data. The procedure for designing all these tools involves fetching, ltering, process- ing and presentation of required data. The resultant HTML les consisting of the required data are displayed for the perusal of users. Future enhancements to our Ethernet Robot include optimization to improve performance and customization for use as a sophisticated client-speci c search agent. ii Acknowledgments I am heartily thankful to my supervisor, Dr. Wilamowski, whose encouragement, guid- ance and support from the initial to the nal level enabled me to develop an understanding of the subject. I would like to thank my committee members, Dr. Hung and Dr. Roppel for accepting my request to be on my thesis committee. I would like to thank my family members for encouraging me to pursue this degree and Arthi Kothandaraman for providing me all the support. Lastly, I o er my regards and blessings to all of those who supported me in any respect during the completion of the thesis. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Phases of Automatic Web Data Extraction . . . . . . . . . . . . . . . . . . . 1 1.2 Categories of Data used in Web Data Extraction . . . . . . . . . . . . . . . . 2 1.3 Current Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.1 Pros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.2 Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.3 Engineering Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 GradeWatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 GradeWatch System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 GradeWatch User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Viewing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Data processing from IEEE Xlpore - Ethernet Robot . . . . . . . . . . . . . . . 13 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5 Data Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 iv 4 The online search tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 Citations Explorer Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.3 Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.4 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.5 Data Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A Perl Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.1 Ethernet Robot - IEEE Web Data Extraction . . . . . . . . . . . . . . . . . 35 v List of Illustrations 1.1 Structure of various documents [1]. . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Data Flow in GradeWatch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 User Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Course progress report of a student - user x. . . . . . . . . . . . . . . . . . . 12 3.1 IEEE Xplore webpage depicting various Data Fields. . . . . . . . . . . . . . 14 3.2 Flowchart depicting the four stages of Ethernet Robot. . . . . . . . . . . . . 16 3.3 Output of wget implementation in Perl: Example.htm. . . . . . . . . . . . . 19 3.4 Execution of the Ethernet Robot Perl code. . . . . . . . . . . . . . . . . . . 20 3.5 Resultant web page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6 Data Presentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Search Interface to download required papers. . . . . . . . . . . . . . . . . . 23 4.2 Web page displaying the search results. . . . . . . . . . . . . . . . . . . . . . 25 5.1 Sample page generated by Google Scholar. . . . . . . . . . . . . . . . . . . . 27 5.2 Final web page for most cited papers for the year 2007. . . . . . . . . . . . . 29 vi Chapter 1 Introduction The meteoritic rise of the World Wide Web as the knowledge powerhouse of the 21st century has led to a tremendous growth in information available to the masses. This, in turn, implies that the useful information is all the more time-consuming to narrow down or locate in the huge mass of available data. In other words, with increasing knowledge base, there is a pressing need to e ciently extract useful information in a shorter amount of time. Acquisition of structured data from a pile of unstructured documents is called Data Ex- traction (DE). Web data extraction is a process of extracting information or data from the World Wide Web (WWW) and manipulating it according to the user constraints. A brief overview of web data extraction is discussed below and we present an example model of web data extraction based on these features in the coming sections. 1.1 Phases of Automatic Web Data Extraction The Web data extraction process can be divided into four distinct phases [20, 21, 26]: 1. Collecting Web data - Includes past activities as recorded in Web server logs and/or via cookies or session tracking modules. In some cases, Web content, structure, and application data can be added as additional sources of data. 2. Preprocessing Web data - Data is frequently pre-processed to put it into a format that is compatible with the analysis technique to be used in the next step. Preprocessing may in- clude cleaning data of abnormalities, ltering out irrelevant information according to the goal of analysis, and completing the missing links (due to caching) in incomplete click through paths. Most importantly, unique sessions need to be identi ed from the di erent requests, based on a heuristic, such as requests originating from an identical IP address within a given 1 time period. 3. Analyzing Web data - Also known as Web Usage Mining [22, 23, 24], this step applies machine learning or Data Mining techniques to discover interesting usage patterns and sta- tistical correlations between web pages and user groups. This step frequently results in automatic user pro ling, and is typically applied o ine, so that it does not add a burden on the web server. 4. Decision making/Final Recommendation Phase - The last phase in web data extraction makes use of the results of the previous analysis step to deliver recommendations to the user. The recommendation process typically involves generating dynamic Web content on the y, such as adding hyperlinks to the last web page requested by the user. This can be accomplished using a variety of Web technology options such as CGI programming. 1.2 Categories of Data used in Web Data Extraction The Web data mining process depends on one or more of the following data sources [25,26]: 1. Content Data - Text, images, etc, in HTML pages, as well as information in databases. 2. Structure Data - Hyperlinks connecting the pages to one another. 3. Usage Data - Records of the visits to each web page on a website, including time of visit, IP address, etc. This data is typically recorded in Web server logs, but it can also be collected using cookies or other session tracking tools. 4. User Pro le - Information about the user including demographic attributes (age, popula- tion, etc), and preferences that are gathered either explicitly or implicitly. The input le of a Data Extraction (DE) task may be structured, semi-structured, or free- text. As shown in Fig. 1.1, the de nition of these terms varies across research domains. Free-texts, e.g., news article, that are written in natural languages are considered unstruc- tured [28], postings on newsgroup (e.g., apartment rentals), medical records and equipment maintenance logs are semi-structured, while HTML pages are structured. According to the 2 database researchers [29], the information stored in computer databases is known as struc- tured data; XML documents have the schema information mixed with the data values and hence, are semi-structured data. Web pages in HTML are unstructured because there is very limited indication of the type of data. XML documents are considered as structured since there is XML schema available to describe the data. Free texts are unstructured since they require substantial natural language processing. The huge quantity of HTML pages on the Web are semi-structured [30] since the embedded data are often exchanged through HTML tags. One source of these large semi-structured documents is from the deep Web, which includes dynamic Web pages that are generated from structured databases with some templates or layouts. For example, the set of book pages from ebay has the same layout for the authors, title, price, comments, etc. A page class is Web pages that are formed from the same database with the same template. Semi-structured HTML pages can also be generated by hand. For example, the publication lists from various researchers? homepages all have title and source for each single paper, though they are produced by di erent people. 1.3 Current Trend Current tools that enable data extraction or data mining are both expensive to maintain and complex to design and use due to several potholes such as di erence in data formats, varying attributes and typographical errors in input documents [1]. An Extractor or Wrapper is one of such tools, which can perform the Data Extraction and processing jobs. Wrappers are special program routines that automatically extract data from Internet websites and convert the information into a structured format. Wrappers have three main functions. Download HTML pages from a website. Search, recognize and extract speci ed data. Save this data in a suitably structured format to enable further manipulation [2]. 3 Figure 1.1: Structure of various documents [1]. The data can then be further imported into other applications for additional processing. Wrapper induction based on inductive machine learning is the leading technique available now a days. The user is asked to label or mark the target items in a set of training pages or a list of data records in one page. Using these training pages, the system learns extraction rules. Inductive learning poses a major problem - the initial set of labeled training pages may not be fully depictive of the templates of all other pages. Poor performance of learnt rules is experienced for pages that follow templates uncovered by the labeled pages. This problem can be solved by labeling more pages, because more pages cover more templates. Despite, manual labeling requires a large supply of labor and is time consuming with an unsatis ed coverage of all possible templates. There are two main approaches to wrapper generation. The rst and currently chief approach is wrapper induction. The second is automatic ex- traction. As discussed above, wrapper learning works as follows: The user rst manually 4 labels a set of training pages or data records in a list. A learning system then generates rules from the training pages. Target items can be extracted from new pages by using these rules. Sample wrapper induction systems include WIEN [9], Stalker [10, 11, 12], BWI [13], WL2 [14]. An analytical survey on wrapper learning [15] gives a family of PAC-learnable wrapper classes and their induction algorithms and complexities. WIEN [9] and Softmealy [16] are earlier wrapper learning systems, which were later improved by Stalker [11, 10, 17, 12]. Stalker learns rules for each item and uses more detailed depiction of rules. It treats the items separately instead of ordering them. Though this method is more exile it makes learning harder for complex pages as the local information is not fully utilized. Recent developments on Stalker are the addition of di erent active learning facilities to the system which has reduced the number of pages which a user needs to label. Active learning allows the system to select the most useful pages which a user labels and hence reduces manual e ort [18]. Other tools typically used are roadrunner [3], WebOQL [4], Automated Data Extraction by Pattern Discovery [5], etc. Every day there is an exponential increase in the amount of information that seeps into the internet. Though this increases the possibility of nding a particular object, it also means a proportionate increase in search time. The tools for data extraction should therefore be developed with a view to reduce search time while keeping up with the internet advancements. In an attempt to serve this need, we present a new method of data extraction, called Ethernet Robot. Here, we make use of the Perl scripting language and the free non-interactive download utility- wget.exe. Notable features of the Perl language, which form the core of our Ethernet Robot, is dis- cussed below. Perl is the most prominent web programming language available because of its text processing features and developments in usability, features, and/or execution speed. Handling HTML forms is made simple by the CGI.pm module, a part of Perl?s standard distribution. It has the capability to handle encrypted Web data, including e-commerce transactions and can be embedded into web servers to speed up processing by as much as 5 2000%. The function ?mod perl? allows the Apache web server to embed a Perl interpreter [6]. Perl has a powerful regular expression engine built directly into its syntax. A regu- lar expression or regex is a syntax that increases ease of operations that involve complex string comparisons, selections, replacements and hence, parsing. Regex are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Regular expressions are widely used in our method to reduce the complexity of the code, to render the code obscure and powerful, and thus, unique. The combination of Perl, regular expressions and wget make Ethernet Robot an e cient solution for accelerated data downloading and extraction. Ethernet robot and its functionality are described in chapter 3. A complete description of GradeWatch, an online grade posting system is given in chapter 2. An online search engine based on the data extracted from Ethernet Robot is discussed in chapter 4. Finally, the online citations explorer tool, its operation and results are discussed in chapter 5 in addition to the conclusions and future work in chapter 6. 1.4 Pros and Cons 1.4.1 Pros Web data extraction has many advantages for corporations and government agencies and hence they are the main users. This technology has enabled ecommerce to do personal- ized marketing, which eventually results in higher trade volumes. The government agencies are use this technology to analyze threats and ght against terrorism. The society can ben- e ted by the predicting potential of this application by identifying criminal activities. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can improve pro tability by target pricing based on the pro les created. 6 1.4.2 Cons Web Data Extraction technology when used on data of personal nature might cause concerns. The most criticized ethical issue involving web mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent. Another important concern is that the companies collecting the data for a speci c purpose might use the data for a totally di erent purpose, and this essentially violates the users interests. 1.4.3 Engineering Constraints Several websites on the Web do not allow robots to crawl through their web sites and grab information which reduce the performance of their system. One of the tools described in this thesis is Ethernet Robot which was used to extract data from the IEEE server systematically. IEEE has issued ?No Robots Policy? which states that downloading database or any portion of a publication?s issue or volume in a systematic fashion is strictly prohibited. The use of robots or intelligent agents on their site is a violation of subscription license agreement. Creative solutions have to be developed to overcome these violations. 7 Chapter 2 GradeWatch 2.1 Overview GradeWatch is a web-based database which allows students to check their advancement in courses they take using a web browser. Perl is used to design the interface and is integrated to the internet by a web server [36]. A student needs to be updated with his or her feedback gained through the given homework and projects. This helps in the learning technique of the student. Typically, di erent types of work and projects give away di erent weights to the nal grade. So, it is di cult to keep the student noti ed about his or her current standing. One easier way is to use spreadsheets but it kills e ort and time of both the instructor and the student. Therefore, the need for a database that would be easy to use by both an instructor and students is determined. Though there are numerous database systems for each university, they are only restricted to a limited group of authorized people or sta . A student generally cannot always access the database anytime online. Therefore, keeping in mind all the students who take the course, and keeping in mind a multi-browser compatible webpage, the GradeWatch database was created [31]. The database can be categorized accordingly as the instructor chooses and the system is easy to use both by the instructor and the student. 2.2 GradeWatch System Design In the developing stages of software, the accessibility of the software to the client ma- chine and server machine has to be resolved. When requested by the client machine, applets 8 are dispatched through the network and execution is performed entirely on the client ma- chine. The applet would have to query a database server located at the same machine where the web site is built [35]. A secure HTTP can be used if data security is a priority. This avoids the need to encrypt data. The server executes instructions based on the given information and sends the results back to the local machine that made the request [32][33]. Fig. 2.1 shows the program component division and data ow in the application. The user interface is programmed in HTML enhanced with JavaScript. The data ow on the server is taken care of by a Perl script. The script accesses databases, veri es access authorization, and generates the report containing the grades of the student upto date and is sent back to the client machine as a web page [34]. A student can either use a webpage or call a CGI script to access the database. The CGI Figure 2.1: Data Flow in GradeWatch. script generates a HTML page dynamically which consists of the list of courses with the 9 available databases. To retrieve the database, required details from a student are course number, his or her family name, and the student ID number as a password. The details are matched in accordance with the course selected. If the authorization fails, the user is denied access. Other features of GradeWatch include: 1. Multiple accesses to the databases for the instructors. 2. E-mail noti cation to each student whenever there is an update in the spreadsheet/database. 3. Sending the grades individually to any speci c student by e-mail. 4. Automatic setting up of a mailing list, which can be easily passed to any e-mail client program. 2.3 GradeWatch User Interface To access the database, a person needs to provide the course number, family name, and student ID as a password. The database front-end is a web page with a CGI form as shown in Fig. 2.2. The CGI form takes the provided input from the user and the data is transferred to the server. The server which runs the CGI script receives the data as input. Perl is used to write the CGI script. First, the program veri es the data of a course in the list of courses and database of the instructor. Then the program processes the database for the corresponding course and retrieves the information required to produce a progress report. The report is produced as a HTML page or web page that is sent back to the web server, which redirects it to the client - a web browser. The authentic data shown relies on the arrangement of the grade database le for a speci c course. If an instructor desires to collect the database, more functionality is added. An in- structor can access the grades of any student by entering the student?s family name and the instructor?s password, or can access all grades at the same time by providing the instructor?s own name in the name eld. 10 Figure 2.2: User Interface. 2.4 Viewing Results In the sample database used for both, an illustration and as a demo on the Internet, the instructor?s name and the password are both set to admin. After examining the report web page, the instructor has an option to send a grade report by e-mail to all students, to a particular student, or to a selected group. Optionally, a few lines of additional memo may be appended. A sample instructor?s report for all students is shown in Fig. 2.3. 11 Figure 2.3: Course progress report of a student - user x. 12 Chapter 3 Data processing from IEEE Xlpore - Ethernet Robot 3.1 Overview This section presents the implemented model of data extraction that can, in fact, draw only the necessary data from any web server on the internet. It can further be developed into a powerful search engine/portal. Typically, a Data Extraction (DE) task is well-marked by its input and its extraction target. The inputs are usually unstructured documents like the semi-structured documents that are present on the Web, such as tables or itemized lists or a free text that is written in natural language [7]. Our model of Data Extraction (DE), Ethernet Robot, can be used to download and extract any kind of information present on the internet according to the user requirements. 3.1.1 An Example We consider an example of extracting titles and authors, pages, abstract URLs corre- sponding to the titles from the IEEE Transactions on Industrial Electronics located on IEEE Xplore. The main aim of this example is to allow the Associate Editors to search for re- viewers, and Authors to search for paper references of the corresponding IEEE Transactions. The Transactions has papers listed according to the years of publication and each year has 6 issues. A screenshot of the transactions is shown in Fig. 3.1. In this gure, the boxes indicate the required data to be extracted and the inessential data or junk to be ltered out from each issue. Lets now see how we automatically download and extract the data desired from these websites. Every Transactions on IEEE Xplore has a certain punumber, of which, Transactions on Industrial Electronics has a punumber = 41. The generalized URL of an issue Z for the year/volume no.- Y is given according to IEEE 13 Figure 3.1: IEEE Xplore webpage depicting various Data Fields. as: http://ieeexplore.ieee.org/servlet/opac?punumber=41&isvol=Y&isno=Z To download an extract the titles from volume no. 54 and issue 3 the URL is: http://ieeexplore.ieee.org/servlet/opac?punumber=41svol=54sno=3 Again each issue may have several pages 0,1,2,.. each page being addressed by : http://ieeexplore.ieee.org/servlet/opac?punumber=Xisvol=Yisno=Z&page=P&ResultStart=Q where page = P denotes the page number P, ResultStart=Q denotes the start of title number Q. The URL - ?http://ieeexplore.ieee.org/servlet/opac?punumber=41svol=54sno=3&page=1&ResultStart=25? 14 is the link to the titles starting from number 26. The page P=0 of any issue contains the links/URLs of the remaining pages as shown in Fig. 3.1. So the other pages can be fetched using the wget function and can be concatenated to the page P=0 to form a single page containing all the paper listings. Following script does the above process: $p = $p0.$p1.$p2; where $p0, $p1 and $p2 are the pages divided according to the paper listings and $p denotes the webpage containing all the paper listings of an issue. So the behavior of the tool is de ned by the variables volume no.-Y, issue number - Z, page number - P. To download all the pages from years 2000 to 2006 say, the following conditions have to be included at the beginning of main Perl code: This example model of data extraction (Ethernet Robot) extracts all the titles and corresponding data from the IEEE Transactions on Industrial Electronics. In order to extract all data, the system needs to traverse all the paper list pages in the archive and then extract all the titles and data from each paper list page. The code is devised to elicit the titles, authors, page numbers, abstract and abstract links from IEEE Xplore. Next, the Ethernet Robot goes to the webpage pointed by the URLs in each record and fetches the abstract. On completion of data acquisition, the raw data is printed in a new HTML le and published as a webpage. The Ethernet Robot carries out four stages: Data collection, Data ltering, Data pro- cessing and Data presentation on web. A schematic representation of the sequence of steps is shown in Fig 3.2. Of these, the data collection and ltering steps are relatively simple whereas the data pro- 15 Figure 3.2: Flowchart depicting the four stages of Ethernet Robot. 16 cessing and presentation steps require more involved procedures. These steps are explained in greater detail in the following sections. 3.2 Data Collection As mentioned in the previous section, the desired data to be fetched are speci c volumes from the IEEE Transactions on Industrial Electronics. Thus, the starting point for this procedure is the Transactions webpage. Now, we invoke the function get page with the parameter: volume number. get page grabs the web pages corresponding to the volume number and returns one page per issue for each issue of the year/volume number. The content of each issue is represented by a single variable -$page. Every year/volume has 6 issues, each of which is represented by a single element in an array of 6 variables. The following invariant holds true at any point during the operation of the code: $p[$i]= $page for $i<=6 where $p - array of issue content $i - issue number or iteration $page - content of each issue 3.3 Data Filtering Each webpage indicated by the variable $p[$i] consists of various irrelevant (to our purposes) JavaScript, html tags, tables and other miscellaneous information appended to the data we wish to extract. Hence, the content of the page needs to be ltered. The following condition in the code performs the proposed ltering operation: where the new variable ?$entry? holds the required content between the and
. 17 3.4 Data Processing We now have the desired data in the variable $issuepage. All we need is to extract them in an orderly fashion in accordance with the IEEE format of representation. Following con- dition using the regular expressions divides the variables $1 through $6 into titles, authors, abstract links and pdf le links: which assigns the data into following variables: $title = $1 - titles $authors = $3 - authors $labs = $4 - abstract links $lpdf = $5 - pdf le links We use wget.exe to obtain the abstracts from the abstract links. GNU Wget is a free utility for non-interactive download of les from the Web. It supports http, https, and ftp protocols, as well as retrieval through http proxies. Wget is non-interactive, which means that it can work in the background, while the user is not logged on. This allows the user to start retrieval and disconnect from the system, letting wget nish the work. In contrast, most of the Web browsers require constant user?s presence, which can be a great hindrance when transferring a lot of data [8]. The operation of wget.exe can be explained brie y as below: The implementation of wget in Perl is shown below to fetch the webpage addressing www.ieee.org: 18 use strict; my $addr = "http://www.ieee.org/portal/site"; system("wget.exe", "-q", "-O", "example.htm", $addr); The output of the above implementation is as displayed below in Fig. 3.3: Figure 3.3: Output of wget implementation in Perl: Example.htm. The essence of Ethernet Robot lies in the wget.exe function which downloads the desired data (Abstracts and pdf les) from the web pages. Wget.exe is responsible for the robotic behavior of our code as it does the automatic extraction of abstracts and pdf les. The sub function get page makes use of wget.exe to extract all the data from a webpage. sub get page f my ($addr) = @ ; 19 $addr = s/&/&/g; my $fname="54 1.htm"; system("wget.exe","-q","-O", $fname,"{referrer =http://tie.ieee-ies.org/tie/", $addr); my $page=get le($fname); unlink($fname); return($page); g The wget parameter "-O $fname" speci es that the content in the webpage indicated by Figure 3.4: Execution of the Ethernet Robot Perl code. $addr will be printed to the le - $fname i.e., 54 1.htm. Hence, at the end of execution of the code, we have all the required data corresponding to the issues in 6 separate les per year/volume. The execution of the Perl code is shown in Fig. 3.4. 20 3.5 Data Presentation Processing of data obtained in the previous section results in a web page which contains all the papers with corresponding titles, authors, abstract links, pdf links with complete abstract. The resulted web page is as shown in Fig. 3.5. Figure 3.5: Resultant web page. The obtained les are then released into the World Wide Web by presenting them as links on a website. The les can be accessed by authors and editors, and reviewers for academic and professional use. A sample of the les shown on the web is shown in Fig. 3.6. The data present in the issues can be further processed using Perl to group all the issues of a year/volume in one whole le. Sample website which contains all the links for the desired data can be accessed from : http://tie.ieee-ies.org/tie/abs/index.htm 21 Figure 3.6: Data Presentation. 22 Chapter 4 The online search tool After successfully extracting the data and storing them in our database, a search inter- face has been developed which can actually display all the titles/papers for a set of three keywords taken from title or authors or any of the words in the abstract. This tool is an addition to the Ethernet Robot, written in Perl/CGI which allows the user to search through the entire database we have extracted before and display the search results in a separate webpage. The search can be further re ned or made selective by choosing the appropriate radio buttons for the corresponding years. Fig. 4.1 shows the search interface created for the above example. Figure 4.1: Search Interface to download required papers. This search tool can be included in the website hosted by our server by using Forms in the HTML page as follows: 23 In the above HTML script, ACTION tells the server to execute the search.cgi program on hitting the button Submit. The program search.cgi performs the data processing job again, fetching the les from the database on the server. Param, an inbuilt function in CGI script is used in this program to acquire user data from the HTML le. For example, if the user mentions ?neural?, ?networks? and ?motors? in the three query spaces- keyword I, keyword II and keyword III respectively, all the Titles and abstracts which have the keywords mentioned above will be displayed in a separate webpage. Fig. 4.2 shows the resultant webpage after performing the search operation. The user can right click the full text link to download and save the entire paper in pdf format. These links are provided by our server where the entire extracted database is present. For the further development of this technique, we can create an online tool using the same core concept and Google search, which when searched for an author or a title gives the details such as - total number of papers, total number of citations, average number of citations per paper, average number of citations per author, average number of papers per author, average number of citations per year, the age-weighted citation rate. This tool can also be used to nd out the most cited paper and the most downloaded paper. 24 Figure 4.2: Web page displaying the search results. 25 Chapter 5 Citations Explorer Tool 5.1 Overview The Citations Explorer Tool is a program that retrieves and analyzes academic citations. The tool written in Perl uses Google Scholar to obtain raw citations, then analyzes these and presents the following statistics: Total number of papers Total number of citations Average number of citations per paper Average number of citations per year Design includes a code in Perl to search for most frequently cited IEEE TIE (Transac- tions on Industrial Electronics) papers in a speci c year and displaying the obtained details of papers on a separate webpage. The tool works similar to that of the Ethernet Robot we discussed in chapter 3 containing all the stages: Data Acquisition, Data Filtering, Data Processing and Data Presentation. Data acquisition is done by the Google Search API for Google Scholar. All the paper listings and their corresponding citation results are returned by the Search API in one web page. Data on this resultant web page are ltered and processed using regular expressions to present the data on a new web page. 26 5.2 Data Acquisition Data acquisition is the core of this tool and is done by the Google AJAX Search API for Google Scholar. The Google AJAX Search API is a JavaScript library that allows us to embed Google Search in our web pages and other web applications. The Google AJAX Search API provides simple web objects that perform inline search over the Google service - Google Scholar. If a web page is designed to help users create content (e.g. citation analysis, message boards, blogs, etc.), the API is designed to support these activities by allowing them to copy search results directly into their messages. The search query in our case is the year we want the citation results for followed by "Trans- actions on Industrial Electronics". The tool asks for the search query and for the year when executed. Figure 5.1: Sample page generated by Google Scholar. The resulting huge collection of information is returned in a web page, part of which is shown in Fig. 5.1. The data in the webpage is semi structured since the embedded data are rendered regularly via the use of HTML tags [30]. This semi structured data contains the required information such as number of citations, titles, author names, volume number, issue number, month and year of publication and the page numbers. 27 5.3 Data Filtering The purpose of data ltering is to assist the user in nding that one necessary piece of information. One of the advantages of ltering is the increase in processing speed. Most data may seem rather random but one can usually nd patterns. Based on these patterns we can build structures that will let a user answer a few simple questions. With this information we can then narrow down our data to return a small result set. The user can then quickly scan through this small set of data to nd the one piece that they need. Filtering permits you to present varying views of the data stored in a dataset without actually a ecting that data. Filter property represents a string that de nes the lter criteria. The above discussed webpage consists of various irrelevant (to our purposes) JavaScript, HTML tags, tables and other miscellaneous information appended to the data we wish to extract. Hence, the content of the page needs to be ltered. The HTML tags and tables can be removed or ltered using the code [Appendix 3]: Here $resultpage is the un ltered page containing required data and $ ltereddata con- tains only the information required. The regular expression above selects only the informa- tion present between the HTML tables of the web page and places it in a variable $ ltered- data. 5.4 Data Processing In Data Processing, the data is run through the algorithms and characteristics and variables are identi ed and categorized, thereby transforming the data into broader, more meaningful pieces of information. We now have the desired data in the variable $ ltereddata. All we need is to extract them in an orderly fashion in accordance with the IEEE format 28 of representation. Following condition using the regular expressions divides the variables $1 through $6 into titles, authors, abstract links and pdf le links [Appendix]: which assigns the data into following variables: $title= $1 - titles $authors= $3 - authors $labs = $5 - abstract links $lpdf = $6 - pdf le links The data corresponding to number of citations is extracted using a regular expression matching "Cited by" for each paper. 5.5 Data Presentation The nal outcome of data processing is a web page containing the contents of the variables $title, $authors, $labs, $lpdf grouped together representing title, authors, abstract links and full text links respectively. Page numbers, volume number, issue number and number of citations for each paper are displayed accordingly. The webpage containing the most cited papers on Transactions on Industrial Electronics (TIE) for the year 2007 with some added HTML script is shown in Fig. 5.2. Figure 5.2: Final web page for most cited papers for the year 2007. 29 Chapter 6 Conclusions The tool Ethernet Robot, delivers optimum performance, making it a unique tool for web data extraction. The tool stays within the constraints of ?IEEE No Robot Policy? and the information is extracted and presented on a di erent website only for the Associate Editors to search for reviewers, and Authors to search for paper references of the corresponding IEEE Transactions. GradeWatch is an e cient and platform independent grade posting system. The Online Search Tool provides accurate results for a given set of search keywords. The Citations Explorer Tool assisted by Google Scholar presents exhaustive list of papers and their citation count. The usage of Perl scripting language for all the tools, regular expressions and wget.exe make them accurate and advantageous. All the tools can be can be customized according to the required data and the format of the data which a user desires. 6.1 Future Work Developments can be done in data extraction, processing and presentation levels. Ex- traction of data from the web pages can be improved by adding more pattern matching reg- ular expressions. Intelligent data processing like making a title or statement automatically from a set of given keywords using genetic algorithms is possible and is a valid improve- ment to Ethernet Robot. Data ow between the server and client can be made simple by using database software like SQL or MySQL and scripting language PHP to access the SQL databases. 30 Bibliography [1] Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis and Khaled F. Shaalan, "A Survey of Web Information Extraction Systems," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 10, pp. 1411-1428, October 2006 [2] Wrapper De nition. http://www.knowlesys.com/articles/web-data- extraction/wrapper de nition.htm [3] Crescenzi, G. Mecca, and P. Merialdo, "RoadRunner: Towards Automatic Data Extrac- tion from Large Web Sites," Proc. the 26th Int?l Conf. Very Large Database Systems (VLDB), pp. 109-118, 2001. [4] G.O. Arocena and A.O. Mendelzon, "WebOQL: Restructuring Documents, Databases, and Webs," Proc. 14th IEEE Int?l Conf. Data Eng. (ICDE), pp. 24-33, 1998. [5] C.-H. Chang, C.-N. Hsu, and S.-C. Lui, "Automatic Information Extraction from Semi- Structured Web Pages by Pattern Discovery," Decision Support Systems J., vol. 35, no. 1, pp. 129-147, 2003. [6] About Perl. http://www.perl.org/about.html [7] I-Chen Wu, Jui-Yuan Su, and Loon-Been Chen, "A Web Data Extraction Description Language and Its Implementation", Proceedings of the 29th Annual International Com- puter Software and Applications Conference (COMPSAC?05) [8] GNU Wget 1.11.4 Manual. http://www.gnu.org/software/wget/manual/wget.html [9] Kushmerick, N.: Wrapper induction for information extraction. PhD thesis (1997) Chairperson-Daniel S. Weld. [10] Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: AGENTS ?99: Proceedings of the third annual conference on Autonomous Agents. (1999) 190-197 [11] Muslea, I., Minton, S., Knoblock, C.: Active learning with strong and weak views: A case study on wrapper induction. In: Proceedings of the 18th International Joint Conference on Arti cial Intelligence (IJCAI-2003). (2003) [12] Muslea, I., Minton, S., Knoblock, C.: Adaptive view validation: A rst step towards automatic view detection. In: Proceedings of ICML2002. (2002) 443-450 31 [13] Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Sev- enteenth National Conference on Arti cial Intelligence and Twelfth Conference on In- novative Applications of Arti cial Intelligence. (2000) 577-583 [14] Cohen, W., Hurst, M., Jensen, L.: A exible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW-2002. (2002) [15] Kushmerick, N.: Wrapper induction: e ciency and expressiveness. Artif. Intell. (2000) 15-68 [16] Hsu, C.N., Dung, M.T.: Generating nite-state transducers for semi-structured data extraction from the web. Information Systems 23 (1998) 521-538 [17] Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach. (2003) 275-287 [18] Yanhong Zhai and Bing Liu. "Structured Data Extraction from the Web based on Partial Tree Alignment" Accepted for publication in IEEE Transactions on Knowledge and Data Engineering, 2006 [19] IEEE Xplore OPAC Linking. http://ieeexplore.ieee.org/xpl/opac.jsp [20] Schafer J.B., Konstan J., and Reidel J. (1999). Recommender Systems in E-Commerce, In Proc. ACM Conf. E-commerce, 158-166. [21] Mobasher, B., Cooley, R., and Srivastava, J. (2000). Automatic personalization based on web usage mining, Communications of the. ACM, 43(8) 142-151. [22] Spiliopoulou M. and Faulstich L. C. (1999). WUM: A Web utilization Miner, in Proc. of EDBT workshop WebDB98, Valencia, Spain. [23] Nasraoui O., Krishnapuram R., and Joshi A. (1999). Mining Web Access Logs Using a Relational Clustering Algorithm Based on a Robust Estimator, 8th International World Wide Web Conference, Toronto, 40-41. [24] Srivastava, J., Cooley, R., Deshpande, M., And Tan, P-N. (2000). Web usage mining: Discovery and applications of usage patterns from web data, SIGKDD Explorations, 1(2), 12-23. [25] Eirinaki M., Vazirgiannis M. (2003). Web mining for web personalization. ACM Trans- actions On Internet Technology (TOIT), 3(1), 1-27. [26] Malinowski and B.M. Wilamowski, "Compiling Computer Programs Through Inter- net", ITHET-2000 - International Conference on Information Technology Based Higher Education and Training, Istanbul, Turkey, July 3-5, 2000, pp. 343-348. [27] Malinowski A. and B. M. Wilamowski, "Internet Accessible Compilers in Software and Computer Engineering" International Conference on Simulation and Multimedia in En- gineering Education (ICSEE?99), San Francisco, CA, January 14-15, 1999, pp. 221-224. 32 [28] S. Soderland, "Learning to Extract Text-Based Information from the World Wide Web," Proc. Third Int?l Conf. Knowledge Discovery and Data Mining (KDD), pp. 251-254, 1997. [29] R. Elmasri and S.B. Navathe, Fundamentals of Database Systems, fourth ed. Addison Wesley, 2003. [30] C. N. Hsu and M. Dung, "Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web," J. Information Systems, vol. 23, no. 8, pp. 521-538, 1998. [31] Sweet, W. and Geppert, L., "http:// It has changed everything, especially our engineer- ing thinking," IEEE Spectrum, January 1997, pp. 23-37. [32] Gundavaram, S., CGI Programming on the World Wide Web. OReilly & Associates, Inc., 1996 [33] Jamsa, K.; Lalani, S.; Weakley, S.; Web Programming, Jamsa Press, Las Vegas, NV, 1996. [34] Wall,L., Christiansen, T., Schwartz, R.L. Programming Perl, OReilly & Associates, Inc., 1996. [35] Camposano, R.; Deering, S.; DeMicheli, G.; Markov, L.; Mastellone, M.; Newton, A.R.; Rabaey, J.; Rowson, J.; "What?s ahead for Design on the Web", IEEE Spectrum, September 1998, pp. 53-63. [36] Wilamowski, Bogdan and Malinowski, Aleksander, "GradeWatch - the Software Package Displying on Web Pages Students Grades", Proceedings ASEE Annual Conference 2000. 33 Appendices 34 Appendix A Perl Source Code The Perl programs to implement the tools presented in this thesis are presented here. A.1 Ethernet Robot - IEEE Web Data Extraction The source code for extracting paper details from IEEE Transactions on Industrial Electronics for the year 2007 is presented here. -----------------------Ethernet Robot(erobo.pl) - Perl Source Code------------------ for(my $Y=53;$Y<=53;$Y++) #Access all issues of volume 53 {my $t =6; if($Y==(35||36)){$t=4;}else{$t=$t;} for(my $X=1;$X<=$t;$X++) #Issue 1 to 6 { my $file = "C:/erobo/g/$Y-$X.htm"; open (HTM, ">", $file) or Error(?open?,?file?); my $ad="http://ieeexplore.ieee.org/servlet/opac? punumber=41&isvol=$Y&isno=$X"; my $issuepage= get_page($ad); #store the each issue webpage in a variable open(KK,">", ?22.htm?); my @rec = ; print KK $issuepage; close KK; open(KK,"<", ?22.htm?); my @rc = ; close KK; my @L=("","","","");my $k=0; foreach my $l(@rc) { if(($l =~ m/ResultStart/i)&&(($l =~ m/page=1/i)||($l =~ m/page=2/i))) #concatenate all the pages per issue into one variable { my $c="HREF="; my $d="class"; $a=index($l,$d); $b=index($l,$c); $L[$k]=substr($l,$b+5,$c-33); $k++; 35 } } my $x=""; my $y=""; if($L[0]==$L[1]) { $x="http://ieeexplore.ieee.org" . $L[0]; $y=""; } elsif($L[0]!=$L[1]) { $x="http://ieeexplore.ieee.org".$L[0]; $y="http://ieeexplore.ieee.org".$L[1]; $y = get_page($y); } $x=get_page($x); $issuepage .= $x; $issuepage .= $y; my $month =fun_month($X); my $year = 1988+($Y-35); open (OUT, ">", ?out.txt?) or Error(?open?,?file?); my $vol=$Y; my $no=$X; my $date = "$month $year"; my $filename= ?ies?.$vol.?_?.$no.?.html?; print HTM << "Header"; IEEE Transactions on Industrial Electronics

  IEEE Transactions on  Industrial Electronics 

Header print HTM "\n

Volume $vol,  Number $no, $date           "; print HTM "\n Access to the journal on IEEE XPLORE    "; print HTM "\n IE Transactions Home Page 36


"; my $count=0; while ( $issuepage =~ m/]*>\s*]*>\s*]*>\s* (.*?)\s*<\/td>\s*<\/tr>\s*<\/table>/gis ) # Filter out useful information { my $entry = $1; if ($entry =~ m/(.*?)<\/strong>\s*
\s*((.*?)
)? \s*Page\(s\):  \s*(\w+-\s*\w+).*/is) # Process out required values in to the variables to be printed later { $count++; my $title=$1; my $authors=$3; defined($authors) or $authors=""; my $pages=$4; my $labs=$5; my $lpdf=$6; my $lcrt=$7; print $lpdf; print $lpdf; $title =~ s/\s+/ /g; $authors =~ s/\s+/ /g; ($labs =~ m/^http:\/\//) or $labs = "http://ieeexplore.ieee.org/" . $labs; ($lpdf =~ m/^http:\/\//) or $lpdf = "http://ieeexplore.ieee.org/" . $lpdf; ($lcrt =~ m/^http:\/\//) or $lcrt = "http://ieeexplore.ieee.org/" . $lcrt; my ($i, $name, $autors); # inverting orders in authors my @auth = split( /;/ , $authors); $autors = ??; foreach (@auth) { (my $lasn, my $first) = split( /,/); $name = $first .? ?. $lasn . ?,? ; $autors=$autors . $name; $autors =~ s/ / /g; } print OUT "\n$autors"; my $abstract=""; my $keywords=""; my $pageabstract = get_page($labs ); if (not defined($pageabstract)) { $abstract="ERROR: Cannot access the abstract page"; } elsif ($pageabstract =~ m/]*>\s*]*>\s*]*>\ s*Abstract\s*<\/span>\s*
\s*(.*?)\s*<\/td>\s*<\/tr>/is) 37 { $abstract = $1; $abstract =~ s/\s+/ /g; } elsif ($pageabstract =~ m/\s*\s*Your request cannot be processed at this time \.\s*<\/td>\s*<\/tr>/is) { $abstract = "ERROR: Download blocked by IEEE"; } else { $abstract = "ERROR: cannot run wget.exe or abstract page format unrecognized."; } if ($pageabstract =~ m/]*>\s*]*>\s*] *>\s*Author\s+Keywords\s*<\/span> \s*
\s*(.*?)\s*<\/td>\s*<\/tr>/is) { $keywords = $1; $keywords =~ s/\s*]*>\s*(.*?)\s*<\/a> \s* \s* \*/$1;/g; $keywords =~ s/\s+/ /g; } else { $keywords="Not listed"; } print "\n\n\n$count. Title: $title\nAuthors: $authors\nPages: $pages \n More: $labs\nPDF: $lpdf\n\nAbstract: $abstract\n\n Keywords: $keywords\n"; print HTM "\n

$Y. $X. $count.     "; print HTM " $autors \"$title,\" Trans. on Industrial Electronics, vol. $vol, no. $no, pp. $pages, $date.    Abstract Link   Full Text
"; if ($abstract =~ /ERROR:/) {print HTM " ";} else {print HTM "\n
Abstract: $abstract ";} } } 38 close OUT; open (DAT, $file); my @rec = ; close(DAT); open(DAT,$file); foreach my $line(@rec) { if($line=~ m/g>", $file); print DATA @rec; close(DATA); } } print HTM << "Bottom";

 

Bottom close HTM; sub get_stdin { my $page=""; while (1) { my $line=; $line or last; $page = $page . $line; } return($page); } sub get_file { 39 my ($fname) = @_; open (TMP, "<", $fname) or return(undef); my $page=""; while (1) { my $line=; $line or last; $page = $page . $line; } close(TMP); return($page); } sub fun_month { my ($i) = $_[0];my $m = ""; if ($i==1){ $m = "Feb."; } elsif ($i==2) { $m = "April"; } elsif ($i==3) {$m = "June"; } elsif ($i==4) { $m = "Aug."; } elsif ($i==5) { $m = "Oct."; } else { $m = "Dec."; } return($m); } sub get_page # Function to download full web page in to a variable { my ($addr) = @_; $addr =~ s/&/&/g; my $fname="xxx.htm"; system("wget.exe", "-q", "-O", $fname, "--referer= http://tie.ieee-ies.org/tie/", $addr); my $page=get_file($fname); unlink($fname); return($page); } 40