INTERNET DATA ACQUISITION, SEARCH AND PROCESSING by Sandeep Neeli A thesis submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Master of Science Auburn, Alabama Dec 18, 2009 Keywords: Data Mining, Citations, Data Processing Copyright 2009 by Sandeep Neeli Approved by: Bogdan Wilamowski, Chair, Professor of Electrical and Computer Engineering Thaddeus Roppel, Associate Professor of Electrical and Computer Engineering John Hung, Professor of Electrical and Computer Engineering Abstract Internet data acquisition from the Web is the process of extracting essential data from any web server. Semi-structured data present in the form of HTML web pages need to be extracted, converted into structured data before presenting them to the users. In this thesis, four tools are presented which perform the functions - data acquisition, data search and data processing. They are: GradeWatch, Ethernet Robot, Online Search Tool and Citations Explorer Tool. GradeWatch is a tool mainly for students and faculty of an academic insti- tution to check and post grades online respectively. Ethernet Robot extracts paper details for IEEE Transactions on Industrial Electronics from IEEE web server using Perl scripting language and processes the data using regular expressions. Using the paper database created by Ethernet Robot, an Online Search Tool is developed which can perform a search up to a depth of three keywords and present the information on a separate web page from which the users can also download the papers by clicking the corre- sponding links provided with them. The Citations Explorer is a program which returns the most cited papers for the IEEE Transactions on Industrial Electronics for a particular year. The program uses Google Scholar to do the search and Perl regular expressions to process the data. The procedure for designing all these tools involves fetching, ltering, process- ing and presentation of required data. The resultant HTML les consisting of the required data are displayed for the perusal of users. Future enhancements to our Ethernet Robot include optimization to improve performance and customization for use as a sophisticated client-speci c search agent. ii Acknowledgments I am heartily thankful to my supervisor, Dr. Wilamowski, whose encouragement, guid- ance and support from the initial to the nal level enabled me to develop an understanding of the subject. I would like to thank my committee members, Dr. Hung and Dr. Roppel for accepting my request to be on my thesis committee. I would like to thank my family members for encouraging me to pursue this degree and Arthi Kothandaraman for providing me all the support. Lastly, I o er my regards and blessings to all of those who supported me in any respect during the completion of the thesis. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Phases of Automatic Web Data Extraction . . . . . . . . . . . . . . . . . . . 1 1.2 Categories of Data used in Web Data Extraction . . . . . . . . . . . . . . . . 2 1.3 Current Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.1 Pros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.2 Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.3 Engineering Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 GradeWatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 GradeWatch System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 GradeWatch User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Viewing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Data processing from IEEE Xlpore - Ethernet Robot . . . . . . . . . . . . . . . 13 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5 Data Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 iv 4 The online search tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 Citations Explorer Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.3 Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.4 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.5 Data Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A Perl Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.1 Ethernet Robot - IEEE Web Data Extraction . . . . . . . . . . . . . . . . . 35 v List of Illustrations 1.1 Structure of various documents [1]. . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Data Flow in GradeWatch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 User Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Course progress report of a student - user x. . . . . . . . . . . . . . . . . . . 12 3.1 IEEE Xplore webpage depicting various Data Fields. . . . . . . . . . . . . . 14 3.2 Flowchart depicting the four stages of Ethernet Robot. . . . . . . . . . . . . 16 3.3 Output of wget implementation in Perl: Example.htm. . . . . . . . . . . . . 19 3.4 Execution of the Ethernet Robot Perl code. . . . . . . . . . . . . . . . . . . 20 3.5 Resultant web page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6 Data Presentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Search Interface to download required papers. . . . . . . . . . . . . . . . . . 23 4.2 Web page displaying the search results. . . . . . . . . . . . . . . . . . . . . . 25 5.1 Sample page generated by Google Scholar. . . . . . . . . . . . . . . . . . . . 27 5.2 Final web page for most cited papers for the year 2007. . . . . . . . . . . . . 29 vi Chapter 1 Introduction The meteoritic rise of the World Wide Web as the knowledge powerhouse of the 21st century has led to a tremendous growth in information available to the masses. This, in turn, implies that the useful information is all the more time-consuming to narrow down or locate in the huge mass of available data. In other words, with increasing knowledge base, there is a pressing need to e ciently extract useful information in a shorter amount of time. Acquisition of structured data from a pile of unstructured documents is called Data Ex- traction (DE). Web data extraction is a process of extracting information or data from the World Wide Web (WWW) and manipulating it according to the user constraints. A brief overview of web data extraction is discussed below and we present an example model of web data extraction based on these features in the coming sections. 1.1 Phases of Automatic Web Data Extraction The Web data extraction process can be divided into four distinct phases [20, 21, 26]: 1. Collecting Web data - Includes past activities as recorded in Web server logs and/or via cookies or session tracking modules. In some cases, Web content, structure, and application data can be added as additional sources of data. 2. Preprocessing Web data - Data is frequently pre-processed to put it into a format that is compatible with the analysis technique to be used in the next step. Preprocessing may in- clude cleaning data of abnormalities, ltering out irrelevant information according to the goal of analysis, and completing the missing links (due to caching) in incomplete click through paths. Most importantly, unique sessions need to be identi ed from the di erent requests, based on a heuristic, such as requests originating from an identical IP address within a given 1 time period. 3. Analyzing Web data - Also known as Web Usage Mining [22, 23, 24], this step applies machine learning or Data Mining techniques to discover interesting usage patterns and sta- tistical correlations between web pages and user groups. This step frequently results in automatic user pro ling, and is typically applied o ine, so that it does not add a burden on the web server. 4. Decision making/Final Recommendation Phase - The last phase in web data extraction makes use of the results of the previous analysis step to deliver recommendations to the user. The recommendation process typically involves generating dynamic Web content on the y, such as adding hyperlinks to the last web page requested by the user. This can be accomplished using a variety of Web technology options such as CGI programming. 1.2 Categories of Data used in Web Data Extraction The Web data mining process depends on one or more of the following data sources [25,26]: 1. Content Data - Text, images, etc, in HTML pages, as well as information in databases. 2. Structure Data - Hyperlinks connecting the pages to one another. 3. Usage Data - Records of the visits to each web page on a website, including time of visit, IP address, etc. This data is typically recorded in Web server logs, but it can also be collected using cookies or other session tracking tools. 4. User Pro le - Information about the user including demographic attributes (age, popula- tion, etc), and preferences that are gathered either explicitly or implicitly. The input le of a Data Extraction (DE) task may be structured, semi-structured, or free- text. As shown in Fig. 1.1, the de nition of these terms varies across research domains. Free-texts, e.g., news article, that are written in natural languages are considered unstruc- tured [28], postings on newsgroup (e.g., apartment rentals), medical records and equipment maintenance logs are semi-structured, while HTML pages are structured. According to the 2 database researchers [29], the information stored in computer databases is known as struc- tured data; XML documents have the schema information mixed with the data values and hence, are semi-structured data. Web pages in HTML are unstructured because there is very limited indication of the type of data. XML documents are considered as structured since there is XML schema available to describe the data. Free texts are unstructured since they require substantial natural language processing. The huge quantity of HTML pages on the Web are semi-structured [30] since the embedded data are often exchanged through HTML tags. One source of these large semi-structured documents is from the deep Web, which includes dynamic Web pages that are generated from structured databases with some templates or layouts. For example, the set of book pages from ebay has the same layout for the authors, title, price, comments, etc. A page class is Web pages that are formed from the same database with the same template. Semi-structured HTML pages can also be generated by hand. For example, the publication lists from various researchers? homepages all have title and source for each single paper, though they are produced by di erent people. 1.3 Current Trend Current tools that enable data extraction or data mining are both expensive to maintain and complex to design and use due to several potholes such as di erence in data formats, varying attributes and typographical errors in input documents [1]. An Extractor or Wrapper is one of such tools, which can perform the Data Extraction and processing jobs. Wrappers are special program routines that automatically extract data from Internet websites and convert the information into a structured format. Wrappers have three main functions. Download HTML pages from a website. Search, recognize and extract speci ed data. Save this data in a suitably structured format to enable further manipulation [2]. 3 Figure 1.1: Structure of various documents [1]. The data can then be further imported into other applications for additional processing. Wrapper induction based on inductive machine learning is the leading technique available now a days. The user is asked to label or mark the target items in a set of training pages or a list of data records in one page. Using these training pages, the system learns extraction rules. Inductive learning poses a major problem - the initial set of labeled training pages may not be fully depictive of the templates of all other pages. Poor performance of learnt rules is experienced for pages that follow templates uncovered by the labeled pages. This problem can be solved by labeling more pages, because more pages cover more templates. Despite, manual labeling requires a large supply of labor and is time consuming with an unsatis ed coverage of all possible templates. There are two main approaches to wrapper generation. The rst and currently chief approach is wrapper induction. The second is automatic ex- traction. As discussed above, wrapper learning works as follows: The user rst manually 4 labels a set of training pages or data records in a list. A learning system then generates rules from the training pages. Target items can be extracted from new pages by using these rules. Sample wrapper induction systems include WIEN [9], Stalker [10, 11, 12], BWI [13], WL2 [14]. An analytical survey on wrapper learning [15] gives a family of PAC-learnable wrapper classes and their induction algorithms and complexities. WIEN [9] and Softmealy [16] are earlier wrapper learning systems, which were later improved by Stalker [11, 10, 17, 12]. Stalker learns rules for each item and uses more detailed depiction of rules. It treats the items separately instead of ordering them. Though this method is more exile it makes learning harder for complex pages as the local information is not fully utilized. Recent developments on Stalker are the addition of di erent active learning facilities to the system which has reduced the number of pages which a user needs to label. Active learning allows the system to select the most useful pages which a user labels and hence reduces manual e ort [18]. Other tools typically used are roadrunner [3], WebOQL [4], Automated Data Extraction by Pattern Discovery [5], etc. Every day there is an exponential increase in the amount of information that seeps into the internet. Though this increases the possibility of nding a particular object, it also means a proportionate increase in search time. The tools for data extraction should therefore be developed with a view to reduce search time while keeping up with the internet advancements. In an attempt to serve this need, we present a new method of data extraction, called Ethernet Robot. Here, we make use of the Perl scripting language and the free non-interactive download utility- wget.exe. Notable features of the Perl language, which form the core of our Ethernet Robot, is dis- cussed below. Perl is the most prominent web programming language available because of its text processing features and developments in usability, features, and/or execution speed. Handling HTML forms is made simple by the CGI.pm module, a part of Perl?s standard distribution. It has the capability to handle encrypted Web data, including e-commerce transactions and can be embedded into web servers to speed up processing by as much as 5 2000%. The function ?mod perl? allows the Apache web server to embed a Perl interpreter [6]. Perl has a powerful regular expression engine built directly into its syntax. A regu- lar expression or regex is a syntax that increases ease of operations that involve complex string comparisons, selections, replacements and hence, parsing. Regex are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Regular expressions are widely used in our method to reduce the complexity of the code, to render the code obscure and powerful, and thus, unique. The combination of Perl, regular expressions and wget make Ethernet Robot an e cient solution for accelerated data downloading and extraction. Ethernet robot and its functionality are described in chapter 3. A complete description of GradeWatch, an online grade posting system is given in chapter 2. An online search engine based on the data extracted from Ethernet Robot is discussed in chapter 4. Finally, the online citations explorer tool, its operation and results are discussed in chapter 5 in addition to the conclusions and future work in chapter 6. 1.4 Pros and Cons 1.4.1 Pros Web data extraction has many advantages for corporations and government agencies and hence they are the main users. This technology has enabled ecommerce to do personal- ized marketing, which eventually results in higher trade volumes. The government agencies are use this technology to analyze threats and ght against terrorism. The society can ben- e ted by the predicting potential of this application by identifying criminal activities. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can improve pro tability by target pricing based on the pro les created. 6 1.4.2 Cons Web Data Extraction technology when used on data of personal nature might cause concerns. The most criticized ethical issue involving web mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent. Another important concern is that the companies collecting the data for a speci c purpose might use the data for a totally di erent purpose, and this essentially violates the users interests. 1.4.3 Engineering Constraints Several websites on the Web do not allow robots to crawl through their web sites and grab information which reduce the performance of their system. One of the tools described in this thesis is Ethernet Robot which was used to extract data from the IEEE server systematically. IEEE has issued ?No Robots Policy? which states that downloading database or any portion of a publication?s issue or volume in a systematic fashion is strictly prohibited. The use of robots or intelligent agents on their site is a violation of subscription license agreement. Creative solutions have to be developed to overcome these violations. 7 Chapter 2 GradeWatch 2.1 Overview GradeWatch is a web-based database which allows students to check their advancement in courses they take using a web browser. Perl is used to design the interface and is integrated to the internet by a web server [36]. A student needs to be updated with his or her feedback gained through the given homework and projects. This helps in the learning technique of the student. Typically, di erent types of work and projects give away di erent weights to the nal grade. So, it is di cult to keep the student noti ed about his or her current standing. One easier way is to use spreadsheets but it kills e ort and time of both the instructor and the student. Therefore, the need for a database that would be easy to use by both an instructor and students is determined. Though there are numerous database systems for each university, they are only restricted to a limited group of authorized people or sta . A student generally cannot always access the database anytime online. Therefore, keeping in mind all the students who take the course, and keeping in mind a multi-browser compatible webpage, the GradeWatch database was created [31]. The database can be categorized accordingly as the instructor chooses and the system is easy to use both by the instructor and the student. 2.2 GradeWatch System Design In the developing stages of software, the accessibility of the software to the client ma- chine and server machine has to be resolved. When requested by the client machine, applets 8 are dispatched through the network and execution is performed entirely on the client ma- chine. The applet would have to query a database server located at the same machine where the web site is built [35]. A secure HTTP can be used if data security is a priority. This avoids the need to encrypt data. The server executes instructions based on the given information and sends the results back to the local machine that made the request [32][33]. Fig. 2.1 shows the program component division and data ow in the application. The user interface is programmed in HTML enhanced with JavaScript. The data ow on the server is taken care of by a Perl script. The script accesses databases, veri es access authorization, and generates the report containing the grades of the student upto date and is sent back to the client machine as a web page [34]. A student can either use a webpage or call a CGI script to access the database. The CGI Figure 2.1: Data Flow in GradeWatch. script generates a HTML page dynamically which consists of the list of courses with the 9 available databases. To retrieve the database, required details from a student are course number, his or her family name, and the student ID number as a password. The details are matched in accordance with the course selected. If the authorization fails, the user is denied access. Other features of GradeWatch include: 1. Multiple accesses to the databases for the instructors. 2. E-mail noti cation to each student whenever there is an update in the spreadsheet/database. 3. Sending the grades individually to any speci c student by e-mail. 4. Automatic setting up of a mailing list, which can be easily passed to any e-mail client program. 2.3 GradeWatch User Interface To access the database, a person needs to provide the course number, family name, and student ID as a password. The database front-end is a web page with a CGI form as shown in Fig. 2.2. The CGI form takes the provided input from the user and the data is transferred to the server. The server which runs the CGI script receives the data as input. Perl is used to write the CGI script. First, the program veri es the data of a course in the list of courses and database of the instructor. Then the program processes the database for the corresponding course and retrieves the information required to produce a progress report. The report is produced as a HTML page or web page that is sent back to the web server, which redirects it to the client - a web browser. The authentic data shown relies on the arrangement of the grade database le for a speci c course. If an instructor desires to collect the database, more functionality is added. An in- structor can access the grades of any student by entering the student?s family name and the instructor?s password, or can access all grades at the same time by providing the instructor?s own name in the name eld. 10 Figure 2.2: User Interface. 2.4 Viewing Results In the sample database used for both, an illustration and as a demo on the Internet, the instructor?s name and the password are both set to admin. After examining the report web page, the instructor has an option to send a grade report by e-mail to all students, to a particular student, or to a selected group. Optionally, a few lines of additional memo may be appended. A sample instructor?s report for all students is shown in Fig. 2.3. 11 Figure 2.3: Course progress report of a student - user x. 12 Chapter 3 Data processing from IEEE Xlpore - Ethernet Robot 3.1 Overview This section presents the implemented model of data extraction that can, in fact, draw only the necessary data from any web server on the internet. It can further be developed into a powerful search engine/portal. Typically, a Data Extraction (DE) task is well-marked by its input and its extraction target. The inputs are usually unstructured documents like the semi-structured documents that are present on the Web, such as tables or itemized lists or a free text that is written in natural language [7]. Our model of Data Extraction (DE), Ethernet Robot, can be used to download and extract any kind of information present on the internet according to the user requirements. 3.1.1 An Example We consider an example of extracting titles and authors, pages, abstract URLs corre- sponding to the titles from the IEEE Transactions on Industrial Electronics located on IEEE Xplore. The main aim of this example is to allow the Associate Editors to search for re- viewers, and Authors to search for paper references of the corresponding IEEE Transactions. The Transactions has papers listed according to the years of publication and each year has 6 issues. A screenshot of the transactions is shown in Fig. 3.1. In this gure, the boxes indicate the required data to be extracted and the inessential data or junk to be ltered out from each issue. Lets now see how we automatically download and extract the data desired from these websites. Every Transactions on IEEE Xplore has a certain punumber, of which, Transactions on Industrial Electronics has a punumber = 41. The generalized URL of an issue Z for the year/volume no.- Y is given according to IEEE 13 Figure 3.1: IEEE Xplore webpage depicting various Data Fields. as: http://ieeexplore.ieee.org/servlet/opac?punumber=41&isvol=Y&isno=Z To download an extract the titles from volume no. 54 and issue 3 the URL is: http://ieeexplore.ieee.org/servlet/opac?punumber=41svol=54sno=3 Again each issue may have several pages 0,1,2,.. each page being addressed by : http://ieeexplore.ieee.org/servlet/opac?punumber=Xisvol=Yisno=Z&page=P&ResultStart=Q where page = P denotes the page number P, ResultStart=Q denotes the start of title number Q. The URL - ?http://ieeexplore.ieee.org/servlet/opac?punumber=41svol=54sno=3&page=1&ResultStart=25? 14 is the link to the titles starting from number 26. The page P=0 of any issue contains the links/URLs of the remaining pages as shown in Fig. 3.1. So the other pages can be fetched using the wget function and can be concatenated to the page P=0 to form a single page containing all the paper listings. Following script does the above process: $p = $p0.$p1.$p2; where $p0, $p1 and $p2 are the pages divided according to the paper listings and $p denotes the webpage containing all the paper listings of an issue. So the behavior of the tool is de ned by the variables volume no.-Y, issue number - Z, page number - P. To download all the pages from years 2000 to 2006 say, the following conditions have to be included at the beginning of main Perl code: This example model of data extraction (Ethernet Robot) extracts all the titles and corresponding data from the IEEE Transactions on Industrial Electronics. In order to extract all data, the system needs to traverse all the paper list pages in the archive and then extract all the titles and data from each paper list page. The code is devised to elicit the titles, authors, page numbers, abstract and abstract links from IEEE Xplore. Next, the Ethernet Robot goes to the webpage pointed by the URLs in each record and fetches the abstract. On completion of data acquisition, the raw data is printed in a new HTML le and published as a webpage. The Ethernet Robot carries out four stages: Data collection, Data ltering, Data pro- cessing and Data presentation on web. A schematic representation of the sequence of steps is shown in Fig 3.2. Of these, the data collection and ltering steps are relatively simple whereas the data pro- 15 Figure 3.2: Flowchart depicting the four stages of Ethernet Robot. 16 cessing and presentation steps require more involved procedures. These steps are explained in greater detail in the following sections. 3.2 Data Collection As mentioned in the previous section, the desired data to be fetched are speci c volumes from the IEEE Transactions on Industrial Electronics. Thus, the starting point for this procedure is the Transactions webpage. Now, we invoke the function get page with the parameter: volume number. get page grabs the web pages corresponding to the volume number and returns one page per issue for each issue of the year/volume number. The content of each issue is represented by a single variable -$page. Every year/volume has 6 issues, each of which is represented by a single element in an array of 6 variables. The following invariant holds true at any point during the operation of the code: $p[$i]= $page for $i<=6 where $p - array of issue content $i - issue number or iteration $page - content of each issue 3.3 Data Filtering Each webpage indicated by the variable $p[$i] consists of various irrelevant (to our purposes) JavaScript, html tags, tables and other miscellaneous information appended to the data we wish to extract. Hence, the content of the page needs to be ltered. The following condition in the code performs the proposed ltering operation: where the new variable ?$entry? holds the required content between the