USING GENETIC PROGRAMMING TO QUANTIFY THE EFFECTIVENESS OF SIMILAR USER CLUSTER HISTORY AS A PERSONALIZED SEARCH METRIC Exceptwherereferenceismadetotheworkofothers,theworkdescribedinthis thesisismyownorwasdoneincollaborationwithmyadvisorycommittee. BrianDavidEoff CertificateofApproval: JuanE.Gilbert AssistantProfessor Department of Computer Science and SoftwareEngineering JohnA.HamiltonJr. AssociateProfessor Department of Computer Science and SoftwareEngineering W.HomerCarlisle AssociateProfessor Department of Computer Science and SoftwareEngineering StephenL.McFarland ActingDean,GraduateSchool USING GENETIC PROGRAMMING TO QUANTIFY THE EFFECTIVENESS OF SIMILAR USER CLUSTER HISTORY AS A PERSONALIZED SEARCH METRIC BrianDavidEoff AThesis Submittedto theGraduateFacultyof AuburnUniversity inPartialFulfillmentofthe Requirementsforthe Degreeof MasterofScience Auburn,Alabama December16,2005 USING GENETIC PROGRAMMING TO QUANTIFY THE EFFECTIVENESS OF SIMILAR USER CLUSTER HISTORY AS A PERSONALIZED SEARCH METRIC BrianDavidEoff Permission is granted to Auburn University to make copies of this thesis at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publicationrights. SignatureofAuthor Date Copysentto: Name Date iii THESIS ABSTRACT USING GENETIC PROGRAMMING TO QUANTIFY THE EFFECTIVENESS OF SIMILAR USER CLUSTER HISTORY AS A PERSONALIZED SEARCH METRIC BrianDavidEoff MasterofScience,December16,2005 (B.E.,AuburnUniversity,2003) 100TypedPages DirectedbyJohnA.HamiltonJr. OnlinesearchistheservicethatpushestheInternet. Onemustonlylookatthesuccess ofacompanysuchasGoogle,anideafroma1998graduateresearchpaperthathasin2005 not only become a wildly successful company, but whose very name Google had become synonymouswiththeverbsearch,torealizehowimportantsearchis. Many IR researchers have suggested that the next great step in search is to make the process more personal. Search results should be tailored to the individual user. Early attempts at personalization such as relevance feedback have never gained popularity with users due to the need for further interaction. The end goal is personalization without the userhavingtocontributemoreoftheirattention. I propose that personalization can be accomplished by observing a user?s document selections. That a page?s overall popularity is important, but more important is the pages thatuserssimilartotheprimaryuserfindpopular. Ialsoproposethathistoryshouldnotbe based solely on a listing of prior documents a user has found relevant, but on the clusters of documents a user has found relevant. Clusters allow for pockets of information to be observed,andthusafullerunderstandingoftheusercanbedetermined. iv How then do I determine if this new metric is usable short of implementing a search engine using the metric, putting it online, and hoping users flock to it? Genetic program- ming was created to solve such problems. Genetic programming can be used to determine if a newly proposed information retrieval metric (collaborative filtering based on cluster history)iseffective. Bygivingageneticprogrammingframeworkatrainingsetcontaining documents, queries, and relevance judgments an optimal ranking function can be found. The genetic programming framework could incorporate the new metrics proposed along with traditional search metrics such as term frequency and document length. If these pro- posed metrics survived the evolution process they can be determined to be effective in the returningofrelevantdocumentstoauser?squery. v ACKNOWLEDGMENTS ForMom,Dad,Sarah,andtherestofmypeople. Theyknowwhotheyare. vi Style manual or journal used IEEE Standards Style Manual (together with the style knownas?aums?). Computer software used The document preparation package T E X (specifically L A T E X) togetherwiththedepartmentalstyle-file aums.sty. vii TABLE OF CONTENTS LIST OF FIGURES ix 1 INTRODUCTION 1 2 BACKGROUND INFORMATION 4 2.1 WhatisInformationRetrieval? . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 HistoryofInformationRetrieval . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 CharlesDarwin,Programmer . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 InformationRetrievalisDifficult . . . . . . . . . . . . . . . . . . . . . . . 10 3 LITERATURE REVIEW 13 3.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 DocumentClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 RelevanceFeedback. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Term-WeightingandRankingFunctions . . . . . . . . . . . . . . . . . . . 22 3.5 GeneticProgrammingAndRankingFunctions . . . . . . . . . . . . . . . . 24 3.6 PersonalizedSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 RESEARCH DESCRIPTION 31 5 EXPERIMENT AND RESULTS 37 6 CONCLUSION 49 7 FUTURE WORK 53 BIBLIOGRAPHY 54 APPENDICES 58 A GENETIC PROGRAMMING FRAMEWORK EXPERIMENT RUN 59 B GENETIC PROGRAMMING FRAMEWORK SOURCE CODE 67 C CLUSTERING SOURCE CODE 85 viii LIST OF FIGURES 2.1 AGeneticProgrammingTree[1] . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 RelevanceAndTheUser[2] . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Relevanceiscriticalindeterminingrankingfunctionperformance[3] . . . 17 3.3 Jain?sTaxonomyofClusteringAlgorithms[4] . . . . . . . . . . . . . . . . 19 3.4 RankingFunctions[5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 BestRankingFunctionDiscoveredbyFanet. al. ARRANGER[6] . . . . . 26 4.1 ClusterHistory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 SharingofClustersAmongUsers . . . . . . . . . . . . . . . . . . . . . . 35 5.1 BestTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 FitnessofTreesoverGPIRRun . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 GraphofCR99Performance . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 ChartofPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ix CHAPTER 1 INTRODUCTION Google, the most popular online search engine with a 15.3% share of visits receives 250millionsearchrequestsperday,andhasindexed8.1billionwebpages[7][8][9]. Yahoo [10], Google?s closest competitor, receives a mere 10% of all search requests. In seven yearsGooglehasgonefromagraduateresearchpapertoabilliondollarcompany,primarily duetoPageRank,asinglesearchmetricthatusesincomingpagelinkstodetermineapage?s popularity [11]. PageRank was the most significant advantage it had over its competition. Googlewasfiveyearslatetothestartofonlinesearch engines,yettheywere abletomake upthedistanceduetotheirsearch algorithmmetric. Information overload has been predicted since the 1940?s [12]. The number of new documents created for the web is growing at a substantial rate. If the internet is going to continuebeingausefulsourceofinformation,onlinesearch enginesnotonlyhavetokeep pacewiththenewdocumentbulk,butalsoimprovetheirperformanceinreturningrelevant documents to a user?s query. Without search engines the internet becomes an unnavigable mess. Currentsearchenginesgivelittleornoconsiderationtoauser?spastqueries. Pastuser histories should be used in determining the relevance of a document to a user?s query. If a user searches for the term ?Java? based on their past queries and page choices, a search engine should be able to determine if they are interested in the programming language or coffee. Google?s PageRank algorithm pushes popular documents higher in the order of returned documents to a user?s query. Instead of using what the whole online community considers to be a popular page, a search engine might perform better if it returned pages 1 that are popular with other users that have similar histories to that of the user. Then small pocketsofinterestcouldbedeveloped. Search algorithm designers are unsure of what metrics to use and the way in which to combine and balance different metrics. There are a variety of metrics dealing with link structure of web pages, term weighting and popularity. The goal of these designers is to createanalgorithmthatwillgivethemoptimalperformance,sinceerrorwillalwaysexistin thesealgorithms. There isnoperfectsearch algorithm. Thehumancreationofthequeries, andthedifference betweentheperceivedandactualrelevancewillalwaysbeanissue. Geneticprogrammingcanbeusedtocreateanoptimalsolutiontothesearchalgorithm problem. Forthegeneticprogrammingframeworktofunctionefficiently,largeamountsof data are necessary, which can be used as training sets to learn the correct documents for a query. Thedatacouldbeavailabletoanylargesearchengine,documents,userqueries,and dependingonwhatdocumentstheusersselectedaftertheirquery-relevancyjudgements. A smallsampleofthisinformationcouldbefedintoageneticprogrammingframeworkwith the hope of producing an algorithm that would return relevant results. A designer could also remove metrics and determine which were the most useful. They could also test new metricsquicklywithouthavingtoinflictapossiblypoorideaonreal users. The research conducted through this thesis attempts to establish that a genetic pro- grammingframework can be usedto determinetheusabilityof varioussearch metricsand to also demonstrate that the user histories of similar users is a successful metric in deter- miningrelevancyofdocumentstoauser?ssearch. Thisthesiswillexaminethevarioustechniquesusedininformationretrieval. Chapter One will give the reader background informationon the field of information retrieval as it appliesto online search engines. The goal is to showthe advancementthroughthe history anddemonstratehowpersonalizedsearch isthenextstepininformationretrieval. Chapter 2 Two willbe a completeliterature surveyof the informationretrievaland genetic program- ming cannon. Chapter Three will discuss the implementation and reasoning behind the genetic programming framework and the communal personalized search metric. Chapter Fourwilldescribetheexperimentsconductedtoprovethatcommunalpersonalizedsearch- ingis ableto return morerelevantresultsto theuser?s queries. Chapter Fivewillconclude the thesis with an overview of how genetic programming and personalized search fit into onlineinformationretrievalandareflectiononthefindings. 3 CHAPTER 2 BACKGROUND INFORMATION ?Although information retrieval has lately become quite a fad, I intend in this paper to stand back and take an unhurried look at what is going on, and try to predictwherethisfieldmustgoandwhatitmustdointhefuture.? -CalvinN. Mooers(1959)[13] 2.1 WhatisInformationRetrieval? Information retrieval (IR) was a term coined by Calvin Mooers in 1950. The goal of an IR system is to organize data in such a way that a user can quickly gain access tothe knowledgetheydesire. Van Rijsbergen,a predominantIR researcher, statedthatthe probleminherentininformationretrievalis,?wehavevastamountsofinformationtowhich accurateandspeedyaccessisbecomingevermoredifficult?[14]. Atraditionallibrarycard catalog is an example of an information retrieval system. It is an attempt to condense a collection in such a way that a user can better access what they are interested in without havingtolookthroughalldocuments. 2.2 HistoryofInformationRetrieval Theproblemofsearchingrawdataforinformationhasbeenaroundforoveronehun- dred years. In 1897 a concordance, or index, of every meaningful word in the Bible was published[15]. And while that mightseem a trivialtask today (simplyinputa copy, parse it,andbuildacountofthewords)itwasalifelongtaskin1897. Theseconcordanceswere early examples of organizing information in such a way that a user could quickly access 4 whattheywereinterestedin. Theindexescreatedbytheseearlyconcordancemakerscould be viewedastheancestors ofthe invertedindexstructureusedin mostsearch engines. An inverted index is a hash data structure where the hash key is a word, and inside the hash is a linked list with two fields. The two fields contain the number of occurrences and the document location. After the computing revolution of the 1960?s concordances were no longer created by hand, and the process was significantly quickened. The last handmade concordancewasacollectionofByron?sworks;ittookameretwenty-fiveyears[15]. With computerassistancethetaskwouldtakeminutes. Some of the earliest discussionof digital search came from Vannevar Bush, one time director of the Office of Scientific Research and Development. Bush invented the concept for what he would later call ?Memex? in the 1930?s. He described Memex as ?a device in which an individual stores all his books, records and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility? [12]. Also, Bush described an idea of documents that were interconnected with each other through ?trails,? this concept is considered to be the inspiration for hypertext. Bush created the Memex concept (though he never actually constructed it) to counteract what he viewed as information overload. Researchers at Microsoft are currently attempting to construct a Memexlikesystem,MyLifeBits[16]. In the 1960?s Cornell professor Gerald Salton began focusing on IR research. Salton led the group that developed SMART, jokingly known as Salton?s Magical Retriever of Text, but commonly known as System for the Manipulation and Retrieval of Text. Salton contributedtothediscoveryofavarietyofIRtechniques: vectorspacemodel,termweight- ing,relevancefeedback, clustering,extendedBoolean retrieval,term discriminationvalue, 5 dictionary construction, term dependency, text understanding and structuring, passage re- trieval, and automatic text processing using SMART. The vector space model was partic- ularly ground breaking. In a vector space system both the query and the document were treatedasvectors,andtheirrelevancytoeach otherwasdeterminedbythedistanceapart. The first internet search engine was Archie, created in 1990 by Alan Emtage [17]. This was before Tim Berners-Lee?s creation of HTTP. All documentson the internet were storedonFTPservers. Archie?scrawlerprobedthevariousFTPserversforlistingsoftheir filesandindexedeach ofthosefiles. Wandex was the first WWW (World Wide Web) search engine, it was created by MatthewGrayin1993. EarlierintheyearGraycreatedthe?WorldWideWebCrawler,?or Wanderer, an early crawling robot. The Wanderer caused a small amount of controversy, the release of the bot caused a notable loss in networkperformance. Wanderer mistakenly accessedthesamepagesrepeatedly,oftenahundredtimesinasinglehour[17]. In1992theDepartmentofDefense,alongwithNISTco-sponsoredtheTextREtrieval Conference (TREC) [18]. The aim was to promote research in the information retrieval community by supplying the infrastructure that was needed for such a huge evaluation of text retrieval methods. The TREC conference contained a variety of tracks focusing on question-answering systems, multimedia search, search involving structured data and manyothersareaddedonayearlybasis. Atthe2003TRECconfernceninety-threegroups fromtwenty-twocountriesparticipated[18]. Alsoofnoteisthatinthefirstsixyearsofthe TREC conferencetheeffectivenessofretrievalsystemspresenteddoubled[18]. During the early nineties bots were only recording the title of web pages, the first hundred or so words and their locations. In 1994 Brian Pinkerton created a crawler that recorded all of the textof each document [17]. This was the first time full-textsearch was available on the WWW. Pinkerton entitled his system WebCrawler. At this time it only 6 contained the documents from six thousand servers. WebCrawler was eventually sold in 1997toAmericanOnline(AOL). Yahoowasthefirstonlinesearchenginetogainwidespreadpopularity. Yahoowasnot exactly a search engine in the traditional sense. Yahoo was a hierarchal arranged catalog, thatwasnotfed byacrawlerorbot,butwasinputtedbyYahoo?seditors. Inthepastsmallinnovationsinthesearchcommunity,ifimplementedwell,havelead to substantial returns in terms of popularity. Each of the discussed projects had their mo- ments, and there abilities attracted users. There seems to be a natural progression. There are no huge innovations, just bit by bit improvements. Users will not move on to a new search engine unless that engine offers a significant improvement over what they are cur- rently using. It is not enough for a company to get on even footing with their competitor, theymustsurpassthem. Conveniently,ifanewcompanydoessurpassthecompetitionthen userstendtoflocktotheirproduct. The history of information retrieval section of this thesis is not a complete history of internetsearch, butthereader can quicklysee theprogressionofthe technology. TheUni- versity of Nevada?s Veronica, Thinking Machine Corporation?s WAIS, UCSTRI, Netfind andLycoshavebeenleftout[3]. 2.3 CharlesDarwin,Programmer ?Computer programs that evolve in ways that resemble natural selection can solve complex problems even their creators do not fully understand.? - JohnKoza[1] 7 Since the early nineties a sub-discipline of Artificial intelligence known as evolutionary computing has gained popularity. The idea that evolution can be applied to creating solu- tionswasfirstproposedbyJohnHolland. Evolutionarycomputingusedthetheoriesofevo- lution,survivalof the fittest,selectivebreeding and mutationtoanswer difficultquestions. Instead of simplyattemptingto brute-force a solution,evolutionarycomputingallowedan intelligent selective search of the solution space of a problem [1][19]. Given sufficient computational resources a genetic programming solution could yield results that compete withthoseofthebestdomainexperts[20]. Genetic Programming was developed principally by John Koza. The individual was no longer a coded representation of the problem, the individual was a computer program. The goalof a geneticprogrammingsystemwas todiscovera program thatproduced some desired output for a particular set of inputs [1]. These programs were represented as a tree structure, as shown in Figure 2.1. This structure allowed for the various programs to be easily manipulated from generation to generation. The trees consisted of functions and terminals. The functionscould be programmingoperations, arithmetic,mathematical, logicalordomainspecific. Theterminalscouldbenumericalconstantsorvariables. The initial population was created pseudo-randomly. Koza described this step as a ?primordial ooze of thousands of randomly created programs? [20]. Then the population was ranked according to the fitness of each individualusing the fitness function. The goal of the fitness functionwas todetermine a score for each individualthatreflected howwell their possible solution performs. The fitness function allowed the programs to be ranked, thus determining who is the ?fittest? and will survive. The designer of the system had a choice he must make: what individualsget to breed, what individuals would simply enter the next generation, and which individuals would be killed off. Randomness had a role 8 Figure2.1: AGeneticProgrammingTree[1] 9 to play in this. It was not wise to remove all the weakest individuals, due to the possi- bility that those trees might move towards a local optimum. There were various selection techniques;theyincludeelitism,fitness-proportionate,roulette-wheel,scaling,tournament, rank, generation, steady-state, and hierarchal selection. Once the individualshad been se- lected they could either be mutated, where they were simply altered, or they could have been crossovered with another individual. Crossover was metaphoric reproduction. And like reproduction, two individualscould produce multiplechildren. An examplewould be ifthebesttwoindividualswere crossoveredfourdifferentwaystoproducefourchildren. Genetic programming was used to solve a variety of problems in fields ranging from electrical circuit design to biochemistry and microbiology. Genetic programming has had greatsuccesswithsolvingproblemsthathavelargesolutionspaces,andrequireanoptimal solution. 2.4 InformationRetrievalisDifficult ?The simple reason: even humans are poor at deciding what information is relevant to a particular question. Trying to get a computer to figure it out is nearlyimpossible.? [21] Informationretrieval(IR)isanunsolvedprobleminmultipledisciplinesofstudy,and it is not unsolved because of lack of interest. The scientist who comes up with a correct system,onethatalwaysreturnthedocumentthattheuserneedsanddoesitinareasonable amount of time, will surely enjoy great personal wealth and numerous accolades. Due to the high profitability of this research, search engine algorithms are kept secret, and thusly advances become less likely [22]. The vast majority of the techniques used by search engines were discovered in the 1970?s by IR researchers such as Salton. The seventies 10 were a time when researchers were not as concerned with the profitability of their ideas, andopenlypublishedtheirfindings. Patterson best describes the inherent difficulty in building search engines: ?At serve time, you have to get the results out of the index, sort them as per their relevancy to the query and stick them in a pretty Web page and return them. If it sounds easy, then you haven?t written a search engine? [23]. The need for disk space and processing power is staggering. Those needs are slightly outweighed by the need for large amounts of band- width. There is often no way of knowing if what you are doing will fully work. There is notestingsuitefordetermininghowwellasearchengineoperates. Scalabilitybecomesan issuewithlargeonlinesearch engines. Thesearch engineisareal-timesystem;itneedsto respondquicklytoauser?srequest. Another issue is that web site owners want their sites featured highly by search en- gines, and they will often attempt to abuse the search engine ranking functions to achieve higher status. For example, if word count highly affects the ranking of a page, a web site creator might repeatedly insert a word they want to be associated with their site in hidden text. A normaluser willnotbe abletosee theword,buta crawlerwillseeit. A siteowner mightpay a ?link farm? service to boost itssearch placement in query results [22]. A link farm has the ability to create thousands of pages with links directed towards a single web site. This will give the illusion of popularity, and can cause metrics such as PageRank to be inaccurate. These techniques are known as ?spamming? search engines. An entire in- dustry of search engine optimizers(SEO) has been created willingto sell their services of boosting a web sites search ranking [24]. If a search engine does not protect itself from these tricks theirsearch results willbecome useless, and thuslytheywilllose users. Tech- niqueshavecomeouttobeatsearchenginespam,butoftentimesthosetechniquesarealso 11 quickly beaten. It is a constant arms race, and often requires human involvement in the pagerankings. 12 CHAPTER 3 LITERATURE REVIEW Information retrieval is a popular research subject. The goal of most information re- trievalresearchissimplyaboutgivingusersthedocumentmostrelevanttotheirqueries. In the search for a solution to that problem researchers have studied various ways of sorting documents, indexing documents, compressing various parts of the document set, and ap- plyingreasoningtothequery. Thegoalofthisliteraturereviewwastofindinformationon howtobestreturnthemostrelevantdocumentstoauser?squery. Thepapersthatinfluencedandshapedtheworkconductedforthisthesisaredescribed in the following literature survey. The first section will be an overview of literature deal- ing with the concept of relevance. Relevance is a difficult thing to comprehend due to its abstract nature. The second section deals with research into the clustering of similar doc- uments, followedby relevance feedback. Research into term-weightingstrategies, various ranking functions, personalized search and the use of genetic programmingto tune search algorithmswillalsobediscussed. 3.1 Relevance Relevance is a central idea in IR research. Van Rijsbergen went so far as to claim relevanceisthenotionat?thecentreofinformationretrieval?[14]. Despiteitssignificance, relevancy is a hard concept to quantify. Relevance has been declared to be the ?most fundamental and much debated concern for information science.? Despite the importance of relevancy there is little agreement about its exact nature. Moreover a proper way to evaluate systemsmaking relevancyjudgementshas yet to be determined. The debate over relevancegetsevenmoreconvoluted: adocumentcanbeperceivedasbeingrelevanttothe 13 systembasedonanenteredquery,butnotrelevanttotheuser. Priorresearchhasshowthat thirty-eight variables affect the relevancy that a human judges a document on, including the style of the document, visual layout, and difficulty of language. These variables are incrediblydemandingrelevancyjudgementsforasearchenginetoconsider,whichiswhat makesrelevancydifficult;itgoesfarbeyondthemerecontentofthedocument. The most recent work on the concept of relevancy as applied to IR has been done by Stefano Mizzaro. Mizzaro discusses the confusion about relevance by researchers, stating ?a great deal of such problems are caused by the existence of many relevances, not just one,andbyaninconsistentlyusedterminology: theterms?relevance?,?topicality?,?utility?, ?usefulness?, ?system relevance?, ?user relevance?, and others are given different meanings bydifferentauthors: sometimestheyareusedassynonyms(twoormoretermsforthesame concept) sometimes in an ambiguous manner (the same term for two or more concepts)? [2]. Inhispaper[2]Mizzarodescribesfourdimensionsofrelevancy. Thefirstdimensionis ?informationresources.? In this dimensionexistsa group of three entities: document, sur- rogate, and information. Mizzaro defines document as ?the physicalentity thatthe user of anIRsystemwillobtainafterhisseekingofinformation?. Surrogatemeans?arepresenta- tionofthedocument,?towhichMizzarogivesexamplessuchasakeywordlist,anabstract orbibliographicinformation. Thelastmemberofthegroupisinformation,whichMizzaro acknowledgesisnota physicalconcept,but the?entitythatthe user receives/createswhen readingadocument.? Theseconddimensionistherepresentationoftheuser?sproblem. In thisdimensionMizzaromakesadistinctionbetweentheRealInformationNeed(RIN)and thePerceivedInformationNeed(PIN)[2]. Thedistinctionmustbenotedthatwhattheuser wants,whattheytrulywant,mightnotbethesameaswhattheyrequest. Thisuserissueis oftenreferredtoasAnomalousStateofKnowledge(ASK).Anotherissueisthevocabulary 14 Figure3.1: RelevanceAndTheUser[2] problem,whichisamismatchbetweenthetermsusedinthedocument,andthetermsused in the request. The user mentally creates a request for the system, but the user must enter this request in as a query, which might require a boolean like language. The RIN, PIN, request and query are the parts of the second dimension. These four items can be viewed asstates. AusermustgofromasRINtoPINtoarequesttoaquery,andatmanytimesthe processcanbeflawed. The third dimensionis time, which isn?t discussed much in IR research. A document might not be relevant to a user?s query at a certain time, but may be considered so in the future. The change is often brought about by a user having a greater understanding of the material after a period of time, and a document that wasn?t relevant to the user when her of she was less knowledgeable could be later on once a base of knowledge has been gained. The fourth dimension is ?components? [2]. There are three components in the fourthdimension: topic,taskandcontext. Topicisthesubjectareathattheuserisinterested 15 in. Task is the activity that the user will use the information they receive for, ie. writing a term paper or studying for an exam. The third component is context. Context includes everythingthat affects the search and the decision of relevance that is not covered in topic ortask[2]. Mizzaro in essence has created a framework for the discussion of relevance. In his articles he also proposesgraphing relevance ina four-dimensionspace tofully understand it. Thisseemstobeanunnecessaryattempttofurtherquantifyrelevanceintoameasurable form, but Mizzaro?s contribution to the relevance discussion cannot be overlooked. Rele- vance as a measurement must take into consideration numerous factors, and relevance is not a permanent statistic between a document and a query. Relevance must be considered in context with the user. Mizzaro also notes that relevance can change over time, some- thingsearch designers rarely giveconsiderationto. The importance of this is that a search engine?sperformanceisjudgedonitsabilitytoreturnamaximumnumberofrelevantdoc- uments and a minimum number of non-relevant documents to a user?s query. To create a well designed search ranking function one must fully understand the meaning of rele- vance. Just as important as returning relevant documents and not returning non-relevant documentsistoalsominimizethenumberofrelevantdocumentsthatarenotretrieved. 3.2 DocumentClustering Clustering is the grouping of similar objects. It is a research subject in a variety of scientific disciplines, due to the need to make sense of large collections of data. Many termsaresynonymouswithclusteringsuchasunsupervisedlearning,numericaltaxonomy, vectorquantizationandlearningbyobservation[4]. IRresearchervanRijsbergenproposed a clustering hypothesis which stated, ?closely associated documents tend to be relevant to the same requests? [14]. Clustering was originally applied to IR research as a means for 16 Figure3.2: Relevanceiscriticalindeterminingrankingfunctionperformance [3] improvingefficiency. Instead of determiningthe relevanceof a query toeach documentin the data set, the query was compared to the centroid of the cluster, and if it was deemed relevant all documents in said cluster were also deemed relevant. This drastically reduced thenumberofcomparisonsthatneedtobemade. Unfortunately,inmoststudiesitwasde- terminedthat retrievingthe contents of the clusters whosecentroids mostcloselymatched thequerydidnotperform aswellasretrievingthetoprankdocumentsfrom theentirecol- lection. Otherresearchershavestudiedtheperformanceimprovementsbyusingclustersin thesearchprocess,buttheirresultsonlyshowedoccasionalgains[25]. In current IR research, clustering usually falls into two categories: document clus- tering or search result clustering. Document clustering is done prior to a user ever being involved in the system. All documents are clustered based on their similarity, and as new documents are discovered they are added to a cluster. For a large document set the initial clustering can be very computationally expensive. Search result clustering tries to coun- teract thatcost. Clustering onlyoccurs once search resultsare returned, and the clustering 17 only applies to those results. The clustering is based on the small snippets that search en- gines return with their result listings. So the end results are grouped together to allow a usertobrowsethecategoriesmadeonthefly. Norecord oftheseclustersarekept. To accomplish document clustering, numerous algorithms have been created. Clus- tering algorithms fall into two categories, hierarchical or partitioning [4]. Hierarchical algorithms begins with an initial clustering, and then merges or splits the clusters until a measure of similarityhas been met. The first step of a hierarchical clustering algorithm is to place each document into its own cluster. Next, the closest clusters are combined. This is done until the appropriate number of clusters has been found. That is the bottom-up approachtohierarchicalclustering. Inthetop-downapproachalldocumentareplacedinto a single cluster. The two documents in the cluster that are farthest apart are the centroids of the two new clusters. All the documents that were in the original cluster now become a member of which ever of the two clusters they are closer to. Which ever cluster is the largestgetssplitinthenextcycle. The most commonexample of a partitionalcluster algorithm is k-means. In k-means anumberofclustersarechosen,andtheneachdocumentisrandomlyassignedtoacluster. Next, the centroid for each cluster is calculated, and each document is assigned to the clusterthatthecentroidisnearest. Thislaststepisrepeateduntilclusteringconverges,and documentsnolongerswitchclusters. Duetoitsrandomnature,variousrunsofthek-means algorithmwillhave variedresultson theexact samedocumentset. Figure 3.3presentsthe treeofclusteringalgorithms,allthealgorithmsare eitherpartitionalofhierarchical. Thequestionofwhichclusteringalgorithmisthebestperformerisoftendebated. The speed in which the clustering algorithm performs and the quality of the clusters must be considered. The partitional algorithm K-means is faster than hierarchical algorithms [26]. ItcancomputeinO(K*N).Thehierarchicalalgorithmsproduceahigherqualityofclusters. 18 Figure3.3: Jain?sTaxonomyofClusteringAlgorithms[4] WhenadocumentsetisextremelylargeitiswisertousetheK-meansalgorithmforthesake of efficiency. If thedataset issmall,or ifthecluster mustbeof a highqualityhierarchical algorithms make more sense [26]. In their article, ?Evaluation of Clustering Algorithms for Document Datasets,? Zhao and Karypis observed that partitionalclustering algorithms outperform agglomerative when dealing with document datasets [27]. Quality in terms of clustering is a subjective measurement of if the clusters contain similar documents. It should also be pointed out that research has shown that only the fifty to a hundred most frequent terms in the document are needed to cluster documents [28] . The same research pointedoutthatclusteringbasedonalltermsactuallydegradedthesystemperformance. Clustering could be used to return documents to a query that does not contain any of the words in the query. If a document isdeemed veryrelevant to a query, then all the doc- uments in that document?s cluster would also be deemed to have a relevance to the query. Nolongerwillitbeasimplekeywordsearch. Thebeautyofthistypeofmechanismisthat it could potentially alleviate certain issues with the ambiguity of language. An example would be if a user searched for ?Search Engine Research,? and the system might find the 19 cluster that contains documents on ?Search Engine Research,? and would notice that the cluster also contained ?IR Research.? It then might return documents that do not contain words from the original query, but returns documents that were similar to documents that were relevant to the query. This technique could alleviate the problem of a user not fully knowingthecolloquialofafield. A few online search engines are utilizingclustering inthe search process. The online search engines Visimo and the Clusty use clustering extensively. The notoriously close- lipped Google labs commented in a recent article about the possible use of clustering in future incarnations of their search engine. Urs Hoelzle, VP of Engineering at Google, commented that academic implementations of clustering had little success due to lack of data, and went on to further comment, ?If you have enoughdata, you get reasonably good answersoutofit?[29]. 3.3 RelevanceFeedback Relevance feedback was a technique proposed by Gerald Salton. [30] defines rele- vance feedback as ?an interaction cycle in which a user selects a small set of documents thatappeartoberelevanttothequery,andthesystemthenusesfeaturesderivedfromthese selected relevant documents to revise the original query.? In the paper, ?Improving Re- trievalPerformance byRelevanceFeedback,? SaltonandBuckleylistthemainadvantages of relevance feedback as: ?It shields the user from the details of the query formulation process, and permits the construction of useful search statements without intimate knowl- edgeofcollectionmake-upandsearchenvironment,??Itbreaksdownthesearchoperation intoasequenceofsmallsteps,designedtoapproachthewantedsubjectgradually,?and?It provides a controlled query alteration process designed to emphasize some terms and to de-emphasizeothers,asrequiredinparticularsearchenvironments?[31]. 20 Thequestionofwhatisthebestwaytoimplementrelevancefeedbackhasbeenwidely researched. Relevance feedback can be implemented as term reweighing or query expan- sion. Term reweighingis ?modifyingterm weightsbased on term use in retrievedrelevant andnon-relevantdocument?[32]. Queryexpansionisthetakingofthemostfrequentterms fromadocumentthathasbeendeemedrelevant,addingthemtothequery,andthenresub- mitting the query. Query expansion cannot take into consideration negative feedback; a term cannot be removed from the query that is not there. Term reweighing can take into considerationnegativerelevance. Thecommonwordsfromadocumentwithanegativerel- evancereceivea lowerterm weight. Inthe sameway thatcommonwordsinthedocument set receive a lower weight, common terms in the document which the user has described an nonrelevantget lower weight. Harman?sresearch showsthatquery expansionperforms better in average precision than term reweighing [32]. Query expansion results in a 27% improvement over the base search. Also Harman observes that the greatest results are achievedby addingtwentyterms tothe query. Bothquery expansionand term reweighing see performance improvements in [32]. Term reweighing performs best when not taking negativefeedback into considerationaccording to the experimentsperformed in [32]. The improvementthoughisamarginal0.2%improvement. Relevance feedback is an interestingtechnique due to its simplicity. With the help of the user, the search engine can personalize the documents that are retrieved. Many users already usea form ofrelevancefeedback. Once a userreceivesa listingof documents,oc- casionallytheywillnoticeawordinthedescriptionsthattheyhadn?tusedintheiroriginal query. They will then re-enter the query with this new information. The concept is not unfamiliartoauser,andisnotterriblyinvasive. Thatdoesnotmakeiteasy. Theusermust quantify a document, often on the basis of little more that a brief description. Also this 21 quantification mightbe on a scale or declaring the documentrelevant or non-relevant,and bothwayshaveflaws. Pseudo relevance feedback or automatic query expansion works on the assumption that the top few returned pages are often the most relevant, and that pages that are similar tothosearealsoimportant[33]. Pseudorelevancefeedbackinessencecreatesanewquery based on the most common terms in the first few returned documents. The technique has beenverysuccessfulatTREC,andtheOKAPIsearchalgorithmusesaformofit[34]. The beautyofthesefeedbacksystemsisthattheydonotrequiretheusertoactivelyparticipate. The reason for the rise in research on pseudo-relevance feedback is due to the un- willingness of everyday users to use relevance feedback features. The study conducted by Jansen et. al. found that less than five percent of users utilize the relevance feedback option in online search engines [35]. The authors go on to comment that ?the question of actual low use of this feature should be addressed in contrast to assumptions about high usefulness of this in IR research? [35]. The authors question the direction of the research of relevance feedback commenting, ?This is one of the examples where users are voting withtheirfingers,andresearch isgoingtheotherway?[35]. 3.4 Term-WeightingandRankingFunctions The ranking functions of commercial online search engines are closely guarded se- crets. Little is known about the ranking function of the current leader Google beyond a graduate student conference paper. However, papers presented at TREC conferences have provided a wealth of information about the internals of many experimental ranking func- tions. Also,mostoftherankingfunctionsare derivedfrom thevectorspacemodel(VSM) created byGeraldSalton. 22 The vector space model views both the query and the documents as a vector. The document is already in this form if indexing has occurred. The vectors will be the size of thecombinedvocabularyofthequeryandthedocument. Ifawordiscontainedinthequery or the document, and not in the other a zero is placed in the query for that word location. Thesimilarityiscomputedbytakingtheinnerproductofthetwovectors[15]. Ifthequery and the document have no words in common this product will be zero. The inner product isnottheonlymeansofmeasuringsimilarity. Term weightinggoesbeyondconsideringthecountofthequerytermsinadocument. If a term occurs in many documents, then it gets a lower weight. This is the idea behind term-frequency inverse document frequency (TFIDF). Figure 3.4 shows the ranking func- tionPivotedTFIDF,Okapi[34]andINQUERY[36]. Alltherankingfunctionshavesimilar form,andoftenusetheexactsamevariables. Otherrankingfunctionsare webspecific, andareprimarilybasedonthelinkedstruc- tureoftheweb,suchasGoogle?sPageRankandHyper-InterlacedTopicSelection(HITS). PageRankusesthenumberofpagesthatlinktoadocumentasareflectionofrelevance,and those documents are higher ranked. Google recently filed for a patent that discussed the usingoftimemetricssuchashowoftenawebsite?scontentchanges,andwhenthedomain name for the site was registered [37]. This metric takes into consideration how long ago a domain name was registered. Most malicious web sites do not hold their domain names for more than a year. Google uses this metric to reduce the number of junk sites that are returned in their search results. Also Google takes into consideration the location of the querytermsinthepageandthesizeofthetextcontainingthequeryterms[38]. 23 Figure3.4: RankingFunctions[5] 3.5 GeneticProgrammingAndRankingFunctions ?Inthestruggleforsurvival,thefittestwinoutattheexpenseoftheirrivals because they succeed in adapting themselves best to their environment.? - CharlesDarwin[39] The authors of [40] propose using genetic programming to build search engine ranking functions. The IR ranking function can easily be represented as a tree, which is the data structure used in genetic programming. Their trees consist of eleven terminals, four oper- ators, and a constantvalue between 0 and 1. The size of the trees are limitedto a depth of nomorethantenfortheirexperiment. Forthefitnessfunctiontheauthorsuseaveragepre- cision. Asetnumberofthetoptreesinthecurrentgenerationbecomemembersofthenext generation,andtournamentselectionisusedtodeterminewhichtreeswouldcrossoverand have their offspring continue into the next generation. To determine the success of their solutiontheycompareeachindividual?sperformanceagainstOkapiBM25,arankingfunc- tionthathasconsistentlyperformedwellinTRECevaluations. ThecorpusistheAssociate Press collection spanning three years. The document set is split into three sets; one for 24 training, one for evaluation and one for testing. The split is done to prevent the GP from over-fitting itself to one document set, thus not producing accurate results. The result of their experiment is that they were able to discover a ranking function that outperformed Okapi by a significantpercentage. One of the intriguingresults of the study is that the GP frameworkdiscoverssomecommonlyknownrankingfunctionssuchasTFIDF. The research in [40] further expands on the experiment from [6]. It is no longer a nameless GP framework; Fan names the system ARRANGER (Automatic Rendering of RankingFunctionsbyGenetic Programming)[6]. Fan et. al. goesontopropose usingthe ARRANGERsystemandpseudorelevancefeedback(Fanreferstothetechniqueas?blind feedback?) togainfurtherperformance improvements. The researchers from [40] present results from another experiment based on the ge- netic framework they propose. Part of the results, though, consist of creating individual ranking functions for each query, and then comparing them to Okapi and TFIDF. In this portionoftheexperimentwhenusingshortqueriestheywereabletogainaincreaseinav- erage precision by 16.19% and 10.71% against PTFIDF and Okapi respectively[5]. When usinglongerqueriestheywereabletoperform32.97%and17.01%greater. Thisportionof theexperimentseemsfrivolous. Itisunfairtocreateafunctionforeachquery,anditseems to be unreasonable in a normal IR environment. Also, TREC does provide long queries, whichcanrangebetweenseventeenandninetywords. Rarelydosearchenginesencounter queriesofthissize. Alsotheresearchers removedfifteenqueriesoutofthesetoffifty,due tothequeriesnothavingasignificantnumberofrelevantdocuments(theyallhadlessthan four relevant documents). The more important portion of the experiment though is using the genetic programming framework to create a consensus ranking function, one that will workwelloverallfiftyqueries. Theresultsofthisexperimentarewhendealingwithshort queriestherankingfunctiondiscoveredwasabletooutperformOkapiandPTFIDF3.13% 25 Figure3.5: Best RankingFunctionDiscoveredbyFanet. al. ARRANGER [6] and 10.75% in average precision [5]. On long queries the gain in average precision grew to 18.4% and 3.13% against PTFIDF and Okapi respectively. Figure 3.5 below shows the bestrankingfunctionthatwasdiscoveredbythegeneticprogrammingframework. Fan et. al. also studied the affects of a variety of fitness functions when using ge- netic programming to discover ranking functions [41]. In early experiments Fan and his colleagues limited the ranking function to average precision, and only considered the top twenty documents in evaluation fitness. In this article the fitness functions take into con- sideration the order of the documents. Relevant documents should be higher in the order than the non-relevant documents. Unfortunately relevancy judgments provided by TREC donotprovidearankofrelevanceorderforqueries,onlywhetheradocumentisrelevantor not. The idea of thisresearch [41] is, for example, if only fifteen relevantdocumentsexist for a query, the first fifteen of the documents retrieved should be relevant and the last five of the twenty should not be. Relevant and non-relevant documents should not be mixed throughoutthe returned set; relevantdocuments first, non-relevantdocuments last. This is referred to as precision. The authors state why they view this fitness evaluation to be of importanceduetothenotionthat?theutilityofarelevantdocumentdecreasewithranking order? [41]. The authors of [41] propose four new fitness functionsbased on thisconcept. Their experiment also contains three other fitness functions: average precision, the CHK fitness function developed by Chang and Kwok and the LGM fitness function created by Lopez-Puljate [41]. The corpus used for the experiment was the TREC 10GB collection from 2000. The corpus was randomly split into training, validation and test sections. The 26 results were measured by the average-precision and the precision of the top ten hits. The resultsoftheexperimentshowthatthreeofthefourfitnessfunctionscreatedbytheauthors wereabletocreaterankingfunctionsthatoutperformedOkapiBM24. Thesethreeranking functionsalso outperform PAVGand CHK as fitness functions. The LGMfitness function wasunabletobeusedincreatingarankingfunctionthatoutperformsOkapiBM25,aswas oneofthefitnessfunctionstheauthorspropose[41]. 3.6 PersonalizedSearch Personalized search is user specific unlike traditional search. The promise of person- alized search has always been results that are tailored to the individual. This would be accomplished by developing a model or a knowledge of a user. Over time the software would be able to recognize preferences, and incorporate that into the ranking. Currently a semi-personalizedsearch is available,but often it requires the user to fill out a survey,and often the results are more tailored to the person?s sex, location, income, and age. These qualitiesdonotencompasswhoapersonis. Personalsearchisadifficultconcepttoimple- ment,thesystemmustlearnaboutaperson,andthenapplythatinformationtotheranking ofdocuments. In the paper, ?Context in Web Search,? Steve Lawrence makes a distinction between a document being valuable as opposed to simply being relevant. Lawrence states that a documents value ?depends on the context of the query - for example, the education, inter- ests and previous experience of a user along with information about the current request? [42]. This idea is similar to Mizzaro?s take on relevance which must consider the user?s background and needs, even if the user is not fully knowledgeable of those needs. Other researchershavereferredtothetraditionalresultsreturnedbysearchenginesas?consensus relevant,?and theresultsof a personalizedsystemas ?personal relevancy?[43]. Lawrence 27 goes on to describe Inquirius2, a search engine which requests the user to select a context for their query. The example given in [42] is that of a user searching for a document on ?machinelearning?mightwanttolimitthecontextofthesearchtoresearcharticles. Inquir- ius2 requires explicit information from the user, but Lawrence also mentions the Watson project that does not require such information. The Watson project is able to observe the documentscurrentlyopenedandbeingedited,andusesinformationgainedfromthosedoc- uments. The query ismodified toinclude thisnewinformationpriorto beingsubmittedto a search engine. Lawrence goes on to define a personalized search engine as ?a search engine that knows all your previous requests and interests, and uses that information to tailorresults?[42]. Alsonotedisthatapersonalizedsearchenginewillnotreturnthesame resultstoaqueryfordifferentusers,andalsoovertimetheresultsauserreceivedtoapast querymaydifferfromtheresultstheyreceivetothesamequeryinthefuture. The Haystack project [44] is an attempt to create a personalized search engine that does not require the user to explicitly state context. The Haystack search engine does not focus on the web, but is concerned with the user?s desktop system. Haystack is capable of observing the user?s interaction with the system through a variety of proxies (web and email). By usinga proxy,Haystack can take intoconsiderationtemporal effects withrela- tionto theuser?s interactionwithweb documents. The longer a user spendsat a particular site,themorelikelythecontentofthatsiteintereststheuser. Thisinformationcanbeused tomoreaccuratelyreturnresultstoauser?squeryforinformationstoredontheirsystem. The primary concern of a personalized search engine must be the way in which they model a user?s interest. Tandujaja and Muitake issue with the schemes mostpersonalized search engines use to store user info, and refer to it in jest as ?a bag of words? [45]. The authorspointoutthatthe?bagofword?approachdoesnotprovideapropercontextforthe words,andcouldbeconfusedduetosynonyms. Theauthorsgivetheexampleoftheword 28 ?rose?,whichcouldeasilybeappliedtoflowers,orpossiblewine[45]. The?bagofwords? approach providesnomeansofdeterminingwhich. Theyalsocommentthatthisapproach focuses on the user?s likes, and does not take into consideration dislikes. Tandujaja and Mui propose Persona, a personalized search system combining filtering and user profiling [45]. Persona uses a tree-coloring technique to create a user profile. The tree is the Open Directory Project (ODP) taxonomy. The Open Directory Project is very similar to Yahoo in its structure. Persona incorporates the HITS algorithm discussed earlier. The Persona systemdoesrequireexplicitrelevancefeedbackfromtheuser. Usersaresupposetoratethe context, not the individual documents, negatively or positively. This feedback is reflected ontheuser?streebyacolor-codingofthenode. Thiscodingreflectsthe?numberoftimesit hasbeenvisited,ratedpositive,negativeandassociatedURL?s?[45]. Oncetheuserentersa quer,yPersona(ifthecontextisfoundintheuser?sprofile)givestheassociateddocuments more or less weight, depending on the color. If the context is not in the user?s profile, Persona tries the ODP taxonomy. It then checks if the context has the same parent as any node in the user?s profile, or if it is a child. If so, their weights are adjusted accordingly [45]. This tree based user profile is what makes Persona interesting. The use of a tree, which can easily be compared to a maintained taxonomy makes the user profiling system powerfulinthefaceofsynonymsandotherambiguities. Liuet. al. notesthatauserprofile combined with a general profile or a categorical hierarchy similar to the ODP can be used to better map a user?s query to a set of categories, and thus return more relevant results [46]. Intheirexperimenttheyfoundthecombinesystemoutperformsasimpleuserprofile. Therefindingsshowthatauserhistoryneedstobeputintoacontext. Collaborative filtering has primarily been used in recommendation systems in online commerce sites. Often collaborative filtering is referred to as a ?word of mouth? systems [47]. AnexampleisAmazon?sbookrecommenderservice,whichusesfeedbackonbooks 29 auserhasorderedinthepasttofindsimilarusers,andrecommendbooksthatasimilaruser has enjoyed that the primary user has yet to purchase [48]. Collaborative filtering has not beenfullyimplementedintorankingfunctions. Thegoalofcollaborativefilteringistouse other people?s preferences, determine how similarly they are to the user, and then decide whetherauserwouldbeinterestedintheitem. 3.7 Conclusion The work of Weiguo Fan, Gerald Salton, van Rijsbersen and Mizzaro are pivotal to this thesison a variety of levels. Salton and van Rijsbergen are the early innovators. They understoodtheideathatifinformationcanbemeasured,thensimilaritycanbediscovered. Mizzaro took on the difficult task of actually defining relevance, a word that is thrown around far too often in IR literature, with little consideration of what it means. Relevance incorporates more than just a measurement between a document and a query, the user should also be considered when computing relevance. The work of Weiguo Fan and his colleagues showed that GP ideas can be applied to IR. All of these articles, and the work of these scientist specifically, lead me to my idea of using GP to determine the usefulness ofhistoryandsocialpreference asasearchmetric. 30 CHAPTER 4 RESEARCH DESCRIPTION ?If itdoesn?t work right, we can always try somethingelse.? - John McCarthy [49] It is my belief that the next logical step for search engines is towards personalization. If a search engine uses information gained about a user it would be more capable to return relevantresultsto a query. Thispersonalizationshouldmature overtime. The more a user uses the search engine the better the results should be. Also, a search engine should be able togainthisinformationbyobservation. A user shouldnothave toanswer a surveyor explicitlygiveoverinformation. The question arises on how best to use the history of a user to return relevant doc- uments. Simply giving documents that a user has visited in the past a higher rank is not enough. Rankinghighlydocumentssimilartothosethatauserhasvisitedinthepastcould provideabenefitintheusefulnessofthereturneddocuments. Thistechniquecouldbeused to determinethe topicsof interestin the past,and thuslywouldbe beneficial in alleviating ambiguitiesof language. Alldocumentsin the collectioncouldbe grouped togetherbased ontheirsimilarities. Theclusterswouldform ad-hoctopiccategories. Thus,amoreuseful image of the user could be developed based on the content type of documents they have visitedand notsimplythe document. Based onthe clusteringhypothesis,if a documentin a cluster is relevantto a query, it isprobable that other documents in thatcluster would be relevant. Iproposethatkeepingahistoryoftheclustersauserhasvisitedismoreadvantageous than simplymaintaininga document history. Clusters are in essence a grouping of similar 31 Figure4.1: ClusterHistory documents. Often this type of information is provided by search engines to give the user documentssimilartoachosendocumentiftheusersodesires. Insteadofanaddedfeature, this information should be a direct part of the ranking function. In effect the cluster a user has selected documents from in the past should be used in determining a document?s relevance. Figure 4.1 visualizes the way in which the history of the user is recorded as a listingoftheirclusterhistory,andtheclustersthatcontaindocumentsthattheuserhasand hasnotvisited. InFigure4.1,thedocumentstheuservisitedare inaboldsquare. 32 Many current search engines use the popularity of a document to determine its rank- ing. While this metric has had success, it seems that a better metric would be to reward thedocumentthatuserssimilartotheprincipaluserhaveviewedinthepasttoberelevant. This would allow community knowledge to participate in the ranking function. It would provide it a social element. Once a user begins to use the search engine, it will discover otherusersthathavesearchedandfoundsimilarpages,andusethisinformationasanother metric. Figure 4.2 shows how by utilizing a user?s cluster history similar preferences be- tween users could be discovered. In a brief example the users Amy and Barry are similar duetothenumberofsharedclustersbetweenthem. Insteadoftherankingfunctionlimiting its knowledge to just the user?s past, it could also use the past of other like-minded users. As IR researcher KeithStirling wrote, ?relevance estimatesfrom past users can be usedto rankdocumentsforfutureuserswhomaysubmitsimilarinformationrequests?[50]. Thistechnique isknownas ?collaborativefiltering.? It is often utilizedinrecommen- dation systems. Amazon uses this technique to recommend books that similar customers haveboughtinthepastorhaveafavorablefeedbackof. Inacollaborativefilteringsystem, similarity between users is determined in the same manner that similarity between a doc- ument and a query are in the vector space model [51]. This collaborativefiltering score is calculatedbyaddingupthesimilarityscoresofalltheuserswhohaveaccessedtheparticu- lardocumentorobject. Thisfunctioncanbenormalizedtotakeintoconsiderationthesize of the document, and differences in the number of documents visited. This is done so as not to corrupt the results if a user has a particularly vast history, or the document contains many more words than others. The score is an attempt to quantify how similar the users that have visited this document are to the user. Most collaborative filtering functions are not based simply on visiting the document but on the user giving explicit feedback of the usefulness of the document. This can be an impediment to the user. I believe that instead 33 of the collaborative filtering being based on explicit score, or document history it would be better served to be based on cluster history. This is a natural conclusionbased on what can be gained by recording a cluster history versus a document history. The goal of this researchistodetermineifcollaborativefilteringbasedonclusterhistoryisausablesearch metric. Theissueishowthispersonalizationfitsintothesearchenginealgorithm. Throughout this thesis a variety of search metrics have been discussed: word count, term weights, and linkstructure. Personal preference shouldbe ametric integratedintotherankingfunction. I have proposed three new metrics, personal history, personal cluster history, and cluster historyof similarusers. These metricsneed tobe integratedintothe rankingfunction,and balanced to return the most relevant results. To balance this ranking function it must be determined which metrics are the most important, and the ranking function must reflect that. Word count will always be an important metric, and it is unthinkable that another metricsuchaspersonalhistorywouldaffecttherankingfunctionmore. Ifthatwasthecase nomatterwhattheuser?squery,thepagestheyhavealreadyvisitedwouldbereturned. Thequestionarises,though,howbesttointegratethisnewmetricintoarankingfunc- tion. The designers of ranking functions must carefully balance the importance of each metric, and fine tune the function. Traditionallyitappears this is caused by trial-and-error and gradual refinement. It also seems to involve a large amount of intuitive creation. If a creator proposed a new metric, how could they tell if that metric was successful if the ranking function they have created was created in such a matter? Perhaps the metric was superior, but the integration of the metric into the ranking function was flawed. Genetic programming is a way to determine such things. Researchers have proposed using GP to create better ranking functions, but this research involves taking the current metrics and 34 Figure4.2: SharingofClustersAmongUsers 35 rearranging them and the operators to produce better functions. As of yet, no one has proposedusingthistechniquetodeterminewhetherornotanewmetricisvalid. Given the GP all the search metrics, and the new proposed metric, over time the GP wouldcreate arankingfunctionthatincludedthemostusablemetrics. Poormetricswould be lost in the evolution process that takes place between generations. To be successful all that would be needed would be a document set, queries, relevancy judgements, and user histories. This would be automated testing of the ranking function. The fitness function associatedwiththeGPwouldgivethecreatorinsightintotheperformanceofthefunction. 36 CHAPTER 5 EXPERIMENT AND RESULTS The goalof thisexperimentwas to demonstratethat clusteredhistoryof similarusers can be integrated into a ranking function and return relevant documents to a query. To ac- complishthis,asetof documents,asetof queries,andasetof relevancyjudgementswere needed. Relevancy judgements are the determined relevance between a document and a query, usually of a binary nature; either the document is relevant or it is not. The TREC conference (previously mentioned in Chapter 2) provided all this necessary material for their participants. This resource allowed for the comparison of ranking functionsto deter- minewhichperformsbetter. IchoosetousetheTRECdatasetbecauseitisthepreeminent conferenceininformationretrievalresearch,andalsobecausemostofthepastliteraturein IR that I have studied during this research has exclusively used TREC datasets to test the abilityoftheirretrievaltechniques. The list below provides an overview of the steps that were taken to accomplish the experiment. 1 Thedatasetwasparsed,portered,andstopwordswereremoved. Thedatasetwasthen indexed and stored in a MySQL database. The queries and relevance judgements werealsostoredinadatabase. 2 User historieswere created based on a portionof the available queries and a portion oftheavailablerelevancejudgements. 3 The dataset was clustered using a K-means hierarchical clustering algorithm. The number of clusters was chosen to insure that each cluster contained on average ten documents. 37 4 A GP framework was constructed. The goal of thisframework was tocreate a rank- ing function that had optimal performance. The GP had terminals available to it consistingofsearch metricsfromtheOkapi,INQUERYandPivotedTFIDF ranking functions. Also included in the possible terminals were the personal history of the user, the clustering history, and a collaborative filtering metric based on the cluster history. 5 Once the optimal ranking function was discovered by the GP framework all the queries were run through using the Okapi, INQUERY and Pivoted TFIDF ranking function. This allowed the performance of the GP created framework to be com- paredtoothercommonsearchenginerankingfunctions. 6 TherankingfunctionthatwascreatedusingtheGPframeworkwasthentestedontwo other datasets to insure over training did not occur. Okapi, INQUERY and Pivoted TFIDF werealsotestedtoallowformorecomparisons. The TREC HARD data set is referred to as CR99. The data set is part of the Con- gressionalRecord(CR).TheCR99datasetis146.6MBinsize. TheCRdatasetisfurther divided into three data sets CRE, CRS and CRH. These sets are divided so one is for the Senate, one is for the House, and the third is for extension. The CRE, CRS and CRH sets contain 4126, 6339, and 6146 documents respectively. The data sets are not perfect though; all contain duplicate copies of documents. The data set also contains a listing of fortyqueries,andrelevancejudgementsforeachdocumenttoeach query. The first step of the experiment was to get the data set into the form of an inverted index. This was accomplished using a Perl script, and the inverted index was stored in a MySQL database. The index is essential to the performance of the ranking function. It is simply a large two key hash table; the first key is the documents ID, the second key 38 is a term from the document and the value is the word count of said term. Before each term was entered in the database it was portered, and all capitalization and whitespace was removed. Also common stopwords such as ?the? were not indexed. TREC contains two different types of queries, long and short. The short query is a three to five word description of the information similar to what users normally enter into a search engine. The long query is a paragraph or more detailed description of what the user is interested in. For the purposes of this experiment only the short queries were used, because that is most similar to what a common online search engine would face. Studies have shown that the average query contains only 2.21 terms, and that less than four percent of queries contain more than six terms [35]. The queries and the relevancy judgements were also storedinaMySQLdatabaseforquickaccess. Thequeriesreceivedthesameporteringand removing of common words that the index received. These practices are consistent with normalsearchengineoperation. AtthispointthedocumentswereclusteredusingaK-meanspartitionalalgorithm. The clusteringprogramwasaPerlscriptthataccessedtheinvertedindexstoredinMySQL.The sourcecodefortheclusteringprogramisavailableintheappendix. Thenumberofclusters were chosen so that each cluster would average ten documents. Research has shown that having many clusters with a small number of documents is ideal in document retrieving systems[52]. Thegoaloftheclusteringwastogroupsimilardocuments,toineffectcreate pockets of information. The data sets were clustered individually. The way the K-means partitionalalgorithmworkedwasthatalldocumentswereplacedintoasinglecluster. Next the largest cluster, which in this case was the initial cluster was split. A sample of all the documentsin the cluster were taken and the two documentsfurthest apart based on vector space similarity were chosen to be the initial members of the two new clusters created by 39 splittingthelargestcluster. Thenallthedocumentsinthelargestclusterwerecomparedto eachofthetwotodeterminewhichwascloser,andthuslyjoinedthatcluster. Unfortunately user histories were not available from TREC. For the purposes of this experiment user histories were pseudo-randomly created. Each user was given between five and twenty of the forty available queries. They then had between two and ten of the relevantdocumentsfor each queryadded totheirpersonalhistory. Tenuserswere created. It was briefly considered to have the users also select some non-relevant documents, but it did not seem logical for a user to go out of their way to visit documents not relevant to their query. In a traditional search engine the user is given a small preview of the page, and normally they can determine if the document is relevant to their interest. It should be keptinmindthatTREC doesnotprovidea degreeof relevancebetweenadocumentanda query. Itisasimplebinarydistinction;eitherthedocumentisrelevantoritisnot. Alsofor eachquerytherecanbeuptoonehundredrelevantdocuments. Thehypotheticaluseronly chooseasmallnumberofthosedocuments. A variety of known ranking functions were tested using the TREC data set. This was done to get a base for performance, with the intent of performing at a higher level. Given a query the ten documents with the highest scores were returned. Then using the relevancyjudgementsitwas determinedhowmany of these documentswere relevant,this gaveusapercentage. Thepercentagewasthenaveragedovertheentiresetofqueries. The rankingfunctionstestedwereTFIDF,INQUERY,andOkapiBM25. Noneoftheseranking functionsconsideruserhistoryasacomponentoftheirrankingcriteria. It is impossible to prove a ranking function is correct. The goal is to simply create an optimal solution. In this experiment the trial and error of creating a ranking function was simply replaced with genetic programming. All the variables in the previous ranking functions were given to the GP framework to create a ranking function. The proposed 40 personal history functions were available. By using thissetup a ranking function could be found that outperformed the commonly accepted functions. Instead of creating a ranking function by hand, and tweaking it until it performs well, I simply chose to let evolution accomplishit. A genetic programming framework was created to develop the most optimal ranking function. The framework was written in the Perl programming language. The choice of using Perl instead of Common Lisp in which Koza developed his GP framework was due tocomplicationsingettingLisptocommunicatetoaMySQLdatabase. AlsoPerlcontains an eval function, which allows a string to be evaluated as source code. The subroutine to calculate therank ofeach documentcouldbe created onthe fly. Thiswas necessary tode- terminethefitnessofthenewlycreatedrankingfunctions. IntheGPIRframeworkthetrees, or functions had a limited number of terminals and operators. The operators consisted of log, addition, subtraction, multiplication, and division. The available operators were term frequency,thenumberofdocumentsinthecollectioninwhichthetermispresent,average termfrequencyofthecollection,themaximumtermfrequency,theaveragetermfrequency in the collection, the maximum of the number of documents in the collection in which a term is present, number of document in the collection, word count of the document, per- sonal history, cluster history, and the collaborative filtering cluster score describe in the prior section. Thesemetrics were also usedby Fanet. al. intheir variousexperiments[5]. The framework could compose the tree consisting of any of these terminals or operators. The max depth of each tree was limited to seven, and the minimum depth was limited to two. This was done to insure the ranking functions didn?t become unruly, and possibly unusable, but on the lower side, to insure that the ranking function was an equation and not simply a variable. Certain rules were also defined in the creation of the trees to in- sure that correct equations were created, to avoid division by zero for example. The GP 41 framework pseudo-randomly created an initial population of trees, consisting of pseudo- randomly chosen terminals and operators. Once created, the fitness of each of these trees wasevaluated. Theevaluationusesthetree toreturnthe toptendocumentsfor each ofthe forty queries provided by TREC. Limiting it to the top ten documents was done because studies have shown that 58% of users do not look past the first ten results [35]. Using the relevancejudgements,itwasdeterminedhowmanyofthesetendocumentswererelevantto thequery. Apercentageofrelevantdocumentswasrecordedforeachquery,andthefitness wastheaverageofthesepercentagesoverallfortyqueries. Onmytestmachine(G41Ghz, 1GB RAM) it took approximately one minute to test each tree, a reasonable time consid- ering that processing each tree was basically submitting forty queries to a search engine. Once an initial populationof trees was tested they were ordered according to their fitness. At this point the population evolved into the next generation. The size of the population remained consistent from generation to generation. The top ? populationsize ?1 functions in the current generation automatically advanced into the next generation without change. Each of these select few were crossovered with each other. Crossover consisted of taking a random portion of one tree, and combining it with a random portion of another tree to produceanewtree. Thefigure belowisanexampleoftwotreesbeingcrossovered. To fill outthe remainingmembersof thepopulationtrees were chosen at random and crossovered. Thisinsuredthatweakerperformingmembersofthepopulationcouldexistin furthergenerations. Thepopulationwasevolvedforanumberofgenerations. Attheendof each generation the best tree was printed out. Also, the average fitness of each population was recorded. This was used to determine if the population became stronger through the stepsofevaluation. Forthepurposeofthisexperimentthepopulationsizewasfiftytrees,andwithtwenty generations. Preliminaryrunsofthegeneticprogrammingframeworkshowedthataninitial 42 43 population size of fifty was sufficient enough to produce a good variety of ranking func- tions. Having a larger initial population did not produce better results, and the run-time of the experiment was significantly increased. The choice of twenty populations was due to initial observations that showed the optimal ranking function being discovered prior to the tenth generation. The use of twenty was to provide a buffer. The fitness of the trees wasonlydeterminedagainstoneofthedatasets,CRE.OncetheGPrunwascompletedthe best trees were taken and tested against the other two datasets to insure that over training did not occur. Not all terminals were used in each tree, and just as in biological evolution where unwanted traits are lost, weaker metrics were evolved out of the population. By shared cluster historysurvivingthe process of evolution,it was viewed as a useful metric. Also, comparing the GP created ranking function containing shared cluster history versus commonrankingfunctionsperformancegainswerefurtherobserved. Whentheexperimentbeganproblemswiththedatasetbegantosurface. Itwasknown that the dataset had duplicate copies of documents, each with a different document identi- fication code. The issue was that when the ranking functionwould deem both these docu- mentsrelevanttoa query(after alltheywere identical)therelevancejudgementsprovided byTRECwouldonlylistoneofthedocumentsasrelevant. Thisledtothepercentagescore being low, but since this was a problem the GP ranking function, Okapi, INQUERY, and invertedTFIDFallexperiencedevenlyitallowedtheexperimenttoprogress,knowingthat it could be shown which ranking function performed the best. Unfortunately these scores, while they allowed rank to be assigned, were not an accurate representation of the rank- ing function?s true performance. In generation four of the run the tree that would be the best performer was discovered. The tree is shown in Figure 5.1. The tree used the cluster collaborative filter score metric, the term frequency of the query terms in the document, the numberof documentsin the collectionand the documentlength. The metricspersonal 44 Figure5.1: BestTree history and cluster history, which were simply lists of the documents and the clusters the user had visited in the past, were available, but they did not survivethe evolutionprocess. The complete run of the experiment is available in the appendix, including a record of all thetreesinthefinalpopulationandtheirscores. Figure 5.2 is a graph containing the best fitness of each generation, and how it pro- gressedovertheentirerun. Alsointhegraphistheaveragefitnessofthepopulationwhich peekedatgenerationnine,andproceeded toplateauatthatpointwithlittlechange. TherankingfunctioninFigure5.1wasdevelopedusingtheCREdataset,buttoinsure no over-training was committed, the function was tested against the other two datasets. Also the INQUERY, Pivoted TFIDF, and Okapi ranking functions were tested against the three data sets to allow a comparison to the GP created function. These ranking functions were simply the equations used to rank the documents and did not take into consideration the use of thesaurus or pseudo-relevance feedback. These are techniques outside of the rankingfunctionthatcanbefurtherusedtoimprovethequalityofresults. AsthegraphinFigure5.3showstheGPcreatedrankingfunctionusingthecollabora- tivefilteringbasedonclusterswasabletooutperformthethreeotherrankingfunctionson twooutofthreeofthedatasets. Thefunctionshowedthegreatestperformancegainonthe CRE data set in which it was trained, but also performed well on the CRS dataset, which 45 Figure5.2: FitnessofTreesoverGPIR Run 46 Figure5.3: GraphofCR99Performance 47 CRE CRS CRH PivotedTFIDF 10 16.5625 15.16129 INQUERY 10 15.625 14.19 Okapi 10.625 15 15.16129 GPIR 14.0625 17.5 14.8387 Figure5.4: Chart ofPerformance theGPframeworkhadnocontactwith. The completesourcecodeoftheGP frameworkis providedintheappendix. Figure12isatablecontainingthescoresofeachoftherankingfunctions. Thescores werecalculatedinthesamemannerasthefitnessisintheGPframework. On two of the three datasets the ranking function created by the GP incorporating the new metric outperformed the more seasoned ranking functions. On the CRH dataset whereitplacedthirditwasstillclosetothetop,withOkapiandINQUERYtyinginscore, and only being roughly .33 above. On average the GP created ranking function outper- formeditsclosestrivalPivotedTFIDFby1.5. Theonlysignificantadvantagesthisranking function had were the new personalized metric, and being constructed using the genetic programmingframework. 48 CHAPTER 6 CONCLUSION Information retrieval experts claim that the next great advance will be personalized search, that the search engine will actually know the user. That is not enough. The search engine shouldalsoknowpeople likethe user. This informationhelps puta user?s interests ina betterlight. A metaphor for this new personal search would be if a person went to the bookstore andpickedoutabook. Theyknewthebasicsubjectthattheywereinterestedin,andthey?ve hadpastexperiencebuyingbooks,theyknowauthorsthey?veenjoyed,andpublisherswho create handsome books. Using this information they could make a better purchase than if they simply walked in only knowing the subject matter they were interested in. Consider thoughifthisshopperalsohadrecommendationsfromtheirfriends,familyandcolleagues, people of similar intelligence, interest and location. Given this information a customer wouldbefarmorelikelytobesatisfiedwiththepurchasetheymade. Searchisaproduct. Itisaservice. Thesystemthatgivestheuserwhattheywant,with as little hassle as possible will win. For an engine to not use all the information available is foolish. The only question is how to balance all these metrics into an equation that will produce optimal results, without torturing the user through a period of test and fix. The creatorsofgeneticprogrammingenvisionedsuchproblemswhentheycreatedtheconcept. Once a creator has proposed a new metric, how can they determine whether or not it actually works? Even if the metric has merit, how do they implement it into the ranking function? GP provides a way for a designer to discover how to integrate this new metric into a ranking function. Collaborative filtering can be helpful in the process or returning usefuldocumentstoa user?squery. 49 With a much larger data set, access to user?s queries and their subsequent document selections to said queries, a better ranking function could be designed with the help of genetic programming, and also new metrics could be tested. If with a new metric, the fitnessofthefinalrankingfunctionincreases thenitcanbedeemedthatmetricisusefulin therankingthedocuments. Once a newmetric has been proposedthe questionof howto bestintegrate itintothe ranking function, and also if the proposed metric is useful will arise. The solution to both questions is the use of genetic programming. Considering the impossibility of proving a search metric works, the best alternative is to determine a function that incorporates the metricthatoutperformsthebestperformingrankingfunctionnotcontainingthemetric. The user is the ultimate determiner of relevancy. Two users enter in the exact same query and the document that the first user believes is relevant the second user might dis- agree. The inexact nature of this field forces researchers to search for the most optimal solution,sinceatotallycorrect solutionseemstobeunlikely. Given more information, the performance of the ranking function can only improve. With larger data sets, more queries, more relevancy judgements, and more user histories the genetic programming frameworks? ability to create a successful ranking function will behelpedsignificantly. Consider the possibilityof a search engine thatconstantlychanges. Slowlyover time the ranking functionis adapted togivebetter results. The ranking functionused two years prior, will not be the same as the current because of the knowledge gained. A genetic programming application would run concurrently beside the search engine, constantly at- tempting to discover a better ranking function. The genetic program is always there to allow a designer to easily add a new metric into the fold, and then discover if the metric improvestherankingprocess. 50 When given the choice of which metrics to use the genetic programming framework chose, term frequency, the collaborative score proposed, number of documents and the length of the document. All of the other metrics were deemed to be unneeded to produce the results, or possibly harmful. The ranking function created by the GP framework had the best performance increase over the other ranking functions on the dataset that it was tested. This was tobe expected. On the CRH dataset the GP created ranking functionwas bested by Okapi and PivotedTFIDF, yet the distance was very minimal. The GP created functionwelloutperformedallthefunctionsontheCRSdataset. TheGPfunctionhadonly onerealadvantageovertheotherfunctions,andthatwastheabilitytousethecollaborative filteringmetricI earlier proposed. If theGP hadbeen allowedtotrainon allthree datasets it would have been able to improve the score on the CRH dataset. The point of training it on only one of the data sets was to support the notion that it would perform well on other datasets. The genetic programmingframeworkwas a tool. The singleday run of the geneticprogrammingframeworkwasabletooutperforma varietyof rankingfunctionthat werecreatedovermonthsusingheuristictechniques. Therunwasabletodeterminewhich metricswereusablebytheirappearanceinthefinalrankingfunction. Ifthesemetricswere withoutmerittheywouldhavebeenquicklyremovedduringtheevolutionprocess. Thebestrankingfunctionwasdiscoveredafterthefourthgeneration. Longerruns,and a larger initial population would not have improved the final performance of the ranking function. The best way to improve the results would have been to have had more queries andmorerelevancyjudgements. ThatwouldhaveallowedtheGPframeworktoperfectthe rankingfunctionthatmuchmore. On average the ranking function created by the GP framework, which included the collaborative filtering based on clustering metric, performed at 15.467. This is compared to PivotedTFIDF, OKAPI, and INQUERY which performed at 13.9, 13.587 and 13.271 51 respectively. The addition of this metric caused a significant improvement in the perfor- mance of the ranking function. The performance gain was not due to over training, as demonstratedbygoodperformanceontheothertwodatasetswhichtheGPhadnocontact with. Theresearchconductedinthisthesishasshownthatgeneticprogrammingcanbeused todetermineifanewlyproposedinformationretrievalmetric(collaborativefilteringbased on cluster history) is effective. The framework also integrates this new metric into the ranking function in an optimal way, and fully automates the heuristic process of building a ranking function. A technique such as this can allow researchers to gain insight into the performanceoftheirmetrics,andalsosignificantlyshortenthetimeittakestocreatethese searchenginerankingfunctions. 52 CHAPTER 7 FUTURE WORK InthefutureIhopetoapplythetechniquesproposedintoatrueonlinesearchengine. A search engine that could adjust the ranking function over time would be a significant improvement over current implementations. With larger datasets, more users, and thusly more queries and relevance judgements, an even more significant improvement could be hadoverthemodestgainsdemonstratedinthistext. Withnothingmorethantwentyusers, andadatasetthatdidnotexceedfivethousanddocumentsIwasabletocreateasystemthat couldfindtheoptimalrankingfunction. Some might contend that adding larger numbers of documents and larger number of user?s would not help the process, that the computation needed to test each function on each query would become too much. It should be viewed as a benefit, not a challenge. The dataset could be randomly divided, also the users. Each time a function was created it would test it against a portion of the queries, a portion of the users and a portion of the documents. Iimaginethegeneticprogrammingframeworkrunninginthebackgroundabletointe- grateabetterrankingfunctionintothesearchengineifitisfound. Alsonewmetricscould be quickly added to the search, without prior testing to decide if they are actually useful. Theframeworkwilldeterminetheirusefulness. In the future I also hope to look into new metrics that further use community infor- mation to improvesearch engine results. Hopefully this research will lead me to a greater understandingofwhatcanbeinferredfromobservationofauser?ssearchhistory,andhow these observations can best be integrated into the search process while maintaining the user?sprivacy. 53 BIBLIOGRAPHY [1] J. R. Koza, Genetic Programming : On the Programming of Computers by Means of Natural Selection. MITPress,1993. [2] S. Mizzaro, ?How manyrelevances ininformationretrieval?,? Interacting with Com- puters,vol.10,no.3,pp.303?320,1998. [3] O.R.Za??ane,?Fromresourcediscoverytoknowledgediscoveryontheinternet,?tech. rep.,TR1998-13,SimonFraser University,1998. [4] A.K. Jain,M.N.Murty,andP. J.Flynn,?Data clustering: A review,?ACM Comput- ing Surveys,vol.31,no.2,pp.264?323,1999. [5] W. Fan, M. D. Gordon, and P. Pathak, ?A generic ranking function discovery frame- workbygeneticprogrammingforinformationretrieval,?InformationProcessingand Management,vol.40,no.4,pp.587?602,2004. [6] W.Fan,M.Luo,L.Wang,W.Xi,andE.A.Fox,?Tuningbeforefeedback: combining rankingdiscoveryandblindfeedbackforrobustretrieval,?inSIGIR?04: Proceedings of the 27th annual international conference on Research and development in infor- mation retrieval,pp.138?145,ACMPress,2004. [7] Google,?http://www.google.com,?DateAccessed-June2005. [8] D. Sullivan, ?Search engine size wars v erupts,? November 2004. http://blog.searchenginewatch.com/blog/041111-084221. [9] D. Sullivan, ?Searches per day,? February 2003. http://searchenginewatch.com/reports/article.php/2156461. [10] Yahoo,?http://www.yahoo.com,?DateAccessed-June2005. [11] S.BrinandL.Page,?Theanatomyofalarge-scale hypertextualWebsearchengine,? Computer Networks and ISDN Systems,vol.30,no.1-7,pp.107?117,1998. [12] V.Bush,?Aswemaythink,?The AtlanticMonthly,July1945. [13] C. N. Mooers, ?The next twenty years in information retrieval: Some goals and pre- dictions,?American Documentation,vol.11,pp.229?236,March1960. [14] C.J.VanRijsbergen,InformationRetrieval,2nd edition. Dept.ofComputerScience, UniversityofGlasgow,1979. [15] I. H. Witten, A. Moffat, and T. C. Bell, Managing Gigabytes: Compressing and In- dexing Documents and Images. MorganKaufmann,1999. 54 [16] J. Gemmell, G. Bell, R. Lueder, S. Drucker, and C. Wong, ?Mylifebits: fulfilling the memex vision,? in MULTIMEDIA ?02: Proceedings of the tenth ACM international conference on Multimedia,pp.235?238,ACMPress,2002. [17] W.SonnenreichandT.Macinta,WebDeveloper.comGuidetoSearchEngines. Wiley, Febuary1998. [18] NIST, ?Text retrieval conference (trec) home page,? Date Accessed - June 2005. http://trec.nist.gov. [19] J.R. Koza,?Geneticprogramming,?inEncyclopedia of Computer Science and Tech- nology(J.G.WilliamsandA.Kent,eds.),pp.29?43,Marcel-Dekker,1998. [20] H. Hirsh, W. Banzhaf, J. R. Koza, C. Ryan, L. Spector, and C. Jacob, ?Genetic pro- gramming,?IEEE Intelligent Systems,vol.15,pp.74?84,May-June2000. [21] S.G.Steinberg,?Seekandyeshallfind(maybe),? Wired,vol.4,May1996. [22] M.CafarellaandD.Cutting,?Buildingnutch: Opensourcesearchesperday,?Queue, vol.2,no.2,pp.54?61,2004. [23] A. Patterson, ?Why writing your own search engine is hard,? Queue, vol. 2, no. 2, pp.48?53,2004. [24] A. L. Penenberg, ?Search rank easy to manipulate,? March 2005. http://www.wired.com/news/culture/0,1284,66893,00.html. [25] R. M. Losee and J. Lewis Church, ?Are two document clusters better than one? the cluster performance question for information retrieval: Brief communication,? Jour- naloftheAmericanSocietyforInformationScienceandTechnology,vol.56,pp.106? 108,January2005. [26] A. Leuski, ?Evaluating document clustering for interactive information retrieval,? in CIKM ?01: Proceedings of the tenth international conference on Information and knowledge management,pp.33?40,ACMPress,2001. [27] Y. Zhao and G. Karypis, ?Evaluation of hierarchical clustering algorithms for docu- ment datasets,? in CIKM ?02: Proceedings of the eleventh international conference on Information and knowledge management,pp.515?524,ACMPress,2002. [28] A. Leouskiand W. Croft, ?An evaluationof techniquesfor clusteringsearch results,? tech. rep., IR-76, Department of Computer Science University of Massachusetts, Amherst,1996. [29] S. Kuchinskas, ?Peeking into google,? March 2005. http://www.internetnews.com/xSP/article.php/3487041. [30] R.Baeza-YatesandB.Ribeiro-Neto,ModernInformationRetrieval.AddisonWesley, 1999. 55 [31] G.Salton andC. Buckley,?Improvingretrievalperformance by relevancefeedback,? Journal of the American Society of Information Science, vol. 41, pp. 286?297, June 1990. [32] D. Harman, ?Relevance feedback revisited,? in SIGIR ?92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in infor- mation retrieval,pp.1?10,ACMPress,1992. [33] M. S. Khan and S. Khor, ?Enhanced web document retrieval using automatic query expansion,?JournaloftheAmericanSocietyforInformationScienceandTechnology, vol.55,pp.29?49,January2004. [34] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau, ?Okapi at trec-2,?in Text REtrieval Conference, pp.21?30,1992. [35] B.J.Jansen,A.Spink,andT.Saracevic,?Reallife,realusers,andrealneeds: astudy and analysis of user queries on the web,? Information Processing and Management: an International Journal,vol.36,no.2,pp.207?227,2000. [36] J. P. Callan, W. B. Croft, and S. M. Harding, ?The INQUERY retrieval system,? in ProceedingsofDEXA-92,3rdInternationalConferenceonDatabaseandExpertSys- tems Applications,pp.78?83,1992. [37] A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, O. Sercinoglu, and S. Tong, ?United states patent application 20050071741,?2003. [38] S.DominichandA.Skrop,?Pagerankandinteractioninformationretrieval: Research articles,? Journal of the American Society for Information Science and Technology, vol.56,no.1,pp.63?69,2005. [39] WikiQuote, ?Charles darwin - wikiquote,? Date Accessed - June 2005. http://en.wikiquote.org/wiki/Charles Darwin. [40] W. Fan, M. D. Gordon, and P. Pathak, ?Discovery of context-specific ranking func- tions for effective information retrieval using genetic programming,? IEEE Transac- tions on Knowledge and Data Engineering,vol.16,no.4,pp.523?527,2004. [41] W.Fan, E. A. Fox,P. Pathak,and H. Wu, ?The effects of fitnessfunctionsongenetic programming-basedrankingdiscoveryforwebsearch,?JournalOfTheAmericanSo- ciety For Information Science And Technology (JASIST), vol.55,pp.628?636,2004. [42] S. Lawrence, ?Context in web search,? IEEE Data Engineering Bulletin, vol. 23, no.3,pp.25?32,2000. [43] J. Pitkow, H. Sch?utze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E. Adar, and T.Breuel,?Personalized search,? Commun. ACM,vol.45,no.9,pp.50?55,2002. 56 [44] E.Adar, D. Kargar, and L.A. Stein, ?Haystack: per-user informationenvironments,? inCIKM ?99: Proceedings of the eighth internationalconference on Information and knowledge management,pp.413?422,ACMPress, 1999. [45] F.TanudjajaandL.Mui,?Persona: Acontextualizedandpersonalizedwebsearch,?in InProceedingsofthe35AnnualHawaiiInternationalConferenceonSystemSciences (HICSS?02),HICSS, 2002. [46] F. Liu, C. Yu, and W. Meng, ?Personalized web search for improvingretrieval effec- tiveness,? IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, pp.28?49,2004. [47] U.ShardanandandP.Maes,?Socialinformationfiltering: Algorithmsforautomating ?wordof mouth?,?in Proceedings of ACM CHI?95 Conference on Human Factors in Computing Systems,vol.1,pp.210?217,ACMPress,1995. [48] G.Linden,B.Smith,andJ.York,?Amazon.comrecommendations: item-to-itemcol- laborativefiltering,? Internet Computing, IEEE,vol.7,pp.76?80,Jan-Feb 2003. [49] WikiQuote, ?John mccarthy - wikiquote,? Date Accessed - June 2005. http://en.wikiquote.org/wiki/John McCarthy. [50] K. H. Stirling, ?On the limitations of document ranking algorithms in information retrieval,? in SIGIR ?81: Proceedings of the 4th annual international ACM SIGIR conference on Informationstorage and retrieval,pp.63?65,ACMPress,1981. [51] I.SoboroffandC.Nicholas,?Collaborativefilteringandthegeneralizedvectorspace model (poster session),? in SIGIR ?00: Proceedings of the 23rd annual interna- tionalACM SIGIR conference on Research and development in informationretrieval, pp.351?353,ACMPress,2000. [52] A.Griffiths,H.C.Luckhurst,andP.Willett,?Usinginterdocumentsimilarityinforma- tion in document retrieval system,? Journal of the American Society for Information Science and Technology,vol.37,pp.3?11,1986. 57 APPENDICES 58 APPENDIX A GENETIC PROGRAMMING FRAMEWORK EXPERIMENT RUN Connected to DB Inverted Index Loaded from DB Queries Loaded from DB Relevance Judgements Loaded from DB User Profiles Loaded from DB Building Similarity Matrix Loaded Cluter Information Number of Documents 4125 Generation 1 12.5 $collabscore ? $tf Average Fitness of Generation 8.49375 Generation 2 12.8125 $collabscore ? $dfmaxcol Average Fitness of Generation 9.95710784313725 Generation 3 12.8125 log $tfavgcol + $collabscore ? $tfavg / $tfavgcol ? log $doclengthavg + $dfmaxcol ? $tf + $clhistory / $numberodocs / $doclength / $doclength / $dfmaxcol Average Fitness of Generation 11.1397058823529 Generation 4 13.75 $collabscore ? $tf ? $collabscore ? $dfmaxcol ? $numberodocs / $doclength Average Fitness of Generation 11.9362745098039 Generation 5 14.0625 $collabscore ? $tf ? $numberodocs / $doclength Average Fitness of Generation 11.9607843137255 Generation 6 14.0625 59 $collabscore ? $tf ? $collabscore / $collabscore / $doclength Average Fitness of Generation 12.4203431372549 Generation 7 14.0625 $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength Average Fitness of Generation 12.5122549019608 Generation 8 14.0625 $collabscore / $collabscore ? $tf ? $numberodocs / $doclength ? $collabscore Average Fitness of Generation 12.4877450980392 Generation 9 14.0625 $collabscore ? $tf ? $collabscore / $collabscore / $doclength Average Fitness of Generation 12.6899509803922 Generation 10 14.0625 $collabscore ? $tf ? $collabscore / $collabscore / $doclength Average Fitness of Generation 12.156862745098 Generation 11 14.0625 $collabscore ? $tf ? $collabscore / $collabscore / $doclength Average Fitness of Generation 12.2732843137255 Generation 12 14.0625 $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Average Fitness of Generation 12.2549019607843 Generation 13 14.0625 $collabscore ? $tf ? $collabscore / $collabscore / $doclength Average Fitness of Generation 12.156862745098 Generation 14 60 14.0625 $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Average Fitness of Generation 11.9791666666667 Generation 15 14.0625 $collabscore ? $tf ? $collabscore / $collabscore / $doclength ? $numberodocs Average Fitness of Generation 11.6421568627451 Generation 16 14.0625 $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Average Fitness of Generation 11.9669117647059 Generation 17 14.0625 $numberodocs ? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs Average Fitness of Generation 12.1629901960784 Generation 18 14.0625 $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs Average Fitness of Generation 11.7708333333333 Generation 19 14.0625 $numberodocs ? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Average Fitness of Generation 11.6666666666667 Final Generation ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 61 ???????????????????????????????????? $numberodocs ? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $tf ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $collabscore ? $tf ? $numberodocs Fitness 12.5 ???????????????????????????????????? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? 62 $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $collabscore ? $tf Fitness 12.5 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 63 ???????????????????????????????????? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs / $doclength ? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 13.125 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? 64 $numberodocs ? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength Fitness 13.75 ???????????????????????????????????? $numberodocs ? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs 65 Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 ???????????????????????????????????? $numberodocs ? $numberodocs Fitness 9.375 ???????????????????????????????????? $numberodocs ? $numberodocs ? $collabscore ? $tf ? $numberodocs / $doclength ? $numberodocs ? $numberodocs Fitness 14.0625 Best FINAL Tree ???????????????????????????????????? $collabscore ? $tf ? $numberodocs / $doclength Best Fitness : 14.0625 66 APPENDIX B GENETIC PROGRAMMING FRAMEWORK SOURCE CODE # ! / opt / local / bin / perl ?w #GPIR Framework #Able to create ranking functions , test them for their ability #and evolve the ranking functions over generations use DBI; use porter ; use strict ; use warnings ; use Tree :: Binary ; use Tree :: Binary :: Search ; use Tree :: Visualize ; use Tree :: Binary :: Visitor :: InOrderTraversal ; use Tree :: Binary :: Visitor :: InOrderTraversalExpressionTree ; use Tie :: Hash :: StructKeyed ; use Carp; use Storable ; $SIG{ WARN } = \&carp ; $SIG{ DIE } = \& confess ; use vars qw(%collabscoretable @stopwords %clusterinfo %sim %clusterhistory %population $tfavgcol $doclengthavg %nkhash $popsize @terminals @operators @userprofile $dbh %queries % relevancejudgement %docscores $query %invertedindex $numberodocs %tfmaxhash %dfmaxcolhash % doclengthhash %tfavghash ) ; @operators = ( ?+? , ??? , ??? , ?/? , ?log? , ?null?) ; @terminals = ( ? $tf ? , ?$Nk? , ?$tfavgcol ? , ?$tfmax ? , ? $tfavg ? , ?$dfmaxcol ? , ?$numberodocs ? , ?$doclength ? , ?$doclengthavg ? , ?$phistory ? , ? $collabscore ? , ? $clhistory ? , ? null ?) ; my $dsn = ?DBI:mysql : thesis : localhost ?; my $db user name = ?root ?; my $db password = ?dragon ?; my ( $id , $password) ; 67 $dbh = DBI?>connect($dsn , $db user name , $db password) ; print ?Connected to DB\n?; loadInvertedIndex () ; print ?Inverted Index Loaded from DB\n?; loadQueries () ; print ?Queries Loaded from DB\n?; loadRelevanceJudgements () ; print ?Relevance Judgements Loaded from DB\n?; loadUserProfiles () ; print ?User Profiles Loaded from DB\n?; buildSimMatrix (1) ; print ?Building Similarity Matrix \n?; loadClusterInfo () ; print ?Loaded Cluter Information \n?; my @docids = keys(%invertedindex ) ; $numberodocs = $# docids ; print ?Number of Documents ?. $numberodocs .?\n?; @docids = () ; $tfavgcol = tfavgcol () ; $doclengthavg = doclengthavg () ; $popsize = $ARGV[0]; my $numgeneration = $ARGV[1]; tie %population , ?Tie :: Hash :: StructKeyed ?; createInitialPopulation () ; my $bestfitness = 0; my $besttree = () ; for(my $x = 1; $x < $numgeneration ; $x++) { print ?\nGeneration ?.$x.?\n?; #Lets Print The Population Here my $holder = getBestFromPopulation () ; # PrintBest Member of Current Population print ?Best Fitness ?. $population{$holder }.?\n ?; printTree ( $holder ) ; if ( $population{$holder } > $bestfitness ) { $besttree = $holder ; $bestfitness = $population{$holder }; } 68 print ?Average Fitness of Generation ?. avgFitnessPopulation () .?\n\n?; evolvePopulation () ; } print ?Final Generation \n?; printPopulation () ; print ?Best FINAL Tree\n??????????????????\n?; printTree ( $besttree ) ; print ?Best Fitness : ?. $bestfitness .?\n?; #Output The Graphs Here exit ; sub createInitialPopulation { for(my $x = 0; $x < $popsize ; $x++) { my $treesize = int rand(5) + 2; my $parent = $operators [ int rand($# operators ) ]; my $tree = Tree :: Binary?>new( $parent ) ; $tree = buildTree ( $tree , $treesize , $parent ) ; $population{$tree } = getFitnessOfTree ( $tree ) ; } } sub evolvePopulation { tie my %newpopulation , ?Tie :: Hash :: StructKeyed ?; for(my $x=0; $x < (( int ( sqrt ( $popsize ) ) ) ? 1) ; $x++) { my $besttree = getBestFromPopulation () ; $newpopulation{ $besttree } = () ; delete $population{ $besttree }; #remove besttree from population } my @poparr = ( keys %newpopulation ) ; 69 foreach my $doca ( @poparr) { foreach my $docb ( @poparr) { my $temptree = crossoverTree ($doca , $docb) ; $newpopulation{ $temptree } = () ; } } my @tempary = keys (%newpopulation ) ; my $tempsize = $# tempary ; for(my $y=0; $y < ( $popsize ? $tempsize ) ; $y ++) { my $temptree = crossoverTree ( getRandomFromPopulation () , getRandomFromPopulation () ) ; $newpopulation{$temptree } = () ; } %population = () ; for my $tree ( keys %newpopulation ) { $population{$tree } = getFitnessOfTree ( $tree ) ; } } sub getBestFromPopulation { my $bestfitness = 0; my $besttree = () ; for my $tree ( keys %population ) { if ( $population{$tree } > $bestfitness ) { $bestfitness = $population{ $tree }; $besttree = $tree ; } } return $besttree ; } sub getRandomFromPopulation { my @populationlist = ( keys %population ) ; 70 my $randtree = $populationlist [ int (rand($# populationlist ) ) ]; return $randtree ; } sub printPopulation { for my $tree ( keys %population ) { print ????????????????\n?; printTree ( $tree ) ; print ?Fitness ?. $population{$tree }.?\ n?; } } sub printTree { my ( $tree ) = @ ; print Tree :: Visualize?>new( $tree , ?ASCII? , ? Diagonal ?)?>draw () . ?\n?; print ?\n? x 2; print Tree :: Visualize?>new( $tree , ?ASCII? , ? TopDown?)?>draw () . ?\n?; print ?\n? x 2; my $visitor = Tree :: Binary :: Visitor :: InOrderTraversalExpressionTree?>new() ; $tree?>accept( $visitor ) ; my $equation = ( join ? ? , $visitor?>getResults () ) ; print $equation .?\n\n?; } sub getFitnessOfTree { my ( $tree ) = @ ; my %docscores = () ; my $visitor = Tree :: Binary :: Visitor :: InOrderTraversalExpressionTree?>new() ; $tree?>accept( $visitor ) ; my $statement = ( join ? ? , $visitor?> getResults () ) ; eval ( ?sub rankingfunction { my ( $tf , $dfmaxcol , $Nk, $clhistory , $phistory , $tfavg , $doclength , $tfmax , $collabscore ) = @ ; my $score = 0; eval { $score =? . $statement . ? ; }; return $score ; } ?) ; foreach my $query ( keys %queries ) { 71 my @queryterms = split (? ? , $query) ; for my $document ( keys %invertedindex ) { my $score = 0; my $check = 1; my $tfmax = tfmax($document) ; my $doclength = doclength ( $document) ; my $tfavg = tfavg ($document) ; my $phistory = personalHistory ($document) ; my $clhistory = clusterHistory ( $clusterinfo {$document}) ; my $collabscore = collabClusterScore ( $clusterinfo {$document } , 1) ; foreach my $term ( @queryterms) { #Because the query contains multiple terms my $Nk = Nk($term) ; my $dfmaxcol = dfmaxcol($term) ; if ( exists ( $invertedindex{ $document}{$term }) ) { my $tf = $invertedindex {$document}{ $term }; eval { 72 $score = $score + rankingfunction ( $tf , $dfmaxcol , $Nk , $clhistory , $phistory , $tfavg , $doclength , $tfmax , $collabscore ) ? $tf ; }; } else { $check = 0; } } if ($check != 0) { $docscores{$queries{ $query}}{$document } = $score ; } } } my $count = 0; my $overallpercent = 0; for my $query ( keys %docscores ) { $count++; my @topdocs = () ; my $percent = 0; foreach my $did ( sort { $docscores{$query}{$b } <=> $docscores{$query}{$a }} (keys %{ $docscores{$query }})) { if (@topdocs <= 9) { push(@topdocs , $did ) ; if ( exists ( $relevancejudgement{ $query}{$did }) ) 73 { $percent = $percent + 10; } } } $overallpercent = $overallpercent + $percent ; } return ( $overallpercent / $count ) ; } sub buildTree { my ( $tree , $n , $p) = @ ; return $tree unless $n > 0; $n??; if ($n > 1&& $p ne ?log?){ $p = $operators [ int rand($# operators ) ]; $tree?>setLeft ( buildTree (Tree :: Binary?>new($p) , $n , $p) ) ; $p = $operators [ int rand($# operators ) ]; $tree?>setRight ( buildTree (Tree :: Binary?>new($p ) , $n , $p) ) ; } elsif ( $n == 1 && $p ne ?log?){ $p = $terminals [ int rand($# terminals ) ]; $tree?>setLeft ( buildTree (Tree :: Binary ?>new($p) , $n , $p) ) ; $p = $terminals [ int rand($# terminals ) ]; $tree?>setRight ( buildTree (Tree :: Binary ?>new($p) , $n , $p) ) ; } elsif ( $n == 1 && $p eq ?log?){ $p = $terminals [ int rand($# terminals ) ]; $tree?>setRight ( buildTree (Tree :: Binary ?>new($p) , $n , $p) ) ; } elsif ( $n > 1 && $p eq ?log?){ $p = $operators [ int rand($# operators ) ]; $tree?>setRight ( buildTree (Tree :: Binary ?>new($p) , $n , $p) ) ; } 74 return $tree ; } sub crossoverTree { my( $treea , $treeb ) = @ ; my $parent = () ; my $parenttest = int (rand(2) ) ; if ( $parenttest == 1) { $parent = $treea?>getNodeValue () ; } else { $parent = $treeb?>getNodeValue () ; } my $tree = Tree :: Binary?>new( $parent ) ; #Make precaution if root is log if ( $parent ne ?log?) { $tree?>setLeft (treeGoDown( $treea , int ( rand( $treea?>height () ) ) ) ) ; $tree?>setRight (treeGoDown( $treeb , int (rand( $treeb?>height () ) ) ) ) ; } else { $tree?>setRight (treeGoDown( $treeb , int (rand( $treeb?>height () ) ) ) ) ; } return $tree ; } sub treeGoDown { my( $tree , $steps ) = @ ; for(my $x = 0; $x < $steps ; $x++) { if ( $tree?>hasLeft () == 1 && $tree?> hasRight () == 1) { if ( int (rand(2) ) == 0) { $tree = $tree?> getRight () ; } else { $tree = $tree?>getLeft () ; } } elsif ( $tree?>hasRight () == 1) { 75 $tree = $tree?>getRight ; } } return $tree ; } sub loadInvertedIndex { my $sthdoc = $dbh?>prepare (? select ? from CREOccurance?) ; $sthdoc?>execute or die ?Unable to execute query : $dbh?>errstr \n?; %invertedindex = () ; while(my $row = $sthdoc?>fetchrow hashref ) { $invertedindex{$row?>{?DID?}}{$row?>{? Term? }} = $row?>{?Count?}; } $sthdoc?>finish () ; } sub loadQueries { my $sthquery = $dbh?>prepare (? select ? from Queries?) ; $sthquery?>execute or die ?Unable to execute query\n?; %queries = () ; while (my $row = $sthquery?>fetchrow hashref ) { $queries{$row?>{?Query? }} = $row?>{?ID ?} } $sthquery?>finish () ; } sub loadRelevanceJudgements { my $sthreljud = $dbh?>prepare (? select ? from RelevanceJudgements?) ; 76 $sthreljud?>execute or die ?Unable to execute query : $dbh?>errstr \n?; %relevancejudgement = () ; while(my $row = $sthreljud?>fetchrow hashref ) { $relevancejudgement{$row?>{?Query?}}{ $row?>{?DID? }} = $row?>{?Count?}; } $sthreljud?>finish () ; } sub loadUserProfiles { @userprofile = () ; my $sthuprofile = $dbh?>prepare (? select DID from CREUserProfiles where UserID=?1??) ; $sthuprofile?>execute or die ?Unable to execute query : $dbh?>errstr \n?; while (my $ary = $sthuprofile?>fetchrow ) { push( @userprofile , $ary ) ; } $sthuprofile?>finish () ; } sub fixQueries { my ( $query) = @ ; my @queryterms = () ; $query =? tr /[A?Z]/[ a?z ]/; my @tempqueryterms = split (? ? , $query) ; foreach my $word ( @tempqueryterms) { my $holder = 0; foreach my $t ( @stopwords) { if ($word eq $t ) { $holder = 1; } } if ( $holder == 0) { 77 push(@queryterms , porter ($word ) ) } } return @queryterms ; } sub Nk { my $word = $ [0]; my $nkcount = 0; if (! exists ($nkhash{$word}) ) { for my $document ( keys %invertedindex ) { if ( exists ( $invertedindex{ $document}{$word}) ) { $nkcount++; } } $nkhash{$word } = $nkcount ; } else { $nkcount = $nkhash{$word } ; } return $nkcount ; } sub tfmax { # The Maximum Term Frequecny in a Document my $did = $ [0]; my $largest = 0; if (! exists ( $tfmaxhash{$did }) ) { for my $word ( keys %{$invertedindex{ $did }}) { if ( $largest < $invertedindex{ $did}{$word}) { $largest = $invertedindex{$did }{$word}; } } $tfmaxhash{$did } = $largest ; } 78 else { $largest = $tfmaxhash{$did }; } return $largest ; } sub doclength { my $did = $ [0]; my $total = 0; if (! exists ( $doclengthhash{$did }) ) { for my $word ( keys %{$invertedindex{ $did }}) { $total = $total + $invertedindex{$did}{$word }; } $doclengthhash{$did } = $total ; } else { $total = $doclengthhash{$did }; } return $total ; } sub doclengthavg { my $total = 0; for my $did ( keys %invertedindex ) { for my $word ( keys %{$invertedindex{ $did }}) { $total = $total + $invertedindex{$did}{$word }; } } return $total / $numberodocs ; } sub doccount { my @docids = ( keys %invertedindex ) ; return $# docids ; } sub tfavg { #Average Term Frequency in the current document 79 my $did = $ [0]; my $count = 0; my $total = 0; if (! exists ( $tfavghash{$did }) ) { for my $word ( keys %{$invertedindex{ $did }}) { $count++; $total = $total + $invertedindex{$did}{$word }; } $tfavghash{$did } = $total / $count ; } #return $total / $count ; return $tfavghash{$did }; #Average Term Frequency in the current document } sub tfavgcol { my $total = 0; for my $did ( keys %invertedindex ) { $total = $total + tfavg ( $did ) ; } return $total / doccount () ; } sub dfmaxcol { my $word = $ [0]; my $largest = 0; if (! exists ( $dfmaxcolhash{$word}) ) { for my $did ( keys %invertedindex ){ if ( exists ( $invertedindex{$did }{$word}) ) { if ( $invertedindex{$did }{$word}) { $largest = $invertedindex {$did}{$word }; } } } $dfmaxcolhash{$word } = $largest ; } else { 80 $largest = $dfmaxcolhash{$word}; } return $largest ; } sub df { #Number of Documents in A Collection That Contain A Word (SAME AS NK) my $word = $ [0]; my $total = 0; for my $did ( keys %invertedindex ) { if ( exists ( $invertedindex{$did}{$word}) ) { $total ++; } } return $total ; } sub hashValueDescending { $docscores{$query}{$b} <=> $docscores{$query}{$a}; } sub personalHistory { my $document = $ [0]; my $check = 0; foreach my $doc ( @userprofile ) { if ($document eq $doc) { $check = 1; } } return $check ; } sub clusterHistory { my $cluster = $ [0]; if ( exists ( $clusterhistory { $cluster }) ){ return 1; } else { return 0; } 81 } sub collabClusterScore { my( $clusternum , $userid ) = @ ; if ( exists ( $collabscoretable {$clusternum }) ){ return $collabscoretable {$clusternum }; } else{ my $score = 0; foreach my $id ( keys %clusterhistory ) { if ( exists $clusterhistory {$id}{ $clusternum }) { if ( $id != $userid ){ $score = $score + $sim {$id}{$userid }; } } } $collabscoretable {$clusternum } = $score ; return $score ; } } sub buildSimMatrix { my $userid = $ [0]; my $sthcluhistory = $dbh?>prepare (? select CREUserProfiles . UserID , CRECluster . Cluster from CREUserProfiles , CRECluster where CRECluster .DID = CREUserProfiles .DID?) ; $sthcluhistory?>execute or die ?Unable to execute query\n?; %clusterhistory = () ; while (my $row = $sthcluhistory?>fetchrow hashref ) { $clusterhistory {$row?>{?UserID?}}{$row?>{? Cluster? }} = 1; } $sthcluhistory?>finish () ; 82 foreach my $user ( keys %clusterhistory ) { if ( $userid != $user ) { $sim{$user}{$userid } = 0; foreach my $cluster ( keys %{ $clusterhistory {$userid }}) { if ( exists $clusterhistory { $user}{ $cluster }) { $sim{$user}{$userid } = $sim{$user}{ $userid } + 1; } } } } } sub avgFitnessPopulation { my $count = 0; my $total = 0; for my $tree ( keys %population ) { $count++; $total = $total + $population{$tree }; } return $total / $count ; } sub loadClusterInfo { %clusterinfo = () ; my $sthcluinfo = $dbh?>prepare (? select ? from CRECluster?) ; $sthcluinfo?>execute or die ?Unable to execute query : $dbh?>errstr \n?; while(my $row = $sthcluinfo?>fetchrow hashref ) { $clusterinfo {$row?>{?DID? }} = $row?>{? Cluster?}; } $sthcluinfo?>finish () ; } 83 sub loadClusterHistory { %clusterhistory = () ; my $sthcluhistory = $dbh?>prepare (? select CRECluster . Cluster from CREUserProfiles , CRECluster where CRECluster .DID = CREUserProfiles .DID and CREUserProfiles . UserID = 1?) ; $sthcluhistory?>execute or die ?Unable to execute query : $dbh?>errstr \n?; while(my $row = $sthcluhistory?> fetchrow hashref ) { $clusterhistory {$row?>{?Cluster? }} = 1; } $sthcluhistory?>finish () ; } 84 APPENDIX C CLUSTERING SOURCE CODE # ! / opt / local / bin / perl ?w # K?Means Document Clustering Program #Size of the Cluster Taken From the Command Line use DBI; ($sec ,$min , $hour ,$mday ,$mon, $year ,$wday , $yday , $isdst )=localtime (time) ; printf ?%4d?%02d?%02d %02d:%02d:%02d\n? , $year+1900,$mon+1,$mday , $hour ,$min , $sec ; #Hash of { documentID}{term } = Count %occurance = () ; #Hash of { documentID}{documentID } = Distance %distances = () ; #Hash of { documentID } = Cluster %cluster = () ; $clustersize = $ARGV[0]; print ?Final Cluster Size : ?. $clustersize .?\n?; my $dsn = ?DBI:mysql : thesis : localhost ?; my $db user name = ?root ?; my $db password = ? ?; my ( $id , $password) ; my $dbh = DBI?>connect($dsn , $db user name , $db password) ; @frdocs = () ; @biggestcluster = () ; $sth = $dbh?>prepare (? select ? from CRSOccurance?) ; $sth?>execute or die ?Unable to execute query : $dbh?> errstr \n?; while($row = $sth?>fetchrow hashref ) { $occurance{$row?>{?DID?}}{$row?>{?Term? }} = $row?>{?Count?}; } 85 $sth?>finish () ; for $document ( keys %occurance ) { push(@frdocs , $document) ; } %documentwordcount = () ; for $document ( keys %occurance ) { for $term ( keys %{$occurance{$document}}) { if ( exists ($documentwordcount{$document }) ) { $documentwordcount{$document } = $documentwordcount{ $document } + $occurance{ $document}{$term }; } else { $documentwordcount{$document } = $occurance{$document}{ $term }; } } } print ?StepOne Finished \n?; #This Hash Keeps Track of How Large Each Cluster Is %clustersizes = () ; foreach $document ( @frdocs) { $cluster {$document } = 1; if ( exists ( $clustersizes {1})) { $clustersizes {1} = $clustersizes {1} + 1; } else { $clustersizes {1} = 1; } } $numberofsplits = 1; while( $numberofsplits < $clustersize ) { 86 $largest = 0; $numberofsplits ++; foreach $key ( keys %clustersizes ) { if ( $clustersizes {$key} > $largest ) { $largest = $clustersizes {$key }; $largestCluster = $key; } } print ? Splitting Largest Cluster ?. $largestCluster .?\n?; print ?Cluster Size ?. $clustersizes { $largestCluster }.?\n?; @biggestcluster = () ; foreach $id ( keys %cluster ) { if ( $cluster {$id } == $largestCluster ) { push( @biggestcluster , $id ) } } @twodocs = greatestdistance ( @biggestcluster ) ; $pointA = $twodocs [0]; $pointB = $twodocs [1]; # foreach $id ( keys %cluster ){ if ( $cluster {$id } == $largestCluster ) { $dista = distance ( $id , $pointA ) ; $distb = distance ( $id , $pointB ) ; if ( $dista < $distb ) { 87 $cluster {$id } = $largestCluster ; } else { $cluster {$id } = $numberofsplits ; } } } print $cluster {$pointA }.?\t?. $cluster {$pointB }.?\n?; %clustersizes = () ; foreach $id ( keys %cluster ) { $clLoc = $cluster {$id }; if ( exists ( $clustersizes {$clLoc }) ) { $clustersizes {$clLoc } = $clustersizes {$clLoc } + 1; } else { $clustersizes {$clLoc } = 1; } } } $sthb = $dbh?>prepare (?CREATE TABLE CRSCluster (DID VARCHAR(30) , Cluster INT, PRIMARY KEY( ?DID? , ? Cluster ?) )?) ; $sthb?>execute ; $sthb?>finish () ; foreach $id ( keys %cluster ) { $sthc = $dbh?>prepare (? insert into CRSCluster values ( ??. $id .? ?,??. $cluster {$id }.? ?)?) ; $sthc?>execute ; $sthc?>finish () ; } foreach $thing (keys %clustersizes ){ 88 print $thing .?\t?. $clustersizes {$thing }.?\n?; } ($sec ,$min , $hour ,$mday ,$mon, $year ,$wday , $yday , $isdst )=localtime (time) ; printf ?%4d?%02d?%02d %02d:%02d:%02d\n? , $year+1900,$mon+1,$mday , $hour ,$min , $sec ; exit ; sub distance { $distance = 0; $difference = 0; $didx = $ [0]; $didy = $ [1]; if (! exists ( $distances{$didx}{$didy }) ) { if ( $didx ne $didy ) { for $term ( keys %{$occurance{$didx }}) { if ( exists ( $occurance{ $didy}{$term }) ) { $difference = ( $occurance{ $didx}{ $term} ? $occurance{ $didy}{ $term })??2; } elsif (! exists ( $occurance{$didy}{ $term }) ) { $difference = ( $occurance{ $didx}{ $term } ? 0) ??2; } 89 if ( exists ( $distance{ $didx}{$didy }) ) { $distance = $distance + $difference ; } else { $distance = $difference ; } } for $term ( keys %{$occurance{ $didy }}) { if (! exists ( $occurance{ $didx}{$term }) ) { $difference = ( $occurance{ $didy}{ $term } ? 0) ??2; $distance = $distance + $difference ; } } } else { $distance = 0; } $distances{$didx}{$didy } = $distance ; $distances{$didy}{$didx } = $distance ; } else { $distance = $distances{$didx}{$didy }; } 90 return $distance ; } sub greatestdistance { @holderlist = @ ; if ($# holderlist > 200) { $startpoint = int (($# holderlist ) /100) ; @documentlist = @holderlist [0.. $startpoint ]; } else { @documentlist = @holderlist ; } $greatestdistance = 0; @furthesttwo = () ; foreach $documentone ( @documentlist ) { foreach $documenttwo ( @documentlist ) { $distance = distance ( $documentone , $documenttwo) ; if ( $distance > $greatestdistance ) { $greatestdistance = $distance ; $pointa = $documentone ; $pointb = $documenttwo ; } } } print $pointa .?\t?. $pointb .?\n?; push( @furthesttwo , $pointa ) ; push( @furthesttwo , $pointb ) ; return @furthesttwo ; } 91