Text and Predictive Analytics; Classification of On-line Customer Opinion Surveys by Ahmet Yucel A thesis submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Master of Science Auburn, Alabama December 12, 2011 Keywords: Singular Value Decomposition, Feature Selection, Predictive Modeling Copyright 2011 by Ahmet Yucel Approved by David Mark Carpenter, Chair, Professor of Mathematics and Statistics Asheber Abebe, Associate Professor of Mathematics and Statistics Peng Zeng, Associate Professor of Mathematics and Statistics ii Abstract Thanks to computers and other data storage technologies, today we can easily save and rapid access to enormous amounts of data. According to results a report published by Andrew Harbison & Pearse Ryan, the information stored in electronic platform consist of more than 80% from written text (unstructured data). Text analytics can be very useful tool in converting these large amounts of unstructured data into regular structured data and then actionable information. Predictive models are very useful tools in text mining. The results of the models may give important information that can be used in decision making. Feature construction from pre- existing features and feature selection techniques involve creating the predictive models. Research has shown that by applying different feature selection techniques in creating predictive models can increase the model?s accuracy rate. In this thesis we review text and predictive analytical methods, and apply them to a cast satisfaction. iii Acknowledgments First and foremost, I would like to thank God. I would like to thank Professor David Mark Carpenter for all his support, encouragement, and patience. Thank you to my parents Hasan Huseyin and Zeynep Yucel, my sister Sevda Teker and her husband Erkan Teker for their love, support, and encouragement. They have been very supportive and encouraging throughout my academic tenure here at Auburn. Last but not least I would like to say thank you to the members of Committee: Dr. Peng Zeng and Dr. Asheber Abebe for their valuable feedbacks and suggestions that helped me to improve the thesis. iv Table of Contents Abstract ..??... ???.? ??????.????.???????.?..???????.. ii Acknowledgements ..??????.? ??????.????.???????...???.. i ii List of Figures ?????... ???.?????? ....???????.??. ???? ?.. vii List of Tables ?..??? ...?????????????????..?.? ?? ? .??..? ix 1. INTRODUCTION.................................................................................................................... 1 2. LITERATURE REVIEW......................................................................................................... 3 2.1 TEXT MINING OVERVIEW............................................................................................. 3 2.1.1 TEXT CATEGORIZATION / CLASSIFICATION..................................................... 5 2.1.1.1 TOKENIZATION .................................................................................................. 6 2.1.1.2 STOP WORD REMOVAL .................................................................................... 8 2.1.1.3 STEMMING............................................................................................................9 2.1.1.4 FEATURE EXTRACTION...................................................................................10 2.1.1.4.1 FEATURE CONSTRUCTION....................................................................... 11 2.1.1.4.2 FEATURE SELECTION................................................................................ 11 2.1.1.4.2.1 Information Gaining.................................................................................. 12 2.1.1.4.2.2 Chi-square ..................................................................................... 12 2.1.1.5 WEIGHTING...................................................................................................... 13 2.1.1.5.1 BOOLEAN RETRIEVAL............................................................................ 13 2.1.1.5.2 TERM FREQUENCY WEIGHTING.......................................................... 14 2.1.1.5.3 TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY WEIGHTIN.... 14 v 2.1.1.6 VECTOR SPACE MODEL................................................................................ 15 2.1.1.6.1 VECTOR CREATING................................................................................. 15 2.1.1.6.1.1 Bianary Frequency Weighting................................................................ 16 2.1.1.6.1.2 Term Frequency Weighting.................................................................... 16 2.1.1.6.1.3 Term Frequency-Inverse Document Frequency Weighting Method...... 17 2.1.1.6.2 KEYWORD SELECTION AND WEIGHTING......................................... 17 2.1.2 ASSOCIATION ANALYSIS..................................................................................... 19 2.1.3 CLUSTERING............................................................................................................ 21 2.1.4 INFORMATION EXTRACTION.............................................................................. 22 2.2 TEXT MINING ALGORITHMS..................................................................................... 23 2.2.1 NA?VE BAYES ALGORITH ................................................................................... 23 2.2.2 GENERAL LINEAR MODELS (GLM) ALGORITHM........................................... 24 2.2.3 SUPPORT VECTOR MACHINE (SVM) ALGORITHM......................................... 26 2.2.4 K-MEANS ALGORITHM......................................................................................... 29 2.2.5 APRIORI ALGORITHM........................................................................................... 30 2.2.6 DECISION TREES.................................................................................................... 32 3. A TEXT ANALYTICS APPLICATION............................................................................. 34 3.1 STUDY DESIGN AND DATA DESCRIPTION............................................................. 34 3.2 MATERIALS/TOOLS...................................................................................................... 34 3.3 PROCEDURE................................................................................................................... 35 3.4 RESULTS ........................................................................................................................ 53 4. CONCLUSIONS?????????? ??????????????????...5 5 4.1 FUTURE WORK?? ? ? ..? ??????????????.????????5 5 vi REFERENCES?? ????????????????????????????.. 56 APPENDIX?????.??????????????????????????.. 61 vii List of Figures Figure 1 Text mining Process......................................................................................................... 4 Figure 2 Text categorization/classification process.........................................................................5 Figure 3 The task of text preparation and processing..................................................................... 6 Figure 4 Breaking up the text into tokens by appointed marks...................................................... 8 Figure 5 Matrix representation of the document vectors.............................................................. 16 Figure 6 Market Basket Analysis.................................................................................................. 20 Figure 7 Data Clustering............................................................................................................... 21 Figure 8 Hierarchical Clustering................................................................................................... 22 Figure 9 The hyperplane?s maximization the geometric distance................................................ 28 Figure 10 The hyperplanes? separation the two classes................................................................ 28 Figure 11 Maximum-margin hyperplane and margins for an SVM............................................. 29 Figure 12 Steps of the K-Means Algorithm.................................................................................. 30 Figure 13 Apriori Algorithm example.......................................................................................... 31 Figure 14 Decision Tree Layout................................................................................................... 32 Figure 15 Text mining filtration window Selected words............................................................ 35 Figure 16 Selected words.............................................................................................................. 36 Figure 17 Scree Plot.......................................................................................................................37 Figure 18 Scatterplot of Component -2 vs Component-1............................................................. 38 Figure 19 Colored Scatterplot of Component -2 vs Component-1with Labels ........................... 39 Figure 20 General look of the Data Set........................................................................................ 40 viii Figure 21 Recoding Negative_Connotations variable.................................................................. 41 Figure 22 Interaction Plot: City vs Negative_Connotations......................................................... 42 Figure 23 Decision Tree with SVD scores................................................................................... 43 Figure 24 Decision Tree-1............................................................................................................ 45 Figure 25 Decision Tree-2............................................................................................................ 48 Figure 26 First part of the second decision tree............................................................................ 50 Figure 27 Second part of the second decision tree....................................................................... 51 Figure 28 Third part of the second decision tree.......................................................................... 52 ix List of Tables Table 1 Data Frequency for each city .......................................................................................... 34 Table 2 Model Evaluation of the predictive model with SVD scores.......................................... 43 Table 3 Selected Features-1.......................................................................................................... 44 Table 4 Model Evaluation-1......................................................................................................... 46 Table 5 Selected Features-2.......................................................................................................... 47 Table 6 Model Evaluation-2......................................................................................................... 52 1 1. INTRODUCTION Advances in storage capabilities, huge data collections, and easy access to target data left people in an immense data pool. One of the most important ways to deal with this problem is Data Mining. Data mining is the analysis process of discovering knowledge in a database. However, according to research by Merrill Lynch and Gartner, 85-90% of the data all over the world are stored in unstructured form (McKnight, 2005), and thus, data mining algorithms are not enough by themselves. At this point Text Mining plays an important role. Text mining is the process of exploring structured data and extracting useful information from a collection of unstructured data. Text mining methods can be used in very different areas including business documents, customer reviews, web pages, e-mails and other sources. One of the most popular text mining techniques is predictive modeling. Decision trees, neural networks and boosted trees are different types of predictive models. Predictive models are used to determine which class a set of data belongs to. For example, a technology company can apply predictive modeling algorithms to specifically target customers, and so before generating a new model. Building an accurate predictive model is very important to get reliable results. Especially in business, inaccurate results may be very expensive. For example, if the technology company mentioned above uses inaccurate results and produce its new model according to that results, this situation may cause the company lose its market, customer confidence and money. Generally researchers think that an ensemble model approach such as decision trees will produce predictive models with higher accuracy rates. Therefore, in our experiment we are going to use a decision tree. In our experiment we investigate the premise using different feature selection methods and then build decision trees for the selected features. 2 The rest of this thesis is organized as follows: Chapter 2 contains the literature review and a general overview about text mining processes and descriptions about various text mining algorithms and techniques. Chapter 3 discusses the experiment design data collection, materials/tools and results. Chapter 4 presents ideas for future work. 3 2. LITERATURE REVIEW 2.1 TEXT MINING OVERVIEW Thanks to computers and internet technology, today it is very easy to reach a tremendous pile of information and knowledge. Every day millions of pages of news-text are flowing into the web sites of news agencies, millions of web sites? contents are being updated and approximately 90 trillion (2010 est.) e-mails are being sent between e-mail account owners more than 2 billion (Radicati & Hoang, 2011) per year all over the world. For example, if we think that even Microsoft webmail has 256.2 million users (Graham, 2008), how we are in a pile of information can be easily seen. In short, in the electronic platform, unstructured and continuously increasing huge piles of information such as the reader/viewer comments, academic articles/journals, e-books, blogs, e-mails, chat conversations, research documents, etc are waiting to be analyzed. According to the results of a report published by Andrew Harbison & Pearse Ryan (Harbison & Ryan, 2009), the information which can be processed digitally consists of almost 80-85% from written text (unstructured data). By using the methods of data mining, choosing this regular data through piles of unstructured dispersed data is becoming very important. Text and data mining are similar at the point that both try to obtain information from massive and unstructured sources. However, text mining is based on text sources (Chang, Healey, McHugh, Wang, Jason, 2001), (Kroeze & Bothma, 2007). Rregular structured data are extracted from unstructured data (text) and thus hidden information is discovered. This process is done with a variety of text mining techniques. (Kroeze & Bothma, 2007) However, this is a very difficult and time-consuming 'process. Natural language processing (NLP) is a sub-discipline of computer science and linguistics. In NLP, natural language texts and/or sounds carried out on the studies in computer processing. 4 Therefore, modern statistical NLP algorithms require using of linguistics, computer science, and statistics (Charniak, 1984). All programming languages used around the world have specific structures, rules, and a standard filed. Natural languages cannot be explained so easily. All around the globe, there are hundreds of different official/known languages and each language has more than 100,000 words. In addition, the fact that language courses are always changing and expanding with a lot of uncertainty, and each language has its own unique grammar structure. For this reason, it is impossible a text mining software to interpret a language 100% correctly (Erol, 2009). Figure-1: Text mining Process Text mining Process (see Figure-1) (Stavrianou, Andritsos, & Nicoloyannis, 2007) is divided into four main categories: text classification or text categorization (TC), association analysis, clustering and information extraction (IE). The classification or TC process is to include categories or classes of objects previously known. Association analysis is used to identify the words which are often associated with each other or developing and to make the sets of documents or the contents of documents more understandable. IE techniques are used to find the useful data in the documents or statements. Cluster analysis is used to discover the underlying structure of the document sets. 5 2.1.1 TEXT CATEGORIZATION / CLASSIFICATION Text categorization/classification (TC) is the grouping of a text into two or more classes (Mahinovs & Tiwari, 2007). The goal of TC is to classify documents (academic articles, emails, etc) into categories. For example, news articles into ?local? and ?global?, e-mails into ?spam? and ?others?, and customer feedbacks into ?positive? and ?negative? can be classified. Today, the internet is one of the biggest and most important information sources. Undoubtedly, its importance comes from being available very easily and fast. However, with the extension of internet in an uncontrollable way, reaching to aimed data becomes like hunting a special-species fish in this information ocean. However, categorization is a significant method that reduces the time to reach the information. This is one of the most important motivations for the TC. An example use for TC is to deal with span emails. However, because of a big partition of the text- data on the internet is written in natural language, the categorization of texts is very difficult in this format. So, for overcoming this problem, these texts written in natural language should be transformed into digital texts (Asyali & Yildirim, 2004). Figure-2 shows the text categorization process (Mahinovs & Tiwari, 2007). Figure-2: Text categorization/classification process. 6 In addition, of course should be a pre-process that prepares the text to be categorized. For example, specific tags like xml/html are identified as blocks of text for section searching (Oracle?, 2003), non-letter characters are replaced by spaces, single-letter words should be deleted, and all characters are converted into lower cases (Tonta, Bitirim & Sever, 2002). There are several steps before text categorization. These steps are tokenization, stop word removal, stemming, feature extraction, and vector space model (Mahinovs & Tiwari, 2007). 2.1.1.1 TOKENIZATION Token is a simple sample of a type. In text mining, tokenization is the breaking a stream of text up into tokens, and is the very first step for preparing natural language text for before categorization. Figure-3 shows where extended tokenization is located within the text preparation and processing framework (Hassler & Fliedl, 2006). Figure-3: The task of text preparation and processing. Firstly, the texts are broken up into meaningful components. For example, texts can be broken up into chapters, sections, sentences, words, phrases, symbols, or other meaningful elements called tokens (Feldman & Sanger, 2007). Generally words are surrounded by whitespace and may be followed by punctuations, parentheses, or quotes. So, a simple tokenization rule can be stated with the following order: Break up the character sequence at whitespace positions and cut off punctuations, parentheses, and quotes at both ends of the fragments to get the sequence of tokens (Sherpa & Choejey, 7 2008). However, while we follow this order we may encounter with some problems and need some more extra study. First of all, all periods are not punctuation. They may be markers for abbreviations such as ?U.K.? ?Mr.?, ?Dr.?, and so on. Only periods which are sentence markers should be removed to get separate tokens (Feldman & Sanger, 2007). Also, the sentence markers ?!? or ??? are generally obvious punctuations. The most difficult symbols to distinguish are semicolons ?:? and ?;?.Distinguishing the different uses of colons and semicolons is very hard without analyzing the whole sentence. The other problem is with ordinal numbers. Ordinal numbers are written with a trailing period after the number in Turkish or other Europe Language. For example, ?13rd? is written ?13.?. These ordinal numbers cause the same problem as abbreviations: A number which is followed by a period may either be an ordinal number, a cardinal number in sentence-final position, or an ordinal number in sentence-final position. Distinguishing is not possible without contextual information. In the above expression we said that tokens do not contain whitespace. However, there are some multi expressions that are complex prepositions like ?because of?, conjunctions like ?so that?, and adverbs like ?at all?, date expressions like ?Jan. 16, 1986?, time expressions like ?2:30 am?, and proper names like ?AC Milan? and it is better to accept them as single tokens. The opposite job is done for the acronym words like ?we?ll? or ?aren?t?. These words are separated into two words like ? we will? or ?are not?. The last problem that might be encountered in tokenization is missing-whitespaces. Sometimes there is no blank after a punctuation mark like ?hours.The? or ?however,when?, that should be broken up into three tokens (Hassler & Fliedl, 2006), (Ben-Hur & Weston). Here a step by step example of tokenization and typing of tokens (Hassler & Fliedl, 2006). 1- identify single-tokens 2- type single-tokens 8 3- split sentence end markers 4- reinterpret single-token types 5- merge and split tokens recursively 6- reinterpret all token types ? Alphabetics : no letters capitalized ( ), first letter capitalized ( ), all letters capitalized ( ), mixed cases ( ) etc. ? Numerics : plain numbers ( ), numbers containing periods or colons ( ) etc. ? Punctuation marks :: sentence ending markers ( ), pairwise marks lick brackets and quotes ( ), single sentence-internalmarks like commas ( ) etc. ? Mixtures : ending with sentence end marker ( ), ending with hyphen ( ), starting with hyphen ( ), containing slashes/hyphens ( ), containing numbers ( ) etc.? Figure-4 shows how to break up the text into tokens by marks appointed earlier. Figure-4: Breaking up the text into tokens by appointed marks. 2.1.1.2 STOP WORD REMOVAL Stop-words are common words that do not have so much meaning in a retrieval system. Stop-words are a part of natural language with that a text miner will encounter. The reason that stop-words should be removed from a text is that they make the text look heavier and less important for analysts and the stop-words are not necessary for the analysis and so we do get 9 some data reduction by eliminating stop-words. A query done by using stop-words would have a weak ability to categorize the text because of these words return each element of the data set as a result (Adsiz, 2006). In enterprise search, all stop words, for example, common words like a and the, are removed from multiple word queries to increase search performance. There is not one master too many list of stop words which all tools use. Any group of words can be chosen as the stop words for a given purpose, depending on their importance and data reduction needs. For some search machines, these are some of the most common, short function words, such as the, is, at, which and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as 'The Who', 'The The', or 'Take That'. Other search engines remove some of the most common words including lexical words , such as "want"? from query in order to improve performance (Stackoverflow, 2008). 2.1.1.3 STEMMING Stemming is the process of reducing inflected or derived words to their stem, base or root forms. Certainly, the number of words in a text gives important information about the text. Therefore, it is quite likely that if a word is frequently repeated in a text then it should be related to the subject of the text. The root or stem with an adjunct and the simple form of a word will be counted as different words and this situation makes the word less important than it should be. However, when we consider the meaning, different forms of the word may be treated as the same. For example; stemmer, stemming, stemmed ? stem (Julie, 1968). These four words will be treated as the same. The stem is the morphological root of the word is not necessarily a valid Lexical words: A Lexical item (or lexical unit, lexical entry) is a single word or chain of words that forms the basic elements of a language's lexicon (vocabulary). Examples are "cat", "traffic light", "take care of", "by-the-way", and "it's raining cats and dogs". (A complete list of English stop-words including 429 words can be found under http://www.lextek.com/manuals/onix/stopwords1.html.) 10 word. Porter?s Algorithm is popular stemming method (Mustafa, Akbar & Sultan, 2009). 2.1.1.4 FEATURE EXTRACTION In the all size reduction algorithms, all words in the documents are collected into a word- list. Then, according to the results of reduction algorithms, some words in the word-list are removed and finally, only the words in the word-list are used. Feature extraction is a data reduction process. The feature extraction process results in a much smaller and richer clustering sets of words of attributes. Here, word-clusters are created by feature similarity of the words. After grouping words into clusters, these clusters are treated as features instead of individual words (Feldman & Sanger, 2007). Feature extraction maps from a larger amount of resources into a simplifier data (Dragos & Manolescu, 1998). When dealing with a large and complex data, much text may not be meaningful for the aim of analysis. To overcome this problem we need to use classification algorithms. The goal of feature extraction is to improve the effectiveness of classification and analysis (Rustum, 2007). In the feature extraction frame, by the specific features we can restructure the data, we can extract the important information which is crucial for the aim of analysis to improve the quality of the classifier. In addition, because of term frequency is an important information resource to distinguish the documents that have different contents, we can create feature vectors through word algorithms and statistical approaches on term frequency (Liangtu & Xiaoming, 2007). We can operate feature extraction in two main steps; feature construction and feature selection. 11 2.1.1.4.1 FEATURE CONSTRUCTION Feature construction is the process of creating a new feature that is earlier unknown (Li, 2007), or designing a modified feature from the pre-existing features associated with database. 2.1.1.4.2 FEATURE SELECTION Feature selection is applied to eliminate insignificant and irrelevant features from texts and create subsets consisting of significant features. And so, the existing features are simplified, the dimension of the texts are reduced, in other words, the existing features are transformed into a lower dimensional space and thus the comprehensibility can be significantly improved (Feldman & Sanger, 2007). In short, feature selection is aimed at creating a more suitable and clearer data to be analyzed easily and to see important (Li, 2007) but hidden points (Kim, Street & Menczer, 2003). There are many filters to evaluate and eliminate features. In order to perform the filtering, the relevance of features such as document frequency should be calculated as a measure. Typically interclass distance, statistical dependence or information-theoretic measures can be given as examples for filters. Some of them are very strength, may remove almost 90-99% of the all features (Feldman & Sanger, 2007). Document frequency of a word is the number of appearing of the word in that document. A researcher identifies a certain point of frequency such that the words that have frequencies under that point are removed from the document. This method gets its base from the idea that words that appear less than a certain threshold do not have a decisive role or ability to identify categories (Bolat, 2003). By the way there are other useful measures of feature relevance that take into account the relations between features and categories. For example information gaining (IG) and chi-square (Feldman & Sanger, 2007). 12 2.1.1.4.2.1 Information Gaining (IG) This method is about the effect of each word on the categorization. Let be possible categories. The IG of a word w is computed with the following formula where is the possibility of that a document belongs to category among the all categories, is the possibility of that w appears in a random document among the all documents, is the possibility of that w appears in a random document belonging to category , is the possibility of that w does not appear in a random document belonging to category . The IG is computed for each word and the words which have IG less than a certain threshold identified by analyst are removed from the collection (Adsiz, 2006). 2.1.1.4.2.2 Chi-square measures the dependency between word w and category . where A is the number of documents in that w appears and belong to category . B is the number of documents in that w appears but do not belong to category . C is the number of documents in that w does not appears but belong to category . 13 D is the number of documents in that w does not appears and do not belong to category . N is the total number of documents in the collection. And the measurement method is The is computed for each word and the words which have less than a certain threshold are removed from the collection (Bolat, 2003). 2.1.1.5 WEIGHTING Weighting is a process consists of choosing terms which are important (contribute more than others) for a document and giving these terms more importance (weight) in the analysis (Wikipedia, 2011). There are several methods to apply Weighting process; Boolean Retrieval, Term Frequency Weighting, and Term Frequency?Inverse Document Frequency (Tf-Idf) Weighting Methods (Adsiz, 2006). 2.1.1.5.1 Boolean Retrieval Boolean retrieval is one of the simplest and oldest retrieval methods using exact-match model. Boolean method comes from Boolean algebra where keywords are combined with connecting search operators AND, OR, and NOT. Boolean logic is a popular and simply method generally used on internet. This method is established on grouping into two classes like ?0 or 1?, ?non-exist or exist?, ?white or black?, ?yes or no? etc (George, 2003). This method is applied to text as following If a word occurs in text then weight of the word is 1, otherwise it is 0. , 14 where is the ith term of the kth text, and is the number of occurrence of the term in the text (Adsiz, 2006). 2.1.1.5.2 Term Frequency Weighting In this method, the number of occurrences of a word in a document is considered as weight for that word. That is, term frequency weight is equal to the number of occurrence of the word in document (Adsiz, 2006). here, if we take the number of occurrence of each word as the term of the document, instead of the words, then we will have quantitative terms. This is called in the literature as the bag of words model (Manning , Raghavan & Sch?tze, 2008). 2.1.1.5.3 Term Frequency?Inverse Document Frequency (Tf-Idf) Weighting Term Frequency?Inverse Document Frequency (Tf-Idf) Weighting is a popular weighting method. The main application area is Information Retrieval (IR). In this method documents are identified in Vector Space Model. A Tf-Idf Weighting function works as following. (Soucy & Mineau) Firstly, as we mentioned above, the tf?idf weights are often used to evaluate how a word is significant to a text. The tf?idf combines the word frequency in the document. We know that tf measures how frequent a word in a document, while idf measures infrequency. A word is considered as insignificant if it has low occurrence in a text, however, if this word has a low frequency in the whole document, then it can be very important to describe the category of the text. This is the general idea behind this tf?idf weighting procedure (Adsiz, 2006). The tf?idf combines these two kinds of frequencies as a weighting measure. We define the inverse document frequency of a term as follows (Manning, Raghavan & Sch?tze, 2008): , 15 where N is the total number of documents and is the number of documents in that the word occurs. Therefore, the tf-idf weighting value for the ith term of the kth text given by the following formula (Manning , Raghavan & Sch?tze, 2008). 2.1.1.6 VECTOR SPACE MODEL The Vector Space Model (VSM) firstly introduced by Gerard Salton. This mathematical approach is one of the most popular information retrieval models in text mining (Salton & Singhal, 1995). Text is messy and difficult to manage. VSM provides a mathematical representation of documents. In this model, every collection of items are represented as set, or vectors of terms extracted from the text itself (Stavrianou, Andritsos & Nicoloyannis, 2007). There are some specific methods to create vectors of terms. 2.1.1.6.1 Vector Creating , , , ,