Upper and Expected Value Normalization for Evaluating Information Retrieval and Text Generation Systems
Type of DegreePhD Dissertation
Computer Science and Software Engineering
Restriction TypeAuburn University Users
MetadataShow full item record
Evaluation metric is a crucial part to improve the performance of any system. An accurate evaluation metric is capable to detect and compare multiple different models and thus serving the user’s need in a specific domain. Ranking in Information Retrieval (IR) and text summarization in Natural Language Processing (NLP) are two important tasks and often served as a key component within an intelligent system. For instance, a search engine uses a ranking algorithm to determine the order in which the search results are displayed. The ranking algorithm analyzes various factors to evaluate the relevance and quality of web pages and then assigns them a ranking based on their perceived value to the user. Although many different evaluation metrics had been proposed for a better understanding of the ranking/summarization models and to improve an intelligent system, empirical evaluation is still a challenge. While original IR evaluation metrics are normalized in terms of their upper bounds based on an ideal ranked list, a corresponding expected value normalization for them has not yet been studied. We present a framework with both upper and expected value normalization, where the expected value is estimated from a randomized ranking of the corresponding documents present in the evaluation set. We next conducted two case studies by instantiating the new framework for two popular IR evaluation metrics (e.g., nDCG, MAP) and then comparing them against the traditional metrics. For the NLP domain, we specifically consider ROUGE and BERTScore in the text summarization evaluation and conducted the two case studies by instantiating the new framework for ROUGE/BERTScore to observe the implications, where the expected ROUGE/BERTScore is calculated by an expected summary given a source document, resulting in an instance-level penalty for each source document. For the ranking task, experiments on two Learning-to-Rank (LETOR) benchmark data sets, MSLR-WEB30K (includes 30K queries and 3771K documents) and MQ2007 (includes 1700 queries and 60K documents), with eight LETOR methods (pairwise & listwise), demonstrate the following properties of the new expected value normalized metric: 1) Statistically significant differences (between two methods) in terms of original metric no longer remain statistically significant in terms of Upper Expected(UE) normalized version and vice-versa, especially for uninformative query-sets. 2) When compared against the original metric, our proposed UE normalized metrics demonstrate an average of 23% and 19% increase in terms of Discriminatory Power on MSLR-WEB30K and MQ2007 data sets, respectively. We found similar improvements in terms of consistency as well; for example, UE-normalized MAP decreases the swap rate by 28% while comparing across different data sets and 26% across different query sets within the same data set. For the text summarization task, we also conducted the expected value normalization on two widely used metrics, ROUGE and BERTScore. Experiments on CNN/Daily Mail datasets with 12 different abstractive summarization models also demonstrate the following properties of the new expected value normalized metric: 1) When compared against the original metric, our proposed UE normalized BERTScore demonstrate higher human correlation w.r.t. four important perspectives (Consistency, Coherence, Relevance, Fluency) across 12 abstractive summarization methods, especially in Heterogeneous documents, 2) Human judgment favors Upper expected value normalized BERTScore against original version across comparison of 6 extractive summarization methods. On the other hand, for the ROUGE metric, UE normalization does not help much in terms of human correlation with abstraction summarization methods, though it improves the human correlation with extractive summarization methods. These findings suggest that the IR and NLP community should consider UE normalization seriously when computing nDCG, MAP, ROUGE, and BERTScore, a more in-depth study of UE normalization for general IR and NLP evaluation is warranted.