Robust Rank-based Metric for Evaluating Text Summarization
Type of DegreePhD Dissertation
Computer Science and Software Engineering
MetadataShow full item record
Since information is growing exponentially, it is more important than ever to find effective ways to summarize text in the information age we live in. Automated text summarization systems are designed to make this process simple. Abstractive and extractive are the two broad categories into which summarization approaches fall. When a system performs abstractive summarization, it generates a summary that may not appear exactly in the original text but captures the essential points. In contrast, extractive summarization creates the summary by selecting core sentences or phrases from the source text. Since human evaluation is impractical due to time and cost constraints, automatic evaluation is essential for effectively evaluating the performance of these summarization systems. One of the most popular metrics in this context is the Recall-Oriented Understudy for Gisting Evaluation (ROUGE). ROUGE evaluates by examining the direct lexical overlap between a model summary and a reference summary. However, it has limitations in capturing semantic understanding which is problematic in the context of extractive summaries. If the human-written summary includes more novel words than the original document, ROUGE will provide a poor score to extractive summaries due to a lack of semantic awareness. Another limitation of the ROUGE metric in the context of extractive summarization is the following: while the extractive summarization task is generally framed as a sentence ranking problem, the ROUGE metric was not originally proposed for evaluating the quality of a ranker. To address the challenges in evaluating extractive summarization, we introduced a novel metric known as Semantic-aware Normalized Cumulative Gain (Sem-nCG). The Sem-nCG metric is both semantic-aware and rewards a system-generated summary based on some ground truth ranking of sentences from the original document. Yet, the initial version of Sem-nCG doesn't account for redundancy in the summaries or support evaluation with multiple reference summaries. To address these issues, we enhanced Sem-nCG with a redundancy-aware version and have also shown how to use the Sem-nCG metric when multiple reference summaries are available. Our findings suggest that compared to the conventional metrics, the redundancy-aware Sem-nCG exhibits a stronger correlation with human judgments (25\% improvement in coherence and 10\% improvement in relevance dimension). Finally, we also proposed a rank-based evaluation metric for abstractive summarization. Our findings demonstrate that employing rank-based evaluation for abstractive summarization is not only feasible but also enhances the alignment of the metric with human judgment, resulting in a 41\% increase in correlation with human judgments compared to strong baselines utilizing recent Large Language Model (LLM)-based embedding. The thesis introduces a baseline Sem-nCG metric, an improved redundancy-aware version of Sem-nCG, and a rank-based evaluation approach for abstractive summarization. These metrics will play a vital role in the development and comparison of text summarization models and also will help to identify the progress in the field and state-of-the-art models.