From extractive to abstractive summarization: a journey
Research in the field of text summarisation has primarily been dominated by investigationsof various sentence extraction techniques with a significant focus towards news articles.In this thesis, we intend to look beyond generic sentence extraction and instead focuson domain-specific summarisation, methods for creating ensembles of multiple extractivesummarisation techniques and using sentence compression as the first step towardsabstractive summarisation.We start by proposing two new datasets for domain-specific summarisation. The firstcorpus is a collection of court judgements with corresponding handwritten summaries,while the second one is a collection of scientific articles from ACL anthology. The legalsummaries are recall-oriented and semi-extractive, compared to the abstracts of ACL articleswhich are more precision oriented and abstractive. Both collections have a reasonablenumber of article-summary pairs, enabling us to use data-driven techniques. Excludingnewswire corpora where the summaries are usually article headlines, the proposed collectionsare amongst the largest openly available collections of document summarisation.Next, we propose a completely data-driven technique for sentence extraction from legaland scientific articles. In both legal and ACL corpus, the summaries have a predefinedformat. Hence, it is possible to identify summary worthy sentences depending on whetherthey contain certain key phrases. Our proposed approach based on attention-based neuralnetwork learns to automatically identify these key phrases from pseudo-labelled data,without requiring any annotation or handcrafted rules. The proposed model outperformsexisting baselines and state of the art systems by a large margin.There are a large number of sentence extraction techniques, none of which guaranteebetter performance than the others. As a part of this thesis, we explore if it is possibleto leverage this variance in performance for generating an ensemble of several extractivetechniques. In the first model, we study the effect of using multiple sentence similarityscores, ranking algorithms and text representation techniques. We demonstrate that suchvariations can be used for improving Rank Aggregation. Using several sentence similaritymetrics, with any given ranking algorithm, always generates better abstracts. Next, wepropose several content-based aggregation models. Given the variation in performanceof extractive techniques across documents, the apriori knowledge about which techniquewould give the best result for a given document will drastically improve the result. Insuch case, an oracle ensemble system can be made which chose best possible summaryfor a given document. In the proposed content-based aggregation models, we estimatethe probability of a summary being good by looking at the amount of content it shareswith other candidate summaries. We present a hypothesis that a good summary will necessarilyshare more information with another good summary, but not with a bad summary.We build upon this argument to construct several content-based aggregation techniques,achieving a substantial improvement in the Rouge scores.In the end, we propose another attention based neural model for sentence compression.We use a novel context encoder, which helps the network to handle rare but informativeterms better. We compare the proposed approach to some sentence compression and abstractivetechniques that have been proposed in past few years. We present our argumentsfor and against these techniques and build a further roadmap for abstractive summarisation.In the end, we present the results on an end to end system which performs sentenceextraction using standalone summarisation systems as well as their ensembles and thenuses the sentence compression technique for generating the final abstractive summary.
- PhD Theses