Semantic Based Content Search and Content Summarization

  • Georgios Mamakis

    Student thesis: Doctoral Thesis

    Abstract

    Document summarization has been an intriguing task of Computational linguistics. A number of definitions have been proposed in References, all of which consider document summarization as a problem of text compression. One of the most complete definitions by Sparck-Jones states that "...a summary is a reductive transformation of source text to summary text through content condensation by selection and/or generalisation on what is important in the source...". The importance of document summarization does not lie only in presenting information in a shortened form, but also in selecting the most appropriate content to present. Moreover, a main feature in summarization is the number of sources from which a summary may be produced; thus, single-document and multi-document have been proposed, denoting the number of sources from which the summary will be produced. In addition, another categorization that may be extracted from this definition refers to the importance of the source, and what the potential user thinks is important. This leads to the definition of generic and query-based or task focused summarization, where generic implies that the summarizer should extract information according to the main topics discussed in the document, while query-based summarization focuses on extracting information according to simple or more complex questions on the document. Moreover, importance of content can be extracted through knowledge-rich (supervised and semi-supervised summarization) and knowledge lean approaches (unsupervised or shallow summarization). The last categorization refers to the type generation of the summary, the two main categories being: extractive summarization, where sentences are maintained in the summarization process unaltered; and abstraction, where the sentences are either semantically altered or compressed.

    The research depicted in this thesis, presents novel document summarization approaches based on the theories of Machine Learning (ML) and Natural Language Processing (NLP) for generic single-document extractive summarization. The motivation to target on Greek language came from the lack of a Greek summarization system. Most notably, only one system for Greek Summarization system exists in the literature (GreekSum). The research undertaken resulted in: the development of a stemming algorithm used for noun and adjective identification, based on grammatical analysis on Greek language; the development of a novel statistical classification scheme, initially aimed to document summarization, that is proven to outperform other statistical summarizers as Naive Bayes Classifier (NEC) and Language Models (LM); the development of a supervised statistical summarization algorithm based on document classification techniques (Text Classification Assisted Summarization for Greek Language-TCASGL); and the development of a knowledge-lean summarization algorithm (Generic Unsupervised Text Summarization - GUTS), using shallow semantic document analysis and statistics. The results demonstrate that the classification algorithm significantly outperforms widely available statistical algorithms, while the ML approach yielded comparable results to other supervised systems. In addition to that, GUTS was shown to perform equally well with knowledge rich approaches.

    Date of AwardOct 2012
    Original languageEnglish
    SupervisorAndrew Ware (Supervisor)

    Keywords

    • Computational linguistics

    Cite this

    '