Contributions to Shallow Discourse Parsing : To English and beyond

Abstract: Discourse is a coherent set of sentences where the sequential reading of the sentences yields a sense of accumulation and readers can easily follow why one sentence follows another. A text that lacks coherence will most certainly fail to communicate its intended message and leave the reader puzzled as to why the sentences are presented together. However, formally accounting for the differences between a coherent and a non-coherent text still remains a challenge. Various theories propose that the semantic links that are inferred between sentences/clauses, known as discourse relations, are the building blocks of the discourse that can be connected to one another in various ways to form the discourse structure. This dissertation focuses on the former problem of discovering such discourse relations without aiming to arrive at any structure, a task known as shallow discourse parsing (SDP). Unfortunately, so far, SDP has been almost exclusively performed on the available gold annotations in English, leading to only limited insight into how the existing models would perform  in a low-resource scenario potentially involving any non-English language. The main objective of the current dissertation is to address these shortcomings and help extend SDP to the non-English territory. This aim is pursued through three different threads: (i) investigation of what kind of supervision is minimally required to perform SDP, (ii) construction of multilingual resources annotated at discourse-level, (iii) extension of well-known means to (SDP-wise) low-resource languages. An additional aim is to explore the feasibility of SDP as a probing task to evaluate discourse-level understanding abilities of modern language models is also explored.The dissertation is based on six papers grouped in three themes. The first two papers perform different subtasks of SDP through relatively understudied means. Paper I presents a simplified method to perform explicit discourse relation labeling without any feature-engineering whereas Paper II shows how implicit discourse relation recognition benefits from large amounts of unlabeled text through a novel method for distant supervision. The third and fourth papers describe two novel multilingual discourse resources, TED-MDB (Paper III) and three bilingual discourse connective lexicons (Paper IV). Notably, Ted-MDB is the first parallel corpus annotated for PDTB-style discourse relations covering six non-English languages. Finally, the last two studies directly deal with multilingual discourse parsing where Paper V reports the first results in cross-lingual implicit discourse relation recognition and Paper VI proposes a multilingual benchmark including certain discourse-level tasks that have not been explored in this context before. Overall, the dissertation allows for a more detailed understanding of what is required to extend shallow discourse parsing beyond English. The conventional aspects of traditional supervised approaches are replaced in favor of less knowledge-intensive alternatives which, nevertheless, achieve state-of-the-art performance in their respective settings. Moreover, thanks to the introduction of TED-MDB, cross-lingual SDP is explored in a zero-shot setting for the first time. In sum, the proposed methodologies and the constructed resources are among the earliest steps towards building high-performance multilingual, or non-English monolingual, shallow discourse parsers.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)