Modality in spoken spanish and japanesea corpus-based study and automatic annotation.

  1. Herrero Zorita, Carlos
Dirigida per:
  1. Antonio Moreno Sandoval Director/a

Universitat de defensa: Universidad Autónoma de Madrid

Fecha de defensa: 11 de de maig de 2017

Tribunal:
  1. Kayoko Takagi President/a
  2. Mick O'Donnell Secretari/ària
  3. Doaa Samy Vocal
  4. Paul Rayson Vocal
  5. Hiroto Ueda Vocal

Tipus: Tesi

Resum

The main aim of this thesis is to automatically find and classify elements that signal modality in Spanish and Japanese sentences, taking into account both the- oretical and empirical information. In order to join different disciplines such as typology, logic, corpus and computational linguistics, the aim is to answer three main questions: (1) What is the best definition and classification of modality for a cross-linguistic computational work? (2) How is modality used in spoken Spanish and Japanese, and how are modal markers modified in discourse? (3) How can this information be formalised into a program that can annotate modals automatically in new texts? Modality is seen from the logic perspective as a semantic feature that adds necessity or possibility meanings to the predicate, as it is proven to be the best approximation for this type of study. Modality is encoded in the sentence in both languages by a series of auxiliaries, adverbs, adjectives and grammatical moods. The corpora will tell us how these markers are affected by negation, ellipsis, syntactic separation and ambiguity, which need to be detected by the program for the sake of precision and recall. The corpora also provide information about modality usage, and reveals that it is a feature correlated to the type of communication, probably in relation to social constraints. Monologues achieve similar results in both languages, but when inter- action takes place, the difference is noticeable. In dialogues, there is a predominance of necessity values in Spanish, and nearly equal numbers of necessity and possibility in Japanese. The final result of the thesis is a rule-based program that outputs an XML with modal markers annotated and classified equally in both languages. It will be used in the future in bigger and different types of texts in order to draw more precise conclusions from both languages. Also, the program will be made available to use freely through an online interface at http://elvira.lllf.uam.es/modtag/ mainmodtagger.html, hosted on the Computational Linguistics Laboratory web page of the Universidad Autónoma de Madrid.