Are the existing training corpora unnecessarily large?

  1. Ballesteros, Miguel
  2. Herrera, Jesús
  3. Francisco, Virginia
  4. Gervás Gómez-Navarro, Pablo
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Ano de publicación: 2012

Número: 48

Páxinas: 21-27

Tipo: Artigo

Outras publicacións en: Procesamiento del lenguaje natural

Resumo

This paper addresses the problem of optimizing the training treebank data because the size and quality of the data has always been a bottleneck for the purposes of training. In previous studies we realized that current corpora used for training machine learning–based dependency parsers contain a significant proportion of redundant information at the syntactic structure level. Since the development of such training corpora involves a big effort, we argue that an appropriate process for selecting the sentences to be included in them can result in having parsing models as accurate as the ones given when training with bigger – non optimized corpora (or alternatively, bigger accuracy for an equivalent annotation effort). This argument is supported by the results of the study we carried out, which is presented in this paper. Therefore, this paper demonstrates that the training corpora contain more information than needed for training accurate data–driven dependency parsers. Patrocinador/es: This research is funded by the Spanish Ministry of Education and Science (TIN2009-14659-C03-01 Project), Univers

Referencias bibliográficas

  • Abeillé, Anne, editor. 2003. Treebanks: Building and Using Parsed Corpora, volume 20 of Text, Speech and Language Technology. Kluwer Academic Publishers, Dordrecht.
  • Afonso, Susana, Eckhard Bick, Renato Haber, and Diana Santos. 2002. Floresta sintá(c)tica: A treebank for Portuguese. In LREC 2002.
  • Ballesteros, Miguel, Jesús Herrera, Virginia Francisco, and Pablo Gervás. 2010. Improving Parsing Accuracy for Spanish using Maltparser. SEPLN, 44:83-90, 05/2010.
  • Böhmová, A., J. Hajic, E. Hajicová, and B. Hladká. 2003. The PDT: a 3-level annotation scenario. In Abeillé (Abeillé, 2003), chapter 7.
  • Bosco, Cristina, Simonetta Montemagni, Alessandro Mazzei, Vincenzo Lombardo, Felice dell'Orletta, Alessandro Lenci, Leonardo Lesmo, Giuseppe Attardi, Maria Simi, Alberto Lavelli, Johan Hall, Jens Nilsson, and Joakim Nivre. 2010. Comparing the influence of different treebank annotations on dependency parsing. In LREC.
  • Brants, Sabine, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith. 2002. The tiger treebank. In Proceedings of the Workshop on Treebanks and Linguistic Theories (TLT).
  • Buchholz, Sabine and Erwin Marsi. 2006. Conll-x shared task on multilingual dependency parsing. In CoNLL-X '06: Proceedings of the Tenth Conference on Computational Natural Language Learning, pages 149-164, Morristown, NJ, USA. Association for Computational Linguistics.
  • Chen, Keh-Jiann, Chi-Ching Luo, Ming-Chung Chang, Feng-Yi Chen, Chao-Jan Chen, Chu-Ren Huang, and Zhao-Ming Gao. 2003. Sinica treebank: Design criteria, representational issues and implementation. In Abeillé (Abeillé, 2003), chapter 13, pages 231-248.
  • Dzeroski, S., T. Erjavec, N. Ledinek, P. Pajas, Z. Zabokrtsky, and A. Zele. 2006. Towards a Slovene dependency treebank. In In Proc. Int. Conf. on Language Resources and Evaluation (LREC.
  • Hajic, Jan, Otakar Smrz, Petr Zemánek, Jan Snaidauf, and Emanuel Beska. 2004. Prague Arabic dependency treebank: Development in data and tools. pages 110-117.
  • Herrera, J. and P. Gervás. 2008. Towards a Dependency Parser for Greek Using a Small Training Data Set. Journal of the Spanish Society for Natural Language Processing (SEPLN), 41:29-36.
  • Kawata, Y. and J. Bartels. 2000. Stylebook for the Japanese treebank in VERBMOBIL. Verbmobil-Report 240, Seminar für Sprachwissenschaft, Universität Tübingen.
  • Kromann, Matthias T. 2003. The Danish dependency treebank and the underlying linguistic theory. Växjö, Sweden.
  • Nilsson, Jens, Johan Hall, and Joakim Nivre. 2005. MAMBA meets TIGER: Reconstructing a Swedish treebank from antiquity. In Proc. of the NODALIDA Special Session on Treebanks.
  • Nivre, Joakim, Johan Hall, Jens Nilsson, Atanas Chanev, Gülsen Eryigit, Sandra Kbler, Svetoslav Marinov, and Erwin Marsi. 2007. Maltparser: A languageindependent system for data-driven dependency parsing. Natural Language Engineering, 13(2):95-135.
  • Oflazer, Kemal, Bilge Say, Dilek Zeynep Hakkani-Tür, and Gökhan Tür. 2003. Building a Turkish treebank. In Abeillé (Abeillé, 2003), chapter 15.
  • Palomar, M., M. Civit, A. Díaz, L. Moreno, E. Bisbal, M. Aranzabe, A. Ageno, M.A. Martí, and Navarro. 2004. 3lb: Construcción de una base de datos de árboles sintáctico-semánticos para el catalán, euskera y español. In Proceedings of the XX Conference of the Spanish Society for Natural Language Processing (SEPLN), pages 81-88. Sociedad Española para el Procesamiento del Lenguaje Natural.
  • Prokopidis, P., E. Desypri, M. Koutsombogera, H. Papageorgiou, and S. Piperidis. 2005. Theoretical and Practical Issues in the Construction of a Greek Dependency Treebank. In Proceedings of The Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005), Barcelona, Spain, pages 149-160.
  • Simov, Kiril, Petya Osenova, Alexander Simov, and Milen Kouylekov. 2005. Design and implementation of the Bulgarian HPSG-based treebank. Journal of Research on Language and Computation - Special Issue, 2(4):495-522, December.
  • van der Beek, Leonoor, Gosse Bouma, Robert Malouf, and Gertjan van Noord. 2002. The Alpino dependency treebank. In Computational Linguistics in the Netherlands (CLIN). Rodopi.