Legal-ES : a set of large scale resources for spanish legal text processing

  1. Doaa Samy
  2. Jerónimo Arenas-García
  3. David Pérez-Fernández
Actas:
Proceedings of the First Workshop on Language Technologies for Government and Public Admnistration(LT4Gov)
  1. Doaa Samy (ed. lit.)
  2. Jerónimo Arenas García (ed. lit.)
  3. David Pérez Fernández (ed. lit.)

Editorial: European Language Resources Association

Año de publicación: 2020

Páginas: 32-36

Tipo: Aportación congreso

Resumen

This paper presents work on progress aiming at the development of Legal-ES. Legal-ES is a set of resources for Spanish legal text processing including a large scale corpus with calculated models for word embeddings and topics. The large scale Spanish legal corpus consists of over 2000 million words from open public legislative, jurisprudential and administrative texts representing a variety of sources from international, national and regional entities. The corpus is pre-processed and tokenized. A word embedding is calculated over raw text and over lemmatised texts in addition to some experiments with topic modelling on the legislative subset of the corpus representing the text from the Spanish Official Bulletin of State (Boletin Oficial del Estado-BOE). Within the framework of the Workshop on Language Technologies for Government and Public Administration (LT4Gov), the present paper showcases how Public Data is a valuable input for developing Language Resources. It fits within the second dimension of the workshop, i.e. PublicData4LRs. Legal-ES is the result of an initiative by the team of the Spanish Plan for the Advancement of Language Technologies (Plan TL) aiming at developing resources for the HLT community to promote intelligent solutions by industry and academia destined to Public Administration and the Legal Domain.