Bilingual Parallel Corpora for Linguistic Research

9 pagesPublished: November 28, 2016


In this paper it will reflect on the specific needs of the linguistic research regarding the construction of bilingual parallel corpora and primarily on the conclusions to be drawn for their design, compilation and domains. A research group of the university in Santiago is currently building a bilingual parallel corpus (Corpus PaGeS) consisting of original texts in German and Spanish together with their translations into the other language, as well as German and Spanish translations from a third language. This corpus was originally intended for linguistic research purposes, specifically, the analysis of the expression of the spatial relations. Initially a brief survey of some significant existing related corpora is performed, and their limitations for linguistic studies are outlined. The different issues that were taken into account for the design of the corpus will be explained, such as type of texts, domains, regional language variety or quality and direction of translations. After describing the manual preparation process of the texts to make the documents suitable for further processing it is explained the manual and automatic annotation procedure: the metadata, and the automatically linguistic annotation. Then the process of sentence alignment and the manual review of the alignment are described and finally the next steps of future work are outlined

Keyphrases: annotation, bilingual corpora, contrastive linguistics, corpus design, parallel corpora

In: Antonio Moreno Ortiz and Chantal Pérez-Hernández (editors). CILC2016. 8th International Conference on Corpus Linguistics, vol 1, pages 88--96

