Linguistics

Written Texts

Corpus Diacrónico del Español (CORDE) is a data bank created by Real Academia Española that provides a structured set of texts for lexicographical and grammatical research dating from the beginning of the Spanish language until 1974. Includes almost 300 million lexical forms. It is divided into two main groups: fiction and non-fiction texts.
El Corpus del Español: Consists of the Historical Corpus that contains more than 100 million words in more than 20,000 Spanish texts from the 1200s to the 1900s; the Web Corpus, with about two billion words of Spanish taken from two million contemporary web pages from 21 different Spanish-speaking countries; and the NOW (News on the Web) Corpus that has about 7.6 billion words from web-based newspapers and magazines in 21 Spanish-speaking countries from 2012 to 2019.
El Grial Corpus of Spanish is a collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the Pontificia Universidad Católica de Valparaíso, Chile. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical etc). All documents have been tagged and parsed.

Spoken Language

Corpus Oral de Lenguaje Adolescente (COLA) is a corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile.
El Corpus Oral y Sonoro del Español Rural (COSER) is made up of recordings of spoken language in rural enclaves of the Iberian Peninsula.
Hamburg Corpus of Argentinean Spanish (HaCASpa): Audio recordings of experimental, read and spontaneous speech from 60 adult speakers in Buenos Aires (Porteño dialect) and the Neuquén/Comahue area (Northern Patagonia).

Multilingual