Skip to main content

Linguistics

National Corpora

  • The [Open] American National Corpus: The Open American National Corpus (OANC) is a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. All data and annotations are fully open and unrestricted for any use.
  • The British National Corpus (BNC) : The BNC is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.

Bilingual/Multilingual

Romance Languages

Spain

Written Texts

  • Corpus del Español : More than 100 million words in more than 20,000 Spanish texts from the 1200s to the 1900s.
  • CORDE Corpus Diacrónico del Español  : data bank that provides a structured set of texts for lexicographic and grammatical research dating from the beginning of the Spanish language until 1974. Includes almost 300 million lexical forms. It is divided into two main groups: fiction and non-fiction texts. Created by Real Academia Española.
  • El Grial Corpus of Spanish: a collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the the Pontificia Universidad Católica de Valparaíso, Chile. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical, etc). All documents have been tagged and parsed. 
  • El Corpus del Español: Compiled by Mark Davies

Spoken Language

Latin America

Spoken Spanish

  • Hamburg Corpus of Argentinean Spanish (HaCASpa) : Audio and video recordings of experimental/read and spontaneous speech from adult speakers of Porteño Spanish in Argentina. Speakers are 18-69 years old and from two geographic areas.
Portuguese Text 
  • Corpus Brasileiro : 1 billion words of contemporary Brazilian Portuguese in a variety of genres and registers.
  • Corpus do Português : More than 45 million words in almost 57,000 Portuguese texts from the 1300s to the 1900s.
Spoken Portuguese
  • Arquivo Dialetal do CLUP (Dialectal Archive of the Center of Linguistics of the University of Porto) : a database of recordings of European Portuguese collected during the last two decades, spanning both Mainland Portugal and islands. 
Bi-directional
  • COMPARA (Portuguese-English Parallel Corpus) : a bi-directional parallel corpus based on an open-ended collection of Portuguese-English and English-Portuguese source-texts and translations. 

 

Italian Texts
  • CORIS (CORpus di Italiano Scritto) : reference corpus of written Italian. CORIS contains 130 million words
  • PAISA (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) :  large corpus of authentic contemporary Italian texts from the web. 
  • BoLC (Bononia Legal Corpus) : Italian legal language
Spoken Italian
  • BADIP Banca dati dell'italiano parlato : Database of spoken Italian. Contains an online edition of the 500,000 word LIP-Corpus. The edition is being enriched with POS-tags and lemmata, more data are being added continuously. 
  • CLIPS  (Corpus of Spoken Italian) : is a corpus (audio files, annotation and documentation) of spoken Italian. About 100 hours of speech, equally represented by female and male voices. A section of the corpus is transcribed orthographically, a smaller section has been phonetically labeled.
Sicilian

Germanic Languages

 

German Texts
  • The German National Corpus :
  • Deut­sche Dia­chro­ne Baum­bank (DDB): a small (ca. 8000 to­kens) deep­ly syn­tac­ti­cal­ly an­no­tat­ed cor­pus con­sist­ing of three sub­cor­po­ra of dif­fer­ent lan­guage pe­ri­ods of Ger­man (Old High Ger­man, Mid­dle High Ger­man, Ear­ly New High Ger­man).
  • BOMP : A machine-readable German pronunciation dictionary
  • RIDG­ES proj­ect (Reg­is­ter in Di­a­chron­ic Ger­man Sci­ence) is an in­ves­ti­ga­tion in­to the de­vel­op­ment of the Ger­man sci­en­tif­ic lan­guage in the ear­ly mod­ern and mod­ern pe­ri­ods, rang­ing from the mid 16th to the late 19th cen­tu­ry.
  • German Political Speeches Corpus and Visualization : This corpus consists of speeches by the German Presidents, Chancellors and a few ministers, all gathered from official sources. 
Spoken German
  • Bavarian Archive for Speech Signals : Audio corpora of German speech, including read, spontaneous, and telephone. Varieties include regional speech, adolescent speech, and intoxicated speech.
German as a Foreign Language
  • Falko : a freely available error-annotated learner corpus of German as a foreign language
  • KanDeL (Kansas Developmental Learner corpus) : a freely available longitudinal learner corpus of beginning to intermediate learners of German as a foreign language, 
  • SUBTLEX-NL : database of Dutch word frequencies based on 44 million words from film and television subtitles.
Spoken Dutch
  • Spoken Dutch Corpus (CGN) contains 900 hours (and approximately 3,3 million words) of Dutch and Flemish speech.

Swedish

Danish

Norwegian

Icelandic

Faroese

 

Languages of Central and Eastern Europe (Slavic and others)

Russian
Ukrainian
Belarusian
  • The Polish National Corpus : a reference corpus of Polish language containing over fifteen hundred million words. The corpus is searchable by means of advanced tools that analyse Polish inflection and the Polish sentence structure.
Czech

Czech Texts

  • The Czech National Corpus (Český národní korpus): large electronic corpus of written and spoken Czech.
  • Czech Academic Corpus v.1.0 (Českým akademickým korpuse) : corpus with a manual morphological annotation of morphology of the Czech language consisting of approximately 600,000 words in continuous texts.

Spoken Czech

  • Prague Spoken Corpus : authentic spoken Czech, mainly colloquial and thematically unspecialised, from the Prague area and its surroundings. Covers the four sociolinguistic variables in balanced proportions: the speaker's gender, age, education and type of speech. 304 recordings.
Slovak
  • Slovak National Corpus : database of contemporary Slovak language texts, covering a broad range of language styles.

Middle Eastern Languages

  • Quranic Arabic Corpus : an annotated linguistic resource which shows the Arabic grammar, syntax and morphology for each word in the Holy Quran. The corpus provides three levels of analysis: morphological annotation, a syntactic treebank and a semantic ontology.

 

Persian (Farsi) Texts
  • Hamshahri Corpus (پیکره همشهری‎)  : a sizable Persian corpus based on the Iranian newspaper Hamshahri, one of the first online Persian newspapers in Iran
  • Bijankhan Corpus (پیکرهٔ بی‌جن‌خان ) : a tagged corpus with 2.6 million manually tagged words with a tag set that contains 40 Persian POS tags. The collection is gathered from daily news and common texts. 

Languages of East Asia

  • Corpus of Spontaneous Japanese (CSJ) : The CSJ corpus contains approximately 650 hours of spontaneous contemporary Japanese speech, recorded between 1999 and 2003. Two kinds of speech, academic presentation speeches (APS) and simulated public speaking (SPS), were the main sources, but some material was also taken from interviews with the subjects about their APS or SPS, and from recordings of the subjects reading short passages aloud.
  • GCML General Corpus of the Modern Mongolian language :  contains 966 texts, 1 155 583 words.

 

Languages of Southeast Asia

  • HSE Thai Corpus : Corpus of modern texts written in Thai language, collected from various Thai websites (mostly news sites). The texts, containing in whole 50 million tokens, were was assigned its English translation and part of speech tag. 

Languages of Africa

Bantu
  • Comparative Bantu OnLine Dictionary (CBOLD) : A lexicographic database to support the theoretical, descriptive, and historical linguistic study of Bantu languages. The database includes a substantial list of reconstructed Proto-Bantu roots, several thousand additional reconstructed regional roots, and reflexes of these roots for a substantial subset of the 500+ daughter languages.

Sign Language

ASL (American Sign Language)
  • National Center for Sign Language and Gesture Resources (Boston University). A substantial corpus of American Sign Language (ASL) video data from native signers is being collected and made available. Multiple synchronized high-quality video files (available in a variety of formats) showing the signing from different angles as well as a close-up view of the face.

Turkic

Turkish Texts
  • TS Corpus (Turkish Corpus project) T: a general-purpose corpus containing 491 million POSTagged tokens. The TS Corpus V2  is the main Turkish corpus, but they have also released 10 different corpora  with different aims and functionalities, including: 
    • TS TimeLine Corpus : the contemporary news/columns corpus of Turkish with over 2.2 million news and articles with a range of 19 years.
    • TS Wikipedia Corpus : composed from July 2013 dump of Turkish Wikipedia pages. 
    • TweetS Corpus : 
    • Columns Corpus : equally by female and male authors. The corpus covers a 10 years period and allows users to run restricted queries by gender of the author, date and the source.
    • Abstract Corpus : samples academic writing from various disciplines.TS Abstract Corpus is specially a useful source for text genre classification studies. 
    • Syllable Corpus : features syllable tagging for Turkish. 
    • TS Gezi Corpus : a specialized corpus which is build by 2,968 articles published by 9 different sources, both from Turkish and foreign press, during Taksim Gezi Park Protests. 
    • Constitution Corpus : texts of the 3 different Turkish constitutions
    • Idioms & Proverbs Corpus : more than 10 thousand Turkish idioms and proverbs
Spoken Turkish