Skip to Main Content

Linguistics

Select National Corpora

  • The Open American National Corpus (OANC): Massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. All data and annotations are fully open and unrestricted for any use. 
  • The British National Corpus (BNC): A 100-million-word collection with samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the late 20th century.
  • The Croatian National Corpus (Hrvatski nacionalni korpus - HNK): Largest and most important corpus of Croatian. Its compilation started in 1998 at the Institute of Linguistics, University of Zagreb, and currently has 216.8 million tokens. 
  • The Czech National Corpus (Český národní korpus - ČNK): Includes written contemporary Czech, spontaneous spoken language, diachronic corpus of historical texts and parallel corpus of translations from or to 30+ languages.
  • The Digital Dictionary of the German Language (Digitales Wörterbuch der deutschen Sprache - DWDS): Project of the Berlin-Brandenburg Academy of Sciences and Humanities to create a digital lexical system based on large digital text corpora.
  • The Hellenic National Corpus (Εθνικός Θησαυρός Ελληνικής Γλώσσας): Online platform developed by the Institute for Language and Speech Processing/ R.C. Athena to offer language material (corpus) and computational tools for its processing.

  • The Hungarian National Corpus (Magyar Nemzeti Szövegtár): Started in 1998 at the Hungarian Academy of Sciences (HAS), the corpus has expanded to include variants of modern Hungarian from beyond the borders of Hungary.
  • The Korean National Corpus: Various written and spoken Korean language corpora developed by the National Institute of Korean Language.
  • The Polish National Corpus (Narodowy Korpus Języka Polskiego - NKJP): Collection of classical literature, daily newspapers, specialist journals, transcripts of conversations, and a variety of short-lived and internet texts with over 1.5 billion words.
  • The Russian National Corpus (Национальный корпус русского языка): Representative collection of texts in Russian, with more than 2 billion tokens, complete with linguistic annotation and search tools.
  • The Slovak National Corpus (Slovenský národný korpus - SNK): Electronic database containing Slovak language texts from 1955 onward and covering a broad range of language styles, genres, areas and regions.

Bilingual/Multilingual

  • Hamburg Centre for Language Corpora (Hamburger Zentrum für Sprachkorpora): Provides access to spoken and written corpora of various Eurasian languages managed and delivered by the Center for Sustainable Research Data Management.
  • Endangered Languages Archive (ELAR): A digital repository for preserving multimedia collections of endangered languages from all over the world.

Germanic Languages

German Texts
  • The German National Corpus assembles various corpora from the 15th century to the present day.
  • The Mannheim German Reference Corpus (Das Deutsche Referenzkorpus – DeReKo) contains corpora of contemporary written German at the IDS - Leibniz-Institut für Deutsche Sprache. 
  • DDB (Deut­sche Dia­chro­ne Baum­bank) is a small (ca. 8000 to­kens) deep­ly syn­tac­ti­cal­ly an­no­tat­ed cor­pus con­sist­ing of three sub­-cor­pora (Old High Ger­man, Mid­dle High Ger­man, Ear­ly New High Ger­man).
  • RIDG­ES proj­ect (Reg­is­ter in Di­a­chron­ic Ger­man Sci­ence) is an in­ves­ti­ga­tion in­to the de­vel­op­ment of the Ger­man sci­en­tif­ic lan­guage in the ear­ly mod­ern and mod­ern pe­ri­ods, rang­ing from the mid 15th to the 20th cen­tu­ry.
  • German Political Speeches Corpus and Visualization consists of speeches by the German Presidents, Chancellors and a few ministers, all gathered from official sources. 
Spoken German
  • Bavarian Archive for Speech Signals: Audio corpora of German speech, including read, spontaneous and telephone. Varieties include regional speech, adolescent speech and intoxicated speech.
German as a Foreign Language
  • Falko: A freely available error-annotated learner corpus of German as a foreign language
  • KanDeL (Kansas Developmental Learner corpus): A freely available longitudinal corpus of beginning to intermediate learners of German as a foreign language, 
Dutch Texts
  • SUBTLEX-NL is a database of Dutch word frequencies based on 44 million words from film and television subtitles.
Spoken Dutch
  • Spoken Dutch Corpus (CGN) contains 800 hours and nearly 9 million words of Dutch and Flemish speech.

Danish

Faroese

Icelandic

Norwegian

Swedish

Multilingual sources

  • Corpus of American Nordic Speech is a speech corpus with speakers from USA and Canada speaking Norwegian and Swedish. There are 268 speakers from 63 places, with more than 774,000 tokens.
  • The LIA Sápmi Corpus is a speech corpus with recordings from 1960 to 1990 of Sami dialects from the northern part of Norway, Finland and Sweden. The corpus contains about 190,000 tokens from 122 speakers in 19 locations.
  • Nordic Dialect Corpus and Syntax Database is a corpus of Norwegian, Swedish, Danish, Faroese, Icelandic and Övdalian spoken languages. It consists of spontaneous speech data with 2.75 million words from a variety of sources recorded in 1998–2015.

 

Romance Languages

  • ARTFL (American and French Research on the Treasury of the French Language): 1880 French texts from the 12th to the 20th centuries, ranging from classic works of French literature to various kinds of non-fiction prose and technical writing.

  • Lexique: A database developed at the University of Paris V provides information for 140,000 words in the French language.
  • REDAC: A corpus made of the 664,982 articles taken extracted from the French Wikipedia.

Written Texts

  • Corpus Diacrónico del Español (CORDE) is a data bank created by Real Academia Española that provides a structured set of texts for lexicographical and grammatical research dating from the beginning of the Spanish language until 1974. Includes almost 300 million lexical forms. It is divided into two main groups: fiction and non-fiction texts.
  • El Corpus del Español: Consists of the Historical Corpus that contains more than 100 million words in more than 20,000 Spanish texts from the 1200s to the 1900s; the Web Corpus, with about two billion words of Spanish taken from two million contemporary web pages from 21 different Spanish-speaking countries; and the NOW (News on the Web) Corpus that has about 7.6 billion words from web-based newspapers and magazines in 21 Spanish-speaking countries from 2012 to 2019.
  • El Grial Corpus of Spanish is a collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the Pontificia Universidad Católica de Valparaíso, Chile. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical etc). All documents have been tagged and parsed.

Spoken Language

Portuguese Text 
  • O Corpus Brasileiro: 1 billion words of contemporary Brazilian Portuguese in a variety of genres and registers.
  • O Corpus do Português: Consists of the Historical Corpus that contains more than 45 million words in almost 57,000 Portuguese texts from the 1300s to the 1900s; the Web Corpus with about 1 billion words of Portuguese taken from 1 million web pages from four Portuguese-speaking countries (Brazil, Portugal, Angola and Mozambique); and the NOW (News on the Web) Corpus that has about 1.1 billion words from web-based newspapers and magazines in four Portuguese-speaking countries from 2012-2019.
Spoken Portuguese
  • Arquivo Dialetal do CLUP (Dialectal Archive of the Center of Linguistics of the University of Porto): A database of recordings of European Portuguese collected during the last two decades, spanning both Mainland Portugal and islands. 
Bi-directional
  • COMPARA (Portuguese-English Parallel Corpus): A bi-directional parallel corpus based on an open-ended collection of Portuguese-English and English-Portuguese source-texts and translations. 

 

Italian Texts
  • CORIS (CORpus di Italiano Scritto): A reference corpus of written Italian, CORIS contains 150 million words.
  • DiaCORIS (Diachronic Corpus of Written Italian): Project that aims at the construction of a diachronic corpus comprising written Italian texts produced between 1861 and 1945, extending the structure and possibilities of CORIS.
  • BoLC (Bononia Legal Corpus): A multilingual comparable legal corpus developed at the University of Bologna.
  • PAISA (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati): A large corpus of authentic contemporary Italian texts from the web.
Spoken Italian
  • BADIP (Banca dati dell'italiano parlato): Database of spoken Italian. Contains an online edition of the 500,000 word LIP-Corpus. The edition is being enriched with POS-tags and lemmata, more data are being added continuously. Deactivated as of 2019.
  • CLIPS (Corpora e Lessici dell'Italiano Parlato e Scritto) is a corpus of spoken and written Italian. It contains about 100 hours of speech, equally represented by female and male voices. A section of the corpus is transcribed orthographically and a smaller section has been phonetically labeled.
Sicilian

Slavic Languages

Russian
  • Russian National Corpus (RNC): A corpus of the modern Russian language that incorporates about 1.5 billion tokens. 
  • General Internet-Corpus of Russian (GICR): A megacorpus of tagged texts from the Russian Internet, including news sites, VKontakte, LiveJournal and Mail.ru blogs.
  • Collection of Russian Corpora: A tool developed at the Centre for Translation Studies, University of Leeds, allows convenient searches of multiple corpora (RNC, Russian Newspapers, Russian Internet Corpus etc.) using a single interface.
  • Corpus of Russian Student Texts (CoRST) is a collection of Russian texts written by students in different universities and academic disciplines, from economics and sociology to mathematics and philosophy. Currently, the size of the corpus is about 3.1 million tokens.
  • Russian Learner Corpus (RLC) comprises texts produced by learners of Russian as a foreign language and speakers of heritage Russian. The corpus contains both oral and written production.
Ukrainian
Belarusian
  • Belarusian N-korpus (Беларускі N-корпус): Corpus of texts in the modern Belarusian language with structural and grammatical marking and certification. The volume of the corpus is about 177 million words.
  • The Polish National Corpus: A reference corpus of Polish language containing over 1.5 billion words. The corpus is searchable by means of advanced tools that analyze Polish inflection and Polish sentence structure.
Czech Texts
  • The Czech National Corpus (Český národní korpus): Large electronic corpus of written and spoken Czech with more than 4 billion tokens.
  • Czech Academic Corpus (Český akademický korpus): A morphologically and syntactically annotated corpus of the Czech language consisting of approximately 650,000 words in continuous texts.

Spoken Czech
  • Prague DaTabase of Spoken Czech (PDTSC): Authentic spoken Czech from the Prague area and its surroundings with over 7,000 minutes of spontaneous speech recorded during several related projects.
Slovak
  • Slovak National Corpus: An electronic database of contemporary Slovak language texts (1955-present), covering a broad range of language styles, genres, areas, regions etc.
  • Corpus of Slovene Language (FIDA) contains just over 100 million words of Slovene texts found in the press, the Internet and speech transcripts from the late 1990s.
  • Slovenian National Corpus (Gigafida) aka Korpus pisne standardne slovenščine is an upgrade of the older FIDA corpus that currently contains over 1 billion words.
  • Slovene web corpus (slWaC) is a web corpus collected from the .si top-level domain. The current version of the corpus (v2.0) contains 1.2 billion tokens and is annotated with the lemma and the morphosyntax layer.

Bosnian

  • Bosnian web corpus (bsWaC) is a web corpus collected from the .ba top-level domain. The 1.0 version of the corpus contains 429 million tokens and is annotated with the lemma, morphosyntax and dependency syntax layers.
  • The Oslo Corpus of Bosnian Texts compiled at the University of Oslo consists of approximately 1.5 million words from various texts, including fiction, legal texts, newspapers and journals, mostly published in the 1990s.

Croatian

  • Croatian National Corpus (Hrvatski nacionalni korpus, HNK): A corpus of modern Croatian with over 2.5 million words.
  • Croatian Learner Corpus (CroLTeC) contains texts collected from learners of Croatian as a second and foreign language with over 1 million tokens.

  • Croatian web corpus (hrWaC) is a web corpus collected from the .hr top-level domain. The current version of the corpus (v2.0) contains 1.9 billion tokens and is annotated with the lemma, morphosyntax and dependency syntax layers.

Serbian

Uralic Languages

Estonian

Finnish

Veps-Karelian

Hungarian Texts

Spoken Hungarian

  • Beszélt nyelvi Adatbázis (BEA) or Spoken Language Database is a phonetically grounded multifunctional database of spontaneous Hungarian speech. It contains data from nearly 500 speakers, representative in age, sex, educational background and dialect in the central region of Hungary.
  • Budapesti Szociolingvisztikai Interjú (BUSZI) or Budapest Sociolinguistic Network is a large-scale survey that provides reliable data and analyses of the varieties of the Hungarian language spoken in Budapest.

  • Mari Language is a digital collection of resources in and about Mari language varieties developed by researchers at the University of Vienna in collaboration with indigenous scholars and native speakers.
  • Meadow Mari Corpora contain the corpus of contemporary written literary texts in Meadow Mari variety, as well as the corpus of social media materials.

Erzya

  • Erzya Corpora contain the corpus of contemporary written literary Erzya and the corpus of social media and forums in Erzya language.

Moksha

  • Moksha Corpora contain the corpus of contemporary written literary Moksha and the corpus of social media and forums in Moksha language.
  • Ob-Ugric Languages (OUL) is a collection of online descriptive resources, including text corpora, of two related and endangered Ob-Ugric languages: Khanty (Ostyak) - Kazym and Surgut dialects - and Mansi (Vogul) - Northern and Eastern dialects.
  • Ob-Ugric Database (OUB) is a project related to OUL with analysed text corpora and dictionaries for less described Ob-Ugric dialects: Yugan Khanty and Western Mansi.

Komi

Udmurt

  • Udmurt Corpora contain the corpus of contemporary written literary Udmurt, the corpus of Udmurt-language social media and the sound-aligned corpus of Udmurt dialects.

Kamas

  • The INEL Kamas corpus has been created within the long-term INEL project. It consists of two parts: folklore texts collected by Kai Donner in 1912–1914, and transcribed audio recordings of the now-extinct Kamas language's last speakers made between 1964 and 1970.

Nganasan

  • The Nganasan Spoken Language Corpus (NSLC) includes 177 communications, 136 of which contain an aligned audio recording, with glossed and annotated transcripts from 57 speakers. All texts have been translated into Russian and English, some also into German.

Selkup

  • The INEL Selkup Corpus has been created within the long-term INEL project. It is composed of archival material from between 1962 and 1977.
  • The Selkup Language Corpus (SLC) contains 144 communications with glossed and annotated transcripts from 53 speakers. All texts have been translated into English and German, most of them are also available in Russian.

Middle Eastern Languages

  • Quranic Arabic Corpus: An annotated linguistic resource which shows the Arabic grammar, syntax and morphology for each word in the Holy Quran. The corpus provides three levels of analysis: morphological annotation, a syntactic treebank and a semantic ontology.
  • Arabic Learner Corpus (ALC) aims to provide a collection of written and spoken materials produced by learners of Arabic in Saudi Arabia. It includes almost 300,000 words produced by close to 1,000 students.
  • Leeds Arabic Corpora: A tool developed at the Centre for Translation Studies, University of Leeds, allows convenient searches of multiple corpora (Al Hayat News, Arabic Wikipedia etc.) using a single interface.
  • Tunisian Arabic Corpus contains close to 3,000 texts comprising over 1 million words from literary sources, TV, radio and the internet.

 

  • Farsdat (Farsi Speech Database) comprises recordings of 300 Iranian speakers representing ten different dialects. 6,000 utterances were segmented and labelled phonetically and phonemically.

  • Uppsala Persian Corpus (UPC) is a large, tagged and freely available Persian corpus. It is a modified version of the Bijankhan corpus and contains almost 3 million tokens.

Languages of East Asia

Text / Written Corpora
  • Balanced Corpus of Contemporary Written Japanese (BCCWJ): BCCWJ is a corpus that attempts to grasp the breadth of contemporary written Japanese. It contains extensive samples of modern Japanese texts in order to create as uniquely balanced a corpus as possible. The data is comprised of 104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents among others. Random samples of each genre were taken.

Spoken Corpora
  • Corpus of Spontaneous Japanese (CSJ): The CSJ corpus contains approximately 650 hours of spontaneous contemporary Japanese speech recorded between 1999 and 2003. Two kinds of speech, academic presentation speeches (APS) and simulated public speaking (SPS), were the main sources, but some material was also taken from interviews with the subjects about their APS or SPS, and from recordings of the subjects reading short passages aloud.
  • Japanese Speech Corpora of Major City Dialects consists of a series of recordings made in the 1990s.
     

 

  • Classical Tibetan Corpus consists of a small number of Classical Tibetan texts that were linguistically analyzed and annotated.
  • Tibetan Corpus contains audio and video recordings of Tibetan native speakers.

Languages of Southeast Asia

  • HSE Thai Corpus is a corpus of modern texts written in Thai language that were collected from a variety of Thai (news) websites and contain a total of 50 million tokens.

  • SEAlang Library Tagalog Text Corpus contains more than 2 million words taken from various internet sources, as well as Ramos Tagalog-English Dictionary and Tagalog Literary Text collection.
  • Tagalog Corpus contains a number of one-hour recordings of interactions between different groups of Tagalog-speaking adults (Adult language corpus), and children and their guardians (Child language corpus).
  • UP Filipino Language Corpus (UP-FLC) is a work in progress to create a corpus of standardized written and spoken texts.

Languages of Africa

  • Comparative Bantu OnLine Dictionary (CBOLD): A lexicographic database to support the theoretical, descriptive and historical linguistic study of Bantu languages. The database includes a substantial list of reconstructed Proto-Bantu roots, several thousand additional reconstructed regional roots and reflexes of these roots for a substantial subset of the 500+ daughter languages.

Sign Languages

ASL (American Sign Language)
  • National Center for Sign Language and Gesture Resources (Boston University). A substantial corpus of American Sign Language (ASL) video data from native signers is being collected and made available. Multiple synchronized high-quality video files (available in a variety of formats) showing the signing from different angles as well as a close-up view of the face.