American English Dialect Recordings: The Center for Applied Linguistics Collection contains 118 hours of recordings documenting North American English dialects, dating from 1900-1999. A few recordings of Canadian speakers are included.
BNC: The British National Corpus is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
CLMETEV: The Corpus of Late Modern English Texts (extended version), compiled by Hendrik De Smet, Department of Linguistics, University of Leuven, Belgium.
COCA: The Corpus of Contemporary American English. Contains over 1 billion words, 1990–2019. Downloadable version is available on Abacus.
COHA: The Corpus of Historical American English. Contains more than 475 million words, 1820s–2010s. Downloadable version is available on Abacus.
EUSTACE: Edinburgh University Speech Timing Archive and Corpus of English comprises 4608 spoken sentences by six speakers of British English, designed to examine a number of durational effects in speech and are controlled for length and phonetic content.
GloWbE: Global Web-Based English. Contains about 1.9 billion words of text from twenty different countries, 2012-2013. Downloadable version is available on Abacus.
Helsinki Corpus of English Texts is a multi-genre diachronic corpus that includes periodically organized text samples from Old, Middle and Early Modern English.
IDEA: International dialects of English Archive recordings are principally in English and of native speakers, and include both English-language dialects and English spoken in the accents of other languages.
LEON: Leuven English Old to New compiled by Peter Petré, Department of Linguistics, University of Leuven, Belgium.
NOW: News on the Web. Contains 16.1 billion words from web-based newspapers and magazines, 2010-present. Downloadable version is available on Abacus.
OANC: The Open American National Corpus (OANC) is a massive electronic collection of American English, including texts from all genres and transcripts of spoken data produced since 1990. All data and annotations are fully open and unrestricted for any use.
Penn Parsed Corpora of Historical English. Contains corpora of Middle English (PPCME2), Early Modern English (PPCEME) and Modern British English (PPCMBE2). Downloadable version is available in Abacus.
Rap Almanac: Rap Research Lab’s fully-searchable proprietary database of over 500,000 transcribed rap lyrics and associated metadata.
Strathy Corpus of Canadian English: Product of the Strathy Language Unit at Queen's University. Contains 50 million words from more than 1,100 spoken sources, fiction, magazines, newspapers and academic texts.
TIME: TIME Magazine Corpus. Contains 100 million words, 1923–2006.
YCOE: The York-Toronto-Helsinki Parsed Corpus of Old English Prose is a 1.5 million word syntactically-annotated corpus. Sister corpus to PPCME2 above.
The Oslo Corpus of Tagged Norwegian Texts is based on works of fiction, factual prose and newpaper/magazines. Bokmål part contains some 18.5 million words, while the Nynorsk part contains about 3.8 million.
Stockholm—Umeå Corpus (SUC) is a collection of various Swedish texts from the 1990s, totaling 1 million words.
Multilingual sources
ARTFL (American and French Research on the Treasury of the French Language): 1880 French texts from the 12th to the 20th centuries, ranging from classic works of French literature to various kinds of non-fiction prose and technical writing.
Written Texts
Spoken Language
Text Archive of the Ancient Sicilian: A corpus of 736 medieval Sicilian texts from 13th to 16th centuries
Czech Academic Corpus (Český akademický korpus): A morphologically and syntactically annotated corpus of the Czech language consisting of approximately 650,000 words in continuous texts.
Croatian Learner Corpus (CroLTeC) contains texts collected from learners of Croatian as a second and foreign language with over 1 million tokens.
Croatian web corpus (hrWaC) is a web corpus collected from the .hr top-level domain. The current version of the corpus (v2.0) contains 1.9 billion tokens and is annotated with the lemma, morphosyntax and dependency syntax layers.
The Corpus of Serbian Language (CSL) was compiled from a sample of 11 million words and spans the Serbian language from the 12th century to the contemporary period.
The Corpus of Modern Serbian Language (SrpKor2013) contains 122 million words collected from literary and administrative texts, newspapers, magazines and the web.
Serbian web corpus (srWaC) is a web corpus collected from the .rs top-level domain. The 1.0 version of the corpus contains 894 million tokens and is annotated with the lemma, morphosyntax and dependency syntax layers.
The Corpus of Estonian Literary Language is made up of fiction and newspaper texts published between 1890s and 1990s.
The Estonian Reference Corpus consists of full-text written materials divided into multiple sub-corpora.
Hungarian Texts
Spoken Hungarian
Budapesti Szociolingvisztikai Interjú (BUSZI) or Budapest Sociolinguistic Network is a large-scale survey that provides reliable data and analyses of the varieties of the Hungarian language spoken in Budapest.
Meadow Mari Corpora contain the corpus of contemporary written literary texts in Meadow Mari variety, as well as the corpus of social media materials.
Corpus of the Komi Language (Коми кыв корпус / Korpus komi jazyka) is an electronic resource developed by the Interregional Laboratory of Information Support for the Functioning of Finno-Ugric Languages (FU-Lab).
The Komi media collection is an electronic reference information system based on an annotated recordings of Komi dialects in video and audio formats with translations into Russian and English
The Komi-Zyrian Corpora contain the corpus of contemporary written literary texts in Komi-Zyrian variety, as well as the corpus of social media materials.
Tunisian Arabic Corpus contains close to 3,000 texts comprising over 1 million words from literary sources, TV, radio and the internet.
Farsdat (Farsi Speech Database) comprises recordings of 300 Iranian speakers representing ten different dialects. 6,000 utterances were segmented and labelled phonetically and phonemically.
Uppsala Persian Corpus (UPC) is a large, tagged and freely available Persian corpus. It is a modified version of the Bijankhan corpus and contains almost 3 million tokens.
Balanced Corpus of Contemporary Written Japanese (BCCWJ): BCCWJ is a corpus that attempts to grasp the breadth of contemporary written Japanese. It contains extensive samples of modern Japanese texts in order to create as uniquely balanced a corpus as possible. The data is comprised of 104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents among others. Random samples of each genre were taken.
National Institute of the Korean Language Corpus: Dataset with information on the frequency of use of modern Korean language.
Open Korean Corpora: A living document for Korean NLP dataset curation.
HSE Thai Corpus is a corpus of modern texts written in Thai language that were collected from a variety of Thai (news) websites and contain a total of 50 million tokens.