Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.



A Corpus is a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures. 

Generally, corpora are assembled according to predefined criteria to fit intended aims such as studying linguistic structures, machine translation or natural language processing. Building a corpus is a time consuming task.

This guide lists corpora across the world's languages:

The following titles provide a good starting point for students who are new to corpus-based research: