Skip to Main Content

Linguistics: Corpora

A guide to linguistics collections and e-resources
Subjects: Linguistics

Corpora

For corpus based research there are purpose built corpora available for various languages including Czech, Dutch, French, Frisian, German and Spanish on Databases A-Z. 

Digital Language Resources in Oxford Find a variety of specialised corpora and tools for Oxford users, including the Linguistic Data Consortium, (open consortium of universities, libraries, corporations and government research laboratories) and  CLARIN European Research Infrastructure Consortium.

CHILDES is a database of spoken child language with browsable transcripts in various languages, including Romance, Scandinavian, Celtic, Germanic and East-Asian languages.

CELEX is a lexical database for Dutch, English and German, developed by the Max Planck Institute. For each language, this data set contains detailed information on:

  • orthography (variations in spelling, hyphenation)
  • phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress)
  • morphology (derivational and compositional structure, inflectional paradigms)
  • syntax (word class, word class-specific subcategorizations, argument structures)
  • word frequency (summed word and lemma counts, based on recent and representative text corpora

CLARIN  (European Research Infrastructure for Language Resources and Technology) contains many annotated resources mainly but not only in Germanic languages and tools such as a part of speech tagger. Access here CLARIN Corpora, Lexical Resources and Tools.

It is not possible to list all CLARIN resources here since it is being updated all the time. This is s selection of resources:

  • Historical Thesaurus of English The University of Glasgow’s Historical Thesaurus of English is a unique resource charting the development of meaning in the huge and varied vocabulary of English. It consists of almost every recorded word in English from Anglo-Saxon times to the present day, all arranged into detailed hierarchies of meaning.
  • Oxford Text Archive a repository of digital texts, language corpora and various other textual and lexical resources.

 

SketchEngine 

Initially an EU project, this database now contains lemmatized and POS tagged corpora in many languages including Chinese, Arabic, European and African languages. 

 

Surrey Lexical Splits Database

The database from the Surrey Morphology Group offers a novel perspective on morphological typology, charting the range of possible deviations from canonical paradigms, and comprises 300 records taken from 50 languages, covering 27 families. Interactive visualisations of fieldwork data from Chichimec (Oto-Manguean, Mexico) and Skolt Saami (Finno-Ugric, Finland) are accessible through https://lexicalsplits.surrey.ac.uk/. The implications of the Chichimec data for morphological theory are reported inhttps://muse.jhu.edu/article/785538


 

English corpora

British National Corpus The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. Click here for more info about which interface of BNC to use.

American National Corpus American English, both spoken and written, will contain 100 million words when completed.

 iWeb corpus is a corpus of English language websites from UK, US, Canada, Ireland, Australia and New Zealand for lingusitic research. It contains 14 billion words from nearly 95,000 systematically selected websites (22 million webpages). iWeb corpus can be browsed for word frequency lists, collocates, n-grams and full-text data. It can also be searched by individual word or by phrases, strings or substrings.

English corpora (list from BYU)  can be found on https://corpus.byu.edu/ (mostly American, also including English and Canadian corpora)

COHA (Corpus of Historical American English), included in iWeb corpus (see above) contains more than 400 million words of text from the 1810s-2000s. The corpus is balanced by genre decade by decade.

German corpora

Schweizerdeutsches MundartKorpus 

Newspaper archives

It is also possible to compile your own corpus from newspaper archives, for English newspaper use Nexis UK

For French newspapers use Retronews (newspaper archive of the Bibliothèque Nationale de France)

More specific newspaper corpora in various European languages can be accesses through CLARIN newspaper corpora.

 

Text archives

Compile your own corpus from 

Oxford Text Archive (OTA) contains literary and linguistic resources for use in Higher Education, in research, teaching and learning.

 Early English Books Online (EEBO). Eighteenth Century Collections Online or Eighteenth Century Journals. For more corpora, browse Databases A-Z by subject for Linguistics/Corpora.

 

Register of corpora 

RE3data is a global register of research data repositories. Type 'linguistics' in the  search box to find over 100 repositories, corpora, grammar databases. The records are tagged for refined searching, and linked to the actual resources.

Create a corpus

A really good guide to creating corpora has been developed by IT Services.