Skip to main content

Linguistics: Corpora

A guide to linguistics collections and e-resources
Subjects: Linguistics

Key Resources

OxLip+ subscription e-resources

British National Corpus The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. Click here for more info about which interface of BNC to use.


American National Corpus American English, both spoken and written, will contain 100 million words when completed.


For corpus based research there are purpose built corpora available for various languages including Czech, Dutch, French, Frisian, German and Spanish on OxLip+. Browse by subject for Linguistics/Corpora.

Lexical databases and corpora is a non-exhausitve list of databases and corpora, organised by language.

CHILDES is a database of spoken child language with browsable transcripts in various languages, including Romance, Scandinavian, Celtic, Germanic and East-Asian languages.

CELEX is a lexical database for Dutch, English and German, developed by the Max Planck Institute.

For each language, this data set contains detailed information on:

  • orthography (variations in spelling, hyphenation)
  • phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress)
  • morphology (derivational and compositional structure, inflectional paradigms)
  • syntax (word class, word class-specific subcategorizations, argument structures)
  • word frequency (summed word and lemma counts, based on recent and representative text corpora)

It is also possible to compile your own corpus from a compilation of texts, e.g. from online newspapers (Nexis), or from older books in EEBO (Early English Books Online). Eighteenth Century Collections Online or Eighteenth Century Journals. For more corpora, browse OxLip+ by subject for Linguistics/Corpora.

For tagging and parsing texts and corpora, have a look at this list of tools.

A really good guide to creating corpora has been developed by IT Services.

Subject Guide

Johanneke Sytsema's picture
Johanneke Sytsema
Subject Consultant for Linguistics, Dutch and Frisian
Taylor Institution Library
St Giles'
01865 (2)78159