Skip to main content

Linguistics: Corpora

A guide to linguistics collections and e-resources
Subjects: Linguistics

Key Resources

Databases A-Z subscription e-resources

British National Corpus The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. Click here for more info about which interface of BNC to use.

 

American National Corpus American English, both spoken and written, will contain 100 million words when completed.

Corpora

For corpus based research there are purpose built corpora available for various languages including Czech, Dutch, French, Frisian, German and Spanish on Databases A-Z. 

Digital Language Resources in Oxford Find a variety of specialised corpora and tools, including the Linguistic Data Consortium, (open consortium of universities, libraries, corporations and government research laboratories) and  CLARIN European Research Infrastructure Consortium.

CHILDES is a database of spoken child language with browsable transcripts in various languages, including Romance, Scandinavian, Celtic, Germanic and East-Asian languages.

CELEX is a lexical database for Dutch, English and German, developed by the Max Planck Institute. For each language, this data set contains detailed information on:

  • orthography (variations in spelling, hyphenation)
  • phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress)
  • morphology (derivational and compositional structure, inflectional paradigms)
  • syntax (word class, word class-specific subcategorizations, argument structures)
  • word frequency (summed word and lemma counts, based on recent and representative text corpora

CLARIN contains many annotated resources mainly ibut not only in Germanic languages and tools such as a part of speech tagger,

ELAR Endangered Language Archive

IntelliText The Intelligent Tools for Creating and Analysing Electronic Text Corpora for Humanities Research.

GATE The General Architecture for Text Engineering is free, open-source software for a wide range of computational tasks involving human language.

#LancsBox is a new-generation software package for the analysis of language data and corpora developed at Lancaster University

Language Technology Software The Language Technology Group makes available various software packages, often free to researchers.

Wordtree makes an interactive visual representation of corpus concordance data.

Historical Thesaurus of English The University of Glasgow’s Historical Thesaurus of English is a unique resource charting the development of meaning in the huge and varied vocabulary of English. It consists of almost every recorded word in English from Anglo-Saxon times to the present day, all arranged into detailed hierarchies of meaning.

English corpora

 iWeb corpus is a corpus of English language websites from UK, US, Canada, Ireland, Australia and New Zealand for lingusitic research. It contains 14 billion words from nearly 95,000 systematically selected websites (22 million webpages). iWeb corpus can be browsed for word frequency lists, collocates, n-grams and full-text data. It can also be searched by individual word or by phrases/strings or substrings (e.g. *ism, un*able) or phrases such as got VERB-ed, from ADJ to ADJ, phrasal verbs, or NOUN NOUN.

English corpora (list from BYU)  can be found on https://corpus.byu.edu/ (mostly American, also including English and Canadian corpora)

COHA (Corpus of Historical American English), included in iWeb corpus (see above) contains more than 400 million words of text from the 1810s-2000s. The corpus is balanced by genre decade by decade.

Newspaper archives

It is also possible to compile your own corpus from newspaper archives, for English newspaper use Nexis UK

For French newspapers use Retronews (newspaper archive of the Bibliothèque Nationale de France)

 

Text archives

Compile your own corpus from 

Oxford Text Archive (OTA) contains literary and linguistic resources for use in Higher Education, in research, teaching and learning.

 Early English Books Online (EEBO). Eighteenth Century Collections Online or Eighteenth Century Journals. For more corpora, browse OxLip+ by subject for Linguistics/Corpora.

 

Tagging and parsing tools

For tagging and parsing texts or corpora, have a look at this list of tools.

A really good guide to creating corpora has been developed by IT Services.

Subject Guide

Johanneke Sytsema's picture
Johanneke Sytsema
Contact:
Subject Consultant for Linguistics, Dutch and Frisian
Taylor Institution Library
St Giles'
Oxford
1OX 3NA
01865 (2)78159
Website
Subjects:Linguistics