Oxford LibGuides: English Language: a short guide to online resources: Corpora

Introduction to Corpora

A corpus is a collection of texts or text extracts that have been put together to be used as a sample of a language or language variety. It consists of texts that have been produced in 'natural contexts' (published books, ordinary conversation, letters, newspapers, lectures etc), which means it mirrors natural language. A well-composed corpus can be used to answer questions about language use, such as:

Does 'wicked' generally mean 'good' or 'bad'? Has this meaning changed over time? Does the use differ between different kinds of text? Do different (kinds of) speakers use the word in the same way?

A reference corpus (created to be a balanced sample of a language variety) can be used as the basis of comparison between a text/genre and 'standard language'.

Specialised corpora can be used to examine or compare different language varieties, such as language from a particular area, covering a certain genre or text type, produced by particular language users, etc.

Corpora can be synchrone (covering one time) or diachrone (covering several time periods), consist of different media (written or spoken language) and be composed of different languages.

Annotated corpora have extra information added, usually linguistic information (part-of-speech, lemmata) or metadata (information about the material in the corpus, speakers/authors, situation, extra-linguistic information etc).

There are corpora that can be consulted online, via a custom-built interface, and ones that you explore with stand-alone tools that you install on your computer.

Specialised Corpora

The Old Bailey Corpus
This corpus is based on the Proceedings of the Old Bailey, published from 1674 to 1913. The 2163 volumes contain almost 134 million words. Since the proceedings were taken down in shorthand by scribes in the courtroom, the verbatim passages are arguably as near as we can get to the spoken word of the period. The material thus offers the rare opportunity of analyzing spoken language in a period that has been neglected both with regard to the compilation of primary linguistic data and the description of the structure, variability, and change of English.

Oxford Text Archive

The Oxford Text Archive (OTA) contains many useful Corpora available to download. Some examples include:

The Lampeter Corpus of Early Modern English Tracts
Parsed Corpus of Early English Correspondence (PCEEC)
A Corpus of English Dialogues 1560-1760 (CED)
Dictionary of Old English Corpus in Electronic Form (DOEC)
The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose
The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)
Corpus of Early English Correspondence Sampler (CEECS)
The York-Helsinki parsed corpus of Old English poetry (YCOEP)
Anthology of Middle English texts
Complete corpus of Old English

Downloading these Corpora from the OTA will give you files that will need to be used in software that can process Corpora - we recommend AntConc. You will need to download AntConc and then load your files into it. The creators of AntConc have created extensive guides on video, and we would recommend that you work your way through these to understand all the functions before beginning to undertake analysis.

Useful links

British National Corpus (20th century English)
A big corpus of written and spoken (transcribed) material from different genres. Considered a standard reference. Available via different tools

BNC-BYU Brigham Young University
English-Corpora: BNC
Easy to use online interface. Good for quick queries (with or without wordclass tags), overall frequencies, searches in different written genres and collocations. Easy to compare results to other BYU corpora. To use the BYU-BNC you must first register. To register, you need either to use a computer on campus, or to be connected remotely via a proxy server (VPN – see https://www.it.ox.ac.uk/work-remotely). After registering, you will be able to access BYU-BNC remotely, but will need to re-authenticate every 365 days by logging on again on campus.
BNCweb at Lancaster
http://bncweb.lancs.ac.uk/bncwebSignup/
Register first to use. A guide to using BNCweb has been created by IT Services and can be downloaded at the bottom of this box.

American English

Corpus of Contemporary American English (COCA)
https://www.english-corpora.org/coca/
Available via Brigham Young University, interface same as for BYU-BNC and COHA
Corpus of Historical American English (COHA)
https://www.english-corpora.org/coha/
Available via Brigham Young University, interface same as for BYU-BNC and COCA

Old/Middle English

Dictionary of Old English: Web Corpus
http://tapor.library.utoronto.ca/doecorpus/
Corpus of Middle English Prose and Verse
http://quod.lib.umich.edu/c/cme/

Language of the Internet

iWeb - Corpus of 14 Billion words, from 22 million web pages
https://www.english-corpora.org/iweb/

BNCWeb Guide A guide to BNCWeb

Make your own Corpora

To make your own corpora we recommend you try AntConc.

AntConc is a freeware corpus analysis toolkit for concordancing and text analysis.