British National Corpus The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. Click here for more info about which interface of BNC to use.
American National Corpus American English, both spoken and written, will contain 100 million words when completed.
For corpus based research there are purpose built corpora available for various languages including Czech, Dutch, French, Frisian, German and Spanish on Databases A-Z.
Digital Language Resources in Oxford Find a variety of specialised corpora and tools, including the Linguistic Data Consortium, (open consortium of universities, libraries, corporations and government research laboratories) and CLARIN European Research Infrastructure Consortium.
CHILDES is a database of spoken child language with browsable transcripts in various languages, including Romance, Scandinavian, Celtic, Germanic and East-Asian languages.
CELEX is a lexical database for Dutch, English and German, developed by the Max Planck Institute. For each language, this data set contains detailed information on:
CLARIN contains many annotated resources mainly ibut not only in Germanic languages and tools such as a part of speech tagger,
IntelliText The Intelligent Tools for Creating and Analysing Electronic Text Corpora for Humanities Research.
GATE The General Architecture for Text Engineering is free, open-source software for a wide range of computational tasks involving human language.
#LancsBox is a new-generation software package for the analysis of language data and corpora developed at Lancaster University
Language Technology Software The Language Technology Group makes available various software packages, often free to researchers.
Wordtree makes an interactive visual representation of corpus concordance data.
Historical Thesaurus of English The University of Glasgow’s Historical Thesaurus of English is a unique resource charting the development of meaning in the huge and varied vocabulary of English. It consists of almost every recorded word in English from Anglo-Saxon times to the present day, all arranged into detailed hierarchies of meaning.
iWeb corpus is a corpus of English language websites from UK, US, Canada, Ireland, Australia and New Zealand for lingusitic research. It contains 14 billion words from nearly 95,000 systematically selected websites (22 million webpages). iWeb corpus can be browsed for word frequency lists, collocates, n-grams and full-text data. It can also be searched by individual word or by phrases/strings or substrings (e.g. *ism, un*able) or phrases such as got VERB-ed, from ADJ to ADJ, phrasal verbs, or NOUN NOUN.
It is also possible to compile your own corpus from newspaper archives, for English newspaper use Nexis UK
For French newspapers use Retronews (newspaper archive of the Bibliothèque Nationale de France)
Compile your own corpus from
Oxford Text Archive (OTA) contains literary and linguistic resources for use in Higher Education, in research, teaching and learning.
Tagging and parsing tools
For tagging and parsing texts or corpora, have a look at this list of tools.
A really good guide to creating corpora has been developed by IT Services.