For corpus based research there are purpose built corpora available for various languages including Czech, Dutch, French, Frisian, German and Spanish on Databases A-Z.
Digital Language Resources in Oxford Find a variety of specialised corpora and tools for Oxford users, including the Linguistic Data Consortium, (open consortium of universities, libraries, corporations and government research laboratories) and CLARIN European Research Infrastructure Consortium.
CHILDES is a database of spoken child language with browsable transcripts in various languages, including Romance, Scandinavian, Celtic, Germanic and East-Asian languages.
CELEX is a lexical database for Dutch, English and German, developed by the Max Planck Institute. For each language, this data set contains detailed information on:
CLARIN (European Research Infrastructure for Language Resources and Technology) contains many annotated resources mainly but not only in Germanic languages and tools such as a part of speech tagger. Access here CLARIN Corpora, Lexical Resources and Tools.
It is not possible to list all CLARIN resources here since it is being updated all the time. This is s selection of resources:
Initially an EU project, this database now contains lemmatized and POS tagged corpora in many languages including Chinese, Arabic, European and African languages.
Surrey Lexical Splits Database
The database from the Surrey Morphology Group offers a novel perspective on morphological typology, charting the range of possible deviations from canonical paradigms, and comprises 300 records taken from 50 languages, covering 27 families. Interactive visualisations of fieldwork data from Chichimec (Oto-Manguean, Mexico) and Skolt Saami (Finno-Ugric, Finland) are accessible through https://lexicalsplits.surrey.ac.uk/. The implications of the Chichimec data for morphological theory are reported inhttps://muse.jhu.edu/article/785538
English corpora
British National Corpus The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. Click here for more info about which interface of BNC to use.
American National Corpus American English, both spoken and written, will contain 100 million words when completed.
iWeb corpus is a corpus of English language websites from UK, US, Canada, Ireland, Australia and New Zealand for lingusitic research. It contains 14 billion words from nearly 95,000 systematically selected websites (22 million webpages). iWeb corpus can be browsed for word frequency lists, collocates, n-grams and full-text data. It can also be searched by individual word or by phrases, strings or substrings.
English corpora (list from BYU) can be found on https://corpus.byu.edu/ (mostly American, also including English and Canadian corpora)
COHA (Corpus of Historical American English), included in iWeb corpus (see above) contains more than 400 million words of text from the 1810s-2000s. The corpus is balanced by genre decade by decade.
German corpora
Schweizerdeutsches MundartKorpus
Newspaper archives
It is also possible to compile your own corpus from newspaper archives, for English newspaper use Nexis UK
For French newspapers use Retronews (newspaper archive of the Bibliothèque Nationale de France)
More specific newspaper corpora in various European languages can be accesses through CLARIN newspaper corpora.
Text archives
Compile your own corpus from
Oxford Text Archive (OTA) contains literary and linguistic resources for use in Higher Education, in research, teaching and learning.
Early English Books Online (EEBO). Eighteenth Century Collections Online or Eighteenth Century Journals. For more corpora, browse Databases A-Z by subject for Linguistics/Corpora.
Register of corpora
RE3data is a global register of research data repositories. Type 'linguistics' in the search box to find over 100 repositories, corpora, grammar databases. The records are tagged for refined searching, and linked to the actual resources.
A really good guide to creating corpora has been developed by IT Services.