For corpus based research there are purpose built corpora available for various languages including Czech, Dutch, French, Frisian, German and Spanish on Databases A-Z.
Digital Language Resources in Oxford Find a variety of specialised corpora and tools for Oxford users, including the Linguistic Data Consortium, (open consortium of universities, libraries, corporations and government research laboratories) and CLARIN European Research Infrastructure Consortium.
CHILDES is a database of spoken child language with browsable transcripts in various languages, including Romance, Scandinavian, Celtic, Germanic and East-Asian languages.
CELEX is a lexical database for Dutch, English and German, developed by the Max Planck Institute. For each language, this data set contains detailed information on:
CLARIN (European Research Infrastructure for Language Resources and Technology) contains many annotated resources mainly but not only in Germanic languages and tools such as a part of speech tagger. Access here CLARIN Corpora, Lexical Resources and Tools.
It is not possible to list all CLARIN resources here since it is being updated all the time. This is s selection of resources:
Initially an EU project, this database now contains lemmatized and POS tagged corpora in many languages including Chinese, Arabic, European and African languages.
The database from the Surrey Morphology Group offers a novel perspective on morphological typology, charting the range of possible deviations from canonical paradigms, and comprises 300 records taken from 50 languages, covering 27 families. Interactive visualisations of fieldwork data from Chichimec (Oto-Manguean, Mexico) and Skolt Saami (Finno-Ugric, Finland) are accessible through https://lexicalsplits.surrey.ac.uk/. The implications of the Chichimec data for morphological theory are reported inhttps://muse.jhu.edu/article/785538
British National Corpus The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. Click here for more info about which interface of BNC to use.
American National Corpus American English, both spoken and written, will contain 100 million words when completed.
iWeb corpus is a corpus of English language websites from UK, US, Canada, Ireland, Australia and New Zealand for lingusitic research. It contains 14 billion words from nearly 95,000 systematically selected websites (22 million webpages). iWeb corpus can be browsed for word frequency lists, collocates, n-grams and full-text data. It can also be searched by individual word or by phrases, strings or substrings.
It is also possible to compile your own corpus from newspaper archives, for English newspaper use Nexis UK
For French newspapers use Retronews (newspaper archive of the Bibliothèque Nationale de France)
More specific newspaper corpora in various European languages can be accesses through CLARIN newspaper corpora.
Compile your own corpus from
Oxford Text Archive (OTA) contains literary and linguistic resources for use in Higher Education, in research, teaching and learning.
Register of corpora
RE3data is a global register of research data repositories. Type 'linguistics' in the search box to find over 100 repositories, corpora, grammar databases. The records are tagged for refined searching, and linked to the actual resources.