Linguistic corpora - diachronic corpora - literary corpora - Newspapers
Linguistic corpora for Modern Dutch:
CELEX is a word frequency database of modern Dutch, developed by the Max Planck Institute. Words can be searched in context.
CHILDES is a database of spoken child language with browsable transcripts.
The Alpino Treebank offers a corpus of syntactically annotated sentences, over 150.000 words. Each sentence is represented as a tree-diagram.
Open SoNaR SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications.
The Institute for the Dutch Language has developed a large number of linguistic corpora, accessible here and including:
Diachronic and dialect corpora
The Meertens Institute for Dialectology has links to diachronic corpora. The main compilation corpora are
Literary corpora
DBNL (Digitale Bibliotheek voor de Nederlandse Letteren/Digital Library of Dutch Literature) contains primary and secondary literature in full text from Middle Dutch to Modern Dutch.
Delpher contains newspapers, journals. magazines, books and radio bulletins. The newspapers cover 1618 to present and can be downloaded as pdf or as txt file. It is possible to search all 29.368 newspaper articles at once. In addition, there are 2480 journals/magazines and over 83.000 books from 17th -20th century, many of which are also available in full text in dbnl. Delpher is developed and maintained by the Royal Library, The Hague.
A really good guide to creating corpora has been developed by Oxford Text Archive.