Oxford LibGuides: Dutch Language and Literature: Corpora

Key Resources

Corpora and Databases

Linguistic corpora - diachronic corpora - literary corpora - Newspapers

Linguistic corpora for Dutch:

CELEX is a word frequency database of modern Dutch, developed by the Max Planck Institute. Words can be searched in context.

CHILDES is a database of spoken child language with browsable transcripts.

The Alpino Treebank offers a corpus of syntactically annotated sentences, over 150.000 words. Each sentence is represented as a tree-diagram.

Open SoNaR SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications.

The Institute for the Dutch Language has developed a large number of linguistic corpora, accessible here and including:

Corpus Hedendaags Nederlands (Corpus of Modern Dutch), developed by the INTL (Institute for the Dutch Language) comprises over 800.000 texts from 1814-2013, taken from newspapers, magazines, news bulletins and legal texts.
Spoken Dutch Corpus (Corpus Gesproken Nederlands; CGN) is a sound archive consisting of 10 million words of transcribed speech from Dutch and Flemish speakers. It covers various dialects, representing speakers of various ages. The samples were collected between 1998 and 2004.
Historic Dictionaries (Old, Middle and Modern Dutch and Modern Frisian)
17th Century Newspaper Corpus (Couranten Corpus) A POS-tagged and lemmatised corpus of 13 17th-Century newspapers with advanced search function.

Diachronic and dialect corpora

The Gesproken Corpus van de zuidelijk-Nederlandse Dialecten (GCND) (spoken corpus of Southern-Dutch dialects) of the University of Ghent is an pilot corpus tagged for parts of speech, based on recordings from the 1960-1980.

The Meertens Institute for Dialectology has links to diachronic corpora. The main compilation corpora are

Compilatiecorpus Historisch Nederlands 1.0 (CHN): ambtelijke teksten 1250-1800 (historical Dutch non-literary texts 1250-1800)
Compilatiecorpus Historisch Nederlands 1.0 (CHN): narratieve teksten 1575-2000 (historical Dutch narrative texts 1575-2000)

Historical Corpus of Dutch (HCD) A fully searchable representative corpus of administrative texts, ego-documents and pamphlets from four regions of the Netherlands (Holland, Zeeland, Brabant, Flanders) from 16th-19th Century. (Developed by the Institute for the Dutch Language)

Literary corpora

DBNL (Digitale Bibliotheek voor de Nederlandse Letteren/Digital Library of Dutch Literature) contains primary and secondary literature in full text from Middle Dutch to Modern Dutch.

Delpher contains newspapers, journals. magazines, books and radio bulletins. The newspapers cover 1618 to present and can be downloaded as pdf or as txt file. It is possible to search all 29.368 newspaper articles at once. In addition, there are 2480 journals/magazines and over 83.000 books from 17^th -20^th century, many of which are also available in full text in dbnl. Delpher is developed and maintained by the Royal Library, The Hague.

A really good guide to creating corpora has been developed by Oxford Text Archive.

Subject Guide

Johanneke Sytsema

Email Me

Contact:

Subject Consultant for Linguistics, Dutch and Frisian
Taylor Institution Library
St Giles'
Oxford
1OX 3NA

01865 (2)78159

Subjects: Linguistics