Skip to Main Content

Dutch Language and Literature: Corpora

Dutch Language and Literature

Key Resources

Corpora and Databases

Linguistic corpora - diachronic corpora - literary corpora - Newspapers

Linguistic corpora for Modern Dutch:

CELEX is a word frequency database of modern Dutch, developed by the Max Planck Institute. Words can be searched in context.

CHILDES is a database of spoken child language with browsable transcripts.

The Alpino Treebank offers a corpus of syntactically annotated sentences, over 150.000 words. Each sentence is represented as a tree-diagram.

Open SoNaR SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications.

The Institute for the Dutch Language has developed a large number of linguistic corpora, accessible here and including:

  • Corpus Hedendaags Nederlands (Corpus of Modern Dutch), developed by the INTL (Institute for the Dutch Language) comprises over 800.000 texts from 1814-2013, taken from newspapers, magazines, news bulletins and legal texts.
  • Spoken Dutch Corpus (Corpus Gesproken Nederlands; CGN) is a sound archive consisting of 10 million words of transcribed speech from Dutch and Flemish speakers. It covers various dialects, representing speakers of various ages. The samples were collected between 1998 and 2004.
  • Historic Dictionaries (Old, Middle and Modern Dutch and Modern Frisian)
  • 17th Century Newspaper Corpus (Couranten Corpus) A POS-tagged and lemmatised corpus of 13  17th-Century newspapers with advanced search function. 

 

Diachronic and dialect corpora
The Meertens Institute for Dialectology has links to diachronic corpora. The main compilation corpora are

Literary corpora

DBNL (Digitale Bibliotheek voor de Nederlandse Letteren/Digital Library of Dutch Literature) contains primary and secondary literature in full text from Middle Dutch to Modern Dutch.

Delpher contains newspapers, journals. magazines, books and radio bulletins. The newspapers cover 1618 to present and can be downloaded as pdf or as txt file. It is possible to search all 29.368 newspaper articles at once. In addition, there are 2480 journals/magazines and over 83.000 books from 17th -20th century, many of which are also available in full text in dbnl.  Delpher is developed and maintained by the Royal Library, The Hague.

 

A really good guide to creating corpora has been developed by Oxford Text Archive.

Subject Guide

Profile Photo
Johanneke Sytsema
Contact:
Subject Consultant for Linguistics, Dutch and Frisian
Taylor Institution Library
St Giles'
Oxford
1OX 3NA
01865 (2)78159
Website
Subjects: Linguistics