During the academic year 2024/25 a one year pilot is being run by the Bodleian Libraries to support Text and Data Mining (TDM) at the University of Oxford. This seeks to highlight the use of Bodleian Collections in TDM, promote the support given by librarians in this area and collect evidence concerning text and data needs of Bodleian readers. The project is managed by librarians serving the Social Sciences Division, Humanities Division and the Centre for Digital Scholarship.
This guide is aimed at those Bodleian readers who:
Want to know what Bodleian subscriptions or other collections can be used,
Require advice on what tools are available,
Wish to understand why copyright holders may prevent some collections from being used,
Are already engaged in application of these techniques and require advanced support,
Want to pursue funding for a project that will include TDM.
Contact us via the feedback form, or email John Southall (Social Sciences) or Frank Egerton (Humanities).
This guide should be seen as a work in progress that will expand as the pilot develops.
Text and data mining is a research method that uses computational processes to analyse very large sets of text or structured data and identify patterns and relationships that would otherwise remain unrecognised. It is the continual growth in both computing power and the amount of accessible material - such as the digital Bodleian Libraries subscriptions - that makes TDM possible.
The Bodleian TDM pilot service offers support where:
Contact us for help in using these tools, arranging training or with any other feedback.
Contact us for help in negotiating appropriate access with publishers.
Contact us to discuss support in identifying such resources and ensuring access conditions are suitable for researchers.
A basic step of TDM is to reproduce or copy content from text collections, such as journals, or content from databases. This corpus, collection or dataset is then subjected to analysis using tools such as Python. However, unless the material in question is out of copyright term protection, or copying is legally permitted, making such copies can potentially be viewed as copyright or database right infringement.
The ‘UK Copyright exception’ to the UK Copyright, Designs and Patents Act 1988 has been an important development in opening up material for TDM analysis, but there are caveats. The intent of TDM must be non-commercial and the copying must take place within the UK. Rghts holders and suppliers are entitled to "apply reasonable measures to maintain their network security or stability". TDM extracts unusually large amounts of content which can be mistaken for a data breach. When such unexpected activity is detected suppliers may shut down access to a whole institution whilst the situation is assessed. Ideally publishers should be contacted in advance and access negotiated. Staff of the Bodleian Libraries have established relationships with publishers that can be an asset when arranging such access.
Researchers should contact us if they have any questions about copyright legislation or require help in negotiating with publishers.
• Text and data mining for non-commercial research - GOV.UK
• Text and data mining - BDS
A number of new TDM products are being considered by the Bodleian Libraries. These will be trialled to allow researchers to try them out. Please pass on feedback and thoughts after using them.
One of such products, Constellate is a text analysis platform that integrates access to scholarly content and open educational resources into a cloud-based lab to help teach text analysis and data literacy skills. Constellate is part of ITHAKA’s portfolio of services, along with resources like JSTOR and Portico.
A Constellate Learn & Evaluate trial has been initiated by the Centre for Digital Scholarship (CDS) as a potential resource for online teaching of coding for the Digital Humanities. To find out more about the platform see their User guide.
However, ITHAKA has made the decision to sunset Constellate on July 1, 2025. Details of what will be removed, still available and steps users need to take are here.
To support staff and students who have been exploring Constellate's resources and learning computational analysis skills, they plan to extend the university’s full access trial through to June 30, 2025. This includes access to all the classes and resources, and will give users time to transition and to ensure they can preserve their work.
To summarise, through June 30, 2025:
After June 2025, ITHAKA will continue to support text-mining of JSTOR content through other means, which they will provide details of in due course.
University of Oxford IT Services' Digital Skills Courses and Resources provide in-person or online training in coding languages commonly used for text and data mining (TDM) - R and Python.