Skip to Main Content

Text and Data Mining pilot project: Home

Social Sciences Division and Humanities

Purpose of this guide

During the academic year 2024/25 a one year pilot is being run by the Bodleian Libraries to support Text and Data Mining (TDM) at the University of Oxford. This seeks to highlight the use of Bodleian Collections in TDM, promote the support given by librarians in this area and collect evidence concerning text and data needs of Bodleian readers. The project is managed by librarians serving the Social Sciences Division, Humanities Division and the Centre for Digital Scholarship.

This guide is aimed at those Bodleian readers who:

  • Want to know what Bodleian subscriptions or other collections can be used, 

  • Require advice on what tools are available, 

  • Wish to understand why copyright holders may prevent some collections from being used, 

  • Are already engaged in application of these techniques and require advanced support, 

  • Want to pursue funding for a project that will include TDM.  

Contact us via the feedback form, or email John Southall (Social Sciences) or Frank Egerton (Humanities).

This guide should be seen as a work in progress that will expand as the pilot develops.

TDM with Bodleian Collections and Subscriptions

Text and data mining is a research method that uses computational processes to analyse very large sets of text or structured data and identify patterns and relationships that would otherwise remain unrecognised. It is the continual growth in both computing power and the amount of accessible material - such as the digital Bodleian Libraries subscriptions - that makes TDM possible.

The Bodleian TDM pilot service offers support where:

  1. Researchers want access or guidance on the key TDM tools currently available through Bodleian Libraries. These allow access to content and a range of specially developed tools (APIs) aimed both at those experienced in coding and those looking for code free solutions. The following are currently available through the Bodleian Libraries, and provide APIs such as workbenches, code free dashboards and tutorials:

Contact us for help in using these tools, arranging training or with any other feedback.

  1. Researchers want to create a mineable collection using Bodleian e-journal subscriptions. Copyright is still held by the publisher but the subscription licence includes TDM permissions or recognises UK Copyright exception. It still must be carried out within the technical framework applied by the publisher to maintain their network security and stability. Large downloading may be mistaken as malicious activity and lead to termination of access.

Contact us for help in negotiating appropriate access with publishers.

  1. Researchers have funding and can pay for project - rather than institution wide - access to content for the purposes of text and data mining. Some data mining platforms require predefined research teams or projects as part of usage. These will not be suitable for institution-wide access or funding by the Bodleian libraries but are ideal for projects developed as part of a funding application.

Contact us to discuss support in identifying such resources and ensuring access conditions are suitable for researchers.

Legislation relating to TDM

A basic step of TDM is to reproduce or copy content from text collections, such as journals, or content from databases. This corpus, collection or dataset is then subjected to analysis using tools such as Python. However, unless the material in question is out of copyright term protection, or copying is legally permitted, making such copies can potentially be viewed as copyright or database right infringement. 

The ‘UK Copyright exception’ to the UK Copyright, Designs and Patents Act 1988 has been an important development in opening up material for TDM analysis, but there are caveats. The intent of TDM must be non-commercial and the copying must take place within the UK. Rghts holders and suppliers are entitled to "apply reasonable measures to maintain their network security or stability". TDM extracts unusually large amounts of content which can be mistaken for a data breach. When such unexpected activity is detected suppliers may shut down access to a whole institution whilst the situation is assessed. Ideally publishers should be contacted in advance and access negotiated. Staff of the Bodleian Libraries have established  relationships with publishers that can be an asset when arranging such access.

Researchers should contact us if they have any questions about copyright legislation or require help in negotiating with publishers.

Latest updates

A number of new TDM products are being considered by the Bodleian Libraries. These will be trialled to allow researchers to try them out. Please pass on feedback and thoughts after using them.

One of such products, Constellate is a text analysis platform that integrates access to scholarly content and open educational resources into a cloud-based lab to help teach text analysis and data literacy skills. Constellate is part of ITHAKA’s portfolio of services, along with resources like JSTOR and Portico.

A Constellate Learn & Evaluate trial has been initiated by the Centre for Digital Scholarship (CDS) as a potential resource for online teaching of coding for the Digital Humanities. To find out more about the platform see their User guide.

However, ITHAKA has made the decision to sunset Constellate on July 1, 2025. Details of what will be removed, still available and steps users need to take are here.
To support staff and students who have been exploring Constellate's resources and learning computational analysis skills, they plan to extend the university’s full access trial through to June 30, 2025. This includes access to all the classes and resources, and will give users time to transition and to ensure they can preserve their work.

To summarise, through June 30, 2025:

  • University of Oxford will have full access to all Constellate features and resources
  • The Spring 2025 classes will proceed as planned
  • They are developing plans to ensure we can access our educational materials (notebooks, recordings, etc.) after the sunset date
  • They will communicate directly with those in the university that use Constellate to help them preserve their work

After June 2025, ITHAKA will continue to support text-mining of JSTOR content through other means, which they will provide details of in due course.

iSkills training

University of Oxford IT Services' Digital Skills Courses and Resources provide in-person or online training in coding languages commonly used for text and data mining (TDM) - R and Python.

Feedback Form