Oxford LibGuides: Text and Data Mining pilot project: Current TDM-ready collections

Gale Digital Scholar Lab

Gale Digital Scholar Lab
This resource requires you to register before use.
"Gale Digital Scholar Lab is a cloud-based research environment that allows students and researchers to apply natural language processing tools to raw text data (OCR) from Gale's Primary Sources in a single research platform."

To find out how the Lab works, watch the Webinar.
"This webinar shows you how to create a content set, clean text, and run an analysis in Digital Scholar Lab."

An exciting and innovative product that introduces Digital Humanities research methods and analysis techniques.

Further information

A Humanities focussed resource with cloud-based platform and built in tools aimed at streamlining analyses, finding, cleaning, and organising data, and supporting natural language processing (NLP) for historical texts. Users at any level can work with large corpora of text and data. Dashboards are provided for those who do not wish to code. Users familiar with Python can code using Jupyter Notebooks and other options.

Support

The Lab is a single research platform where you can apply natural language processing tools to raw text data (OCR) from Bodleian Libraries' Gale Primary Sources subscriptions, or from uploaded OCR.

Gale Digital Scholar Lab is organised in three broad steps: Build, Clean, and Analyse which support users in interpreting both Gale Primary Sources and their own documents.

An integrated Learning Center provides instructional tutorial videos and explanations throughout.

HathiTrust Research Center

HathiTrust Research Center
This resource requires you to login with your Oxford SSO again at the homepage. This resource requires you to register before use.
Alternative name: HathiTrust Research Center Analytics ; HathiTrust Research Centre ; HTRC

A unit within the the HathiTrust Digital Library that manages access to its content for the specific purposes of text mining. The Hathi Trust is a repository containing over 17.5 million titles digitised by academic and research institutions across North America and Europe. It specialises in literature and U.S. Government document collections. Since the University of Oxford is a member institution, its students and staff may access books and create collections using their own free personal accounts. Creating a collection with the HathiTrust Digital Library is like creating your own text corpus, which can be used for computational text analysis via the research center. Collections may also be made public for other researchers to access.

HTRC is a unit within the the HathiTrust Digital Library that manages access to its content for the specific purposes of text mining.

HTRC: Learn and Support

For those who are new to both text analysis and HTRC tools and data New to HTRC? page will address the basics.

A user with a background in natural language processing (NLP) methods, for example, might be ready to jump into the data capsule documentation, a tool designed for more intermediate to advanced HTRC users, since it requires you to be familiar with the Linux/Unix command-line and programming languages like Python or R.

TDM Studio (ProQuest)

TDM Studio
To access your account once created visit https://tdmstudio.proquest.com/home and login with your University of Oxford email.
TDM Studio is a text and data mining tool created by ProQuest. It allows programmatic analysis of published content from the millions of pages of news and scholarly publications provided through current university ProQuest subscriptions. These are listed under ‘Vendor/Provider’ in the Database A-Z.
Two ways of working are provided:
1. the Visualization Dashboard is designed for users of all levels and includes Topic Modelling, Geographic Analysis and Sentiment Analysis
2. the Workbench Dashboard is designed for experienced users and works within a Jupyter Notebook environment, allowing the creation of a text corpus or dataset in minutes. This dashboard is aimed at researchers familiar with coding in Python, R or similar.
For documentation and support, visit the TDM Studio LibGuide.
Registering for TDM Studio
Anyone with a valid University of Oxford email address can request access to TDM Studio. To request an account and workbench, please fill out this form. By default, each workbench can support 1-5 users.

A browser based collaborative platform that gives access to large amounts of text and data from ProQuest with a Jupyter Notebook work environment.

Support

The platform supports both teaching and research with a Python and R Jupyter coding interface as well as pre-configured visualizations.

To activate Help & Learn relevant to each Dashboard, first open either Workbench or Visualization at the top of the page the click of Help and Learn button.

TDM Studio LibGuides

GPT LLMs are now available for use as a beta feature.

Steps to Access:

Login with your University Email Address
Turn On your TDM Studio Workbench
Open Jupyter Notebook
Navigate to: Getting Started/2025.04.1/ProQuest TDM Studio Samples/GPT_Sentiment_Analysis.ipynb
Run the sample script against your dataset.

TDM Studio Workbench GPT Tutorial

Email TDM Studio Team with your queries.

Journals and TDM

Some platforms (e.g. TDM Studio) allow researchers to import external content for TDM. However, this depends on the licensing agreement with the relevant publishers. Please note that it is the researchers' responsibility to make sure the content can be used.

The following publishers allow researchers to use their journal content to create TDM corpora where it is for non-commercial purposes and institutional access is in place.
Do not assume that this applies to other content from the publisher, and always contact the publisher before collecting large amounts of text or data. This is a precaution against TDM activity being misread as an ongoing network breach.

Elsevier
"A license-based approach that automatically enables researchers at subscribing institutions to text mine for non-commercial research purposes and to gain access to full-text content in XML for this purpose."
Cambridge University Press
"Terms of Use ... permit text and data mining of Cambridge Core content for any non-commercial purpose, as long as you have lawful access to the content you wish to mine".
Oxford University Press
"Non-commercial TDM rights are only permitted where users have lawful access to the content. Non-commercial TDM rights are detailed in our Subscription Agreement with your institution. To understand what is permitted under your institution’s agreement, please contact your librarian or OUP. "
Sage
"Downloading articles from Sage Journals for the purposes of text and data mining is expressly permitted in our standard licence agreements and our terms of use for no extra fee. You do not need to ask permission to systematically download articles with some restrictions -see their website."
Springer Nature
"For subscribed journals and books, Springer Nature grants researchers text and data mining rights via their institutions, provided the purpose is non-commercial research. For details and additional options go the publisher's website."
Taylor & Francis
"If you or your institution subscribes to content from Taylor & Francis you can carry out TDM activities on this content, as well as open access content, without any additional charge, provided this is on a non-commercial basis."
Wiley
"Academic subscribers can perform TDM under license (or in accordance with statutory rights under applicable legislation) on subscribed content for non-commercial purposes at no extra cost."

Jupyter Project

Jupyter Notebook

Jupyter Notebook is a web-based interactive computational environment for creating and organising Jupyter notebook documents. It’s a tool and working space that’s also supported by most TDM providers. The Notebook supports several languages like Python (IPython), R and Julia and is largely used for data analysis, data visualization and further interactive, exploratory computing. It combines live code, equations, narrative text and visualisations for a simple, streamlined, document-centric experience. The platform originates from Project Jupyter to develop open-source software, open standards, and services for interactive computing across multiple programming languages.