Skip to Main Content

Text and Data Mining pilot project: Current TDM-ready collections

Social Sciences Division and Humanities

Gale Digital Scholar Lab

A Humanities focussed resource with cloud-based platform and built in tools aimed at streamlining analyses, finding, cleaning, and organising data, and supporting natural language processing (NLP) for historical texts. Users at any level can work with large corpora of text and data. Dashboards are provided for those who do not wish to code. Users familiar with Python can code using Jupyter Notebooks and other options.

Support

The Lab is a single research platform where you can apply natural language processing tools to raw text data (OCR) from Bodleian Libraries' Gale Primary Sources subscriptions, or from uploaded OCR.

Gale Digital Scholar Lab is organised in three broad steps: Build, Clean, and Analyse which support users in interpreting both Gale Primary Sources and their own documents.

An integrated Learning Center provides instructional tutorial videos and explanations throughout.

HathiTrust Research Center

HTRC is a unit within the the HathiTrust Digital Library that manages access to its content for the specific purposes of text mining.

HTRC: Learn and Support

For those who are new to both text analysis and HTRC tools and data New to HTRC? page will address the basics.

A user with a background in natural language processing (NLP) methods, for example, might be ready to jump into the data capsule documentation, a tool designed for more intermediate to advanced HTRC users, since it requires you to be familiar with the Linux/Unix command-line and programming languages like Python or R.

See also an HTRC workshop on legal issues in TDM usage

TDM Studio (ProQuest)

A browser based collaborative platform that gives access to large amounts of text and data from ProQuest with a Jupyter Notebook work environment. 

Support

The platform supports both teaching and research with a Python and R Jupyter coding interface as well as pre-configured visualizations.

To activate Help & Learn relevant to each Dashboard, first open either Workbench or Visualization at the top of the page the click of Help and Learn button. 

TDM Studio LibGuides

Journals and TDM

Some platforms (e.g. TDM Studio) allow researchers to import of external content for TDM. However, this depends on the licensing agreement with the relevant publishers. Please note that it is the researchers' responsibility to make sure the content can be used.

The following publishers allow researchers to use of their journal content to create TDM corpora where it is for non-commercial purposes and institutional access is in place.
Do not assume that this applies to other content from the publisher, and always contact the publisher before collecting large amounts of text or data. This is a precaution against TDM activity being misread as an ongoing network breach.

  • Elsevier
    "A license-based approach that automatically enables researchers at subscribing institutions to text mine for non-commercial research purposes and to gain access to full-text content in XML for this purpose."
  • Cambridge University Press
    "Terms of Use ... permit text and data mining of Cambridge Core content for any non-commercial purpose, as long as you have lawful access to the content you wish to mine".
  • Oxford University Press
    "Non-commercial TDM rights are only permitted where users have lawful access to the content. Non-commercial TDM rights are detailed in our Subscription Agreement with your institution. To understand what is permitted under your institution’s agreement, please contact your librarian or OUP. "
  • Sage
    "Downloading articles from Sage Journals for the purposes of text and data mining is expressly permitted in our standard licence agreements and our terms of use for no extra fee. You do not need to ask permission to systematically download articles with some restrictions -see their website."
  • Springer Nature
    "For subscribed journals and books, Springer Nature grants researchers text and data mining rights via their institutions, provided the purpose is non-commercial research. For details and additional options go the publisher's website."
  • Taylor & Francis
    "If you or your institution subscribes to content from Taylor & Francis you can carry out TDM activities on this content, as well as open access content, without any additional charge, provided this is on a non-commercial basis."
  • Wiley
    "Academic subscribers can perform TDM under license (or in accordance with statutory rights under applicable legislation) on subscribed content for non-commercial purposes at no extra cost."

Jupyter Project

Jupyter Notebook

Jupyter Notebook is a web-based interactive computational environment for creating and organising Jupyter notebook documents. It’s a tool and working space that’s also supported by most TDM providers. The Notebook supports several languages like Python (IPython), R and Julia and is largely used for data analysis, data visualization and further interactive, exploratory computing. It combines live code, equations, narrative text and visualisations for a simple, streamlined, document-centric experience. The platform originates from Project Jupyter to develop open-source software, open standards, and services for interactive computing across multiple programming languages.