A Humanities focussed resource with cloud-based platform and built in tools aimed at streamlining analyses, finding, cleaning, and organising data, and supporting natural language processing (NLP) for historical texts. Users at any level can work with large corpora of text and data. Dashboards are provided for those who do not wish to code. Users familiar with Python can code using Jupyter Notebooks and other options.
The Lab is a single research platform where you can apply natural language processing tools to raw text data (OCR) from Bodleian Libraries' Gale Primary Sources subscriptions, or from uploaded OCR.
Gale Digital Scholar Lab is organised in three broad steps: Build, Clean, and Analyse which support users in interpreting both Gale Primary Sources and their own documents.
An integrated Learning Center provides instructional tutorial videos and explanations throughout.
HTRC is a unit within the the HathiTrust Digital Library that manages access to its content for the specific purposes of text mining.
For those who are new to both text analysis and HTRC tools and data New to HTRC? page will address the basics.
A user with a background in natural language processing (NLP) methods, for example, might be ready to jump into the data capsule documentation, a tool designed for more intermediate to advanced HTRC users, since it requires you to be familiar with the Linux/Unix command-line and programming languages like Python or R.
See also an HTRC workshop on legal issues in TDM usage
A browser based collaborative platform that gives access to large amounts of text and data from ProQuest with a Jupyter Notebook work environment.
The platform supports both teaching and research with a Python and R Jupyter coding interface as well as pre-configured visualizations.
To activate Help & Learn relevant to each Dashboard, first open either Workbench or Visualization at the top of the page the click of Help and Learn button.
Some platforms (e.g. TDM Studio) allow researchers to import of external content for TDM. However, this depends on the licensing agreement with the relevant publishers. Please note that it is the researchers' responsibility to make sure the content can be used.
The following publishers allow researchers to use of their journal content to create TDM corpora where it is for non-commercial purposes and institutional access is in place.
Do not assume that this applies to other content from the publisher, and always contact the publisher before collecting large amounts of text or data.
This is a precaution against TDM activity being misread as an ongoing network breach.
Jupyter Notebook is a web-based interactive computational environment for creating and organising Jupyter notebook documents. It’s a tool and working space that’s also supported by most TDM providers. The Notebook supports several languages like Python (IPython), R and Julia and is largely used for data analysis, data visualization and further interactive, exploratory computing. It combines live code, equations, narrative text and visualisations for a simple, streamlined, document-centric experience. The platform originates from Project Jupyter to develop open-source software, open standards, and services for interactive computing across multiple programming languages.