Excerpt |
Learn about the different access points and formats to the data HTRC provides, as well as the various affordances and limitations of each method. Your research project will largely dictate which method is best suited for your needs. |
HTRC provides access to data from the HathiTrust corpus in several forms across its suite of tools and services for computational text analysis. Data is periodically synced from HathiTrust, but not all HTRC tools and services are updated on the same schedule. Additionally, copyright, user agreements, and security concerns impact data availability and format.
HTRC algorithms and HTRC Data Capsules are capable of analyzing the entire HathiTrust corpus, and additionally make use of each volume’s MARC bibliographic and METS metadata. Both the HTRC algorithms and Capsule-environments draw from the HTRC Data API described below.
The HTRC makes available also two datasets, the HTRC Extracted Features Dataset and a dataset of Word Frequencies in English Language Literature, 1700-1922. HTRC Extracted Features includes metadata and extracted page-level data (words and word counts) for 15.7 million volumes.
HathiTrust+Bookworm visualizes data for 15.7 million volumes.
- Search and select volumes to build a collection in HathiTrust. Import to HTRC as a workset to use with algorithms or call data into your Capsule using the HTRC Workset Toolkit.
- Query the HathiTrust Bibliographic API.
- Utilize the HathiFiles.
- Need more help? HTRC can help you build a list of volume IDs to analyze. Contact htrc-help@hathitrust.org.
Data API
This table outlines the differences between the HTRC Data API and HathiTrust Data API. As a researcher-accessible service, describes the HTRC Data API functions within the HTRC Data Capsules environment.
HathiTrust Data API
purpose | to provide access to HathiTrust text data within an HTRC Data Capsule AND to serve high-performance large-scale algorithms and programs (not publicly-accessible) |
data available | entire HathiTrust corpus |
use | In-capsule via the HTRC Workset Toolkit |
throttling enforcement | no |
security | JWT |
bulk retrieval of volumes | yes |
metadata available |