HTRC Extracted Features 2.0 is the most current version of a derived dataset consisting of metadata and data elements extracted from volumes in the HathiTrust Digital Library. The dataset is composed of 17.2 million JSON files representing a snapshot of the HathiTrust corpus from February 2020.
This documentation describes the structure and data in the HTRC Extracted Features 2.0 files for users of those files. The specific features extracted are described in more detail below.
You can also refer to technical documentation of the Extracted Features 3.0 JSON-LD schema here. The schema was developed collaboratively with JSTOR-Portico, and it could be applied to data from non-HathiTrust sources to create compatible datasets. This version of the extracted features vocabulary is designed as a linked data standard (JSON-LD).
Downloading the files
Coming soon!
Data Stats
# of volumes represented
17,123,746
# of pages represented
6,221,631,336
# of tokens represented
2,906,819,723,689
# files derived from in-copyright volumes
10,550,952
# pages derived from in-copyright volumes
2,913,069,029,723
# tokens derived from in-copyright volumes
5,826,138,059,446
# files derived from public domain & Creative Commons volumes
6,572,794
# pages derived from public domain & Creative Commons volumes
2,478,069,869
# tokens derived from public domain & Creative Commons volumes