Excerpt |
---|
Learn how to understand the data that you will be working with. |
Volumes
Items in HathiTrust are called volumes. A volume is a discrete object that was digitized and cataloged as one unit. In the case of HathiTrust and its collection, volumes are typically books (monographs), but they may also be one issue of a periodical, several issues of a periodical bound and described together, or even a musical score. Keep in mind that a volume may be an anthology containing multiple works! Currently, volumes in HathiTrust start as physical objects that are digitized and added to the HathiTrust Digital Library.
...
HathiTrust volumes are identified via unique HathiTrust IDs. These alpha-numeric IDs track volumes across HathiTrust and HTRC systems. This volume of Jane Austen's letters has the volume ID hvd.32044021076179. When viewing a volume in the Digital Library, the volume ID can be found in the URL after "id=". The volume ID can be used to call metadata via the HathiTrust's Bibliographic API or to pull volume content via the HathiTrust's Data API. Additionally, the volume ID is often present in the file and/or directory name for content pertaining to a specific volume, and it also makes up the (pairtree) directory structure for volumes accessed via HathiTrust dataset requests or the HTRC Extracted Features Dataset.
HathiTrust volume IDs begin with a prefix code that identifies the library-of-origin (i.e. holding library) of the digitized item. For example, all volumes IDs that begin with uiug relate to objects held by the University of Illinois.
...
While metadata for volumes in HathiTrust exists in a variety of formats and for a number of intended use cases, it generally begins as MARC metadata, the standard for library cataloging. It is often helpful to rely on the MARC specifications to navigate HathiTrust metadata for analysis, for example determining what certain codes mean or data structures imply. Additionally, HathiTrust publishes specification for their metadata records that can be quite useful as there are HathiTrust-specific uses of some fields, particularly MARC field 975, that contain useful metadata about volumes: https://www.hathitrust.org/bib_specifications.
While HathiTrust does not facilitate bulk-download of full metadata records at this time, metadata is available in various formats and through several services that each can be useful depending on the use case:
- Hathifiles: tab-delimited files of reduced bibliographic metadata pulled from MARC records that are released daily for incremental additions to HathiTrust. On the first of each month, a file of every volume currently in HathiTrust is released.
...
- HathiTrust Bibliogrpahic API: for retrieving JSON-formatted MARC metadata via HathiTrust ID, HathiTrust record number, or OCLC number for up to 20 identifiers at a time.
- HTRC Extracted Features: volume-level JSON files include limited bibliographic metadata in addition to page-level metadata and features.
Additionally, this tables of MARC Coverage can help clarify the nature of the collection.
...
A volume's copyright and license status affects how researchers are permitted to interact with the data for that volume. Only public domain volumes are available via the HathiTrust Data API and custom data request process. AdditonallyAdditionally, there are access restrictions based on the digitizing agent of volumes in HathiTrust that impact use if the HT Data API and custom data request procedures. Read more here: https://www.hathitrust.org/data. Only public domain volumes are available for research via the HTRC Analytics site and Data Capsulses compute environments. HTRC Extracted Features and the HT+Bookworm tool, however, do provide analytic access to derived data from the entire corpus.
...