Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Anchor
filepaths
filepaths
Filepaths

 HTRC Extracted Features files gathered through rsync will download in pairtree format. Pairtree format is a hierarchical filesystem developed by the University of California Curation Center that maps identifier strings (in this case the HathiTrust Volume ID) to directory paths two characters at a time. The filepaths keep the institutional short code (e.g. mpd, uc2) at the front of each HTID intact.

Downloading with Rsync

To download the Extracted Features files via rsync, you must first generate a list of filepaths to include in the rsync command.

Converting HathiTrust Volume ID to Rsync URL (using HTRC Feature Reader)

 If you already have a list of HT volume IDs, you can use a Python library developed by the HTRC called the HTRC Feature Reader library, to prepare to rsync your volumes of interest. (Note: The HTRC Feature Reader library can do many other things as well!) In the Feature Reader library, there is a convenience function in htrc_features.utils.id_to_rsync(htid) which aids in transforming a list of HT Volume IDs into URLs for rsync. Here is an example showing the conversion of one HT volume ID into an rsync url: 

...