Downloading Extracted Features

Downloading Extracted Features

 

See the different ways you can download EF files to your local machine or inside a data capsule.

The Basics

Rsync

Extracted Features are downloaded using a file transfer utility called rsync, which is a command line utility. Rsync will synchronize files from our servers to your system.  

Tips

  • Not including a path will sync all files: this is possible, but remember that (for Extracted Features 2.0) the full dataset is 4 Terabytes! Only download all of it if you know what you are doing. Most people are downloading a subset of the files. 

  • The fastest way to sync is by pointing at exactly the files that you want. Use the htrc-ef-all-files.txt file if you want the paths to all the EF files.

  • Have a custom list of volumes and want to get the file paths for it? We have created two options to help:

    • Uploading the volume list to HTRC Analytics as a workset allows you to download a shell script. Details.

    • The HTRC Feature Reader installs a command line utility, htid2rsync, and Python functions for doing the same. Details

HTRC Feature Reader

We provide a Python library called the HTRC Feature Reader which simplifies many of the activities that you may want to perform with EF Dataset, including generating file paths for volumes for which you know the HathiTrust volume ID. The Feature Reader is compatible with Extracted Features versions 2.0 and 1.5.

Read full documentation for the HTRC Feature Reader, including code examples.

File names

Filenames are derived from the associated volume's HathiTrust volume ID, with the following characters substituted:

original character

substitute character

original character

substitute character

:

+

/

=

Unzipping downloaded files

The downloaded content will be compiled in a compressed file (.bz2). You will need to navigate your local system and properly unzip any file(s) you wish you view. The bunzip2 command may be useful. You can also work with the compressed files and/or incrementally expand files as part of your processing or analysis pipeline, and the HTRC Feature Reader will work with compressed files without expanding them first.

Download Extracted Features version 2.5

Download Format

File Format

A detailed description of EF 2.5 file format can be found on the EF2.5 documentation page.

File paths

The dataset is stored in a directory specification created by HTRC called stubbytree. (Note: This is the same structure as the Extracted Features 2.0 dataset but a change from the 0.2 and 1.5 versions of the EF dataset that used the pairtree directory specification.) Stubbytree places files in a directory structure based on the file name, with the highest level directory being the HathiTrust source code (i.e. volume ID prefix), and then using every third character of the cleaned volume ID, starting with the first, to create a sub-directory. For example the Extracted Features file for the volume with HathiTrust ID nyp.33433070251792 would be located at: 

root/nyp/33759/nyp.33433070251792.json.bz2

Download Options

Rsync

The rsync module (or alias path) for Extracted Features 2.5 is data.analytics.hathitrust.org::features-2025.04/

Sync all EF files

If you run the following rsync command, without specifying file paths, it will sync all files to the current directory:

rsync -av data.analytics.hathitrust.org::features-2025.04/ .

Do not do this unless you are prepared to work with the full dataset, which is 4.4 TB. Including the final period . when running your command will sync the files to your current working directory. You can also substitute the period with a path to the directory of your choosing to which you would like to sync the files.

Get a list of the EF files

A full listing of all the files is available to download with the following command:

rsync -azv data.analytics.hathitrust.org::features-2025.04/listing/file_listing.txt
Sync a single EF file

You can sync any single Extracted Features file with:

rsync -av data.analytics.hathitrust.org::features-2025.04/{PATH-TO-FILE} .
Sync multiple EF files

You can also download multiple files by writing the Extracted Features files’ paths to a text file, and then run the following command:

rsync -av --files-from FILE.TXT data.analytics.hathitrust.org::features-2025.04/ .
Sync a single folder

You can sync into a single folder, throwing away the directory structure, by adding --no-relative to the rsync command:

rsync -av --no-relative --files-from FILE.TXT data.analytics.hathitrust.org::features-2025.04/ .

Note: if you’re downloading many thousands of files be aware that a single folder can only contain so many files, depending on your file and operating system! Rsync requests for many files with the --no-relative parameter included might cause errors.

Use HTRF Feature Reader UTIL to sync a single EF file

Rather than learning the stubbytree specification, you can use the HTRC Feature Reader’s command line htid2rsync tool that is installed as part of the Python library. For example, to rsync a single Extracted Features file when you know the HathiTrust volume ID use: 

htid2rsync {VOLUMEID} | rsync --files-from - data.analytics.hathitrust.org::features-2025.04/

Python + rsync

If you already have a list of HT volume IDs, you can use the HTRC Feature Reader library, to prepare to rsync your volumes of interest. Here is an example showing the conversion of one HT volume ID into an rsync path using the Feature Reader in Python:  

from htrc_features import utils utils.id_to_rsync('hvd.32044140344292') # 'format' is an optional parameter with the default value of format=’stubbytree’

The htid2rsync utility can also be used to generate filepaths to EF 2.5 and 2.0 data via the command line:

>>$ htid2rsync hvd.32044140344292 >> hvd/34449/hvd.32044140344292.json.bz2

HTRC EF Download Helper Algorithm + rsync

To download the Extracted Features data for a specific workset in HTRC Analytics, you can run the Extracted Features Download Helper algorithm on the workset to generate file paths and a shell script that will run the rsync command for the workset, simplifying the process. The tool can also be useful if you don’t want to go through the process of determining files paths. Select version 2.5 when you run the algorithm to get the EF2.5 version of the files.

The algorithm creates a shell script that you can download and run from your local command line. The file lists the rsync commands for every volume in an HTRC workset. Once you have run the algorithm and downloaded the resulting file, you will run the resulting .sh file like so:

>>$ sh ef_rsync.sh

Download Extracted Features version 2.0