Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel3

...

Read the docs

Sample Files

A sample of 100 extracted feature files is available for download through your browser: sample-EF202003.zip

Also, thematic collections are available to download: DocSouth (82 volumes), EEBO (234 volumes), ECCO (412 volumes).

Filepaths

The dataset is stored in a directory specification created by HTRC called stubbytree. (Note: This is a change from previous versions of the Extracted Features dataset that used the pairtree directory specification.) Stubbytree places files in a directory structure based on the file name, with the highest level directory being the HathiTrust source code (i.e. volume ID prefix), and then using every third character of the cleaned volume ID, starting with the first, to create a sub-directory. For example the Extracted Features file for the volume with HathiTrust ID nyp.33433070251792 would be located at: 

...

Code Block
languagepy
from htrc_features import utils
utils.id_to_rsync('mdphvd.3901505873191332044140344292') # (defaultdefault value of optional 'format' parameter is “formatformat=’stubbytree’”)

You can also use the id_to_rsync utility in the Feature Reader from the command line:

Code Block
languagepowershell
>> from htrc_features import utils
>> utils.id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'’stubbytree’


Feature Reader also comes with a command line utility called  htid2rsync which can be used to generate filepaths to EF 2.0 data:

Code Block
languagepowershell
>>$ htid2rsync hvd.32044140344292
>> hvd/34449/hvd.32044140344292.json.bz2


Anchor
EF1.5download
EF1.5download
Downloading Extracted Features version
1.5

...

Code Block
languagepy
from htrc_features import utils
utils.id_to_rsync('mdphvd.3901505873191332044140344292', format='pairtree')


You can also use the id_to_rsync utility in the Feature Reader on the command line to generate the rsync URLFeature Reader also comes with a command line utility called  htid2rsync which can be used to generate filepaths to EF 1.5 data using the flag --oldstyle:

>> from htrc_features import utils >> utils.id_to_rsync('miun.adx6300.0001.001')
Code Block
languagepowershell
>>$ htid2rsync hvd.32044140344292 --oldstyle
>> 'miunhvd/pairtree_root/ad32/x604/3041/0,40/0034/0142/,092/01/adx6300,0001,001/miun.adx6300,0001,00132044140344292/hvd.32044140344292.json.bz2'


If you use the command line version of htid2rsync, there is an optional flag added () that will generate pairtree instead of stubbytree paths.

...