Table of Contents | ||
---|---|---|
|
...
Sample Files
A sample of 100 extracted feature files is available for download through your browser: sample-EF202003.zip
Also, thematic collections are available to download: DocSouth (82 volumes), EEBO (234 volumes), ECCO (412 volumes).
Filepaths
The dataset is stored in a directory specification created by HTRC called stubbytree. (Note: This is a change from previous versions of the Extracted Features dataset that used the pairtree directory specification.) Stubbytree places files in a directory structure based on the file name, with the highest level directory being the HathiTrust source code (i.e. volume ID prefix), and then using every third character of the cleaned volume ID, starting with the first, to create a sub-directory. For example the Extracted Features file for the volume with HathiTrust ID nyp.33433070251792 would be located at:
...
Code Block |
---|
rsync -av --no-relative --files-from FILE.TXT data.analytics.hathitrust.org::features-2020.03/ . |
Converting HathiTrust Volume ID to stubbytree URL
...
using HTRC Feature Reader
...
If you already have a list of HT volume IDs, you can use a Python library developed by the HTRC called the HTRC Feature Reader library, to prepare to rsync your volumes of interest. Here is an example showing the conversion of one HT volume ID into an rsync url as it would incorporated into Python:
Code Block | ||
---|---|---|
|
...
from htrc_features import utils
utils.id_to_rsync('mdp.39015058731913') (default is “format=’stubbytree’”) |
You can also use the id_to_rsync utility in the Feature Reader from the command line:
Code Block | ||
---|---|---|
| ||
>> from htrc_features import utils
>> utils.id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2' |
Anchor EF1.5download EF1.5download
Downloading Extracted Features version 1.5
EF1.5download | |
EF1.5download |
Download Format
File Format
...
Code Block |
---|
rsync -av --no-relative --files-from FILE.TXT data.analytics.hathitrust.org::features-2018.01/ . |
Converting HathiTrust Volume ID to rsync URL
...
using HTRC Feature Reader
...
If you already have a list of HT volume IDs, you can use a Python library developed by the HTRC called the HTRC Feature Reader library, to prepare to rsync your volumes of interest. In the Feature Reader library, there is a convenience function in htrc_features.Here is an example showing the conversion of one HT volume ID into an rsync url:
Code Block | ||
---|---|---|
| ||
from htrc_features import utils utils.id_to_rsync( |
...
'mdp.39015058731913', format='pairtree') |
You can also use the id_to_rsync utility in the Feature Reader on the command line to generate the rsync URL:
Code Block | ||
---|---|---|
| ||
>> from htrc_features import utils
>> utils.id_to_rsync('miun.adx6300.0001.001') --oldstyle
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2' |
If you use the command line version of htid2rsync, there is an optional flag added () that will generate pairtree instead of stubbytree paths.
Workset Builder 2.0
Extracted Features 1.5 files can also be downloaded for search results in HTRC's beta Workset Builder 2.0. After completing your search, you can download either the Extracted Features files for your results set or for single files from your results.
...