Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

In this page, we introduce the Extracted Features functionality (currently under beta release) that has recently been developed by the HathiTrust Research Center (HTRC). This functionality is one of the ways in which users of HTRC's tools can perform non-consumptive analysis of subsets of the HathiTrust Digital Library's corpus, that they have custom-selected by means of the workset mechanism available through the HTRC.

 Currently, this functionality is available only for the HathiTrust Digital Library's public domain corpus, consisting of slightly less than 5 million volumes.

Table of Contents

1. Overview

This document is a starting point for a user interested in downloading the json-format Extracted Features (EF) data files corresponding to the specific HathiTrust Digital Library volumes that constitute the user's custom workset (that the user has built with the HTRC Workset Builder). We will show, step by step, how you can create a custom workset and then how you can download the data files corresponding to the content of your workset. 

 2. Create your workset

This section shows you how to  create a custom workset, for the volume(s) contained in which you will eventually download the corresponding advanced and basic EF data files. Your workset can contain as many volumes as you wish. However, the example workset for this section will consist, for the sake of simplicity, of a single volume from the HathiTrust Digital Library's public domain collection: a published-in-1920 edition of the book of poems titled Buch der Lieder by the German poet Heinrich Heine. Then we show you how you can download the EF data files corresponding to this workset. (One of the use cases for the EF approach to non-consumptive text analysis that we have posted also uses this particular book by Heine to make its point.)

2.1 Navigate to the HTRC Secure HathiTrust Analytics Research Commons (SHARC)

Navigate to the HTRC Secure HathiTrust Analytics Research Commons (SHARC) page. Click on the link stating “Sign In” at the upper right corner of the screen.

...

 

2.2 Sign in to HTRC SHARC

After Step 1, you will reach the  screen shown below. Enter your HTRC username and password at the respective fields, and then click on the “Sign In” button.

...

 

2.3 Verify that you are logged in to HTRC SHARC

...

2.4 Prepare to create a workset

In this step-by-step instruction set, we will show you how to create a new workset using the HTRC Workset Builder. You will be performing a search on the HathiTrust Digital Library’s collection and then selecting some or all of the search results to constitute your workset. For simplicity, we will show you how to create a simple workset consisting of a single volume, for which you will eventually be able to download the basic and advanced feature data files, by the end of these instructions.

Other ways of creating worksets, or of making use of public worksets  created by other users, also exist. For example, if you already have a comma-separated values (CSV) file that specifies the list of HathiTrust volumeIDs corresponding to the volumes that you want your workset to comprise,  you can simply upload it using the "Upload workset" link. For more information about creating, uploading and browsing worksets, you can consult the instructions available at the HTRC Wiki.

Click on 'Create Workset'.

2.5 Access the HTRC Workset Builder

 You should now be at the screen shown below. Click on the “More options” link.

 

Image Removed

 

2.6 Prepare to search the HathiTrust Digital Library

You should now be at the screen shown below. Enter Buch der Lieder in the ‘Title’ field, Heine in the ‘Author’ field and 1920 in the ‘Publish Date’ field as shown below. Then click the “Search” button.

 

...

 

2.7 Select specific volume(s) from the search results

Select the first of the three volumes that show up, by clicking the checkbox next to ‘Select’, as shown below.

...

2.8 Prepare to view the selected volume(s)

Click on “Selected Items” from among the options at the upper right corner of the screen (as shown below).

Image Removed

2.9 Prepare to create a workset consisting of the selected volume(s)

The volume you selected now shows up as a selected item, as shown below. Click on the “Create/Update Workset” link that is within the grey area.

...

2.10 Name and describe the workset that is about to be created

 

Enter a name (for example, 'HeinePoetry') in the ‘Name:’ field as shown below. Enter a description in the ‘Description:’ field and set the availability to ‘Private’ or ‘Public’ as you prefer. Then click on the ‘Create’ button.

 

...

2.11 Verify that the workset has been created

You should see a message, as shown below, stating that the workset has been created.

...

3. Generate and execute the data file transfer script

 

3.1 Return to the HTRC SHARC screen

Click on the ‘Portal’ link, which is at the top right corner of the screen.

...

3.1 Prepare to execute an algorithm

 

You should now be back at the HTRC SHARC screen, as shown below. Click on the 'Algorithms' link, which is near the top of the screen.

 

...

3.2 Prepare to execute the EF_Rsync_Script_Generator algorithm

 

From the list of algorithms that shows up (as shown below), click on EF_Rsync_Script_Generator. This is the algorithm  for generating the script for downloading the feature data files that correspond to your workset.

 

Image Removed

3.3 Execute the EF_Rsync_Script_Generator algorithm

Specify a job name of your choosing. You also have to select a  workset that the EF_Rsync_Script_Generator algorithm will run against: Check the button saying “Select a workset from my worksets” and select your desired workset.  Your screen should now look like the figure below. At this point, click the ‘Submit’ button.

...

 

3.4 Check the status of the EF_Rsync_Script_Generator algorithm's execution

 

You can now see the status of the job, as shown below. The status of the job will initially show as “Staging”. (Refresh the screen after some time and you will see the status to have changed to “Queued”. )

 

 Image Removed

3.5 Open the completed job

Eventually, the job will have “completed”, and the screen, on refreshing, will look as follows. Click on the link representing the job name.

...

3.6 Prepare to download the results returned by the EF_Rsync_Script_Generator algorithm

At this time, the screen should look like the following:

Image Removed

...

3.7 Download the script returned by the EF_Rsync_Script_Generator algorithm

 

At this point, the script will be downloaded to your computer’s hard disk, and you will see the message at the bottom left of your browser window be replaced by just the name of the downloaded file:

Image Removed

 

3. 8 Run the script returned by the EF_Rsync_Script_Generator algorithm

Windows users please note: Before proceeding, Windows users will need to complete additional steps to prepare their machine to work with rsync. Please follow the directions here.

 

After you download the script, from the command line navigate to the directory where the script file is located. This directory will typically be called Downloads, though the location may be different depending on your machine and if you have moved the file. Here is an example:

Code Block
cd ~/Downloads

 

Once you are in the directory where the file is located, you may be interested in checking the file size to verify that the script exists:

Code Block
ls -l EF_Rsync.sh

Then run the file you downloaded. It is a shell script. When you run it, a basic features data file and an advanced features data file for each volume in your workset will be transferred to your hard disk via the rsync utility.

 

Code Block
sh EF_Rsync.sh

 

If your workset contained N volumes with HathiTrust volume IDs V1, V2, V3,... VN respectively, then executing the shell script as shown above will cause the following compressed advanced and basic feature data files for the corresponding volumes to be transferred to your computer’s hard disk via rsync:

V1.advanced.json.bz2, V1.basic.json.bz2, 

V2.advanced.json.bz2, V2.basic.json.bz2,

V3.advanced.json.bz2, V3.basic.json.bz2,...

VN.advanced.json.bz2, VN.basic.json.bz2

For the workset in this example, because it contained only one volume, the book Buch der Lieder by Heinrich Heine with the HathiTrust volumeID mdp.39015012864743, the script will transfer two files to your machine. They are the advanced and basic feature data files for the volume in the workset:

mdp.39015012864743.advanced.json.bz2

mdp.39015012864743.basic.json.bz2 

 

3. 9 Uncompress the downloaded files

Because the advanced and basic feature data files will be downloaded in a compressed format, you will need to uncompress them into JSON-formatted text files.

You will now be able to view the files in the text editor of your choice, and perform text analysis with them using your own code, in the programming language(s) of your choice.

 

 

...


Excerpt

See the different ways you can download EF files to your local machine or inside a data capsule.


Table of Contents
maxLevel3

The basics

Rsync

Extracted Features are downloaded using a file transfer utility called rsync, which is a command line utility. Rsync will synchronize files from our servers to your system.  

Tips

  • Not including a path will sync all files: this is possible, but remember that (for Extracted Features 2.0) the full dataset is 4 Terabytes! Only download all of it if you know what you are doing. Most people are downloading a subset of the files. 
  • The fastest way to sync is by pointing at exactly the files that you want. Use the htrc-ef-all-files.txt file if you want the paths to all the EF files.
  • Have a custom list of volumes and want to get the file paths for it? We have created two options to help:
    • Uploading the volume list to HTRC Analytics as a workset allows you to download a shell script. Details.
    • The HTRC Feature Reader installs a command line utility, htid2rsync, and Python functions for doing the same. Details

HTRC Feature Reader

We provide a Python library called the HTRC Feature Reader which simplifies many of the activities that you may want to perform with EF Dataset, including generating file paths for volumes for which you know the HathiTrust volume ID. The Feature Reader is compatible with Extracted Features versions 2.0 and 1.5.

Read full documentation for the HTRC Feature Reader, including code examples.

File names

Filenames are derived from the associated volume's HathiTrust volume ID, with the following characters substituted:

original charactersubstitute character
:+
/=


Unzipping downloaded files

The downloaded content will be compiled in a compressed file (.bz2). You will need to navigate your local system and properly unzip any file(s) you wish you view. The bunzip2 command may be useful. Many researchers will prefer to use the compressed files, and the HTRC Feature Reader will work with compressed files. 

Anchor
EF2download
EF2download
Download Extracted Features version 2.0

Download Format

File Format

Read the docs

Sample Files

A sample of 100 extracted feature files is available for download through your browser: sample-EF202003.zip

Also, thematic collections are available to download: DocSouth_sample_EF202003.zip (82 volumes), EEBO_sample_EF202003.zip (234 volumes), ECCO_sample_EF202003.zip (412 volumes).

Filepaths

The dataset is stored in a directory specification created by HTRC called stubbytree. (Note: This is a change from previous versions of the Extracted Features dataset that used the pairtree directory specification.) Stubbytree places files in a directory structure based on the file name, with the highest level directory being the HathiTrust source code (i.e. volume ID prefix), and then using every third character of the cleaned volume ID, starting with the first, to create a sub-directory. For example the Extracted Features file for the volume with HathiTrust ID nyp.33433070251792 would be located at: 

root/nyp/33759/nyp.33433070251792.json.bz2

Download Options

Rsync

The rsync module (or alias path) for Extracted Features 2.0 is data.analytics.hathitrust.org::features-2020.03/ . 

If you run the rsync command as written above, without specifying file paths, it will sync all files. Do not do this unless you are prepared to work with the full dataset, which is 4 TB. Make sure to include the final period (.) when running your command in order to sync the files to your current directory, or else provide the path to the local directory of your choosing where you would like the files to be synced to.

A full listing of all the files is available from:


Code Block
rsync -azv data.analytics.hathitrust.org::features-2020.03/listing/file_listing.txt 


It is possible to sync any single Extracted Features file in the following manner:


Code Block
rsync -av data.analytics.hathitrust.org::features-2020.03/{PATH-TO-FILE} .


Rather than learning the pairtree specification, we recommend using the HTRC Feature Reader’s command line htid2rsync tool. For example, to get rsync a single Extracted Features file when you know the HathiTrust volume ID: 


Code Block
htid2rsync {VOLUMEID} | rsync --files-from - data.analytics.hathitrust.org::features-2020.03/


You can also download multiple files by writing the Extracted Features files’ paths to a text file, and then run the following command:


Code Block
rsync -av --files-from FILE.TXT data.analytics.hathitrust.org::features-2020.03/ .


You can sync into a single folder, throwing away the directory structure, by adding --no-relative to the rsync command:


Code Block
rsync -av --no-relative --files-from FILE.TXT data.analytics.hathitrust.org::features-2020.03/ .

Converting HathiTrust Volume ID to stubbytree URL using HTRC Feature Reader

If you already have a list of HT volume IDs, you can use a Python library developed by the HTRC called the HTRC Feature Reader library, to prepare to rsync your volumes of interest. Here is an example showing the conversion of one HT volume ID into an rsync url as it would incorporated into Python:  

Code Block
languagepy
from htrc_features import utils
utils.id_to_rsync('hvd.32044140344292') # 'format' is an optional parameter with the default value of format=’stubbytree’


Feature Reader also comes with a command line utility called  htid2rsync which can be used to generate filepaths to EF 2.0 data:

Code Block
languagepowershell
>>$ htid2rsync hvd.32044140344292
>> hvd/34449/hvd.32044140344292.json.bz2

Use HTRC EF Download Helper Algorithm


To download the Extracted Features data for a specific workset in HTRC Analytics, there is an algorithm that generates the Rsync download script, the Extracted Features Download Helper. The tool can also be useful if you don’t want to go through the process of determining files paths. Select version 0.2 when you run the algorithm to get the Extracted Features 2.0 version of the files.
The algorithm creates a shell script that you can download and run from your local command line. The file lists the rsync commands for every volume in an HTRC workset.

The algorithm creates a shell script that you can download and run from your local command line. The file lists the rsync commands for every volume in an HTRC workset. Once you have run the algorithm and downloaded the resulting file, you will run the resulting .sh file.

Expand
titleRunning the algorithm

Go to HTRC Analytics

Navigate to https://analytics.hathitrust.org and log in.


Go to the Worksets page of HTRC Analytics

Click on the 'Worksets' link near the top of the screen. From the list of worksets that appear, choose the one you would like to get Extracted Features for and click on its name.


Image Added


From the 'Analyze with Algorithm' drop down menu, choose the Extracted Features Download Helper algorithm.

Image Added

This algorithm generates a script for downloading the feature data files that correspond to your workset.


Execute the  Extracted Features Download Helper algorithm

Specify a job name of your choosing. Select Extracted Features 2.0 from the dataset drop down. Then, click the ‘Submit’ button.


Image Added


Wait until the algorithm has finished and then open the completed job to download

Eventually, the job will complete, and it will move to the "Completed Jobs" section of the page. Click on the link representing the job name to see the results.

 

Image Added

From the results page, click the blue button to download the shell script you will use to get the Extracted Features. The file will go wherever downloads typically end up on your machine, often the Downloads folder.

Image Added 


Expand
titleRun the shell script to download

For Windows users, make sure your computer is set-up to run rsync we recommend cwRsync - Rsync for Windows or Cygwin.

Navigate to the directory where the script file is located

From the command line, navigate to the directory where the script file is located. This directory will typically be called Downloads, though the location may be different depending on your machine and if you have moved the file. Here is an example:

Code Block
cd ~/Downloads


Run the shell script

Then run the file you downloaded. It is a shell script.


Code Block
sh EF_Rsync.sh


When you run it, the Extracted Features files for the volumes in your workset will be downloaded. Note that your workset could contain a volume for which an Extracted Features File has not been created due to the way that the versions of Extracted Features are snapshots of the HathiTrust corpus at specific points in time.

The files will download in pairtree directory structure.


(Optional) Uncompress the downloaded files

Because the feature data files are compressed, you may need to uncompress them into JSON-formatted text files, depending on your need. The compression used is bzip2. If you are using the files with the HTRC Feature Reader, the library will deal with the compression automatically.

Anchor
EF1.5download
EF1.5download
Downloading Extracted Features version
1.5

Download Format

File Format

/wiki/spaces/DEV/pages/43125245

Sample Files

A sample of 100 extracted feature files is available for download through your browser: sample-EF201801.zip.

Also, thematic collections are available to download: DocSouth_sample_EF201801.zip (87 volumes),EEBO_sample_EF201801.zip (355 volumes), ECCO_sample_EF201801.zip (505 volumes).

Filepaths

The data is stored in a pairtree directory structure, allowing you to infer the location of any file based on its HathiTrust volume identifier. Pairtree format is an efficient directory structure, which is important for HathiTrust-scale data, where the files are placed in directories based on “pairs” of characters in their file names. For example the Extracted Features file for the volume with HathiTrust ID mdp.39015073767769 would be located at: 

mdp/pairtreeroot/39/01/50/73/76/69/39015073767769

When you download the files, they will sync in pairtree directory structure, as well. 

Download Options

Rsync

The Rsync module (or alias path) for Extracted Features 1.5 is data.analytics.hathitrust.org::features-2018.01/ . 

If you run the rsync command as written above, without specifying file paths, it will sync all files. Do not do this unless you are prepared to work with the full dataset, which is 4 TB. Make sure to include the final period (.) when running your command in order to sync the files to your current directory, or else provide the path to the local directory of your choosing where you would like the files to be synced to.

A full listing of all the files is available from:


Code Block
rsync -azv data.analytics.hathitrust.org::features-2018.01/listing/file_listing.txt 


In order to rsync a file or set of files, you must know their directory path on HTRC’s servers. It is possible to sync any single Extracted Features file in the following manner:


Code Block
rsync -av data.analytics.hathitrust.org::features-2018.01/{PATH-TO-FILE} .


Rather than learning the pairtree specification, we recommend using the HTRC Feature Reader’s command line htid2rsync tool. For example, to get rsync a single Extracted Features file when you know the HathiTrust volume ID: 


Code Block
 htid2rsync {VOLUMEID} | rsync --files-from - data.analytics.hathitrust.org::features-2018.01/

You can also download multiple files by writing the Extracted Features files’ paths to a text file, and then run the following command:


Code Block
rsync -av --files-from FILE.TXT data.analytics.hathitrust.org::features-2018.01/ .


You can sync into a single folder, throwing away the directory structure, by adding --no-relative to the rsync command:


Code Block
rsync -av --no-relative --files-from FILE.TXT data.analytics.hathitrust.org::features-2018.01/ .


Converting HathiTrust Volume ID to rsync URL using HTRC Feature Reader

If you already have a list of HT volume IDs, you can use a Python library developed by the HTRC called the HTRC Feature Reader library, to prepare to rsync your volumes of interest. Here is an example showing the conversion of one HT volume ID into an rsync url:  

Code Block
languagepy
from htrc_features import utils
utils.id_to_rsync('hvd.32044140344292', format='pairtree')


Feature Reader also comes with a command line utility called  htid2rsync which can be used to generate filepaths to EF 1.5 data using the flag --oldstyle:

Code Block
languagepowershell
>>$ htid2rsync hvd.32044140344292 --oldstyle
>> hvd/pairtree_root/32/04/41/40/34/42/92/32044140344292/hvd.32044140344292.json.bz2


Expand
titleDon't want to use the Feature Reader?

Converting HathiTrust Volume ID to RSync URL (using Python)

Researchers who have their list of HT volume IDs but prefer not to use the HTRC Feature Reader, can convert HT volume IDs into rsync URLs using a Python script. This example is a simplified part of a longer notebook, which further describes how to collect and download large lists of volumes: ID to EF Rsync Link.ipynb

If you don't have it, you may have to install the pairtree library with:  pip install pairtree (Python 2.x only).

 

Code Block
languagepy
import os
from pairtree import id2path, id_encode
def id_to_rsync(htid):
	'''
	Take an HTRC id and convert it to an Rsync location for syncing Extracted Features
 	'''
    libid, volid = htid.split('.', 1)
    volid_clean = id_encode(volid)
    filename = '.'.join([libid, volid_clean, kind, 'json.bz2'])
    path = '/'.join([kind, libid, 'pairtree_root', id2path(volid).replace('\\', '/'), volid_clean, filename])
    return path


Example: 

Code Block
languagepy
id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'



Workset Builder 2.0

Extracted Features 1.5 files can also be downloaded for search results in HTRC's beta Workset Builder 2.0. After completing your search, you can download either the Extracted Features files for your results set or for single files from your results.

First, enter search terms for a desired set of volumes. Once results are returned, filter the results to remove any volumes for which you do not want Extracted Features files or to find the volume(s) most relevant to your work.


Expand
titleDownload for all results

You can download the files for your entire results set. Once your results includes all of the volumes you'd like Extracted Features files for, click "JSON Extracted Features (ZIP)" under the "Export Search Results" heading at the top of the left sidebar. Clicking this link will start a download of the Extracted Features files for all volumes in your results. Since this process is fetching and compressing files, and result sets can be large, it may take a few moments to start and finish your download.


Image Added


Expand
titleDownload single file

If you'd like to download Extracted Features files for individual volumes, you can click the link "Download Extracted Features (complete volume)" under each volume in the search results. This will download uncompressed Extracted Features for the given volume, in JSON.


Image Added

Use HTRC EF Download Helper Algorithm

To download the Extracted Features data for a specific workset in HTRC Analytics, there is an algorithm that generates the Rsync download script, the Extracted Features Download Helper. The tool can also be useful if you don’t want to go through the process of determining files paths. Select version 1.5 when you run the algorithm to get the Extracted Features 1.5 version of the files.

The algorithm creates a shell script that you can download and run from your local command line. The file lists the rsync commands for every volume in an HTRC workset. Once you have run the algorithm and downloaded the resulting file, you will run the resulting .sh file.


Expand
titleRunning the algorithm

Go to HTRC Analytics

Navigate to https://analytics.hathitrust.org and log in.


Go to the Worksets page of HTRC Analytics

Click on the 'Worksets' link near the top of the screen. From the list of worksets that appear, choose the one you would like to get Extracted Features for and click on its name.


Image Added


From the 'Analyze with Algorithm' drop down menu, choose the Extracted Features Download Helper algorithm.

Image Added

This algorithm generates a script for downloading the feature data files that correspond to your workset.


Execute the  Extracted Features Download Helper algorithm

Specify a job name of your choosing. Select Extracted Features 1.0 from the dataset drop down.Then, click the ‘Submit’ button.


Image Added


Wait until the algorithm has finished and then open the completed job to download

Eventually, the job will complete, and it will move to the "Completed Jobs" section of the page. Click on the link representing the job name to see the results.

 

Image Added

From the results page, click the blue button to download the shell script you will use to get the Extracted Features. The file will go wherever downloads typically end up on your machine, often the Downloads folder.

Image Added 


Expand
titleRun the shell script to download

For Windows users, make sure your computer is set-up to run rsync we recommend cwRsync - Rsync for Windows or Cygwin.

Navigate to the directory where the script file is located

From the command line, navigate to the directory where the script file is located. This directory will typically be called Downloads, though the location may be different depending on your machine and if you have moved the file. Here is an example:

Code Block
cd ~/Downloads


Run the shell script

Then run the file you downloaded. It is a shell script.


Code Block
sh EF_Rsync.sh


When you run it, the Extracted Features files for the volumes in your workset will be downloaded. Note that your workset could contain a volume for which an Extracted Features File has not been created due to the way that the versions of Extracted Features are snapshots of the HathiTrust corpus at specific points in time.

The files will download in pairtree directory structure.


(Optional) Uncompress the downloaded files

Because the feature data files are compressed, you may need to uncompress them into JSON-formatted text files, depending on your need. The compression used is bzip2. If you are using the files with the HTRC Feature Reader, the library will deal with the compression automatically.


Anchor
EF.2download
EF.2download
Downloading Extracted Features version
0.2

Download Format

File Format

Read the docs

Sample Files

A sample of 100 extracted feature files is available for download through your browser: sample-EF201801.zip.

Also, thematic collections are available to download: DocSouth_sample_EF201801.zip (87 volumes), EEBO_sample_EF201801.zip (355 volumes), ECCO_sample_EF201801.zip (505 volumes).

Filepaths

The data is stored in a pairtree directory structure, allowing you to infer the location of any file based on its HathiTrust volume identifier. Pairtree format is an efficient directory structure, which is important for HathiTrust-scale data, where the files are placed in directories based on “pairs” of characters in their file names. For example the Extracted Features file for the volume with HathiTrust ID mdp.39015073767769 would be located at: 

mdp/pairtreeroot/39/01/50/73/76/69/39015073767769

When you download the files, they will sync in pairtree directory structure, as well. 

Download Options

Rsync

Rsync will download each feature file individually, following a pairtree directory structure.

The Rsync module (or alias path) for Extracted Features .2 is data.analytics.hathitrust.org::features-2015.02

If you run the rsync command as written above, without specifying file paths, it will sync all files. Do not do this unless you are prepared to work with the full dataset, which is 1.2 TB. Make sure to include the final period (.) when running your command in order to sync the files to your current directory, or else provide the path to the local directory of your choosing where you would like the files to be synced to.

A full listing of all the files is available from:

Code Block
rsync -azv data.analytics.hathitrust.org::features-2015.02/listing/file_listing.txt 


Users hoping for a more flexible file listing can use rsync's --list-only flag.

To rsync only the files in a given text file:

Code Block
rsync -av --files-from FILE.TXT data.analytics.hathitrust.org::features-2015.02/. 


Use HTRC EF Download Helper Algorithm


To download the Extracted Features data for a specific workset in HTRC Analytics, there is an algorithm that generates the Rsync download script, the Extracted Features Download Helper. The tool can also be useful if you don’t want to go through the process of determining files paths. Select version 0.2 when you run the algorithm to get the Extracted Features 0.2 version of the files.

The algorithm creates a shell script that you can download and run from your local command line. The file lists the rsync commands for every volume in an HTRC workset. Once you have run the algorithm and downloaded the resulting file, you will run the resulting .sh file.


Expand
titleRunning the algorithm

Go to HTRC Analytics

Navigate to https://analytics.hathitrust.org and log in.


Go to the Worksets page of HTRC Analytics

Click on the 'Worksets' link near the top of the screen. From the list of worksets that appear, choose the one you would like to get Extracted Features for and click on its name.


Image Added


From the 'Analyze with Algorithm' drop down menu, choose the Extracted Features Download Helper algorithm.

Image Added

This algorithm generates a script for downloading the feature data files that correspond to your workset.


Execute the  Extracted Features Download Helper algorithm

Specify a job name of your choosing. Select Extracted Features 0.2 from the dataset selection drop down. Then, click the ‘Submit’ button.


Image Added


Wait until the algorithm has finished and then open the completed job to download

Eventually, the job will complete, and it will move to the "Completed Jobs" section of the page. Click on the link representing the job name to see the results.

 

Image Added

From the results page, click the blue button to download the shell script you will use to get the Extracted Features. The file will go wherever downloads typically end up on your machine, often the Downloads folder.

Image Added 


Expand
titleRun the shell script to download

For Windows users, make sure your computer is set-up to run rsync we recommend cwRsync - Rsync for Windows or Cygwin.

Navigate to the directory where the script file is located

From the command line, navigate to the directory where the script file is located. This directory will typically be called Downloads, though the location may be different depending on your machine and if you have moved the file. Here is an example:

Code Block
cd ~/Downloads


Run the shell script

Then run the file you downloaded. It is a shell script.


Code Block
sh EF_Rsync.sh


When you run it, the Extracted Features files for the volumes in your workset will be downloaded. Note that your workset could contain a volume for which an Extracted Features File has not been created due to the way that the versions of Extracted Features are snapshots of the HathiTrust corpus at specific points in time.

The files will download in pairtree directory structure.


(Optional) Uncompress the downloaded files

Because the feature data files are compressed, you may need to uncompress them into JSON-formatted text files, depending on your need. The compression used is bzip2. If you are using the files with the HTRC Feature Reader, the library will deal with the compression automatically.