User Guide to Downloading Extracted Features for your Custom Workset

* In this page, we introduce the Extracted Features functionality (currently under beta release) that has recently been developed by the HathiTrust Research Center (HTRC). This functionality is one of the ways in which users of HTRC's tools can perform non-consumptive analysis of subsets of the HathiTrust Digital Library's corpus, that they have custom-selected by means of the workset mechanism available through the HTRC.

(Currently, this functionality is available only for the HathiTrust Digital Library's public domain corpus, consisting of slightly less than 5 million volumes.)

1. Overview

This document is a starting point for a user interested in downloading the json-format Extracted Features (EF) data files corresponding to the specific HathiTrust Digital Library volumes that constitute the user's custom workset (that the user has built with the HTRC Workset Builder). We will show, step by step, how you can create a custom workset and then how you can download the data files corresponding to the content of your workset.

2. Create your workset

This section shows you how to create a custom workset, for the volume(s) contained in which you will eventually download the corresponding advanced and basic EF data files. Your workset can contain as many volumes as you wish. However, the example workset for this section will consist, for the sake of simplicity, of a single volume from the HathiTrust Digital Library's public domain collection: a published-in-1920 edition of the book of poems titled Buch der Lieder by the German poet Heinrich Heine. Then we show you how you can download the EF data files corresponding to this workset. (One of the use cases for the EF approach to non-consumptive text analysis that we have posted also uses this particular book by Heine to make its point.)

2.1 Navigate to the HTRC Secure HathiTrust Analytics Research Commons (SHARC)

Navigate to the HTRC Secure HathiTrust Analytics Research Commons (SHARC) page. Click on the link stating “Sign In” at the upper right corner of the screen.

2.2 Sign in to HTRC SHARC

After Step 1, you will reach the screen shown below. Enter your HTRC username and password at the respective fields, and then click on the “Sign In” button.

2.3 Verify that you are logged in to HTRC SHARC

After Step 2, you will arrive at the screen shown below. Verify that your HTRC username appears at the upper right corner, showing that you are successfully logged in to HTRC SHARC.

2.4 Prepare to create a workset

In this step-by-step instruction set, we will show you how to create a new workset using the HTRC Workset Builder. You will be performing a search on the HathiTrust Digital Library’s collection and then selecting some or all of the search results to constitute your workset. For simplicity, we will show you how to create a simple workset consisting of a single volume, for which you will eventually be able to download the basic and advanced feature data files, by the end of these instructions.

Other ways of creating worksets, or of making use of public worksets created by other users, also exist. For example, if you already have a comma-separated values (CSV) file that specifies the list of HathiTrust volumeIDs corresponding to the volumes that you want your workset to comprise, you can simply upload it using the "Upload workset" link. For more information about creating, uploading and browsing worksets, you can consult the instructions available at the HTRC Wiki.

Click on 'Create Workset'.

2.5 Access the HTRC Workset Builder

You should now be at the screen shown below. Click on the “More options” link.

2.6 Prepare to search the HathiTrust Digital Library

You should now be at the screen shown below. Enter Buch der Lieder in the ‘Title’ field, Heine in the ‘Author’ field and 1920 in the ‘Publish Date’ field as shown below. Then click the “Search” button.

2.7 Select specific volume(s) from the search results

Select the first of the three volumes that show up, by clicking the checkbox next to ‘Select’, as shown below.

2.8 Prepare to view the selected volume(s)

Click on “Selected Items” from among the options at the upper right corner of the screen (as shown below).

2.9 Prepare to create a workset consisting of the selected volume(s)

The volume you selected now shows up as a selected item, as shown below. Click on the “Create/Update Workset” link that is within the grey area.

2.10 Name and describe the workset that is about to be created

Enter a name (for example, 'HeinePoetry') in the ‘Name:’ field as shown below. Enter a description in the ‘Description:’ field and set the availability to ‘Private’ or ‘Public’ as you prefer. Then click on the ‘Create’ button.

2.11 Verify that the workset has been created

You should see a message, as shown below, stating that the workset has been created.

3. Generate and execute the data file transfer script

3.1 Return to the HTRC SHARC screen

Click on the ‘Portal’ link, which is at the top right corner of the screen.

3.1 Prepare to execute an algorithm

You should now be back at the HTRC SHARC screen, as shown below. Click on the 'Algorithms' link, which is near the top of the screen.

3.2 Prepare to execute the EF_Rsync_Script_Generator algorithm

From the list of algorithms that shows up (as shown below), click on EF_Rsync_Script_Generator. This is the algorithm for generating the script for downloading the feature data files that correspond to your workset.

3.3 Execute the EF_Rsync_Script_Generator algorithm

Specify a job name of your choosing. You also have to select a workset that the EF_Rsync_Script_Generator algorithm will run against: Check the button saying “Select a workset from my worksets” and select your desired workset. Your screen should now look like the figure below. At this point, click the ‘Submit’ button.

3.4 Check the status of the EF_Rsync_Script_Generator algorithm's execution

You can now see the status of the job, as shown below. The status of the job will initially show as “Staging”. (Refresh the screen after some time and you will see the status to have changed to “Queued”. )

3.5 Open the completed job

Eventually, the job will have “completed”, and the screen, on refreshing, will look as follows. Click on the link representing the job name.

3.6 Prepare to download the results returned by the EF_Rsync_Script_Generator algorithm

At this time, the screen should look like the following:

At the very bottom left of your browser window, you will see a message like the following. (The number you see within the parentheses may vary, depending on how many times you have executed this step before. If doing this step for the first time, there will be no parentheses.) Press the “Keep” button.

3.7 Download the script returned by the EF_Rsync_Script_Generator algorithm

At this point, the script will be downloaded to your computer’s hard disk, and you will see the message at the bottom left of your browser window be replaced by just the name of the downloaded file:

3.8 Additional steps for Windows users only

**These additional steps are relevant only to Windows users. Mac or Linux/Unix users can skip to section 3.9.

Windows users will need to install a Unix shell in order to run the rsync script. The most common one is Cygwin. Please refer to its documentation on how to install it, taking care to include the rsync package in the installation.

If you are new to Cygwin, you may not be familiar with how to navigate to your C: drive to find the HTRC-generated rsync file you just downloaded. The following FAQ page provides information about how to locate your C: drive from the Cygwin shell.

3. 9 Run the script returned by the EF_Rsync_Script_Generator algorithm

Find the downloaded file on your hard disk. It is a shell script you can run from a terminal window. When you run it, a basic features data file and an advanced features data file for each volume in your workset will be transferred to your hard disk via the rsync utility.

You can check the file size using the ls -l command at the Unix shell prompt, and then execute the EF_Rsync script, as shown below.

unix_shell_prompt$ ls -l EF_Rsync.sh

This should verify that the EF_Rsync script exists:

-rw-r-----@ 1 sayan  GSLIS-AD\sayan  320 May  5 00:53 EF_Rsync.sh


unix_shell_prompt$ sh EF_Rsync.sh

mdp.39015012864743.advanced.json.bz2

sent 152 bytes  received 200 bytes  704.00 bytes/sec

total size is 10192  speedup is 28.95

mdp.39015012864743.basic.json.bz2

sent 1538 bytes  received 1121 bytes  5318.00 bytes/sec

total size is 171977  speedup is 64.68

If your workset contained N volumes with HathiTrust volume IDs V1, V2, V3,... VN respectively, then executing the shell script as shown above will cause the following compressed advanced and basic feature datafiles for the corresponding volumes,

V1.advanced.json.bz2, V1.basic.json.bz2,

V2.advanced.json.bz2, V2.basic.json.bz2,

V3.advanced.json.bz2, V3.basic.json.bz2,...

VN.advanced.json.bz2, VN.basic.json.bz2,

to be transferred to your computer’s hard disk via rsync. You will then be able to uncompress these files into text files in json format. You will be able to view the features by opening the uncompressed files in a suitable editor (such as Oxygen), and be able to manipulate the files programmatically. (For this particular workset, recall that there was only one volume, the book Buch der Lieder by Heinrich Heine, with the HathiTrust volumeID mdp.39015012864743. Therefore, the files mdp.39015012864743.advanced.json.bz2 and mdp.39015012864743.basic.json.bz2 were transferred.