* Here we introduce the Extracted Features functionality (currently under beta release) that have recently been developed by the HathiTrust Research Center. This functionality is one of the ways in which users of the HathiTrust Research Center can perform non-consumptive analysis of subsets of the HathiTrust Digital Library's corpus that they have custom-selected by means of the workset mechanism available through the HTRC. (Currently, this functionality is available only for the HathiTrust Digital Library's public domain corpus, consisting of slightly less than 5 million volumes.)
1. Overview
This document is a starting point for a user interested in downloading the json-format Extracted Features (EF) data files corresponding to the specific HathiTrust Digital Library volumes that constitute the user's custom workset (that the user has built with the HTRC Workset Builder). We will show, step by step, how you can create a custom workset and then how you can download the data files corresponding to the content of your workset.
2. Create your workset
This section shows you how to create a custom workset, for the volume(s) contained in which you will eventually download the corresponding advanced and basic EF data files. Your workset can contain as many volumes as you wish. However, the example workset for this section will consist, for the sake of simplicity, of a single volume from the HathiTrust Digital Library's public domain collection: a published-in-1920 edition of the book of poems titled Buch der Lieder by the German poet Heinrich Heine. Then we show you how you can download the EF data files corresponding to this workset. (One of the use cases for the EF approach to non-consumptive text analysis that we have posted also uses this particular book by Heine to make its point.)
2.1 Navigate to the HTRC Secure HathiTrust Analytics Research Commons (SHARC)
Navigate to the HTRC Secure HathiTrust Analytics Research Commons (SHARC). Click on the link stating “Sign In” at the upper right corner of the screen.
2.2 Sign in to HTRC SHARC
After Step 1, you will reach the screen shown below. Enter your HTRC username and password at the respective fields, and then click on the “Sign In” button.
2.3 Verify that you are logged in to HTRC SHARC
After Step 2, you will arrive at the screen shown below. Verify that your HTRC username appears at the upper right corner, showing that you are successfully logged in to HTRC SHARC.
2.4 Prepare to create a workset
In this step-by-step instruction set, we will show you how to create a new workset using the HTRC Workset Builder. You will be performing a search on the HathiTrust Digital Library’s collection and then selecting some or all of the search results to constitute your workset. For simplicity, we will show you how to create a simple workset consisting of a single volume, for which you will eventually be able to download the basic and advanced feature data files, by the end of these instructions.
Other ways of creating worksets, or of making use of public worksets created by other users, also exist. For example, if you already have a comma-separated values (CSV) file that specifies the list of HathiTrust volumeIDs corresponding to the volumes that you want your workset to comprise, you can simply upload it using the "Upload workset" link. For more information about creating, uploading and browsing worksets, you can consult the instructions available at the HTRC Wiki.
Click on 'Create Workset'.
2.5 Access the HTRC Workset Builder
You should now be at the screen shown below. Click on the “More options” link.
2.6 Prepare to search the HathiTrust Digital Library
You should now be at the screen shown below. Enter Buch der Lieder in the ‘Title’ field, Heine in the ‘Author’ field and 1920 in the ‘Publish Date’ field as shown below. Then click the “Search” button.
2.7 Restart a Virtual Machine
Although we do not provide a reset button for you to restart the VM directly, you can always stop the VM and then start it again. This has the same effect of pushing a reset button on a machine.
2.8 Delete a Virtual Machine
You can delete a VM by pushing the “Delete VM” button. After that, the VM is wiped out and everything inside the VM is gone.
3. Resources
1) The source code
The code base has 3 parts, a web GUI, web service and backend scripts
You can download the code for web GUI from http://sourceforge.net/p/htrc/code/HEAD/tree/HTRC-UI-Portal2/.
You can download the code repository for web service and backend scripts from https://github.com/htrc/HTRC-Data-Capsules
2) The web interface
You can find the url for the web front end from the HTRC production portal https://htrc2.pti.indiana.edu