Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents
maxLevel23

Overview

The HTRC Extracted Features functionality is are one of the ways in which users of HTRC's tools can perform non-consumptive analysis of texts text in the HathiTrust Digital Library's corpus

...

. As with most HTRC functions, Extracted Features are available for HTRC Worksets. Worksets, which are user-create collections of volumes from the HTDL, can be small (one volume) to large (more than thousands of volumes), and at their core, consist simply of a list of HathiTrust Volume IDs.  

Researchers have several options for creating their workset, including quering the HTRC Solr Proxy APIResearchers who do not yet have a workset and who only want to work with the public domain texts can create a workset in the HTRC Workset Builder

Download Format

Files

The HTRC Extracted Features files are formatted in JSON. For more information about the fields, see the documentation for each release

Sample file: 

Code Block
languagejs
titleExample EF data for basic features for a single page
{  "id":"loc.ark:/13960/t1fj34w02",
   "metadata":{
      "schemaVersion":"1.2",
      "dateCreated":"2015-02-12T13:30",
      "title":"Shakespeare's Romeo and Juliet,",
      "pubDate":"1920",
      "language":"eng",
      "htBibUrl":"http://catalog.hathitrust.org/api/volumes/full/htid/loc.ark:/13960/t1fj34w02.json",
      "handleUrl":"http://hdl.handle.net/2027/loc.ark:/13960/t1fj34w02",
      "oclc":"",
      "imprint":"Scott Foresman and company, [c1920]"
   },
   "features":{
      "schemaVersion":"2.0",
      "dateCreated":"2015-02-20T11:31",
      "pageCount":230,
      "pages":[
        {"seq":"00000015",
          “tokenCount":212,
          "lineCount":38,
          "emptyLineCount":10,
          "sentenceCount":7,
          "languages":[{"en":"1.00"}],
          "header":{
             "tokenCount":7,
             "lineCount":3,
             "emptyLineCount":1,
             "sentenceCount":1,
             "tokenPosCount":{
                "I.":{"NN":1},
                "THE":{"DT":1},
                "INTRODUCTION":{"NN":1},
                "DRAMA":{"NNPS":1},
                "SHAKESPEARE":{"NNP":1},
                "ENGLISH":{"NNP":1},
                "AND":{"CC":1}}},
          "body":{
             "tokenCount":205,
             "lineCount":35,
             "emptyLineCount":9,
             "sentenceCount":6,
             "tokenPosCount":{
                "striking":{"JJ":1},
                "his":{"PRP$":1},
                 "plays":{"NNS":1},
                "London":{"NNP":1},
                "four":{"CD":1},
                ".":{".":7},
                "dramatic":{"JJ":2},
                "1576":{"CD":1},
                "stands":{"VBZ":1},
                ...
                "growth":{"NN":1}
             }
          },
          "footer":{
             "tokenCount":0,
             "lineCount":0,
             "emptyLineCount":0,
                    "sentenceCount":0,
                    "tokenPosCount":{}}}]}}

Converting ID to RSync URL (Python with HTRC Feature Reader library)

...


...

Anchor
filepaths
filepaths
Filepaths

 HTRC Extracted Features files gathered through rsync will download in pairtree format. Pairtree format is a hierarchical filesystem developed by the University of California Curation Center that maps identifier strings (in this case the HathiTrust Volume ID) to directory paths two characters at a time. The filepaths keep the institutional short code (e.g. mpd, uc2) at the front of each HTID intact.

 

Preparing to Rsync

...

Converting HathiTrust Volume ID to Rsync URL (using HTRC Feature Reader)

 If you already have a list of HT volume IDs, you can use a Python library developed by the HTRC called the HTRC Feature Reader library, to prepare to rsync your volumes of interest. (Note: The HTRC Feature Reader library can do many other things as well!) In the Feature Reader library, there is a convenience function in htrc_features.utils.id_to_rsync(htid) which aids in transforming a list of HT Volume IDs into URLs for rsync. Here is an example showing the conversion of one HT volume ID into an rsync url: 

 

Code Block
languagepy
>> from htrc_features import utils
>> utils.id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'

 

Converting HathiTrust Volume ID

...

to RSync URL (using Python)

 

Researchers who have their list of HT volume IDs but prefer not to use the HTRC Feature Reader, can convert HT volume IDs into rsync URLs using a Python script. This example is a simplified part of a longer notebook, which further describes how to collect and download large lists of volumes: ID to EF Rsync Link.ipynb.  

If you don't have it, you may have to install the pairtree library with:  pip  pip install pairtree pairtree (Python 2.x only).

 

Code Block
languagepy
import os
from pairtree import id2path, id_encode
def id_to_rsync(htid):
	'''
	Take an HTRC id and convert it to an Rsync location for syncing Extracted Features
 	'''
    libid, volid = htid.split('.', 1)
    volid_clean = id_encode(volid)
    filename = '.'.join([libid, volid_clean, kind, 'json.bz2'])
    path = '/'.join([kind, libid, 'pairtree_root', id2path(volid).replace('\\', '/'), volid_clean, filename])
    return path

...

Code Block
languagebash
rsync -azv data.analytics.hathitrust.org::pd-features/{{URL}} .

If you have not yet created a workset, the Portal & Workset Builder Tutorial for v3.0 includes information on creating an account and building a Workset.

...


Using the HTRC Portal Algorithm


The HTRC Portal can aid some researchers who do not feel comfortable creating rsync URLs themselves can use the HTRC EF_Rsync_Script_Generator algorithm in the Portal to create a shell script they will download and run locally to rsync the files.

 

Image Added


Go to the Algorithms page of the HTRC Portal

While logged into the HTRC Portal with the account you created your workset with, click on the 'Algorithms' link near the top of the screen. Image Removed From the list of algorithms that shows up, click on EF_Rsync_Script_Generator. This 'algorithm' generates a script for downloading the feature data files that correspond to your workset.

 

Image Modified

Execute the EF_Rsync_Script_Generator algorithm

Specify a job name of your choosing. You also have to select a  workset that the EF_Rsync_Script_Generator algorithm will run against: Check the button saying “Select a workset from my worksets” and select your desired workset.  Your screen should now look like the figure below. At this point, click the ‘Submit’ button.

Image Modified

...


Wait until the algorithm has finished

You can now see the status of the job, as shown below. The status of the job will initially show as “Staging”. ( Refresh the screen after some time and you will see the status has changed to “Queued”. )to see its progress.

 Image Modified


Open the completed job to download

Eventually, the job will have “completed”, and the screen, on refreshing, will look as followsit will move to the "Completed Jobs" section of the page. Click on the link representing the job name .

Image Removed 

...

to initiate downloading the EF_Rsync_Script

...

At this time, the screen should look like the following:

Image Removed

At the very bottom left of your browser window, you will see a message like the following. (The number you see within the parentheses may vary, depending on how many times you have executed this step before. If doing this step for the first time, there will be no parentheses.) Press the “Keep” button.Image Removed  

 Image Added 


At this point, the script will be downloaded to your computer’s hard disk , and you will see the message at the bottom left of your browser window be replaced by just the name of the downloaded file:Image Removedwherever browser-initiated downloads are stored, likely in your Downloads folder.

 

Run the script returned by the EF_Rsync_Script_Generator algorithm

Windows users please note: Before proceeding, Windows users will need to complete additional steps to prepare their machine to work with rsync. Please follow the directions here before continuing to ensure your machine is equipped for rsync.

After you download the script, from the From the command line, navigate to the directory where the script file is located. This directory will typically be called Downloads, though the location may be different depending on your machine and if you have moved the file. Here is an example:

Code Block
cd ~/Downloads

 

Once you are in the directory where the file is located, you may be interested in checking the file size to verify that the script exists:

Code Block
ls -l EF_Rsync.sh


Then run the file you downloaded. It is a shell script. When you run it, a basic features data file and an advanced features data file for each volume in your workset will be transferred to your hard disk via the rsync utility.

...

If your workset contained N volumes with HathiTrust volume IDs V1, V2, V3,... VN respectively, then executing the shell script as shown above will cause the following feature data files for the corresponding volumes to be transferred to your computer’s hard disk via rsync: V1.json.bz2, V2.json.bz2, V3.json.bz2, ..., VN.json.bz2The workset in our example only contained one volume, Buch der Lieder by Heinrich Heine with the HathiTrust volumeID mdp.39015012864743. The corresponding file is called mdp.39015012864743.json.bz2See Filepaths above to learn more about the pairtree structure the Extracted Features files follow.


(Optional) Uncompress the downloaded files

Because the feature data files are compressed, you may need to uncompress them into JSON-formatted text files, depending on your need. The compression used is bzip2. If you are using the files with the HTRC Feature Reader, the library will deal with the compression automatically.