Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Table of Contents
maxLevel3

Overview

HTRC Extracted Features are one of the ways in which users of HTRC's tools can perform non-consumptive analysis of text in the HathiTrust Digital Library's corpus. As with most HTRC functions, Extracted Features are available for HTRC Worksets. Worksets, which are user-create collections of volumes from the HTDL, can be small (one volume) to large (more than thousands of volumes), and at their core, consist simply of a list of HathiTrust Volume IDs.  

This workset, or list of volume IDs, can be created by building a collection of volumes in HTDL and downloading the volume-ID containing metadata, or by making use of one of the other metadata sources from HathiTrust. Contact htc-help@hathitrust.org if you need assistance creating a list of volume IDs. 

Download Format

Files

The HTRC Extracted Features files are formatted in JSON. For more information about the fields, see the documentation for each release

Sample file: 

Code Block
languagejs
titleExample EF data for basic features for a single page
{  "id":"loc.ark:/13960/t1fj34w02",
   "metadata":{
      "schemaVersion":"1.2",
      "dateCreated":"2015-02-12T13:30",
      "title":"Shakespeare's Romeo and Juliet,",
      "pubDate":"1920",
      "language":"eng",
      "htBibUrl":"http://catalog.hathitrust.org/api/volumes/full/htid/loc.ark:/13960/t1fj34w02.json",
      "handleUrl":"http://hdl.handle.net/2027/loc.ark:/13960/t1fj34w02",
      "oclc":"",
      "imprint":"Scott Foresman and company, [c1920]"
   },
   "features":{
      "schemaVersion":"2.0",
      "dateCreated":"2015-02-20T11:31",
      "pageCount":230,
      "pages":[
        {"seq":"00000015",
          “tokenCount":212,
          "lineCount":38,
          "emptyLineCount":10,
          "sentenceCount":7,
          "languages":[{"en":"1.00"}],
          "header":{
             "tokenCount":7,
             "lineCount":3,
             "emptyLineCount":1,
             "sentenceCount":1,
             "tokenPosCount":{
                "I.":{"NN":1},
                "THE":{"DT":1},
                "INTRODUCTION":{"NN":1},
                "DRAMA":{"NNPS":1},
                "SHAKESPEARE":{"NNP":1},
                "ENGLISH":{"NNP":1},
                "AND":{"CC":1}}},
          "body":{
             "tokenCount":205,
             "lineCount":35,
             "emptyLineCount":9,
             "sentenceCount":6,
             "tokenPosCount":{
                "striking":{"JJ":1},
                "his":{"PRP$":1},
                 "plays":{"NNS":1},
                "London":{"NNP":1},
                "four":{"CD":1},
                ".":{".":7},
                "dramatic":{"JJ":2},
                "1576":{"CD":1},
                "stands":{"VBZ":1},
                ...
                "growth":{"NN":1}
             }
          },
          "footer":{
             "tokenCount":0,
             "lineCount":0,
             "emptyLineCount":0,
                    "sentenceCount":0,
                    "tokenPosCount":{}}}]}}


Anchor
filepaths
filepaths
Filepaths

 HTRC Extracted Features files gathered through rsync will download in pairtree format. Pairtree format is a hierarchical filesystem developed by the University of California Curation Center that maps identifier strings (in this case the HathiTrust Volume ID) to directory paths two characters at a time. The filepaths keep the institutional short code (e.g. mpd, uc2) at the front of each HTID intact.


Preparing to Rsync

Converting HathiTrust Volume ID to Rsync URL (using HTRC Feature Reader)

 If you already have a list of HT volume IDs, you can use a Python library developed by the HTRC called the HTRC Feature Reader library, to prepare to rsync your volumes of interest. (Note: The HTRC Feature Reader library can do many other things as well!) In the Feature Reader library, there is a convenience function in htrc_features.utils.id_to_rsync(htid) which aids in transforming a list of HT Volume IDs into URLs for rsync. Here is an example showing the conversion of one HT volume ID into an rsync url: 

 

Code Block
languagepy
>> from htrc_features import utils
>> utils.id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'

 

Converting HathiTrust Volume ID to RSync URL (using Python)

Researchers who have their list of HT volume IDs but prefer not to use the HTRC Feature Reader, can convert HT volume IDs into rsync URLs using a Python script. This example is a simplified part of a longer notebook, which further describes how to collect and download large lists of volumes: ID to EF Rsync Link.ipynb

If you don't have it, you may have to install the pairtree library with:  pip install pairtree (Python 2.x only).

 

Code Block
languagepy
import os
from pairtree import id2path, id_encode
def id_to_rsync(htid):
	'''
	Take an HTRC id and convert it to an Rsync location for syncing Extracted Features
 	'''
    libid, volid = htid.split('.', 1)
    volid_clean = id_encode(volid)
    filename = '.'.join([libid, volid_clean, kind, 'json.bz2'])
    path = '/'.join([kind, libid, 'pairtree_root', id2path(volid).replace('\\', '/'), volid_clean, filename])
    return path

 

Example:

 

Code Block
languagepy
id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'



The Extracted Features for this volume can be downloaded using RSync:

 

Code Block
languagebash
rsync -azv data.analytics.hathitrust.org::features/{{URL}} .


Using the HTRC Analytics algorithm


HTRC Analytics can aid some researchers who do not feel comfortable creating rsync URLs themselves using the Extracted Features Download Helper algorithm to create a shell script they will download and run locally to rsync the files.




Go to the Worksets page of HTRC Analytics


While logged into HTRC Analytics, click on the 'Worksets' link near the top of the screen. From the list of worksets that appear, choose the one you would like to get Extracted Features for and click on its name.




From the 'Analyze with Algorithm' drop down menu, choose the Extracted Features Download Helper algorithm.


This algorithm generates a script for downloading the feature data files that correspond to your workset.


Execute the  Extracted Features Download Helper algorithm


Specify a job name of your choosing. Then, click the ‘Submit’ button.





Wait until the algorithm has finished and then open the completed job to download

Eventually, the job will complete, and it will move to the "Completed Jobs" section of the page. Click on the link representing the job name to see the results.

 



From the results page, click the blue button to download the shell script you will use to get the Extracted Features. The file will go wherever downloads typically end up on your machine, often the Downloads folder.

 

Run the script returned by the Extracted Features Download Helper algorithm

Windows users: Please follow the directions here before continuing to ensure your machine is equipped for rsync.

From the command line, navigate to the directory where the script file is located. This directory will typically be called Downloads, though the location may be different depending on your machine and if you have moved the file. Here is an example:

Code Block
cd ~/Downloads


Then run the file you downloaded. It is a shell script. When you run it, a basic features data file and an advanced features data file for each volume in your workset will be transferred to your hard disk via the rsync utility.


Code Block
sh EF_Rsync.sh


If your workset contained N volumes with HathiTrust volume IDs V1, V2, V3,... VN respectively, then executing the shell script as shown above will cause the following feature data files for the corresponding volumes to be transferred to your computer’s hard disk via rsync: V1.json.bz2, V2.json.bz2, V3.json.bz2, ..., VN.json.bz2. See Filepaths above to learn more about the pairtree structure the Extracted Features files follow.


(Optional) Uncompress the downloaded files

Because the feature data files are compressed, you may need to uncompress them into JSON-formatted text files, depending on your need. The compression used is bzip2. If you are using the files with the HTRC Feature Reader, the library will deal with the compression automatically.

Workset Builder 2.0

Extracted Features files can also be downloaded for search results in Workset Builder 2.0. First, enter search terms for a desired set of volumes. Once results are returned, apply any filtering for the results to remove any volumes for which you do not want Extracted Features files. Once your result set includes all of the volumes you'd like Extracted Features files for, click "JSON Extracted Features (ZIP)" under the "Export Search Results" heading at the top of the left sidebar:



Clicking this link will start a download of the Extracted Features files for all volumes in your results. Since this process is fetching and compressing files, and result sets can be large, it may take a few moments to start and finish your download.

Alternatively, you can download Extracted Features files for individual volumes by following the same steps as above, and then clicking the "Download Extracted Features (complete volume)" link under each desired volume in the search results:



This will download uncompressed Extracted Features for the given volume, in JSON.