Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


 

Table of Contents
maxLevel3

...

 HTRC Extracted Features files gathered through rsync will download in pairtree format. Pairtree format is a hierarchical filesystem developed by the University of California Curation Center that maps identifier strings (in this case the HathiTrust Volume ID) to directory paths two characters at a time. The filepaths keep the institutional short code (e.g. mpd, uc2) at the front of each HTID intact.

 


Preparing to Rsync

Converting HathiTrust Volume ID to Rsync URL (using HTRC Feature Reader)

...

Code Block
languagepy
id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'

 



The Extracted Features for this volume can be downloaded using RSync:

...

Code Block
languagebash
rsync -azv data.analytics.hathitrust.org::features/{{URL}} .


Using the HTRC

...

Analytics algorithm


The HTRC Portal Analytics can aid some researchers who do not feel comfortable creating rsync URLs themselves can use the HTRC EF_Rsync_Script_Generator algorithm in the Portal using the Extracted Features Download Helper algorithm to create a shell script they will download and run locally to rsync the files.

 

...


Image Added



Go to the Algorithms Worksets page of the HTRC PortalAnalytics


While logged into the  HTRC PortalAnalytics, click on the 'AlgorithmsWorksets' link near the top of the screen. From the list of algorithms worksets that shows up, click on EF_Rsync_Script_Generator. This 'algorithm' appear, choose the one you would like to get Extracted Features for and click on its name.



Image Added


From the 'Analyze with Algorithm' drop down menu, choose the Extracted Features Download Helper algorithm.

Image Added


This algorithm generates a script for downloading the feature data files that correspond to your workset.

 

...

Execute the EF_Rsync_Script_Generator the  Extracted Features Download Helper algorithm


Specify a job name of your choosing. You also have to select a  workset that the EF_Rsync_Script_Generator algorithm will run against: Check the button saying “Select a workset from my worksets” and select your desired workset.  Your screen should now look like the figure below. At this pointThen, click the ‘Submit’ button.

...




Image Added


Wait until the algorithm has finished

You can now see the status of the job. Refresh the screen to see its progress.

 Image Removed

Open and then open the completed job to download

Eventually, the job will have “completed”complete, and it will move to the "Completed Jobs" section of the page. Click on the link representing the job name to initiate downloading the EF_Rsync_Scriptsee the results.

 

...

At this point, the script will be downloaded to your computer’s hard disk wherever browser-initiated downloads are stored, likely in your

Image Added



From the results page, click the blue button to download the shell script you will use to get the Extracted Features. The file will go wherever downloads typically end up on your machine, often the Downloads folder.

Image Added 

Run the script returned by the EF_Rsync_Script_Generator the Extracted Features Download Helper algorithm

Windows users: Please follow the directions here before continuing to ensure your machine is equipped for rsync.

...

Then run the file you downloaded. It is a shell script. When you run it, a basic features data file and an advanced features data file for each volume in your workset will be transferred to your hard disk via the rsync utility.

...


Code Block
sh EF_Rsync.sh
 


If your workset contained N volumes with HathiTrust volume IDs V1, V2, V3,... VN respectively, then executing the shell script as shown above will cause the following feature data files for the corresponding volumes to be transferred to your computer’s hard disk via rsync: V1.json.bz2, V2.json.bz2, V3.json.bz2, ..., VN.json.bz2. See Filepaths above to learn more about the pairtree structure the Extracted Features files follow.

...