The filepath to sync Extracted Features files through RSync follows a pairtree format, keeping the institutional shortcode intact (e.g. mpd, uc2).
If you are the HTRC Feature Reader library, there is a convenience function in htrc_features.utils.id_to_rsync(
htid)
:
>> from htrc_features import utils >> utils.id_to_rsync('miun.adx6300.0001.001') 'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2' |
This example is a simplified part of a longer notebook, which further describes how to collect and download large lists of volumes: ID to EF Rsync Link.ipynb.
If you don't have it, you may have to install the pairtree library with: pip install pairtree
(Python 2.x only).
import os from pairtree import id2path, id_encode def id_to_rsync(htid): ''' Take an HTRC id and convert it to an Rsync location for syncing Extracted Features ''' libid, volid = htid.split('.', 1) volid_clean = id_encode(volid) filename = '.'.join([libid, volid_clean, kind, 'json.bz2']) path = '/'.join([kind, libid, 'pairtree_root', id2path(volid).replace('\\', '/'), volid_clean, filename]) return path |
Example:
id_to_rsync('miun.adx6300.0001.001') 'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2' |
The Extracted Features for this volume can be downloaded using RSync:
rsync -azv data.analytics.hathitrust.org::features/{{URL}} . |