Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

This documentation has been updated for the newest format of URLs for the Extracted Features dataset, intended for release in August 2016. This format no longer has basic and advanced features described in separate files. If you are looking for information on the earlier format, see version 12 of this page.


The filepath to sync Extracted Features files through RSync follows a pairtree format, keeping the institutional shortcode intact (e.g. mpd, uc2).

 Converting ID to RSync URL (Python with HTRC Feature Reader library)

If you are the HTRC Feature Reader library, there is a convenience function in htrc_features.utils.id_to_rsync(htid):

>> from htrc_features import utils
>> utils.id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'

Converting ID to RSync URL (Python)

This example is a simplified part of a longer notebook, which further describes how to collect and download large lists of volumes: ID to EF Rsync Link.ipynb

If you don't have it, you may have to install the pairtree library with:  pip install pairtree (Python 2.x only).

import os
from pairtree import id2path, id_encode
def id_to_rsync(htid):
	'''
	Take an HTRC id and convert it to an Rsync location for syncing Extracted Features
 	'''
    libid, volid = htid.split('.', 1)
    volid_clean = id_encode(volid)
    filename = '.'.join([libid, volid_clean, kind, 'json.bz2'])
    path = '/'.join([kind, libid, 'pairtree_root', id2path(volid).replace('\\', '/'), volid_clean, filename])
    return path

Example:

id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'

The Extracted Features for this volume can be downloaded using RSync:

rsync -azv data.analytics.hathitrust.org::pd-features/{{URL}} .
  • No labels