Release of Feature Extraction (alpha version)

Hathi Trust Research Center feature extraction

Dear friends and colleagues,

The HathiTrust Research Center (HTRC) is proud to announce the alpha release of a new dataset, consisting of page-level features extracted from a quarter-million text volumes.

HTRC Extracted Features Dataset documentation and download.

Features are data attributes defined in such a way that they can be identified by a computer and analyzed at scale. The HTRC Feature Extraction alpha dataset has already processed the underlying text, identifying headers and footers, rejoining hyphenated words, and offering page-level details such as:

- term-frequency counts, per section (head/body/footer), per page

- occurrences of terms as different parts of speech

- line counts and sentence counts

- character counts at the start or end of lines

Since it is currently in alpha version, we are looking for feedback on how data like this can help you in your research and how we can better serve the scholarly community.

Today’s dataset is built upon the HathiTrust’s non-Google-digitized public domain volumes — that is, the original scanned representations of all the texts can be accessed through the HathiTrust. We have features for 67,932,813 pages from 250,178 volumes, spanning nearly six hundred years. The median date of the material is 1899, and the text is primarily English. While this alpha release originates from public domain data, this type of extracted feature dataset also provides a road map toward non-consumptive research on works not in the public domain, since the features, though useful for scholarly research purposes, are not sufficient to reconstruct the text itself.

The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois. In conjunction with the HathiTrust Digital Library, the HTRC team strives to meet the technical challenges that researchers face when dealing with massive amounts of digital text, by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.

Questions? Please contact <htrc-support-l@list.iu.edu>.