HTRC BookNLP Dataset for English-Language Fiction
Work with rich, unrestricted entity, word, and character data extracted from over 200,000 volumes of English-language fiction in the HTDL
Short URL for this page: https://htrc.atlassian.net/l/cp/4W91G6sM
The HTRC BookNLP Dataset for English-Language Fiction (ELF) derived dataset was created using the BookNLP pipeline, extracting data from the NovelTM English-language fiction set, a supervised machine learning-derived set of around 213,000 volumes in the HathiTrust Digital Library.
BookNLP is a text analysis pipeline tailored for common natural language processing (NLP) tasks to empower work in computational linguistics, cultural analytics, NLP, machine learning, and other fields. This dataset has the potential to power exciting new computational research of English-language literature, along with more methods-focused work in the areas mentioned above, with minimal infrastructure support from HTRC. As with all derived datasets, a key goal of the project is also to lower the barrier for working with HathiTrust Digital Library (HTDL) and HTRC data, and allow users to leverage computational resources they may have personally or through their institutional affiliation. Specificities of the data, a discussion of its non-consumptiveness, and other pros and cons of release follow in the next sections.
This dataset is modified from the standard BookNLP pipeline to output only files that meet HTRC's non-consumptive use policy that requires minimal data that cannot be easily reconstructed into the raw volume to be released. Please see the Data section below for specifics on what files are included and their description.
Dataset Stats
# of volumes represented | 201,527 |
# of in-copyright volumes represented | 90,857 |
# of files | 604,561 |
Size of full dataset (gigabytes) | 451.2 GB |
Jump to Section
Attribution
Ryan Dubnicek, Boris Capitanu, Glen Layne-Worthey, Jennifer Christie, John A. Walsh, J. Stephen Downie (2023). The HathiTrust Research Center BookNLP Dataset for English-Language Fiction. HathiTrust Research Center. https://doi.org/10.13012/d4gy-4g41
This derived dataset is released under a Creative Commons Attribution 4.0 International License.
About the Data
BookNLP is a pipeline that combines state-of-the-art tools for a number of routine cultural analytics or NLP tasks, optimized for large volumes of text, including (verbatim from BookNLP’s GitHub documentation):
Part-of-speech tagging
Dependency parsing
Entity recognition
Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
Quotation speaker identification
Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
Event tagging
Referential gender inference (TOM_SAWYER -> he/him/his)
In practice, this means that for each book run through the pipeline, a standard (consumptive) BookNLP implementation generates the 6 following files:
The .tokens file*
The .book.html file*
The .quote file*
The .entities file
The .supersense file
The .book file
* Files with an asterisk following them are not part of this dataset due to violating HTRC's non-consumptive use policy. This release includes only the non-consumptive files, files 4 through 6: the entities, supersenses and character data. Read on for more specific information about each of these files.
The volumes this data was generated for are detailed in NovelTM Datasets for English-Language Fiction, 1700-2009 by Underwood, Kimutis, and Witte. In brief, these volumes were identified by supervised machine learning and manually verified training data as being English-language fiction. Full HathiFiles metadata TSV for the volumes in this dataset is available to download via this link.
Entities
The .entities files contain state-of-the-art tagged entities identified by a predictive model fine-tuned to identify and extract entities from narrative text. The entities are displayed in a TSV with a coreference ID (so that entities referred to by different names can be disambiguated as the same entity), a start and end token for the entity appearance, at the volume level (e.g. “Athens” is tagged with a start and end token of 2814, as it is a unigram). Additionally, the file contains the type of entity and the entity string itself.
Screenshot of example of .entities TSV
Entity tagging in BookNLP is done using a predictive model trained to recognize named entities using an annotated dataset that includes the public domain materials in LitBank plus "a new dataset of ~500 contemporary books, including bestsellers, Pulitzer Prize winners, works by Black authors, global Anglophone books, and genre fiction"
Entities are tagged in the following categories:
People (PER): e.g., Tom Sawyer, her daughter
Facilities (FAC): the house, the kitchen
Geo-political entities (GPE): London, the village
Locations (LOC): the forest, the river
Vehicles (VEH): the ship, the car
Organizations (ORG): the army, the Church
See BookNLP's technical documentation for more information.
Supersenses
The .supersense file is a TSV that contains tagged tokens and phrases that represent one of 41 lexical categories reflected in the computational linguistics database WordNet. These tags represent fine-grained semantic meaning for each token or phrase within the sentence in which they occur. Along with the tokens/phrases (the supersenses themselves) the file contains the supersense category and its volume-level start and end token.
Screenshot of example .supersense TSV
Supersense data could be used for fine-grained entity analysis, such as an investigation into how authors write social characters (counting and analyzing occurrences of tokens tagged "verb.social"), or which foods are most popular in certain sections of volumes (see how occurrences of "noun.food" cluster within the structure of books in a workset). A complete list of supersense tags is available in WordNet's lexicographical documentation.
Character data (.book files)
Lastly, the .book file contains a large JSON array with “characters” (fictional agents in the volumes, e.g. "Ebenezer Scrooge") as the main key, and then information about each character mentioned more than once in the text. This data includes these classes of information, by character:
All of the names with which the character is associated, including pronouns, to disambiguate mentions in the text. From an excerpt like, “I mean that's all I told D.B. about, and he's my brother and all” this means the data for the character “D.B.” will include the words associated with that name explicitly in the text and the pronoun “he” in this sentence. (under the label "mentions" in the JSON)
Words that are used to describe the character ("mod")
Nouns the character possesses ("poss")
Actions the character does (labeled "agent")
Actions done to the character ("patient")
An inferred gender label for the character ("g")
Example screenshot of JSON .book files viewed in a web browser with JSON pretty print plugin
This data alone could power inquiry into character descriptions, actions, narrative role, among many possible research questions, without requiring full text of volumes, especially those where accessing the text could violate copyright.
A Note on Gender
Gender inference is a challenging computational task, but also a challenging task even for human readers and scholars. As time has marched on, our understanding of gender has also developed. The male-female gender binary, a relic of the past, is no longer considered suitable for fully reflecting the realities of our society, challenging historical research, especially that done at scale to minimize human intervention. While BookNLP supports many labels in the gender field, the state-of-the-art in gender tagging is rapidly advancing, and the data in this dataset is likely to be considered outdated very soon (if not already) by those who study gender. Please keep this in mind when seeking to use this data.