About the Data
BookNLP is a pipeline that combines state-of-the-art tools for a number of routine cultural analytics or NLP tasks, optimized for large volumes of text, including (verbatim from BookNLP’s GitHub documentation):
- Part-of-speech tagging
- Dependency parsing
- Entity recognition
- Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
- Quotation speaker identification
- Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
- Event tagging
- Referential gender inference (TOM_SAWYER -> he/him/his)
In practice, this means that for each book run through the pipeline, a standard (consumptive) BookNLP implementation generates the 6 following files:
- The .tokens file*
- The .book.html file*
- The .quote file*
- The .entities file
- The .supersense file
- The .book file
* Files with an asterisk following them are not part of this dataset due to violating HTRC's non-consumptive use policy. This release includes only the non-consumptive files, files 4 through 6: the entities, supersenses and character data. Read on for more specific information about each of these files.
The volumes this data was generated for are detailed in NovelTM Datasets for English-Language Fiction, 1700-2009 by Underwood, Kimutis, and Witte. In brief, these volumes were identified by supervised machine learning and manually verified training data as being English-language fiction. Full HathiFiles metadata TSV for the volumes in this dataset is available to download via this link.
The .entities files contain state-of-the-art tagged entities identified by a predictive model fine-tuned to identify and extract entities from narrative text. The entities are displayed in a TSV with a coreference ID (so that entities referred to by different names can be disambiguated as the same entity), a start and end token for the entity appearance, at the volume level (e.g. “Athens” is tagged with a start and end token of 2814, as it is a unigram). Additionally, the file contains the type of entity and the entity string itself.
Screenshot of example of .entities TSV
Entity tagging in BookNLP is done using a predictive model trained to recognize named entities using an annotated dataset that includes the public domain materials in LitBank plus "a new dataset of ~500 contemporary books, including bestsellers, Pulitzer Prize winners, works by Black authors, global Anglophone books, and genre fiction"
Entities are tagged in the following categories:
- People (PER): e.g., Tom Sawyer, her daughter
- Facilities (FAC): the house, the kitchen
- Geo-political entities (GPE): London, the village
- Locations (LOC): the forest, the river
- Vehicles (VEH): the ship, the car
- Organizations (ORG): the army, the Church
See BookNLP's technical documentation for more information.
The .supersense file is a TSV that contains tagged tokens and phrases that represent one of 41 lexical categories reflected in the computational linguistics database WordNet. These tags represent fine-grained semantic meaning for each token or phrase within the sentence in which they occur. Along with the tokens/phrases (the supersenses themselves) the file contains the supersense category and its volume-level start and end token.
Screenshot of example .supersense TSV
Supersense data could be used for fine-grained entity analysis, such as an investigation into how authors write social characters (counting and analyzing occurrences of tokens tagged "verb.social"), or which foods are most popular in certain sections of volumes (see how occurrences of "noun.food" cluster within the structure of books in a workset). A complete list of supersense tags is available in WordNet's lexicographical documentation.
Character data (.book files)
Lastly, the .book file contains a large JSON array with “characters” (fictional agents in the volumes, e.g. "Ebenezer Scrooge") as the main key, and then information about each character mentioned more than once in the text. This data includes these classes of information, by character:
- All of the names with which the character is associated, including pronouns, to disambiguate mentions in the text. From an excerpt like, “I mean that's all I told D.B. about, and he's my brother and all” this means the data for the character “D.B.” will include the words associated with that name explicitly in the text and the pronoun “he” in this sentence. (under the label "mentions" in the JSON)
- Words that are used to describe the character ("mod")
- Nouns the character possesses ("poss")
- Actions the character does (labeled "agent")
- Actions done to the character ("patient")
- An inferred gender label for the character ("g")
Example screenshot of JSON .book files viewed in a web browser with JSON pretty print plugin
This data alone could power inquiry into character descriptions, actions, narrative role, among many possible research questions, without requiring full text of volumes, especially those where accessing the text could violate copyright.
A Note on Gender
Gender inference is a challenging computational task, but also a challenging task even for human readers and scholars. As time has marched on, our understanding of gender has also developed. The male-female gender binary, a relic of the past, is no longer considered suitable for fully reflecting the realities of our society, challenging historical research, especially that done at scale to minimize human intervention. While BookNLP supports many labels in the gender field, the state-of-the-art in gender tagging is rapidly advancing, and the data in this dataset is likely to be considered outdated very soon (if not already) by those who study gender. Please keep this in mind when seeking to use this data.