The HTRC BookNLP Dataset for English-Language Fiction (ELF) derived dataset was created using the BookNLP pipeline, extracting data from the NovelTM English-language fiction set, a supervised machine learning-derived set of around 213,000 volumes in the HathiTrust Digital Library. BookNLP is a text analysis pipeline tailored for common natural language processing (NLP) tasks to empower work in computational linguistics, cultural analytics, NLP, machine learning, and other fields.
This dataset is modified from the standard BookNLP pipeline to output only files that meet HTRC's non-consumptive use policy that requires minimal data that cannot be easily reconstructed into the raw volume to be released. Please see the Data section below for specifics on what files are included and their description.
BookNLP is a pipeline that combines state-of-the-art tools for a number of routine cultural analytics or NLP tasks, optimized for large volumes of text, including (verbatim from BookNLP’s GitHub documentation):
- Part-of-speech tagging
- Dependency parsing
- Entity recognition
- Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
- Quotation speaker identification
- Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
- Event tagging
- Referential gender inference (TOM_SAWYER -> he/him/his)
This dataset was generated by running each volume in the NovelTM English-Language Fiction dataset, sourced from the HathiTrust Digital Library, through the BookNLP pipeline, generating rich derived data for each volume.
For each book run through the pipeline, this dataset contains the 3 following files:
- The .entities file: contains state-of-the-art tagged entities identified by a predictive model fine-tuned to identify and extract entities from narrative text. Entity types that are tagged include: people, facilities, geo-political entities, locations, vehicles, and organizations.
- The .supersense file: contains tagged tokens and phrases that represent one of 41 lexical categories reflected in the computational linguistics database WordNet. These tags represent fine-grained semantic meaning for each token or phrase within the sentence in which they occur.
- The .book file: contains a large JSON array with “characters” (fictional agents in the volumes, e.g. "Ebenezer Scrooge") as the main key, and then information about each character mentioned more than once in the text. This data includes these classes of information, by character:
- All of the names with which the character is associated, including pronouns, to disambiguate mentions in the text. From an excerpt like, “I mean that's all I told D.B. about, and he's my brother and all” this means the data for the character “D.B.” will include the words associated with that name explicitly in the text and the pronoun “he” in this sentence. (under the label "mentions" in the JSON)
- Words that are used to describe the character ("mod")
- Nouns the character possesses ("poss")
- Actions the character does (labeled "agent")
- Actions done to the character ("patient")
- An inferred gender label for the character ("g")