Search and Recognition
To answer these questions they needed a corpus to perform topic modeling on containing documents written by or about Black women. The authors utilized the expertise of a university librarian to help with the corpus creation process. To create this corpus the researchers used the HathiTrust and JSTOR digital libraries and performed the following boolean search:
(Black OR “african american” OR negro) AND (wom?n OR female* OR girl*)
In addition, the search was limited to English-language documents from 1746-2014 in the following formats:
Both JSTOR and HathiTrust have ways for researchers to obtain the full texts as well as the metadata for each volume. Because this is a guide for users of HathiTrust, the focus will be on obtaining documents from HathiTrust. More information on acquiring full-text documents from JSTOR can be found on their website.
For the data from HathiTrust, this project made use of a custom dataset request from HathiTrust. Custom datasets are available for public domain materials, though there are restrictions in place based on who digitized the item and where the requesting researcher is located. While not used in this project, another data access option could have been via an HTRC Data Capsule. The capsules are available to anyone with an HTRC account, though only researchers from member institutions are allowed to analyze texts from the full HathiTrust collection, including in-copyright volumes. While the default storage allotment would not hold the 800,000 plus volumes used for this project, it will hold up to 100GB of data. The researchers for this project had access to and used High Performance Computing resources at the University of Illinois.
The full-text HathiTrust volumes come as .txt files and the metadata for each volume comes as a .tsv or JSON file. For this project JSON was used. Once the researchers performed the above search, they pulled out 19,398 documents that they believed to be authored by African Americans based on the metadata with known African American author names and/or phrases such as “African American” or “Negro” in the subject headings. Of these 19,398 documents, 6,120 came from the HathiTrust database and 13,278 from the JSTOR database.
The researchers then ran the 19,398 documents through the MALLET topic modeling tool and set the number of topics to generate to 100. Five researchers then reviewed the keyword list that makes up each topic. This included the words as well as the number of times the word was assigned to that particular topic. Each researcher individually determined which topics were most relevant in that they were about Black women specifically or were topics with which Black women might be connected (i.e. maternal health, education). Once each individual identified topics of interest, the researchers gathered and discussed findings, viewed visualizations, adjusted interpretations and came to a consensus on relevant topics.