Excerpt |
---|
Read how the research team for the "Textual Geographies" project used text analysis methods to create an interactive visualization based on geographic locations. |
Textual Geographies
Project Team:
Matthew Wilkens - Cornell University
Cameron Blevins - Northeastern University
David Chiang - University of Notre Dame
Elizabeth Evans - University of Notre Dame
Marissa Gemma - Max Planck Institute for Empirical Aesthetics
Ryan Heuser - Stanford University
Matthew Sisk - University of Notre Dame
Mads Rosendahl Thomsen - Aarhus University
Guangchen Ruan - Indiana University
Advisory Board:
David Bamman - University of California Berkeley
J. Stephen Downie - University of Illinois at Urbana-Champaign
Ian Gregory - Lancaster University
Beth Plale - Indiana University Bloomington
The Project
There are many ways to use the HTDL as well as the HTRC for research. Here we give an example of one way the HTDL and HTRC were used for a National Endowment for the Humanities (NEH) grant-funded research project.
This project utilizes natural language processing (NLP) algorithms as well as mapping tools and APIs in order to find locations mentioned in text and then acquire geographic data about those locations. The text used for this project came exclusively from the HTDL. The final result is a web-based search tool that allows for geospatial and graphical visualization of geographic textual data found in over 10 million HathiTrust volumes as well as downloadable versions of the data.
Needed tools/skills to conduct similar research:
(Please note that many people contributed to this project, so expertise in all of these skills and tools is not required of a single individual)
- Access to HTDL and HTRC
- Knowledge of a programming language (Python, R, Java, etc…)
- Knowledge of Stanford’s NLPCore Natural Language Processing Toolkit which includes Named Entity Recognition (NER)
- Access to high performance computing (only in instances with large datasets)
- Knowledge of APIs and how to access them
- Knowledge of JSON
- Some knowledge of postgresql and databases
- Web Design (HTML/CSS/Javascript)
Let’s Begin
The corpus for this project comes entirely from the HTDL. As of September 1st, 2020 the HTDL houses over 17 million digitized volumes and the HTRC gives researchers from member institutions access to all 17 million volumes. The only limitations for the volumes used for this project were the languages available via the Stanford NER tool. Since this project first started in 2016 and also uses Stanford’s CoreNLP Natural Language Processing Toolkit, the languages of the volumes were limited to English, Spanish, German, and Chinese. Even with these limitations a corpus of over 10 million volumes ranging from 1700 to the present day was produced. As with many projects of this type, there have been advancements made to the tools used and additional volumes have since been added to the HTDL and HTRC.
Also of note, the project was done in two phases. First, the algorithms described below were run on items in the HTDL that were in the public domain at the time. Then the same exact workflow was applied again on items that were still in copyright.
Acquiring the Corpus
In order to find the volumes that contained these languages the researchers did not rely on metadata or subject headings. The contents of the HTDL were run through a language detection algorithm in Java. Most programming languages today, including Python and R, have packages or libraries that can perform this task with a high level of accuracy, so knowledge of Java is not specifically required. You may use the programming language with which you or a co-researcher are most familiar.
The researchers also had direct access to the HathiTrust collection as well as access to High Performance Computing (HPC). There are currently ways for researchers from member institutions to have access to the full HathiTrust corpus as well. This access comes via use of an HTRC Data Capsule and, as stated, affiliation with an institution that is a member of the HathiTrust. While the default storage allotment would not hold the 10 TB of data used in this project, it will hold up to 100GB (maximum default is 10GB, but increases can be given if needed) and in special circumstances (such as this example) other options can be arranged. For a corpus this size access to HPC is necessary for computing power. However, for a smaller corpus, the HTRC also provides many tools and services, including a web-based NER algorithm which can handle a corpus as large as 3,000 volumes.
Applying NER
Once the language recognition algorithm isolated the desired languages for the corpus, the desired corpus of English, German, Spanish, and Chinese volumes was able to be assembled. Then the corpus volumes are run through Stanford’s CoreNLP software, by language, and the NER data for each volume is extracted. The NER algorithm at the very least tags the words in the volume data as a LOCATION, PERSON, or ORGANIZATION. If a word does not fall into a category it is marked with an “O” for “Other”.
Once all the volumes have been run through the NER algorithm, you should have .txt files that can look like this:
JanePERSON
DoePERSON
willO
flyO
toO
ParisLOCATION
thisO
weekendO
.O
At this point, it is possible to output the volume IDs, words tagged by the NER algorithm as LOCATION, as well as word counts into a text .tsv or .csv file or even a JSON file if preferred, using an algorithm written in your preferred programming language. Even with a corpus of a few thousand volumes, there can still be millions of lines as each word tagged as a location will create a line in the file. With over 10 million volumes this becomes even more of a problem. To overcome this, the researchers used Hadoop’s sequence file format. This file would need to be unpacked and converted to a format that could be processed by APIs from Google, such as the geocoding APIs described below. The researchers did the unpacking as well as other steps using Python, including moving it to a Postgres database, and querying the Google APIs and adding the API responses to the Postgres database.
Given the purpose of this research project, the researchers only used the most frequently occurring locations. Any word tagged as a location that was above a certain percentage of all location terms was kept. This was done for each individual language, due to the different number of volumes in each, so the percentage threshold for a location word to make the cut varied by language. In addition, if a location term made up a certain percentage of the words in a single volume it remained in the dataset, even if it missed the percentage threshold for that particular language in the corpus.
Geocoding the data
Now that we have locations pulled from the text and have a dataset, we need to acquire more detailed data about the locations. For this, Google and their Google Maps suite of geocoding APIs were used, particularly the Places API and the Geocoding API. The Places API was utilized first as it is better at handling locations that are points of interest (e.g. Hagia Sophia, Machu Picchu, Eiffel Tower). The Places API assigns a unique identifier for each location. These unique identifiers can then be used in the Geocoding API to acquire more detailed data about the location, which, among a lot of other information, includes the latitude and longitude of the location. This geographic data comes in either JSON or XML format. For this particular project JSON was chosen for the output.
The data
It is at this point the HTRC volume metadata and the geocoding data is combined as well. This can be done by writing algorithms that access the Google APIs and append the query results to the existing data. As stated above, this was all done using algorithms written in Python, however, other programming languages have the same capabilities so using Python is not required.
Correcting issues
As with any computational project involving text, the results are never perfect. The issues with deciphering geographic locations in text range from locations that are fictional (Kingsbridge in “Pillars of the Earth”) to characters with names identical to locations (Jack London, George Washington, Indiana Jones, etc.) being mistaken as a location and many other complexities in between. The first pass of the corpus through this workflow produced a recall of over 70%, but a precision of under 50%. (Here is a Wikipedia article on recall and precision, in case you need a refresher.) While there was not much that could be done about the recall (as it is dependent on the algorithms used), the precision problem could be corrected. This was done by hand. There was too much data to go through every entry, so every location that made up 0.01% of the data or occurred 100 times or more in a single volume was reviewed and corrected if necessary. This hand correction improved the precision to almost 90%.
Making the data accessible
There are many options to make the data accessible. The researchers for this project made the data available in two ways using one medium: it can be downloaded from their website, or it can be searched and visualized from their website. The downloadable files are .csv files containing the merged volume metadata from HTRC and the Google geocoding data and are broken up chronologically. The search and visualize aspect is made possible through a Postgres database and web design, however, there are many options for databases, most of which have a free version that can be upgraded to a pay-for-service version with more storage and features. Your institution also may already have database options available.
The Results
The result of this project is a website that allows you to search a database of 10 million HathiTrust volumes in English, Spanish, German, and Chinese from 1700 to the present day and visualize information that shows locations mentioned in various documents. The searches can be filtered by language, authors, dates, and subject headings and include geospatial and graphical visualization options. So you can see what locations are most prominent in Spanish works of fiction from 1850-1900 if you are so inclined. In addition, researchers can download the data for their own use without having to go through the same process described above to acquire, match, correct, and clean the data. Thus, a very valuable tool and dataset was created and made freely available to any and all interested scholars.
Given the large number of volumes in the HathiTrust and the analytics tools made available by the HTRC, there are many possibilities for future grant-funded research that can enhance and improve the research landscape by creating additional open source tools and/or providing open access to data for other scholars.