HTRC Research Impact

This is a list of scholarly articles, datasets, and other research products produced by the scholarly community which have made substantial use of HTRC’s tools, services, expertise, and unique access to HathiTrust data. See also and for work research produced by HTRC and its staff.

Monographs

Franklin, Samuel W. (2023). The Cult of Creativity: A Surprisingly Recent History. Chicago: The University of Chicago Press.

Based in part on Franklin’s 2016 HTRC Advanced Collaborative Support grant, “Inside the Creativity Boom.” Read the project report.

Sinykin, Dan (2023). Big Fiction: How Conglomeration Changed the Publishing Industry and American Literature. New York: Columbia University Press.

Based in part on Sinykin’s 2019 HTRC Advanced Collaborative Support grant, “Supporting The Conglomerate Era Project.” Read the project report.

So, Richard Jean (2021). Redlining Culture : A Data History of Racial Inequality and Postwar Fiction. New York: Columbia University Press.

Based in part on So’s 2017 HTRC Advanced Collaborative Support grant, “A Computational History of the U.S. Novel, 1950-2000.” Read the project report.

Underwood, Ted (2019). Distant Horizons : Digital Evidence and Literary Change. Chicago: University of Chicago Press.

Based in part on Underwood’s longstanding collaborations with HTRC, including his co-creation of HTRC’s “NovelTM Datasets for English-Language Fiction, 1700-2009” (read the dataset project report here); his dataset of “Word Frequencies in English-Language Literature, 1700-1922”; and others.

Scholarly Articles

Adams, A.L. (2021). Online tools for digital humanities. Public Services Quarterly, 17(3), 177-182, DOI: https://doi.org/10.1080/15228959.2021.1938789

Bagga, S., & Piper, A. (2022). HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust. Journal of Open Humanities Data, 8(7). DOI: https://doi.org/10.5334/johd.71

Beausang, C. (2022). Diachronic delta: A computational method for analysing periods of accelerated change in literary datasets. Digital Scholarship in the Humanities, 37(3), 644–659. https://doi.org/10.1093/llc/fqab041

Bhattacharyya, S. (2017). Words in a world of scaling-up: Epistemic normativity and text as data. Sanglap, 4(1), 31-42. https://sanglap-journal.in/index.php/sanglap/article/view/86

Brown, N.M., Mendenhall, R., Black, M.L., Van Moer, M., Zerai, A., Flynn, K. (2016). Mechanized Margin to Digitized Center: Black Feminism's Contributions to Combatting Erasure within the Digital Humanities. International Journal of Humanities and Arts Computing, 10(1): 110-125. DOI: 10.3366/ijhac.2016.0163

Brown, N. M., Mendenhall, R., Black, M., Moer, M. V., Flynn, K., McKee, M., … Zhai, C. (2019). In Search of Zora/When Metadata Isn’t Enough: Rescuing the Experiences of Black Women Through Statistical Modeling. Journal of Library Metadata, 19(3–4), 141–162. DOI: 10.1080/19386389.2019.1652967

Chang, K. K., & Moravec, M. (2021). Feminist bestsellers: A digital history of 1970s feminism. Issue 7: Post45 x Journal of Cultural Analytics. https://post45.org/2021/04/feminist-bestsellers-a-digital-history-of-1970s-feminism/

Craig, K. (2018). Introduction to Bookworm; Robots Reading Vogue; Bookworm: HathiTrust; Bookworm: Open Library; Building a Bookworm. Journal of American History, 105(1): 244–247. DOI: https://doi.org/10.1093/jahist/jay139

Dobson, J. (2020). Interpretable Outputs: Criteria for Machine Learning in the Humanities. Digital Humanities Quarterly, 15(2). http://digitalhumanities.org:8081/dhq/vol/15/2/000555/000555.html

Dobson, J. (2022). Vector hermeneutics: On the interpretation of vector space models of text. Digital Scholarship in the Humanities, 37(1). DOI: https://doi.org/10.1093/llc/fqab079

Ehrlich, H. (2015). Poe in Cyberspace: Balloons! Drones!! The Global Internet!!! The Edgar Allan Poe Review, 16(2), 242–246. https://doi.org/10.5325/edgallpoerev.16.2.0242

Erlin, M., Piper, A., Knox, D., Pentecost, S. and Blank, A. (2022). The TRANSCOMP Dataset of Literary Translations from 120 Languages and a Parallel Collection of English-language Originals. Journal of Open Humanities Data, 8(0), p.29.DOI: https://doi.org/10.5334/johd.94

Evans, E., & Wilkens, M. (2018). Nation, ethnicity, and the geography of British fiction, 1880-1940. Journal of Cultural Analytics, 3(2). https://doi.org/10.22148/16.024

Grallert, T. (2022). Open Arabic Periodical Editions: A Framework for Bootstrapped Scholarly Editions Outside the Global North. Digital Humanities Quarterly, 16(2). http://digitalhumanities.org:8081/dhq/vol/16/2/000593/000593.html 

Hamilton, S., & Piper, A. (2023). MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library. Journal of Open Humanities Data, 9(3). DOI: https://doi.org/10.5334/johd.95

Kelly, N.M., White, N., Glass, L. (2021). Squatter Regionalism: Postwar Fiction, Geography, and the Program Era. Journal of Cultural Analytics 6(2). DOI: https://doi.org/10.22148/001c.22332

Kilner, K., & Fitch, K. (2017). Searching for My Lady’s Bonnet: discovering poetry in the National Library of Australia’s newspapers database. Digital Scholarship in the Humanities, 32(1), i69–i83. DOI: https://doi.org/10.1093/llc/fqw062

Krause, T. B. (2018). IE4 : Forays into text mining. Folia Linguistica, 52(s39-s2), 511-520. DOI: doi.org/10.1515/flih-2018-0020

Lee, A. S., Chiarawongse, P., Guldi, J., & Zsom, A. (2020). The Role of Critical Thinking in Humanities Infrastructure: The Pipeline Concept with a Study of HaToRI (Hansard Topic Relevance Identifier). Digital Humanities Quarterly, 14(3). http://digitalhumanities.org:8081/dhq/vol/14/3/000481/000481.html 

Le-Khac, L., & Hao, K. (2021). The Asian American Literature We’ve Constructed. Journal of Cultural Analytics 6(2). DOI: https://doi.org/10.22148/001c.22330

Moravec, M., Chang, K.K. (2021). Feminist Bestsellers: A Digital History of 1970s Feminism. Journal of Cultural Analytics 6(2). DOI: https://doi.org/10.22148/001c.22333

Nurmikko-Fuller, T. (2022). Teaching Linked Open Data using Bibliographic Metadata. Journal of Open Humanities Data, 8(6). DOI: https://doi.org/10.5334/johd.60

Organisciak, P., Schmidt, B. M., & Durward, M. (2023). Approximate nearest neighbor for long document relationship labeling in digital libraries. International Journal on Digital Libraries, 24(4), 311-325. https://doi.org/10.1007/s00799-023-00354-5

Ravenscroft, A., Allen, C. (2019). Finding and Interpreting Arguments: An Important Challenge for Humanities Computing and Scholarly Practice. DHQ: Digital Humanities Quarterly, 13(4). http://digitalhumanities.org:8081/dhq/vol/13/4/000436/000436.html

Schmidt, B. (2018). Stable random projection: Lightweight, general-purpose dimensionality reduction for digitized libraries. *Journal of Cultural Analytics, 3*(1). https://doi.org/10.22148/16.025

Shanahan, J., Burke, R., Lučić, A. (2020). Reading Chicago Reading: Quantitative Analysis of a Repeating Literary Program. DHQ: Digital Humanities Quarterly, (14)2. http://digitalhumanities.org:8081/dhq/vol/14/2/000461/000461.html

Sinykin, D., Roland, E. (2021). Against Conglomeration: Nonprofit Publishing and American Literature After 1980. Journal of Cultural Analytics, 6(2). DOI: https://doi.org/10.22148/001c.22331

Soto-Corominas, A., De la Rosa, J., & Suárez, J. L. (2018). What Loanwords Tell Us about Spanish (and Spain). Digital Studies/le Champ Numérique, 8(1), 4. DOI: http://doi.org/10.16995/dscn.297

Stevens, G. (2017). New Metadata Recipes for Old Cookbooks: Creating and Analyzing a Digital Collection Using the HathiTrust Research Center Portal. Code4Lib Journal, 37(1). Accessed November 23, 2022. https://journal.code4lib.org/articles/12548

Underwood, T. (2016). The Life Cycles of Genres. Journal of Cultural Analytics, 2(2). DOI: https://doi.org/10.22148/16.005. .

Underwood, T., Bamman, D., Lee, S. (2018). The Transformation of Gender in English-Language Fiction. Journal of Cultural Analytics, 3(2). DOI: https://doi.org/10.22148/16.019

Conference Papers, Presentations, and Posters

Ball, L., & Bothma, T. (2020). The capability of search tools to retrieve words with specific properties from large text collections. In Proceedings of ISIC, the Information Behaviour Conference, Pretoria, South Africa, 28 September - 1 October, 2020. Information Research, 25(4), paper isic2030. DOI: https://doi.org/10.47989/irisic2030

Bamman, D., Carney, M., Gillick, J., Hennesy, C., & Sridhar, V. (2017). Estimating the Date of First Publication in a Large-Scale Digital Library. 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 1-10. DOI: https://doi.org/10.5555/3200334.3200351

Kobak, D., Linderman, G., Steinerberger, S., Kluger, Y., Berens, P. (2020). Heavy-Tailed Kernels Reveal a Finer Cluster Structure in t-SNE Visualisations. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. DOI: https://doi.org/10.1007/978-3-030-46150-8_8

Ledbetter, W., & Spring, J. (2020). Peace Speech Identification Using ABBYY Fine Reader and HathiTrust. 2020 IEEE International Conference on Big Data (Big Data), 5739-5743. DOI: https://doi.org/10.1109/BigData50022.2020.9377870

Organisciak, P., Shetenhelm, S., Vasques, D.F.A., Matusiak, K. (2019). Characterizing Same Work Relationships in Large-Scale Digital Libraries. In: Taylor, N., Christian-Lamb, C., Martin, M., Nardi, B. (eds) Information in Contemporary Society. iConference 2019. Lecture Notes in Computer Science(), vol 11420. Springer, Cham. DOI: 10.1007/978-3-030-15742-5_40

Thompson, L., & Mimno, D.M. (2018). Authorless Topic Models: Biasing Models Away from Known Structure. International Conference on Computational Linguistics. https://paperswithcode.com/paper/authorless-topic-models-biasing-models-away

VandenBosch, A., Schmidt, B.M., Matusiak, K.K. and Organisciak, P. (2021), Moving Past Metadata: Improving Digital Libraries with Content-Based Methods. Proceedings of the Association for Information Science and Technology, 58: 849-851. https://doi.org/10.1002/pra2.585

Grants

Berg-Kirkpatrick, T., Bamman, D. (2017-2021). Text in situ: reasoning about visual information in computational analysis of books. Digital Humanities Advancement Grant. National Endowment for the Humanities. FAIN: HAA-256044-17. $325,000.

Choi, K. (2022). Unbiased AI for Computational Poetry Analysis on Massive Digital Collections. Laura Bush 21st Century Librarian Program, Institute of Museum and Library Services. $420,819.

Lučić, A. (2024-2026). Automated Peritext Detection in Fiction and Non-Fiction Works. Digital Humanities Advancement Grant. National Endowment for the Humanities. FAIN: HAA-296436-24. $75,000.

Organisciak, P. (2018-2022). Text Duplication and Similarity in the HathiTrust Digital Library. National Leadership Grants, Institute of Museum and Library Services. $276,943

Shanahan, J. (2015-2022). Reading Chicago Reading: Modeling Texts and Readers in a Public Library System. Digital Humanities Start-Up Grants. National Endowment for the Humanities. FAIN: HD-248600-16. $74,271.

Underwood, T. (2013-215). Understanding Genre in a Collection of a Million Volumes. Digital Humanities: Digital Humanities Start-Up Grants. National Endowment for the Humanities. FAIN: HD-51787-13. $54,576.

Wilkens, M. (2016-2020). Textual geographies. Digital Humanities Implementation Grant. National Endowment for the Humanities. FAIN: HAA-256044-17. $325,000.

Datasets

Bagga, S., & Piper, A. (2021). HATHI 1M: Introducing a million page historical prose dataset in English from the Hathi Trust [Data set]. Harvard Dataverse, V2. DOI: 10.7910/DVN/HAKKUA

Hamilton, S., & Piper, A. (2023). MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library. Journal of Open Humanities Data, 9, 3. DOI: 10.5334/johd.95

Schmidt, B. M. (2018). Materials for 2D representation of the HathiTrust Library [Data set]. In Journal of Cultural Analytics. Zenodo. DOI: 10.5281/zenodo.1477018

Underwood, T., Kimutis, P., Witte, J. (2020). NovelTM Datasets for English-Language Fiction, 1700-2009. Journal of Open Humanities Data, 5, 2. DOI: 10.22148/001c.13147

Wilkins, M., & Ruan, G. (2020). Geographic locations in English-language literature, 1701-2011. DOI: 10.13012/2K5C-RF13

Dissertations

Baciu, D. C. (2018). From everything called Chicago School to the theory of varieties (Order No. 10791150). [Doctoral dissertation, Illinois Institute of Technology]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2070496437).

Ball, L. H. (2020). Enhancing digital text collections with detailed metadata to improve retrieval (Order No. 30700822). [Doctoral dissertation, University of Pretoria, South Africa]. Available from ProQuest Dissertations & Theses Global. (2890697122)

Heuser, R. J. (2019). Abstraction: A literary history (Order No. 28827993). [Doctoral dissertation, Stanford University]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2598037086).

Kim, A. (2022). Understanding books through graphs and language models (Order No. 29319913). [Doctoral dissertation, State University of New York at Stony Brook]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2708971635).

Krien, B. (2022). Magazine ecology: Environmental knowledge infrastructures in nineteenth-century periodical publishing (Order No. 29165944). [Doctoral dissertation, University of Iowa]. ProQuest Dissertations & Theses Global Closed Collection. (2697052200).

Mohan, P. (2021). An analysis of gender bias in K-12 assigned literature through comparison of non-contextual word embedding models (Order No. 28264852). [Mast of Science dissertation, University of Washington]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2495370013).

Murdock, J. (2019). Topic modeling the reading and writing behavior of information foragers (Order No. 13900717). [Doctoral dissertation, Indiana University]. Available from Dissertations & Theses @ Big Ten Academic Alliance; ProQuest Dissertations & Theses Global.

Peng, Z. (2018). Cloud-based service for access optimization to textual big data (Publication No. 10808749) [Doctoral dissertation, Indiana University]. ProQuest Dissertations & Theses.

Pethe, C. G. (2022). Natural language processing for the large-scale analysis of literary works (Order No. 29209879). [Doctoral dissertation, State University of New York at Stony Brook]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2681065092)

Robinson, C. S. (2021). Negotiating sovereignty within the British Atlantic – Text mining the discourse of colonial South Carolinian elites 1769-1776 (Order No. 28541629). [Doctoral dissertation, Washington State University]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2572595609).

Thompson, L. J. (2020). Understanding and directing what models learn (Order No. 28260475). [Doctoral dissertation, Cornell University]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2484982965).

Throne, J. (2018). Printed texts and digital doppelgangers: Reading literature in the 21st century (Order No. 13423010). [Doctoral dissertation, University of California, Santa Cruz]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2184770909).