HTRC Research Impact
This is a list of scholarly articles, datasets, and other research products produced by the scholarly community which have made substantial use of HTRC’s tools, services, expertise, and unique access to HathiTrust data. See also HTRC Publications and Presentations and Grant-funded projects for work research produced by HTRC and its staff.
Monographs
Franklin, Samuel W. (2023). The Cult of Creativity: A Surprisingly Recent History. Chicago: The University of Chicago Press.
Based in part on Franklin’s 2016 HTRC Advanced Collaborative Support grant, “Inside the Creativity Boom.” Read the project report.
Sinykin, Dan (2023). Big Fiction: How Conglomeration Changed the Publishing Industry and American Literature. New York: Columbia University Press.
Based in part on Sinykin’s 2019 HTRC Advanced Collaborative Support grant, “Supporting The Conglomerate Era Project.” Read the project report.
So, Richard Jean (2021). Redlining Culture : A Data History of Racial Inequality and Postwar Fiction. New York: Columbia University Press.
Based in part on So’s 2017 HTRC Advanced Collaborative Support grant, “A Computational History of the U.S. Novel, 1950-2000.” Read the project report.
Underwood, Ted (2019). Distant Horizons : Digital Evidence and Literary Change. Chicago: University of Chicago Press.
Based in part on Underwood’s longstanding collaborations with HTRC, including his co-creation of HTRC’s “NovelTM Datasets for English-Language Fiction, 1700-2009” (read the dataset project report here); his dataset of “Word Frequencies in English-Language Literature, 1700-1922”; and others.
Scholarly Articles
Adams, A.L. (2021). Online tools for digital humanities. Public Services Quarterly, 17(3), 177-182, DOI: 10.1080/15228959.2021.1938789
Almelhem, A., Iyigun, M., Kennedy, A., & Rubin, J. (2023). Enlightenment ideals and belief in progress in the run-up to the Industrial Revolution: A textual analysis. IZA Discussion Paper No. 16674. Available at SSRN: http://dx.doi.org/10.2139/ssrn.4668604
Baciu, D. C. (2020). Cultural life: Theory and empirical testing. Biosystems, 197, 104208. DOI: 10.1016/j.biosystems.2020.104208
Bagga, S., & Piper, A. (2022). HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust. Journal of Open Humanities Data, 8(7). DOI: 10.5334/johd.71
Ball, L. H., & Bothma, T. J. D. (2022). Investigating the extent to which words or phrases with specific attributes can be retrieved from digital text collections. Information Research, 27(1). DOI: 10.47989/irpaper917
Beausang, C. (2022). Diachronic delta: A computational method for analysing periods of accelerated change in literary datasets. Digital Scholarship in the Humanities, 37(3), 644–659. 10.1093/llc/fqab041
Bhattacharyya, S. (2017). Words in a world of scaling-up: Epistemic normativity and text as data. Sanglap, 4(1), 31-42. https://sanglap-journal.in/index.php/sanglap/article/view/86
Bologna, F. (2020). A computational approach to urban space in science fiction. Journal of Cultural Analytics, 5(2). DOI: 10.22148/001c.18120
Brown, N.M., Mendenhall, R., Black, M.L., Van Moer, M., Zerai, A., Flynn, K. (2016). Mechanized Margin to Digitized Center: Black Feminism's Contributions to Combatting Erasure within the Digital Humanities. International Journal of Humanities and Arts Computing, 10(1): 110-125. DOI: 10.3366/ijhac.2016.0163
Brown, N. M., Mendenhall, R., Black, M., Moer, M. V., Flynn, K., McKee, M., … Zhai, C. (2019). In Search of Zora/When Metadata Isn’t Enough: Rescuing the Experiences of Black Women Through Statistical Modeling. Journal of Library Metadata, 19(3–4), 141–162. DOI: 10.1080/19386389.2019.1652967
Brown, S. (2022). Graphing the Poetess. Victorian Review, 48(2), 194-200. DOI: 10.1353/vcr.2022.a900622
Chang, K. K., & Moravec, M. (2021). Feminist bestsellers: A digital history of 1970s feminism. Issue 7: Post45 x Journal of Cultural Analytics. https://post45.org/2021/04/feminist-bestsellers-a-digital-history-of-1970s-feminism/
Craig, K. (2018). Introduction to Bookworm; Robots Reading Vogue; Bookworm: HathiTrust; Bookworm: Open Library; Building a Bookworm. Journal of American History, 105(1): 244–247. DOI: https://doi.org/10.1093/jahist/jay139
Dobson, J. (2020). Interpretable Outputs: Criteria for Machine Learning in the Humanities. Digital Humanities Quarterly, 15(2). http://digitalhumanities.org:8081/dhq/vol/15/2/000555/000555.html
Dobson, J. (2022). Vector hermeneutics: On the interpretation of vector space models of text. Digital Scholarship in the Humanities, 37(1). DOI: https://doi.org/10.1093/llc/fqab079
Dobson, J., & Sanders, S. (2022). Distant approaches to the printed page. Digital Studies/le champ numérique (DSCN), 12(1).
Ehrlich, H. (2015). Poe in Cyberspace: Balloons! Drones!! The Global Internet!!! The Edgar Allan Poe Review, 16(2), 242–246. https://doi.org/10.5325/edgallpoerev.16.2.0242
Erlin, M., Piper, A., Knox, D., Pentecost, S. and Blank, A. (2022). The TRANSCOMP Dataset of Literary Translations from 120 Languages and a Parallel Collection of English-language Originals. Journal of Open Humanities Data, 8(0), p.29.DOI: https://doi.org/10.5334/johd.94
Evans, E., & Wilkens, M. (2018). Nation, ethnicity, and the geography of British fiction, 1880-1940. Journal of Cultural Analytics, 3(2). https://doi.org/10.22148/16.024
Freedman, R. (2018). Listening to Melancholia: The Chansons of Orlando di Lasso. Revue Belge de Musicologie / Belgisch Tijdschrift Voor Muziekwetenschap, 72, 173–191. https://www.jstor.org/stable/45108296
Graham, S., Milligan, I., Weingart, S. B., & Martin, K. (2022). The joys of big data for historians. In S. Graham, I. Milligan, S. B. Weingart, & K. Martin (Eds.), Exploring big historical data: The historian's macroscope (2nd ed., pp. 1-34). World Scientific. https://doi.org/10.1142/9789811243042_0001
Grallert, T. (2022). Open Arabic Periodical Editions: A Framework for Bootstrapped Scholarly Editions Outside the Global North. Digital Humanities Quarterly, 16(2). http://digitalhumanities.org:8081/dhq/vol/16/2/000593/000593.html
Hamilton, S., & Piper, A. (2023). MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library. Journal of Open Humanities Data, 9(3). DOI: https://doi.org/10.5334/johd.95
Kelly, N.M., White, N., Glass, L. (2021). Squatter Regionalism: Postwar Fiction, Geography, and the Program Era. Journal of Cultural Analytics 6(2). DOI: https://doi.org/10.22148/001c.22332
Kilner, K., & Fitch, K. (2017). Searching for My Lady’s Bonnet: discovering poetry in the National Library of Australia’s newspapers database. Digital Scholarship in the Humanities, 32(1), i69–i83. DOI: https://doi.org/10.1093/llc/fqw062
Klancher, J. (2018). Scale and skill in British print culture: Reading the technologies, 1680–1820. Studies in Eighteenth-Century Culture, 47(1), 89-106. DOI: https://doi.org/10.1353/sec.2018.0008
Krause, T. B. (2018). IE4 : Forays into text mining. Folia Linguistica, 52(s39-s2), 511-520. DOI: doi.org/10.1515/flih-2018-0020
Lee, A. S., Chiarawongse, P., Guldi, J., & Zsom, A. (2020). The Role of Critical Thinking in Humanities Infrastructure: The Pipeline Concept with a Study of HaToRI (Hansard Topic Relevance Identifier). Digital Humanities Quarterly, 14(3). http://digitalhumanities.org:8081/dhq/vol/14/3/000481/000481.html
Le-Khac, L., & Hao, K. (2021). The Asian American Literature We’ve Constructed. Journal of Cultural Analytics 6(2). DOI: https://doi.org/10.22148/001c.22330
Liddle, D. (2019). Could fiction have an information history? Statistical probability and the rise of the novel. Journal of Cultural Analytics, 4(2). DOI: 10.22148/16.033
Marquez, X. (2023). Ancient tyranny and modern dictatorship. The Review of Politics. Advance online publication. Available at SSRN: https://ssrn.com/abstract=4662514
McAlister, S., Allen, C., Ravenscroft, A., Reed, C., Bourget, D., Lawrence, J., Börner, K., & Light, R. (2014). From big data to argument analysis and automated extraction: A selective study of argument in the philosophy of animal psychology from the volumes of the Hathi Trust collection. In Digging by debating.
Moravec, M., Chang, K.K. (2021). Feminist Bestsellers: A Digital History of 1970s Feminism. Journal of Cultural Analytics 6(2). DOI: https://doi.org/10.22148/001c.22333
Nurmikko-Fuller, T. (2022). Teaching Linked Open Data using Bibliographic Metadata. Journal of Open Humanities Data, 8(6). DOI: https://doi.org/10.5334/johd.60
Organisciak, P., Newman, M., Eby, D., Acar, S., & Dumas, D. (2023). How do the kids speak? improving educational use of text mining with child-directed language models. Information and Learning Science, 124(1), 25-47. doi: 10.1108/ILS-06-2022-0082
Organisciak, P., & Ryan, M. (2024). Improving text relationship modelling with artificial data. Journal of Information Science, 50(2), 434-446. DOI: 10.1177/01655515221093031
Organisciak, P., Schmidt, B. M., & Durward, M. (2023). Approximate nearest neighbor for long document relationship labeling in digital libraries. International Journal on Digital Libraries, 24(4), 311-325. https://doi.org/10.1007/s00799-023-00354-5
Piper, A. (2016). Fictionality. Journal of Cultural Analytics, 2(2). DOI: 10.22148/16.011
Piper, A. (2022). Biodiversity is not declining in fiction. Journal of Cultural Analytics, 7(3). DOI: 10.22148/001c.38739
Ravenscroft, A., Allen, C. (2019). Finding and Interpreting Arguments: An Important Challenge for Humanities Computing and Scholarly Practice. DHQ: Digital Humanities Quarterly, 13(4). http://digitalhumanities.org:8081/dhq/vol/13/4/000436/000436.html
Schmidt, B. (2018). Stable random projection: Lightweight, general-purpose dimensionality reduction for digitized libraries. *Journal of Cultural Analytics, 3*(1). DOI: 10.22148/16.025
Shanahan, J., Burke, R., Lučić, A. (2020). Reading Chicago Reading: Quantitative Analysis of a Repeating Literary Program. DHQ: Digital Humanities Quarterly, (14)2. http://digitalhumanities.org:8081/dhq/vol/14/2/000461/000461.html
Sinykin, D., Roland, E. (2021). Against Conglomeration: Nonprofit Publishing and American Literature After 1980. Journal of Cultural Analytics, 6(2). DOI: 10.22148/001c.22331
Sobchuk, O., & Šeļa, A. (2024). Computational thematics: Comparing algorithms for clustering the genres of literary fiction. Humanities and Social Sciences Communications, 11(438). DOI: 10.1057/s41599-024-02933-6
Soto-Corominas, A., De la Rosa, J., & Suárez, J. L. (2018). What Loanwords Tell Us about Spanish (and Spain). Digital Studies/le Champ Numérique, 8(1), 4. DOI: 10.16995/dscn.297
Stevens, G. (2017). New Metadata Recipes for Old Cookbooks: Creating and Analyzing a Digital Collection Using the HathiTrust Research Center Portal. Code4Lib Journal, 37(1). Accessed November 23, 2022. https://journal.code4lib.org/articles/12548
Stuhler, O. (2024). The gender agency gap in fiction writing (1850 to 2010). Proceedings of the National Academy of Sciences of the United States of America, 121(29), e2319514121. DOI: 10.1073/pnas.2319514121
Underwood, T. (2016). The Life Cycles of Genres. Journal of Cultural Analytics, 2(2). DOI: 10.22148/16.005
Underwood, T. (2018). Why literary time is measured in minutes. Elh, 85(2), 341-365. DOI: 10.1353/elh.2018.0013
Underwood, T., Bamman, D., Lee, S. (2018). The Transformation of Gender in English-Language Fiction. Journal of Cultural Analytics, 3(2). DOI: 10.22148/16.019
Underwood, T., & So, R. J. (2021). Can we map culture? Journal of Cultural Analytics, 6(3). DOI: 10.22148/001c.24911
Wilkens, M. (2021). Too isolated, too insular: American literature and the world. Journal of Cultural Analytics, 6(3). DOI: 10.22148/001c.25273
Conference Papers, Presentations, and Posters
Ball, L., & Bothma, T. (2020). The capability of search tools to retrieve words with specific properties from large text collections. In Proceedings of ISIC, the Information Behaviour Conference, Pretoria, South Africa, 28 September - 1 October, 2020. Information Research, 25(4), paper isic2030. DOI: 10.47989/irisic2030
Bamman, D., Carney, M., Gillick, J., Hennesy, C., & Sridhar, V. (2017). Estimating the Date of First Publication in a Large-Scale Digital Library. 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 1-10. DOI: 10.5555/3200334.3200351
Burns, P. J. (2019). Hacking Multi-word Named Entity Recognition on HathiTrust Extracted Features Data. Paper presented at ACH/ICCH 2019. Pittsburgh, PA.
Burns, P. J. (2020). Mapping library subject headings with the HathiTrust extracted features dataset. Digital Publication in Mediterranean Archaeology, Institute for the Study of the Ancient World, October 2020.
Kim, A., Pethe, C., & Skiena, S. (2020). What time is it? Temporal analysis of novels. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 9076–9086). Association for Computational Linguistics.
Kobak, D., Linderman, G., Steinerberger, S., Kluger, Y., Berens, P. (2020). Heavy-Tailed Kernels Reveal a Finer Cluster Structure in t-SNE Visualisations. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. DOI: 10.1007/978-3-030-46150-8_8
Ledbetter, W., & Spring, J. (2020). Peace Speech Identification Using ABBYY Fine Reader and HathiTrust. 2020 IEEE International Conference on Big Data (Big Data), 5739-5743. DOI: 10.1109/BigData50022.2020.9377870
Lucic, A., Burke, R., & Shanahan, J. (2019). Unsupervised Clustering with Smoothing for Detecting Paratext Boundaries in Scanned Documents. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL 2019) (pp. 53-56). ACM-IEEE Joint Conference on Digital Libraries JCDL. DOI: 10.1109/JCDL.2019.00018
McConnaughey, L., Dai, J., & Bamman, D. (2017). The labeled segmentation of printed books. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 737–747). Association for Computational Linguistics.
Miller, D. M., Wermer-Colan, H. A., Stefan, S., & Kane, M. (2023). Modeling Eco-Poetics and Eco-Politics in 20th Century Anglophone Climate Fiction: Toxic Water. Short paper presented at the ADHO Conference 2023, University of Graz, Graz, Austria.
Organisciak, P., Shetenhelm, S., Vasques, D.F.A., Matusiak, K. (2019). Characterizing Same Work Relationships in Large-Scale Digital Libraries. In: Taylor, N., Christian-Lamb, C., Martin, M., Nardi, B. (eds) Information in Contemporary Society. iConference 2019. Lecture Notes in Computer Science(), vol 11420. Springer, Cham. DOI: 10.1007/978-3-030-15742-5_40
Organisciak, P., Therrell, G., Ryan, M., & Schmidt, B. M. (2019). Examining patterns of text reuse in digitized text collections. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (pp. 361-362). IEEE. https://doi.org/10.1109/JCDL.2019.00071
Thai, K., Chang, Y., Krishna, K., & Iyyer, M. (2022). RELiC: Retrieving evidence for literary claims. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7500–7518). Association for Computational Linguistics.
Thompson, L., & Mimno, D.M. (2018). Authorless Topic Models: Biasing Models Away from Known Structure. International Conference on Computational Linguistics. https://paperswithcode.com/paper/authorless-topic-models-biasing-models-away
VandenBosch, A., Schmidt, B.M., Matusiak, K.K. and Organisciak, P. (2021), Moving Past Metadata: Improving Digital Libraries with Content-Based Methods. Proceedings of the Association for Information Science and Technology, 58: 849-851. DOI: 10.1002/pra2.585
Grants
Sponsored research projects that prominently cite or make direct use of HTRC platforms and data products, or that rely substantially on HTRC staff in advisory or consultative roles.
Bamman, David (2020-2025). Multilingual BookNLP: Building a Literary NLP Pipeline Across Languages. Digital Humanities Advancement Grants. National Endowment for the Humanities. FAIN: HAA-271654-20. $292,054.00
Berg-Kirkpatrick, T., Bamman, D. (2017-2021). Text in situ: reasoning about visual information in computational analysis of books. Digital Humanities Advancement Grant. National Endowment for the Humanities. FAIN: HAA-256044-17. $325,000.
Choi, K. (2022). Unbiased AI for Computational Poetry Analysis on Massive Digital Collections. Laura Bush 21st Century Librarian Program, Institute of Museum and Library Services. $420,819.
Lučić, Ana (2024-2026). Automated Peritext Detection in Fiction and Non-Fiction Works. Digital Humanities Advancement Grant. National Endowment for the Humanities. FAIN: HAA-296436-24. $75,000.
Organisciak, Peter (2018-2022). Text Duplication and Similarity in the HathiTrust Digital Library. National Leadership Grants, Institute of Museum and Library Services. $276,943
Samberg, Rachael, et al. (2019-2021). Building Legal Literacies for Text Data Mining. Institute for Advanced Topics in the Digital Humanities, National Endowment for the Humanities. $165,000. Learn more
Shanahan, J. (2015-2022). Reading Chicago Reading: Modeling Texts and Readers in a Public Library System. Digital Humanities Start-Up Grants. National Endowment for the Humanities. FAIN: HD-248600-16. $74,271.
Underwood, Ted (2013-2015). Understanding Genre in a Collection of a Million Volumes. Digital Humanities: Digital Humanities Start-Up Grants. National Endowment for the Humanities. FAIN: HD-51787-13. $54,576. Interim Report
Underwood, Ted (2020-2021). Broadening Access to Text Analysis by Describing Uncertainty. National Endowment for the Humanities. FAIN: PR-268817-20. $73,122. Learn more
Potvin, Sarah, and Alex Wermer-Colan (2024-2025). Data Speculations. Institute of Museum and Library Services National Forum. Log no. LG-254864-OLS-23. $124,391. Learn more
Wilkens, M. (2016-2020). Textual geographies. Digital Humanities Implementation Grant. National Endowment for the Humanities. FAIN: HAA-256044-17. $325,000.
Datasets
Bagga, S., & Piper, A. (2021). HATHI 1M: Introducing a million page historical prose dataset in English from the Hathi Trust [Data set]. Harvard Dataverse, V2. DOI: 10.7910/DVN/HAKKUA
Hamilton, S., & Piper, A. (2023). MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library. Journal of Open Humanities Data, 9, 3. DOI: 10.5334/johd.95
Schmidt, B. M. (2018). Materials for 2D representation of the HathiTrust Library [Data set]. In Journal of Cultural Analytics. Zenodo. DOI: 10.5281/zenodo.1477018
Underwood, T., Kimutis, P., Witte, J. (2020). NovelTM Datasets for English-Language Fiction, 1700-2009. Journal of Open Humanities Data, 5, 2. DOI: 10.22148/001c.13147
Wilkins, M., & Ruan, G. (2020). Geographic locations in English-language literature, 1701-2011. DOI: 10.13012/2K5C-RF13
Theses and Dissertations
Baciu, D. C. (2018). From everything called Chicago School to the theory of varieties (Order No. 10791150). [Doctoral dissertation, Illinois Institute of Technology]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2070496437).
Ball, L. H. (2020). Enhancing digital text collections with detailed metadata to improve retrieval (Order No. 30700822). [Doctoral dissertation, University of Pretoria, South Africa]. Available from ProQuest Dissertations & Theses Global. (2890697122)
Heuser, R. J. (2019). Abstraction: A literary history (Order No. 28827993). [Doctoral dissertation, Stanford University]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2598037086).
Jett, J. (2019). Towards a general conceptual model for bibliographic aggregates: Four case studies from our bibliographic standards. [Doctoral dissertation, University of Illinois].
Kim, A. (2022). Understanding books through graphs and language models (Order No. 29319913). [Doctoral dissertation, State University of New York at Stony Brook]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2708971635).
Krien, B. (2022). Magazine ecology: Environmental knowledge infrastructures in nineteenth-century periodical publishing (Order No. 29165944). [Doctoral dissertation, University of Iowa]. ProQuest Dissertations & Theses Global Closed Collection. (2697052200).
Mohan, P. (2021). An analysis of gender bias in K-12 assigned literature through comparison of non-contextual word embedding models (Order No. 28264852). [Mast of Science dissertation, University of Washington]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2495370013).
Murdock, J. (2019). Topic modeling the reading and writing behavior of information foragers (Order No. 13900717). [Doctoral dissertation, Indiana University]. Available from Dissertations & Theses @ Big Ten Academic Alliance; ProQuest Dissertations & Theses Global.
Peng, Z. (2018). Cloud-based service for access optimization to textual big data (Publication No. 10808749) [Doctoral dissertation, Indiana University]. ProQuest Dissertations & Theses.
Pethe, C. G. (2022). Natural language processing for the large-scale analysis of literary works (Order No. 29209879). [Doctoral dissertation, State University of New York at Stony Brook]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2681065092)
Robinson, C. S. (2021). Negotiating sovereignty within the British Atlantic – Text mining the discourse of colonial South Carolinian elites 1769-1776 (Order No. 28541629). [Doctoral dissertation, Washington State University]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2572595609).
Stevens, G. (2017). Early American cookbooks: Creating and analyzing a digital collection using the HathiTrust Research Center Portal (Master's capstone). Graduate Faculty in Liberal Studies, The City University of New York.
Thompson, L. J. (2020). Understanding and directing what models learn (Order No. 28260475). [Doctoral dissertation, Cornell University]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2484982965).
Throne, J. (2018). Printed texts and digital doppelgangers: Reading literature in the 21st century (Order No. 13423010). [Doctoral dissertation, University of California, Santa Cruz]. Available from ProQuest Dissertations & Theses Global; ProQuest Dissertations & Theses Global Closed Collection. (2184770909).