Notes of the user group meeting on November 19, 2014

Participants: David Mimno (presenter, from Cornell), Michelle Paolillo (Cornell), Michael Black (UIUC), Matthew Wilken (Notre Dame), Peter Leonard (Yale)

HTRC staff: Loretta Auvil, Miao Chen

David Mimno introduced a word similarity project that he developed using the HathiTrust non-Google digitized open collection. The project can be found here http://mimno.infosci.cornell.edu/wordsim/nearest.html?q=caste

where you can query a word and find its most similar words in a specific year.

He explained the similarity with an example from the last user group meeting: How to study the meaning of the word "creativity"? It becomes a common word after World War II, but we don't know why, and what this means. The tool he developed can help with the broad context of the word.

He generated a matrix of similarity, by a 3rd level function. It's page-level of a specific year. Similarity is defined a co-occurring similarity. There could be other measurements: pairwise similarity etc. There is no order of words in the page-level word count, but having POS information. He used POS information filter words, and then throw POS information for the rest of the workflow.

High dimensionality with historical text processing: it's very large dimensions with this data set. He looked at the most frequent words given a year. One question is how we make sure we're not biased towards the most informative year.
He ignored adverbs but kept the verbs, but words such as "were" may not be so important. He also removed hand-crafted stop list. Locality sensitive hashing were used for scalability. Used a cluster to distribute the processing by years. It took less than an hour to finish the job.

OCR quality issue: the project is biased towards frequent words, so not a big problem. Restricting after 1800 is pretty reliable to avoid OCR errors.

Question: Do you have sense of getting interesting results without stemming, but doing lemmatization?
I haven't thought about it but could be interesting. We need to be careful about stemming because it doesn't give me what I want usually.

Then the participants looked at an example of querying "caste", and the returned similar words.

Question: What's the potential use of this similarity?
It can be used in classification tasks. It wouldn't be a good idea to use it on the modern text.

Question: How many processors did you use?

It was 124 jobs in total, one for each year.

-------------

Meeting announcement

Dear Scholars,

We are having our November user group teleconference on 11/19 Wed, 3-4pm ET (or 2-3pm CT). Please see below for call-in information.

For this meeting we invited Prof.David Mimno from Cornell University, to discuss mainly two issues:

1) use case examples for the HTRC feature extraction functionality

2) challenges presented to HTRC text mining, or more broadly, historical text: scalability, curse of dimensionality, plus OCR errors.

Everyone is welcome to join the discussion and share your thoughts. Please drop me a line if you plan to call in.

For more HTRC events, please see https://wiki.htrc.illinois.edu/display/COM/HathiTrust+Research+Community+Pages