Participants: David Mimno (presenter, from Cornell), Michelle Paolillo (Cornell), Michael Black (UIUC), Matthew Wilken (Notre Dame), Peter Leonard (Yale)
HTRC staff: Loretta Auvil, Miao Chen
David Mimno introduced a word similarity project that he developed using the HathiTrust non-Google digitized open collection. The project can be found here http://mimno.infosci.cornell.edu/wordsim/nearest.html?q=caste
where you can query a word and find its most similar words in a specific year.
He explained the similarity with an example from the last user group meeting, i.e. How to study the meaning of the word "creativity"? It becomes a common word after world war II, but we don't know its meaning and why. The tool he developed can help with the broad context of the word.
He generated a matrix of similarity, by a 3rd level function. It's page-level of a specific year. Similarity is defined a co-occurring similarity. There could be other measurements: pairwise similarity etc. There is no order of words in the page-level word count, but having POS information. He used POS information filter words, and then throw POS information for the rest of the workflow.
High dimensionality with historical text processing: it's very large dimensions with this data set. He looked at the most frequent words given a year. One question is how we make sure we're not biases towards the most informative year.
He ignored adverbs but kept the verbs, but words such as "were" may not be so important. He also removed hand-crafted stop list. Locality sensitive hashing were used for scalability. Used a cluster to distribute the processing by years. It took less than an hour to finish the job.
OCR quality issue: the project is biased towards frequent words, so not a big problem. Restricting after 1800 is pretty reliable to avoid OCR errors.
Question: Do you have sense of getting interesting results without stemming, but doing lemmatization?
I haven't thought about it but could be interesting. We need to be careful about stemming because it doesn't give me what I want usually.
Then the participants looked at an example of querying "caste", and the returned similar words.
Question: What's the potential use of this similarity?
It can be used in classification tasks. It wouldn't be a good idea to use it on the modern text.
Question: How many processors did you use?
It was 124 jobs in total, one for each year.
-------------