Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Participants: David Mimno (presenter, from Cornell), Michelle Paolillo (Cornell), Lynn Thitchner (Cornell), Michael Black (UIUC), Matthew Wilken (Notre Dame), Peter Leonard (Yale)

...

He explained the similarity with an example from the last user group meeting, i.e. : How to study the meaning of the word "creativity"? It becomes a common word after world war World War II, but we don't know its meaning and whywhy, and what this means. The tool he developed can help with the broad context of the word.

...

High dimensionality with historical text processing: it's very large dimensions with this data set. He looked at the most frequent words given a year. One question is how we make sure we're not biases biased towards the most informative year.
He ignored adverbs but kept the verbs, but words such as "were" may not be so important. He also removed hand-crafted stop list. Locality sensitive hashing were used for scalability. Used a cluster to distribute the processing by years. It took less than an hour to finish the job.

...

1) use case examples for the HTRC feature extraction functionality, available in a Google doc here https://docs.google.com/document/d/14-be-4VBNeVPZsFO-e7UWephgf71LfYvhlr9qssEfTg/edit  (document prepared by Sayan Bhattacharyya)
2) challenges presented to HTRC text mining, or more broadly, historical text: scalability, curse of dimensionality, plus OCR errors. 

...