Notes of the user group meeting on September 24, 2014

Attendees: Samuel Franklin (Brown), Michelle Paolillo (Cornell), Ted Underwood (UIUC), Tassie Gniady (Indiana), David Mimno (Cornell)

HTRC team: Sayan Bhattacharyya, Miao Chen

We had our September user group teleconference on 9/24 Wed, 3-4pm ET (or 2-3pm CT). This time we invited Samuel Franklin from Brown University to talk about his research interest and needs related to HTRC. He presented his research plan using the HT corpus, and then the group followed up with an open discussion.

About Samuel and his project: He is a 5th year phd student in American Studies. It's basically a keyword study around "Creativity" after WWII. He studied discourse and communities. He used case studies, on people having experience of champaign creativity. It's partially a discourse analysis, e.g. what the concept influenced in the past.

He played with Google n-gram viewer and J-stor and ProQuest a bit. He also played with HTRC portal hoping to have much more going on. He looked at curves of the word "creativity", and parse the curves, and found what discplines accounts for these parsed curves.

Sayan: specific commuinty or groups for discourse, would it be good to have library of congress classification?

Ted Underwood suggested him to look at HTRC Bookworm, he thinks there is book curve parse.

Ted Underwood mentioned a project he is doing in parallel to Sam's, on discourse around money across time. He does topic modeling. The next step is to find what topics were these words (e.g. creativity) assigned to even thought there was no topic on creativity.

Sayan mentioned Sam can use the Dunning likilihood algorithm in the portal to compare works of two authors, assuming they both write about creativity.

Another thing Sam looks at is abbreviation, about what it means. This can be done by using collocations. He is also interested in associations. It'd be interesting to track influence given writer, tradition of creativity, or by citation analysis.

How much are you interested can be found in HTRC portal?
Most of his specific search yielded low numbers.

Miao introduced the HTRC Data Capsule as a solution to running algorithms against copyrighted content.

Sayan mentioned WordNet, a kind of network representation of words.

David: it'd be great to have some examples to show that what we can do with these data. Need to get publicity about what's sitting there. It will also be good to have statistics information, such as word entropy.