Notes of the user group meeting on September 24, 2014

Attendees: Samuel Franklin (Brown), Michelle Paolillo (Cornell), Ted Underwood (UIUC), Tassie Gniady (Indiana), David Mimno (Cornell)

HTRC team: Sayan Bhattacharyya, Miao Chen

We had our September user group teleconference on 9/24 Wed, 3-4pm ET (or 2-3pm CT). This time we invited Samuel Franklin from Brown University to talk about his research interest and needs related to HTRC. He presented his research plan using the HT corpus, and then the group followed up with an open discussion.

About Samuel and his project: He is a 5th year phd student in American Studies. Part of it is a keyword study around "Creativity" after WWII. He studies how  different communities of discourse used the term "creativiiy".  It's partially a discourse analysis, e.g. what the concept influenced in the past.

He has played with Google n-gram viewer and J-stor and ProQuest a bit. He has also played with HTRC portal, hoping to have much more going on. He has looked at plots of the word "creativity", and parsed them, and examined what disciplines accounts for them.

Sayan:  Would you find it useful to treat specific library of congress classifications as proxies for communities of discourse?

Ted Underwood suggested that Samuel look at HTRC+Bookworm.

Ted Underwood mentioned a project he is doing in parallel to Sam's, on discourse around money across time. He does topic modeling. The next step is to find what topics these words (e.g. "creativity") were assigned to, even though there may have been no topic on creativity.

Sayan mentioned Sam can use the Dunning log-likelihood algorithm in the portal to, for example, compare the works of two (sets of) authors, assuming  both (sets) write about creativity.

Another thing Sam looks at is abbreviation, about what it means. This can be done by using collocations. He is also interested in associations. It'd be interesting to track influence given writer, tradition of creativity, or by citation analysis.

"How much of what  you  are interested can be found in HTRC portal?"
Most of his specific searches yielded low numbers.

Miao introduced the HTRC Data Capsule as a solution to running algorithms against copyrighted content.

Sayan mentioned WordNet, a kind of network representation of words, which can be useful (e.g. to search for not just "creativity" but"creativity" and its synonyms, "creativity" and words semantically related to it, etc. etc.)

David: it'd be great to have some examples to show that what we can do with these data. Need to get publicity about what's sitting there. It will also be good to have statistics information, such as word entropy.