Notes of the user group meeting on September 20, 2013
Pre-meeting notes:
-word break (Ted Underwood)
-page vs. volume level (Ted Underwood)
-interaction of 2 work sets of pages (Sayan Bhattacharyya)
-topic modeling, sentiment analysis (Matthew Wilkens)
-derived stats and informational data (Beth Plale)
Meeting notes:
Ted Underwood (a professor at Department of English, UIUC) discussed his user case with HTRC people during the meeting.
Ted: this is on top of current data API, adding or making changes on top of that.
Jiaan: token count on page level is already implemented in API now.
Ted: it can be specific beyond token counts
page can be divided into pieces to get such features
pages can be table of content, index, poetry,
we can count lines, number of capitalized chars to get an idea of the page
important to provide token count on the full corpus, as people can usually get open-open from HathiTrust
The benefit of recalculating the token counts: one is faster, the other is more secure because it completely separates from original collection. There can be a firewall between data api and volumes.
One challenge is in OCR errors. Random errors don't affect results much.
Hyphen, dash challenge: some from OCR errors; some are not, so need to be joined when preprocessing. may need to compare with a dictionary to make the decision of whether to join them.
Punctuation challenge: keep punctuations in output?
Miao: is it necessary to provide tf*idf?
Ted: researchers can calculate their own idf based on context, so tf is good for them.
Loretta: provide stemmed version of words?
Ted: researchers can do this on their own.
Sayan: make units on chapter level. where are the breaks on a page? sentence length in a page?
Ted: may have security issue, but total number of lines in text/page is number of empty lines are fine.
Ted's use case is getting word count and then apply genre classifier (his own model).
Loretta summarizes requested extensions: count of number of lines on page, count of number of lines that start with a capitalized token, use dictionary to deal with hypens or nondictionary tokens at end of line and start of next line and combine these tokens only if they exist in a dictionary, add counts for punctuation tokens
Action item:
Sayan, Jiaan will get Python script from Ted, and work on a simpler version/logic of that script.