Attendees: Michelle A. Paolillo (Cornell), Matt Wilkens (Notre Dame), Adam (Cornell), Ian Barba (Texas University Library)
HTRC: Loretta Auvil, Harriett Green, Sayan Bhattacharyya, Miao Chen
2 topics were discussed:
1) feedback on HTRC Bookworm http://sandbox.htrc.illinois.edu/bookworm/
2) NLP features important for your research
Bookworm's integration with curated collection would be nice. Would be nice to have Bookworm working against work set
Matt: on NLP side of things, 2 things is useful to his research: POS tagging, named entity extraction (NER)
POS is relatively solved problem, with 97% accuracy, and it only need to be done once. There is no need to be offered interactively.
NER is also one time processing, but is nowhere near solved as POS, with a lot of errors, probably more importantly users might be able to supply their own data. He wondered how many potential users out there would have that kind of use cases.
Loretta: We have in tool set now. We has a list of things to add to Bookworm, and NER on the list. Will see if finally get implemented.
POS is well solved for modern corpus, but not for old corpus. You will see more noise as you go back in time.
Hearder/footers will cause problems, and she has it on the list of BW.
Matt: Ted Underwood has been working on hearder/footer removal. Not sure it's easy to integrate with current work flow.
For older POS, text processing group at Northwestern University work on that, for specially older stuff, e.g. shakespeare era. He thinks their training data is openly available.
The quality of NER is mixed. Even for people/org/locations, there are a lot of errors. People will question whether it's good enough for my work, if put publicly.
Question: Can I use regular expressions to coalesce word forms, or use wildcards to capture divergent forms, etc.?
Answer: It's not offered in current system probably due to the way it's implemented in mysql, but should be doable.
Michelle and Loretta have the impression that Bookworm creators are also involved to n-gram viewer. This indicates that Bookworm can be tweaked to do similar things to Google n-gram viewer.
Loretta: we can change back end to Solr, to scale up to 3M/10M books
Question: for metadata such as author/gender, publication, state, how complete is that metadata?
Loretta explains how gender metadata was derived, Stacy Kowalczyk and Peng Zong's work presented in last UnCamp.
Sayan: maybe we need to have something to distinguish between metadata that were decided by algorithm and metadata that already exist there
Loretta: any facets that people may come up in the future? We would like to have the ability to extend it.
Miao: now the x-axis is time, what else can be x-axis?
Michelle: one thing you can do is to segment individual text, to look word frequency of a book, divide certain number segments, and show the frequency of words over the content order of the book.
Miao mentioned similar work done by Wolfram Alpha.
Loretta: we can also slice time a bit differently.
Sayan asked Michelle: provide a single year, see what choose for a particular rather than multiple years?
Loretta: Bookworm has timeline better laid out, and will install the latest version.
Sayan: another possibility is to fix a particular place of publication, e.g. new york city, and find language use at different locations, from east to west, using different words, language variation, use distance measure; fix a particular place on the map, and do the visualization.
Miao: mentioned a problem of female vs. male writers on the word "revolution", the sum of "revolution" frequency by female writers and "revolution" frequency by male writers should be equal or less to frequency of "revolution", given a particular time point, but it's not as shown in HTRC Bookworm.
Loretta: it's because of smoothing issue. You can turn off smoothing option; also can use the exact count option instead of frequency percentage.
Adam: just started with HTRC. Ff you have resources set up meetings in Webex etc. for screen sharing, that would help explain things.
Loretta: will update to the newest version of bookworm UI, probably won't make changes to the backend code unless something incorrect about metadata.
When things change, we need to regenerate everything from the beginning. That's why we need to extend the scalability. adapt to metadata change, if some gender was wrong.