Notes of the user group meeting on Jan 23, 2014

Attendees: Michelle A. Paolillo (Cornell), Matt Wilkens (Notre Dame), Adam (Cornell), Ian Barba (Texas University Library)

HTRC: Loretta Auvil, Harriett Green, Sayan Bhattacharyya, Miao Chen

2 topics were discussed:

1) feedback on HTRC Bookworm http://sandbox.htrc.illinois.edu/bookworm/

2) NLP features important for your research

Bookworm's integration with curated collection would be nice. Would be nice to have Bookworm working against work set

Matt: on NLP side of things, 2 things is useful to his research: POS tagging, named entity extraction (NER)
POS is relatively solved problem, with 97% accuracy, and it only need to be done once. There is no need to be offered interactively.
NER is also one time processing, but is nowhere near solved as POS, with a lot of errors, probably more importantly users might be able to supply their own data. He wondered how many potential users out there would have that kind of use cases.

loretta: we have in tool set now. she has a list of things to add to bw, she has NER on the list. WIll see if put in the proposal.
but POS is well solved, but not for old corpus. will see more noise as you go back in time.
hearder/footers will cause problems, she has it on the list of BW.

Matt: Ted Underwood has been working on hearder/footer removal. Not sure it's easy to integrate with current work flow.
older POS, text processing group at NW, for specially older stuff, e.g. shakespeare era. He thinks their training data is openly available.

quality of NER is mixed. PR terms, there're going be a lot of errors. even for people/org/locaions, a lot of errors. People will question whether it's good enough for my work, if put publicly.

Can I use regular expressions to coalesce word forms, or use wildcards to capture divergent forms, etc.?
probably due to the way it's implemented in mysql, should be doable.
bookworm creators are also involved to n-gram viewer.
this can be tweaked to do similar things to n-gram
can change back end to Solr, to scale up to 3m/10m books

a question: author/gender, publication, state
how complete is that metadata?

Sayan: maybe to have something to distinguish being decided by algorithm, or gender already existing in metadata

Loretta: facets that people may come up in the future?
would like to have the ability to extend it, rather than

segment individual text, to look word freq of a book, divide certain number segments
wolfram alpha

loretta: being able to slice time a bit differently

sayan asked Michelle: provide a single year, see what choose for a particular rather than multiple years?

Loretta: bookworm has timeline better laid out

fix a particular place of publication, e.g. new york city, map publication to distribution of publications
language use at different locations, from east to west, using different words, language variation, use distance measure; fix a particular place on the map, and do the visualization.

a problem of female vs. male,
because of smoothing issue.

adam: just started, if you have resources set up webex, for screen sharing. to explain things.

loretta: better to have exact count

to adam: send email to us, to follow up quickly to have screen sharing session, or to htrc user group

update to the newest version of bookworm UI, probably won't make changes to the backend code unless something incorrect about metadata.
need to regenerate everything from the beginning. that's why we need to extend the scalability. adapt to metadata change, if some gender was wrong.