Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

HTRC: Loretta Auvil, Harriett Green, Sayan Bhattacharyya, Miao Chen

 

2 topics were discussed:

1) feedback on HTRC Bookworm http://sandbox.htrc.illinois.edu/bookworm/

...

Matt: on NLP side of things, 2 things is useful to his research: POS tagging, named entity extraction (NER)
POS is relatively solved problem, with 97% accuracy, and it only need to be done once. There is no need to be offered interactively.
NER is also one time processing, but is nowhere near solved as POS, with a lot of errors, probably more importantly users might be able to supply their own data. He wondered how many potential users out there would have that kind of use cases. 

lorettaLoretta: we We have in tool set now. she We has a list of things to add to bwBookworm, she has and NER on the list. WIll Will see if put in the proposalfinally get implemented.
but POS is well solved for modern corpus, but not for old corpus. You will see more noise as you go back in time.
hearderHearder/footers will cause problems, and she has it on the list of BW.

Matt: Ted Underwood has been working on hearder/footer removal. Not sure it's easy to integrate with current work flow.
For older POS, text processing group at NWNorthwestern University work on that, for specially older stuff, e.g. shakespeare era. He thinks their training data is openly available.

The quality of NER is mixed. PR terms, there're going be a lot of errors. even Even for people/org/locaionslocations, there are a lot of errors. People will question whether it's good enough for my work, if put publicly.

Question: Can I use regular expressions to coalesce word forms, or use wildcards to capture divergent forms, etc.?
Answer: It's not offered in current system probably due to the way it's implemented in mysql, but should be doable.
bookworm Michelle and Loretta have the impression that Bookworm creators are also involved to n-gram viewer.
this can This indicates that Bookworm can be tweaked to do similar things to Google n-gram viewer.
Loretta: we can change back end to Solr, to scale up to 3m3M/10m 10M books

a question: Question: for metadata such as author/gender, publication, state
how , how complete is that metadata?

Loretta explains how gender metadata was derived, Stacy Kowalczyk and Peng Zong's work presented in last UnCamp.

Sayan: maybe we need to have something to distinguish being between metadata that were decided by algorithm , or gender already existing in metadataand metadata that already exist there

Loretta: any facets that people may come up in the future?
would We would like to have the ability to extend it, rather than.

Miao: now the x-axis is time, what else can be x-axis?

Michelle: one thing you can do is to segment individual text, to look word freq frequency of a book, divide certain number segments
wolfram alphaloretta: being able to , and show the frequency of words over the content order of the book.
Miao mentioned similar work done by Wolfram Alpha.

Loretta: we can also slice time a bit differently.

sayan Sayan asked Michelle: provide a single year, see what choose for a particular rather than multiple years?

Loretta: bookworm Bookworm has timeline better laid out, and will install the latest version.

Sayan: another possibility is to fix a particular place of publication, e.g. new york city, map publication to distribution of publications
language and find language use at different locations, from east to west, using different words, language variation, use distance measure; fix a particular place on the map, and do the visualization.

Miao: mentioned a problem of female vs. male writers on the word "revolution", the sum of "revolution" frequency by female writers and "revolution" frequency by male writers should be equal or less to frequency of "revolution", given a particular time point, but it's not as shown in HTRC Bookworm.
Loretta: it's because of smoothing issue. You can turn off smoothing option; also can use the exact count option instead of frequency percentage.

adamAdam: just started , if with HTRC. Ff you have resources set up webex, meetings in Webex etc. for screen sharing. to , that would help explain things.

loretta: better to have exact count

to adam: send email to us, to follow up quickly to have screen sharing session, or to htrc user group

Loretta: will update to the newest version of bookworm UI, probably won't make changes to the backend code unless something incorrect about metadata.

When things change, we need to regenerate everything from the beginning. thatThat's why we need to extend the scalability. adapt to metadata change, if some gender was wrong.

...