Notes of the user group meeting on Dec 5, 2013
Participants: Michelle A. Paolillo, Loretta Auvil, Samitha Liyanage
Theme of today is discussion personal usage of HTRC data, or usage of data in general, or patrons' data need from a library.
Michelle discussed on use case from a scholar at Cornell, Prof. Ed Baptist (History). He wanted to access in-copyright content of HathiTrust, which is about Slave narrative from the Federal Writer’s Project. He would like to do some entity extraction and topic modeling on the data. However, this is in-copyright can HathiTrust does not provide scanned image.
Miao: will dig out a previous email to look more into this use case
Loretta:For entity extraction, such as time, it's easy to extract some strings from text, e.g. today, yesterday, but it's hard to put them in the timeline.
She also shared some knowledge on HTRC analysis on in copyrighted content in the future. Mainly, features will be extracted, such as word frequencies. There have also been discussion on getting co-occurrence matrix for topic modeling purpose.
Michelle: Prof. Baptist taught digital history class this past semester, which included emphasis on introducing algorithmic analysis to students. The class was introduced to HTRC porat and Prof. Baptist used topic modeling samples from it as a teaching method in class. He showed students word cloud from different perspectives, e.g. comparing narratives of enslavement as told by slaves and abolitionists; quite interesting to compare the word clouds.
There has been interest in scholarly community to have XML files from some place and use it for other purposes.
Then the discussion went on to visualization. It's useful in showing people associations from the data in an interactive way. Loretta mentioned there could be difficulty in setting up such servers or service for visualization.
Miao mentioned the most recent HTRC architecture in plan has a part for R/Python, as another way of interacting with HTRC text.
Â
Â