Notes of the user group meeting on April 24, 2013

Attendees: Michelle Paolillo
HTRC team: Sayan Bhattacharyya, Loretta Auvil, Samitha Liyanage, Miao Chen

The topic is: handling invalid volume IDs in HTRC workset due to takedowns.

HTRC takes down volumes from the data store periodically based on takedown request from HathiTrust, usually due to copyright reason. Therefore it causes the work set inconsistent issue: if you have such a volume (which have been taken down from HTRC data store), but the volume ID is still in your work set. One side effect is when you perform algorithms on the work set, they won't perform on the taken down volume. So we are asking for inputs from the user community about ways of notifying users about the take-downs and handling this inconsistency issue.

Michelle: I would like HTRC to notify users of pull-down volumes so that users are aware of the situation.

Loretta: If you have a work set within which no volumes are take down, do you still want to be notified?

Michelle: I think only the owner should be notified.

Sayan: for public work set, if owners don't respond take actions, then it affects others.

Loretta: you can save the work set yourself and correct it yourself.

Michelle: Scholars working on specific scholarly questions will eventually need to control the work set for reasons driven by their intellectual inquiry (eg: reproducibility of results, relevance of inclusions/exclusions, etc.). They will more naturally work with worksets that they control, and/or ones that are built by collaborators.

Owners should get email if the volumes in their work set is taken down.

Loretta: Would you want multiple emails, for multiple work sets/volumes?

It'd be easier to send it once something is taken down, than accumulating taken-down volumes for users.

Michelle: that makes sense. These are essentially administrative tasks and dynamic notification allows the owner to respond as the need arises.

Loretta: Does the term "quarantine" wording bother you?

Michelle: It'd be nice to have a softer word. "Quarantine" seems to connotate an issue of data security. In the text of detail that arises in wireframe "When user clicks on the "Update Workset" button" and "When user clicks on the "View Details" button", the results to the user of the two alternatives given is not clearly spelled out. What is the difference between quarantine or delete?

Loretta: delete is completely deleted. Quarantine puts a flag to the volume

can we call the "delete" button "remove"?
Sayan: maybe add "permanently" to "remove"

Sayan: "Sequestration" makes more sense?
Loretta: or even just 'flag", or "ignore"

Michelle: Likes "Ignore" Suggests that the buttons be relabeled as "Ignore" and "Remove" and reword the text as follows: "Click "Ignore" If you want to retain the deleted volume IDs in this workset, but be ignored for processing by algorithms. Click "Remove" if you want to permanently remove the deleted volume IDs from this workset. Press "Cancel" to do nothing - this workset will continue to generate errors when used by algorithms."

Miao: do you feel privacy issues if HTRC scans your work set for detecting pull-down volumes? Do you feel being monitored in that way?

Michelle: I don't necessarily feel privacy is intruded. HTRC knows what's in the work set anyway. It'd be good to know the volumes are taken down, and the maintenance of a workset is essentially an administrative issue, much like getting an overdue notice from a library.

Loretta: Is there other facets you'd like to have for searching, other than author/keyword/year etc.?
Michelle: public + private work set facets.

Loretta: Are you creating some public, some private work sets?
Michelle: Yes, actually doing a mix. It makes sense to distinguish you own, and other work sets.

Michelle: If I'm in the portal, and I'm looking at the algorithms, she can see authors/owners of the work set, that's very good orientation for me. That helps distinguish work sets built around the same author or topic by different owners. (Ex: many people make Shakespeare work sets). So @author helps her decide which work set to choose. At work set list (?) she can't see the author, she can't tell by just looking at the list. These are inconsistent views, perhaps by design.

Loretta: no need to have that inconsistency. Might have just been an oversight.

Michelle (switched to another issue): Keyword tags are important. People give tags to their material which is important for search. It'd be nice to have people to be able to tag other people's work set.

Loretta: summarized 2 things: add owner to the tag list; add tag to the work set details description. Right now only owner can modify it. But might be able to create a second set of tags that other folks can use to tag public sets.

Miao: when pull-down volumes are to be removed from your work set, do you want to keep the same work set or have a new one?

Michelle: would like to keep the same work set, because in the future when in copyright items are added to the portal experience, I may want to refer back to the work set, and "Un-ignore" the vols IDs

Miao: how about versioning of the work set, e.g. having a new version of work set when updating the work set.

Michelle: that would be the ideal situation. Does HTRC portal keep versioning of work sets?

Loretta: Will have to see if this is turned on in storage, but we can turn this on if necessary. Currently it's not displayed to users in the portal, so what people see is the only the most recent list of their volumes.

Loretta: Do you like the way of displaying gender information?
Michelle: the gender icon appeared to be confusing at the 1st time seeing it. Not sure what M/F means but then figured out. Gender not an interesting question for her with regards to the types of examples she uses the portal for. The size of the icon seems a little big, bigger than the text anyway, and this is somewhat visually annoying to her. Maybe just change to male/female symbol - but not sure what this means for unknown.

Michelle: had a class demoing HTRC portal recently. She thinks that as participants engage, they are really trying to do several things, and it's hard to tease these apart. They are getting used to doing algorithmic text analysis, and a methodology that is new to them. They are also just trying to find their way around in the portal. This makes actually evaluating their thoughts about the suitability/desirability of the algorithms difficult.
There is a lot of excitement about the NaiveBayes algorithm potential. If we could can actually train effectively on one set, and classify on another set, then that would really helpful. People like topic modeling, and in favor of word count. Sometimes count just tells you about the grammatical necessities of the language, e.g. "the" as required frequently in English.Word count results can look very disappointing unless the scholar can control the stopwords dynamically, as they can in the voyant tools (http://voyant-tools.org/).

Entity extraction is another thing that people get excited about. Lots of potential to find you way through texts.