Not a lot of faculty know what HT or HTRC are.
There has been a dramatic increase in interest in digital humanities, digital publishing and text-analysis type of projects. This is mostly from junior faculty and graduate students, but of late there has also been a surge of interest among advanced undergraduates.
Video tutorials that HTRC is developing will be useful for users. There is an "audience question": many find the HTRC tools to be fairly limiting, especially in comparison with commercial tools.
We don't know if the corpus tools fit the current interests of our linguistics faculty.
Currently 3 million volumes of non-copyrighted materials and 8 million volumes of copyrighted materials.
HT-Bookworm (under development by HTRC and groups from Baylor and Northeastern) and the Data Capsule (which will allow users to run computational methods against protected data) are some techniques being developed that will make copyrighted material more usable to users in the future. In order to for HT-Bookworm to work, the back end will need to be changed from mySql (as it is currently) to Solr. Solr is likely to be more scalable. The plan is also to integrate HT-Bookworm with HTRC worksets, so that one would be able to go in both directions — from the Data Capsule to the workset, and from the workset to the Data Capsule. That is, from what the user discovers using HT-Bookworm, the user might be able to automatically generate a workset. The goal is to make the HT-Bookworm work with all the public-domain material in HTRC within a year. If by then HTRC gets the copyrighted data, that will be sought to be integrated, too.
There will be more interest when there is more tutorial information. Also, the things that faculty want to do don't always fit into existing tutorial information. Doing something more visually with topic models, such as Termite from Stanford, may be good.
If someone wants to submit an algorithm to the portal, how is it decided? Currently, it is decided on a one-on-one basis. Setting up a workflow/process for people to submit their algorithms will be useful.
What happens if someone wants to integrate textual content that they have on their own, with HTRC materials? Unfortunately, that (taking data from other sources) is not currently on the radar. But if someone were to want to augment their analysis with additional data, that is on the radar — for example, if a user were to augment their analysis by validating with a dictionary — that kind of thing is on the radar. But augmenting the data itself with an additional corpus — that has not been on the agenda.
When faculty come and talk with librarians about analysis of digitized text, there are three main types of use cases:
1) They bring their own text
2) They bring a set of texts that they would like to use HTRC as a reference corpus for
3) They bring specific texts that they want to compare with texts that are in the HT corpus
No matter how big the HT corpus is, or how powerful the existing algorithms are, if the existing (corpus + algorithms) do not meet the needs of the specific problem that the user is trying to solve, then the user will not use the resources. Often, people come to librarians with their own texts (such as EEBO, ECHO, the Old Bailey text corpus, etc., and describe the problem and request an algorithm to be written to do the analysis that they are trying to do. Nowadays, even undergraduates come with quite complex tasks that they are trying to do.
Sometimes, the text data set that the user is interested in using, are government documents that get set out periodically — all the articles put out within a certain time period.
Some suggestions of what the HTRC could do:
1) Try to create a bibliography of conference papers — to keep track of what researchers are doing using HTRC tools and resources. (This is similar to what the IPSCR does.)
2) People need to be encouraged to use HTRC as the source for the documents they need. Try to find a way to communicate with undergraduates directly, and let undergraduates know about HTRC and what it can do. Often, undergraduates will run with the resource, and convince faculty to let them use it for class projects, etc. Faculty members tend to be more reticent about using new resources than undergraduates are.
3) The crux of the matter is to come up with tutorials that people can play with whenever they have a few minutes free to play with it, and still make it into a useful learning experience.